CN106815274A - Daily record data method for digging and system based on Hadoop - Google Patents
Daily record data method for digging and system based on Hadoop Download PDFInfo
- Publication number
- CN106815274A CN106815274A CN201510875453.3A CN201510875453A CN106815274A CN 106815274 A CN106815274 A CN 106815274A CN 201510875453 A CN201510875453 A CN 201510875453A CN 106815274 A CN106815274 A CN 106815274A
- Authority
- CN
- China
- Prior art keywords
- log data
- time period
- current time
- hadoop
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了一种基于Hadoop的日志数据挖掘方法,将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中;若Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则利用预置的并行运算模型对Hadoop数据库中的第一日志数据集合进行并行聚集处理,得到第二日志数据集合;根据第二日志数据集合中的日志数据的维度对第二日志数据集合中的日志数据进行维度划分,将得到的不同维度对应的第三日志数据集合保存至Hadoop数据库中。本发明还公开了一种基于Hadoop的日志数据挖掘系统。本发明能够快速有效地实现海量数据的挖掘,满足对海量数据进行挖掘的存储及运算需求。
The invention discloses a log data mining method based on Hadoop, which saves the acquired first log data set in the current time period into the Hadoop database; if the number of the first log data sets already stored in the Hadoop database satisfies the preset value, then use the preset parallel operation model to carry out parallel aggregation processing on the first log data set in the Hadoop database to obtain the second log data set; according to the dimension of the log data in the second log data set, the second log data The log data in the set is divided into dimensions, and the obtained third log data sets corresponding to different dimensions are saved in the Hadoop database. The invention also discloses a log data mining system based on Hadoop. The invention can quickly and effectively realize massive data mining, and meets the storage and calculation requirements for massive data mining.
Description
技术领域technical field
本发明涉及计算机数据处理领域,尤其涉及一种基于Hadoop的日志数据挖掘方法及系统。The invention relates to the field of computer data processing, in particular to a Hadoop-based log data mining method and system.
背景技术Background technique
进入互联网时代以来,如何在不断暴增的海量用户信息中,快速寻找更合适、可量化、可预测的精准营销策略,成为了包括运营商在内众多企业的核心需求。Since entering the Internet era, how to quickly find more suitable, quantifiable, predictable and precise marketing strategies in the ever-increasing mass of user information has become the core demand of many companies, including operators.
然而,传统数据库对数据运算能力有限,存储成本昂贵,无法满足海量数据的挖掘的需求。However, traditional databases have limited data computing capabilities and high storage costs, which cannot meet the needs of massive data mining.
上述内容仅用于辅助理解本发明的技术方案,并不代表承认上述内容是现有技术。The above content is only used to assist in understanding the technical solution of the present invention, and does not mean that the above content is admitted as prior art.
发明内容Contents of the invention
本发明的主要目的在于提供一种基于Hadoop的日志数据挖掘方法及系统,旨在解决传统数据库对数据运算能力有限,存储成本昂贵,无法提供海量数据的挖掘的技术问题。The main purpose of the present invention is to provide a log data mining method and system based on Hadoop, aiming to solve the technical problems that traditional databases have limited data computing capabilities, high storage costs, and cannot provide massive data mining.
为实现上述目的,本发明提供的一种基于Hadoop的日志数据挖掘方法,包括:In order to achieve the above object, a kind of log data mining method based on Hadoop provided by the present invention comprises:
将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中;The first log data set in the current time period obtained is saved in the Hadoop database;
若所述Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则利用预置的并行运算模型对所述Hadoop数据库中的第一日志数据集合进行并行聚集处理,得到第二日志数据集合;If the number of the first log data sets stored in the Hadoop database satisfies the preset value, then utilize the preset parallel operation model to carry out parallel aggregation processing on the first log data sets in the Hadoop database to obtain the second collection of log data;
根据所述第二日志数据集合中的日志数据的维度对所述第二日志数据集合中的日志数据进行维度划分,将得到的不同维度对应的第三日志数据集合保存至所述Hadoop数据库中。Dimensionally divide the log data in the second log data set according to the dimensions of the log data in the second log data set, and save the obtained third log data sets corresponding to different dimensions into the Hadoop database.
优选地,所述方法还包括:Preferably, the method also includes:
从网络侧获取当前时间段内的日志数据;Obtain log data in the current time period from the network side;
对所述当前时间段内的日志数据进行聚集处理,得到所述当前时间段内的第一日志数据集合。Aggregating the log data within the current time period is performed to obtain a first set of log data within the current time period.
优选地,所述从网络侧获取当前时间段内的日志数据的步骤之后还包括:Preferably, after the step of obtaining log data in the current time period from the network side, the step further includes:
对所述当前时间段内的日志数据进行数据清洗,得到当前时间段内清洗后的日志数据;Perform data cleaning on the log data in the current time period to obtain the cleaned log data in the current time period;
则所述对所述当前时间段内的日志数据进行聚集处理,得到所述当前时间段内的第一日志数据集合的步骤包括:Then the step of aggregating the log data in the current time period to obtain the first log data set in the current time period includes:
对所述当前时间段内清洗后的日志数据进行聚集处理,得到所述当前时间段内的第一日志数据集合。Perform aggregation processing on the cleaned log data in the current time period to obtain a first set of log data in the current time period.
优选地,所述方法还包括:Preferably, the method also includes:
若接收到数据查询指令,则按照所述数据查询指令中包含的查询维度从所述Hadoop数据库中读取与所述查询维度对应的第三日志数据集合;If a data query instruction is received, the third log data set corresponding to the query dimension is read from the Hadoop database according to the query dimension included in the data query instruction;
对所述第三日志数据集合进行数据分析,并在显示界面上显示数据分析的结果。Perform data analysis on the third log data set, and display the results of the data analysis on the display interface.
优选地,所述对所述第三日志数据集合进行数据分析,包括:Preferably, said performing data analysis on said third log data set includes:
按照预先设置的聚类算法对所述第三日志数据集合中的用户进行用户分组,得到用户分组列表;performing user grouping on the users in the third log data set according to a preset clustering algorithm to obtain a user grouping list;
根据用户分组列表中的用户的日志数据得到至少两个用户维度对应的级别配置表,所述用户维度是预先设置的,所述级别配置表中包含所述用户分组列表中的用户按照所述用户维度进行分级确定的级别。Obtain a level configuration table corresponding to at least two user dimensions according to the log data of users in the user grouping list, the user dimension is preset, and the level configuration table includes users in the user grouping list according to the user The level at which the dimension is hierarchically determined.
为实现上述目的,本发明还提供一种基于Hadoop的日志数据挖掘系统,包括:To achieve the above object, the present invention also provides a Hadoop-based log data mining system, comprising:
第一保存模块,用于将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中;The first saving module is used to save the first log data collection in the current time period obtained in the Hadoop database;
并行聚集模块,用于若所述Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则利用预置的并行运算模型对所述Hadoop数据库中的第一日志数据集合进行并行聚集处理,得到第二日志数据集合;Parallel aggregation module, for if the number of the first log data collection that described Hadoop database has preserved meets preset numerical value, then utilize the preset parallel computing model to carry out parallel to the first log data collection in described Hadoop database Aggregating and processing to obtain a second log data set;
划分保存模块,根据所述第二日志数据集合中的日志数据的维度对所述第二日志数据集合中的日志数据进行维度划分,将得到的不同维度对应的第三日志数据集合保存至所述Hadoop数据库中。The division and storage module divides the log data in the second log data set according to the dimensions of the log data in the second log data set, and saves the obtained third log data sets corresponding to different dimensions in the Hadoop database.
优选地,所述系统还包括:Preferably, the system also includes:
获取模块,用于从网络侧获取当前时间段内的日志数据;The acquisition module is used to acquire log data in the current time period from the network side;
第一聚集模块,用于对所述当前时间段内的日志数据进行聚集处理,得到所述当前时间段内的第一日志数据集合。The first aggregation module is configured to perform aggregation processing on the log data within the current time period to obtain a first set of log data within the current time period.
优选地,所述系统还包括清洗模块;Preferably, the system also includes a cleaning module;
所述清洗模块用于在所述获取模块获取所述当前时间段内的日志数据之后,对所述当前时间段内的日志数据进行数据清洗,得到当前时间段内清洗后的日志数据;The cleaning module is used to perform data cleaning on the log data in the current time period after the acquisition module acquires the log data in the current time period, so as to obtain the cleaned log data in the current time period;
且所述第一聚集模块具体用于对所述当前时间段内清洗后的日志数据进行聚集处理,得到所述当前时间段内的第一日志数据集合。And the first aggregation module is specifically configured to aggregate the cleaned log data in the current time period to obtain the first log data set in the current time period.
优选地,所述系统还包括:Preferably, the system also includes:
读取模块,用于若接收到数据查询指令,则按照所述数据查询指令中包含的查询维度从所述Hadoop数据库中读取与所述查询维度对应的第三日志数据集合;The reading module is used to read the third log data set corresponding to the query dimension from the Hadoop database according to the query dimension contained in the data query instruction if the data query instruction is received;
分析模块,用于对所述第三日志数据集合进行数据分析,并在显示界面上显示数据分析的结果。An analysis module, configured to perform data analysis on the third log data set, and display the result of the data analysis on the display interface.
优选地,所述分析模块包括:Preferably, the analysis module includes:
聚类模块,用于按照预先设置的聚类算法对所述第三日志数据集合中的用户进行用户分组,得到用户分组列表;A clustering module, configured to perform user grouping on users in the third log data set according to a preset clustering algorithm to obtain a user grouping list;
获取显示模块,用于根据用户分组列表中的用户的日志数据得到至少两个用户维度对应的级别配置表,所述用户维度是预先设置的,所述级别配置表中包含所述用户分组列表中的用户按照所述用户维度进行分级确定的级别An acquisition and display module, configured to obtain a level configuration table corresponding to at least two user dimensions according to the log data of users in the user group list, the user dimension is preset, and the level configuration table includes the user group list of users are graded and determined according to the user dimension
本发明提供一种基于Hadoop的日志数据挖掘方法,将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中,若Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则利用预置的并行运算模型对该Hadoop数据库中的第一日志数据集合进行并行聚集处理,得到第二日志数据集合,根据该第二日志数据集合中的日志数据的维度对该第二日志数据集合中的日志数据进行维护划分,将得到的不同维度对应的第三日志数据集合保存至该Hadoop数据库中,以完成日志数据的挖掘。由于Hadoop数据库具有较好的分布式存储能力及并行运算能力,利用该Hadoop数据库对日志数据进行分布式存储及利用并行运算模型进行并行运算,能够快速有效地实现海量数据的挖掘,满足对海量数据进行挖掘的存储及运算需求。The present invention provides a log data mining method based on Hadoop, which saves the acquired first log data collection in the current time period into the Hadoop database, if the number of the first log data collections stored in the Hadoop database satisfies the preset numerical value, then use the preset parallel operation model to perform parallel aggregation processing on the first log data set in the Hadoop database to obtain the second log data set. According to the dimension of the log data in the second log data set, the second The log data in the log data set is maintained and divided, and the obtained third log data sets corresponding to different dimensions are stored in the Hadoop database to complete the mining of the log data. Since the Hadoop database has good distributed storage and parallel computing capabilities, using the Hadoop database to store log data in a distributed manner and use the parallel computing model for parallel computing can quickly and effectively realize massive data mining and meet the needs of massive data mining. Storage and computing requirements for mining.
附图说明Description of drawings
图1为本发明第一实施例的基于Hadoop的日志数据挖掘方法的流程示意图;Fig. 1 is the schematic flow sheet of the log data mining method based on Hadoop of the first embodiment of the present invention;
图2为图1中的第一实施例的步骤101之前追加步骤的流程示意图;Fig. 2 is a schematic flow chart of additional steps before step 101 of the first embodiment in Fig. 1;
图3为图1中的第一实施例的步骤103之后追加步骤的流程示意图;FIG. 3 is a schematic flow chart of additional steps after step 103 of the first embodiment in FIG. 1;
图4为本发明第二实施例中基于Hadoop的日志数据挖掘系统的功能模块的示意图;Fig. 4 is the schematic diagram of the functional module of the log data mining system based on Hadoop in the second embodiment of the present invention;
图5为图4的第二实施例中追加的功能模块的示意图;FIG. 5 is a schematic diagram of additional functional modules in the second embodiment of FIG. 4;
图6为图4的第二实施例中追加的功能模块的示意图。FIG. 6 is a schematic diagram of additional functional modules in the second embodiment of FIG. 4 .
本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization of the purpose of the present invention, functional characteristics and advantages will be further described in conjunction with the embodiments and with reference to the accompanying drawings.
具体实施方式detailed description
应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.
本发明提供一种基于Hadoop的日志数据挖掘方法,将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中,若Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则利用预置的并行运算模型对该Hadoop数据库中的第一日志数据集合进行并行聚集处理,得到第二日志数据集合,根据该第二日志数据集合中的日志数据的维度对该第二日志数据集合中的日志数据进行维护划分,将得到的不同维度对应的第三日志数据集合保存至该Hadoop数据库中,以完成日志数据的挖掘。由于Hadoop数据库具有较好的分布式存储能力及并行运算能力,利用该Hadoop数据库对日志数据进行分布式存储及利用Hadoop中的预置的并行运算模型进行并行运算,能够快速有效地实现海量数据的挖掘,满足对海量数据进行挖掘的存储及运算需求。The present invention provides a log data mining method based on Hadoop, which saves the acquired first log data collection in the current time period into the Hadoop database, if the number of the first log data collections stored in the Hadoop database satisfies the preset numerical value, then use the preset parallel operation model to perform parallel aggregation processing on the first log data set in the Hadoop database to obtain the second log data set. According to the dimension of the log data in the second log data set, the second The log data in the log data set is maintained and divided, and the obtained third log data sets corresponding to different dimensions are stored in the Hadoop database to complete the mining of the log data. Since the Hadoop database has good distributed storage capability and parallel computing capability, using the Hadoop database to store log data in a distributed manner and using the preset parallel computing model in Hadoop to perform parallel computing can quickly and effectively realize massive data processing. Mining to meet the storage and computing needs of massive data mining.
请参阅图1,为本发明第一实施例中基于Hadoop的日志数据挖掘方法的流程示意图,包括:Referring to Fig. 1, it is a schematic flow chart of the Hadoop-based log data mining method in the first embodiment of the present invention, including:
步骤101、将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中;Step 101, saving the first log data collection obtained in the current time period into the Hadoop database;
在本发明实施例中,基于Hadoop的日志数据挖掘方法可以应用在基于Hadoop的日志数据挖掘系统(以下简称为:挖掘系统)中,挖掘系统将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中。In the embodiment of the present invention, the Hadoop-based log data mining method can be applied in a Hadoop-based log data mining system (hereinafter referred to as: mining system), and the mining system saves the first log data set in the current time period obtained to the Hadoop database.
其中,挖掘系统是按照时间段获取第一日志数据集合的,例如,若时间段是15分钟或者是30分钟,则挖掘系统获取当前的15分钟时间段内的第一日志数据集合或者获取当前的30分钟时间段内第一日志数据集合。Wherein, the mining system obtains the first log data set according to the time period. For example, if the time period is 15 minutes or 30 minutes, the mining system obtains the first log data set within the current 15-minute time period or obtains the current The first log data set in a 30-minute time period.
其中,该时间段是获取数据的周期,可以按照数据量的大小确定该时间段的时长。Wherein, the time period is a cycle for acquiring data, and the length of the time period may be determined according to the amount of data.
其中,Hadoop可实现分布式文件系统(Hadoop Distributed File System,HDFS),Hadoop的框架核心是Hadoop数据库及并行运算模型,其中,Hadoop数据库能够为海量的数据提供分布式存储,并行运行模型能够为海量的数据提供并行运算。Among them, Hadoop can implement Distributed File System (Hadoop Distributed File System, HDFS). The framework core of Hadoop is Hadoop database and parallel operation model. Among them, Hadoop database can provide distributed storage for massive data, and the parallel operation model can provide massive The data provides parallel operation.
优选的,该并行运算模型为mapreduce运算模型。Preferably, the parallel computing model is a mapreduce computing model.
步骤102、若Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则利用预置的并行运算模型对Hadoop数据库中的第一日志数据集合进行并行聚集处理,得到第二日志数据集合;Step 102, if the number of the first log data set stored in the Hadoop database satisfies the preset value, then use the preset parallel operation model to carry out parallel aggregation processing on the first log data set in the Hadoop database to obtain the second log data collection;
在本发明实施例中,挖掘系统在每个时间段内都将获取到的第一日志数据集合保存至Hadoop数据库中,若该Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则可利用该Hadoop框架中的预置的并行运算模型对Hadoop数据库中的第一日志数据集合进行聚集处理,得到第二日志数据集合。In the embodiment of the present invention, the mining system saves the obtained first log data set in the Hadoop database in each time period, if the number of the first log data sets that the Hadoop database has saved meets the preset value, the first log data set in the Hadoop database can be aggregated using the preset parallel operation model in the Hadoop framework to obtain the second log data set.
其中,在实际应用中可根据具体的需要预先设置该数值,例如,若上述的时间段为15分钟,且需要对一个小时内的第一日志数据集合进行聚集处理,则该预先设置的数值为4;若上述的时间段为30分钟,且需要对1天内的第一日志数据集合进行聚集处理,则该预先设置的数值为48。Wherein, in practical applications, this value can be preset according to specific needs. For example, if the above-mentioned time period is 15 minutes, and the first log data set within one hour needs to be aggregated, the preset value is 4. If the above time period is 30 minutes, and the first log data set within 1 day needs to be aggregated, the preset value is 48.
可以理解的是,基于上述的聚集处理,挖掘系统还可以利用类似的方式得到不同时间周期内的日志数据集合,例如:可以利用4个时间段为15分钟的第一日志数据集合得到一个小时内的日志数据集合,可以利用24个一个小时内的日志数据集合得到一天内的日志数据集合,可以利用30个一天内的日志数据集合得到一个月内的日志数据集合,且以此类推,可以得到不同时间内的日志数据集合,以满足不同的需求。It can be understood that, based on the above-mentioned aggregation processing, the mining system can also use a similar method to obtain log data collections in different time periods, for example: the first log data collection with 4 time periods of 15 minutes can be used to obtain log data collections within an hour The log data set of , you can use 24 log data sets within one hour to get the log data set within one day, you can use 30 log data sets within one day to get the log data set within one month, and so on, you can get Collection of log data at different times to meet different needs.
在本发明实施例中,挖掘系统在利用预置的并行运算模型进行并行聚集处理时,是将相同的日志数据的计数值进行累加。In the embodiment of the present invention, when the mining system uses a preset parallel operation model to perform parallel aggregation processing, it accumulates count values of the same log data.
步骤103、根据第二日志数据集合中的日志数据的维度对第二日志数据集合中的日志数据进行维度划分,将得到的不同维度对应的第三日志数据集合保存至Hadoop数据库中。Step 103: Dimensionally divide the log data in the second log data set according to the dimensions of the log data in the second log data set, and save the obtained third log data sets corresponding to different dimensions into the Hadoop database.
在本发明实施例中,挖掘系统在得到第二日志数据集合之后,将根据该第二日志数据集合中的日志数据的维度对第二日志数据集合中的日志数据进行维度划分,且将得到的不同维度对应的第三日志数据集合保存至Hadoop数据库中,以实现海量日志数据的挖掘,且保存的第三日志数据集合可以作为用户数据查询的数据源,支持显示界面的图标、图形查询及多维度查询,使得能够多角度展示数据,达到数据挖掘的展示效果。In the embodiment of the present invention, after the mining system obtains the second log data set, it divides the dimensions of the log data in the second log data set according to the dimensions of the log data in the second log data set, and divides the obtained The third log data collections corresponding to different dimensions are saved in the Hadoop database to realize the mining of massive log data, and the saved third log data collections can be used as the data source for user data query, and support icon and graphic query on the display interface and multiple Dimension query enables data to be displayed from multiple angles to achieve the display effect of data mining.
其中,日志数据的维度有很多,包括但不限于上网内容、上网位置和上网时间,其中,上网内容是指在用户的浏览位置,该浏览位置可以是具体的某一个位置,例如可以是百度、搜狐、新浪微博等等,也可以是一类网址,例如:音乐、电影等等。上网位置是指用户使用的IP位置所处的地理位置范围,上网时间是指生成日志数据的时间。且维度的划分是根据系统的要求,通过维度上的数据完成对用户整体行为的进一步刻画。需要说明的是,对于不同类型的日志数据,其日志数据的维度也是不一样的,例如:在对日志数据中的用户的流量数据采用本发明实施例中的技术方案进行数据挖掘时,其维度除了上述的上网内容、上网位置和上网时间以外,还可以包含上网频率、用户年龄、月消费等等,因此在实际应用中,可以根据具体的需要进行维度划分,此处不做限定。Among them, there are many dimensions of log data, including but not limited to online content, online location, and online time. Among them, online content refers to the browsing location of the user, and the browsing location can be a specific location, such as Baidu, Sohu, Sina Weibo, etc., can also be a type of website, such as: music, movies, etc. The online location refers to the geographical range of the IP address used by the user, and the online time refers to the time when the log data is generated. And the division of dimensions is based on the requirements of the system, and the further description of the overall behavior of users is completed through the data on the dimensions. It should be noted that for different types of log data, the dimensions of the log data are also different. For example, when the user traffic data in the log data is mined using the technical solution in the embodiment of the present invention, the dimension In addition to the above-mentioned online content, online location, and online time, it can also include online frequency, user age, monthly consumption, etc. Therefore, in practical applications, dimensions can be divided according to specific needs, which is not limited here.
优选的,在本发明实施例中,挖掘系统在将不同维度对应的第三日志数据集合保存至Hadoop数据库中之后,还可以将该不同维度对应的第三日志数据集合保存至列存储阵列中,使得能够实现Hadoop数据库和列存储阵列的协同工作,使得能够满足不同的应用场景的数据需求。Preferably, in the embodiment of the present invention, after the mining system saves the third log data sets corresponding to different dimensions into the Hadoop database, it can also save the third log data sets corresponding to the different dimensions in the column storage array, It enables the collaborative work of the Hadoop database and the column storage array to meet the data requirements of different application scenarios.
优选的,由于挖掘系统是在Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值的情况下才会执行上述的并行聚集处理及维度划分的操作的,因此,得到的第三日志数据集合其实也对应着一个时间段,挖掘系统在保存时,可以保存维度、时间段及第三日志数据集合三者之间的对应关系。Preferably, because the mining system executes the above-mentioned parallel aggregation processing and dimension division operations when the number of the first log data collections stored in the Hadoop database satisfies the preset value, the obtained third In fact, the log data set also corresponds to a time period. When saving, the mining system can save the corresponding relationship between the dimension, the time period and the third log data set.
在本发明实施例中,挖掘系统将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中,若Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则利用预置的并行运算模型对该Hadoop数据库中的第一日志数据集合进行并行聚集处理,得到第二日志数据集合,根据该第二日志数据集合中的日志数据的维度对该第二日志数据集合中的日志数据进行维护划分,将得到的不同维度对应的第三日志数据集合保存至该Hadoop数据库中,以完成日志数据的挖掘。由于Hadoop数据库具有较好的分布式存储能力及并行运算能力,利用该Hadoop数据库对日志数据进行分布式存储及利用Hadoop中的并行运算模型进行并行运算,能够快速有效地实现海量数据的挖掘,满足对海量数据进行挖掘的存储及运算需求。In the embodiment of the present invention, the mining system saves the first log data set obtained in the current time period into the Hadoop database, and if the number of the first log data set saved by the Hadoop database meets the preset value, then use The preset parallel operation model performs parallel aggregation processing on the first log data set in the Hadoop database to obtain the second log data set, and according to the dimension of the log data in the second log data set, the log data in the second log data set The log data is maintained and divided, and the obtained third log data sets corresponding to different dimensions are stored in the Hadoop database to complete the mining of log data. Since the Hadoop database has good distributed storage and parallel computing capabilities, using the Hadoop database to store log data in a distributed manner and using the parallel computing model in Hadoop to perform parallel computing can quickly and effectively realize massive data mining, satisfying Storage and computing requirements for mining massive amounts of data.
请参阅图2,为本发明图1所示的第一实施例中步骤101之前追加步骤的流程示意图,包括:Please refer to FIG. 2, which is a schematic flow chart of additional steps before step 101 in the first embodiment shown in FIG. 1 of the present invention, including:
步骤201、从网络侧获取当前时间段内的日志数据;Step 201, obtaining log data in the current time period from the network side;
在本发明实施例中,挖掘系统是从网络侧获取当前时间段内的日志数据,具体的:挖掘系统可以通过日志数据的抽取的方式从网络侧获取当前时间段内的日志数据,或者,可以利用网络爬虫技术从网络侧获取当前时间段内的日志数据,或者,可以通过从网络侧的BOSS营帐数据库中获取当前时间段内的日志数据,或者,可以接受网络侧的第三方厂商提供的当前时间段内的日志数据,或者结合上述的至少两种方式获取当前时间段内的日志数据。In the embodiment of the present invention, the mining system obtains the log data in the current time period from the network side, specifically: the mining system can obtain the log data in the current time period from the network side by extracting the log data, or can Use web crawler technology to obtain log data in the current time period from the network side, or obtain log data in the current time period from the BOSS tent database on the network side, or accept current data provided by third-party vendors on the network side log data within a time period, or obtain log data within a current time period in combination with at least two of the above methods.
步骤202、对当前时间段内的日志数据进行聚集处理,得到当前时间段内的第一日志数据集合。Step 202, aggregate the log data in the current time period to obtain a first set of log data in the current time period.
在本发明实施例中,挖掘系统在获取到当前时间段内的日志数据之后,对当前时间段内的日志数据进行聚集处理,得到当前时间段的第一日志数据集合。In the embodiment of the present invention, after the mining system acquires the log data in the current time period, it aggregates the log data in the current time period to obtain the first log data set in the current time period.
其中,步骤202中聚集可以是根据日志数据的内容进行分类,把相同内容或者属于同一类的内容的日志数据作为一条数据进行数目上的累加,聚集后得到的第一日志数据集合的数量级将远远低于获取到的当前时间段内的日志数据的数量级,当时数据意义被完整的保存下来。Among them, the aggregation in step 202 can be classified according to the content of the log data, and the log data of the same content or content belonging to the same category are accumulated as a piece of data, and the order of magnitude of the first log data set obtained after the aggregation will be far The order of magnitude is far lower than the log data obtained in the current time period, and the meaning of the data at that time is completely preserved.
在本发明实施例中,挖掘系统通过图2所示的追加的步骤实现第一日志数据集合的获取,且通过对从网络侧获取到的当前时间段内的日志数据进行聚集,能够有效的降低日志数据的数量级,使得在Hadoop数据库中所所需要的存储空间减小,节约存储空间。In the embodiment of the present invention, the mining system acquires the first log data set through the additional steps shown in FIG. The order of magnitude of the log data reduces the storage space required in the Hadoop database and saves storage space.
优选的,在本发明实施例中,挖掘系统在执行步骤202之前还可以执行以下步骤:Preferably, in the embodiment of the present invention, the mining system may also perform the following steps before performing step 202:
对当前时间段内的日志数据进行数据清洗,得到当前时间段内清洗后的日志数据;Perform data cleaning on the log data in the current time period to obtain the cleaned log data in the current time period;
在本发明实施例中,挖掘系统在对获取到的当前时间段内的日志数据进行聚集之前,还可以对当前时间段内的日志数据进行数据清洗,得到当前时间段内清洗后的日志数据。In the embodiment of the present invention, before the mining system aggregates the acquired log data in the current time period, it may also perform data cleaning on the log data in the current time period to obtain the cleaned log data in the current time period.
且若挖掘系统执行了上述步骤,则也需要对步骤202进行适应性的调整,且步骤202适应性调整为:And if the mining system executes the above steps, it is also necessary to make adaptive adjustments to step 202, and the adaptive adjustments to step 202 are:
对当前时间段内清洗后的日志数据进行聚集处理,得到当前时间段内的第一日志数据集合。Aggregate the cleaned log data in the current time period to obtain the first log data set in the current time period.
其中,对日志数据进行清洗可以是去除一些不满足预先设置的数据类型的日志数据,和/或,发现并纠正日志数据中可识别的错误,并修正或者删除出现可识别的日志数据。Wherein, cleaning the log data may be removing some log data that does not meet the preset data type, and/or discovering and correcting identifiable errors in the log data, and correcting or deleting identifiable log data.
在本发明实施例中,挖掘系统通过对当前时间段内的日志数据进行数据清洗,使得能够除去一些无用或者出错的日志数据,降低日志数据处理的数量,且便于更好的进行数据挖掘。In the embodiment of the present invention, the mining system can remove some useless or erroneous log data by cleaning the log data in the current time period, reduce the amount of log data processing, and facilitate better data mining.
请参阅图3,为本发明图1所示第一实施例中的步骤103之后追加步骤的流程示意图,包括:Please refer to FIG. 3, which is a schematic flow chart of additional steps after step 103 in the first embodiment shown in FIG. 1 of the present invention, including:
步骤301、若接收到数据查询指令,则按照数据查询指令中包含的查询维度从Hadoop数据库中读取与查询维度对应的第三日志数据集合;Step 301, if the data query instruction is received, read the third log data set corresponding to the query dimension from the Hadoop database according to the query dimension contained in the data query instruction;
在本发明实施例中,挖掘系统在将得到的第三日志数据保存至Hadoop数据库中之后,用户可以通过输入数据查询指令的方式请求查询数据,且若挖掘系统接收到数据查询指令,则按照数据查询指令中包含的查询维度从Hadoop数据库中读取与维度对应的第三日志数据集合。In the embodiment of the present invention, after the mining system saves the obtained third log data into the Hadoop database, the user can request to query the data by inputting a data query instruction, and if the mining system receives the data query instruction, it will follow the data query instruction. The query dimension included in the query instruction reads the third log data set corresponding to the dimension from the Hadoop database.
优选的,该数据查询指令中还可以包含某个时间段,则挖掘系统将读取在该时间段内,该查询维度对应的第三日志数据集合。Preferably, the data query instruction may also include a certain time period, and the mining system will read the third log data set corresponding to the query dimension within the time period.
步骤302、对第三日志数据集合进行数据分析,并在显示界面上显示数据分析的结果。Step 302, perform data analysis on the third log data set, and display the result of the data analysis on the display interface.
在本发明实施例中,挖掘系统还将对第三日志数据集合进行数据分析,并在显示界面上显示数据分析的结果,具体的:挖掘系统按照预先设置的聚类算法对第三日志数据集合中的用户进行用户分组,得到用户分组列表;根据用户分组列表中的用户的日志数据得到至少两个用户维度对应的级别配置表,并在显示界面上显示级别配置表;用户维度是预先设置的,级别配置表中包含用户分组列表中的用户按照用户维度进行分级确定的级别。In the embodiment of the present invention, the mining system will also perform data analysis on the third log data set, and display the results of the data analysis on the display interface. Specifically: the mining system will analyze the third log data set according to a preset clustering algorithm. The users in the user group are grouped to obtain the user group list; according to the log data of the users in the user group list, the level configuration table corresponding to at least two user dimensions is obtained, and the level configuration table is displayed on the display interface; the user dimension is preset , the level configuration table contains the levels determined by the users in the user group list according to the user dimension.
其中,用户维度可以分为横向维度和纵向维度,并且在不同的维度下对用户进行评级。例如:挖掘系统得到的用户分组,包括:所有用户组及微博用户组,对于所有用户组,对该组内的所有用户按照使用的流量大小进行名次排行,排名前20%的为五星级用户,排名前20%至40%的为四星级用户,并以此类推,确定该所有用户组中的每一个用户的星级。此即为横向维度评级。对于微博用户组合中的用户,按照用户启动微博之后产生的流量大小进行名次排行,排名前20%的为五星级用户,排名前20%至40%的为四星级用户,并以此类推,确定该微博用户组中的每一个用户的星级。此即为纵向维度评级。通过横向维度评级和纵向维度评级,使得能够对实现对用户群体的画像展示,以便业务专家针对具体的分组画像得到有针对性的方案。Among them, the user dimension can be divided into a horizontal dimension and a vertical dimension, and users are rated in different dimensions. For example: the user groups obtained by the mining system include: all user groups and Weibo user groups. For all user groups, all users in the group are ranked according to the amount of traffic used, and the top 20% are five-star users, the top 20% to 40% are four-star users, and by analogy, determine the star rating of each user in all user groups. This is the horizontal dimension rating. For the users in the Weibo user portfolio, rankings are made according to the amount of traffic generated after users start Weibo. The top 20% are five-star users, and the top 20% to 40% are four-star users. By analogy, the star rating of each user in the microblog user group is determined. This is the vertical dimension rating. Through the horizontal dimension rating and the vertical dimension rating, it is possible to realize the portrait display of user groups, so that business experts can get targeted solutions for specific group portraits.
优选的,该预先设置的聚类算法可以是K-means算法。Preferably, the preset clustering algorithm may be K-means algorithm.
其中,查询维度是基于Hadoop数据库中保存的第三日志数据集合对应的维度设置的,例如:查询维度可以是上网内容、上网时间、上网位置等中的任意一种或者任意几种。Wherein, the query dimension is set based on the dimension corresponding to the third log data set stored in the Hadoop database. For example, the query dimension can be any one or any several of online content, online time, online location, etc.
在本发明实施例中,挖掘系统通过按照数据查询指令中包含的查询维度从Hadoop数据库中读取与查询维度对应的第三日志数据集合,并对该第三日志数据集合进行数据分析,且在显示界面上显示数据分析的结果,使得能够有效的将数据挖掘的结果显示给用户。In the embodiment of the present invention, the mining system reads the third log data set corresponding to the query dimension from the Hadoop database according to the query dimension contained in the data query instruction, and performs data analysis on the third log data set, and The result of data analysis is displayed on the display interface, so that the result of data mining can be effectively displayed to the user.
需要说明的是,在本发明实施例中,基于Hadoop数据库的日志数据的挖掘方法可以应用在流量数据的精准营销系统中,例如,可以通过图1至图3所示实施例中描述的技术方案实现目标用户的挖掘及营销选址的挖掘等等,给运营商对目标用户或者目标基站小区做有针对性精细化营销提供数据基础。It should be noted that, in the embodiment of the present invention, the log data mining method based on the Hadoop database can be applied in the precision marketing system of traffic data, for example, through the technical solutions described in the embodiments shown in Figures 1 to 3 Realize the mining of target users and marketing site selection, etc., and provide operators with a data basis for targeted and refined marketing of target users or target base station communities.
其中,若是需要确定目标用户,则在图3所示实施例中的步骤301中,查询维度可以是上网内容或者上网流量,若需要确定目标基站小区,则查询维度可以是上网位置。Wherein, if it is necessary to determine the target user, then in step 301 in the embodiment shown in FIG. 3 , the query dimension may be Internet content or Internet traffic, and if it is necessary to determine the target base station cell, the query dimension may be Internet location.
在实际应用中,用户可以根据具体的需要选择查询维度,此处不做限定。In practical applications, users can select query dimensions according to specific needs, which is not limited here.
请参阅图4,为本发明第二实施例中基于Hadoop的日志数据挖掘系统的功能模块的示意图,包括:Referring to Fig. 4, it is a schematic diagram of the functional modules of the Hadoop-based log data mining system in the second embodiment of the present invention, including:
第一保存模块401,用于将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中;The first saving module 401 is used to save the first log data set in the obtained current time period into the Hadoop database;
其中,挖掘系统是按照时间段获取第一日志数据集合的,例如,若时间段是15分钟或者是30分钟,则挖掘系统获取当前的15分钟时间段内的第一日志数据集合或者获取当前的30分钟时间段内第一日志数据集合。Wherein, the mining system obtains the first log data set according to the time period. For example, if the time period is 15 minutes or 30 minutes, the mining system obtains the first log data set within the current 15-minute time period or obtains the current The first log data set in a 30-minute time period.
其中,该时间段是获取数据的周期,可以按照数据量的大小确定该时间段的时长。Wherein, the time period is a cycle for acquiring data, and the length of the time period may be determined according to the amount of data.
其中,Hadoop可实现分布式文件系统(Hadoop Distributed File System,HDFS),Hadoop的框架核心是Hadoop数据库及并行运算模型,其中,Hadoop数据库能够为海量的数据提供分布式存储,并行运行模型能够为海量的数据提供并行运算。Among them, Hadoop can implement Distributed File System (Hadoop Distributed File System, HDFS). The framework core of Hadoop is Hadoop database and parallel operation model. Among them, Hadoop database can provide distributed storage for massive data, and the parallel operation model can provide massive The data provides parallel operation.
优选的,并行运算模型为mapreduce运算模型。Preferably, the parallel operation model is a mapreduce operation model.
并行聚集模块402,用于若所述Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则利用预置的并行运算模型对所述Hadoop数据库中的第一日志数据集合进行并行聚集处理,得到第二日志数据集合;Parallel aggregation module 402, for if the number of the first log data set that described Hadoop database has preserved meets preset numerical value, then utilize the preset parallel computing model to carry out the first log data set in described Hadoop database Parallel aggregation processing to obtain a second log data set;
其中,在实际应用中可根据具体的需要预先设置该数值,例如,若上述的时间段为15分钟,且需要对一个小时内的第一日志数据集合进行聚集处理,则该预先设置的数值为4;若上述的时间段为30分钟,且需要对1天内的第一日志数据集合进行聚集处理,则该预先设置的数值为48。Wherein, in practical applications, this value can be preset according to specific needs. For example, if the above-mentioned time period is 15 minutes, and the first log data set within one hour needs to be aggregated, the preset value is 4. If the above time period is 30 minutes, and the first log data set within 1 day needs to be aggregated, the preset value is 48.
可以理解的是,基于上述的聚集处理,并行聚集模块402还可以利用类似的方式得到不同时间周期内的日志数据集合,例如:可以利用4个时间段为15分钟的第一日志数据集合得到一个小时内的日志数据集合,可以利用24个一个小时内的日志数据集合得到一天内的日志数据集合,可以利用30个一天内的日志数据集合得到一个月内的日志数据集合,且以此类推,可以得到不同时间内的日志数据集合,以满足不同的需求。It can be understood that, based on the above aggregation processing, the parallel aggregation module 402 can also use a similar method to obtain log data sets in different time periods, for example: four first log data sets with a time period of 15 minutes can be used to obtain a For the log data collection within an hour, you can use 24 log data collections within an hour to get the log data collection within a day, you can use 30 log data collections within a day to get the log data collection within a month, and so on. Log data collections at different times can be obtained to meet different needs.
划分保存模块403,根据所述第二日志数据集合中的日志数据的维度对所述第二日志数据集合中的日志数据进行维度划分,将得到的不同维度对应的第三日志数据集合保存至所述Hadoop数据库中。The division and storage module 403 divides the log data in the second log data set into dimensions according to the dimensions of the log data in the second log data set, and saves the obtained third log data sets corresponding to different dimensions in the set described Hadoop database.
其中,日志数据的维度有很多,包括但不限于上网内容、上网位置和上网时间,其中,上网内容是指在用户的浏览位置,该浏览位置可以是具体的某一个位置,例如可以是百度、搜狐、新浪微博等等,也可以是一类网址,例如:音乐、电影等等。上网位置是指用户使用的IP位置所处的地理位置范围,上网时间是指生成日志数据的时间。且维度的划分是根据系统的要求,通过维度上的数据完成对用户整体行为的进一步刻画。需要说明的是,对于不同类型的日志数据,其日志数据的维度也是不一样的,例如:在对日志数据中的用户的流量数据采用本发明实施例中的技术方案进行数据挖掘时,其维度除了上述的上网内容、上网位置和上网时间以外,还可以包含上网频率、用户年龄、月消费等等,因此在实际应用中,可以根据具体的需要进行维度划分,此处不做限定。Among them, there are many dimensions of log data, including but not limited to online content, online location, and online time. Among them, online content refers to the browsing location of the user, and the browsing location can be a specific location, such as Baidu, Sohu, Sina Weibo, etc., can also be a type of website, such as: music, movies, etc. The online location refers to the geographical range of the IP address used by the user, and the online time refers to the time when the log data is generated. And the division of dimensions is based on the requirements of the system, and the further description of the overall behavior of users is completed through the data on the dimensions. It should be noted that for different types of log data, the dimensions of the log data are also different. For example, when the user traffic data in the log data is mined using the technical solution in the embodiment of the present invention, the dimension In addition to the above-mentioned online content, online location, and online time, it can also include online frequency, user age, monthly consumption, etc. Therefore, in practical applications, dimensions can be divided according to specific needs, which is not limited here.
优选的,在本发明实施例中,挖掘系统在将不同维度对应的第三日志数据集合保存至Hadoop数据库中之后,还可以将该不同维度对应的第三日志数据集合保存至列存储阵列中,使得能够实现Hadoop数据库和列存储阵列的协同工作,使得能够满足不同的应用场景的数据需求。Preferably, in the embodiment of the present invention, after the mining system saves the third log data sets corresponding to different dimensions into the Hadoop database, it can also save the third log data sets corresponding to the different dimensions in the column storage array, It enables the collaborative work of the Hadoop database and the column storage array to meet the data requirements of different application scenarios.
在本发明实施例中,第一保存模块401将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中,若所述Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则并行聚集模块402利用预置的并行运算模型对所述Hadoop数据库中的第一日志数据集合进行并行聚集处理,得到第二日志数据集合,最后划分保存模块403根据所述第二日志数据集合中的日志数据的维度对所述第二日志数据集合中的日志数据进行维度划分,将得到的不同维度对应的第三日志数据集合保存至所述Hadoop数据库中。In the embodiment of the present invention, the first saving module 401 saves the acquired first log data collection in the current time period into the Hadoop database, if the number of the first log data collections saved by the Hadoop database satisfies the preset value, then the parallel aggregation module 402 uses the preset parallel operation model to carry out parallel aggregation processing on the first log data set in the Hadoop database to obtain the second log data set, and finally divides and saves the module 403 according to the second log data set Dimensions of the log data in the data set divide the log data in the second log data set into dimensions, and save the obtained third log data sets corresponding to different dimensions into the Hadoop database.
在本发明实施例中,挖掘系统将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中,若Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则利用Hadoop数据库中的并行运算模型对该Hadoop数据库中的第一日志数据集合进行并行聚集处理,得到第二日志数据集合,根据该第二日志数据集合中的日志数据的维度对该第二日志数据集合中的日志数据进行维护划分,将得到的不同维度对应的第三日志数据集合保存至该Hadoop数据库中,以完成日志数据的挖掘。由于Hadoop数据库具有较好的分布式存储能力及并行运算能力,利用该Hadoop数据库对日志数据进行分布式存储及利用Hadoop中的并行运算模型进行并行运算,能够快速有效地实现海量数据的挖掘,满足对海量数据进行挖掘的存储及运算需求。In the embodiment of the present invention, the mining system saves the first log data set obtained in the current time period into the Hadoop database, and if the number of the first log data set saved by the Hadoop database meets the preset value, then use The parallel computing model in the Hadoop database performs parallel aggregation processing on the first log data set in the Hadoop database to obtain a second log data set, and the second log data set is obtained according to the dimension of the log data in the second log data set The log data in is maintained and divided, and the obtained third log data sets corresponding to different dimensions are stored in the Hadoop database to complete the mining of log data. Since the Hadoop database has good distributed storage and parallel computing capabilities, using the Hadoop database to store log data in a distributed manner and using the parallel computing model in Hadoop to perform parallel computing can quickly and effectively realize massive data mining, satisfying Storage and computing requirements for mining massive amounts of data.
请参阅图5,为图4所示的第二实施例中追加的功能模块的示意图,包括:Please refer to Fig. 5, which is a schematic diagram of additional functional modules in the second embodiment shown in Fig. 4, including:
获取模块501,用于从网络侧获取当前时间段内的日志数据;An acquisition module 501, configured to acquire log data in the current time period from the network side;
在本发明实施例中,获取模块501是从网络侧获取当前时间段内的日志数据,具体的:获取模块501可以通过日志数据的抽取的方式从网络侧获取当前时间段内的日志数据,或者,可以利用网络爬虫技术从网络侧获取当前时间段内的日志数据,或者,可以通过从网络侧的BOSS营帐数据库中获取当前时间段内的日志数据,或者,可以接受网络侧的第三方厂商提供的当前时间段内的日志数据,或者结合上述的至少两种方式获取当前时间段内的日志数据。In the embodiment of the present invention, the acquisition module 501 acquires the log data in the current time period from the network side, specifically: the acquisition module 501 can acquire the log data in the current time period from the network side by extracting the log data, or , you can use the web crawler technology to obtain the log data in the current time period from the network side, or you can obtain the log data in the current time period from the BOSS tent database on the network side, or you can accept the third-party vendors on the network side. log data in the current time period, or combine at least two of the above methods to obtain log data in the current time period.
第一聚集模块502,用于对所述当前时间段内的日志数据进行聚集处理,得到所述当前时间段内的第一日志数据集合。The first aggregation module 502 is configured to perform aggregation processing on the log data within the current time period to obtain a first set of log data within the current time period.
其中,第一聚集模块502可以是根据日志数据的内容进行分类,把相同内容或者属于同一类的内容的日志数据作为一条数据进行数目上的累加,聚集后得到的第一日志数据集合的数量级将远远低于获取到的当前时间段内的日志数据的数量级,当时数据意义被完整的保存下来。Among them, the first aggregation module 502 can classify according to the content of the log data, and accumulate the log data of the same content or content belonging to the same category as a piece of data, and the order of magnitude of the first log data set obtained after aggregation will be The order of magnitude is far lower than the log data obtained in the current time period, and the meaning of the data at that time is completely preserved.
在本发明实施例中挖掘系统在执行第一聚集模块502之后才会开始执行图4所示实施例中的第一保存模块401。In the embodiment of the present invention, the mining system will start to execute the first saving module 401 in the embodiment shown in FIG. 4 after executing the first gathering module 502 .
在本发明实施例中,系统还包括清洗模块503;In the embodiment of the present invention, the system further includes a cleaning module 503;
清洗模块503用于在所述获取模块501获取所述当前时间段内的日志数据之后,对所述当前时间段内的日志数据进行数据清洗,得到当前时间段内清洗后的日志数据;The cleaning module 503 is used to perform data cleaning on the log data in the current time period after the acquisition module 501 acquires the log data in the current time period, to obtain the cleaned log data in the current time period;
且若挖掘系统执行了清洗模块503,则第一聚集模块502具体用于对所述当前时间段内清洗后的日志数据进行聚集处理,得到所述当前时间段内的第一日志数据集合。And if the mining system executes the cleaning module 503, the first aggregation module 502 is specifically configured to aggregate the cleaned log data in the current time period to obtain the first log data set in the current time period.
在本发明实施例中,挖掘系统通过图2所示的追加的步骤实现第一日志数据集合的获取,且通过对从网络侧获取到的当前时间段内的日志数据进行聚集,能够有效的降低日志数据的数量级,使得在Hadoop数据库中所所需要的存储空间减小,节约存储空间。且挖掘系统还可以通过对当前时间段内的日志数据进行数据清洗,使得能够除去一些无用或者出错的日志数据,降低日志数据处理的数量,且便于更好的进行数据挖掘。In the embodiment of the present invention, the mining system acquires the first log data set through the additional steps shown in FIG. The order of magnitude of the log data reduces the storage space required in the Hadoop database and saves storage space. Moreover, the mining system can also perform data cleaning on the log data in the current time period, so as to remove some useless or erroneous log data, reduce the amount of log data processing, and facilitate better data mining.
请参阅图6,为图4所示的第二实施例追加的功能模块的示意图,包括:Please refer to FIG. 6, which is a schematic diagram of the additional functional modules of the second embodiment shown in FIG. 4, including:
读取模块601,用于若接收到数据查询指令,则按照所述数据查询指令中包含的查询维度从所述Hadoop数据库中读取与所述查询维度对应的第三日志数据集合;The reading module 601 is used to read the third log data set corresponding to the query dimension from the Hadoop database according to the query dimension included in the data query instruction if a data query instruction is received;
分析模块602,用于对所述第三日志数据集合进行数据分析,并在显示界面上显示数据分析的结果。The analysis module 602 is configured to perform data analysis on the third log data set, and display a result of the data analysis on a display interface.
其中,所述分析模块602包括:Wherein, the analysis module 602 includes:
聚类模块603,用于按照预先设置的聚类算法对所述第三日志数据集合中的用户进行用户分组,得到用户分组列表;A clustering module 603, configured to perform user grouping on the users in the third log data set according to a preset clustering algorithm to obtain a user grouping list;
获取显示模块604,用于根据用户分组列表中的用户的日志数据得到至少两个用户维度对应的级别配置表,并在显示界面上显示所述级别配置表;所述用户维度是预先设置的,所述级别配置表中包含所述用户分组列表中的用户按照所述用户维度进行分级确定的级别。The acquisition and display module 604 is used to obtain at least two level configuration tables corresponding to user dimensions according to the log data of users in the user grouping list, and display the level configuration tables on the display interface; the user dimensions are preset, The level configuration table includes levels determined according to the user dimension of the users in the user grouping list.
其中,用户维度可以分为横向维度和纵向维度,并且在不同的维度下对用户进行评级。例如:挖掘系统得到的用户分组,包括:所有用户组及微博用户组,对于所有用户组,对该组内的所有用户按照使用的流量大小进行名次排行,排名前20%的为五星级用户,排名前20%至40%的为四星级用户,并以此类推,确定该所有用户组中的每一个用户的星级。此即为横向维度评级。对于微博用户组合中的用户,按照用户启动微博之后产生的流量大小进行名次排行,排名前20%的为五星级用户,排名前20%至40%的为四星级用户,并以此类推,确定该微博用户组中的每一个用户的星级。此即为纵向维度评级。通过横向维度评级和纵向维度评级,使得能够对实现对用户群体的画像展示,以便业务专家针对具体的分组画像得到有针对性的方案。Among them, the user dimension can be divided into a horizontal dimension and a vertical dimension, and users are rated in different dimensions. For example: the user groups obtained by the mining system include: all user groups and Weibo user groups. For all user groups, all users in the group are ranked according to the amount of traffic used, and the top 20% are five-star users, the top 20% to 40% are four-star users, and by analogy, determine the star rating of each user in all user groups. This is the horizontal dimension rating. For the users in the Weibo user portfolio, rankings are made according to the amount of traffic generated after users start Weibo. The top 20% are five-star users, and the top 20% to 40% are four-star users. By analogy, the star rating of each user in the microblog user group is determined. This is the vertical dimension rating. Through the horizontal dimension rating and the vertical dimension rating, it is possible to realize the portrait display of user groups, so that business experts can get targeted solutions for specific group portraits.
优选的,该预先设置的聚类算法可以是K-means算法。Preferably, the preset clustering algorithm may be K-means algorithm.
其中,查询维度是基于Hadoop数据库中保存的第三日志数据集合对应的维度设置的,例如:查询维度可以是上网内容、上网时间、上网位置等中的任意一种或者任意几种。Wherein, the query dimension is set based on the dimension corresponding to the third log data set stored in the Hadoop database. For example, the query dimension can be any one or any several of online content, online time, online location, etc.
在本发明实施例中,挖掘系统通过按照数据查询指令中包含的查询维度从Hadoop数据库中读取与查询维度对应的第三日志数据集合,并对该第三日志数据集合进行数据分析,且在显示界面上显示数据分析的结果,使得能够有效的将数据挖掘的结果显示给用户。In the embodiment of the present invention, the mining system reads the third log data set corresponding to the query dimension from the Hadoop database according to the query dimension contained in the data query instruction, and performs data analysis on the third log data set, and The result of data analysis is displayed on the display interface, so that the result of data mining can be effectively displayed to the user.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本发明各个实施例的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is better implementation. Based on such an understanding, the essence of the technical solution of the present invention or the part that contributes to the prior art can be embodied in the form of software products, and the computer software products are stored in a storage medium (such as ROM/RAM, disk, CD) contains several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods of various embodiments of the present invention.
以上仅为本发明的优选实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。The above are only preferred embodiments of the present invention, and are not intended to limit the patent scope of the present invention. Any equivalent structure or equivalent process conversion made by using the description of the present invention and the contents of the accompanying drawings, or directly or indirectly used in other related technical fields , are all included in the scope of patent protection of the present invention in the same way.
Claims (10)
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201510875453.3A CN106815274B (en) | 2015-12-02 | 2015-12-02 | Hadoop-based log data mining method and system |
| PCT/CN2016/097363 WO2017092444A1 (en) | 2015-12-02 | 2016-08-30 | Log data mining method and system based on hadoop |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201510875453.3A CN106815274B (en) | 2015-12-02 | 2015-12-02 | Hadoop-based log data mining method and system |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN106815274A true CN106815274A (en) | 2017-06-09 |
| CN106815274B CN106815274B (en) | 2022-02-18 |
Family
ID=58796202
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201510875453.3A Active CN106815274B (en) | 2015-12-02 | 2015-12-02 | Hadoop-based log data mining method and system |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN106815274B (en) |
| WO (1) | WO2017092444A1 (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107391645A (en) * | 2017-07-12 | 2017-11-24 | 广州市昊链信息科技股份有限公司 | A kind of logistics information automatic push and practical operation specification form system and method |
Families Citing this family (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107241231B (en) * | 2017-07-26 | 2020-04-03 | 成都科来软件有限公司 | Rapid and accurate positioning method for original network data packet |
| CN112287208B (en) * | 2019-09-30 | 2024-03-01 | 北京沃东天骏信息技术有限公司 | User portrait generation method, device, electronic device and storage medium |
| WO2021102888A1 (en) * | 2019-11-29 | 2021-06-03 | 京东方科技集团股份有限公司 | Data processing device and method, and computer-readable storage medium |
| CN111597179B (en) * | 2020-05-18 | 2023-12-05 | 北京思特奇信息技术股份有限公司 | Method and device for automatically cleaning data, electronic equipment and storage medium |
| CN112632020B (en) * | 2020-12-25 | 2022-03-18 | 中国电子科技集团公司第三十研究所 | Log information type extraction method and mining method based on spark big data platform |
Citations (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6732123B1 (en) * | 1998-02-23 | 2004-05-04 | International Business Machines Corporation | Database recovery to any point in time in an online environment utilizing disaster recovery technology |
| US20070055687A1 (en) * | 2005-09-02 | 2007-03-08 | International Business Machines Corporation | System and method for minimizing data outage time and data loss while handling errors detected during recovery |
| KR20090050405A (en) * | 2007-11-15 | 2009-05-20 | 한국전자통신연구원 | Method and apparatus for classifying user's behavior based on event log in context-aware system environment |
| CN101483557A (en) * | 2009-03-03 | 2009-07-15 | 中兴通讯股份有限公司 | Log statistic, storing method and system used for deep packet detection apparatus |
| US20100306286A1 (en) * | 2009-03-05 | 2010-12-02 | Chi-Hsien Chiu | Distributed steam processing |
| CN102685221A (en) * | 2012-04-29 | 2012-09-19 | 华北电力大学(保定) | Distributed storage and parallel mining method for state monitoring data |
| US20140304401A1 (en) * | 2013-04-06 | 2014-10-09 | Citrix Systems, Inc. | Systems and methods to collect logs from multiple nodes in a cluster of load balancers |
| CN104182506A (en) * | 2014-08-19 | 2014-12-03 | 浪潮(北京)电子信息产业有限公司 | Log management method |
| CN104301360A (en) * | 2013-07-19 | 2015-01-21 | 阿里巴巴集团控股有限公司 | Method, log server and system for recording log data |
| US20150081668A1 (en) * | 2013-09-13 | 2015-03-19 | Nec Laboratories America, Inc. | Systems and methods for tuning multi-store systems to speed up big data query workload |
| CN104616092A (en) * | 2014-12-16 | 2015-05-13 | 国家电网公司 | Distributed log analysis based distributed mode handling method |
| CN104969213A (en) * | 2013-01-31 | 2015-10-07 | 脸谱公司 | Data stream splitting for low-latency data access |
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN100481077C (en) * | 2006-01-12 | 2009-04-22 | 国际商业机器公司 | Visual method and device for strengthening search result guide |
| CN103036921B (en) * | 2011-09-29 | 2015-09-23 | 北京新媒传信科技有限公司 | A kind of user behavior analysis system and method |
| CN103955502B (en) * | 2014-04-24 | 2017-07-28 | 科技谷(厦门)信息技术有限公司 | A kind of visualization OLAP application realization method and system |
| CN104317958B (en) * | 2014-11-12 | 2018-01-16 | 北京国双科技有限公司 | A kind of real-time data processing method and system |
-
2015
- 2015-12-02 CN CN201510875453.3A patent/CN106815274B/en active Active
-
2016
- 2016-08-30 WO PCT/CN2016/097363 patent/WO2017092444A1/en not_active Ceased
Patent Citations (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6732123B1 (en) * | 1998-02-23 | 2004-05-04 | International Business Machines Corporation | Database recovery to any point in time in an online environment utilizing disaster recovery technology |
| US20070055687A1 (en) * | 2005-09-02 | 2007-03-08 | International Business Machines Corporation | System and method for minimizing data outage time and data loss while handling errors detected during recovery |
| KR20090050405A (en) * | 2007-11-15 | 2009-05-20 | 한국전자통신연구원 | Method and apparatus for classifying user's behavior based on event log in context-aware system environment |
| CN101483557A (en) * | 2009-03-03 | 2009-07-15 | 中兴通讯股份有限公司 | Log statistic, storing method and system used for deep packet detection apparatus |
| US20100306286A1 (en) * | 2009-03-05 | 2010-12-02 | Chi-Hsien Chiu | Distributed steam processing |
| CN102685221A (en) * | 2012-04-29 | 2012-09-19 | 华北电力大学(保定) | Distributed storage and parallel mining method for state monitoring data |
| CN104969213A (en) * | 2013-01-31 | 2015-10-07 | 脸谱公司 | Data stream splitting for low-latency data access |
| US20140304401A1 (en) * | 2013-04-06 | 2014-10-09 | Citrix Systems, Inc. | Systems and methods to collect logs from multiple nodes in a cluster of load balancers |
| CN104301360A (en) * | 2013-07-19 | 2015-01-21 | 阿里巴巴集团控股有限公司 | Method, log server and system for recording log data |
| US20150081668A1 (en) * | 2013-09-13 | 2015-03-19 | Nec Laboratories America, Inc. | Systems and methods for tuning multi-store systems to speed up big data query workload |
| CN104182506A (en) * | 2014-08-19 | 2014-12-03 | 浪潮(北京)电子信息产业有限公司 | Log management method |
| CN104616092A (en) * | 2014-12-16 | 2015-05-13 | 国家电网公司 | Distributed log analysis based distributed mode handling method |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107391645A (en) * | 2017-07-12 | 2017-11-24 | 广州市昊链信息科技股份有限公司 | A kind of logistics information automatic push and practical operation specification form system and method |
| CN107391645B (en) * | 2017-07-12 | 2018-04-10 | 广州市昊链信息科技股份有限公司 | A kind of logistics information automatic push and practical operation specification form system and method |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2017092444A1 (en) | 2017-06-08 |
| CN106815274B (en) | 2022-02-18 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11941017B2 (en) | Event driven extract, transform, load (ETL) processing | |
| Zheng et al. | Service-generated big data and big data-as-a-service: an overview | |
| TWI512506B (en) | Sorting method and device for search results | |
| CN106815274B (en) | Hadoop-based log data mining method and system | |
| WO2019024496A1 (en) | Enterprise recommendation method and application server | |
| JP2013534334A (en) | Method and apparatus for sorting query results | |
| CN105912587A (en) | Data acquisition method and system | |
| CN103620601A (en) | Joining tables in a mapreduce procedure | |
| US20130185429A1 (en) | Processing Store Visiting Data | |
| US20130198240A1 (en) | Social Network Analysis | |
| CN102609435A (en) | Large-scale event evaluation using realtime processors | |
| US11132362B2 (en) | Method and system of optimizing database system, electronic device and storage medium | |
| CN106202482A (en) | A kind of web information flow method and system based on user behavior analysis | |
| CN104572856A (en) | Converged storage method of service source data | |
| US20180101622A1 (en) | Perform graph traversal with graph query language | |
| CN111046237A (en) | User behavior data processing method and device, electronic equipment and readable medium | |
| CN112231590B (en) | Content recommendation method, system, computer device and storage medium | |
| WO2015074477A1 (en) | Path analysis method and apparatus | |
| CN112650946B (en) | Recommended method, device, system and storage medium for product information | |
| CN101957968A (en) | Online transaction service aggregation method based on Hadoop | |
| CN108197338A (en) | A kind of browser bookmark generation method, system and terminal device | |
| CN107480205A (en) | A kind of method and apparatus for carrying out data partition | |
| CN104199977A (en) | Method for searching based on data creation information in database | |
| CN117634894A (en) | Ecological environment risk assessment method and device, electronic equipment and storage medium | |
| CN116821493A (en) | Message push method, device, computer equipment and storage medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |