[go: up one dir, main page]

CN106815274B - Hadoop-based log data mining method and system - Google Patents

Hadoop-based log data mining method and system Download PDF

Info

Publication number
CN106815274B
CN106815274B CN201510875453.3A CN201510875453A CN106815274B CN 106815274 B CN106815274 B CN 106815274B CN 201510875453 A CN201510875453 A CN 201510875453A CN 106815274 B CN106815274 B CN 106815274B
Authority
CN
China
Prior art keywords
log data
time period
current time
data set
hadoop
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510875453.3A
Other languages
Chinese (zh)
Other versions
CN106815274A (en
Inventor
惠羿
熊伟
哈景楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN201510875453.3A priority Critical patent/CN106815274B/en
Priority to PCT/CN2016/097363 priority patent/WO2017092444A1/en
Publication of CN106815274A publication Critical patent/CN106815274A/en
Application granted granted Critical
Publication of CN106815274B publication Critical patent/CN106815274B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于Hadoop的日志数据挖掘方法,将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中;若Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则利用预置的并行运算模型对Hadoop数据库中的第一日志数据集合进行并行聚集处理,得到第二日志数据集合;根据第二日志数据集合中的日志数据的维度对第二日志数据集合中的日志数据进行维度划分,将得到的不同维度对应的第三日志数据集合保存至Hadoop数据库中。本发明还公开了一种基于Hadoop的日志数据挖掘系统。本发明能够快速有效地实现海量数据的挖掘,满足对海量数据进行挖掘的存储及运算需求。

Figure 201510875453

The invention discloses a log data mining method based on Hadoop, which saves the acquired first log data set in the current time period into a Hadoop database; if the number of the first log data sets saved in the Hadoop database meets a preset number , then use the preset parallel computing model to perform parallel aggregation processing on the first log data set in the Hadoop database to obtain a second log data set; according to the dimension of the log data in the second log data set, the second log data The log data in the set is divided into dimensions, and the obtained third log data sets corresponding to different dimensions are stored in the Hadoop database. The invention also discloses a log data mining system based on Hadoop. The invention can quickly and effectively realize the mining of massive data, and meet the storage and operation requirements for mining the massive data.

Figure 201510875453

Description

Hadoop-based log data mining method and system
Technical Field
The invention relates to the field of computer data processing, in particular to a log data mining method and system based on Hadoop.
Background
Since the internet era, how to quickly find a more appropriate, quantifiable, and predictable accurate marketing strategy in an ever-increasing mass of user information becomes a core demand of numerous enterprises including operators.
However, the traditional database has limited data operation capability and expensive storage cost, and cannot meet the requirement of mining mass data.
The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.
Disclosure of Invention
The invention mainly aims to provide a log data mining method and system based on Hadoop, and aims to solve the technical problems that a traditional database is limited in data operation capacity, expensive in storage cost and incapable of providing massive data mining.
In order to achieve the above object, the invention provides a log data mining method based on Hadoop, comprising:
storing the acquired first log data set in the current time period into a Hadoop database;
if the number of the first log data sets stored in the Hadoop database meets a preset numerical value, performing parallel aggregation processing on the first log data sets in the Hadoop database by using a preset parallel operation model to obtain a second log data set;
and performing dimension division on the log data in the second log data set according to the dimensions of the log data in the second log data set, and storing the obtained third log data sets corresponding to different dimensions into the Hadoop database.
Preferably, the method further comprises:
acquiring log data in the current time period from a network side;
and carrying out aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.
Preferably, the step of obtaining the log data in the current time period from the network side further includes:
performing data cleaning on the log data in the current time period to obtain cleaned log data in the current time period;
the step of performing aggregation processing on the log data in the current time period to obtain a first log data set in the current time period includes:
and carrying out aggregation processing on the cleaned log data in the current time period to obtain a first log data set in the current time period.
Preferably, the method further comprises:
if a data query instruction is received, reading a third log data set corresponding to the query dimension from the Hadoop database according to the query dimension contained in the data query instruction;
and performing data analysis on the third log data set, and displaying the result of the data analysis on a display interface.
Preferably, the performing data analysis on the third log data set includes:
performing user grouping on the users in the third log data set according to a preset clustering algorithm to obtain a user grouping list;
obtaining a level configuration table corresponding to at least two user dimensions according to log data of users in a user grouping list, wherein the user dimensions are preset, and the level configuration table comprises levels determined by the users in the user grouping list according to the user dimensions in a grading manner.
In order to achieve the above object, the present invention further provides a log data mining system based on Hadoop, including:
the first storage module is used for storing the acquired first log data set in the current time period into a Hadoop database;
the parallel aggregation module is used for performing parallel aggregation processing on the first log data set in the Hadoop database by using a preset parallel operation model to obtain a second log data set if the number of the first log data sets stored in the Hadoop database meets a preset numerical value;
and the division and storage module is used for performing dimension division on the log data in the second log data set according to the dimension of the log data in the second log data set, and storing the obtained third log data sets corresponding to different dimensions into the Hadoop database.
Preferably, the system further comprises:
the acquisition module is used for acquiring the log data in the current time period from a network side;
and the first aggregation module is used for performing aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.
Preferably, the system further comprises a cleaning module;
the cleaning module is used for cleaning the log data in the current time period after the acquisition module acquires the log data in the current time period to obtain the cleaned log data in the current time period;
and the first aggregation module is specifically configured to perform aggregation processing on the cleaned log data in the current time period to obtain a first log data set in the current time period.
Preferably, the system further comprises:
the reading module is used for reading a third log data set corresponding to a query dimension from the Hadoop database according to the query dimension contained in the data query instruction if the data query instruction is received;
and the analysis module is used for carrying out data analysis on the third log data set and displaying the result of the data analysis on a display interface.
Preferably, the analysis module comprises:
the clustering module is used for carrying out user grouping on the users in the third log data set according to a preset clustering algorithm to obtain a user grouping list;
an obtaining and displaying module, configured to obtain a level configuration table corresponding to at least two user dimensions according to log data of users in a user grouping list, where the user dimensions are preset, and the level configuration table includes levels determined by users in the user grouping list in a hierarchical manner according to the user dimensions
The invention provides a Hadoop-based log data mining method, which comprises the steps of storing a first log data set in the current time period into a Hadoop database, if the number of the first log data sets stored in the Hadoop database meets a preset numerical value, performing parallel aggregation processing on the first log data set in the Hadoop database by using a preset parallel operation model to obtain a second log data set, performing maintenance and division on the log data in the second log data set according to the dimensionality of the log data in the second log data set, and storing a third log data set corresponding to different dimensionalities into the Hadoop database to finish the mining of the log data. The Hadoop database has better distributed storage capacity and parallel operation capacity, so that the log data are stored in a distributed mode by the Hadoop database and parallel operation is performed by the parallel operation model, massive data can be mined quickly and effectively, and the storage and operation requirements for mining the massive data are met.
Drawings
FIG. 1 is a schematic flow chart of a Hadoop-based log data mining method according to a first embodiment of the present invention;
FIG. 2 is a schematic flow chart showing additional steps before step 101 of the first embodiment of FIG. 1;
FIG. 3 is a flow chart illustrating additional steps after step 103 of the first embodiment of FIG. 1;
FIG. 4 is a diagram illustrating functional modules of a Hadoop-based log data mining system according to a second embodiment of the present invention;
FIG. 5 is a diagram of additional functional modules in the second embodiment of FIG. 4;
fig. 6 is a schematic diagram of additional functional modules in the second embodiment of fig. 4.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides a Hadoop-based log data mining method, which comprises the steps of storing a first log data set in the current time period into a Hadoop database, if the number of the first log data sets stored in the Hadoop database meets a preset numerical value, performing parallel aggregation processing on the first log data set in the Hadoop database by using a preset parallel operation model to obtain a second log data set, performing maintenance and division on the log data in the second log data set according to the dimensionality of the log data in the second log data set, and storing a third log data set corresponding to different dimensionalities into the Hadoop database to finish the mining of the log data. The Hadoop database has better distributed storage capacity and parallel operation capacity, so that the log data are stored in a distributed mode by the Hadoop database and parallel operation is performed by a preset parallel operation model in the Hadoop, massive data can be mined quickly and effectively, and the storage and operation requirements for mining the massive data are met.
Referring to fig. 1, a schematic flow chart of a Hadoop-based log data mining method according to a first embodiment of the present invention includes:
step 101, storing the acquired first log data set in the current time period into a Hadoop database;
in the embodiment of the invention, the log data mining method based on Hadoop can be applied to a log data mining system based on Hadoop (hereinafter referred to as mining system), and the mining system stores the acquired first log data set in the current time period into a Hadoop database.
The mining system acquires the first log data set according to a time period, for example, if the time period is 15 minutes or 30 minutes, the mining system acquires the first log data set in the current 15-minute time period or acquires the first log data set in the current 30-minute time period.
The time period is a period for acquiring data, and the duration of the time period can be determined according to the size of the data volume.
The Hadoop can realize a Distributed File System (HDFS), and the frame core of the Hadoop is a Hadoop database and a parallel operation model, wherein the Hadoop database can provide Distributed storage for massive data, and the parallel operation model can provide parallel operation for the massive data.
Preferably, the parallel operation model is a mapreduce operation model.
102, if the number of the first log data sets stored in the Hadoop database meets a preset numerical value, performing parallel aggregation processing on the first log data sets in the Hadoop database by using a preset parallel operation model to obtain a second log data set;
in the embodiment of the invention, the mining system stores the acquired first log data set into the Hadoop database in each time period, and if the number of the first log data sets stored in the Hadoop database meets a preset numerical value, the first log data set in the Hadoop database can be aggregated by using a preset parallel operation model in the Hadoop frame to obtain a second log data set.
In practical applications, the value may be preset according to specific needs, for example, if the time period is 15 minutes and aggregation processing needs to be performed on the first log data set within one hour, the preset value is 4; if the time period is 30 minutes and the aggregation process needs to be performed on the first log data set within 1 day, the preset value is 48.
It will be appreciated that based on the aggregation process described above, the mining system may also derive log data sets for different time periods in a similar manner, such as: the log data sets within one hour can be obtained by using the first log data sets with 4 time periods of 15 minutes, the log data sets within one day can be obtained by using the log data sets within 24 one hour, the log data sets within one month can be obtained by using the log data sets within 30 one day, and the like, the log data sets within different time periods can be obtained to meet different requirements.
In the embodiment of the invention, when the mining system carries out parallel aggregation processing by using the preset parallel operation model, the same count value of the log data is accumulated.
103, performing dimensionality division on the log data in the second log data set according to the dimensionality of the log data in the second log data set, and storing the obtained third log data sets corresponding to different dimensionalities into a Hadoop database.
In the embodiment of the invention, after the mining system obtains the second log data set, the dimension division is carried out on the log data in the second log data set according to the dimension of the log data in the second log data set, and the obtained third log data sets corresponding to different dimensions are stored in a Hadoop database so as to realize the mining of mass log data, and the stored third log data sets can be used as data sources for user data query and support icons, graphic query and multi-dimensional query of a display interface, so that the data can be displayed in multiple angles, and the display effect of data mining is achieved.
The dimensions of the log data are many, including but not limited to internet surfing content, internet surfing position and internet surfing time, where the internet surfing content refers to a browsing position of a user, and the browsing position may be a specific certain position, such as hundredth, fox search, new wave microblog, or a type of website, for example: music, movies, and the like. The internet surfing position refers to the geographical position range of the IP position used by the user, and the internet surfing time refers to the time for generating log data. And the dimension division is to finish further description of the whole behavior of the user through data on the dimension according to the requirements of the system. It should be noted that, for different types of log data, the dimensions of the log data are also different, for example: when the technical scheme of the embodiment of the invention is adopted to perform data mining on the traffic data of the user in the log data, the dimension of the data mining can also include the internet surfing frequency, the user age, the monthly consumption and the like besides the internet surfing content, the internet surfing position and the internet surfing time, so that in practical application, dimension division can be performed according to specific needs, and the dimension division is not limited here.
Preferably, in the embodiment of the present invention, after the mining system stores the third log data sets corresponding to different dimensions into the Hadoop database, the mining system may also store the third log data sets corresponding to different dimensions into the column storage array, so that cooperative work of the Hadoop database and the column storage array can be realized, and data requirements of different application scenarios can be met.
Preferably, the mining system executes the operations of parallel aggregation processing and dimension division only when the number of the first log data sets stored in the Hadoop database satisfies a preset value, so that the obtained third log data set actually corresponds to a time period, and the mining system can store the corresponding relationship among the dimension, the time period and the third log data set when storing the third log data set.
In the embodiment of the invention, the mining system stores the acquired first log data set in the current time period into the Hadoop database, if the number of the first log data sets stored in the Hadoop database meets a preset numerical value, the first log data set in the Hadoop database is subjected to parallel aggregation processing by using a preset parallel operation model to obtain a second log data set, the log data in the second log data set are maintained and divided according to the dimensionality of the log data in the second log data set, and the obtained third log data sets corresponding to different dimensionalities are stored into the Hadoop database to finish the mining of the log data. The Hadoop database has better distributed storage capacity and parallel operation capacity, so that the log data are stored in a distributed mode by the Hadoop database and parallel operation is performed by the parallel operation model in the Hadoop database, massive data can be mined quickly and effectively, and the storage and operation requirements for mining the massive data are met.
Referring to fig. 2, a flow chart illustrating an additional step before step 101 in the first embodiment of fig. 1 according to the present invention includes:
step 201, obtaining log data in the current time period from a network side;
in the embodiment of the present invention, the mining system obtains log data in the current time period from the network side, specifically: the mining system may acquire the log data in the current time period from the network side by means of extraction of the log data, or may acquire the log data in the current time period from the network side by using a web crawler technology, or may acquire the log data in the current time period from a BOSS accounting database of the network side, or may receive the log data in the current time period provided by a third party vendor of the network side, or may acquire the log data in the current time period by combining at least two of the above-mentioned manners.
Step 202, performing aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.
In the embodiment of the invention, after acquiring the log data in the current time period, the mining system performs aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.
In step 202, the aggregation may be performed by classifying according to the content of the log data, and accumulating the log data of the same content or the same class of content as one piece of data in number, where the order of magnitude of the first log data set obtained after aggregation is far lower than the order of magnitude of the obtained log data in the current time period, and the meaning of the data at that time is completely preserved.
In the embodiment of the present invention, the mining system implements acquisition of the first log data set through the additional steps shown in fig. 2, and by aggregating the log data acquired from the network side in the current time period, the magnitude of the log data can be effectively reduced, so that the storage space required in the Hadoop database is reduced, and the storage space is saved.
Preferably, in the embodiment of the present invention, before performing step 202, the mining system may further perform the following steps:
performing data cleaning on the log data in the current time period to obtain cleaned log data in the current time period;
in the embodiment of the invention, before the mining system aggregates the acquired log data in the current time period, the mining system can also perform data cleaning on the log data in the current time period to obtain the cleaned log data in the current time period.
If the excavation system executes the above steps, the adaptive adjustment of step 202 is also required, and the adaptive adjustment of step 202 is:
and carrying out aggregation processing on the cleaned log data in the current time period to obtain a first log data set in the current time period.
The log data can be cleaned by removing some log data which do not meet the preset data type, and/or finding and correcting recognizable errors in the log data, and correcting or deleting the recognizable log data.
In the embodiment of the invention, the mining system can remove some useless or error log data by performing data cleaning on the log data in the current time period, reduce the number of log data processing and facilitate better data mining.
Referring to fig. 3, a flow chart illustrating additional steps after step 103 in the first embodiment of fig. 1 according to the present invention includes:
step 301, if a data query instruction is received, reading a third log data set corresponding to a query dimension from a Hadoop database according to the query dimension contained in the data query instruction;
in the embodiment of the invention, after the mining system stores the obtained third log data in the Hadoop database, a user can request to query the data by inputting a data query instruction, and if the mining system receives the data query instruction, the third log data set corresponding to the dimension is read from the Hadoop database according to the query dimension contained in the data query instruction.
Preferably, the data query instruction may further include a certain time period, and the mining system reads a third log data set corresponding to the query dimension in the time period.
And 302, performing data analysis on the third log data set, and displaying the result of the data analysis on a display interface.
In the embodiment of the present invention, the mining system further performs data analysis on the third log data set, and displays a result of the data analysis on a display interface, specifically: the mining system carries out user grouping on the users in the third log data set according to a preset clustering algorithm to obtain a user grouping list; obtaining a level configuration table corresponding to at least two user dimensions according to the log data of the users in the user grouping list, and displaying the level configuration table on a display interface; the user dimension is preset, and the level configuration table comprises the level determined by the user in the user grouping list according to the user dimension.
Wherein the user dimensions can be divided into a horizontal dimension and a vertical dimension, and the user is rated in different dimensions. For example: and mining the user groups obtained by the system, wherein the user groups comprise: and for all user groups and the microblog user group, ranking all users in the group according to the used flow, wherein five-star users are ranked 20% at the top, four-star users are ranked 20% to 40% at the top, and the star level of each user in all the user groups is determined by analogy. This is the lateral dimension rating. And for the users in the microblog user group, ranking the users in the rank ranking according to the traffic generated after the users start the microblog, wherein five-star users are ranked 20% at the top, four-star users are ranked 20% to 40% at the top, and the star ranking of each user in the microblog user group is determined by analogy. This is the vertical dimension rating. By means of the horizontal dimension rating and the vertical dimension rating, portrait display of user groups can be achieved, and a targeted scheme can be obtained by a service expert for specific grouped portraits.
Preferably, the preset clustering algorithm may be a K-means algorithm.
The query dimension is set based on a dimension corresponding to a third log data set stored in the Hadoop database, for example: the query dimension can be any one or more of internet surfing content, internet surfing time, internet surfing position and the like.
In the embodiment of the invention, the mining system reads the third log data set corresponding to the query dimension from the Hadoop database according to the query dimension contained in the data query instruction, performs data analysis on the third log data set, and displays the result of the data analysis on the display interface, so that the result of the data mining can be effectively displayed to a user.
It should be noted that, in the embodiment of the present invention, the method for mining log data based on the Hadoop database may be applied to a precise marketing system of traffic data, for example, mining of a target user, mining of a marketing site, and the like may be implemented by using the technical solutions described in the embodiments shown in fig. 1 to fig. 3, so as to provide a data basis for targeted and refined marketing of the target user or a target base station cell by an operator.
If the target user needs to be determined, in step 301 in the embodiment shown in fig. 3, the query dimension may be internet content or internet traffic, and if the target base station cell needs to be determined, the query dimension may be an internet location.
In practical applications, the user may select the query dimension according to specific needs, which is not limited herein.
Referring to fig. 4, a schematic diagram of functional modules of a Hadoop-based log data mining system according to a second embodiment of the present invention includes:
the first saving module 401 is configured to save the acquired first log data set in the current time period to a Hadoop database;
the mining system acquires the first log data set according to a time period, for example, if the time period is 15 minutes or 30 minutes, the mining system acquires the first log data set in the current 15-minute time period or acquires the first log data set in the current 30-minute time period.
The time period is a period for acquiring data, and the duration of the time period can be determined according to the size of the data volume.
The Hadoop can realize a Distributed File System (HDFS), and the frame core of the Hadoop is a Hadoop database and a parallel operation model, wherein the Hadoop database can provide Distributed storage for massive data, and the parallel operation model can provide parallel operation for the massive data.
Preferably, the parallel operation model is a mapreduce operation model.
A parallel aggregation module 402, configured to perform parallel aggregation processing on a first log data set in the Hadoop database by using a preset parallel operation model if the number of the first log data sets stored in the Hadoop database meets a preset numerical value, so as to obtain a second log data set;
in practical applications, the value may be preset according to specific needs, for example, if the time period is 15 minutes and aggregation processing needs to be performed on the first log data set within one hour, the preset value is 4; if the time period is 30 minutes and the aggregation process needs to be performed on the first log data set within 1 day, the preset value is 48.
It is understood that based on the above aggregation process, the parallel aggregation module 402 can also obtain the log data sets in different time periods in a similar manner, for example: the log data sets within one hour can be obtained by using the first log data sets with 4 time periods of 15 minutes, the log data sets within one day can be obtained by using the log data sets within 24 one hour, the log data sets within one month can be obtained by using the log data sets within 30 one day, and the like, the log data sets within different time periods can be obtained to meet different requirements.
The division and storage module 403 is configured to perform dimension division on the log data in the second log data set according to the dimensions of the log data in the second log data set, and store the obtained third log data sets corresponding to different dimensions into the Hadoop database.
The dimensions of the log data are many, including but not limited to internet surfing content, internet surfing position and internet surfing time, where the internet surfing content refers to a browsing position of a user, and the browsing position may be a specific certain position, such as hundredth, fox search, new wave microblog, or a type of website, for example: music, movies, and the like. The internet surfing position refers to the geographical position range of the IP position used by the user, and the internet surfing time refers to the time for generating log data. And the dimension division is to finish further description of the whole behavior of the user through data on the dimension according to the requirements of the system. It should be noted that, for different types of log data, the dimensions of the log data are also different, for example: when the technical scheme of the embodiment of the invention is adopted to perform data mining on the traffic data of the user in the log data, the dimension of the data mining can also include the internet surfing frequency, the user age, the monthly consumption and the like besides the internet surfing content, the internet surfing position and the internet surfing time, so that in practical application, dimension division can be performed according to specific needs, and the dimension division is not limited here.
Preferably, in the embodiment of the present invention, after the mining system stores the third log data sets corresponding to different dimensions into the Hadoop database, the mining system may also store the third log data sets corresponding to different dimensions into the column storage array, so that cooperative work of the Hadoop database and the column storage array can be realized, and data requirements of different application scenarios can be met.
In this embodiment of the present invention, a first saving module 401 saves an acquired first log data set in a current time period to a Hadoop database, if the number of the first log data sets saved in the Hadoop database satisfies a preset numerical value, a parallel aggregation module 402 performs parallel aggregation processing on the first log data set in the Hadoop database by using a preset parallel operation model to obtain a second log data set, and finally a division saving module 403 performs dimension division on the log data in the second log data set according to the dimension of the log data in the second log data set, and saves an acquired third log data set corresponding to different dimensions to the Hadoop database.
In the embodiment of the invention, the mining system stores the acquired first log data set in the current time period into a Hadoop database, if the number of the first log data sets stored in the Hadoop database meets a preset numerical value, a parallel operation model in the Hadoop database is utilized to perform parallel aggregation processing on the first log data set in the Hadoop database to obtain a second log data set, the log data in the second log data set is maintained and divided according to the dimensionality of the log data in the second log data set, and a third log data set corresponding to different dimensionalities is stored into the Hadoop database to finish the mining of the log data. The Hadoop database has better distributed storage capacity and parallel operation capacity, so that the log data are stored in a distributed mode by the Hadoop database and parallel operation is performed by the parallel operation model in the Hadoop database, massive data can be mined quickly and effectively, and the storage and operation requirements for mining the massive data are met.
Please refer to fig. 5, which is a schematic diagram of additional functional modules in the second embodiment shown in fig. 4, including:
an obtaining module 501, configured to obtain log data in a current time period from a network side;
in this embodiment of the present invention, the obtaining module 501 obtains log data in the current time period from a network side, specifically: the obtaining module 501 may obtain log data in the current time period from the network side by extracting the log data, or may obtain the log data in the current time period from the network side by using a web crawler technology, or may obtain the log data in the current time period from a BOSS accounting database of the network side, or may receive the log data in the current time period provided by a third party vendor of the network side, or may obtain the log data in the current time period by combining at least two of the above manners.
A first aggregation module 502, configured to perform aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.
The first aggregation module 502 may classify the log data according to the content of the log data, and accumulate the log data of the same content or the same class of content as one piece of data in number, where the order of magnitude of the first log data set obtained after aggregation is far lower than the order of magnitude of the obtained log data in the current time period, and the meaning of the data at that time is completely stored.
In the embodiment of the present invention, the mining system will not start executing the first saving module 401 in the embodiment shown in fig. 4 until the first aggregation module 502 is executed.
In an embodiment of the present invention, the system further comprises a cleaning module 503;
the cleaning module 503 is configured to perform data cleaning on the log data in the current time period after the obtaining module 501 obtains the log data in the current time period, so as to obtain the cleaned log data in the current time period;
and if the mining system executes the cleaning module 503, the first aggregation module 502 is specifically configured to perform aggregation processing on the cleaned log data in the current time period to obtain a first log data set in the current time period.
In the embodiment of the present invention, the mining system implements acquisition of the first log data set through the additional steps shown in fig. 2, and by aggregating the log data acquired from the network side in the current time period, the magnitude of the log data can be effectively reduced, so that the storage space required in the Hadoop database is reduced, and the storage space is saved. And the mining system can also remove some useless or error log data by carrying out data cleaning on the log data in the current time period, so that the processing quantity of the log data is reduced, and the data mining is facilitated to be better carried out.
Please refer to fig. 6, which is a schematic diagram of additional functional modules of the second embodiment shown in fig. 4, including:
a reading module 601, configured to, if a data query instruction is received, read a third log data set corresponding to a query dimension from the Hadoop database according to the query dimension included in the data query instruction;
an analysis module 602, configured to perform data analysis on the third log data set, and display a result of the data analysis on a display interface.
Wherein the analysis module 602 comprises:
a clustering module 603, configured to perform user grouping on users in the third log data set according to a preset clustering algorithm, so as to obtain a user grouping list;
an obtaining and displaying module 604, configured to obtain a level configuration table corresponding to at least two user dimensions according to log data of users in the user grouping list, and display the level configuration table on a display interface; the user dimension is preset, and the level configuration table comprises the level determined by the user in the user grouping list according to the user dimension in a grading way.
Wherein the user dimensions can be divided into a horizontal dimension and a vertical dimension, and the user is rated in different dimensions. For example: and mining the user groups obtained by the system, wherein the user groups comprise: and for all user groups and the microblog user group, ranking all users in the group according to the used flow, wherein five-star users are ranked 20% at the top, four-star users are ranked 20% to 40% at the top, and the star level of each user in all the user groups is determined by analogy. This is the lateral dimension rating. And for the users in the microblog user group, ranking the users in the rank ranking according to the traffic generated after the users start the microblog, wherein five-star users are ranked 20% at the top, four-star users are ranked 20% to 40% at the top, and the star ranking of each user in the microblog user group is determined by analogy. This is the vertical dimension rating. By means of the horizontal dimension rating and the vertical dimension rating, portrait display of user groups can be achieved, and a targeted scheme can be obtained by a service expert for specific grouped portraits.
Preferably, the preset clustering algorithm may be a K-means algorithm.
The query dimension is set based on a dimension corresponding to a third log data set stored in the Hadoop database, for example: the query dimension can be any one or more of internet surfing content, internet surfing time, internet surfing position and the like.
In the embodiment of the invention, the mining system reads the third log data set corresponding to the query dimension from the Hadoop database according to the query dimension contained in the data query instruction, performs data analysis on the third log data set, and displays the result of the data analysis on the display interface, so that the result of the data mining can be effectively displayed to a user.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (6)

1.一种基于Hadoop的日志数据挖掘方法,其特征在于,包括:1. a log data mining method based on Hadoop, is characterized in that, comprises: 将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中;Save the acquired first log data set in the current time period to the Hadoop database; 若所述Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则利用预置的并行运算模型对所述Hadoop数据库中的第一日志数据集合进行并行聚集处理,得到第二日志数据集合;If the number of the first log data sets saved in the Hadoop database satisfies a preset value, the preset parallel computing model is used to perform parallel aggregation processing on the first log data sets in the Hadoop database, and a second log data set is obtained. log data collection; 根据所述第二日志数据集合中的日志数据的维度对所述第二日志数据集合中的日志数据进行维度划分,将得到的不同维度对应的第三日志数据集合保存至所述Hadoop数据库中;Dimensionally divide the log data in the second log data set according to the dimensions of the log data in the second log data set, and save the obtained third log data sets corresponding to different dimensions in the Hadoop database; 若接收到数据查询指令,则按照所述数据查询指令中包含的查询维度从所述Hadoop数据库中读取与所述查询维度对应的第三日志数据集合;If a data query instruction is received, read a third log data set corresponding to the query dimension from the Hadoop database according to the query dimension included in the data query instruction; 按照预先设置的聚类算法对所述第三日志数据集合中的用户进行用户分组,得到用户分组列表;Perform user grouping on the users in the third log data set according to a preset clustering algorithm to obtain a user grouping list; 根据用户分组列表中的用户的日志数据得到至少两个用户维度对应的级别配置表,并在显示界面上显示所述级别配置表;所述用户维度是预先设置的,所述级别配置表中包含所述用户分组列表中的用户按照所述用户维度进行分级确定的级别。Obtain level configuration tables corresponding to at least two user dimensions according to the log data of users in the user grouping list, and display the level configuration tables on the display interface; the user dimension is preset, and the level configuration table contains The users in the user grouping list are classified and determined according to the user dimension. 2.根据权利要求1所述的方法,其特征在于,所述方法还包括:2. The method according to claim 1, wherein the method further comprises: 从网络侧获取当前时间段内的日志数据;Obtain log data in the current time period from the network side; 对所述当前时间段内的日志数据进行聚集处理,得到所述当前时间段内的第一日志数据集合。The log data in the current time period is aggregated to obtain a first log data set in the current time period. 3.根据权利要求2所述的方法,其特征在于,所述从网络侧获取当前时间段内的日志数据的步骤之后还包括:3. The method according to claim 2, wherein after the step of acquiring the log data in the current time period from the network side, the step further comprises: 对所述当前时间段内的日志数据进行数据清洗,得到当前时间段内清洗后的日志数据;Perform data cleaning on the log data in the current time period to obtain the cleaned log data in the current time period; 则所述对所述当前时间段内的日志数据进行聚集处理,得到所述当前时间段内的第一日志数据集合的步骤包括:Then, the steps of performing aggregation processing on the log data in the current time period to obtain the first log data set in the current time period include: 对所述当前时间段内清洗后的日志数据进行聚集处理,得到所述当前时间段内的第一日志数据集合。Perform aggregation processing on the cleaned log data in the current time period to obtain a first log data set in the current time period. 4.一种基于Hadoop的日志数据挖掘系统,其特征在于,包括:4. a log data mining system based on Hadoop, is characterized in that, comprises: 第一保存模块,用于将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中;The first saving module is used to save the acquired first log data set in the current time period into the Hadoop database; 并行聚集模块,用于若所述Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则利用预置的并行运算模型对所述Hadoop数据库中的第一日志数据集合进行并行聚集处理,得到第二日志数据集合;A parallel aggregation module, configured to use a preset parallel computing model to parallelize the first log data set in the Hadoop database if the number of the first log data sets saved in the Hadoop database satisfies a preset value Aggregate processing to obtain a second log data set; 划分保存模块,用于根据所述第二日志数据集合中的日志数据的维度对所述第二日志数据集合中的日志数据进行维度划分,将得到的不同维度对应的第三日志数据集合保存至所述Hadoop数据库中;The dividing and saving module is configured to perform dimension division on the log data in the second log data set according to the dimensions of the log data in the second log data set, and save the obtained third log data sets corresponding to different dimensions to in the Hadoop database; 读取模块,用于若接收到数据查询指令,则按照所述数据查询指令中包含的查询维度从所述Hadoop数据库中读取与所述查询维度对应的第三日志数据集合;a reading module, configured to read a third log data set corresponding to the query dimension from the Hadoop database according to the query dimension included in the data query instruction if a data query instruction is received; 聚类模块,用于按照预先设置的聚类算法对所述第三日志数据集合中的用户进行用户分组,得到用户分组列表;a clustering module, configured to perform user grouping on the users in the third log data set according to a preset clustering algorithm to obtain a user grouping list; 获取显示模块,用于根据用户分组列表中的用户的日志数据得到至少两个用户维度对应的级别配置表,并在显示界面上显示所述级别配置表;所述用户维度是预先设置的,所述级别配置表中包含所述用户分组列表中的用户按照所述用户维度进行分级确定的级别。The acquisition and display module is used to obtain the level configuration tables corresponding to at least two user dimensions according to the log data of the users in the user grouping list, and display the level configuration tables on the display interface; the user dimensions are preset, so The level configuration table includes the levels determined by the users in the user grouping list according to the user dimension. 5.根据权利要求4所述的系统,其特征在于,所述系统还包括:5. The system of claim 4, wherein the system further comprises: 获取模块,用于从网络侧获取当前时间段内的日志数据;The acquisition module is used to acquire log data in the current time period from the network side; 第一聚集模块,用于对所述当前时间段内的日志数据进行聚集处理,得到所述当前时间段内的第一日志数据集合。The first aggregation module is configured to perform aggregation processing on the log data in the current time period to obtain a first log data set in the current time period. 6.根据权利要求5所述的系统,其特征在于,所述系统还包括清洗模块;6. The system of claim 5, further comprising a cleaning module; 所述清洗模块用于在所述获取模块获取所述当前时间段内的日志数据之后,对所述当前时间段内的日志数据进行数据清洗,得到当前时间段内清洗后的日志数据;The cleaning module is configured to perform data cleaning on the log data in the current time period after the acquisition module acquires the log data in the current time period, to obtain the cleaned log data in the current time period; 且所述第一聚集模块具体用于对所述当前时间段内清洗后的日志数据进行聚集处理,得到所述当前时间段内的第一日志数据集合。And the first aggregation module is specifically configured to perform aggregation processing on the cleaned log data in the current time period to obtain a first log data set in the current time period.
CN201510875453.3A 2015-12-02 2015-12-02 Hadoop-based log data mining method and system Active CN106815274B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201510875453.3A CN106815274B (en) 2015-12-02 2015-12-02 Hadoop-based log data mining method and system
PCT/CN2016/097363 WO2017092444A1 (en) 2015-12-02 2016-08-30 Log data mining method and system based on hadoop

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510875453.3A CN106815274B (en) 2015-12-02 2015-12-02 Hadoop-based log data mining method and system

Publications (2)

Publication Number Publication Date
CN106815274A CN106815274A (en) 2017-06-09
CN106815274B true CN106815274B (en) 2022-02-18

Family

ID=58796202

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510875453.3A Active CN106815274B (en) 2015-12-02 2015-12-02 Hadoop-based log data mining method and system

Country Status (2)

Country Link
CN (1) CN106815274B (en)
WO (1) WO2017092444A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391645B (en) * 2017-07-12 2018-04-10 广州市昊链信息科技股份有限公司 A kind of logistics information automatic push and practical operation specification form system and method
CN107241231B (en) * 2017-07-26 2020-04-03 成都科来软件有限公司 Rapid and accurate positioning method for original network data packet
CN112287208B (en) * 2019-09-30 2024-03-01 北京沃东天骏信息技术有限公司 User portrait generation method, device, electronic device and storage medium
WO2021102888A1 (en) * 2019-11-29 2021-06-03 京东方科技集团股份有限公司 Data processing device and method, and computer-readable storage medium
CN111597179B (en) * 2020-05-18 2023-12-05 北京思特奇信息技术股份有限公司 Method and device for automatically cleaning data, electronic equipment and storage medium
CN112632020B (en) * 2020-12-25 2022-03-18 中国电子科技集团公司第三十研究所 Log information type extraction method and mining method based on spark big data platform

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102685221A (en) * 2012-04-29 2012-09-19 华北电力大学(保定) Distributed storage and parallel mining method for state monitoring data
CN104182506A (en) * 2014-08-19 2014-12-03 浪潮(北京)电子信息产业有限公司 Log management method
CN104616092A (en) * 2014-12-16 2015-05-13 国家电网公司 Distributed log analysis based distributed mode handling method

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6732123B1 (en) * 1998-02-23 2004-05-04 International Business Machines Corporation Database recovery to any point in time in an online environment utilizing disaster recovery technology
US7552147B2 (en) * 2005-09-02 2009-06-23 International Business Machines Corporation System and method for minimizing data outage time and data loss while handling errors detected during recovery
CN100481077C (en) * 2006-01-12 2009-04-22 国际商业机器公司 Visual method and device for strengthening search result guide
KR20090050405A (en) * 2007-11-15 2009-05-20 한국전자통신연구원 Method and apparatus for classifying user's behavior based on event log in context-aware system environment
CN101483557B (en) * 2009-03-03 2011-07-13 中兴通讯股份有限公司 Log statistic, storing method and system used for deep packet detection apparatus
US9178935B2 (en) * 2009-03-05 2015-11-03 Paypal, Inc. Distributed steam processing
CN103036921B (en) * 2011-09-29 2015-09-23 北京新媒传信科技有限公司 A kind of user behavior analysis system and method
US10223431B2 (en) * 2013-01-31 2019-03-05 Facebook, Inc. Data stream splitting for low-latency data access
US10069677B2 (en) * 2013-04-06 2018-09-04 Citrix Systems, Inc. Systems and methods to collect logs from multiple nodes in a cluster of load balancers
CN104301360B (en) * 2013-07-19 2019-03-12 阿里巴巴集团控股有限公司 A kind of method of logdata record, log server and system
US20150081668A1 (en) * 2013-09-13 2015-03-19 Nec Laboratories America, Inc. Systems and methods for tuning multi-store systems to speed up big data query workload
CN103955502B (en) * 2014-04-24 2017-07-28 科技谷(厦门)信息技术有限公司 A kind of visualization OLAP application realization method and system
CN104317958B (en) * 2014-11-12 2018-01-16 北京国双科技有限公司 A kind of real-time data processing method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102685221A (en) * 2012-04-29 2012-09-19 华北电力大学(保定) Distributed storage and parallel mining method for state monitoring data
CN104182506A (en) * 2014-08-19 2014-12-03 浪潮(北京)电子信息产业有限公司 Log management method
CN104616092A (en) * 2014-12-16 2015-05-13 国家电网公司 Distributed log analysis based distributed mode handling method

Also Published As

Publication number Publication date
CN106815274A (en) 2017-06-09
WO2017092444A1 (en) 2017-06-08

Similar Documents

Publication Publication Date Title
CN106815274B (en) Hadoop-based log data mining method and system
Johanson et al. Big automotive data: Leveraging large volumes of data for knowledge-driven product development
CN105183912B (en) Abnormal log determines method and apparatus
CN111177111A (en) Attribution modeling when executing queries based on user-specified segments
CN104246751B (en) Recommendation apparatus, commending system and recommendation method
US20140164391A1 (en) Data block saving system and method
WO2019024496A1 (en) Enterprise recommendation method and application server
CN111046237B (en) User behavior data processing method and device, electronic equipment and readable medium
JP2013534334A (en) Method and apparatus for sorting query results
CN107273369B (en) Method and device for modifying table data
US20140214632A1 (en) Smart Crowd Sourcing On Product Classification
US11132362B2 (en) Method and system of optimizing database system, electronic device and storage medium
WO2013106595A2 (en) Processing store visiting data
CN106202482A (en) A kind of web information flow method and system based on user behavior analysis
CN109359141B (en) Visual report data display method and device
CN108647235A (en) A kind of data analysing method, equipment and medium based on data warehouse
CN102982112A (en) Ranking list generation method and journal generation method and server
WO2018067420A1 (en) Perform graph traversal with graph query language
CN103235811A (en) Data storage method and device
CN112231590B (en) Content recommendation method, system, computer device and storage medium
CN113590372A (en) Log-based link tracking method and device, computer equipment and storage medium
CN111414361A (en) Label data storage method, device, equipment and readable storage medium
CN117634894B (en) Ecological environment risk assessment method and device, electronic equipment and storage medium
WO2017065795A1 (en) Incremental update of a neighbor graph via an orthogonal transform based indexing
CN106933873A (en) A kind of cross-platform data querying method and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant