[go: up one dir, main page]

CN117591477A - A log aggregation query method for massive data - Google Patents

A log aggregation query method for massive data Download PDF

Info

Publication number
CN117591477A
CN117591477A CN202311397570.4A CN202311397570A CN117591477A CN 117591477 A CN117591477 A CN 117591477A CN 202311397570 A CN202311397570 A CN 202311397570A CN 117591477 A CN117591477 A CN 117591477A
Authority
CN
China
Prior art keywords
data
log
hot
cold
fields
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311397570.4A
Other languages
Chinese (zh)
Other versions
CN117591477B (en
Inventor
陈铭
梁忠辉
刘超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Smart Net Anyun Wuhan Information Technology Co ltd
Original Assignee
Smart Net Anyun Wuhan Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Smart Net Anyun Wuhan Information Technology Co ltd filed Critical Smart Net Anyun Wuhan Information Technology Co ltd
Priority to CN202311397570.4A priority Critical patent/CN117591477B/en
Publication of CN117591477A publication Critical patent/CN117591477A/en
Application granted granted Critical
Publication of CN117591477B publication Critical patent/CN117591477B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Library & Information Science (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a log aggregation query method of mass data, which comprises the following steps: and (3) log acquisition: collecting different types of log data from a server; processing the data fields in the log by adopting dynamic and static combined fields to obtain processed data; analyzing the processed data by adopting a cold-hot separation model to obtain cold and hot data, and shunting the cold and hot data to warehouse; the method comprises the steps of firstly carrying out log inquiry through a hot database, and then carrying out log inquiry through a cold database. The invention adopts dynamic and static combined field processing and cold and hot separation models, and then shunts the log data, so that the user directly inquires from the hot data during inquiry, thereby greatly reducing the inquiry range, reducing the magnitude and improving the overall inquiry speed.

Description

一种海量数据的日志聚合查询方法A log aggregation query method for massive data

技术领域Technical field

本发明涉及日志处理领域,尤其涉及一种海量数据的日志聚合查询方法。The invention relates to the field of log processing, and in particular to a log aggregation query method for massive data.

背景技术Background technique

随着信息时代的发展,大量的日志数据会通过实时监控,分析后的数据会进行统计分析,通过不同的维度去发现和解决生产中的问题。如何高效和快速的根据时间范围做出分析,能够灵活地切换维度去分析,获取相应数据的时间十分快速,这些问题逐渐成为市场上比较关注的点。市场上目前比较常见的日志:With the development of the information age, a large amount of log data will be monitored in real time, and the analyzed data will be statistically analyzed to discover and solve production problems through different dimensions. How to efficiently and quickly conduct analysis based on the time range, be able to flexibly switch dimensions for analysis, and obtain corresponding data very quickly, these issues have gradually become a point of concern in the market. The more common logs currently on the market:

1.安全日志:记录应用程序的操作、错误、异常、性能指标等信息,用于追踪和排查应用程序问题。1. Security log: records application operations, errors, exceptions, performance indicators and other information, used to track and troubleshoot application problems.

2.应用程序日志:记录应用程序的操作、错误、异常、性能指标等信息,用于追踪和排查应用程序问题。2. Application log: records application operations, errors, exceptions, performance indicators and other information to track and troubleshoot application problems.

3.系统日志:记录操作系统的事件、错误、警告等信息,包括登录信息、服务启停、资源利用情况等。3. System log: records operating system events, errors, warnings and other information, including login information, service start and stop, resource utilization, etc.

4.数据库日志:记录数据库操作、查询、事务等信息,有助于追踪和调试数据库相关的问题。4. Database log: records database operations, queries, transactions and other information, helping to track and debug database-related issues.

5.服务器日志:记录服务器硬件和操作系统的性能指标、错误、事件等信息,包括CPU、内存、磁盘利用率、网络流量等。5. Server log: records performance indicators, errors, events and other information of server hardware and operating system, including CPU, memory, disk utilization, network traffic, etc.

6.用户日志:记录用户在系统中的操作活动、登录记录、访问权限等信息。6. User log: records user’s operating activities, login records, access rights and other information in the system.

7.调试日志:开发人员在开发和调试过程中记录的日志,用于追踪代码执行路径、查看变量值等。7. Debug logs: Logs recorded by developers during the development and debugging process, used to track code execution paths, view variable values, etc.

8.业务日志:记录特定业务操作的日志,如交易日志、订单日志、日志查询等,用于追踪业务流程和后续分析。8. Business logs: logs that record specific business operations, such as transaction logs, order logs, log queries, etc., used to track business processes and subsequent analysis.

日志聚合作为一种提高效率和降低噪音的管理策略,已经在现代IT运维中成为一个重要的实践。以下是日志聚聚合的一些背景现状:Log aggregation, as a management strategy to improve efficiency and reduce noise, has become an important practice in modern IT operations. The following is some background status of log aggregation:

1.日志聚合数量激增:随着IT系统的规模和复杂性不断增加,监控系统产生的日志数量也越来越多。大量的重复、冗余和无关紧要的日志给运维人员带来了巨大的压力,因此日志聚合成为了必要的手段。1. The number of log aggregation is increasing: As the scale and complexity of IT systems continue to increase, the number of logs generated by monitoring systems is also increasing. A large number of duplicate, redundant and irrelevant logs have put great pressure on operation and maintenance personnel, so log aggregation has become a necessary means.

2.需要准确的上下文信息:仅通过单个日志往往难以获得问题的全貌和上下文信息。日志聚合能够将相关的日志进行整合,提供更准确的上下文信息,帮助运维团队更好地理解和分析问题。2. Need accurate contextual information: It is often difficult to obtain the full picture and contextual information of the problem through only a single log. Log aggregation can integrate related logs to provide more accurate contextual information and help the operation and maintenance team better understand and analyze problems.

3.提高故障处理效率:日志聚合可以帮助运维团队更快速、准确地定位和处理故障。通过聚合和过滤日志,运维人员可以更有针对性地解决问题,避免在海量的日志中迷失。3. Improve fault handling efficiency: Log aggregation can help the operation and maintenance team locate and handle faults more quickly and accurately. By aggregating and filtering logs, operation and maintenance personnel can solve problems in a more targeted manner and avoid getting lost in massive logs.

总而言之,日志聚合作为一种应对日志泛滥和提高运维效率的策略,正在得到广泛的应用和关注。All in all, log aggregation is receiving widespread application and attention as a strategy to deal with log flooding and improve operation and maintenance efficiency.

目前,日志聚合采用的技术主要是离线处理,通过固定的分钟、小时、天,将数据通过mapReduce进行离线计算,聚合得到的数据入库到一张表中,表中字段包含全部字段。At present, the technology used in log aggregation is mainly offline processing. The data is calculated offline through mapReduce at fixed minutes, hours, and days. The aggregated data is stored in a table, and the fields in the table contain all fields.

离线处理的缺陷如下:The disadvantages of offline processing are as follows:

1、离线处理延迟反应时间较长,数据得先入库,后从库中查询,加大了数据库压力,处理性能较低。1. The delayed response time of offline processing is long, and the data must be stored in the database first and then queried from the database, which increases the pressure on the database and lowers the processing performance.

2、原始数据字段冗余到聚合字段,条件查询,字段太多,查询获取速度慢。2. The original data fields are redundant to the aggregation fields, conditional query, too many fields, and the query acquisition speed is slow.

3、日志聚合范围变更,数据响应慢。举例:用户如果想要更改聚合时间范围,从5分钟切换到1小时,那么必须重新跑一个任务,计算从某个时间点到现在的数据,往往耗时处理过程较长,恢复慢,不能及时响应用户的需求。或者单独维护一张小时表,或者维护一张天数表,这样处理却占用了服务器的资源。3. The log aggregation range changes and the data response is slow. For example: If the user wants to change the aggregation time range from 5 minutes to 1 hour, he must re-run a task to calculate the data from a certain point in time to the present. This often takes a long time to process, recovery is slow, and cannot be done in a timely manner. Respond to user needs. Either maintain a separate hour table, or maintain a day table, but this processing takes up server resources.

发明内容Contents of the invention

为了解决上述问题,本发明提出一种海量数据的日志聚合查询方法,大大缩小了日志处理的量级,加快了查询速度和缩短响应时间。In order to solve the above problems, the present invention proposes a log aggregation query method for massive data, which greatly reduces the magnitude of log processing, speeds up query speed, and shortens response time.

其中方法包括以下步骤:The method includes the following steps:

S1、日志采集:从服务器采集不同类型日志数据;S1. Log collection: collect different types of log data from the server;

S2、日志处理:将日志中的数据字段采用动静结合字段进行处理,得到处理后的数据;S2. Log processing: The data fields in the log are processed using a combination of dynamic and static fields to obtain processed data;

S3、日志入库:将处理后的数据采用冷热分离模型分析后,得到冷、热数据,并将冷、热数据分流入库;S3. Log storage: After analyzing the processed data using the hot and cold separation model, cold and hot data are obtained, and the cold and hot data are divided into the database;

S4、日志查询:先通过热数据库进行日志查询,再通过冷数据库进行日志查询。S4. Log query: First perform log query through the hot database, and then perform log query through the cold database.

本发明提供的有益效果是:由于采用了动静结合字段处理与冷热分离模型后,将日志数据分流,查询时,用户直接从热数据中查询,大大缩小查询范围,量级减少,提高了整体的查询速度。The beneficial effects provided by the present invention are: due to the adoption of dynamic and static combined field processing and hot and cold separation models, the log data is shunted. When querying, the user directly queries from the hot data, which greatly reduces the scope of the query, reduces the magnitude, and improves the overall query speed.

附图说明Description of drawings

图1是本发明方法流程示意图;Figure 1 is a schematic flow diagram of the method of the present invention;

图2是本发明数据实时处理流程图;Figure 2 is a flow chart of real-time data processing according to the present invention;

图3是用户查询流程示意图;Figure 3 is a schematic diagram of the user query process;

图4是本发明数据降维过程示意图。Figure 4 is a schematic diagram of the data dimensionality reduction process of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地描述。In order to make the purpose, technical solutions and advantages of the present invention clearer, the embodiments of the present invention will be further described below in conjunction with the accompanying drawings.

请参考图1,图1是本发明方法流程示意图;Please refer to Figure 1, which is a schematic flow chart of the method of the present invention;

本发明提供了一种海量数据的日志聚合查询方法,包括:The present invention provides a log aggregation query method for massive data, including:

S1、日志采集:从服务器采集不同类型日志数据;S1. Log collection: collect different types of log data from the server;

需要说明的是,不同类型的日志包括:安全日志、应用程序日志、系统日志、数据库日志、服务器日志、用户日志、调试日志、业务日志等。It should be noted that different types of logs include: security logs, application logs, system logs, database logs, server logs, user logs, debugging logs, business logs, etc.

S2、日志处理:将日志中的数据字段采用动静结合字段进行处理,得到处理后的数据;S2. Log processing: The data fields in the log are processed using a combination of dynamic and static fields to obtain processed data;

需要说明的是,所述动静结合字段,具体指,将数据字段分为固定字段和动态列字段。It should be noted that the combination of dynamic and static fields specifically refers to dividing data fields into fixed fields and dynamic column fields.

其中,所述固定字段具体为可以被聚合的字段。具体地,固定字段通常为日志最基本的属性,这些属性是可变化的聚合数据,例如固定字段为:最近发生时间、最晚发生时间、聚合条数、最高等级、是否阻断等等。固定字段是为了计算和统计结果页面的数据。The fixed fields are specifically fields that can be aggregated. Specifically, fixed fields are usually the most basic attributes of logs. These attributes are changeable aggregate data. For example, fixed fields are: latest occurrence time, latest occurrence time, number of aggregated entries, highest level, whether to block, etc. Fixed fields are used for calculation and statistics of data on the results page.

作为一种实施例而言,比如有如下一批日志数据:As an example, for example, there is the following batch of log data:

1、IP:172.10.12.3流量:3512MB等级:中产生时间:2023-09-09 08:08:00攻击外网次数:1阻断:是;1. IP: 172.10.12.3 Traffic: 3512MB Level: Medium Generation time: 2023-09-09 08:08:00 Number of attacks on external networks: 1 Blocking: Yes;

2、IP:172.10.12.3流量:1024MB等级:低产生时间:2023-11-10 20:08:00攻击外网次数:2阻断:是;2. IP: 172.10.12.3 Traffic: 1024MB Level: Low Generation time: 2023-11-10 20:08:00 Number of attacks on external networks: 2 Blocking: Yes;

3、IP:172.10.12.3流量:7789MB等级:中产生时间:2023-12-10 16:08:00攻击外网次数:3阻断:否;3. IP: 172.10.12.3 Traffic: 7789MB Level: Medium Generation time: 2023-12-10 16:08:00 Number of attacks on external networks: 3 Blocking: No;

则固定字段可以定义为:Then the fixed field can be defined as:

最小流量、最大流量、最高等级、最早发生时间、最晚发生时间、总共攻击外网次数、总共阻断条数、未阻断条数;Minimum flow, maximum flow, highest level, earliest occurrence time, latest occurrence time, total number of external network attacks, total number of blocked items, and number of unblocked items;

需要说明的是,固定字段可以根据字段的属性自定义设计。根据上述固定字段的定义,则固定字段最终的显示结果为:1024MB、7789MB、中、2023-09-09 08:08:00、2023-12-10 16:08:00、6、2、1。It should be noted that fixed fields can be customized according to the properties of the field. According to the definition of the fixed field above, the final display results of the fixed field are: 1024MB, 7789MB, medium, 2023-09-09 08:08:00, 2023-12-10 16:08:00, 6, 2, 1.

所述动态列字段为用户自定义的特征字段。The dynamic column fields are user-defined feature fields.

需要说明的是,动态列字段即为用户需要关注的字段。It should be noted that the dynamic column fields are the fields that users need to pay attention to.

比如,用户自定义的动态列字段为:源地址、源端口、目的地址、目的端口、地理位置。For example, the user-defined dynamic column fields are: source address, source port, destination address, destination port, and geographical location.

作为一种实施例,上述自定义的动态列字段的几条日志的实际特征值为:As an example, the actual characteristic values of several logs of the above-mentioned customized dynamic column fields are:

172.16.1.11_8080_172.16.12.3_8081_武汉;172.16.1.11_8080_172.16.12.3_8081_Wuhan;

172.16.1.11_8080_172.16.12.52_8082_黄石;172.16.1.11_8080_172.16.12.52_8082_Yellowstone;

222.16.1.12_8097_172.16.12.52_8082_杭州。222.16.1.12_8097_172.16.12.52_8082_Hangzhou.

S3、日志入库:将处理后的数据采用冷热分离模型分析后,得到冷、热数据,并将冷、热数据分流入库;S3. Log storage: After analyzing the processed data using the hot and cold separation model, cold and hot data are obtained, and the cold and hot data are divided into the database;

需要说明的是,步骤S3中,所述冷热分离模型具体指:It should be noted that in step S3, the hot and cold separation model specifically refers to:

处理后的数据按照动态列字段进行表示;The processed data is represented according to dynamic column fields;

将动态列字段转换成对应的unincode值;Convert dynamic column fields into corresponding unincode values;

通过聚类分析算法对新一批待处理的日志数据在某一确定的数据分类中进行比较,若新一批待处理的日志数据的unincode值与某一确定的数据分类的unincode值之间的差值小于或等于预设值,则将新一批待处理的日志数据打上热标签,该日志数据为热数据;否则,该数据为冷数据。The cluster analysis algorithm is used to compare the new batch of log data to be processed in a certain data category. If the unincode value of the new batch of log data to be processed is different from the unincode value of a certain data category, If the difference is less than or equal to the preset value, the new batch of log data to be processed will be labeled as hot data, and the log data will be hot data; otherwise, the data will be cold data.

作为一种实施例而言,对于上述实施例的动态列字段,将其转换为unincode值如下:As an embodiment, for the dynamic column field in the above embodiment, convert it into an unincode value as follows:

字段源地址_字段源端口_字段目的地址_字段目的端口_字段地理位置:Field source address_field source port_field destination address_field destination port_field geographical location:

20320_22909_65111_19990_123331;20320_22909_65111_19990_123331;

20320_22909_65292_19220_256431;20320_22909_65292_19220_256431;

52789_22898_65292_19220_751621;52789_22898_65292_19220_751621;

将上述unincode值通过聚类算法进行分析比较,首先设置匹配规则,匹配规则采用精确匹配+模糊字段的方式。The above unincode values are analyzed and compared through the clustering algorithm. First, the matching rules are set. The matching rules adopt the method of exact matching + fuzzy fields.

精确匹配:要求源地址、源端口是完全相同;Exact match: The source address and source port are required to be exactly the same;

模糊字段:要求目的地址、目的端口、地理位置为模糊相同。Fuzzy fields: The destination address, destination port, and geographical location are required to be fuzzy and the same.

关于模糊相同,比如目的端口模糊相同,是指:目的端口中含80;又比如地理位置模糊相同,是指地理位置均为湖北。Regarding fuzzy similarity, for example, the destination ports are fuzzy and the same means that the destination port contains 80; and the geographical location is fuzzy and the same means that the geographical locations are both Hubei.

关于聚类分析算法,本发明中采用Kmeans聚类算法:Regarding the cluster analysis algorithm, the Kmeans clustering algorithm is used in this invention:

1.第一次随机来的日志,随机取k个数据作为初始质心k(特征)1. For the first random log, k pieces of data are randomly selected as the initial centroid k (features)

2.计算其他日志的数据到这k个质心的距离(本申请中即为上述差值在预设值之内);2. Calculate the distance from other log data to the k centroids (in this application, the above difference is within the preset value);

如果某个日志数据离第n个质心的距离更近,则该点为分类n,并对其打上标签,标注为n,下一批日志再次进入,计算得出数据,计算出平均值为分类n。If a certain log data is closer to the nth centroid, then the point is classified as n, and it is labeled as n. The next batch of logs is entered again, the data is calculated, and the average value is calculated as the classification n.

3.分类n相同的数据会快速比较,符合相同的特征(前述的匹配规则)会归为同一条聚合数据,不符合的特征会新生成一条聚合数据。3. Data with the same classification n will be quickly compared. Data that meets the same characteristics (the aforementioned matching rules) will be classified into the same aggregated data. Characteristics that do not match will generate a new aggregated data.

上述做法的目的是将海量数据划分为多个分类区间,数据在相同类别的区间中再进行比较,比如,地理位置比较可以参考聚类分析算法把城市划分为不同的种类。The purpose of the above approach is to divide massive data into multiple classification intervals, and then compare the data in the same category interval. For example, for geographical location comparison, you can refer to the cluster analysis algorithm to divide cities into different categories.

以上述动态列字段的数据为例:Take the data of the above dynamic column field as an example:

172.16.11.1_8080_武汉和172.16.11_8080_孝感为相似数据;172.16.11.1_8080_Wuhan and 172.16.11_8080_Xiaogan are similar data;

聚合特征(匹配规则)为172.16.11.1_8080_湖北,为热数据;The aggregation feature (matching rule) is 172.16.11.1_8080_Hubei, which is hot data;

下一批数据进来了碰撞上相同的聚合特征172.16.11.1_8080_湖北,更新数据为热数据。The next batch of data comes in and collides with the same aggregated feature 172.16.11.1_8080_hubei, and the updated data is hot data.

如果聚合特征数据一直没有被碰撞上,就为冷数据;冷数据会出现符合该聚合特征的数据无变化,最大值,最小值都不会变化,时间也是确定的。If the aggregation feature data has not been collided, it is cold data; cold data will have no change in the data that conforms to the aggregation feature, the maximum value and the minimum value will not change, and the time is also determined.

以上是关于冷、热数据的区分。冷热数据分流入库到不同的数据库,冷数据库和热数据库中。The above is about the distinction between cold and hot data. Hot and cold data are split into different databases, cold databases and hot databases.

作为另一种实施例,本申请中冷、热数据也是变化的,热数据可以变为冷数据,冷数据也可以变为热数据。As another embodiment, in this application, the cold and hot data also change, hot data can be changed into cold data, and cold data can also be changed into hot data.

其中,热数据变为冷数据的具体的策略为:Among them, the specific strategy for changing hot data into cold data is:

设置定时任务,扫描热数据,若日志最晚产生的时间大于当前t天,则将对应特征的热数据更新为冷数据。Set a scheduled task to scan hot data. If the latest log generation time is greater than the current t days, the hot data of the corresponding feature will be updated to cold data.

作为一种实施例,t可以设置为30天、半年或1年等;定时任务的频率可以为1天1次,或几天1次等;As an example, t can be set to 30 days, half a year, or 1 year, etc.; the frequency of the scheduled task can be once a day, or once a few days, etc.;

冷数据变为热数据的具体策略为:The specific strategy for turning cold data into hot data is:

当某待处理的日志数据在某一确定的数据分类中进行比较时,在热数据中未能匹配,但在冷数据中找到,则将该待处理的日志数据更新为热数据。When a certain log data to be processed is compared in a certain data classification and fails to match in the hot data but is found in the cold data, the log data to be processed is updated to hot data.

S4、日志查询:先通过热数据库进行日志查询,再通过冷数据库进行日志查询。S4. Log query: First perform log query through the hot database, and then perform log query through the cold database.

作为一种实施例,本发明中采用分流入库,冷数据专门入冷数据库,热数据专门入热数据库,同时根据设置的策略,冷、热数据库也是时常更新的。As an embodiment, the present invention adopts split inflow into the database, with cold data exclusively entering the cold database and hot data exclusively entering the hot database. At the same time, the cold and hot databases are updated frequently according to the set strategy.

在进行日志查询时,先通过热数据库进行碰撞,能大大减少查询时间,提高查询效率。When performing log query, collision is performed through the hot database first, which can greatly reduce query time and improve query efficiency.

请参考图2,图2是本发明数据实时处理流程图;Please refer to Figure 2, which is a flow chart of real-time data processing according to the present invention;

从服务器进行日志数据采集和处理后,处理后的数据,首先仍然进入原始数据库,同时又引入另一支路,进行聚类分析处理,得到冷、热数据库。After collecting and processing log data from the server, the processed data first still enters the original database, and at the same time another branch is introduced for cluster analysis and processing to obtain cold and hot databases.

请参考图3,图3是用户查询流程示意图。Please refer to Figure 3, which is a schematic diagram of the user query process.

用户查询日志时,首先指定时间范围的指端,其次从热数据库中进行数据碰撞,得到符合条件的字段。When users query logs, they first specify the end of the time range, and then perform data collision from the hot database to obtain fields that meet the conditions.

最后根据符合条件的字段,从原始数据库中调取日志更为详细的内容。Finally, according to the fields that meet the conditions, the more detailed content of the log is retrieved from the original database.

请参考图4,图4是本发明数据降维过程示意图。Please refer to Figure 4, which is a schematic diagram of the data dimensionality reduction process of the present invention.

原始数据为10亿条,经过固定字段和动态列字段处理后,得到明细数据。将明细数据用关联特征值的聚类算法进行分析,得到热数据和冷数据。The original data is 1 billion pieces. After processing the fixed fields and dynamic column fields, detailed data is obtained. Analyze the detailed data using the clustering algorithm of associated eigenvalues to obtain hot data and cold data.

本发明在技术实现上,依赖:Spark+Kafka+ElasticSearch+HDFS+Redis集群+大数据集群;In terms of technical implementation, this invention relies on: Spark+Kafka+ElasticSearch+HDFS+Redis cluster+big data cluster;

数据库采用:ES\openTsdb;Database used: ES\openTsdb;

实施方式:搭建服务器及采集服务器,搭建大数据集群环境,模拟发送安全日志和程序日志,观察数据聚合情况,使用Kafka-Manager和java jvm监控工具监控日志入库情况。性能方面,平均下来每秒可以处理2万条聚合数据,查询千万级数据耗时近5秒级。Implementation method: Build servers and collection servers, build a big data cluster environment, simulate sending security logs and program logs, observe data aggregation, and use Kafka-Manager and java jvm monitoring tools to monitor log storage. In terms of performance, it can process 20,000 pieces of aggregated data per second on average, and it takes nearly 5 seconds to query tens of millions of data.

最后,本发明的有益效果是:由于采用了动静结合字段处理与冷热分离模型后,将日志数据分流,查询时,用户直接从热数据中查询,大大缩小查询范围,量级减少,提高了整体的查询速度。Finally, the beneficial effects of the present invention are: due to the adoption of dynamic and static combined field processing and hot and cold separation models, the log data is shunted. When querying, the user directly queries from the hot data, greatly narrowing the query scope, reducing the magnitude, and improving Overall query speed.

以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within the range.

Claims (7)

1.一种海量数据的日志聚合查询方法,其特征在于:包括:1. A log aggregation query method for massive data, which is characterized by: including: S1、日志采集:从服务器采集不同类型日志数据;S1. Log collection: collect different types of log data from the server; S2、日志处理:将日志中的数据字段采用动静结合字段进行处理,得到处理后的数据;S2. Log processing: The data fields in the log are processed using a combination of dynamic and static fields to obtain processed data; S3、日志入库:将处理后的数据采用冷热分离模型分析后,得到冷、热数据,并将冷、热数据分流入库;S3. Log storage: After analyzing the processed data using the hot and cold separation model, cold and hot data are obtained, and the cold and hot data are divided into the database; S4、日志查询:先通过热数据库进行日志查询,再通过冷数据库进行日志查询。S4. Log query: First perform log query through the hot database, and then perform log query through the cold database. 2.如权利要求1所述的一种海量数据的日志聚合查询方法,其特征在于:所述动静结合字段,具体指,将数据字段分为固定字段和动态列字段。2. A log aggregation query method for massive data as claimed in claim 1, characterized in that: the combined dynamic and static fields specifically refer to dividing the data fields into fixed fields and dynamic column fields. 3.如权利要求2所述的一种海量数据的日志聚合查询方法,其特征在于:所述固定字段具体为可以被聚合的字段。3. A log aggregation query method for massive data according to claim 2, characterized in that: the fixed fields are specifically fields that can be aggregated. 4.如权利要求2所述的一种海量数据的日志聚合查询方法,其特征在于:所述动态列字段为用户自定义的特征字段。4. A log aggregation query method for massive data according to claim 2, characterized in that: the dynamic column field is a user-defined feature field. 5.如权利要求4所述的一种海量数据的日志聚合查询方法,其特征在于:步骤S3中,所述冷热分离模型具体指:5. A log aggregation query method for massive data as claimed in claim 4, characterized in that: in step S3, the hot and cold separation model specifically refers to: 处理后的数据按照动态列字段进行表示;The processed data is represented according to dynamic column fields; 将动态列字段转换成对应的unincode值;Convert dynamic column fields into corresponding unincode values; 通过聚类分析算法对新一批待处理的日志数据在某一确定的数据分类中进行比较,若新一批待处理的日志数据的unincode值与某一确定的数据分类的unincode值之间的差值小于或等于预设值,则将新一批待处理的日志数据打上热标签,该日志数据为热数据;否则,该数据为冷数据。The cluster analysis algorithm is used to compare the new batch of log data to be processed in a certain data category. If the unincode value of the new batch of log data to be processed is different from the unincode value of a certain data category, If the difference is less than or equal to the preset value, the new batch of log data to be processed will be labeled as hot data, and the log data will be hot data; otherwise, the data will be cold data. 6.如权利要求5所述的一种海量数据的日志聚合查询方法,其特征在于:设置定时任务,扫描热数据,若日志最晚产生的时间大于当前t天,更新为冷数据。6. A log aggregation query method for massive data as claimed in claim 5, characterized by: setting a scheduled task, scanning hot data, and updating the log to cold data if the latest generation time is greater than the current t days. 7.如权利要求6所述的一种海量数据的日志聚合查询方法,其特征在于:当某待处理的日志数据在某一确定的数据分类中进行比较时,在热数据中未能匹配,但在冷数据中找到,则将该待处理的日志数据更新为热数据。7. A log aggregation query method for massive data as claimed in claim 6, characterized in that: when a certain log data to be processed is compared in a certain data classification, it fails to match in the hot data. But if it is found in cold data, the log data to be processed is updated to hot data.
CN202311397570.4A 2023-10-24 2023-10-24 Log aggregation query method for mass data Active CN117591477B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311397570.4A CN117591477B (en) 2023-10-24 2023-10-24 Log aggregation query method for mass data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311397570.4A CN117591477B (en) 2023-10-24 2023-10-24 Log aggregation query method for mass data

Publications (2)

Publication Number Publication Date
CN117591477A true CN117591477A (en) 2024-02-23
CN117591477B CN117591477B (en) 2025-03-11

Family

ID=89920860

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311397570.4A Active CN117591477B (en) 2023-10-24 2023-10-24 Log aggregation query method for mass data

Country Status (1)

Country Link
CN (1) CN117591477B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119668513A (en) * 2024-12-05 2025-03-21 南京勤添科技有限公司 A user data processing optimization system for cloud computing environment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138661A (en) * 2015-09-02 2015-12-09 西北大学 Hadoop-based k-means clustering analysis system and method of network security log
US20160357809A1 (en) * 2015-06-02 2016-12-08 Vmware, Inc. Dynamically converting search-time fields to ingest-time fields
CN107577588A (en) * 2017-09-26 2018-01-12 北京中安智达科技有限公司 A kind of massive logs data intelligence operational system
CN109871367A (en) * 2019-02-28 2019-06-11 江苏实达迪美数据处理有限公司 A kind of distributed cold and heat data separation method based on Redis and HBase
CN111930886A (en) * 2020-07-06 2020-11-13 国网江西省电力有限公司电力科学研究院 Log processing method, system, storage medium and computer equipment
CN114036120A (en) * 2021-11-04 2022-02-11 上海欣方智能系统有限公司 A real-time analysis method and system based on massive log data
US20220121507A1 (en) * 2020-10-21 2022-04-21 Vmware, Inc. Methods and systems that sample log/event messages in a distributed log-analytics system
CN114385396A (en) * 2021-12-27 2022-04-22 华青融天(北京)软件股份有限公司 Log analysis method, device, equipment and medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160357809A1 (en) * 2015-06-02 2016-12-08 Vmware, Inc. Dynamically converting search-time fields to ingest-time fields
CN105138661A (en) * 2015-09-02 2015-12-09 西北大学 Hadoop-based k-means clustering analysis system and method of network security log
CN107577588A (en) * 2017-09-26 2018-01-12 北京中安智达科技有限公司 A kind of massive logs data intelligence operational system
CN109871367A (en) * 2019-02-28 2019-06-11 江苏实达迪美数据处理有限公司 A kind of distributed cold and heat data separation method based on Redis and HBase
CN111930886A (en) * 2020-07-06 2020-11-13 国网江西省电力有限公司电力科学研究院 Log processing method, system, storage medium and computer equipment
US20220121507A1 (en) * 2020-10-21 2022-04-21 Vmware, Inc. Methods and systems that sample log/event messages in a distributed log-analytics system
CN114036120A (en) * 2021-11-04 2022-02-11 上海欣方智能系统有限公司 A real-time analysis method and system based on massive log data
CN114385396A (en) * 2021-12-27 2022-04-22 华青融天(北京)软件股份有限公司 Log analysis method, device, equipment and medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119668513A (en) * 2024-12-05 2025-03-21 南京勤添科技有限公司 A user data processing optimization system for cloud computing environment

Also Published As

Publication number Publication date
CN117591477B (en) 2025-03-11

Similar Documents

Publication Publication Date Title
CN111984499B (en) Fault detection method and device for big data cluster
CN112416724B (en) Alarm processing method, system, computer device and storage medium
US11196756B2 (en) Identifying notable events based on execution of correlation searches
JP2022118108A (en) Log auditing method, device, electronic apparatus, medium and computer program
US20060074621A1 (en) Apparatus and method for prioritized grouping of data representing events
US20190079965A1 (en) Apparatus and method for real time analysis, predicting and reporting of anomalous database transaction log activity
EP3070620A1 (en) Lightweight table comparison
WO2021159834A1 (en) Abnormal information processing node analysis method and apparatus, medium and electronic device
WO2007068667A1 (en) Method and apparatus for analyzing the effect of different execution parameters on the performance of a database query
CN107402957B (en) Construction of User Behavior Pattern Library and Method and System for Abnormal User Behavior Detection
CN111258798A (en) Fault positioning method and device for monitoring data, computer equipment and storage medium
CN112583847B (en) Method for network security event complex analysis for medium and small enterprises
He et al. Graph based incident extraction and diagnosis in large-scale online systems
CN117591477A (en) A log aggregation query method for massive data
CN112395315A (en) Method for counting log files and detecting abnormity and electronic device
CN112968805A (en) Alarm log processing method and device
Roschke et al. A flexible and efficient alert correlation platform for distributed ids
Zou et al. Improving log-based fault diagnosis by log classification
CN114116614A (en) Log storage method, device, computer equipment and storage medium
CN117421640A (en) API asset identification method, device, equipment and storage medium
CN116795974A (en) Log retrieval method, log retrieval device, equipment and storage medium
CN114422324A (en) Alarm information processing method and device, electronic equipment and storage medium
Prashanthi et al. Generating analytics from web log
CN118174971B (en) Multi-source heterogeneous data management method and system for network threat
CN117857182B (en) Processing method and device for server abnormal access

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant