CN119377232B

CN119377232B - A method for storing and querying massive conversation data

Info

Publication number: CN119377232B
Application number: CN202411948779.XA
Authority: CN
Inventors: 唐潮; 方奕
Original assignee: Shenzhou Lingcloud Beijing Technology Co ltd
Current assignee: Shenzhou Lingcloud Beijing Technology Co ltd
Priority date: 2024-12-27
Filing date: 2024-12-27
Publication date: 2025-05-06
Anticipated expiration: 2044-12-27
Also published as: CN119377232A

Abstract

The present invention proposes a method for storing and querying massive session data, including: creating an index bucket in a database, the index bucket is composed of multiple layers of index space, and different time precision labels are assigned to each layer of index space; monitoring the target data interface, saving the session data obtained from the target data interface into a preset minute precision table in the database, and saving the index information of the minute precision table into the first layer of index space of the index bucket; configuring materialized views for the first to second to last layers of index space to monitor the data changes in the previous layer of index space, wherein the materialized view in the first layer of index space is used to monitor the data update of the minute precision table; in response to the monitored data changes in the minute precision table, the index information in the multiple layers of index space is updated in sequence. The present invention proposes to create an index bucket with time precision in the process of generating a data table to improve the efficiency of data query and reduce the CPU occupancy.

Description

Massive session data storage and query method

Technical Field

The invention belongs to the field of computer data storage, and particularly relates to a method for storing and inquiring mass session data.

Background

In the prior art, a data table is created and stored according to an interface, and in consideration of statistics (such as a host, an IP session, a TCP session, a UDP session, server access, server binary group and the like) and presentation (such as a timing chart, a pie chart and a TOP chart) of multidimensional data which need to meet various details, when designing the interface data table, data storage with minute as precision is generally selected, namely, session data in only one minute (hereinafter referred to as a minute precision table) is stored in each data table.

However, minute precision tables are unsuitable for performing sub-table operations, which can result in slow data queries and easy memory exhaustion, which can lead to system anomalies.

Disclosure of Invention

In order to improve the query speed of session data and reduce the memory occupation, a method for storing massive session data is provided in a first aspect of the invention, which comprises the steps of creating an index bucket in a database, wherein the index bucket is composed of multiple layers of index spaces, and distributing different time precision labels for each layer of index space, monitoring a target data interface, storing session data acquired from the target data interface into a preset minute precision table in the database, storing index information of the minute precision table into a first layer of index space of the index bucket, configuring materialized views for the first layer to the last second layer of index space, wherein the materialized views in the first layer of index space are used for monitoring the data updating condition of the minute precision table, and sequentially updating the index information in the multiple layers of index space in response to the monitored data change of the minute precision table.

In one or more embodiments, creating an index bucket in a database, the index bucket consisting of multiple layers of index spaces and assigning different time precision labels to each layer of index space, includes creating an index bucket in the database having a bucket depth of at least 3 to form at least three layers of index spaces and sequentially assigning ten minutes, hours, and days to the first through third layers of index spaces as time precision labels.

In one or more embodiments, storing session data obtained from the target data interface in a preset minute precision table in a database includes monitoring the target data interface, extracting the session data obtained from the target data interface into a memory, aggregating session data with the same quadruple and time precision in the memory according to a preset matching rule, and storing the aggregated session data in the minute precision table.

In one or more embodiments, aggregating session data having the same tetrad and time precision in the memory includes obtaining tetrad information and time stamp information in the session data, determining time precision of the session data according to the time stamp information, and aggregating session data having the same source address and destination address and the same time precision.

In one or more embodiments, storing the aggregated session data in the minute precision table includes storing the aggregated session data in the minute precision table, and configuring a minute time precision of the first piece of session data as a table name of the minute precision table as index information.

In one or more embodiments, in response to monitoring that the data of the minute precision table changes, sequentially updating index information in the multi-layer index space includes, in response to monitoring that a new minute precision table is generated, judging whether the time precision of the new minute precision table belongs to a current 10-minute precision table in the first-layer index space, if so, saving the index information of the new minute precision table into the 10-minute precision table, and if not, generating a new 10-minute precision table in the first-layer index space, and saving the index information of the new minute precision table into the new 10-minute precision table.

In one or more embodiments, in response to monitoring that the data of the minute precision table changes, sequentially updating index information in the multi-layer index space, and further comprises in response to monitoring that a new 10-minute precision table is generated, judging whether the time precision of the new 10-minute precision table belongs to a current hour precision table in the second-layer index space, if so, storing the index information of the new 10-minute precision table into the hour precision table, and if not, generating a new hour precision table in the second-layer index space, and storing the index information of the new 10-minute precision table into the new hour precision table.

In one or more embodiments, in response to monitoring that the data of the minute precision table changes, sequentially updating index information in the multi-layer index space, and further comprising, in response to monitoring that a new hour precision table is generated, judging whether the time precision of the new hour precision table belongs to a current day precision table in a third-layer index space, if so, saving the index information of the new hour precision table into the day precision table, and if not, generating a new day precision table in the third-layer index space, and saving the index information of the new hour precision table into the new day precision table.

In one or more embodiments, the method for judging whether the time precision of the newly generated precision table in the upper layer index space belongs to the current precision table in the lower layer index space comprises the steps of judging whether the time difference between the table name of the newly generated precision table in the upper layer index space and the table name of the current precision table in the lower layer index space is smaller than a preset threshold value, and judging that the time precision of the newly generated precision table in the upper layer belongs to the current precision table of the lower layer if the time difference is smaller than the preset threshold value, wherein the preset threshold values of the first layer index space to the third layer index space are respectively 10 minutes, 1 hour and 24 hours.

In a second aspect of the present invention, a massive data query method based on any of the massive data query method embodiments is provided, where the query method includes inputting a time range of data to be queried, splitting the time range into corresponding time precision, sequentially performing matching indexes in a third-layer to first-layer index space according to the time precision, and determining a range of a minute data table according to a final index precision.

The method and the device have the beneficial effects that the index barrel is simultaneously created and the index information of the data table is managed according to time precision in the process of creating the data table corresponding to the data interface, so that the data can be queried based on the time precision range of the request query in the subsequent data query process, thereby improving the data query efficiency and reducing the occupancy rate of the CPU.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are necessary for the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention and that other embodiments may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a workflow diagram of a method of mass data storage according to an embodiment of the present invention;

FIG. 2 is a flow chart of data update of an index bucket through materialized views according to an embodiment of the present invention;

fig. 3 is a flow chart of data query performed by the index bucket according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

It should be noted that, in the embodiments of the present invention, all the expressions "first" and "second" are used to distinguish two entities with the same name but different entities or different parameters, and it is noted that the "first" and "second" are only used for convenience of expression, and should not be construed as limiting the embodiments of the present invention, and the following embodiments are not described one by one.

In order to improve the query speed of session data and reduce the memory occupation, in one embodiment, the invention provides a mass session data storage method, referring to fig. 1, comprising the steps of creating an index bucket in a database, wherein the index bucket is composed of multiple layers of index spaces and distributes different time precision labels for each layer of index space, monitoring a target data interface, storing session data acquired from the target data interface into a minute precision table preset in the database and storing index information of the minute precision table into a first layer of index space of the index bucket, configuring materialized views for the first layer to the last second layer of index space for monitoring the data change condition in the previous layer of index space, and sequentially updating the index information in the multiple layers of index spaces in response to the data change of the monitored minute precision table, wherein the materialized views in the first layer of index space are used for monitoring the data update condition of the minute precision table, and step S4.

Specifically, the embodiment provides that an index bucket is simultaneously created in the process of creating the data table corresponding to the data interface and index information of the data table is managed according to time precision, so that data query can be performed based on the time precision range of the request query in the subsequent data query process, thereby improving data query efficiency and reducing the occupancy rate of a CPU.

In one or more embodiments, creating an index bucket in a database, the index bucket consisting of multiple layers of index spaces and assigning different time precision labels to each layer of index space, includes creating an index bucket in the database with a bucket depth of at least 3 to form at least three layers of index spaces and sequentially assigning ten minutes, hours, and days to the first through third layers of index spaces as time precision labels.

Specifically, the depth of the index barrel corresponds to a multi-layer index space, and the multi-layer index space corresponds to a plurality of index accuracies, so that the depth of the index barrel needs to be set according to the index accuracies. Because the basic precision of the application is the minute clocks, the index barrel stores the data indexes of the minute clocks which are aggregated according to the preset rule, and respectively form 10 minute clocks, hour meters and day meters. The minute clocks and watches store the main content of the session data, and the minute clocks and watches, hour meters and day meters store only the index information of the last layer of precision meter, such as the index information of the hour meter stored in the day meter memory, the index information of the minute clocks and watches stored in the hour meter memory, and the index information of the minute clocks and watches stored in the minute meter memory.

In one embodiment, the indexing accuracy may also include years and months, with the depth of the corresponding index buckets being adapted to increase.

In an alternative embodiment, different index spaces may be allocated for the first through third layers of index spaces, since the index information in the first through third layers of index spaces will gradually decrease.

In one embodiment, the method further comprises the steps of monitoring the target data interface, storing the session data acquired from the target data interface into a minute precision table preset in a database, wherein the step of monitoring the target data interface comprises the steps of extracting the session data acquired from the target data interface into a memory, aggregating the session data with the same quadruple and time precision in the memory according to a preset matching rule, and storing the aggregated session data into the minute precision table.

Specifically, after data is obtained from the data interface and is matched by a preset rule, the system can aggregate session data with the same quadruple and time precision in a memory according to minute precision and then write the session data into a minute clock of a CH library. The time precision of the session data needs to be determined by splitting the time of generating or transmitting the session data and according to a preset rule (such as rounding up/down) manner, for example:

The start time of a session is 2024-10-10 12:23:34.634718152. The corresponding time accuracy of this record is as follows:

minutes accuracy 2024-10-10:12:23:00;

10 minutes accuracy: 2024-10-10:12:30:00 (rounded up is used here);

the hour precision is 2024-10-10:12:00:00;

day precision 2024-10-10:00:00:00.

More specifically, the data with the same quadruple is stored in the same minute entry to facilitate subsequent query and detail statistics (such as host, IP session, TCP session, UDP session, server access, server doublet, etc.) and presentation of multidimensional data (such as timing diagram, pie chart, TOP chart).

In one embodiment, aggregating session data having the same quadruple and time precision in memory according to a preset matching rule includes obtaining quadruple information and time stamp information in the session data, determining time precision of the session data according to the time stamp information, and aggregating session data having the same source address and destination address and the same time precision.

Specifically, the time precision of the data is determined according to the time stamp carried by the data, and the time stamp is applied by the data transmitting end to record the generation time or the transmission time of the data.

In one embodiment, storing the aggregated session data in a minute precision table includes storing the aggregated session data in a minute precision table and configuring a minute time precision of the first piece of session data as a table name of the minute precision table as index information.

Specifically, since a plurality of pieces of session data may be stored in the minute timepiece and the time for which these pieces of session data are generated or transmitted is different from each other, in order to unify the index information of the minute timepiece formed by aggregation, the present embodiment selects the minute time precision of the first piece of session data as the index information of the minute timepiece, and the time difference between the last piece of session data and the first piece of session data in the minute timepiece is within one minute, that is, the same minute time precision.

In an alternative implementation, when the timestamp information of the session data is accurate to seconds, a round-up is required. For example, if the time stamp information of a certain session data is 2020-12-1-15:30:59, the index information of the minute table should be 2020-12-1-15:30. Wherein the process of determining the minute time precision of the session data is performed in the memory.

In one embodiment, in response to monitoring that the data of the minute precision table changes, sequentially updating index information in the multi-layer index space comprises the steps of responding to monitoring that a new minute precision table is generated, judging whether the time precision of the new minute precision table belongs to a current 10-minute precision table in the first-layer index space, storing the index information of the new minute precision table into the 10-minute precision table if the time precision of the new minute precision table belongs to the current 10-minute precision table in the first-layer index space, generating a new 10-minute precision table in the first-layer index space if the time precision of the new minute precision table does not belong to the current 10-minute precision table in the first-layer index space, and storing the index information of the new minute precision table into the new 10-minute precision table;

The updating process of the 10-minute clock and watch comprises the steps of responding to the fact that a new 10-minute precision table is monitored to be generated, judging whether the time precision of the new 10-minute precision table belongs to a current hour precision table in the second-layer index space, storing index information of the new 10-minute precision table into the hour precision table if the time precision table belongs to the current hour precision table in the second-layer index space, generating a new hour precision table in the second-layer index space if the time precision table does not belong to the current hour precision table in the second-layer index space, and storing index information of the new 10-minute precision table into the new hour precision table.

And the updating process of the hour table comprises the steps of responding to the fact that the generation of a new hour precision table is monitored, judging whether the time precision of the new hour precision table belongs to the current day precision table in the third-layer index space, storing index information of the new hour precision table into the day precision table if the time precision table belongs to the current day precision table in the third-layer index space, generating the new day precision table in the third-layer index space if the time precision table does not belong to the current day precision table in the third-layer index space, and storing the index information of the new hour precision table into the new day precision table.

Specifically, referring to fig. 2, in order to ensure that the data of the minute clock can be updated to each precision table in real time and aggregated, the invention is completed by adopting the materialized view provided by Clickhouse and the SummingMergeTree table engine in cooperation. The materialized view is mainly used for monitoring data updating, and only needs to be established corresponding materialized views aiming at a bisected clock, a 10-minute clock and an hour table, and corresponding precision tables are generated according to the data updating condition, and the format of each precision table is generated as follows:

Minute clock yyyyy-MM-dd HH: MM:00 formatted with session start time

10 Minutes yyyy-MM-dd HH: MM:00 formatted according to session start time;

The hour table yyyy-MM-dd HH:00:00 is formatted with session start times;

The day table yyyy-MM-dd 00:00:00 is formatted according to the session start time;

Wherein yyyy in the above format represents a year of accuracy, MM represents a month of accuracy, dd represents a day of accuracy, HH represents a hour of accuracy, and MM represents a number of minutes, wherein 00.ltoreq.mm.ltoreq.59. The materialized view is automatically executed, when data are inserted into the minute clocks, the materialized view immediately carries out 10-minute-precision aggregation on the data which are just inserted, and the data are inserted into the corresponding minute clocks after aggregation, when the materialized view of the minute clocks finds that the data are inserted, the materialized view of the minute clocks immediately carries out hour-precision aggregation on the inserted data, and when the materialized view of the hour meter finds that the data are inserted, the materialized view of the hour meter immediately carries out day-precision aggregation on the inserted data, and the data are inserted into the corresponding day meter after aggregation. The above is the main workflow of materialized view of the index bucket.

For the data inserted into the precision table, if the data with the same quadruple and the corresponding precision are found, then SummingMergeTree engines are needed to determine repeated data based on the designated fields and perform summation calculation on other data, wherein Clickhouse is defined as adding order by behind the designated engines and judging weight according to the fields behind the order by, and other data refer to other fields except the designated fields of the engine table order by.

The invention requires that SummingMergeTree engines be created for minute watches, 10 minute watches, hour watches, day watches. The data enters the clock and watch, when the order by field appointed in the data is repeated, the SummingMergeTree engine is triggered, and summation calculation is carried out on the newly incoming data and the existing identical data.

In an alternative embodiment, the summation calculation in the present invention refers to custom by sum (column) for each field, and not all fields can only be summed. For example, max (column 1) min (column 2) if the maximum or minimum value of a certain field is found, wherein column1 and column2 correspond to different fields.

In one embodiment, the method for judging whether the time precision of the newly generated precision table in the upper layer index space belongs to the current precision table in the lower layer index space comprises judging whether the time difference between the table name of the newly generated precision table in the upper layer index space and the table name of the current precision table in the lower layer index space is smaller than a preset threshold value, and judging that the time precision of the newly generated precision table in the upper layer belongs to the current precision table of the lower layer if the time difference is smaller than the preset threshold value, wherein the preset threshold values of the first layer index space to the third layer index space are respectively 10 minutes, 1 hour and 24 hours.

In a second aspect of the present invention, a method for querying data based on an index bucket formed in the foregoing embodiment is provided, including:

step 100, inputting a time range of data to be queried;

step 200, splitting the time range into corresponding time precision;

And 300, carrying out matching indexing in the index spaces of the third layer to the first layer in sequence according to the time precision, and determining the range of the minute data table according to the final index precision.

Specifically, the invention adopts a time splitting method to split the query time into a conforming precision range and a designated precision table for query. For example, the time range of the data to be queried is 2024-06-12:34:00-2024-06-16:18:22:00, and the query process is shown in fig. 3:

the last 2 hours was taken as the aggregate time of the data into the bucket, the above time ranges can be split into:

2024-06-12:34:00-2024-06-12:40:00 minute clock;

2024-06-12:40:00-2024-06-12:00:00 minute 10 clock;

2024-06-12:00:00-2024-06-13:00:00-00 hourly;

2024-06-13 00:00:00-2024-06-16 00:00:00 table of days;

2024-06-16 00:00:00-2024-06-16:00:00 hourly table;

2024-06-16:00:00-2024-06-16:10:00 check 10 minutes clock;

2024-06-16:18:10:00-2024-06-16:18:22:00 minute clock;

And the like.

In one embodiment, to avoid memory exhaustion, the query requests for the hour meter and day meter may be split into multiple query requests with smaller accuracy, such as splitting a day into 24 hours, splitting an hour into 610 minutes, and determining the accuracy of the split request according to the remaining space of the current memory.

The invention establishes the index relation between the index space and the minute table by utilizing the index barrel, can realize the precision search of the data table, and is convenient for splitting the query request, thereby avoiding excessive occupied memory and reducing the influence of data query on the system operation.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order.

It will be appreciated by persons skilled in the art that the foregoing discussion of any embodiment is merely exemplary and is not intended to imply that the scope of the disclosure of embodiments of the invention, including the claims, is limited to such examples, that technical features of the above embodiments or different embodiments may be combined and that many other variations of the different aspects of the embodiments of the invention as described above exist within the spirit of the embodiments of the invention, which are not provided in detail for clarity. Therefore, any omission, modification, equivalent replacement, improvement, etc. of the embodiments should be included in the protection scope of the embodiments of the present invention.

Claims

1. A method for mass session data storage, the method comprising:

Creating an index bucket in a database, wherein the index bucket consists of multiple layers of index spaces, and different time precision labels are allocated to each layer of index space;

Monitoring a target data interface, storing session data acquired from the target data interface into a minute precision table preset in a database, and storing index information of the minute precision table into a first layer index space of the index barrel;

configuring materialized views for the index spaces from the first layer to the next to last layer for monitoring the data change condition in the index space of the previous layer, wherein the materialized views in the index space of the first layer are used for monitoring the data update condition of the minute precision table;

Sequentially updating index information in the multi-layer index space in response to monitoring that the data of the minute precision table changes;

The method comprises the steps of monitoring a target data interface, storing session data acquired from the target data interface into a minute precision table preset in a database, wherein the monitoring of the target data interface comprises the steps of extracting the session data acquired from the target data interface into a memory, acquiring four-tuple information and time stamp information in the session data in the memory, determining time precision of the session data according to the time stamp information, aggregating the session data which have the same source address and destination address and have the same time precision, and storing the aggregated session data into the minute precision table.

2. The mass session data storage method of claim 1, wherein creating an index bucket in a database, the index bucket consisting of multiple layers of index space and assigning different time precision labels to each layer of index space, comprises:

Creating an index bucket with a bucket depth of at least 3 in a database to form at least three layers of index spaces, and sequentially distributing ten minutes, hours and days to the first layer of index space to the third layer of index space as time precision labels.

3. The mass session data storage method of claim 1, wherein storing the aggregated session data into the minute precision table comprises:

And storing the aggregated session data into the minute precision table, and configuring the minute time precision of the first piece of session data as the table name of the minute precision table to serve as index information.

4. The mass session data storage method of claim 1, wherein sequentially updating the index information in the multi-layer index space in response to monitoring that the data of the minute precision table changes, comprises:

In response to monitoring that a new minute precision table is generated, judging whether the time precision of the new minute precision table belongs to the current 10-minute precision table in the first-layer index space;

if the index belongs to the current 10-minute precision table in the first-layer index space, saving the index information of the new minute precision table into the 10-minute precision table;

If the index information does not belong to the current 10-minute precision table in the first-layer index space, a new 10-minute precision table is generated in the first-layer index space, and index information of the new 10-minute precision table is stored in the new 10-minute precision table.

5. The mass session data storage method of claim 4, wherein the method further comprises:

in response to monitoring that a new 10-minute precision table is generated, judging whether the time precision of the new 10-minute precision table belongs to the current hour precision table in the second-layer index space;

If the index information belongs to the current hour precision table in the second-layer index space, saving the index information of the new 10-minute precision table into the hour precision table;

if the index information does not belong to the current hour precision table in the second-layer index space, a new hour precision table is generated in the second-layer index space, and index information of the new 10-minute precision table is stored in the new hour precision table.

6. The mass session data storage method of claim 5, wherein the method further comprises:

In response to monitoring that a new hour precision table is generated, judging whether the time precision of the new hour precision table belongs to a current day precision table in a third layer index space;

If the index belongs to the current day precision table in the third layer index space, the index information of the new hour precision table is stored in the day precision table;

If the new day precision table does not belong to the current day precision table in the third layer index space, a new day precision table is generated in the third layer index space, and index information of the new hour precision table is stored in the new day precision table.

7. The mass session data storage method according to any one of claims 4 to 6, wherein the manner of determining whether the time precision of the newly generated precision table in the previous layer index space belongs to the current precision table in the next layer index space comprises:

Judging whether the time difference between the table name of the newly generated precision table in the upper layer index space and the table name of the current precision table in the lower layer index space is smaller than a preset threshold value;

If the time precision of the newly generated precision table in the upper layer is smaller than the preset threshold value, judging that the time precision of the newly generated precision table in the upper layer belongs to the current precision table of the lower layer;

The preset thresholds of the index spaces of the first layer to the third layer are respectively 10 minutes, 1 hour and 24 hours.

8. A massive session data query method based on the massive session data storage method according to any one of claims 1-7, characterized in that the query method comprises:

Inputting a time range of data to be queried;

Splitting the time range into corresponding time precision;

And carrying out matching indexes in the index spaces of the third layer to the first layer in sequence according to the time precision, and determining the range of the minute data table according to the final index precision.