CN113778974A

CN113778974A - Log data processing method and device, storage medium and electronic equipment

Info

Publication number: CN113778974A
Application number: CN202110320793.5A
Authority: CN
Inventors: 罗勇
Original assignee: Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2021-03-25
Filing date: 2021-03-25
Publication date: 2021-12-10
Anticipated expiration: 2041-03-25
Also published as: CN113778974B

Abstract

The present disclosure provides a log data processing method, device, storage medium and electronic device, and relates to the technical field of data processing. The method for processing log data includes: dividing the log data into one or more primary data segments according to event identifiers in the log data to be processed; dividing the primary data segment into one or more primary data segments Secondary data segment; according to the interval to which the event identifier in the secondary data segment belongs, determine the bitmap of the secondary data segment to obtain the index information of the secondary data segment; Each bit corresponds to an interval. The present disclosure reduces the storage cost of log data to a certain extent.

Description

Log data processing method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a log data processing method, a log data processing apparatus, a computer-readable storage medium, and an electronic device.

Background

Through the analysis of the log data of the user behavior types such as the click log data, the browsing log data and the like, the preference of the user can be known, and further personalized service is provided for the user. Storing such log data is inevitable in order to facilitate analysis using such log data.

In the related art, when log data of a user behavior type is stored, the log data is usually cached in a Redis memory instead of a disk, and the storage cost of the log data is high due to the small storage capacity and high cost of the memory.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure provides a log data processing method, a log data processing apparatus, a computer-readable storage medium, and an electronic device, thereby at least to some extent solving the problem of high storage cost of log data in related technologies.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to a first aspect of the present disclosure, there is provided a log data processing method, including: dividing the log data into one or more primary data segments according to event identifications in the log data to be processed; dividing the primary data segment into one or more secondary data segments; determining a bitmap of the secondary data segment according to the interval to which the event identifier belongs in the secondary data segment to obtain index information of the secondary data segment; each bit in the bitmap corresponds to an interval.

In an exemplary embodiment of the present disclosure, the method further comprises: and determining a plurality of intervals according to the maximum value and the minimum value of the event identifier.

In an exemplary embodiment of the present disclosure, the determining a plurality of the intervals according to the maximum value and the minimum value of the event identifier includes: and determining a plurality of intervals according to the maximum value and the minimum value of the event identifier in the primary data segment.

In an exemplary embodiment of the present disclosure, when the data type identified by the event is a character string type, the method further includes: and carrying out high-order priority sequencing on the event identifications, and determining the maximum value and the minimum value of the event identifications according to a sequencing result.

In an exemplary embodiment of the present disclosure, the method further comprises: adding the maximum value and the minimum value of the event identification to the index information of the secondary data segment.

In an exemplary embodiment of the present disclosure, the determining a bitmap of the secondary data segment according to the interval to which the event identifier in the secondary data segment belongs includes: determining whether the number of the event identifications falling into each interval in the secondary data segment is 0; when the number of the event identifications falling into the interval is 0, setting the corresponding bit value of the interval in the bitmap as 0; and when the number of the event identifications falling into the interval is not 0, setting the corresponding bit value of the interval in the bitmap as 1.

In an exemplary embodiment of the present disclosure, the determining a bitmap of the secondary data segment according to the interval to which the event identifier in the secondary data segment belongs includes: replacing the event identifier in the secondary data segment with a mapping identifier; and determining the bitmap of the secondary data segment according to the interval to which the mapping identifier belongs.

In an exemplary embodiment of the disclosure, the replacing the event identifier in the secondary data segment with a mapping identifier includes: when the event identification in the secondary data segment meets a preset distribution condition, determining a discrete event identification in the event identification; and replacing the discrete event identification with a mapping identification.

In an exemplary embodiment of the present disclosure, the method further comprises: and storing the corresponding relation between the event identifier and the mapping identifier.

In an exemplary embodiment of the present disclosure, the method further comprises: when the number of the log data in any secondary data segment is larger than a first preset threshold value, dividing any secondary data segment into at least two new secondary data segments; and determining the bitmap of the new secondary data segment to obtain the index information of the new secondary data segment.

In an exemplary embodiment of the present disclosure, the method further comprises: and after any one of the two-level data segments is segmented, updating version information in the one-level data segment to which the any one of the two-level data segments belongs so as to associate the version information with the new two-level data segment obtained after segmentation.

In an exemplary embodiment of the present disclosure, the method further comprises: acquiring an event identifier to be queried; determining candidate secondary data segments in each secondary data segment according to the identifier of the event to be queried and the index information of each secondary data segment; and searching log data corresponding to the event identifier to be inquired in the candidate secondary data segment.

In an exemplary embodiment of the present disclosure, the determining, according to the identifier of the event to be queried and the index information of each secondary data fragment, a candidate secondary data fragment in each secondary data fragment includes: respectively judging whether a bit value corresponding to a target interval in the index information of each secondary data fragment is 1, wherein the target interval is an interval to which the event identifier to be inquired belongs; and when the bit value corresponding to the target interval in the index information is 1, determining the secondary data segment corresponding to the index information as a candidate secondary data segment.

According to a second aspect of the present disclosure, there is provided a log data processing apparatus including: the first dividing module is used for dividing the log data into one or more primary data segments according to the event identification in the log data to be processed; the second dividing module is used for dividing the primary data fragments into one or more secondary data fragments; the bitmap determining module is used for determining a bitmap of the secondary data fragment according to the interval to which the event identifier belongs in the secondary data fragment so as to obtain index information of the secondary data fragment; each bit in the bitmap corresponds to an interval.

According to a third aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described log data processing method.

According to a fourth aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the above-described log data processing method via execution of the executable instructions.

The technical scheme of the disclosure has the following beneficial effects:

in the process of processing the log data, dividing the log data into one or more primary data segments according to event identifiers in the log data to be processed; dividing the primary data segment into one or more secondary data segments; determining a bitmap of the secondary data segment according to the interval to which the event identifier in the secondary data segment belongs to obtain index information of the secondary data segment; each bit in the bitmap corresponds to an interval. On one hand, by dividing the log data to be processed, the distributed storage of the log data can be realized, so that a plurality of storage servers are used for sharing the storage load. On the other hand, the index is constructed by using the interval to which the event identifier belongs in the data fragment, so that the searching efficiency of the log data can be ensured even if the log data is stored in a low-cost disk, and the storage cost can be reduced while the searching efficiency is ensured.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is apparent that the drawings in the following description are only some embodiments of the present disclosure, and that other drawings can be obtained from those drawings without inventive effort for a person skilled in the art.

Fig. 1 shows a flowchart of a log data processing method in the present exemplary embodiment;

fig. 2 is a diagram showing an example of a primary data fragment data structure in the present exemplary embodiment;

FIG. 3 illustrates a flow diagram for determining a bitmap for a secondary data segment in the exemplary embodiment;

FIG. 4 is a flowchart illustrating replacement of an event identifier with a mapping identifier in the exemplary embodiment;

FIG. 5 illustrates a flow diagram of one type of two-level data segment segmentation in the exemplary embodiment;

FIG. 6 illustrates a flow diagram for looking up log data in the exemplary embodiment;

FIG. 7 illustrates a flow diagram for determining candidate secondary data segments in one exemplary embodiment;

FIG. 8 illustrates a flowchart for querying click log data in the exemplary embodiment;

FIG. 9 illustrates a flowchart of click log data processing in the exemplary embodiment;

fig. 10 is a block diagram showing the configuration of a log data processing apparatus in the present exemplary embodiment;

fig. 11 shows an electronic device for implementing the above method in the present exemplary embodiment.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

Herein, "first", "second", etc. are labels for specific objects, and do not limit the number or order of the objects.

In the related art, when storing log data of a user behavior type, the log data is generally cached in a Redis memory. Redis is a memory database that stores data in memory, rather than on disk. With the continuous increase of the log data volume, the occupied Redis memory is also larger and larger, and the storage cost of the log data is higher due to the small storage capacity and high cost of the memory.

In view of one or more of the above problems, exemplary embodiments of the present disclosure provide a log data processing method.

Fig. 1 shows a schematic flow of the log data processing method in the present exemplary embodiment, including the following steps S110 to S130:

step S110, according to the event identification in the log data to be processed, dividing the log data into one or more primary data segments;

step S120, dividing the primary data segment into one or more secondary data segments;

step S130, determining a bitmap of the secondary data segment according to the interval to which the event identifier in the secondary data segment belongs to obtain index information of the secondary data segment; each bit in the bitmap corresponds to an interval.

Each step in fig. 1 will be described in detail below.

Step S110, according to the event identification in the log data to be processed, dividing the log data into one or more primary data segments.

The log data to be processed may be log data of user behavior types such as click log data and browse log data, and the log data may include data fields of dimensions such as event identifiers, URLs and IPs. The event identifier refers to an identifier used for distinguishing each event in log data, and may be a number type or a character string type, and has uniqueness, such as an event ID. The primary data segment is a data segment composed of log data and is obtained by dividing the log data to be processed.

In an alternative embodiment, the dividing of the log data into one or more primary data segments may be implemented by: acquiring a hash value of an event identifier in log data to be processed; and according to the hash value of the event identification, dividing the log data into one or more data fragments.

For example, log data of which the hash value of the event identifier falls in the same hash value range interval may be divided into one primary data segment by presetting a plurality of hash value range intervals.

The log data is divided into a plurality of primary data segments, so that the log data can be stored in a distributed manner, each primary data segment can be stored in different storage servers, and a plurality of storage servers share the storage load.

In an optional implementation manner, the index information of the primary data segment may also be determined according to characteristics of the event identifier in the primary data segment.

The event identifier in the primary data segment may be characterized by a hash value range interval to which the hash value of the event identifier belongs, and the corresponding relationship between each primary data segment and the hash value range interval is recorded and is used as index information of the primary data segment, so that the primary data segment to which the log data belongs can be determined by the hash value of the event identifier of the log data.

Step S120, dividing the primary data segment into one or more secondary data segments.

The secondary data segment is a data segment formed by log data and is obtained by dividing the primary data segment. As an example diagram of a primary data segment data structure shown in fig. 2, a primary data segment is divided into n secondary data segments.

It should be noted that, in an actual implementation process, the data segments may also be divided more finely, for example, the secondary data segments are further divided into smaller data segments for storage.

In an alternative embodiment, the plurality of intervals may be determined according to the maximum value and the minimum value of the event identifier.

The maximum value and the minimum value of the event identifier may be the maximum value and the minimum value of the event identifier of the log data to be processed, or may be the maximum value and the minimum value of the event identifier in the primary data segment. When the interval is determined by the maximum value and the minimum value, the interval composed of the maximum value and the minimum value may be equally divided, or the interval composed of the maximum value and the minimum value may be unequally divided according to the distribution of the event markers, and the range where the event markers are distributed sparsely may be divided into larger intervals, for example, the event markers of 5 pieces of log data are 1, 2, 9, 10, and 100, the range interval composed of the maximum value and the minimum value of the event markers is [1,100], and the interval may be divided into [1,5], [6,10], [11, and 100 ].

The data fragments can be divided through the divided intervals, and the log data with the event identifications falling in the same interval can be divided into one data fragment.

In an alternative embodiment, when the maximum value and the minimum value of the event identifier are the maximum value and the minimum value of the event identifier in the primary data segment, determining a plurality of intervals according to the maximum value and the minimum value of the event identifier may be implemented by: and determining a plurality of intervals according to the maximum value and the minimum value of the event identifier in the primary data segment.

A plurality of intervals are determined according to the maximum value and the minimum value of the event identification in the primary data segment, and the log data of the event identification in the primary data segment in the same interval can be divided into two secondary data segments. In the above process, the range interval formed by the maximum value and the minimum value of the event identifier in the primary data segment is divided into a plurality of intervals, so that the primary data segment is divided into one or more secondary data segments according to the divided intervals.

In an alternative embodiment, when the data type identified by the event is a string type, the method further comprises: and carrying out high-order priority sequencing on the event identifications, and determining the maximum value and the minimum value of the event identifications according to a sequencing result.

The high-order priority ordering is a character string ordering method, and can take the ordered first event identifier and the ordered last event identifier as a minimum value and a maximum value respectively. For example, the event identifiers of 5 pieces of log data are c, a, b, abc, and ab, respectively, and a, abc, ab, b, and c are obtained by high-order prioritization, where a may be the minimum value of the event identifiers and c may be the maximum value of the event identifiers. When the event identifier is a character string type, the data type of the event identifier is not limited to a numerical type by determining the minimum value and the maximum value of the event identifier, so that the application range is wider.

The bitmap is realized by binary array, each bit of the array is used for storing a state, and the bitmap is suitable for large-scale data, but the data states are not many, and can be used for judging whether certain data exists. Each bit in the bitmap corresponds to one interval, and the bit value of each bit in the bitmap indicates whether data exists in the interval or not. When the length of the bitmap is 5 bits, the distribution state of 2 power of 5 event identifiers can be represented. The index information of the secondary segment is composed of the interval to which the event identifier in the secondary data segment belongs and a bitmap.

In an alternative embodiment, the maximum value and the minimum value of the event identifier may be added to the index information of the secondary data segment.

The maximum value and the minimum value of the event identifier in the secondary data segment are added to the index information of the secondary data segment, the index of each piece of log data in the secondary data segment does not need to be stored, the maximum value and the minimum value of the event identifier are added to the index information of the secondary data segment, the size of an index file can be reduced, and loading is facilitated.

In an optional implementation manner, a plurality of intervals may be further determined according to the maximum value and the minimum value of the event identifier in the secondary segment, or a plurality of intervals may be further determined according to the maximum value and the minimum value of the event identifier in the primary segment corresponding to the secondary segment, so as to determine an interval to which the event identifier in the secondary data segment belongs from the divided intervals. The interval can be formed by equally dividing a range interval formed by the maximum value and the minimum value of the event identifier in the secondary segment; for example, there are 5 pieces of log data in the secondary data segment, the corresponding event identifiers are 1, 2, 9, 10, and 100, the range interval formed by the maximum value and the minimum value of the event identifier is [1,100], and the interval can be divided into [1,5], [6,10], [11,100 ]. By dividing the interval, a reference interval can be provided for determining the interval to which the event identifier belongs in the secondary data segment.

In an optional implementation manner, determining a bitmap of the secondary data segment according to the interval to which the event identifier in the secondary data segment belongs may be implemented by: determining whether the number of the event identifications in the secondary data segment falling into each interval is 0; when the number of the event identifications falling into the interval is 0, setting the corresponding bit value of the interval in the bitmap as 0; when the number of the event identifications falling into the interval is not 0, setting the corresponding bit value of the interval in the bitmap as 1.

The bit value of the interval corresponding to the bitmap indicates whether the event identifier in the secondary data fragment falls into the interval, the bit value of 0 indicates that no event identifier in the secondary data fragment falls into the interval, and the bit value of 1 indicates that the event identifier in the secondary data fragment falls into the interval. The index of the event identification can be realized by setting the bit value, irrelevant secondary data fragments can be effectively filtered, and the searching efficiency is improved.

As shown in fig. 2, each secondary data segment contains index information consisting of a corresponding bitmap and an index (index), and the index information of the secondary data segment is stored in the secondary data segment. The index information structure of the secondary data segment can be as shown in table 1, where the position identifier of the bitmap indicates an identifier of each bit in the bitmap, the numerical range indicates the divided interval, a numerical range indicates an interval, the bit value indicates a bit value corresponding to each bit in the bitmap, a bit value of 0 indicates that no event identifier in the secondary data segment falls into the corresponding data value range, and a bit value of 1 indicates that an event identifier in the secondary data segment falls into the corresponding data value range.

TABLE 1

Bitmap position identification	Range of data values	Bit value
			0	1-40	0
1	41-80	0
			2	81-200	1
3	201-400	1
			4	401-8000	0

In an optional implementation manner, determining a bitmap of a secondary data segment according to an interval to which an event identifier belongs in the secondary data segment may be implemented by steps S310 to S320 shown in fig. 3, and specifically includes the following steps:

step S310, replacing the event identifier in the secondary data segment with a mapping identifier.

The mapping identifier is an identifier corresponding to a certain event identifier, can be used for replacing the corresponding event identifier, and is the same as the event identifier data type.

Step S320, determining a bitmap of the secondary data segment according to the interval to which the mapping identifier belongs.

The bitmap of the secondary data segment can be determined according to the interval to which the replaced mapping identifier belongs in the secondary data segment.

In the process shown in fig. 3, the event identifier that needs to be replaced may be determined according to the event identifier distribution. The distribution condition of the identifiers can be optimized by replacing the event identifiers, and the searching efficiency of the log data is further improved.

In an alternative embodiment, replacing the event identifier in the secondary data segment with the mapping identifier may be implemented by steps S410 to S420 shown in fig. 4, and specifically includes the following steps:

step S410, when the event identifications in the secondary data segment meet the preset distribution condition, determining discrete event identifications in the event identifications.

The preset distribution condition may be an event identifier discrete distribution condition preset by a developer according to experience, and when the distribution condition of a certain event identifier satisfies the preset distribution condition, the event identifier is determined as a discrete event identifier.

Step S420, replacing the discrete event identifier with the mapping identifier.

For example, when the distribution of event identifiers in the secondary data segment is uneven, for example, most event identifiers are distributed in the range of 1-1000, there is an event identifier 9999, and 9999 can be replaced with an identifier close to the range of 1-1000.

In the step shown in fig. 4, by setting a preset distribution condition, discrete event identifiers distributed in the secondary data segment can be determined, and by replacing the discrete event identifiers distributed, the span of an event identifier range can be reduced, so as to further improve the efficiency of searching log data.

In an alternative embodiment, the correspondence between the event identifier and the mapping identifier is stored.

The corresponding relationship between the event identifier and the mapping identifier may be stored in the corresponding primary data segment, for example, the corresponding relationship between the event identifier and the mapping identifier is stored in a metadata list of the primary data segment as shown in fig. 2, so as to ensure self-descriptiveness of a single primary data segment for easy searching. In addition, the mapping identifier correspondence of the event identifier may also be stored in the corresponding secondary data segment, so as to ensure the self-descriptiveness of a single secondary data segment, thereby facilitating the search.

In an optional implementation, a data field in the log data may also be replaced, which specifically includes the following steps: and when the occurrence frequency of the data field in the log data is greater than a second preset threshold, replacing the data field with a mapping field, and storing the corresponding relation between the data field and the mapping field.

The correspondence between the stored data fields and the mapped fields may be stored in the metadata of the primary data segments as shown in fig. 2, with one primary data segment employing a set of replacement relationships; the corresponding relation between the stored data field and the mapping field can also be stored in the metadata of the secondary data fragment, and one secondary data fragment adopts a set of replacement relation.

It should be noted that the data size of the mapping field is smaller than the data size of the data field. By replacing the data fields with larger data volume and higher occurrence frequency, the storage capacity of the log data can be effectively reduced.

In an optional implementation manner, the segmenting of the secondary data segment may be implemented through steps S510 to S520 shown in fig. 5, so as to avoid an excessive data amount of the secondary data segment, which specifically includes the following steps:

step S510, when the number of log data in any secondary data segment is greater than a first preset threshold, dividing any secondary data segment into at least two new secondary data segments.

When the secondary data segment is divided, the interval formed by the maximum value and the minimum value corresponding to the event identifier in the secondary segment can be divided into two intervals, the original secondary data segment is divided into two new secondary data segments according to the interval to which each event identifier in the original secondary data segment belongs, and the original secondary data segment is deleted; the segmentation can also be performed according to the number of log data in the secondary data segments, so as to ensure that the data amount of the log data in each new secondary data segment is the same or the difference is 1, for example, a certain secondary data segment includes 21 pieces of log data, the log data in the secondary data segment is sorted according to the corresponding event identifier, and the front and back segmentation can be performed from the position where the log data sorted to 11 th.

Step S520, determining a bitmap of the new secondary data segment to obtain index information of the new secondary data segment.

In the step shown in fig. 5, by splitting the secondary data segments, it is avoided that the content of a single secondary data segment is too large to affect the search efficiency, so as to ensure the efficiency of searching log data in the secondary data segment.

In an optional implementation manner, after any two-level data segment is segmented, the version information is updated in the one-level data segment to which any two-level data segment belongs, so that the version information is associated with a new two-level data segment obtained after segmentation.

Version information is also stored in the primary data segment, as shown in fig. 2. By updating the version information, the method is beneficial to knowing the segmentation condition of the second-level data segment, and can realize that the new second-level data segment obtained after segmentation is associated to the corresponding first-level data segment so as to clarify the affiliated relationship of the second-level data segment.

In an alternative embodiment, as shown in fig. 6, the log data may be searched through the following steps S610 to S630.

Step S610, an event identifier to be queried is obtained.

The method can acquire the order or the traffic waiting query message from a media platform (such as an advertisement platform) through a message queue, and then extract the event identifier to be queried from the message to be queried.

Step S620, determining candidate secondary data segments in each secondary data segment according to the identifier of the event to be queried and the index information of each secondary data segment.

When determining the candidate secondary data segment, firstly determining the candidate primary data segment corresponding to the event identifier to be queried from the log data to be processed according to the event identifier to be queried and the index information of the primary data segment, and then determining the candidate secondary data segment from each secondary data segment contained in the candidate primary data segment according to the event identifier to be queried and the index information of each secondary data segment.

In an alternative embodiment, step S620 may determine the candidate secondary data segment through the steps shown in fig. 7, specifically including step S710 and step S720:

step S710, respectively determining whether a bit value corresponding to a target interval in the index information of each secondary data segment is 1, where the target interval is an interval to which the event identifier to be queried belongs.

Step S720, when the bit value corresponding to the target interval in the index information is 1, determining the secondary data segment corresponding to the index information as a candidate secondary data segment.

In the step shown in fig. 7, the interval to which the event identifier to be queried belongs is first located, and it is not necessary to traverse all log data, so that secondary data segments unrelated to the event identifier to be queried can be quickly filtered out. When the bit value corresponding to the target interval in the index information is 0, it is indicated that log data corresponding to the to-be-processed event identifier does not exist in the candidate interval, and downward search is not needed, so that unnecessary query operation is avoided. And when the bit value corresponding to the target interval in the index information is 1, the log data corresponding to the to-be-processed event identifier exists in the candidate interval, and then the log data is further searched downwards.

Step S630, searching the log data corresponding to the event identifier to be queried in the candidate secondary data segment.

After the candidate secondary data segment is determined, the candidate secondary data segment needs to be accurately searched so as to query the log data corresponding to the event identifier to be queried.

In an optional implementation manner, an inverted sorting index may be created for the secondary data segment, so as to accurately search log data corresponding to the event identifier to be queried from the candidate secondary data through index information of the inverted sorting index, thereby ensuring the accuracy of querying.

When the log data is click log data, fig. 8 provides an implementation manner of querying click log data, which specifically includes the following steps:

step S801, start;

step S802, extracting click IDs of orders and flow information in the message queue to obtain values of the order and the flow information, wherein the extracted click IDs are used as identifiers of events to be inquired;

step S803, carrying out range value matching on bitmap index bits of the click ID in the block, wherein the block serves as a secondary data fragment, and carrying out matching by taking the bitmap of each block as index information so as to determine a target block file to which the click ID belongs, namely a candidate secondary data fragment;

step S804, finding the block matched with the click ID value to obtain a target block file, wherein the target block file found after matching in step S803 is used as a candidate secondary data fragment;

step S805, performing inverted sorting index matching query on the target block file to obtain corresponding log data, and accurately searching the log information corresponding to the click ID from the target block file by adopting an inverted sorting index in the process;

and step S806, ending.

When the log data is click log data, fig. 9 shows an embodiment of click log data processing, which is applied in OCPX (Optimized Cost Per X, Optimized X bid) promotional bidding scenario, where "X" generally refers to a traditional mode of settlement in different manners, and includes two processing stages: an access phase and an output phase.

The access phase specifically comprises the following steps:

step S901, a user click log is returned, wherein the user click log is used as log data to be processed, and the advertisement media returns the user click log to an advertisement main log core processing logic;

step S902, clicking a data structure construction layer, after receiving a user click log returned by an advertisement medium, an advertiser log core processing logic clicks the data structure construction layer, and the user click log forms log data with a specific structure, such as n primary data segments and m secondary data segments;

step S903, log distributed storage, namely performing distributed storage on the log data with a specific structure formed in the step S902, wherein n primary data segments can be stored in different storage servers, and m secondary data segments can be stored in different blocks of a disk of a certain storage server;

after the access phase is executed, the output phase is entered, and the output phase specifically comprises the following steps:

step S904, acquiring real-time messages such as flow, orders and the like;

step S905, performing stream processing query on the real-time message at the stream processing layer Flink/Storm, where the process is to query log data matched with the real-time message from the log data stored in step S903, where Flink and Storm are two open-source stream processing frameworks, and perform processing on a large data stream through a distributed stream data stream engine.

Step S906, the Sink output layer outputs the data processed by the Flink/Storm as the output end of the stream processing layer, the Sink output layer can write the output to the positions of files, sockets (sockets), external systems and the like, and the Sink output layer is output and then returned to the advertisement media by the core processing logic of the advertisement main log;

in step S907, OCPX is a tool for adjusting the intelligent dynamic bidding according to the click rate and the conversion rate of a single flow with the conversion cost as the optimization objective, and in the process, the advertisement media performs OCPX conversion according to the output of the Sink output layer to achieve the dynamic bidding.

The step shown in fig. 9 is to perform real-time query on the mass click log data stored in a distributed manner based on the popularization bidding mode of OCPX, to implement real-time fast hit of the mass click log data, and to implement real-time dynamic bidding according to the click rate and conversion rate of hit click log data, so as to ensure timeliness and high efficiency of OCPX hit conversion.

Exemplary embodiments of the present disclosure also provide a log data processing apparatus. As shown in fig. 10, the log data processing apparatus 1000 includes:

the first dividing module 1010 is configured to divide the log data into one or more primary data segments according to an event identifier in the log data to be processed;

a second partitioning module 1020 for partitioning the primary data segments into one or more secondary data segments;

the bitmap determining module 1030 is configured to determine a bitmap of the secondary data segment according to an interval to which the event identifier in the secondary data segment belongs, so as to obtain index information of the secondary data segment; each bit in the bitmap corresponds to an interval.

In an alternative embodiment, the log data processing apparatus 1000 further includes: and the interval determining module is used for determining a plurality of intervals according to the maximum value and the minimum value of the event identifier.

In an alternative embodiment, the interval determination module is configured to: and determining a plurality of intervals according to the maximum value and the minimum value of the event identifier in the primary data segment.

In an optional embodiment, when the data type of the event identifier is a character string type, the interval determination module is further configured to: and carrying out high-order priority sequencing on the event identifications, and determining the maximum value and the minimum value of the event identifications according to a sequencing result.

In an alternative embodiment, the log data processing apparatus 1000 further includes: and the adding module is used for adding the maximum value and the minimum value of the event identifier to the index information of the secondary data segment.

In an alternative embodiment, the bitmap determination module 1030 is configured to: determining whether the number of the event identifications in the secondary data segment falling into each interval is 0; when the number of the event identifications falling into the interval is 0, setting the corresponding bit value of the interval in the bitmap as 0; when the number of the event identifications falling into the interval is not 0, setting the corresponding bit value of the interval in the bitmap as 1.

In an optional implementation, the bitmap determining module 1030 further includes: the replacing module is used for replacing the event identifier in the secondary data fragment with the mapping identifier; and the bitmap determining module is used for determining the bitmap of the secondary data segment according to the interval to which the mapping identifier belongs.

In an alternative embodiment, the replacement module is further configured to: when the event identification in the secondary data fragment meets a preset distribution condition, determining a discrete event identification in the event identification; and replacing the discrete event identification with a mapping identification.

In an alternative embodiment, the log data processing apparatus 1000 further includes: and the storage module is used for storing the corresponding relation between the event identifier and the mapping identifier.

In an alternative embodiment, the log data processing apparatus 1000 further includes: the data segment segmentation module is used for segmenting any two-level data segment into at least two new two-level data segments when the number of the log data in any two-level data segment is greater than a first preset threshold value; and the bitmap determining submodule is used for determining the bitmap of the new secondary data fragment so as to obtain the index information of the new secondary data fragment.

In an alternative embodiment, the log data processing apparatus 1000 further includes: and the version updating module is used for updating the version information in the primary data segment to which any secondary data segment belongs after any secondary data segment is segmented so as to associate the version information with the new segmented secondary data segment.

In an alternative embodiment, the log data processing apparatus 1000 further includes: the identification acquisition module is used for acquiring an identification of an event to be queried; the candidate segment determining module is used for determining candidate secondary data segments in all the secondary data segments according to the identifier of the event to be queried and the index information of all the secondary data segments; and the searching module is used for searching the log data corresponding to the event identifier to be inquired in the candidate secondary data segment.

In an alternative embodiment, the candidate segment determining module is further configured to: respectively judging whether the bit value corresponding to the target interval in the index information of each secondary data segment is 1, wherein the target interval is the interval to which the event identifier to be inquired belongs; and when the bit value corresponding to the target interval in the index information is 1, determining the secondary data segment corresponding to the index information as a candidate secondary data segment.

The specific details of each part in the log data processing apparatus 1000 are described in detail in the method part embodiment, and details that are not disclosed may refer to the method part embodiment, and thus are not described again.

Exemplary embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon a program product capable of implementing the above-described log data processing method of the present specification. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing an electronic device to perform the steps according to various exemplary embodiments of the disclosure described in the above-mentioned "exemplary methods" section of this specification, when the program product is run on the electronic device. The program product may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on an electronic device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Exemplary embodiments of the present disclosure also provide an electronic device capable of implementing the log data processing method. An electronic device 1100 according to such an exemplary embodiment of the present disclosure is described below with reference to fig. 11. The electronic device 1100 shown in fig. 11 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present disclosure.

As shown in fig. 11, electronic device 1100 may take the form of a general-purpose computing device. The components of the electronic device 1100 may include, but are not limited to: at least one processing unit 1110, at least one memory unit 1120, a bus 1130 connecting the various system components including the memory unit 1120 and the processing unit 1110, and a display unit 1140.

The memory unit 1120 stores program code that may be executed by the processing unit 1110 to cause the processing unit 1110 to perform steps according to various exemplary embodiments of the present disclosure as described in the "exemplary methods" section above in this specification. For example, processing unit 1110 may perform one or more of the method steps of any of fig. 1, 3-9.

The storage unit 1120 may include readable media in the form of volatile storage units, such as a random access memory unit (RAM)1121 and/or a cache memory unit 1122, and may further include a read-only memory unit (ROM) 1123.

The storage unit 1120 may also include a program/utility 1124 having a set (at least one) of program modules 1125, such program modules 1125 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 1130 may be representative of one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 1100 may also communicate with one or more external devices 1200 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 1100, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 1100 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 1150. Also, the electronic device 1100 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) via the network adapter 1160. As shown, the network adapter 1160 communicates with the other modules of the electronic device 1100 over the bus 1130. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 1100, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the exemplary embodiments of the present disclosure.

Furthermore, the above-described figures are merely schematic illustrations of processes included in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit, according to exemplary embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system. Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the following claims.

Claims

1. A log data processing method, comprising:

dividing the log data into one or more primary data segments according to event identifications in the log data to be processed;

dividing the primary data segment into one or more secondary data segments;

determining a bitmap of the secondary data segment according to the interval to which the event identifier belongs in the secondary data segment to obtain index information of the secondary data segment; each bit in the bitmap corresponds to an interval.

2. The method of claim 1, further comprising:

and determining a plurality of intervals according to the maximum value and the minimum value of the event identifier.

3. The method of claim 2, wherein determining the plurality of intervals according to the maximum and minimum of the event identifier comprises:

and determining a plurality of intervals according to the maximum value and the minimum value of the event identifier in the primary data segment.

4. The method according to claim 2, wherein when the data type of the event identifier is a character string type, the method further comprises:

and carrying out high-order priority sequencing on the event identifications, and determining the maximum value and the minimum value of the event identifications according to a sequencing result.

5. The method of claim 2, further comprising:

adding the maximum value and the minimum value of the event identification to the index information of the secondary data segment.

6. The method of claim 1, wherein determining the bitmap of the secondary data segment according to the interval of the event identifier in the secondary data segment comprises:

determining whether the number of the event identifications falling into each interval in the secondary data segment is 0;

when the number of the event identifications falling into the interval is 0, setting the corresponding bit value of the interval in the bitmap as 0;

and when the number of the event identifications falling into the interval is not 0, setting the corresponding bit value of the interval in the bitmap as 1.

7. The method according to claim 1, wherein the determining the bitmap of the secondary data segment according to the interval to which the event identifier in the secondary data segment belongs comprises:

replacing the event identifier in the secondary data segment with a mapping identifier;

and determining the bitmap of the secondary data segment according to the interval to which the mapping identifier belongs.

8. The method of claim 7, wherein replacing the event identifier in the secondary data segment with a mapping identifier comprises:

when the event identification in the secondary data segment meets a preset distribution condition, determining a discrete event identification in the event identification;

and replacing the discrete event identification with a mapping identification.

9. The method of claim 7, further comprising:

and storing the corresponding relation between the event identifier and the mapping identifier.

10. The method of claim 1, further comprising:

when the number of the log data in any secondary data segment is larger than a first preset threshold value, dividing any secondary data segment into at least two new secondary data segments;

and determining the bitmap of the new secondary data segment to obtain the index information of the new secondary data segment.

11. The method of claim 10, further comprising:

and after any one of the two-level data segments is segmented, updating version information in the one-level data segment to which the any one of the two-level data segments belongs so as to associate the version information with the new two-level data segment obtained after segmentation.

12. The method of claim 1, further comprising:

acquiring an event identifier to be queried;

determining candidate secondary data segments in each secondary data segment according to the identifier of the event to be queried and the index information of each secondary data segment;

and searching log data corresponding to the event identifier to be inquired in the candidate secondary data segment.

13. The method according to claim 12, wherein the determining candidate secondary data segments among the secondary data segments according to the event identifier to be queried and the index information of each secondary data segment includes:

respectively judging whether a bit value corresponding to a target interval in the index information of each secondary data fragment is 1, wherein the target interval is an interval to which the event identifier to be inquired belongs;

and when the bit value corresponding to the target interval in the index information is 1, determining the secondary data segment corresponding to the index information as a candidate secondary data segment.

14. A log data processing apparatus characterized by comprising:

the first dividing module is used for dividing the log data into one or more primary data segments according to the event identification in the log data to be processed;

the second dividing module is used for dividing the primary data fragments into one or more secondary data fragments;

the bitmap determining module is used for determining a bitmap of the secondary data fragment according to the interval to which the event identifier belongs in the secondary data fragment so as to obtain index information of the secondary data fragment; each bit in the bitmap corresponds to an interval.

15. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 13.

16. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any of claims 1 to 13 via execution of the executable instructions.