CN121255810A

CN121255810A - Log storage method, search method, device, storage medium, and program product

Info

Publication number: CN121255810A
Application number: CN202511375402.4A
Authority: CN
Inventors: 梁帅; 王杰; 陈震; 徐茂红; 屈有军; 代秋芳; 王宏波; 王楠; 何攀登
Original assignee: China United Network Communications Group Co Ltd
Current assignee: China United Network Communications Group Co Ltd
Priority date: 2025-09-24
Filing date: 2025-09-24
Publication date: 2026-01-02

Abstract

The application provides a log storage method, a retrieval method, equipment, a storage medium and a program product, relates to the technical field of computer data processing, and is used for realizing data compression and efficient data management, realizing self-adaptive evolution of a template set through closed loop feedback and solving the problems of stiff and bloated template library. The method comprises the steps of obtaining original log data, matching log templates in a log template library, converting the original log data into structured data, and storing the structured data in a column type storage system. After receiving a log search request of a user, generating a search task, executing the search task to obtain a search result of each search subtask, and performing reconstruction processing to obtain reconstructed log data.

Description

Log storage method, search method, device, storage medium, and program product

Technical Field

The present application relates to the field of computer data processing technology, and in particular, to a log storage method, a retrieval method, a device, a storage medium, and a program product.

Background

In modern information systems, journals are key data assets that guarantee core functions such as system observability, fault investigation, security audit, and the like. With the popularization of technologies such as distributed technology, cloud primary technology and the like, the amount of log data is increased in an explosive manner, and the storage cost and the query efficiency of the existing log management scheme form a serious challenge.

Related log retrieval technologies are mainly divided into two types, namely metadata extraction and original log storage schemes, wherein the schemes are the basis of a current mainstream log center, the representative technologies comprise a system based on a document database or a column database, the core idea is to extract a small amount of general metadata (such as time stamp and log level) from a log, then store the metadata and unprocessed complete original log text as a whole, and the other is a separate storage scheme based on an external template library, and the schemes decompose each log into a static "template" and a dynamic "variable" through a log analysis technology. Only the template ID and the extracted variable value are stored in the main data table, and the template itself is stored in a separate, external "template library".

However, the system performance of the scheme can encounter serious bottlenecks in a mass data scene, and the template management is increasingly complex and stiff, so that "technical liabilities" which are difficult to pay are formed.

Disclosure of Invention

The application provides a log storage method, a retrieval method, equipment, a storage medium and a program product, which are used for realizing data compression and efficient data management.

The application provides a log storage method, which comprises the steps of obtaining original log data, determining whether a target log template matched with the original log data exists in a log template library, enabling the original log data to comprise a static part and a dynamic part, enabling the target log template to be identical to the static part of the original log data, converting the original log data into first structured data based on the target log template when the target log template matched with the original log data exists in the log template library, and storing the first structured data in a column storage system.

The technical scheme provided by the application has the advantages that the original log usually comprises a dynamic variable and a static part which is fixed, the template can be matched based on the static part of the original log, and the unstructured original log can be converted into structured data based on the target log template and stored in a column storage system, so that the quick classification and batch analysis of the log are facilitated. Therefore, based on the column storage system, the high-efficiency compression and quick query characteristics aiming at the structured data can be realized, the storage cost is reduced, meanwhile, the flexible statistics and aggregation operation of dynamic variables in the log are supported, the log processing efficiency can be remarkably improved, and the data compression and high-efficiency data management are realized.

In one possible implementation manner, the first structured data comprises at least one of a template hash identifier, a dynamic array, a template text and original log data, wherein the template hash identifier is a unique identifier of a target log template matched with the original log data, the dynamic array is a set of dynamic variable data extracted from the original log data based on the target log template, and the template text is in a text form of the target log template.

In another possible implementation manner, the first structured data is stored in the column-type storage system, and the method comprises the steps of writing a template hash identification of the first structured data into a first preset physical column of the column-type storage system, submitting a template text of the first structured data to a dictionary coding processing unit of the column-type storage system, wherein the dictionary coding processing unit is used for writing an identification of a dictionary corresponding to the template text into a physical row when the template text exists in a current data partition, adding the template text into the dictionary corresponding to the current physical partition when the template text does not exist in the current data partition, writing the identification of the dictionary corresponding to the template text into the physical row, and writing a dynamic array of the first structured data into a second preset physical column of the column-type storage system, wherein the second preset physical column supports data of a storage array type.

In another possible implementation, the original log data is converted into second structured data in the event that there is no target log template in the log template library that matches the original log data, and the second structured data is stored in the columnar storage system.

In another possible implementation manner, the second structured data comprises a template hash identifier, a dynamic array, a template text and original log data, wherein the template hash identifier of the second structured data is configured to be a preset value and used for indicating that the original log data is not matched with a log template, and the dynamic array and the template text in the second structured data are null values.

In another possible implementation manner, the second structured data is stored in the column-type storage system, and the method comprises the steps of writing a template hash identification of the second structured data into a first preset physical column of the column-type storage system, wherein the first preset physical column is used for storing a template unique identification, writing an original log of the second structured data into a third preset physical column of the column-type storage system, and the third preset physical column supports the storage of unstructured text data.

In another possible implementation manner, the log templates in the log template library are generated based on the steps of acquiring an original log set in a historical time period, extracting a structured feature vector of each log for each log in the original log set, clustering the structured feature vector of each log in the original log set by adopting a preset clustering algorithm to obtain a plurality of cluster clusters, and identifying a static part of the log in each cluster as a template and marking a dynamic part as a variable placeholder for each cluster in the plurality of cluster clusters.

In another possible implementation manner, if the duty ratio of the original log data which is not matched with the log template is greater than or equal to a preset threshold value in a preset time period, a new log template is generated based on the original log data which is not matched with the log template, and the new log template is stored in a log template library.

The second aspect of the application provides a log retrieval method based on the log storage method, which comprises the steps of receiving a log retrieval request of a user, generating a retrieval task based on the log retrieval request and a storage structure of a column storage system, wherein the retrieval task comprises a plurality of parallel retrieval subtasks, each retrieval subtask is used for retrieving a preset physical column in the column storage system, executing the retrieval task to obtain a retrieval result of each retrieval subtask, and carrying out reconstruction processing on the retrieval results of the plurality of parallel retrieval subtasks to obtain reconstructed log data.

The technical scheme provided by the application has the advantages that the search task is split into parallel subtasks aiming at each preset physical column, the advantage of column storage is fully utilized, only the target column data is scanned, invalid input and output operations are greatly reduced, and the search efficiency is improved. On the other hand, executing the subtasks in parallel may reduce overall retrieval time. The retrieval results of the plurality of parallel retrieval subtasks are subjected to reconstruction processing, so that log data can be completely restored, the integrity of the log data is ensured, the readability of the log is improved, and the requirement of a user on quick query of the log is met.

A possible implementation manner of the column type storage system comprises a first preset physical column, a second preset physical column and a third preset physical column, wherein the first preset physical column is used for storing a template hash identifier of structured data corresponding to original log data, the second preset physical column is used for storing a dynamic array of the structured data corresponding to the original log data, the third preset physical column is used for storing the original log data, a plurality of parallel retrieval subtasks comprise a first retrieval subtask, a second retrieval subtask and a third retrieval subtask, the first retrieval subtask is used for retrieving in the first preset physical column, the second retrieval subtask is used for retrieving in the second preset physical column, and the third retrieval subtask is used for retrieving in the first preset physical column and the third preset physical column.

The application provides a log storage device which comprises an acquisition module, a storage module and a storage module, wherein the acquisition module is used for acquiring original log data, the storage module is used for determining whether a target log template matched with the original log data exists in a log template library, the original log data comprises a static part and a dynamic part, the target log template is identical to the static part of the original log data, the storage module is further used for converting the original log data into first structured data based on the target log template when the target log template matched with the original log data exists in a log template library, and the storage module is further used for storing the first structured data in a column storage system.

The storage module is specifically configured to store the first structured data in the columnar storage system, and includes writing a template hash identifier of the first structured data in a first preset physical column of the columnar storage system, submitting a template text of the first structured data to a dictionary coding processing unit of the columnar storage system, where the dictionary coding processing unit is configured to write an identifier of a dictionary corresponding to the template text in a physical row if the template text exists in a current data partition, adding the template text to the dictionary corresponding to the current physical partition if the template text does not exist in the current data partition, and writing an identifier of the dictionary corresponding to the template text in the physical row, and writing a dynamic array of the first structured data in a second preset physical column of the columnar storage system, where the second preset physical column supports storing array type data.

In another possible implementation manner, the storage module is specifically configured to convert the original log data into second structured data when there is no target log template matched with the original log data in the log template library, and store the second structured data in the column storage system.

Another possible implementation manner, the storage module is specifically configured to store the second structured data in the column storage system, and includes writing a template hash identifier of the second structured data into a first preset physical column of the column storage system, where the first preset physical column is used to store a unique template identifier, writing an original log of the second structured data into a third preset physical column of the column storage system, and where the third preset physical column supports storing unstructured text data.

In another possible implementation, the log templates in the log template library are generated based on the following steps. The system comprises a storage module, a clustering module, a variable placeholder, a variable placement module and a variable placement module, wherein the storage module is used for acquiring an original log set in a historical time period, extracting a structured feature vector of each log for each log in the original log set, carrying out clustering processing on the structured feature vector of each log in the original log set by adopting a preset clustering algorithm to obtain a plurality of clusters, and identifying a static part of the log in each cluster as a template and a dynamic part of the log in each cluster as the variable placeholder for each cluster in the plurality of clusters.

In another possible implementation manner, the storage module is specifically configured to generate a new log template based on the original log data that is not matched to the log template, and store the new log template in the log template library if the duty ratio of the original log data that is not matched to the log template is greater than or equal to a preset threshold in a preset period.

The application provides a log retrieval device which comprises a receiving module, a retrieval module and a retrieval module, wherein the receiving module is used for receiving a log retrieval request of a user, the retrieval module is used for generating a retrieval task based on the log retrieval request and a storage structure of a column storage system, the retrieval task comprises a plurality of parallel retrieval subtasks, each retrieval subtask is used for retrieving a preset physical column in the column storage system, the retrieval module is further used for executing the retrieval task to obtain a retrieval result of each retrieval subtask, and the retrieval module is further used for carrying out reconstruction processing on the retrieval result of the plurality of parallel retrieval subtasks to obtain reconstructed log data.

In a fifth aspect, the application provides an electronic device comprising a processor and a memory, the memory storing instructions executable by the processor, the processor being configured to, when executing the instructions, cause the electronic device to implement the method of the first or second aspect described above.

In a sixth aspect, the application provides a computer readable storage medium comprising computer software instructions which, when run in an electronic device, cause the electronic device to implement the method of the first or second aspects described above.

In a seventh aspect, the present application provides a computer program product comprising a computer program for causing an electronic device to carry out the method of the first or second aspect described above when the computer program is run in the electronic device.

Advantageous effects of the third aspect to the seventh aspect are described with reference to the corresponding descriptions of the first aspect and the second aspect, and are not repeated.

Drawings

FIG. 1 is a system architecture diagram of a log storage and retrieval system provided by the present application;

FIG. 2 is a flow chart of a log storage method provided by the application;

FIG. 3 is a schematic diagram of a log real-time structuring method based on multimode matching provided by the application;

FIG. 4 is a schematic diagram of an adaptive incremental learning method for log templates according to the present application;

FIG. 5 is a flowchart of another log storage method according to the present application;

FIG. 6 is a flow chart of a log retrieval method provided by the application;

FIG. 7 is a schematic diagram of a log storing and retrieving process according to the present application;

Fig. 8 is a schematic diagram of the composition of an electronic device according to the present application;

FIG. 9 is a schematic diagram of a log storage device according to the present application;

Fig. 10 is a schematic diagram of the composition of a log search device according to the present application.

Detailed Description

The log storage method and the retrieval method provided by the application are described in detail below with reference to the accompanying drawings.

The term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean that a exists alone, while a and B exist together, and B exists alone.

The terms "first" and "second" and the like in the description and in the drawings are used for distinguishing between different objects or between different processes of the same object and not for describing a particular order of objects.

Furthermore, references to the terms "comprising" and "having" and any variations thereof in the description of the present application are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed but may optionally include other steps or elements not listed or inherent to such process, method, article, or apparatus.

It should be noted that, in the embodiments of the present application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

In order to clearly describe the technical solution of the embodiments of the present application, in the embodiments of the present application, the terms "first", "second", etc. are used to distinguish the same item or similar items having substantially the same function and effect, and those skilled in the art will understand that the terms "first", "second", etc. are not limited in number and execution order.

In the description of the present application, unless otherwise indicated, the meaning of "a plurality" means two or more.

However, the above solution still has a certain problem. The first scheme ignores that a large amount of repeated static template information exists in the log message body, so that huge storage redundancy and cost are caused by direct storage of massive original texts. Queries for log content rely on inefficient full-text scanning or inverted indexes that require extremely high storage costs to optimize, making it difficult to compromise storage costs and query performance. The second type of scheme must perform a mandatory JOIN (JOIN) operation on the main data table and the external template library to restore the log at the time of query, which becomes a serious bottleneck of system performance in a massive data scenario. Meanwhile, the external template library is expanded continuously with time and cannot be cleaned (due to the dependence of historical data), so that template management is increasingly complex and stiff, and "technical liabilities" which are difficult to pay are formed.

Aiming at the technical problems, the application provides a log storage method and a retrieval method, which have the ideas that an original log usually comprises a dynamic variable and a static part which is fixed, and the template can be matched based on the static part of the original log, so that unstructured original log is converted into structured data based on a target log template and stored in a column storage system, thereby being convenient for quick classification and batch analysis of the log. Therefore, based on the column type storage system, the efficient compression and the rapid query characteristic aiming at the structured data can be realized, the storage cost is reduced, and the data compression and the efficient data management are realized. The search task is split into parallel subtasks aiming at each preset physical column, the advantage of column storage is fully utilized, only target column data is scanned, invalid input and output operations are greatly reduced, and the search efficiency is improved. The retrieval results of the plurality of parallel retrieval subtasks are subjected to reconstruction processing, so that log data can be completely restored, the integrity of the log data is ensured, and the readability of the log is improved.

The embodiments of the present application will be described in detail below with reference to the drawings attached to the specification.

As a possible implementation manner, the log storage and retrieval system can perform structural processing on the log based on the template after acquiring log data and perform differential storage based on the processing result. The log query instruction of the user can be analyzed, and the log query result is returned.

The log storage method and the retrieval method provided by the application can be applied to the log storage and retrieval system 10 shown in fig. 1. As shown in fig. 1, the log storage and retrieval system 10 of the present application may include an electronic device 11 and a storage device 12.

Wherein a communication connection is established between the electronic device 11 and the storage device 12. The connection may be a wireless connection, such as a bluetooth connection, a wireless fidelity (WIRELESS FIDELITY, wi-Fi) connection, etc., or a wired connection, such as a fiber optic connection, etc., as examples. Illustratively, the electronic device 11 and the storage device 12 may be connected to the internet through a router, thereby enabling a communication connection between the electronic device 11 and the storage device 12.

As a possible implementation manner, the electronic device 11 may implement obtaining of multidimensional logs of a terminal system layer (such as an operating system running state and a hardware resource occupation), an application layer (such as starting, interaction and error recording of various software), a data transmission layer (such as a network request and a protocol interaction log) and the like by interfacing with a remote interface (such as receiving log data from other platforms), and may dynamically adjust an obtaining policy according to a resource condition of a terminal (such as a high-performance terminal or a lightweight internet of things terminal), for example, a low-power and lightweight log obtaining mode is adopted in a resource-limited terminal, and multi-thread parallel obtaining is supported in a high-performance terminal.

As another possible implementation manner, the electronic device 11 may receive the log query instruction of the user through a diversified interaction manner, so as to support local graphical interface operations (such as the user inputting keywords, selecting time ranges, setting log levels, selecting service types, etc. on the terminal interface), be compatible with command line inputs (such as submitting query parameters through a specific instruction format), and also interface with a remote interface (such as receiving an API query request from a management platform), and cover local operations and remote management and control scenarios.

In some implementations, the electronic device 11 may also pre-process the data. For example, special characters (e.g., line-feed, tab) in the log are filtered, the log is deduplicated (based on log content hash or key field matching), additional dimensional information (e.g., device name matching based on device IP, application name associated based on log source process ID) may be supplemented, or validity and integrity verification may be performed on the received log query instruction (e.g., whether the fill-in parameters are missing, whether the format is correct).

In some embodiments, the electronic device 11 can parse and generate a log template based on the obtained log, and process the data into structured data based on the log template to store, and can decompose and execute a query instruction from a user into sub-tasks, and reconstruct the executed structure.

In some embodiments, the log template is used to match the obtained log data, and the electronic device 11 may differentially convert the log data into structural data based on the matching result, and differentially store the structural data.

For example, the electronic device 11 may extract the structured feature vector of each log and perform clustering to obtain a plurality of clusters, where the static portion of the log in each cluster is identified as a template and the dynamic portion is labeled as a variable placeholder.

Optionally, the electronic device 11 may also store the obtained log raw data or instruct the raw data to facilitate searching the raw data, and process and analyze the raw data.

In some embodiments, the electronic device 11 may send the log data after structuring to the storage device 12, and the storage device 12 may process, query and store the data from the electronic device 11. The electronic device 11 may also initiate explicit log manipulation instructions to the storage device 12, such as storing or querying a log, whereby the storage device 12 may manipulate the stored log data.

In some embodiments, the storage device 12 may receive log data (such as structured log fields, original log content, etc.) processed by the electronic device 11 and perform secure and persistent storage, so as to avoid loss of data due to limited local space or faults of the electronic device, and implement long-term retention of log data. On the other hand, when the electronic device 11 initiates the log operation instruction, the storage device 12 locates and operates the corresponding log data according to the conditions (such as time range, keywords, etc.) in the instruction, and feeds back the result to the electronic device 11, so that the electronic device 11 can be used for subsequent processing (such as displaying to the user).

By way of example, the electronic device 11 may be a server, for example, a single server, or a server cluster made up of a plurality of servers. In some implementations, the server cluster may also be a distributed cluster.

By way of example, the electronic device 11 may be a terminal device, such as a cell phone, tablet, desktop, laptop, handheld computer, notebook, ultra-mobile personal computer (ultra-mobile personalcomputer, UMPC), netbook, and cellular telephone, personal digital assistant (personal digitalassistant, PDA), augmented reality (augmented reality, AR) \virtual reality (VR) device, or the like. The embodiment of the application does not limit the specific form of the terminal equipment.

By way of example, the storage device 12 may be a columnar storage system.

In some embodiments, the electronic device 11 and the storage device 12 may be two separate devices as shown in FIG. 1, or the electronic device 11 and the storage device 12 may be integrated in the same device.

In some embodiments, a prompting device 13 may also be included in the log storage and retrieval system 10. A communication connection is established between the prompting device 13 and the electronic device 11.

The prompting device 13 is used for displaying log query results. For example, the prompting device 13 may be a voice prompting device, and the prompting device 13 presents the log query result to the user by reading the log query result. Or the prompting device 13 may be a display device, the prompting device 13 presents the log query result to the user by displaying the log query result on a display screen in a variety of ways (e.g., text, pictures, video, etc.).

It should be noted that, the system architecture described in the embodiments of the present application is for more clearly describing the technical solution of the embodiments of the present application, and does not constitute a limitation on the technical solution provided by the embodiments of the present application, and those skilled in the art can know that, along with the evolution of the system architecture, the technical solution provided by the embodiments of the present application is equally applicable to similar technical problems.

Referring to fig. 2, a flowchart of a log storage method according to an embodiment of the present application is shown. As shown in FIG. 2, the log storage method provided by the application specifically comprises the following steps S201 to S204.

S201, acquiring original log data.

In some embodiments, raw log data is used to record the behavior and status of a terminal device, application or operating system, etc., during operation, typically a raw collection of information. The original log data can cover information of multiple layers of terminal systems, application programs, hardware devices and the like.

The raw log data may include, for example, a system layer log, generated by a terminal operating system. The system layer log records the system core running state, such as event viewer log of Windows, including application program error, system start/shut down record, hardware driving abnormality, etc., and log record user login authentication behavior under Linux/var/log directory, system global event, etc.

The raw log data may also include, for example, an application layer log generated by various types of applications running on the terminal. The application layer logs specific operations and interactions of the application, such as file open/save/modify records of office software, web page access history of a browser, plug-in load logs, and the like.

The raw log data may also include, for example, a hardware layer log, generated by a terminal hardware component or an external device. The hardware layer log reflects the hardware state and the running condition, such as the CPU temperature of the server, the real-time record of the memory occupation, the data acquisition time and the original value of the Internet of things sensor, and the like.

In some embodiments, the electronic device generally combines the generation features of the original log data with the terminal scene, and adopts a mode of active acquisition or passive monitoring or a combination of the active acquisition and the passive monitoring to acquire various log data. The method is mainly applicable to scenes with controllable log generation rhythm and clear storage positions, the electronic equipment can actively initiate data acquisition requests, and the passive monitoring is more applicable to scenes with real-time log generation and instant capture.

The electronic device directly reads the original log generated by the system layer or the hardware layer by means of a native log interface provided by the terminal operating system. For example, in a Windows system, a system Event Log is obtained by calling an "Event Log API", in a Linux system, a kernel and an application Log are obtained by reading/proc file systems or calling syslog functions, and in a mobile device, the Log is read by an SDK provided by the system. Aiming at original logs stored in a file form (such as log/. Txt format logs output by an application program and access log files of a server), the electronic equipment captures newly-added contents of log files through a file system monitoring mechanism (such as FILESYSTEMWATCHER of inotify, windows of Linux) on one hand, reads the latest generated logs in real time, and reads historical log files in batches according to a preset period on the other hand. For the application program which is developed in a self-defined way, a log embedding point can be preset in an application code, when the application executes specific operation, an original log containing operation details is automatically generated and pushed to an acquisition device in real time through protocols such as Socket, HTTP and the like, or temporarily stored in a local cache, and the electronic equipment is actively pulled.

S202, determining whether a target log template matched with the original log data exists in a log template library.

Wherein the original log data comprises a static portion and a dynamic portion, and the target log template is identical to the static portion of the original log data.

In some embodiments, the log template is a general framework formed by standardized and abstracted raw log data with similar structures and repeated contents, and variable information in the raw log can be stripped, and a fixed and invariable common structure can be extracted, so that the raw log can be classified quickly, and log mode differences can be identified.

In some embodiments, the static part is text content which is fixed in the original log data and is consistent in the logs of the same kind, and is the basis for judging whether the original log is matched with the template or not, and the static part is not changed along with the time, scene and object change of log generation. Illustratively, in one log, the static portion of content is identical in other identical template logs, which is a common feature of all the under-template logs.

In some embodiments, the dynamic part is variable information which dynamically changes along with the generation scene, the object and the time change of the log in the original log data, is personalized content of the original log, and usually exists in a form of placeholders in a log template, does not influence the attribution of the type of the log, but contains specific details of the log. For example, in the same template log, as the time, scene, object, or specific record generated by the log changes, the content filled by the dynamic part placeholder changes accordingly.

In some embodiments, determining whether a target log template exists in the log template library may be accomplished by comparing static portions of the original log to whether the fixed structure of the template is completely consistent.

As a possible implementation manner, the acquisition device first performs static and dynamic part splitting on the original log, for example, through algorithms such as character string comparison, regular matching and the like, and identifies static parts which are fixed and dynamic parts which are dynamically changed. And taking the split static part as a search keyword, carrying out matching inquiry in a preset log template library, and searching templates with the fixed structure completely identical to the static part.

As another possible implementation, a matching engine may also be constructed to determine whether a target log template exists in the log template library. The template set is precompiled into an Aho-Corasick automaton, which is able to complete multi-pattern matching of the input string in linear time. To achieve multi-tenant or multi-service isolation, multiple independent automaton instances may be built. And then carrying out real-time matching and scanning, and sending the input logs into a corresponding automaton for scanning. The failure pointer mechanism of the Aho-Corasick automaton ensures that extremely high matching efficiency can be achieved even in the presence of a large number of templates.

A log real-time structuring method based on multimode matching is shown in figure 3. The template library provides template data for the constructed automaton through two modes of initialization and incremental updating. Log data acquired from the message system enters the automaton and then is subjected to efficient template matching. The storage device differentially stores log data according to different matching results (e.g., matched to a template or not matched to a template).

In some embodiments, the log templates in the log template library are generated based on the steps of:

and a1, acquiring an original log set in a historical time period.

In some embodiments, the original set of logs over the historical time period may be the sum of all log data collected from the target system that has not been processed over a pre-defined past certain time interval. For example, a collection of past raw log sets l=l ₁,l₂,…,l_n from a particular business system or tenant.

And a2, extracting the structured feature vector of each log for each log in the original log set.

In some implementations, extracting the structured feature vector for each log is a process that converts the unstructured raw log into a computable, comparable numerical vector. By analyzing the log text, key information capable of reflecting the core characteristics of the log is extracted, and the key information is quantized into a group of ordered numerical values to form a characteristic vector.

Illustratively, for each log l _i, its structured feature vector V _i may be extracted. This vector is used to characterize the macrostructure of the log, not the content. The extraction of the structured feature vector may be as follows:

V_i＝{len(l_i),R_sym(l_i),R_dig(l_i),...}

Where len (l _i) is log length, R _sym(l_i) is symbol character ratio, R _dig(l_i) is number duty.

And a step a3 of clustering the structured feature vectors of each log in the original log set by adopting a preset clustering algorithm to obtain a plurality of clusters.

In some embodiments, the clustering algorithm is an unsupervised learning algorithm that can automatically classify data with similar features into one class. In the generation of the log template, a preset clustering algorithm (such as K-means, DBSCAN and the like) can divide logs with similar features into the same group according to the similarity between the log structured feature vectors, so that the automatic classification of the logs is realized.

In some embodiments, a preset clustering algorithm is adopted to perform clustering processing on the structured feature vectors of each log in the original log set, and the structured feature vectors of all logs can be calculated and grouped. The algorithm classifies the logs with smaller difference of the feature vectors into a cluster by analyzing the distance or the similarity between the vectors, and the logs with larger difference are classified into different clusters. The process does not need manual intervention, the algorithm can automatically find the internal association and similar modes among logs, and finally a plurality of cluster clusters are obtained, wherein each cluster represents a group of log sets with similar characteristics.

For example, coarse-grained clustering may employ a clustering algorithm such as K-Means to divide logs with feature vectors V _i that are close in distance in feature space into preliminary clusters. This step can quickly separate logs of disparate structure (e.g., JSON format from plain text format).

And a4, identifying a static part of the log in each cluster as a template for each cluster in the plurality of clusters, and marking the dynamic part as a variable placeholder.

In some embodiments, fine-grained template extraction may refine the common templates through a finer algorithm within each preliminary cluster C _j. This step may employ a fixed depth prefix tree based algorithm to identify the common prefix and fixed portion, i.e., static portion, of the intra-cluster log message as templates and calculate a unique hash identification for each generated template T _j. The algorithm marks the difference portion, i.e. the dynamic portion, as a variable placeholder (e.g.).

An exemplary adaptive log template incremental learning method is shown in fig. 4. After the service log file is obtained, the log file can be subjected to coarse-grained clustering by a K-means clustering algorithm based on the proportion of symbol letters, the length of the log and the like, so that a plurality of clustering clusters are obtained. And identifying the static part as a template for the log in each cluster, and storing the static part in a template library corresponding to each cluster for obtaining analysis processing of the log. The parsed log is stored in a columnar storage system. The electronic device performs statistical detection on the columnar storage system according to a certain rule, performs incremental training on the logs which are not matched with the templates to obtain new templates, and repeatedly performs the steps.

S203, converting the original log data into first structured data based on the target log template when the target log template matched with the original log data exists in the log template library.

In some implementations, when there is a target log template in the log template library that matches the original log data, the original log is converted to the first structured data based on the template. Therefore, the original log with loose format and scattered information is orderly formed into a standardized data form which has definite field, uniform structure and direct use for storage and analysis according to the rules preset by the template.

In some embodiments, the first structured data comprises at least one of a template hash identification, a dynamic array, template text, raw log data.

The template hash identification is a unique identification of the target log template that matches the original log data, e.g., the template hash identification is a unique digital fingerprint of the log template.

Illustratively, the template hash identification is generated by computing the static text of the log template by a hash algorithm (e.g., MD5, SHA-1, etc.). The template hash identifier converts the static part of the template into a fixed-length character string, so that the unique characteristic of the template is reserved, and the storage volume is compressed.

Illustratively, the template hash identification may be denoted as H (T _k), where T _k represents a log template.

The dynamic array is a set of dynamic variable data extracted from original log data based on a target log template, the dynamic array is a variable information set extracted from the original log and consists of specific values corresponding to the template text dynamic placeholders in the original log, and the element sequence in the array is strictly consistent with the placeholder sequence.

By way of example, a dynamic array may be generated by extracting all variable parts from l _raw according to the structure of T _k, generating an ordered array of variables a _v＝[v₁,v₂, assuming a match to the log template T _k.

The template text is in the text form of the target log template. The template text is a common structure description of the similar original logs, and is the description content of the original logs which are fixed.

S204, storing the first structured data in the columnar storage system.

In some embodiments, a columnar storage system is a data storage architecture with columns as basic storage and access units, where all values of the same field (column) in a data table are stored in a contiguous physical space, and different fields are stored separately and independently. When the data of a specific column is required to be queried, the system does not need to read all fields of the whole record, and only the storage block of the target column is required to be loaded, so that unnecessary data reading and writing are greatly reduced.

In some embodiments, storing the first structured data in a columnar storage system includes:

and b1, writing a template hash identification of the first structured data into a first preset physical column of the columnar storage system.

The first physical column is written with the template hash identifier of the first structured data, so that it is clear that the independent physical storage units specially designed for storing all the structured data template hash identifiers in the column storage system, that is, the first physical column, have a fixed data format (such as a character string type and a fixed length) and a storage rule. And extracting a template hash mark for uniquely identifying the target log template from the first structured data, and correspondingly writing the extracted template hash mark into a designated record position of a first preset physical column according to a writing rule stored in a column.

For example, when an input structured log Record is received, record= { template_hash, variables _array, template_text, raw_log }, template_hash is a template hash identifier, variables _array is a dynamic array, template_text is a template text, and raw_log is original log data. The template_hash value is written to a physical column designated as "template unique identifier". While ensuring that the hash identification remains associated with other fields of the same first structured data (e.g., dynamic arrays, raw log information) in the data record dimension.

And b2, submitting the template text of the first structured data to a dictionary coding processing unit of the column-type storage system, wherein the dictionary coding processing unit is used for writing the identification of a dictionary corresponding to the template text into a physical row under the condition that the template text exists in the current data partition.

In some embodiments, the dictionary coding processing unit is a module responsible for compressing, storing and mapping the text or fixed value data with higher repeatability in the column type storage system, and by establishing a unique value dictionary, the original data is replaced by the short identifier, so that the storage redundancy is reduced and the data reading and writing efficiency is improved.

And b3, adding the template text into a dictionary corresponding to the current physical partition under the condition that the template text does not exist in the current data partition, and writing the identification of the dictionary corresponding to the template text into the physical row.

As a possible implementation, after submitting the template text to the dictionary coding processing unit of the column-type storage system, the unit will first search whether the same template text already exists in the current data partition, if the template text is searched to exist (i.e. the corresponding entry exists in the dictionary), the unique identification (such as the dictionary index ID) corresponding to the template text in the dictionary is directly extracted, and the identification is written into the physical line of the current data record.

As another possible implementation manner, if the template text is not found to exist, the brand new template text is used as a new added item, the brand new template text is written into a dedicated dictionary corresponding to the current data partition, and a unique dictionary identifier generated by a system is allocated to the brand new template text. And then, the template text is not directly stored, but the just-allocated dictionary identification is written into the corresponding position of the physical line of the current data record, so that the mapping relation between the physical line and the template text in the dictionary is established.

And b4, writing the dynamic array of the first structured data into a second preset physical column of the columnar storage system, wherein the second preset physical column supports the storage array type data.

In some embodiments, the second predetermined physical column is specifically designed for storage array type data, supporting element structures (e.g., multi-type elements, variable length) and storage formats (e.g., JSON array, binary array) of dynamic arrays. When the dynamic array in the first structured data is written into a second preset physical column of the columnar storage system, the dynamic array is extracted from the first structured data, a single element is not required to be split, the complete dynamic array is directly stored into a corresponding position according to the writing specification of the physical column, and the sequence and the association relation of the elements in the array are maintained.

Illustratively, variables _array in the first structured data may be written to a physical column (i.e., array type column) that supports storing ordered lists of elements. The key value mapping (optional) can also be customized, if the record contains the custom key value pair, the custom key value pair is written into a physical column (i.e. Map type column) supporting the key value pair storage.

In some embodiments, while writing data, instructions may also be issued to the storage system to physically store in a particular organization, instructing the physical layout of the data.

Illustratively, instruction one (data co-location) instructs the storage system to physically arrange the data blocks according to a hierarchical ordering key (log timestamp, service identification, template unique identification). This step ensures that the logs of the same template are highly aggregated both in time and logically, providing excellent data locality for read operations. Instruction two (auxiliary index build) the instruction storage system builds and maintains a probabilistic hop index (e.g., bloom filter) on the "unstructured text" column. The purpose of this step is to quickly filter out a large number of data blocks that do not contain the target key by the index when a subsequent query needs to scan the column.

In some embodiments, by means of efficient compression and fast query characteristics of column storage for structured data, flexible statistics and aggregation operation of dynamic variables in a log are supported while storage cost is reduced, log processing efficiency can be remarkably improved, and data compression and efficient data management are achieved.

Referring to fig. 5, a flowchart of another log storage method according to an embodiment of the present application is shown. Specifically, the method comprises the following steps S501-S502.

S501, converting the original log data into second structured data under the condition that a target log template matched with the original log data does not exist in a log template library.

In some embodiments, when there is no target log template in the log template library that matches the original log data, this means that the static portion of the original log does not match the static structure of all templates in the library, belonging to a new type of log that is undefined.

In some embodiments, the second structured data comprises a template hash identifier, a dynamic array, a template text and original log data, wherein the template hash identifier of the second structured data is configured to be a preset value and used for indicating that the original log data is not matched with the log template, and the dynamic array and the template text in the second structured data are null values.

In some embodiments, when the original log data is not matched with any target template in the log template library, the template hash identifier for the associated template is uniformly set to a preset value (such as "NULL", "UNMATCHED _tpl" or specific coding), and the preset value can intuitively indicate that the original log corresponding to the structured data belongs to the type of the unmatched template, so that subsequent rapid screening and distinguishing are facilitated.

Illustratively, when the original log data does not match any target template in the log template library, a special unmatched record is generated, the template hash is a predefined special value (e.g., 0), the variable array is empty, and the complete original log is maintained.

S502, storing the second structured data in the column type storage system.

In some embodiments, storing the second structured data in the columnar storage system includes writing a template hash identification of the second structured data to a first preset physical column of the columnar storage system, the first preset physical column being used to store a template unique identification, writing an original log of the second structured data to a third preset physical column of the columnar storage system, the third preset physical column supporting storage of unstructured text data.

In some embodiments, the method may write the template hash identifier of the second structured data to the first preset physical column of the columnar storage system as described above. Therefore, the first preset physical column includes the first structured data and the second structured data.

In some embodiments, the third preset physical column supports storing unstructured text data with storage capability adapted to the original log text format. According to the writing specification of the column type storage system, the extracted original log is directly stored in a corresponding record position of a third preset physical column, and the original log can be ensured to be associated with other field dynamic arrays of the second structured data and template texts in a record dimension through a template hash mark.

Illustratively, storing the second structured data in a columnar storage system, a predefined "unmatched" special value may be written to a "template unique identification" column. The raw_log in the record is written in its entirety into a dedicated physical column designated as "unstructured text", i.e., a third preset physical column. Other columns (dynamic array, template text) associated with templates and variables are all set to null values in this row. Meanwhile, according to the method for physically distributing the instructed data, the instruction can be sent to the storage system to be physically stored in a specific organization mode while the data is written.

In some embodiments, if the duty ratio of the original log data that is not matched to the log template is greater than or equal to a preset threshold value within a preset period of time, a new log template is generated based on the original log data that is not matched to the log template, and the new log template is stored in the log template library.

In some implementations, the preset time period is a time period that triggers unmatched log duty cycle statistics. The preset time period can be set in a manner that the preset time period of a high-frequency log scene (such as a server generates tens of access logs every second) can be set to 5-15 minutes, the preset time period of a low-frequency log scene (such as equipment fault logs and system inspection logs once a day) can be prolonged to 1-24 hours, the preset time period can be aligned with a change period according to a service update rhythm (such as a single log or a plurality of logs stored in a database immediately or after a period of time is delayed), and the preset time period can be set by referring to a history unmatched log fluctuation rule and analyzing the time period of the unmatched logs in history data.

In some embodiments, the preset threshold is a duty ratio critical point for judging whether a new template needs to be generated, so that frequent generation of invalid templates (such as triggering update of occasional abnormal logs) caused by too low threshold and omission of important new log modes (such as long-term unmatched large number of valid new logs) caused by too high threshold can be avoided.

In some embodiments, the preset threshold may be set in such a way that a basic threshold is set based on service tolerance, a core service scenario (such as a transaction log and a payment callback log) is set to a lower level (such as 5% -10%), the log requires extremely high template coverage, a non-core service scenario (such as a user behavior embedded point log and a debugging log of non-key service) is increased to 15% -30%, the log does not affect a core function in a short period, a new template is generated by combining with resource cost adjustment generated by a template, calculation flows such as log clustering, static part extraction and placeholder marking are needed, if system resources are limited (such as edge equipment and a low configuration server), the threshold can be appropriately increased (such as up to 20% from 10%), if system resources are sufficient (such as a cloud log analysis platform), the threshold can be reduced (such as 5%), and the effect of setting the past threshold can be analyzed by referring to the history updating effect of a template library.

Illustratively, the system continuously monitors the log ("unmatched log") duty cycle of any existing templates that fail to match in the online processing module. The monitor index ρ is defined as follows:

Where N _unmatched is the number of unmatched logs per unit time and N _total is the total number of logs. When ρ exceeds a certain threshold (which may be 0.01-0.1, such as 0.05), the incremental training tasks are automatically triggered. The task gathers these unmatched logs and learns the new templates.

Referring to fig. 6, a flowchart of a log searching method according to an embodiment of the present application is shown. As shown in FIG. 6, the log searching method specifically includes the following steps S601 to S604.

S601, receiving a log search request of a user.

In some embodiments, the user log search request may come from different entities and terminals, which may be real-time query initiated by an operation and maintenance person through a system management platform (such as a Web console and a desktop client), batch search tasks submitted by a business person through an API interface (such as acquiring a "user payment failure log" at daily timing), and related search triggered automatically by the system (such as automatically searching related context Wen Rizhi after detecting an anomaly).

In some embodiments, the search request may include explicit query elements to define a search scope, e.g., a user specifies a search time scope, avoids full log traversal, may provide search conditions (e.g., matching logs by original log keywords), and may specify a return format (e.g., JSON, table) and a result number limit in part of the scenario. Illustratively, the user submits a search request for the full text of a virtual log. For example, one is a request intended to "find all logs containing a particular keyword K".

S602, generating a retrieval task based on the log retrieval request and a storage structure of the columnar storage system.

The search task comprises a plurality of parallel search subtasks, and each search subtask is used for searching a preset physical column in the column type storage system.

In some embodiments, the storage structure of the columnar storage system organizes data by columns, each field of the data table (such as a timestamp of a log, a template hash, original log content) is stored in a centralized manner as an independent column segment, different columns are scattered in different storage areas, and only the target column segment needs to be loaded for reducing invalid I/O during query. And maintaining exclusive metadata containing storage positions, data ranges and statistical information for each column section, and supporting rapid screening and excluding irrelevant column sections. By means of high similarity of the same column of data, dedicated compression algorithms such as dictionary coding, difference coding and the like are adopted for different column characteristics (such as repeated character strings and continuous numerical values), space is saved, transmission efficiency is improved, column-level indexes such as B+ trees, hash indexes or bloom filters are built for high-frequency query columns, and query scenes such as values and ranges are further adapted.

In some embodiments, when generating a retrieval task based on a log retrieval request and a storage structure of a columnar storage system, the core is to split query logic according to columns to adapt columnar storage characteristics, namely, firstly, resolving retrieval conditions (such as a template hash identifier, a dynamic variable value, an original log keyword and the like) in the request, mapping the retrieval conditions to corresponding preset physical columns (such as a first column corresponding to the template hash, a second column corresponding to the dynamic array and a third column corresponding to the original log) and then splitting the whole retrieval task into a plurality of parallel retrieval subtasks.

In some embodiments, a columnar storage system includes a first preset physical column, a second preset physical column, and a third preset physical column. The method comprises the steps of storing a template hash identification of structured data corresponding to original log data in a first preset physical column, storing a dynamic array of the structured data corresponding to the original log data in a second preset physical column, and storing the original log data in a third preset physical column. The plurality of parallel search sub-tasks includes a first search sub-task, a second search sub-task, and a third search sub-task. The first search subtask is used for searching in a first preset physical column, the second search subtask is used for searching in a second preset physical column, and the third search subtask is used for searching in the first preset physical column and a third preset physical column.

By way of example, a multi-path parallel execution plan is generated for a query request of a user, full-text search of a keyword K is not directly executed, and an execution plan comprising a plurality of parallel retrieval subtasks is dynamically generated according to organization characteristics of underlying data. The search subtask A (template path search) creates a task that searches for keyword K in a dictionary of "template unique identification" columns. Since the dictionary size is much smaller than the original data, this task is performed very fast. The search subtask B (variable path search) creates a task that searches for keyword K in the "ordered variable array" column. The task is performed using an efficient iterative function of the storage engine to elements within the array. The search subtask C (unstructured Path search) creates a task that searches for keyword K in the "unstructured text" column. The task is subjected to two key optimization actions, namely pre-filtering, namely firstly, filtering the column of 'template unique identification' to be equal to a 'unmatched' special value, rapidly reducing the search range to an original log which only occupies a small number, indexing and utilizing, namely, further filtering out a data block which does not contain a keyword K by utilizing a constructed probability hop index on the column, and furthest reducing the disk I/O.

S603, executing the search task to obtain a search result of each search subtask.

In some embodiments, each subtask only returns the record information meeting the search condition in the physical column for which it is responsible, for example, the subtask for the first physical column (template hash identifier) results in all record IDs and corresponding hash values matching the target hash value, the subtask for the second physical column results in the array content and associated record identifier containing the target variable, and the subtask for the third physical column results in the original text segment containing the keyword and the record location. These results only contain matching data for the current column, and do not relate to other column information, but are accompanied by a uniform record unique identification.

Illustratively, the generated plurality of sub-tasks are concurrently assigned to the underlying storage engine for execution, and after receipt of the result set, all successfully returned unique identifiers (Record IDs) of the records are combined and deduplicated.

S604, reconstructing the search results of the plurality of parallel search subtasks to obtain reconstructed log data.

In some embodiments, the results of the multiple parallel retrieval subtasks are reconstructed, and local results scattered in different physical columns can be associated and integrated through recording unique identifiers and restored into a complete log data form. And firstly, based on common record identification (such as a template unique identification) in each subtask result, carrying out alignment association on matching information (such as template hash of a first column, dynamic array of a second column and original log of a third column) of the same log recorded in different columns. The fields are then reorganized according to a predetermined log data structure (e.g., mapping between the original log and structured fields), and the necessary context information (e.g., time stamp, data partition) is supplemented. At the same time, the unmatched or invalid records in each column are filtered out, and finally, a structured complete log data set which can be directly understood and used by a user is formed.

Illustratively, for each RecordID aggregated in the subtask results, its necessary persistence components (either templates, variable arrays or raw log text) are extracted from the storage engine as needed. And applying a log reconstruction rule, namely dynamically generating a final log message in a memory by programmatically filling a variable array into a template placeholder if an associated template is recorded, and directly adopting an original log text if the record is of a non-matching type. And returning the reconstructed log message to the user as a final result.

Referring to fig. 7, a schematic diagram of a log storage and retrieval process according to an embodiment of the present application is shown. After the log data sampled for the first time is input into the self-adaptive log template increment learning method 1, a batch of log templates can be learned and obtained. The sampled log data may be, for example, log files of service a, service B, service C. The log real-time structuring method 2 based on multimode matching acquires log data from a service system, and then inputs the log data into a message queue to acquire the log data in real time. Meanwhile, the method 2 loads the log template learned in the method 1 and matches the obtained log based on the template. And (3) performing differential storage on the logs according to the matching result of the method (2) by using a common and differential fusion column storage model. And after the storage is finished, the method 3 checks the database, and the log data which is not matched with the template is input into the method 1 according to a preset template updating rule. And the method 1 carries out dynamic adaptation and incremental updating of the log template, and enhances and updates the updated template to the method 2 for template matching of the log. The end user inputs the query instruction into the multi-path acceleration retrieval method 4 based on query rewriting, and the method 4 provides a transparent high-performance query experience for the user according to the log stored in the method 3.

It can be seen that the foregoing description of the solution provided by the embodiments of the present application has been presented mainly from a method perspective. To achieve the above-mentioned functions, embodiments of the present application provide corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The embodiment of the application can divide the log storage device into the functional modules according to the method example, for example, each functional module can be divided corresponding to each function, and two or more functions can be integrated into one processing module. The integrated modules may be implemented in hardware or in software functional modules. Optionally, the division of the modules in the embodiment of the present application is schematic, which is merely a logic function division, and other division manners may be implemented in practice.

In some embodiments, the present application further provides a log storage device. The log storage device may include one or more functional modules for implementing the voice service scheduling method of the above method embodiment.

For example, fig. 9 is a schematic diagram of a log storage device 90 according to an embodiment of the present application. As shown in fig. 9, the log storage device includes an acquisition module 91 and a storage module 92.

The system comprises an acquisition module 91 for acquiring original log data, a storage module 92 for determining whether a target log template matched with the original log data exists in a log template library, wherein the original log data comprises a static part and a dynamic part, the target log template is identical to the static part of the original log data, the storage module 92 is further used for converting the original log data into first structured data based on the target log template when the target log template matched with the original log data exists in the log template library, and the storage module 92 is further used for storing the first structured data in a column-type storage system.

In another possible implementation manner, the storage module 92 is specifically configured to store the first structured data in the columnar storage system, and includes writing a template hash identifier of the first structured data in a first preset physical column of the columnar storage system, submitting a template text of the first structured data to a dictionary coding processing unit of the columnar storage system, where the dictionary coding processing unit is configured to write an identifier of a dictionary corresponding to the template text in a physical row if the template text exists in a current data partition, adding the template text to the dictionary corresponding to the current physical partition if the template text does not exist in the current data partition, and writing an identifier of the dictionary corresponding to the template text in the physical row, and writing a dynamic array of the first structured data in a second preset physical column of the columnar storage system, where the second preset physical column supports storing array type data.

In another possible implementation, the storage module 92 is specifically configured to convert the original log data into the second structured data in the case that the target log template matching the original log data does not exist in the log template library, and store the second structured data in the columnar storage system.

In another possible implementation manner, the storage module 92 is specifically configured to store the second structured data in the columnar storage system, and includes writing a template hash identifier of the second structured data into a first preset physical column of the columnar storage system, where the first preset physical column is used to store a unique identifier of the template, writing an original log of the second structured data into a third preset physical column of the columnar storage system, and where the third preset physical column supports storing unstructured text data.

In another possible implementation, the log templates in the log template library are generated based on the following steps. The storage module 92 is specifically configured to obtain an original log set in a historical time period, extract a structured feature vector of each log for each log in the original log set, perform clustering processing on the structured feature vector of each log in the original log set by adopting a preset clustering algorithm to obtain a plurality of clusters, and identify a static part of the log in each cluster as a template and mark a dynamic part as a variable placeholder for each cluster in the plurality of clusters.

In another possible implementation manner, the storage module 92 is specifically configured to generate a new log template based on the original log data that is not matched to the log template if the duty ratio of the original log data that is not matched to the log template is greater than or equal to the preset threshold within a preset period of time, and store the new log template in the log template library.

In some embodiments, the application further provides a log retrieval device. The log retrieval device may include one or more functional modules for implementing the voice service scheduling method of the above method embodiment.

For example, fig. 10 is a schematic diagram of a log search device 100 according to an embodiment of the present application. As shown in fig. 10, the log search device 100 includes a receiving module 101 and a search module 102.

The system comprises a receiving module 101 for receiving a log search request of a user, a searching module 102 for generating a search task based on the log search request and a storage structure of a column storage system, wherein the search task comprises a plurality of parallel search subtasks, each search subtask is used for searching a preset physical column in the column storage system, the searching module 102 is further used for executing the search task to obtain a search result of each search subtask, and the searching module 102 is further used for reconstructing the search results of the plurality of parallel search subtasks to obtain reconstructed log data.

In the case of implementing the functions of the integrated modules described above in the form of hardware, an embodiment of the present invention provides a possible schematic composition of the electronic device referred to in the above embodiment. As shown in fig. 8, the electronic device 900 includes a processor 902, a communication interface 903, and a bus 904. Optionally, the electronic device 900 may also include memory 901.

The processor 902 may be any means for implementing or executing the various exemplary logic blocks, modules, and circuits described in connection with this disclosure. The processor 902 may be a central processor, general purpose processor, digital signal processor, application specific integrated circuit, field programmable gate array or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor 902 may also be a combination that performs computing functions, e.g., including one or more microprocessors, a combination of a DSP and a microprocessor, and the like.

A communication interface 903 for connecting to other devices via a communication network. The communication network may be an ethernet, a radio access network, a wireless local area network (wireless local area networks, WLAN), etc.

The memory 901 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (random access memory, RAM) or other type of dynamic storage device that can store information and instructions, or an electrically erasable programmable read-only memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-only memory, EEPROM), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

As a possible implementation, the memory 901 may exist separately from the processor 902, and the memory 901 may be connected to the processor 902 through the bus 904 for storing instructions or program code. When the processor 902 calls and executes the instructions or the program codes stored in the memory 901, the log storage method and the retrieval method provided by the embodiment of the invention can be realized.

In another possible implementation, the memory 901 may also be integrated with the processor 902.

Bus 904, which may be an extended industry standard architecture (extended industry standard architecture, EISA) bus, or the like. The bus 904 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 8, but not only one bus or one type of bus.

From the foregoing description of the embodiments, it will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e., the internal structure of the service call device is divided into different functional modules, so as to perform all or part of the functions described above.

The embodiment of the application also provides a computer readable storage medium. All or part of the flow in the above method embodiments may be implemented by computer instructions to instruct related hardware, and the program may be stored in the above computer readable storage medium, and the program may include the flow in the above method embodiments when executed. The computer readable storage medium may be any of the foregoing embodiments or memory. The computer-readable storage medium may be an external storage device of the service call device, such as a plug-in hard disk, a smart card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, or a flash memory card (FLASH CARD) provided in the service call device. Further, the computer-readable storage medium may further include both the internal storage unit and the external storage device of the service invocation apparatus. The computer-readable storage medium is used for storing the computer program and other programs and data required by the service calling device. The above-described computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.

The embodiment of the application also provides a computer program product, which contains a computer program, and when the computer program product runs on a computer, the computer is caused to execute any one of the log storage method and the search method provided in the embodiment.

The present application is not limited to the above embodiments, and any changes or substitutions within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

1. A log storage method, characterized in that the method comprises:

Obtain raw log data;

Determine whether a target log template matching the original log data exists in the log template library; the original log data includes a static part and a dynamic part, and the target log template is the same as the static part of the original log data.

If a target log template matching the original log data exists in the log template library, the original log data is converted into first structured data based on the target log template.

The first structured data is stored in a columnar storage system.

2. The method according to claim 1, wherein the first structured data comprises at least one of the following:

The template hash identifier is a unique identifier of the target log template that matches the original log data; the dynamic array is a collection of dynamic variable data extracted from the original log data based on the target log template; and the template text is the text form of the target log template.

3. The method according to claim 2, wherein storing the first structured data in a columnar storage system comprises:

Write the template hash identifier of the first structured data into the first preset physical column of the columnar storage system;

The template text of the first structured data is submitted to the dictionary encoding processing unit of the columnar storage system. The dictionary encoding processing unit is used to write the identifier of the dictionary corresponding to the template text into the physical row when the template text exists in the current data partition; and to add the template text to the dictionary corresponding to the current physical partition and write the identifier of the dictionary corresponding to the template text into the physical row when the template text does not exist in the current data partition.

The dynamic array of the first structured data is written into the second preset physical column of the columnar storage system; the second preset physical column supports storing array-type data.

4. The method according to claim 1, characterized in that the method further comprises:

If no target log template matching the original log data exists in the log template library, the original log data is converted into second structured data.

The second structured data is stored in a columnar storage system.

5. The method according to claim 4, wherein the second structured data comprises: a template hash identifier, a dynamic array, template text, and original log data; the template hash identifier of the second structured data is configured to a preset value to indicate that the original log data does not match a log template; the dynamic array and the template text in the second structured data are null values.

6. The method according to claim 5, wherein storing the second structured data in a columnar storage system comprises:

The template hash identifier of the second structured data is written into the first preset physical column of the columnar storage system; the first preset physical column is used to store the unique identifier of the template;

The original log of the second structured data is written to the third preset physical column of the columnar storage system; the third preset physical column supports the storage of unstructured text data.

7. The method according to claim 1, wherein the log templates in the log template library are generated based on the following steps:

Retrieve the collection of raw logs within a historical time period;

For each log entry in the original log set, extract the structured feature vector for each log entry;

A preset clustering algorithm is used to cluster the structured feature vectors of each log in the original log set to obtain multiple clusters;

For each of the plurality of clusters, the static portion of the logs in each cluster is identified as a template, and the dynamic portion is marked as a variable placeholder.

8. The method according to claim 1, characterized in that the method further comprises:

If, within a preset time period, the proportion of raw log data for which no matching log template is found is greater than or equal to a preset threshold, a new log template is generated based on the raw log data for which no matching log template is found, and the new log template is stored in the log template library.

9. A log retrieval method based on the log storage method according to any one of claims 1 to 8, characterized in that the method comprises:

Receive user log retrieval requests;

Based on the log retrieval request and the storage structure of the columnar storage system, a retrieval task is generated; the retrieval task includes multiple parallel retrieval subtasks, each of which is used to retrieve a preset physical column in the columnar storage system;

Execute the retrieval task and obtain the retrieval results for each retrieval subtask;

The retrieval results of the multiple parallel retrieval subtasks are reconstructed to obtain reconstructed log data.

10. The log retrieval method according to claim 9, wherein the columnar storage system comprises a first preset physical column, a second preset physical column, and a third preset physical column;

The first preset physical column is used to store the template hash identifier of the structured data corresponding to the original log data;

The second preset physical column is used to store a dynamic array of structured data corresponding to the original log data; the third preset physical column is used to store the original log data.

The multiple parallel retrieval subtasks include a first retrieval subtask, a second retrieval subtask, and a third retrieval subtask;

The first retrieval subtask is used to perform a retrieval in the first preset physical column;

The second retrieval subtask is used to perform a retrieval in the second preset physical column;

The third retrieval subtask is used to perform a retrieval in the first preset physical column and the third preset physical column.

11. An electronic device, characterized in that it comprises: a processor and a memory;

The memory stores instructions that the processor can execute;

When the processor is configured to execute the instructions, the electronic device implements the log storage method as described in any one of claims 1 to 8 and the log retrieval method as described in any one of claims 9 to 10.

12. A readable storage medium, characterized in that the readable storage medium comprises: computer-executable instructions; when the computer-executable instructions are executed on a computer, the computer performs the log storage method according to any one of claims 1 to 8 and the log retrieval method according to any one of claims 9 to 10.

13. A computer program product, characterized in that the computer program product includes a computer program that, when the computer program is run on an electronic device, causes the electronic device to perform the log storage method as described in any one of claims 1 to 8 and the log retrieval method as described in any one of claims 9 to 10.