[go: up one dir, main page]

CN116701456A - A data analysis method and related equipment - Google Patents

A data analysis method and related equipment Download PDF

Info

Publication number
CN116701456A
CN116701456A CN202310635626.9A CN202310635626A CN116701456A CN 116701456 A CN116701456 A CN 116701456A CN 202310635626 A CN202310635626 A CN 202310635626A CN 116701456 A CN116701456 A CN 116701456A
Authority
CN
China
Prior art keywords
data
field
sample data
target
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310635626.9A
Other languages
Chinese (zh)
Inventor
张良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Cloud Computing Ltd
Original Assignee
Alibaba Cloud Computing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Cloud Computing Ltd filed Critical Alibaba Cloud Computing Ltd
Priority to CN202310635626.9A priority Critical patent/CN116701456A/en
Publication of CN116701456A publication Critical patent/CN116701456A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24564Applying rules; Deductive queries
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本说明书提供了一种数据解析方法和相关设备,应用于云服务平台,所述云服务平台与多个数据源分别对接。该方法包括:获取所述多个数据源中的任一目标数据源对应的样本数据集合;分析所述样本数据集合中的样本数据的数据结构,并基于分析结果生成与所述目标数据源对应的数据解析规则;所述数据解析规则用于指示所述目标数据源中的各个数据的数据结构;基于所述数据解析规则,对所述目标数据源中的待解析的目标数据进行数据结构解析。

This specification provides a data analysis method and related equipment, which are applied to a cloud service platform, and the cloud service platform is respectively connected to multiple data sources. The method includes: obtaining a sample data set corresponding to any target data source among the plurality of data sources; analyzing the data structure of the sample data in the sample data set, and generating a data set corresponding to the target data source based on the analysis result. The data parsing rules; the data parsing rules are used to indicate the data structure of each data in the target data source; based on the data parsing rules, perform data structure parsing on the target data to be parsed in the target data source .

Description

一种数据解析方法及相关设备A data analysis method and related equipment

技术领域technical field

本说明书一个或多个实施例涉及数据处理技术领域,尤其涉及一种数据解析方法及相关设备。One or more embodiments of this specification relate to the technical field of data processing, and in particular, to a data parsing method and related equipment.

背景技术Background technique

云服务平台可以对接云外的多个数据源,并对该多个数据源的数据进行统一的数据管理。但是,各个数据源的数据格式往往不同,没有统一的标准,这就导致云服务平台在接入这些数据源的数据时无法对其进行准确、有效的数据管理。The cloud service platform can connect to multiple data sources outside the cloud, and perform unified data management on the data of the multiple data sources. However, the data formats of various data sources are often different, and there is no unified standard, which makes it impossible for the cloud service platform to perform accurate and effective data management when accessing the data of these data sources.

因此,云服务平台在接入各个数据源的数据时,经常需要针对各个数据源的数据,手动配置相应的数据解析规则,以使云服务平台能够基于这些手动配置的数据解析规则对各个数据源的数据进行解析,从而获得结构化的标准数据以便进行后续管理。但是,通过手动配置大量的数据解析规则非常耗时耗力且容易出错,极大程度上降低了云服务平台的数据接入效率和数据管理效率。Therefore, when the cloud service platform accesses the data of each data source, it often needs to manually configure the corresponding data analysis rules for the data of each data source, so that the cloud service platform can analyze each data source based on these manually configured data analysis rules. Analyze the data to obtain structured standard data for subsequent management. However, manually configuring a large number of data parsing rules is time-consuming, labor-intensive and error-prone, which greatly reduces the data access efficiency and data management efficiency of the cloud service platform.

发明内容Contents of the invention

有鉴于此,本说明书一个或多个实施例提供一种数据解析方法及相关设备。In view of this, one or more embodiments of this specification provide a data parsing method and related equipment.

第一方面,本说明书提供了一种数据解析方法,应用于云服务平台,所述云服务平台与多个数据源分别对接;所述方法包括:In the first aspect, this specification provides a data analysis method, which is applied to a cloud service platform, and the cloud service platform is respectively connected to multiple data sources; the method includes:

获取所述多个数据源中的任一目标数据源对应的样本数据集合;Obtain a sample data set corresponding to any target data source among the plurality of data sources;

分析所述样本数据集合中的样本数据的数据结构,并基于分析结果生成与所述目标数据源对应的数据解析规则;所述数据解析规则用于指示所述目标数据源中的各个数据的数据结构;Analyzing the data structure of the sample data in the sample data set, and generating a data parsing rule corresponding to the target data source based on the analysis result; the data parsing rule is used to indicate the data of each data in the target data source structure;

基于所述数据解析规则,对所述目标数据源中的待解析的目标数据进行数据结构解析。Based on the data parsing rules, perform data structure parsing on the target data to be parsed in the target data source.

第二方面,本说明书提供了一种数据解析装置,应用于云服务平台,所述云服务平台与多个数据源分别对接;所述装置包括:In the second aspect, this specification provides a data parsing device, which is applied to a cloud service platform, and the cloud service platform is respectively connected to multiple data sources; the device includes:

获取单元,用于获取所述多个数据源中的任一目标数据源对应的样本数据集合;an acquisition unit, configured to acquire a sample data set corresponding to any target data source among the plurality of data sources;

第一解析规则生成单元,用于分析所述样本数据集合中的样本数据的数据结构,并基于分析结果生成与所述目标数据源对应的数据解析规则;所述数据解析规则用于指示所述目标数据源中的各个数据的数据结构;The first parsing rule generation unit is configured to analyze the data structure of the sample data in the sample data set, and generate a data parsing rule corresponding to the target data source based on the analysis result; the data parsing rule is used to indicate the The data structure of each data in the target data source;

解析单元,用于基于所述数据解析规则,对所述目标数据源中的待解析的目标数据进行数据结构解析。A parsing unit, configured to analyze the data structure of the target data to be parsed in the target data source based on the data parsing rules.

相应地,本说明书还提供了一种服务器,应用于云服务平台,所述服务器包括存储器和处理器;所述存储器上存储有可由所述处理器运行的计算机程序;所述处理器运行所述计算机程序时,执行如上述各实施方式所述的数据解析方法。Correspondingly, this specification also provides a server, which is applied to a cloud service platform, the server includes a memory and a processor; the memory stores a computer program that can be run by the processor; the processor runs the In the case of a computer program, the data analysis methods described in the above-mentioned embodiments are executed.

相应地,本说明书还提供了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器运行时,执行如上述各实施方式所述的数据解析方法。Correspondingly, this specification also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the data analysis method described in the above-mentioned embodiments is executed.

综上所述,本申请中的云服务平台可以先获取各个数据源中的样本数据,然后分析各个数据源中的样本数据的数据结构,并基于分析结果自动生成与各个数据源中的数据所对应的数据解析规则。进一步地,云服务平台可以基于自动生成的数据解析规则,快速、准确的对各个数据源中待解析的数据进行数据结构解析,以便后续云服务平台可以基于得到的解析结果,对各个数据源中的数据进行准确、可靠的数据管理。如此,本申请可以通过对数据源中的样本数据的分析,自动生成相应的数据解析规则,进而大大提升了云服务平台的数据解析效率和数据管理效率,进一步提升了云服务平台的服务质量,满足客户的实际需求。To sum up, the cloud service platform in this application can first obtain the sample data in each data source, then analyze the data structure of the sample data in each data source, and automatically generate the data corresponding to the data in each data source based on the analysis results. Corresponding data parsing rules. Further, the cloud service platform can quickly and accurately analyze the data structure of the data to be analyzed in each data source based on the automatically generated data analysis rules, so that the subsequent cloud service platform can analyze the data in each data source based on the obtained analysis results. accurate and reliable data management. In this way, this application can automatically generate corresponding data analysis rules by analyzing the sample data in the data source, thereby greatly improving the data analysis efficiency and data management efficiency of the cloud service platform, and further improving the service quality of the cloud service platform. Meet the actual needs of customers.

附图说明Description of drawings

图1是一示例性实施例提供的一种系统架构示意图;FIG. 1 is a schematic diagram of a system architecture provided by an exemplary embodiment;

图2是一示例性实施例提供的一种数据解析方法的流程示意图;Fig. 2 is a schematic flow chart of a data parsing method provided by an exemplary embodiment;

图3是一示例性实施例提供的一种数据解析装置的结构示意图;Fig. 3 is a schematic structural diagram of a data parsing device provided by an exemplary embodiment;

图4是一示例性实施例提供的一种服务器的结构示意图。Fig. 4 is a schematic structural diagram of a server provided by an exemplary embodiment.

具体实施方式Detailed ways

这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本说明书一个或多个实施例相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本说明书一个或多个实施例的一些方面相一致的装置和方法的例子。Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. Implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of this specification. Rather, they are merely examples of apparatuses and methods consistent with aspects of one or more embodiments of the present specification as recited in the appended claims.

需要说明的是:在其他实施例中并不一定按照本说明书示出和描述的顺序来执行相应方法的步骤。在一些其他实施例中,其方法所包括的步骤可以比本说明书所描述的更多或更少。此外,本说明书中所描述的单个步骤,在其他实施例中可能被分解为多个步骤进行描述;而本说明书中所描述的多个步骤,在其他实施例中也可能被合并为单个步骤进行描述。It should be noted that in other embodiments, the steps of the corresponding methods are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or less steps than those described in this specification. In addition, a single step described in this specification may be decomposed into multiple steps for description in other embodiments; multiple steps described in this specification may also be combined into a single step in other embodiments describe.

本申请所涉及的用户信息(包括但不限于用户设备信息、用户个人信息等)和数据(包括但不限于用于分析的数据、存储的数据、展示的数据等),均为经用户授权或者经过各方充分授权的信息和数据,并且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准,并提供有相应的操作入口,供用户选择授权或者拒绝。The user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in this application are authorized by the user or Information and data that have been fully authorized by all parties, and the collection, use and processing of relevant data must comply with relevant laws, regulations and standards of relevant countries and regions, and provide corresponding operation portals for users to choose to authorize or refuse.

首先,对本说明书中的部分用语进行解释说明,以便于本邻域技术人员理解。First of all, some terms in this specification are explained to facilitate the understanding of those skilled in the art.

(1)结构化数据,也被成为定量数据,是能够用数据或统一的结构加以表示的信息。一条数据如果能清晰地知道里面都有哪些信息列和信息值,那么这条数据就是结构化数据,比如xml格式,json格式的数据就是结构化数据。保存和管理结构化数据的一般为关系型数据库,可以使用结构化查询语言或SQL语言对数据库中保存的结构化数据进行查询。(1) Structured data, also known as quantitative data, is information that can be represented by data or a unified structure. If a piece of data can clearly know which information columns and information values it contains, then this piece of data is structured data, such as data in xml format and json format. The storage and management of structured data is generally a relational database, and structured query language or SQL language can be used to query the structured data stored in the database.

相应的,非结构化数据本质上是结构化数据之外的一切数据,不符合任何预定义的模型。非结构化数据可能是文本的或非文本的,例如一个存有多种信息内容的普通字符串文本,系统无法自动理解其含义,就是非结构化数据。保存和管理非结构化数据的一般为非关系型数据库中,可以使用NoSQL语言对数据库中保存的非结构化数据进行查询。Correspondingly, unstructured data is essentially everything other than structured data that does not conform to any predefined model. Unstructured data may be text or non-text. For example, an ordinary string text containing various information contents, whose meaning cannot be automatically understood by the system, is unstructured data. The storage and management of unstructured data are generally non-relational databases, and NoSQL language can be used to query the unstructured data stored in the database.

(2)标准化数据,是对每个信息都具有统一且规范的标准描述的数据。如果多个不同的数据对于同一个信息(例如来源ip、目标ip等)的字段名称和字段类型是相同的,那这些数据就是标准化数据,否则,就是非标准化数据。(2) Standardized data is data that has a unified and standardized standard description for each information. If multiple different data have the same field name and field type for the same information (such as source ip, target ip, etc.), then these data are standardized data, otherwise, they are non-standardized data.

在混合云场景中,客户会采购并使用多个厂商的产品,并且客户希望有统一的平台(例如云服务平台)来整合多个厂商的数据进行统一且安全的数据管理。示例性的,这些产品可以是各个厂商的网络安全产品,比如防火墙产品,等等,本说明书对此不做具体限定。In a hybrid cloud scenario, customers will purchase and use products from multiple vendors, and they hope to have a unified platform (such as a cloud service platform) to integrate data from multiple vendors for unified and secure data management. Exemplarily, these products may be network security products of various manufacturers, such as firewall products, etc., which are not specifically limited in this specification.

在一示出的实施方式中,云服务平台能够对接多个厂商的数据源,并分别接入多个数据源中的数据进行统一的数据管理。然而,由于各个厂商的数据源的数据格式往往不同,并且同一厂商中与不同产品相关的数据的数据格式也会不同,甚至同一产品的不同版本所对应的数据格式也会不同,又或者与同一产品相关的不同种类的数据(例如事件日志,流量日志,域名日志等)的数据格式都不统一,等等。这就导致云服务平台在接入各个数据源的数据时往往不知道该如何解析数据,也无法准确获知数据中的实际信息,使得云服务平台很难对各个数据源的数据进行有效、可靠的数据管理。In an illustrated embodiment, the cloud service platform can connect to data sources of multiple manufacturers, and respectively access data in multiple data sources for unified data management. However, since the data formats of data sources of various manufacturers are often different, and the data formats of data related to different products in the same manufacturer will also be different, even the data formats corresponding to different versions of the same product will be different, or they may be different from the same The data formats of different types of product-related data (such as event logs, traffic logs, domain name logs, etc.) are not uniform, and so on. As a result, the cloud service platform often does not know how to analyze the data when accessing data from various data sources, and cannot accurately obtain the actual information in the data, making it difficult for the cloud service platform to perform effective and reliable data analysis on the data from each data source. data management.

基于此,在一些可能的实施方式中,往往需要针对各个数据源中与不同产品相关的各种类型的数据,手动配置大量相应的数据解析规则,以使云服务平台能够基于这些手动配置的数据解析规则对各个数据源的数据进行解析,从而获得结构化的标准数据以便进行后续管理。但是,通过手动配置大量的数据解析规则非常耗时耗力且容易出错,导致云服务平台无法快速、准确地接入多个数据源的数据,更容易丢失数据或者产生数据格式错乱,等等,进而极大程度上降低了云服务平台的数据管理效率。Based on this, in some possible implementations, it is often necessary to manually configure a large number of corresponding data parsing rules for various types of data related to different products in each data source, so that the cloud service platform can The parsing rules parse the data of each data source to obtain structured standard data for subsequent management. However, manually configuring a large number of data parsing rules is time-consuming, labor-intensive and error-prone, causing the cloud service platform to fail to quickly and accurately access data from multiple data sources, and it is easier to lose data or generate data format disorder, etc. This greatly reduces the data management efficiency of the cloud service platform.

在另一些可能的实施方式中,云服务平台在接入各个数据源的数据时可以先不对数据做结构解析等处理,而是先将数据接入并存储,后续在查询使用时再手动配置解析规则。如此,虽然可以提升数据接入效率,但是会影响查询性能,对数据的存储和搜索引擎也有着很高的特殊要求,无法满足实际的应用需求。In other possible implementations, when the cloud service platform accesses the data of each data source, it may not perform structural analysis on the data, but first access and store the data, and then manually configure the analysis when querying and using it. rule. In this way, although the data access efficiency can be improved, it will affect the query performance, and it also has high special requirements for data storage and search engines, which cannot meet the actual application requirements.

基于此,本说明书提供了一种基于数据源中的样本数据自动生成相应的数据解析规则,进而基于该数据解析规则对该数据源中待解析的数据进行高效、准确的数据结构解析的技术方案。Based on this, this specification provides a technical solution for automatically generating corresponding data parsing rules based on the sample data in the data source, and then performing efficient and accurate data structure parsing for the data to be parsed in the data source based on the data parsing rules .

在实现时,云服务平台可以获取与其对接的多个数据源中的任一目标数据源对应的样本数据集合。然后,分析该样本数据集合中的样本数据的数据结构,并基于分析结果生成与该目标数据源对应的数据解析规则。其中,该数据解析规则可以用于指示目标数据源中的各个数据的数据结构。然后,云服务平台可以基于上述数据解析规则,对目标数据源中的待解析的目标数据进行数据结构解析,并基于解析结果对该目标数据进行数据管理。During implementation, the cloud service platform can obtain a sample data set corresponding to any target data source among multiple data sources connected to it. Then, analyze the data structure of the sample data in the sample data set, and generate a data parsing rule corresponding to the target data source based on the analysis result. Wherein, the data parsing rule may be used to indicate the data structure of each data in the target data source. Then, the cloud service platform can analyze the data structure of the target data to be parsed in the target data source based on the above data parsing rules, and perform data management on the target data based on the parsing result.

在以上技术方案中,本申请中的云服务平台可以先获取各个数据源中的样本数据,然后分析各个数据源中的样本数据的数据结构,并基于分析结果自动生成与各个数据源中的数据所对应的数据解析规则。进一步地,云服务平台可以基于自动生成的数据解析规则,快速、准确的对各个数据源中待解析的数据进行数据结构解析,以便后续云服务平台可以基于得到的解析结果,对各个数据源中的数据进行准确、可靠的数据管理。如此,本申请可以通过对数据源中的样本数据的分析,自动生成相应的数据解析规则,进而大大提升了云服务平台的数据解析效率和数据管理效率,进一步提升了云服务平台的服务质量,满足客户的实际需求。In the above technical solution, the cloud service platform in this application can first obtain the sample data in each data source, then analyze the data structure of the sample data in each data source, and automatically generate data related to each data source based on the analysis results The corresponding data parsing rules. Further, the cloud service platform can quickly and accurately analyze the data structure of the data to be analyzed in each data source based on the automatically generated data analysis rules, so that the subsequent cloud service platform can analyze the data in each data source based on the obtained analysis results. accurate and reliable data management. In this way, this application can automatically generate corresponding data analysis rules by analyzing the sample data in the data source, thereby greatly improving the data analysis efficiency and data management efficiency of the cloud service platform, and further improving the service quality of the cloud service platform. Meet the actual needs of customers.

请参阅图1,图1是一示例性实施例提供的一种系统架构示意图。本说明书提供的一个或多个实施例可以在图1所示的系统架构或者类似的系统架构中具体实施。如图1所示,该系统可以包括云服务平台和云外的多个数据源,例如包括数据源100a、数据源100b和数据源100c,等等。在一示出的实施方式中,云服务平台可以与数据源100a、数据源100b和数据源100c对接。在一示出的实施方式中,云服务平台可以通过无线网络等方式与数据源100a、数据源100b和数据源100c建立通信连接。Please refer to FIG. 1 . FIG. 1 is a schematic diagram of a system architecture provided by an exemplary embodiment. One or more embodiments provided in this specification may be specifically implemented in the system architecture shown in FIG. 1 or a similar system architecture. As shown in FIG. 1 , the system may include a cloud service platform and multiple data sources outside the cloud, such as a data source 100a, a data source 100b, and a data source 100c, and so on. In an illustrated embodiment, the cloud service platform can interface with the data source 100a, the data source 100b, and the data source 100c. In an illustrated embodiment, the cloud service platform may establish a communication connection with the data source 100a, the data source 100b, and the data source 100c through a wireless network or the like.

在一示出的实施方式中,数据源100a、数据源100b和数据源100c可以是来自不同厂商的数据源。示例性的,数据源100a可以是厂商A的数据源,数据源100a中存储有厂商A持有的多个产品的数据(例如日志数据)。示例性的,数据源100b可以是厂商B的数据源,数据源100b中存储有厂商B持有的多个产品的数据。示例性的,数据源100c可以是厂商C的数据源,数据源100c中存储有厂商C持有的多个产品的数据。In an illustrated embodiment, data source 100a, data source 100b, and data source 100c may be data sources from different vendors. Exemplarily, the data source 100a may be a data source of a vendor A, and the data source 100a stores data (such as log data) of a plurality of products held by the vendor A. Exemplarily, the data source 100b may be a data source of the manufacturer B, and the data source 100b stores data of multiple products held by the manufacturer B. Exemplarily, the data source 100c may be a data source of a manufacturer C, and the data source 100c stores data of a plurality of products held by the manufacturer C.

在一示出的实施方式中,上述各个厂商持有的产品可以是软件产品,例如防火墙等用于保护网络安全的产品,等等,本说明书对此不做具体限定。在一些可能的实施方式中,各个厂商持有的产品也可以是影音娱乐类的软件产品,本说明书对此不做具体限定。In an illustrated embodiment, the products owned by the above-mentioned manufacturers may be software products, such as firewalls and other products used to protect network security, etc., which are not specifically limited in this specification. In some possible implementation manners, the products owned by each manufacturer may also be audio-visual entertainment software products, which is not specifically limited in this specification.

在一示出的实施方式中,数据源100a、数据源100b和数据源100c可以将其中的数据发送至云服务平台,以由云服务平台进行统一的数据管理。示例性的,数据源100a、数据源100b和数据源100c可以通过syslog(系统日志)等方式发送数据,即发送至云服务平台的数据可以是日志数据。示例性的,该日志数据可以包括事件日志,流量日志,域名日志等各种类型的日志数据。In an illustrated embodiment, the data source 100a, the data source 100b, and the data source 100c can send the data therein to the cloud service platform, so that the cloud service platform can perform unified data management. Exemplarily, the data source 100a, the data source 100b, and the data source 100c may send data through syslog (system log), that is, the data sent to the cloud service platform may be log data. Exemplarily, the log data may include various types of log data such as event logs, traffic logs, and domain name logs.

可以理解的是,如上所述,各个数据源中的数据格式往往不同,并且同一数据源中与不同产品相关的数据的数据格式也会不同,甚至与同一产品相关的不同类型数据的数据格式也会不同,等等。It can be understood that, as mentioned above, the data formats in various data sources are often different, and the data formats of data related to different products in the same data source will also be different, and even the data formats of different types of data related to the same product are also different. will be different, and so on.

示例性的,以数据源100a为例,数据源100a可以向云服务平台发送日志数据,其中,事件日志、流量日志和域名日志等各种类型的日志数据的格式可以如下所示。Exemplarily, taking the data source 100a as an example, the data source 100a can send log data to the cloud service platform, where the formats of various types of log data such as event logs, traffic logs, and domain name logs can be as follows.

事件日志:src_ip='10.20.3.4'dst_ip='10.20.3.5'module='syslog'severity='debug'time='1495093983'event log: src_ip='10.20.3.4' dst_ip='10.20.3.5' module='syslog' severity='debug' time='1495093983'

流量日志:srcIp:10.20.3.4desIp:10.20.3.5severity:info time:1495093983Traffic log: srcIp:10.20.3.4desIp:10.20.3.5severity:info time:1495093983

域名日志:src:'10.20.3.4',des:'10.20.3.5',severity:'info',timestamp:'1495093983'Domain name log: src:'10.20.3.4', des:'10.20.3.5', severity:'info', timestamp:'1495093983'

如上所示,每种日志数据均可以包括多个字段,每个字段可以包含对应的字段名称和字段值。相应的,每种日志数据中均可以包括相应的字段分隔符和键值(Key-Value,KV)分隔符。其中,字段分隔符用于分隔数据中相邻的两个字段,KV分隔符用于分隔数据中的每个字段的字段名称和字段值。应理解,字段名称是作为字段的key,字段值是作为与字段的key对应的value。As shown above, each type of log data can include multiple fields, and each field can include a corresponding field name and field value. Correspondingly, each type of log data may include a corresponding field delimiter and a key-value (Key-Value, KV) delimiter. Among them, the field separator is used to separate two adjacent fields in the data, and the KV separator is used to separate the field name and field value of each field in the data. It should be understood that the field name is used as the key of the field, and the field value is used as the value corresponding to the key of the field.

如上所示,事件日志、流量日志和域名日志等各种类型的日志数据中的字段分隔符和KV分隔符都不相同。例如,事件日志是以空格作为字段分隔符,以等号作为KV分隔符。又例如,流量日志是以空格作为字段分隔符,以冒号作为KV分隔符。又例如,域名日志是以逗号作为字段分隔符,以冒号作为KV分隔符。As shown above, the field separators and KV separators in various types of log data such as event logs, traffic logs, and domain name logs are different. For example, the event log uses a space as the field separator and an equal sign as the KV separator. For another example, the traffic log uses a space as the field separator and a colon as the KV separator. For another example, the domain name log uses a comma as the field separator and a colon as the KV separator.

进一步地,如上所示,事件日志、流量日志和域名日志等各种类型的日志数据中对应于同一信息的字段名称也不同。示例性的,以来源ip为例,事件日志中描述来源ip的字段名称是src_ip,而流量日志中描述来源ip的字段名称是srcIp,域名日志中描述来源ip的字段名称是src。示例性的,以目标ip为例,事件日志中描述目标ip的字段名称是dst_ip,而流量日志中描述目标ip的字段名称是dstIp,域名日志中描述目标ip的字段名称是dst,三者各不相同,没有统一的标准。Further, as shown above, field names corresponding to the same information in various types of log data such as event logs, traffic logs, and domain name logs are also different. Exemplarily, taking the source ip as an example, the name of the field describing the source ip in the event log is src_ip, the name of the field describing the source ip in the traffic log is srcIp, and the name of the field describing the source ip in the domain name log is src. Exemplarily, taking the target ip as an example, the field name describing the target ip in the event log is dst_ip, while the field name describing the target ip in the traffic log is dstIp, and the field name describing the target ip in the domain name log is dst. Not the same, there is no uniform standard.

在一示出的实施方式中,应理解,不同数据源发送的同一类型数据的数据格式也会不同,例如同样是事件日志,数据源100a中的事件日志与数据源100b中的事件日志的格式也会不同,等等,此处不再进行赘述。In an illustrated embodiment, it should be understood that the data format of the same type of data sent by different data sources will also be different, for example, it is also an event log, the format of the event log in the data source 100a and the event log in the data source 100b It will also be different, and so on, so I won’t go into details here.

如此,云服务平台在接收到各个数据源发送的数据后,无法按照统一的标准去解析各个数据,进而无法准确获知各个数据的数据结构,例如无法准确地对各个数据中的字段进行分割,无法准确获知每个字段中包含的字段名称和字段值,等等。简言之,各个数据源发送的数据对于云服务平台来说相当于是非结构化并且非标准化的数据,也就导致云服务平台无法进行准确、有效的数据管理。In this way, after the cloud service platform receives the data sent by each data source, it cannot parse each data according to a unified standard, and thus cannot accurately know the data structure of each data, for example, it cannot accurately segment the fields in each data, and cannot Know exactly what field names and field values are contained in each field, and more. In short, the data sent by each data source is equivalent to unstructured and non-standardized data for the cloud service platform, which makes the cloud service platform unable to perform accurate and effective data management.

基于此,在一示出的实施方式中,云服务平台可以先获取数据源100a、数据源100b和数据源100c中的样本数据集合,并分别分析数据源100a、数据源100b和数据源100c中的样本数据集合中的样本数据的数据结构,然后基于分析结果生成与数据源100a、数据源100b和数据源100c中的数据所对应的数据解析规则。相应的,后续云服务平台可以基于生成的各个数据解析规则对数据源100a、数据源100b和数据源100c中相应的数据进行快速、准确的数据结构解析,以便基于解析结果对数据进行有效管理。Based on this, in an illustrated embodiment, the cloud service platform can first obtain the sample data sets in the data source 100a, the data source 100b, and the data source 100c, and respectively analyze the The data structure of the sample data in the sample data set, and then generate data parsing rules corresponding to the data in the data source 100a, the data source 100b, and the data source 100c based on the analysis results. Correspondingly, the follow-up cloud service platform can quickly and accurately analyze the data structure of the corresponding data in the data source 100a, data source 100b, and data source 100c based on the generated data analysis rules, so as to effectively manage the data based on the analysis results.

示例性的,以数据源100a为例,云服务平台可以获取数据源100a中的样本数据集,然后分析样本数据集中的样本数据的数据结构。示例性的,可以分析样本数据中的字段分隔符和KV分隔符等。然后,云服务平台可以基于分析结果生成对应的数据解析规则。其中,该数据解析规则可以用于指示数据源100a中相应数据的数据结构,例如指示数据中的字段分隔符和KV分隔符等。然后,当数据源100a向云服务平台发送相应的数据以使云服务平台进行数据管理时,云服务平台可以基于预先生成的数据解析规则对数据源100a发送的数据进行数据结构解析,并基于解析结果进行数据管理。Exemplarily, taking the data source 100a as an example, the cloud service platform may acquire the sample data set in the data source 100a, and then analyze the data structure of the sample data in the sample data set. Exemplarily, field delimiters and KV delimiters in the sample data may be analyzed. Then, the cloud service platform can generate corresponding data parsing rules based on the analysis results. Wherein, the data parsing rule may be used to indicate the data structure of the corresponding data in the data source 100a, for example, indicate field delimiters and KV delimiters in the data. Then, when the data source 100a sends corresponding data to the cloud service platform to enable the cloud service platform to perform data management, the cloud service platform can analyze the data structure of the data sent by the data source 100a based on the pre-generated data analysis rules, and based on the analysis Results data management.

应理解,由于同一数据源中的数据也会存在不同的数据格式,因此上述基于样本数据生成的数据解析规则一般可以用于指示与该样本数据具备相同格式的数据的数据结构。It should be understood that since data in the same data source may also have different data formats, the above data parsing rules generated based on sample data may generally be used to indicate the data structure of data having the same format as the sample data.

示例性的,数据源100a中的样本数据集可以包括与厂商A持有的目标产品相关的数据,相应的,基于该样本数据集生成的数据解析规则可以用于指示与该目标产品相关的各个数据的数据结构。Exemplarily, the sample data set in the data source 100a may include data related to the target product held by manufacturer A, and correspondingly, the data parsing rules generated based on the sample data set may be used to indicate each The data structure of the data.

示例性的,数据源100a中的样本数据集可以包括与该目标产品相关的事件日志,相应的,基于该样本数据集生成的数据解析规则可以用于指示与该目标产品相关的各个事件日志的数据结构。Exemplarily, the sample data set in the data source 100a may include event logs related to the target product, and correspondingly, the data parsing rules generated based on the sample data set may be used to indicate each event log related to the target product data structure.

示例性的,数据源100a中的样本数据集可以包括与该目标产品相关的流量日志,相应的,基于该样本数据集生成的数据解析规则可以用于指示与该目标产品相关的各个流量日志的数据结构。Exemplarily, the sample data set in the data source 100a may include traffic logs related to the target product, and correspondingly, the data parsing rules generated based on the sample data set may be used to indicate the traffic logs related to the target product data structure.

示例性的,数据源100a中的样本数据集可以包括与该目标产品相关的域名日志,相应的,基于该样本数据集生成的数据解析规则可以用于指示与该目标产品相关的各个域名日志的数据结构,等等,此处不再例举。Exemplarily, the sample data set in the data source 100a may include domain name logs related to the target product, and correspondingly, the data parsing rules generated based on the sample data set may be used to indicate the domain name logs related to the target product Data structures, etc., will not be listed here.

在一示出的实施方式中,该样本数据集可以是数据源100a向云服务平台实时发送的数据中的部分数据,即云服务平台在接收到数据源100a发送的数据后,就可以从该数据中选取部分数据作为样本数据,并基于样本数据即时生成对应的数据解析规则,以对本次接收到的所有数据进行数据结构解析,等等,本说明书对此不做具体限定。In an illustrated embodiment, the sample data set may be part of the data sent by the data source 100a to the cloud service platform in real time, that is, after the cloud service platform receives the data sent by the data source 100a, it can start from the Part of the data is selected as sample data, and corresponding data analysis rules are generated in real time based on the sample data, so as to analyze the data structure of all the data received this time, etc., which are not specifically limited in this specification.

示例性的,数据源100a向云服务平台发送了与目标产品相关的1000条事件日志以使云服务平台进行数据管理,云服务平台在接收到数据源100a发送的1000条事件日志后,可以从该1000条事件日志中选取50条事件日志作为样本数据,并基于这50条事件日志即时生成对应的数据解析规则,随后云服务平台可以基于该解析规则对接收到的所有1000条事件日志进行数据结构解析,等等,本说明书对此不做具体限定。Exemplarily, the data source 100a sends 1000 event logs related to the target product to the cloud service platform so that the cloud service platform can perform data management. After receiving the 1000 event logs sent by the data source 100a, the cloud service platform can retrieve Select 50 event logs from the 1,000 event logs as sample data, and generate corresponding data analysis rules based on these 50 event logs in real time, and then the cloud service platform can perform data analysis on all 1,000 event logs received based on the analysis rules. Structural analysis, etc., are not specifically limited in this specification.

如此,本申请中的云服务平台可以基于待接入的数据源中的样本数据自动生成相应的数据解析规则,进而基于该数据解析规则对该数据源发送的数据进行高效、准确的数据结构解析,以便基于解析结果对该数据源中的数据进行有效的数据管理,极大程度上提升了云服务平台的数据解析效率,进而提升了云服务平台对各个数据源的数据接入效率和数据管理效率。In this way, the cloud service platform in this application can automatically generate corresponding data analysis rules based on the sample data in the data source to be accessed, and then perform efficient and accurate data structure analysis on the data sent by the data source based on the data analysis rules , so as to effectively manage the data in the data source based on the analysis results, which greatly improves the data analysis efficiency of the cloud service platform, and further improves the data access efficiency and data management of the cloud service platform for each data source efficiency.

在一示出的实施方式中,上述数据源100a、数据源100b和数据源100c可以是具备上述功能的一台服务器或者是由多台服务器构成的服务器集群,等等,本说明书对此不做具体限定。In an illustrated embodiment, the above-mentioned data source 100a, data source 100b, and data source 100c may be a server with the above-mentioned functions or a server cluster composed of multiple servers, etc. Specific limits.

在一示出的实施方式中,上述云服务平台可以包括安全运营中心(SecurityOperations Center,SOC),具体可以包括具备上述功能的一台服务器或由多台服务器构成的服务器集群,等等,本说明书对此不做具体限定。In an illustrated embodiment, the above-mentioned cloud service platform may include a Security Operations Center (SecurityOperations Center, SOC), specifically, it may include a server with the above-mentioned functions or a server cluster composed of multiple servers, etc., this specification This is not specifically limited.

请参阅图2,图2是一示例性实施例提供的一种数据解析方法的流程示意图。该方法可以应用于图1所示的系统架构或者类似的系统架构中,具体地,该方法可以应用于图1所示的云服务平台。如图2所示,该方法具体可以包括如下步骤S101-步骤S103。Please refer to FIG. 2 . FIG. 2 is a schematic flowchart of a data parsing method provided by an exemplary embodiment. The method can be applied to the system architecture shown in FIG. 1 or a similar system architecture. Specifically, the method can be applied to the cloud service platform shown in FIG. 1 . As shown in FIG. 2 , the method may specifically include the following steps S101-S103.

步骤S101,获取多个数据源中的任一目标数据源对应的样本数据集合。Step S101, acquiring a sample data set corresponding to any target data source among multiple data sources.

在一示出的实施方式中,云服务平台可以与多个数据源对接,用于对各个数据源中的数据进统一的数据管理。示例性的,该多个数据源可以是不同厂商的数据源,也可以是同一厂商针对不同产品的数据源,等等,本说明书对此不做具体限定。In an illustrated embodiment, the cloud service platform can interface with multiple data sources for unified data management of the data in each data source. Exemplarily, the multiple data sources may be data sources of different manufacturers, or data sources of the same manufacturer for different products, etc., which are not specifically limited in this specification.

在一示出的实施方式中,云服务平台在与多个数据源对接后,可以获取与多个数据源中的任一目标数据源对应的样本数据集,其中,该样本数据集中可以包括多条样本数据。In an illustrated embodiment, after connecting with multiple data sources, the cloud service platform can obtain a sample data set corresponding to any target data source in the multiple data sources, wherein the sample data set can include multiple bar sample data.

在一示出的实施方式中,该样本数据集具体可以包括与目标数据源中的目标产品(例如防火墙产品)相关的数据。在一示出的实施方式中,该样本数据集也可以包括与目标数据源中的目标产品相关的多种类型的数据。在一示出的实施方式中,该样本数据集还可以包括与目标数据源中的多个产品相关的数据,等等,本说明书对此不做具体限定。In an illustrated embodiment, the sample data set may specifically include data related to a target product (such as a firewall product) in the target data source. In an illustrated embodiment, the sample data set may also include various types of data related to the target product in the target data source. In an illustrated embodiment, the sample data set may also include data related to multiple products in the target data source, etc., which is not specifically limited in this specification.

在一示出的实施方式中,目标数据源中的任一产品的版本可能会不断更新,与该产品相关的各类数据的数据格式也会不断变化。因此,当目标数据源中的任一产品发生版本更新时,云服务平台可以获取与该版本更新后的产品相关的样本数据集。In an illustrated embodiment, the version of any product in the target data source may be continuously updated, and the data format of various data related to the product may also be continuously changed. Therefore, when any product in the target data source is updated, the cloud service platform can obtain the sample data set related to the updated product.

在一示出的实施方式中,目标数据源可以通过syslog协议的方式向云服务平台发送样本数据集,相应的,云服务平台接收目标数据源发送的样本数据集。相应的,该样本数据集中的样本数据可以是日志数据,例如为事件日志、流量日志或者域名日志等各种类型的日志数据,本说明书对此不做具体限定。在一些可能的实施方式中,该样本数据集中的样本数据还可以是其他任何可能类型的数据,本说明书对此不做具体限定。In an illustrated embodiment, the target data source may send the sample data set to the cloud service platform through the syslog protocol, and correspondingly, the cloud service platform receives the sample data set sent by the target data source. Correspondingly, the sample data in the sample data set may be log data, such as various types of log data such as event logs, traffic logs, or domain name logs, which is not specifically limited in this specification. In some possible implementation manners, the sample data in the sample data set may also be any other possible type of data, which is not specifically limited in this specification.

示例性的,该样本数据集可以包括如下所示的3条样本数据。示例性的,这3条样本数据可以是目标数据源中与防火墙产品相关的事件日志。Exemplarily, the sample data set may include three pieces of sample data as shown below. Exemplarily, the three pieces of sample data may be event logs related to firewall products in the target data source.

样本数据1:module='syslog#info'severity='debug-1'type='url-http'session_id='1'time='1495093983'src_ip='10.20.3.4'dst_ip='10.20.3.5'proto='TCP'Sample data 1: module='syslog#info'severity='debug-1'type='url-http'session_id='1'time='1495093983'src_ip='10.20.3.4'dst_ip='10.20.3.5'proto ='TCP'

样本数据2:module='syslog#info'severity='debug-1'type='web'session_id='2'time='1495093984'src_ip='10.20.3.4'dst_ip='10.20.3.5'proto='TCP'Sample data 2: module='syslog#info'severity='debug-1'type='web'session_id='2'time='1495093984'src_ip='10.20.3.4'dst_ip='10.20.3.5'proto=' TCP'

样本数据3:module='syslog#info'severity='debug-1'type='tomcat'session_id='3'time='1495093985'src_ip='10.20.3.50'dst_ip='10.20.3.10'proto='TCP'Sample data 3: module='syslog#info'severity='debug-1'type='tomcat'session_id='3'time='1495093985'src_ip='10.20.3.50'dst_ip='10.20.3.10'proto=' TCP'

步骤S102,分析所述样本数据集合中的样本数据的数据结构,并基于分析结果生成与所述目标数据源对应的数据解析规则。Step S102, analyzing the data structure of the sample data in the sample data set, and generating a data parsing rule corresponding to the target data source based on the analysis result.

在一示出的实施方式中,云服务平台在获取到目标数据源发送的样本数据集后,可以分析该样本数据集中的样本数据的数据结构,并基于分析结果生成相应的数据解析规则。In an illustrated embodiment, after acquiring the sample data set sent by the target data source, the cloud service platform can analyze the data structure of the sample data in the sample data set, and generate corresponding data parsing rules based on the analysis result.

在一示出的实施方式中,该生成的数据解析规则可以用于指示目标数据源中相应数据的数据结构。进一步地,该数据解析规则可以用于指示目标数据源中与样本数据具有相同格式的数据的数据结构。In an illustrated embodiment, the generated data parsing rules can be used to indicate the data structure of the corresponding data in the target data source. Further, the data parsing rule may be used to indicate the data structure of the data in the target data source having the same format as the sample data.

示例性的,以该样本数据集包括与目标数据源中的防火墙产品相关的事件日志为例,则生成的数据解析规则可以用于指示目标数据源中与该防火墙产品相关的事件日志的数据结构,等等,本说明书对此不做具体限定。Exemplarily, taking the sample data set including event logs related to the firewall product in the target data source as an example, the generated data parsing rules can be used to indicate the data structure of the event logs related to the firewall product in the target data source , etc., this specification does not specifically limit it.

示例性的,以该样本数据集包括与目标数据源中的防火墙产品相关的流量日志为例,则生成的数据解析规则可以用于指示目标数据源中与该防火墙产品相关的流量日志的数据结构,等等,本说明书对此不做具体限定。Exemplarily, taking the sample data set including traffic logs related to the firewall product in the target data source as an example, the generated data parsing rules can be used to indicate the data structure of the traffic logs related to the firewall product in the target data source , etc., this specification does not specifically limit it.

示例性的,以该样本数据集包括与目标数据源中的防火墙产品和网络流量检测产品相关的事件日志为例,则基于该样本数据集可以生成对应的2个数据解析规则,这2个数据解析规则可以分别用于指示与该防火墙产品和网络流量检测产品相关的事件日志的数据结构,等等,本说明书对此不做具体限定。Exemplarily, taking the sample data set including event logs related to firewall products and network traffic detection products in the target data source as an example, two corresponding data parsing rules can be generated based on the sample data set, the two data The parsing rules can be respectively used to indicate the data structure of the event log related to the firewall product and the network traffic detection product, etc., which are not specifically limited in this specification.

示例性的,以该样本数据集包括与目标数据源中的防火墙产品相关的事件日志和域名日志为例,则基于该样本数据集可以生成对应的2个数据解析规则,这2个数据解析规则可以分别用于指示与防火墙产品相关的事件日志的数据结构以及域名日志的数据结构,等等,本说明书对此不做具体限定。Exemplarily, taking the sample data set including event logs and domain name logs related to the firewall product in the target data source as an example, two corresponding data parsing rules can be generated based on the sample data set, and the two data parsing rules It can be respectively used to indicate the data structure of the event log related to the firewall product and the data structure of the domain name log, etc., which are not specifically limited in this specification.

在一示出的实施方式中,上述数据解析规则可以包含用于指示数据中的字段分隔符和KV分隔符的第一解析规则。In an illustrated embodiment, the above data parsing rules may include a first parsing rule for indicating field delimiters and KV delimiters in the data.

在一示出的实施方式中,云服务平台在获取到目标数据源发送的样本数据集后,可以统计样本数据集合中的样本数据包含的多种分隔符在每个样本数据中的数量。其中,样本数据中包含的多种分隔符可以是由ASCII码中除去大小写字母和单引号、双引号之外的字符组成的,如空格,逗号,冒号,等号,叹号,减号和井号,等等,本说明书对此不做具体限定。In an illustrated embodiment, after the cloud service platform acquires the sample data set sent by the target data source, it may count the number of various delimiters contained in the sample data in the sample data set in each sample data. Among them, the various separators contained in the sample data can be composed of characters other than uppercase and lowercase letters and single quotation marks and double quotation marks in the ASCII code, such as spaces, commas, colons, equal signs, exclamation marks, minus signs and wells No., etc., which are not specifically limited in this specification.

示例性的,以样本数据集包含上述样本数据1、样本数据2和样本数据3为例,样本数据1、样本数据2和样本数据3中均包含井号,减号,等号,空格这4种分隔符。统计井号,减号,等号,空格这4种分隔符分别在样本数据1、样本数据2和样本数据3中的数量,可以得到如下四个数组。Exemplarily, take the example that the sample data set contains the above-mentioned sample data 1, sample data 2 and sample data 3, sample data 1, sample data 2 and sample data 3 all contain pound sign, minus sign, equal sign and space. kind of delimiter. Count the numbers of pound signs, minus signs, equal signs, and spaces in sample data 1, sample data 2, and sample data 3 respectively, and you can get the following four arrays.

井号:[1,1,1],表示井号这一分隔符在样本数据1中的数量为1,在样本数据2中的数量为1,在样本数据3中的数量为1。Hash sign: [1, 1, 1], indicating that the number of the hash sign in sample data 1 is 1, the number in sample data 2 is 1, and the number in sample data 3 is 1.

减号:[2,1,1],表示减号这一分隔符在样本数据1中的数量为2,在样本数据2中的数量为1,在样本数据3中的数量为1。Minus sign: [2, 1, 1], indicating that the number of minus signs in sample data 1 is 2, the number in sample data 2 is 1, and the number in sample data 3 is 1.

等号:[8,8,8],表示等号这一分隔符在样本数据1中的数量为8,在样本数据2中的数量为8,在样本数据3中的数量为8。Equal signs: [8, 8, 8], indicating that the number of equal signs in sample data 1 is 8, the number in sample data 2 is 8, and the number in sample data 3 is 8.

空格:[7,7,7],表示空格这一分隔符在样本数据1中的数量为7,在样本数据2中的数量为7,在样本数据3中的数量为7。Space: [7, 7, 7], indicating that the number of spaces as a delimiter in sample data 1 is 7, the number in sample data 2 is 7, and the number in sample data 3 is 7.

进一步地,在一示出的实施方式中,云服务平台可以基于统计出的多种分隔符在每个样本数据中的数量,在多种分隔符中确定出该样本数据集合的字段分隔符和KV分隔符。Further, in an illustrated embodiment, the cloud service platform can determine the field delimiter and KV delimiter.

在一示出的实施方式中,云服务平台可以将在每个样本数据中数量最多的分隔符确定为样本数据集合的KV分隔符,并将在每个样本数据中数量比该KV分隔符少一个的分隔符确定为样本数据集合的字段分隔符。In an illustrated embodiment, the cloud service platform can determine the separator with the largest number in each sample data as the KV separator of the sample data set, and set the number of separators in each sample data to be less than the KV separator A delimiter is determined as the field delimiter for the sample data set.

示例性的,仍以上述样本数据1、样本数据2和样本数据3为例,等号在每个样本数据中的数量最多,因此可以将等号确定为样本数据集的KV分隔符,空格在每个样本数据中的数量比等号少一个,因此可以将空格确定为样本数据集的字段分隔符。Exemplarily, still taking the above sample data 1, sample data 2 and sample data 3 as examples, the number of equal signs in each sample data is the largest, so the equal sign can be determined as the KV separator of the sample data set, and the space is in The number in each sample data is one less than the equal sign, so spaces can be identified as field separators for sample data sets.

在一示出的实施方式中,云服务平台还可以结合每种分隔符在不同样本数据间数量的差异,更加准确地确定样本数据集的KV分隔符和字段分隔符。In an illustrated embodiment, the cloud service platform can further determine the KV delimiter and the field delimiter of the sample data set more accurately based on the difference in the number of each delimiter among different sample data.

在一示出的实施方式中,云服务平台可以将在每个样本数据中数量最多,且在不同样本数据之间数量的差异最小的分隔符确定为样本数据集合的KV分隔符。相应的,云服务平台可以将在每个样本数据中数量比该KV分隔符少一个,且在不同样本数据之间数量的差异最小的分隔符确定为样本数据集合的字段分隔符。In an illustrated embodiment, the cloud service platform may determine the separator with the largest number in each sample data and the smallest difference in number between different sample data as the KV separator of the sample data set. Correspondingly, the cloud service platform may determine the separator whose number in each sample data is one less than the KV separator and whose number has the smallest difference between different sample data as the field separator of the sample data set.

在一示出的实施方式中,云服务平台还可以基于统计出的多种分隔符在每个样本数据中的数量,进一步通过方差计算,得到各种分隔符在不同样本数据之间的数量方差(即数量的波动)。进一步地,云服务平台还可以结合各种分隔符在不同样本数据之间的数量方差,确定样本数据集合的字段分隔符和KV分隔符。In an illustrated embodiment, the cloud service platform can also calculate the number variance of various separators among different sample data through variance calculation based on the counted number of various separators in each sample data (i.e. fluctuations in quantity). Further, the cloud service platform can also determine the field separator and the KV separator of the sample data set in combination with the variance of the number of various separators among different sample data.

示例性的,仍以上述样本数据1、样本数据2和样本数据3为例,等号在每个样本数据中的数量最多,且在不同样本数据中的数量均为8,方差最小,因此可以将等号确定为样本数据集的KV分隔符。相应的,空格在不同样本数据中的数量均为7,方差最小,且其在每个样本数据中的数量均比等号少一个,因此可以将空格确定为样本数据集的字段分隔符。Exemplarily, still taking the above sample data 1, sample data 2 and sample data 3 as examples, the number of equal signs in each sample data is the largest, and the number of equal signs in different sample data is 8, and the variance is the smallest, so it can be Determine the equal sign as the KV delimiter for the sample dataset. Correspondingly, the number of spaces in different sample data is 7, the variance is the smallest, and the number of spaces in each sample data is one less than the equal sign, so the space can be determined as the field separator of the sample data set.

在一示出的实施方式中,云服务平台可以基于分析得到的上述样本数据集合的字段分隔符和KV分隔符生成第一类解析规则。相应的,该第一类解析规则可以用于指示待解析数据中的字段分隔符和KV分隔符。In an illustrated embodiment, the cloud service platform may generate the first type of parsing rules based on the analyzed field delimiters and KV delimiters of the sample data set above. Correspondingly, the first type of parsing rule can be used to indicate the field delimiter and KV delimiter in the data to be parsed.

在一示出的实施方式中,上述数据解析规则还可以包含用于指示数据中的各个字段的字段类型的第二解析规则。In an illustrated embodiment, the above data parsing rule may further include a second parsing rule for indicating the field type of each field in the data.

在一示出的实施方式中,云服务平台在获取到目标数据源发送的样本数据集,并分析得到样本数据的字段分隔符和KV分隔符后,可以基于分析得到的字段分隔符和KV分隔符确定每个样本数据中的各个字段,以及各个字段包含的字段名称和字段值。In an illustrated embodiment, after the cloud service platform obtains the sample data set sent by the target data source, and analyzes the field delimiter and KV delimiter of the sample data, it can separate character to identify the fields in each sample data, and the field names and field values that each field contains.

进一步地,在一示出的实施方式中,云服务平台可以将每个样本数据中的各个字段的字段值分别与多种字段类型进行匹配。示例性的,该多种字段类型例如包括时间、数值、IP和字符串等类型,本说明书对此不做具体限定。Further, in an illustrated embodiment, the cloud service platform can match the field values of each field in each sample data with multiple field types respectively. Exemplarily, the multiple field types include, for example, time, value, IP, and character string, which are not specifically limited in this specification.

在一示出的实施方式中,可以先定义多种字段类型的正则表达式,然后按照匹配优先度,将每个样本数据中的各个字段的字段值分别与多种字段类型的正则表达式进行匹配,以确定与每个样本数据中的各个字段的字段值所匹配的一种或多种字段类型。In an illustrated embodiment, regular expressions of multiple field types can be defined first, and then the field values of each field in each sample data are compared with the regular expressions of multiple field types according to the matching priority. Match to determine one or more field types that match the field value for each field in each sample data.

在一示出的实施方式中,对于每个样本数据中的任一目标字段,如果与该目标字段的字段值所匹配的字段类型均为多种字段类型中的目标类型,则将该目标类型确定为与该目标字段对应的字段类型。In an illustrated embodiment, for any target field in each sample data, if the field types that match the field value of the target field are all target types among multiple field types, then the target type Determined as the field type corresponding to the target field.

在一示出的实施方式中,对于每个样本数据中的任一目标字段,如果与该目标字段的字段值所匹配的字段类型包括多种字段类型,则将该多种字段类型中对应的数据范围更广的目标类型确定为与该目标字段对应的字段类型。示例性的,如果在多个样本数据中,有一个样本数据中的目标字段所匹配的字段类型是数值类型,而另一个样本数据中同样的目标字段所匹配的字段类型是字符串类型,考虑到字符串类型对应的数据范围大于数值类型对应的数据范围,则可以将字符串类型确定为该目标字段的字段类型。In an illustrated embodiment, for any target field in each sample data, if the field type matching the field value of the target field includes multiple field types, the corresponding A target type with a wider range of data is determined as the field type corresponding to the target field. For example, if among multiple sample data, the field type matched by the target field in one sample data is a numeric type, and the field type matched by the same target field in another sample data is a string type, consider If the data range corresponding to the string type is larger than the data range corresponding to the numeric type, the string type can be determined as the field type of the target field.

示例性的,仍以上述样本数据1、样本数据2和样本数据3为例,与每个样本数据中的src_ip字段相匹配的字段类型均为ip类型,则可以确定src_ip字段的字段类型为ip类型。与每个样本数据中的module字段相匹配的字段类型均为字符串类型,则可以确定module字段的字段类型为字符串类型。与每个样本数据中的session_id字段相匹配的字段类型均为数值类型,则可以确定session_id字段的字段类型为数值类型。示例性的,与每个样本数据中的time字段相匹配的字段类型包括数值类型和时间类型,进一步考虑到时间的数据范围可以涵盖数值的数据范围,且该字段的字段名称包含time等关键字,因此可以确定time字段的字段类型为时间类型。Exemplarily, still taking the above sample data 1, sample data 2 and sample data 3 as examples, the field types matching the src_ip field in each sample data are all ip types, then it can be determined that the field type of the src_ip field is ip type. If the field types matching the module field in each sample data are all string types, it can be determined that the field type of the module field is a string type. The field types matching the session_id field in each sample data are all numeric types, so it can be determined that the field type of the session_id field is a numeric type. Exemplarily, the field types that match the time field in each sample data include numerical type and time type, further considering that the data range of time can cover the data range of numerical values, and the field name of this field contains keywords such as time , so it can be determined that the field type of the time field is the time type.

在一示出的实施方式中,云服务平台可以基于确定的与样本数据中的各个字段对应的字段类型生成第二类解析规则。相应的,该第二类解析规则可以用于指示待解析数据中的各个字段的字段类型。In an illustrated embodiment, the cloud service platform may generate a second type of parsing rule based on the determined field type corresponding to each field in the sample data. Correspondingly, the second type of parsing rule may be used to indicate the field type of each field in the data to be parsed.

在一示出的实施方式中,上述数据解析规则还可以包含用于指示数据中的各个字段对应的标准字段的第三类解析规则。In an illustrated embodiment, the above-mentioned data parsing rules may further include a third type of parsing rules for indicating standard fields corresponding to each field in the data.

在一示出的实施方式中,云服务平台在获取到目标数据源发送的样本数据集,并分析得到样本数据的字段分隔符和KV分隔符后,可以基于分析得到的字段分隔符和KV分隔符确定每个样本数据中的各个字段,以及各个字段包含的字段名称和字段值。In an illustrated embodiment, after the cloud service platform obtains the sample data set sent by the target data source, and analyzes the field delimiter and KV delimiter of the sample data, it can separate character to identify the fields in each sample data, and the field names and field values that each field contains.

在一示出的实施方式中,云服务平台可以计算每个样本数据中的各个字段分别与多种标准字段之间的相似度。其中,标准字段作为一套标准,可以内置在云服务平台中。示例性的,时间字段timestamp,来源ip字段src.ip,目标ip字段dst.ip等可以为预设的标准字段,本说明书对此不做具体限定。In an illustrated embodiment, the cloud service platform can calculate the similarity between each field in each sample data and various standard fields. Among them, the standard fields, as a set of standards, can be built into the cloud service platform. Exemplarily, the time field timestamp, the source ip field src.ip, the destination ip field dst.ip, etc. may be preset standard fields, which are not specifically limited in this specification.

在一示出的实施方式中,云服务平台可以利用预设的相似性计算方法,计算每个样本数据中的各个字段分别与多种标准字段之间的相似度。In an illustrated embodiment, the cloud service platform can use a preset similarity calculation method to calculate the similarity between each field in each sample data and various standard fields.

需要说明的是,本说明书对预设的相似性计算方法不做具体限定。It should be noted that this specification does not specifically limit the preset similarity calculation method.

在一示出的实施方式中,上述预设的相似性计算方法可以包括余弦相似性计算方法和/或编辑距离计算方法,等等,本说明书对此不做具体限定。在一些可能的实施方式中,云服务平台也可以采用其他任何可能的计算方法,本说明书对此不做具体限定。In an illustrated embodiment, the aforementioned preset similarity calculation method may include a cosine similarity calculation method and/or an edit distance calculation method, etc., which are not specifically limited in this specification. In some possible implementation manners, the cloud service platform may also use any other possible calculation method, which is not specifically limited in this specification.

在一示出的实施方式中,云服务平台实际可以将每个样本数据中的各个字段的字段名称分别与多种标准字段进行对比,进而计算每个样本数据中的各个字段分别与多种标准字段之间的相似度。示例性的,仍以上述样本数据1、样本数据2和样本数据3为例,云服务平台可以计算每个样本数据中的src_ip字段与标准的来源ip字段src.ip之间的相似度,例如为95%,以及计算src_ip字段与标准的目标ip字段dst.ip之间的相似度,例如为30%,以及计算src_ip字段与标准的时间字段timestamp之间的相似度,例如为1%,等等,此处不再例举。In an illustrated embodiment, the cloud service platform can actually compare the field names of each field in each sample data with a variety of standard fields, and then calculate the difference between each field in each sample data and a variety of standard fields. similarity between fields. Exemplarily, still taking the above sample data 1, sample data 2 and sample data 3 as examples, the cloud service platform can calculate the similarity between the src_ip field in each sample data and the standard source ip field src.ip, for example 95%, and calculate the similarity between the src_ip field and the standard target ip field dst.ip, for example, 30%, and calculate the similarity between the src_ip field and the standard time field timestamp, for example, 1%, etc. Wait, no more examples here.

进一步地,在一示出的实施方式中,云服务平台可以基于每个样本数据中的各个字段分别与多种标准字段之间的相似度,从多种标准字段中确定出与样本数据中的各个字段相对应的标准字段,并将各个字段与对应的标准字段进行映射。Further, in an illustrated embodiment, the cloud service platform can determine from various standard fields the similarity between each field in each sample data and the standard fields in the sample data. The standard fields corresponding to each field, and map each field with the corresponding standard field.

在一示出的实施方式中,云服务平台可以将该多种标准字段中与样本数据中的各个字段的相似度大于预设阈值的标准字段确定为与各个字段相对应的标准字段,并将各个字段与对应的标准字段进行映射。示例性的,该预设阈值例如为80%、90%或者95%等等,本说明书对此不做具体限定。In an illustrated embodiment, the cloud service platform may determine a standard field whose similarity with each field in the sample data is greater than a preset threshold among the various standard fields as a standard field corresponding to each field, and Each field is mapped with the corresponding standard field. Exemplarily, the preset threshold is, for example, 80%, 90% or 95%, etc., which is not specifically limited in this specification.

在一示出的实施方式中,如果样本数据中的目标字段与上述多种标准字段的相似度均小于或等于预设阈值,则可以将其中相似度最大的标准字段确定为与该目标字段相对应的标准字段,并可以为该标准字段打上推荐映射的标签,以提示用户可以对该映射的结果进行人工核对并修改,从而保证标准字段映射的准确性。In an illustrated embodiment, if the similarities between the target field in the sample data and the above-mentioned various standard fields are all less than or equal to the preset threshold, the standard field with the largest similarity can be determined as being similar to the target field. The corresponding standard field can be labeled with a recommended mapping label to prompt the user to manually check and modify the mapping result, so as to ensure the accuracy of the standard field mapping.

在一示出的实施方式中,云服务平台进一步还可以结合样本数据中的各个字段的字段类型,确定与各个字段相对应的标准字段,等等,本说明书对此不做具体限定。示例性的,上述样本数据1、样本数据2和样本数据3中的time字段的字段类型为时间类型,并且与标准的时间字段timestamp之间的相似度大于预设阈值,则可以将标准的时间字段timestamp确定为与该time字段相对应的标准字段,并将该标准的时间字段timestamp与该time字段进行映射。In an illustrated embodiment, the cloud service platform can further combine the field types of each field in the sample data to determine the standard field corresponding to each field, etc., which is not specifically limited in this specification. Exemplarily, if the field type of the time field in the above sample data 1, sample data 2, and sample data 3 is a time type, and the similarity with the standard time field timestamp is greater than the preset threshold, the standard time field can be The field timestamp is determined as a standard field corresponding to the time field, and the standard time field timestamp is mapped to the time field.

示例性的,仍以上述样本数据1、样本数据2和样本数据3为例,样本数据中的src_ip可以映射为标准的来源ip字段src.ip,样本数据中的dst_ip可以映射为标准的目标ip字段dst.ip,样本数据中的time可以映射为标准的时间字段timestamp,等等,此处不再例举。Exemplarily, still taking the above sample data 1, sample data 2 and sample data 3 as examples, src_ip in the sample data can be mapped to the standard source ip field src.ip, and dst_ip in the sample data can be mapped to the standard target ip The field dst.ip, the time in the sample data can be mapped to the standard time field timestamp, etc., and will not be listed here.

在一示出的实施方式中,云服务平台可以基于样本数据中的各个字段与对应的标准字段之间的映射关系生成第三类解析规则。相应的,该第三类解析规则可以用于指示待解析数据中的各个字段对应的标准字段。In an illustrated embodiment, the cloud service platform may generate a third type of parsing rule based on the mapping relationship between each field in the sample data and the corresponding standard field. Correspondingly, the third type of parsing rules can be used to indicate standard fields corresponding to each field in the data to be parsed.

步骤S103,基于所述数据解析规则,对所述目标数据源中的待解析的目标数据进行数据结构解析。Step S103, based on the data parsing rules, perform data structure analysis on the target data to be parsed in the target data source.

在一示出的实施方式中,云服务平台可以基于得到的数据解析规则,对目标数据源发送的待解析的目标数据进行数据结构解析,并基于解析结果对该目标数据进行数据管理。可以理解的是,云服务平台在基于数据解析规则对目标数据进行数据结构解析后,可以得到结构化且标准化的目标数据,以便云服务平台对其进行准确、有效的数据管理。In an illustrated embodiment, the cloud service platform may analyze the data structure of the target data to be analyzed sent by the target data source based on the obtained data analysis rules, and perform data management on the target data based on the analysis result. It is understandable that after the cloud service platform analyzes the data structure of the target data based on the data analysis rules, it can obtain structured and standardized target data, so that the cloud service platform can perform accurate and effective data management on it.

在一示出的实施方式中,该待解析的目标数据可以为与上述样本数据具备相同数据格式的数据,例如为与同一产品相关的事件日志。In an illustrated embodiment, the target data to be parsed may be data having the same data format as the above sample data, for example, an event log related to the same product.

在一示出的实施方式中,云服务平台可以基于该数据解析规则中包含的第一类解析规则,确定待解析的目标数据的字段分隔符和KV分隔符,从而确定目标数据中包含的各个字段以及各个字段中的字段名称以及字段值。如此,云服务平台可以明确目标数据中的例如来源ip、目标ip和时间等具体信息。In an illustrated embodiment, the cloud service platform can determine the field delimiter and KV delimiter of the target data to be parsed based on the first type of parsing rule included in the data parsing rule, so as to determine each fields and the field names and field values within each field. In this way, the cloud service platform can specify specific information such as source ip, target ip and time in the target data.

进一步地,在一示出的实施方式中,云服务平台可以基于该数据解析规则中包含的第二类解析规则,确定目标数据中的各个字段对应的字段类型,例如为时间、数值、字符串或者IP等。Further, in an illustrated embodiment, the cloud service platform can determine the field type corresponding to each field in the target data based on the second type of parsing rule included in the data parsing rule, such as time, value, string Or IP, etc.

进一步地,在一示出的实施方式中,云服务平台可以基于该数据解析规则中包含的第三类解析规则,将目标数据中的各个字段映射为其对应的标准字段,从而最终得到结构化且标准化的目标数据。Furthermore, in an illustrated embodiment, the cloud service platform can map each field in the target data to its corresponding standard field based on the third type of parsing rule contained in the data parsing rule, so as to finally obtain the structured and standardized target data.

在一示出的实施方式中,待解析的目标数据中也可以包括上述样本数据,云服务平台在接收到目标数据源发送的待解析的目标数据后,就可以从该目标数据中选取部分数据作为样本数据,并基于该样本数据即时生成对应的数据解析规则,以对本次接收到的所有目标数据进行数据结构解析,等等,本说明书对此不做具体限定。In an illustrated embodiment, the target data to be parsed may also include the above sample data, and the cloud service platform may select part of the target data from the target data after receiving the target data to be parsed from the target data source As sample data, corresponding data parsing rules are generated in real time based on the sample data, so as to analyze the data structure of all target data received this time, etc., and this description does not specifically limit this.

综上所述,本申请中的云服务平台可以先获取各个数据源中的样本数据,然后分析各个数据源中的样本数据的数据结构,并基于分析结果自动生成与各个数据源中的数据所对应的数据解析规则。进一步地,云服务平台可以基于自动生成的数据解析规则,快速、准确的对各个数据源中待解析的数据进行数据结构解析,以便后续云服务平台可以基于解析结果,对各个数据源中的数据进行有效、可靠的数据管理。如此,本申请可以通过对样本数据的分析,自动生成相应的数据解析规则,进而大大提升了云服务平台的数据解析和数据管理效率,进一步提升了云服务平台的服务质量,满足客户的实际需求。To sum up, the cloud service platform in this application can first obtain the sample data in each data source, then analyze the data structure of the sample data in each data source, and automatically generate the data corresponding to the data in each data source based on the analysis results. Corresponding data parsing rules. Furthermore, the cloud service platform can quickly and accurately analyze the data structure of the data to be analyzed in each data source based on the automatically generated data analysis rules, so that the subsequent cloud service platform can analyze the data in each data source based on the analysis results. Effective and reliable data management. In this way, this application can automatically generate corresponding data analysis rules through the analysis of sample data, thereby greatly improving the data analysis and data management efficiency of the cloud service platform, further improving the service quality of the cloud service platform, and meeting the actual needs of customers .

与上述方法流程实现对应,本说明书的实施例还提供了一种数据解析装置。请参阅图3,图3是一示例性实施例提供的一种数据解析装置的结构示意图。该装置30可以应用于图1所示系统架构中的云服务平台,可以与多个数据源分别对接,以对多个数据源中的数据进行统一的数据管理。如图3所示,该装置30包括:Corresponding to the realization of the above-mentioned method flow, the embodiment of this specification also provides a data parsing device. Please refer to FIG. 3 . FIG. 3 is a schematic structural diagram of a data parsing device provided by an exemplary embodiment. The device 30 can be applied to the cloud service platform in the system architecture shown in FIG. 1 , and can be connected to multiple data sources respectively, so as to perform unified data management on the data in the multiple data sources. As shown in Figure 3, the device 30 includes:

获取单元301,用于获取所述多个数据源中的任一目标数据源对应的样本数据集合;An acquisition unit 301, configured to acquire a sample data set corresponding to any target data source among the plurality of data sources;

第一解析规则生成单元302,用于分析所述样本数据集合中的样本数据的数据结构,并基于分析结果生成与所述目标数据源对应的数据解析规则;所述数据解析规则用于指示所述目标数据源中的各个数据的数据结构;The first parsing rule generation unit 302 is configured to analyze the data structure of the sample data in the sample data set, and generate a data parsing rule corresponding to the target data source based on the analysis result; the data parsing rule is used to indicate the Describe the data structure of each data in the target data source;

解析单元303,用于基于所述数据解析规则,对所述目标数据源中的待解析的目标数据进行数据结构解析。The parsing unit 303 is configured to analyze the data structure of the target data to be parsed in the target data source based on the data parsing rules.

在一示出的实施方式中,所述数据解析规则包含用于指示数据中的字段分隔符和Key-Value键值分隔符的第一类解析规则;其中,所述字段分隔符用于分隔数据中相邻的两个字段,所述键值分隔符用于分隔数据中的每个字段的字段名称和字段值,所述字段名称作为字段的key,所述字段值作为与字段的key对应的value;In an illustrated embodiment, the data parsing rules include a first type of parsing rule for indicating field separators and Key-Value key-value separators in data; wherein, the field separators are used to separate data The two adjacent fields in the data, the key-value delimiter is used to separate the field name and field value of each field in the data, the field name is used as the key of the field, and the field value is used as the key corresponding to the field value;

所述第一解析规则生成单元302,具体用于:The first parsing rule generating unit 302 is specifically used for:

统计所述样本数据集合中的样本数据包含的多种分隔符在每个样本数据中的数量;Counting the number of various delimiters contained in the sample data in the sample data set in each sample data;

基于统计出的所述数量,在所述多种分隔符中确定出所述样本数据集合的字段分隔符和键值分隔符;Based on the counted quantity, determine the field delimiter and the key value delimiter of the sample data set among the various delimiters;

基于所述样本数据集合的字段分隔符和键值分隔符生成所述第一类解析规则。The first type of parsing rules are generated based on the field delimiter and the key value delimiter of the sample data set.

在一示出的实施方式中,所述第一解析规则生成单元302,具体用于:In an illustrated implementation manner, the first parsing rule generating unit 302 is specifically configured to:

将在每个样本数据中数量最多的分隔符确定为所述样本数据集合的键值分隔符;Determining the separator with the largest number in each sample data as the key-value separator of the sample data set;

将在每个样本数据中数量比所述键值分隔符少一个的分隔符确定为所述样本数据集合的字段分隔符。A delimiter whose number is one less than the key-value delimiter in each sample data is determined as a field delimiter of the sample data set.

在一示出的实施方式中,所述第一解析规则生成单元302,具体用于:In an illustrated implementation manner, the first parsing rule generating unit 302 is specifically configured to:

将在每个样本数据中数量最多,且在不同样本数据之间数量的差异最小的分隔符确定为所述样本数据集合的键值分隔符;Determining the separator with the largest number in each sample data and the smallest difference in number between different sample data as the key-value separator of the sample data set;

将在每个样本数据中数量比所述键值分隔符少一个,且在不同样本数据之间数量的差异最小的分隔符确定为所述样本数据集合的字段分隔符。The number of separators in each sample data is one less than the key-value separator, and the number of separators between different sample data is the smallest is determined as the field separator of the sample data set.

在一示出的实施方式中,所述第一解析规则生成单元302,具体用于:In an illustrated implementation manner, the first parsing rule generating unit 302 is specifically configured to:

计算每种分隔符在不同样本数据之间的数量方差;Calculate the number variance of each delimiter between different sample data;

将在每个样本数据中数量最多,且在不同样本数据之间的数量方差最小的分隔符确定为所述样本数据集合的键值分隔符;Determining the delimiter with the largest quantity in each sample data and the minimum quantity variance between different sample data as the key-value delimiter of the sample data set;

将在每个样本数据中数量比所述键值分隔符少一个,且在不同样本数据之间的数量方差最小的分隔符确定为所述样本数据集合的字段分隔符。The number of separators in each sample data is one less than the key-value separator and the separator with the smallest number variance between different sample data is determined as the field separator of the sample data set.

在一示出的实施方式中,所述数据解析规则还包括用于指示数据中的各个字段对应的字段类型的第二类解析规则;In an illustrated embodiment, the data parsing rule further includes a second type of parsing rule for indicating the field type corresponding to each field in the data;

所述装置30还包括第二解析规则生成单元304,用于:The device 30 also includes a second parsing rule generation unit 304, configured to:

将每个样本数据中的各个字段的字段值分别与多种字段类型进行匹配;Match the field values of each field in each sample data with multiple field types;

如果与每个样本数据中的任一目标字段的字段值所匹配的字段类型均为所述多种字段类型中的目标类型,则将所述目标类型确定为与所述目标字段对应的字段类型;If the field type matched with the field value of any target field in each sample data is the target type in the plurality of field types, then determine the target type as the field type corresponding to the target field ;

如果与每个样本数据中的所述目标字段的字段值所匹配的字段类型包括多种字段类型,则将所述多种字段类型中对应的数据范围更广的目标类型确定为与所述目标字段对应的字段类型;If the field type matched with the field value of the target field in each sample data includes multiple field types, determine the target type corresponding to a wider range of data among the multiple field types as the target type that matches the target The field type corresponding to the field;

基于确定的与样本数据中的各个字段对应的字段类型生成所述第二类解析规则。The second type of parsing rules are generated based on the determined field types corresponding to each field in the sample data.

在一示出的实施方式中,所述数据解析规则还包括用于指示数据中的各个字段对应的标准字段的第三类解析规则;In an illustrated embodiment, the data parsing rule further includes a third type of parsing rule for indicating the standard field corresponding to each field in the data;

所述装置30还包括第三解析规则生成单元305,用于:The device 30 also includes a third parsing rule generation unit 305, configured to:

利用预设的相似性计算方法,计算每个样本数据中的各个字段分别与多种标准字段之间的相似度;Use the preset similarity calculation method to calculate the similarity between each field in each sample data and various standard fields;

从所述多种标准字段中确定出与各个字段的相似度大于预设阈值的标准字段,并将各个字段与对应的标准字段进行映射;determining a standard field whose similarity with each field is greater than a preset threshold from the plurality of standard fields, and mapping each field with a corresponding standard field;

基于样本数据中的各个字段与对应的标准字段之间的映射关系生成所述第三类解析规则。The third type of parsing rules are generated based on the mapping relationship between each field in the sample data and the corresponding standard field.

在一示出的实施方式中,所述预设的相似性计算方法包括余弦相似性计算方法和/或编辑距离计算方法。In an illustrated embodiment, the preset similarity calculation method includes a cosine similarity calculation method and/or an edit distance calculation method.

在一示出的实施方式中,所述样本数据集合中的样本数据和待解析的所述目标数据包括所述目标数据源中的日志数据。In an illustrated embodiment, the sample data in the sample data set and the target data to be parsed include log data in the target data source.

上述装置30中各个单元的功能和作用的实现过程具体详见上述图1-图2对应实施例的描述,在此不再进行赘述。应理解,上述装置30可以通过软件实现,也可以通过硬件或者软硬件结合的方式实现。以软件实现为例,作为逻辑意义上的装置,是通过所在设备的处理器(CPU)将对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言,除了CPU以及存储器之外,上述装置所在的设备通常还包括用于进行无线信号收发的芯片等其他硬件,和/或用于实现网络通信功能的板卡等其他硬件。For the implementation process of the functions and effects of each unit in the above-mentioned device 30, refer to the descriptions of the corresponding embodiments in the above-mentioned Figs. 1-2 for details, and details are not repeated here. It should be understood that the above-mentioned device 30 may be implemented by software, or by hardware or a combination of software and hardware. Taking software implementation as an example, as a logical device, the processor (CPU) of the device reads the corresponding computer program instructions into the memory and runs them. From the perspective of hardware, in addition to CPU and memory, the equipment where the above-mentioned device is located usually includes other hardware such as chips for wireless signal transmission and reception, and/or boards and other hardware for realizing network communication functions.

以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部单元或模块来实现本说明书方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical modules, that is, they may be located in One place, or it can be distributed to multiple network modules. Part or all of the units or modules can be selected according to actual needs to achieve the purpose of the solution in this specification. It can be understood and implemented by those skilled in the art without creative effort.

上述实施例阐明的装置、单元、模块,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机,计算机的具体形式可以是个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件收发设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任意几种设备的组合。The devices, units, and modules described in the above embodiments can be implemented by computer chips or entities, or by products with certain functions. A typical implementing device is a computer, which may take the form of a personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media player, navigation device, e-mail device, game control device, etc. desktops, tablets, wearables, or any combination of these.

与上述方法实施例相对应,本说明书的实施例还提供了一种服务器。请参阅图4,图4是一示例性实施例提供的一种服务器的结构示意图。示例性的,该服务器1000可以包括上述图1所示的云服务平台中的一个或多个服务器,本说明书对此不做具体限定。如图4所示,该服务器1000可以包括处理器1001和存储器1002,进一步还可以包括输入设备1004(例如键盘等)和输出设备1005(例如显示器等)。处理器1001、存储器1002、输入设备1004和输出设备1005之间可以通过总线或其他方式连接。如图4所示,存储器1002包括计算机可读存储介质1003,该计算机可读存储介质1003存储有能够由处理器1001运行的计算机程序。处理器1001可以是通用处理器,微处理器,或用于控制以上方法实施例执行的集成电路。处理器1001在运行存储的计算机程序时,可以执行本说明书实施例中数据解析方法的各个步骤,包括:获取多个数据源中的任一目标数据源对应的样本数据集合;分析所述样本数据集合中的样本数据的数据结构,并基于分析结果生成与所述目标数据源对应的数据解析规则;所述数据解析规则用于指示所述目标数据源中的各个数据的数据结构;基于所述数据解析规则,对所述目标数据源中的待解析的目标数据进行数据结构解析,并基于解析结果对所述目标数据进行数据管理,等等。Corresponding to the foregoing method embodiments, the embodiments of this specification also provide a server. Please refer to FIG. 4 . FIG. 4 is a schematic structural diagram of a server provided by an exemplary embodiment. Exemplarily, the server 1000 may include one or more servers in the cloud service platform shown in FIG. 1 above, which is not specifically limited in this specification. As shown in FIG. 4, the server 1000 may include a processor 1001 and a memory 1002, and may further include an input device 1004 (such as a keyboard, etc.) and an output device 1005 (such as a display, etc.). The processor 1001, the memory 1002, the input device 1004, and the output device 1005 may be connected through a bus or in other ways. As shown in FIG. 4 , the memory 1002 includes a computer-readable storage medium 1003 , and the computer-readable storage medium 1003 stores a computer program executable by the processor 1001 . The processor 1001 may be a general processor, a microprocessor, or an integrated circuit for controlling the execution of the above method embodiments. When the processor 1001 is running the stored computer program, it can execute each step of the data analysis method in the embodiment of this specification, including: obtaining a sample data set corresponding to any target data source among multiple data sources; analyzing the sample data the data structure of the sample data in the collection, and generate data analysis rules corresponding to the target data source based on the analysis results; the data analysis rules are used to indicate the data structure of each data in the target data source; based on the The data parsing rule is to analyze the data structure of the target data to be parsed in the target data source, and perform data management on the target data based on the parsing result, and so on.

对上述数据解析方法的各个步骤的详细描述请参见之前的内容,此处不再进行赘述。For a detailed description of each step of the above data parsing method, please refer to the previous content, and details will not be repeated here.

与上述方法实施例相对应,本说明书的实施例还提供了一种计算机可读存储介质,该存储介质上存储有计算机程序,这些计算机程序在被处理器运行时,执行本说明书实施例中数据解析方法的各个步骤。具体请参见上述图1-图2对应实施例的描述,此处不再进行赘述。Corresponding to the above method embodiments, the embodiments of this specification also provide a computer-readable storage medium, on which computer programs are stored. When these computer programs are run by a processor, the data in the embodiments of this specification are executed. The steps of the analysis method. For details, please refer to the descriptions of the embodiments corresponding to FIGS. 1-2 above, and details are not repeated here.

以上所述仅为本说明书的较佳实施例而已,并不用以限制本说明书,凡在本说明书的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本说明书保护的范围之内。The above descriptions are only preferred embodiments of this specification, and are not intended to limit this specification. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of this specification shall be included in this specification. within the scope of protection.

在一个典型的配置中,终端设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, an end device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。Memory may include non-permanent storage in computer readable media, in the form of random access memory (RAM) and/or nonvolatile memory such as read only memory (ROM) or flash RAM. Memory is an example of computer readable media.

计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。Computer-readable media, including both permanent and non-permanent, removable and non-removable media, can be implemented by any method or technology for storage of information. Information may be computer readable instructions, data structures, modules of a program, or other data.

计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Flash memory or other memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cartridge, tape magnetic disk storage or other magnetic storage device or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media excludes transitory computer-readable media, such as modulated data signals and carrier waves.

还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes Other elements not expressly listed, or elements inherent in the process, method, commodity, or apparatus are also included. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

本领域技术人员应明白,本说明书的实施例可提供为方法、系统或计算机程序产品。因此,本说明书的实施例可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本说明书的实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of this specification may be provided as methods, systems or computer program products. Accordingly, the embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present specification may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein. .

Claims (11)

1. The data analysis method is applied to a cloud service platform, and the cloud service platform is respectively in butt joint with a plurality of data sources; the method comprises the following steps:
acquiring a sample data set corresponding to any target data source in the plurality of data sources;
analyzing a data structure of sample data in the sample data set, and generating a data analysis rule corresponding to the target data source based on an analysis result; the data parsing rule is used for indicating a data structure of each data in the target data source;
and carrying out data structure analysis on the target data to be analyzed in the target data source based on the data analysis rule.
2. The method of claim 1, the data parsing rules comprising a first type of parsing rule for indicating field separators and Key-Value separators in data; the key value separator is used for separating a field name and a field value of each field in the data, wherein the field name is used as a key of the field, and the field value is used as a value corresponding to the key of the field;
the analyzing the data structure of the sample data in the sample data set and generating a data analysis rule corresponding to the target data source based on the analysis result comprises the following steps:
counting the number of a plurality of separators contained in sample data in the sample data set in each sample data;
determining field separators and key separators of the sample data set from the plurality of separators based on the counted number;
the first type parsing rule is generated based on field separators and key-value separators of the sample data set.
3. The method of claim 2, the determining field separators and key separators of the sample data set among the plurality of separators based on the counted number, comprising:
Determining the most number of separators in each sample data as key separators of the sample data set;
a number of separators in each sample data that is one less than the key-value separators is determined as field separators for the sample data set.
4. A method according to claim 3, the determining the most number of separators in each sample data as key separators of the sample data set, comprising:
determining as a key-value separator of the sample data set a separator having a largest number in each sample data and a smallest number of differences between different sample data;
the determining a number of separators in each sample data that is one less than the key separators as field separators of the sample data set, comprising:
the field separator of the sample data set is determined as a separator having one less number in each sample data than the key separator and the smallest difference in number between different sample data.
5. The method of claim 4, the determining, as a key-value separator for the set of sample data, a separator that is the most numerous in each sample data and that has the smallest number of differences between different sample data, comprising:
Calculating a variance in the number of each separator between the different sample data;
determining as a key-value separator of the sample data set a separator having the largest number in each sample data and the smallest number variance between different sample data;
the determining, as a field separator of the sample data set, a separator having one fewer number in each sample data than the key separator and a smallest difference in number between different sample data, comprising:
a field separator of the sample data set is determined as a separator having one fewer number in each sample data than the key-value separator and a minimum number variance between different sample data.
6. The method of claim 2, the data parsing rules further comprising a second type parsing rule for indicating a field type corresponding to each field in the data;
the method further comprises the steps of:
respectively matching field values of various fields in each sample data with various field types;
if the field type matched with the field value of any target field in each sample data is the target type in the multiple field types, determining the target type as the field type corresponding to the target field;
If the field type matched with the field value of the target field in each sample data comprises a plurality of field types, determining the target type with wider data range corresponding to the plurality of field types as the field type corresponding to the target field;
and generating the second type analysis rule based on the determined field types corresponding to the fields in the sample data.
7. The method of claim 2, the data parsing rules further comprising a third type of parsing rule for indicating standard fields corresponding to respective fields in data;
the method further comprises the steps of:
calculating the similarity between each field in each sample data and various standard fields by using a preset similarity calculation method;
determining standard fields with similarity to each field larger than a preset threshold value from the plurality of standard fields, and mapping each field with the corresponding standard field;
and generating the third type of analysis rule based on the mapping relation between each field in the sample data and the corresponding standard field.
8. The method of any of claims 1-7, sample data in the sample data set and the target data to be parsed comprising log data in the target data source.
9. The data analysis device is applied to a cloud service platform, and the cloud service platform is respectively in butt joint with a plurality of data sources; the device comprises:
the acquisition unit is used for acquiring a sample data set corresponding to any target data source in the plurality of data sources;
a first analysis rule generating unit, configured to analyze a data structure of sample data in the sample data set, and generate a data analysis rule corresponding to the target data source based on an analysis result; the data parsing rule is used for indicating a data structure of each data in the target data source;
and the analysis unit is used for carrying out data structure analysis on the target data to be analyzed in the target data source based on the data analysis rule.
10. A server applied to a cloud service platform, the server comprising a memory and a processor; the memory has stored thereon a computer program executable by the processor; the processor, when running the computer program, performs the method of any one of claims 1 to 8.
11. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of claims 1 to 8.
CN202310635626.9A 2023-05-31 2023-05-31 A data analysis method and related equipment Pending CN116701456A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310635626.9A CN116701456A (en) 2023-05-31 2023-05-31 A data analysis method and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310635626.9A CN116701456A (en) 2023-05-31 2023-05-31 A data analysis method and related equipment

Publications (1)

Publication Number Publication Date
CN116701456A true CN116701456A (en) 2023-09-05

Family

ID=87828586

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310635626.9A Pending CN116701456A (en) 2023-05-31 2023-05-31 A data analysis method and related equipment

Country Status (1)

Country Link
CN (1) CN116701456A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118364787A (en) * 2024-04-28 2024-07-19 北京优特捷信息技术有限公司 Automatic processing method, device, equipment and medium for non-standardized log structure

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8620928B1 (en) * 2012-07-16 2013-12-31 International Business Machines Corporation Automatically generating a log parser given a sample log
CN110321457A (en) * 2019-04-19 2019-10-11 杭州玳数科技有限公司 Access log resolution rules generation method and device, log analytic method and system
CN111708860A (en) * 2020-06-15 2020-09-25 北京优特捷信息技术有限公司 Information extraction method, device, equipment and storage medium
CN114169311A (en) * 2021-12-07 2022-03-11 航天信息股份有限公司 Data analysis method and device
CN114328517A (en) * 2021-12-22 2022-04-12 上海欣兆阳信息科技有限公司 Business-oriented dynamically-configurable data aggregation management method and device
CN115174158A (en) * 2022-06-14 2022-10-11 阿里云计算有限公司 Cloud product configuration checking method based on multi-cloud management platform
CN115344456A (en) * 2022-08-12 2022-11-15 华能澜沧江水电股份有限公司 Automatic parsing method for Syslog

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8620928B1 (en) * 2012-07-16 2013-12-31 International Business Machines Corporation Automatically generating a log parser given a sample log
CN110321457A (en) * 2019-04-19 2019-10-11 杭州玳数科技有限公司 Access log resolution rules generation method and device, log analytic method and system
CN111708860A (en) * 2020-06-15 2020-09-25 北京优特捷信息技术有限公司 Information extraction method, device, equipment and storage medium
CN114169311A (en) * 2021-12-07 2022-03-11 航天信息股份有限公司 Data analysis method and device
CN114328517A (en) * 2021-12-22 2022-04-12 上海欣兆阳信息科技有限公司 Business-oriented dynamically-configurable data aggregation management method and device
CN115174158A (en) * 2022-06-14 2022-10-11 阿里云计算有限公司 Cloud product configuration checking method based on multi-cloud management platform
CN115344456A (en) * 2022-08-12 2022-11-15 华能澜沧江水电股份有限公司 Automatic parsing method for Syslog

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118364787A (en) * 2024-04-28 2024-07-19 北京优特捷信息技术有限公司 Automatic processing method, device, equipment and medium for non-standardized log structure

Similar Documents

Publication Publication Date Title
US8584112B2 (en) Open application lifecycle management framework
US8447744B2 (en) Extensibility platform using data cartridges
CN111158795A (en) Report generation method, device, medium and electronic device
US20250227089A1 (en) Combined machine learning and formal techniques for network traffic analysis
CN108572963A (en) Information acquisition method and device
US20130318095A1 (en) Distributed computing environment for data capture, search and analytics
CN112035443B (en) Big data execution method, system, equipment and storage medium based on Linux platform
CN110990447A (en) Data probing method, device, equipment and storage medium
US20160246705A1 (en) Data fabrication based on test requirements
CN109933514A (en) A kind of data test method and apparatus
CN114741392A (en) Data query method, device, electronic device and storage medium
CN115051863B (en) Abnormal flow detection method and device, electronic equipment and readable storage medium
CN110347428A (en) A kind of detection method and device of code similarity
CN114119263A (en) Big data based data checking method and device, electronic equipment and storage medium
US20160342646A1 (en) Database query cursor management
US10503743B2 (en) Integrating search with application analysis
CN117407414A (en) Method, device, equipment and medium for processing structured query statement
CN116361522A (en) A data display method and device
US20150081718A1 (en) Identification of entity interactions in business relevant data
CN108885633B (en) Techniques for auto-discovery and connection to REST interfaces
CN117472940A (en) Data blood relationship construction method and device, electronic equipment and storage medium
CN114896269A (en) Structured query statement detection method and device, electronic equipment and storage medium
US12461955B2 (en) Integration flow generation using large language models
CN117009397A (en) Data query method, data query device, electronic equipment and storage medium
CN116701456A (en) A data analysis method and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination