WO2022047659A1

WO2022047659A1 - Multi-source heterogeneous log analysis method

Info

Publication number: WO2022047659A1
Application number: PCT/CN2020/113002
Authority: WO
Inventors: 汪祖民; 田纪宇; 秦静; 季长清
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2020-09-02
Filing date: 2020-09-02
Publication date: 2022-03-10
Anticipated expiration: 2023-03-02

Abstract

A multi-source heterogeneous log analysis method, belonging to the field of log data processing, to solve the problem of log analysis, comprising step 1: determining the size of a time window according to a response time required by an information system; step 2: using an SGSE algorithm to process log data within each time window into a sample which can be called by an ECC log analysis algorithm; step 3: training and using an ECC log analysis model to analyze whether it is normal within the time window; and step 4: presenting a log analysis result. The present invention has the effect of performing abnormality analysis on logs.

Description

Multi-source heterogeneous log analysis method

technical field

本发明属于日志数据处理领域，涉及一种多源异构日志分析方法和系统。The invention belongs to the field of log data processing, and relates to a multi-source heterogeneous log analysis method and system.

Background technique

随着互联网技术的发展，信息系统内各设备产生的日志数量也日渐增多，对由不同设备产生、数据特征不同的日志进行分析是运维工作的重要组成部分。通过自动化的手段对多源异构的日志数据进行分析，可以及时获知信息系统的运行状态为异常还是正常，确保信息系统安全、稳定的运行，进而降低企业的运维成本。With the development of Internet technology, the number of logs generated by each device in the information system is also increasing day by day. The analysis of logs generated by different devices and with different data characteristics is an important part of the operation and maintenance work. By analyzing the multi-source and heterogeneous log data by automated means, you can know whether the operating status of the information system is abnormal or normal in time, so as to ensure the safe and stable operation of the information system, thereby reducing the operation and maintenance cost of the enterprise.

当前的多源异构日志分析的技术方法中，使用了单一分析再聚合、关联分析等方法。在单一分析聚合的方法中，先分析信息系统内的单一设备中产生的日志，分析出每个设备的运行状态，再按照提前设定好的规则根据每个设备的状态判断整个信息系统是否存在异常情况。然而，此方法分析时没有将不同设备内的日志组合分析，而是单独判断了不同设备状态后再分析，不能挖掘出不同设备日志之间的关系。在关联分析的方法中，先根据日志中的各个字段内容生成特征事件，将一个时间窗口下不同设备产生的事件聚类后进行相似性比较，剔除同类事件。然后将不同设备的同类事件合并，最终生成各类事件的统计报告。然而，在此方法中，以生成各类事件的统计报告为目的，未能深度挖掘各类事件之间的关系直接呈现给用户，且只使用聚类算法无法精准对每个事件进行归类。In the current technical methods of multi-source heterogeneous log analysis, methods such as single analysis re-aggregation and association analysis are used. In the method of single analysis and aggregation, the log generated by a single device in the information system is first analyzed, and the running status of each device is analyzed, and then the existence of the entire information system is judged according to the pre-set rules according to the status of each device. abnormal situation. However, this method does not combine and analyze the logs in different devices, but judges the states of different devices separately and then analyzes them, so that the relationship between the logs of different devices cannot be mined. In the method of association analysis, characteristic events are first generated according to the content of each field in the log, and events generated by different devices in a time window are clustered and then compared for similarity, and similar events are eliminated. Then the similar events of different devices are combined, and finally statistical reports of various events are generated. However, in this method, for the purpose of generating statistical reports of various events, the relationship between various events cannot be deeply excavated and presented to the user directly, and only the clustering algorithm cannot accurately classify each event.

当前的多源异构日志分析的技术方法中，使用了单一分析再聚合 ^[1]、关联分析 ^[2]等方法。在单一分析聚合的方法中，先分析信息系统内的单一设备中产生的日志，分析出每个设备的运行状态，再按照提前设定好的规则根据每个设备的状态判断整个信息系统是否存在异常情况。在关联分析的方法中，先根据日志中的各个字段内容生成特征事件，将一个时间窗口下不同设备产生的事件进行相似性比较，剔除同类事件。然后将不同设备的同类事件合并，最终生成各类事件的统计报告。 In the current technical methods of multi-source heterogeneous log analysis, methods such as single analysis and re-aggregation ^[1] and association analysis ^[2] are used. In the method of single analysis and aggregation, the log generated by a single device in the information system is first analyzed, the running status of each device is analyzed, and then whether the entire information system exists or not is determined according to the state of each device according to the rules set in advance. abnormal situation. In the method of correlation analysis, characteristic events are first generated according to the content of each field in the log, and events generated by different devices in a time window are compared for similarity, and similar events are eliminated. Then the similar events of different devices are combined, and finally statistical reports of various events are generated.

基在单一分析聚合的方法中,没有将不同设备内的日志组合分析，而是单独判断了不同设备状态后再分析，不能挖掘出不同设备日志之间的关系。在关联分析的方法中，以生成各类事件的统计报告为目的，不能够深度挖掘各类事件之间的关系直接呈现给用户。Based on the single analysis and aggregation method, the logs in different devices are not combined and analyzed, but the status of different devices is judged separately and then analyzed, and the relationship between the logs of different devices cannot be mined. In the method of association analysis, for the purpose of generating statistical reports of various events, it is impossible to deeply mine the relationship between various events and present them directly to the user.

发明内容SUMMARY OF THE INVENTION

为了解决对于日志分析的问题，本发明提出如下技术方案：一种多源异构日志分析方法，包括如下步骤：In order to solve the problem of log analysis, the present invention proposes the following technical solution: a multi-source heterogeneous log analysis method, comprising the following steps:

步骤1：根据信息系统所要求的响应时间确定时间窗口的大小；Step 1: Determine the size of the time window according to the response time required by the information system;

步骤2：使用SGSE算法对每个时间窗口内的日志数据处理成可供ECC日志分析算法调用的样本；Step 2: Use the SGSE algorithm to process the log data in each time window into a sample that can be called by the ECC log analysis algorithm;

步骤3：训练并使用ECC日志分析模型分析时间窗口下是否正常；Step 3: Train and use the ECC log analysis model to analyze whether the time window is normal;

步骤4：呈现日志分析结果。Step 4: Present the log analysis results.

进一步的，模型训练的步骤如下：Further, the steps of model training are as follows:

步骤1：将正常、异常时间窗口内的多源异构日志数据的日志数量统计生成日志数量状态子序列，将时间窗口内每个设备上产生的每个日志种类数量统计生成用户行为状态子序列，将时间窗口内每个设备某些重要字段中类型出现的次数进行数量统计生成字段状态子序列；Step 1: Count the number of logs of multi-source heterogeneous log data in normal and abnormal time windows to generate a log number status subsequence, and count the number of each log type generated on each device within the time window to generate a user behavior status subsequence , count the number of occurrences of types in some important fields of each device in the time window to generate field status subsequences;

步骤2：将每个时间窗口下日志数量状态子序列中的n个特征、用户行为状态子序列中的m个特征、字段状态子序列中的j个特征生成(n+m+j)个样本数据集；Step 2: Generate (n+m+j) samples from n features in the log quantity state subsequence, m features in the user behavior state subsequence, and j features in the field state subsequence under each time window data set;

步骤3：将每个正常、异常时间窗口内日志数量状态子序列中的某个特征做为标签的样本数据集按照ECC表达式分别与其他正常、异常时间窗口内的样本数据集两两计算出差异值,计算表达式为：Step 3: Take a certain feature in the log quantity status subsequence in each normal and abnormal time window as the label of the sample data set according to the ECC expression and calculate the sample data sets in other normal and abnormal time windows in pairs. The difference value, the calculation expression is:

v1＝1-v2 (3)v1=1-v2 (3)

f(tableα)＝v1*M′ _tableα+v2*M″ _tableα+bias (4) f(tableα)=v1*M′ _tableα +v2*M″ _tableα +bias (4)

tableα代表着日志数量状态子序列中的某个特征做为标签时所对应的样本，tableα represents the sample corresponding to a feature in the log quantity state subsequence as a label,

tableα′代表正常时间窗口下日志数量状态子序列中的某个特征做为标签时所对应的样本，tableα' represents the sample corresponding to a certain feature in the log quantity status subsequence as a label in the normal time window,

tableα″代表异常时间窗口下日志数量状态子序列中的某个特征做为标签时所对应的样本，"tableα" represents the sample corresponding to a certain feature in the log quantity status subsequence under the abnormal time window as the label,

M _tableα′为日志数量状态子序列中的某个特征做为标签时所对应的样本与代表正常时间窗口下日志数量状态子序列中的某个特征做为标签时所对应的样本的之间的均方误差， M _tableα ′ is the difference between a sample corresponding to a certain feature in the log quantity state subsequence as a label and a sample corresponding to a certain feature in the log quantity state subsequence under a normal time window as a label. mean squared error,

M _tableα″为日志数量状态子序列中的某个特征做为标签时所对应的样本与代表异常时间窗口下日志数量状态子序列中的某个特征做为标签时所对应的样本的之间的均方误差， M _tableα ″ is the difference between the sample corresponding to a certain feature in the log quantity state subsequence as a label and the sample corresponding to a certain feature in the log quantity state subsequence under the abnormal time window as a label mean squared error,

bias为偏执，bias is paranoia,

v1、v2分别为正常时间窗口、异常时间窗口均方误差的变化系数，训练时v1＝v2；v1 and v2 are the variation coefficients of the mean square error of the normal time window and the abnormal time window, respectively, and v1=v2 during training;

f(tableα)为受训练的时间窗口中日志数量状态子序列中的某个特征做为标签所计算的差异值；f(tableα) is the difference value calculated by a feature in the log quantity state subsequence in the trained time window as a label;

步骤4：将每个正常时间窗口通过式(1)-(4)与其他正常、异常时间窗口计算出差异值并保存为集合U ₁，将异常时间窗口通过式(1)-(4)与正常、异常时间窗口计算出差异值并保存为P ₁，得到正常时间窗口下日志数量状态子序列的置信区间σ(α)为 Step 4: Calculate the difference value between each normal time window and other normal and abnormal time windows through equations (1)-(4) and save it as a set U ₁ , and use equations (1)-(4) and The difference value is calculated in the normal and abnormal time windows and saved as P ₁ , and the confidence interval σ(α) of the log quantity state subsequence under the normal time window is obtained as

σ(α)＝[min(U1),max(U1)]∩[min(P1),max(P1)]σ(α)=[min(U1),max(U1)]∩[min(P1),max(P1)]

步骤5：将每个正常、异常时间窗口内用户行为状态子序列中的某个特征做为标签的样本数据集按照ECC表达式分别与其他正常、异常时间窗口内的样本数据集两两计算出差异值,计算表达式为：Step 5: Calculate the sample data set with a certain feature in the user behavior state subsequence in each normal and abnormal time window as the label according to the ECC expression and the sample data sets in other normal and abnormal time windows. The difference value, the calculation expression is:

v3＝1-v4 (7)v3=1-v4 (7)

f(tableβ)＝v3*M′ _tableβ+v4*M″ _tableβ+bias (8) f(tableβ)=v3*M′ _tableβ +v4*M″ _tableβ +bias (8)

tableβ代表着用户行为状态子序列中的某个特征做为标签时所对应的样本，tableβ represents the sample corresponding to a feature in the user behavior state subsequence as a label,

tableβ′代表正常时间窗口下用户行为状态子序列中的某个特征做为标签时所对应的样本，tableβ' represents the sample corresponding to a certain feature in the subsequence of user behavior status under the normal time window as a label,

tableβ″代表异常时间窗口下用户行为状态子序列中的某个特征做为标签时所对应的样本,"tableβ" represents the sample corresponding to a certain feature in the subsequence of user behavior status under the abnormal time window as a label,

M _tableβ′为用户行为状态子序列中的某个特征做为标签时所对应的样本与代表正常时间窗口下用户行为状态子序列中的某个特征做为标签时所对应的样本的之间的均方误差, M _tableβ ′ is the difference between a sample corresponding to a certain feature in the user behavior state subsequence as a label and a sample corresponding to a certain feature in the user behavior state subsequence under a normal time window as a label. mean squared error,

M _tableβ″为用户行为状态子序列中的某个特征做为标签时所对应的样本与代表异常时间窗口下用户行为状态子序列中的某个特征做为标签时所对应的样本的之间的均方误差, M _tableβ ″ is the difference between a sample corresponding to a certain feature in the user behavior state subsequence as a label and a sample corresponding to a certain feature in the user behavior state subsequence under an abnormal time window as a label mean squared error,

bias为偏执,bias is paranoia,

v3、v4分别为正常时间窗口、异常时间窗口均方误差的变化系数，训练时v3＝v4,v3 and v4 are the variation coefficients of the mean square error of the normal time window and the abnormal time window, respectively. During training, v3=v4,

f(tableβ)为受训练的时间窗口中用户行为状态子序列中的某个特征做为标签所计算的差异值；f(tableβ) is the difference value calculated by a feature in the user behavior state subsequence in the trained time window as a label;

步骤6：将每个正常时间窗口通过式(5)-(8)与正常、异常时间窗口计算出差异值并保存为集合U ₂，将异常时间窗口通过式(5)-(8)与正常、异常时间窗口计算出差异值并保存为P ₂，得到正常时间窗口下字段状态子序列的置信区间σ(β)为 Step 6: Calculate the difference between each normal time window and the normal and abnormal time windows through equations (5)-(8) and save it as a set U ₂ , and compare the abnormal time window with the normal and abnormal time windows through equations (5)-(8). , the difference value is calculated in the abnormal time window and saved as P ₂ , and the confidence interval σ(β) of the field state subsequence under the normal time window is obtained as

σ(β)＝[min(U2),max(U2)]∩[min(P2),max(P2)]σ(β)=[min(U2),max(U2)]∩[min(P2),max(P2)]

步骤7：将每个正常、异常时间窗口内字段状态子序列中的某个特征做为标签的样本数据集按照ECC表达式分别与其他正常、异常时间窗口内的样本集两两计算出差异值,计算表达式为：Step 7: Use a certain feature in the field state subsequence in each normal and abnormal time window as the label of the sample data set according to the ECC expression and calculate the difference value in pairs with the sample sets in other normal and abnormal time windows. , the calculation expression is:

v5＝1-v6 (11)v5=1-v6 (11)

f(tableγ)＝v5*M′ _tableγ+v6*M″ _tableγ+bias (12) f(tableγ)=v5*M′ _tableγ +v6*M″ _tableγ +bias (12)

tableγ代表着字段状态子序列中的某个特征做为标签时所对应的样本，tableγ represents the sample corresponding to a feature in the field state subsequence as a label,

tableγ′代表正常时间窗口下字段状态子序列中的某个特征做为标签时所对应的样本，tableγ′ represents the sample corresponding to a feature in the field state subsequence in the normal time window as a label,

tableγ″代表异常时间窗口下字段状态子序列中的某个特征做为标签时所对应的样本,"tableγ" represents the sample corresponding to a feature in the field state subsequence under the abnormal time window as a label,

M _tableγ′为字段状态子序列中的某个特征做为标签时所对应的样本与代表正常时间窗口下字段状态子序列中的某个特征做为标签时所对应的样本的之间的均方误差, M _tableγ ′ is the mean square between a sample corresponding to a feature in the field state subsequence as a label and a sample corresponding to a feature in the field state subsequence as a label in a normal time window error,

M _tableγ″为字段状态子序列中的某个特征做为标签时所对应的样本与代表异常时间窗口下字段状态子序列中的某个特征做为标签时所对应的样本的之间的均方误差, M _tableγ ″ is the mean square between the sample corresponding to a certain feature in the field state subsequence as the label and the sample corresponding to a certain feature in the field state subsequence under the abnormal time window as the label error,

bias为偏执,bias is paranoia,

v5、v6分别为正常时间窗口、异常时间窗口均方误差的变化系数，训练时v5＝v6，v5 and v6 are the variation coefficients of the mean square error of the normal time window and the abnormal time window, respectively. During training, v5=v6,

f(tableγ)为受训练的时间窗口中字段状态子序列中的某个特征做为标签所计算的差异值；f(tableγ) is the difference value calculated by a feature in the field state subsequence in the trained time window as a label;

步骤8：将每个正常时间窗口通过式(9)-(12)与正常、异常时间窗口计算出差异值并保存为集合U ₃，将异常时间窗口通过式(9)-(12)与正常、异常时间窗口计算出差异值并保存为P ₃，得到正常时间窗口下日志数量状态子序列的置信区间σ(γ)为 Step 8: Calculate the difference between each normal time window and the normal and abnormal time windows through formulas (9)-(12) and save it as a set U ₃ , and use formulas (9)-(12) to calculate the difference between the abnormal time window and the normal time window. , the difference value is calculated in the abnormal time window and saved as P ₃ , and the confidence interval σ(γ) of the log quantity state subsequence under the normal time window is obtained as

σ(γ)＝[min(U3),max(U3)]∩[min(P3),max(P3)]。σ(γ)=[min(U3),max(U3)]∩[min(P3),max(P3)].

进一步的，日志分析的方法包括如下步骤：Further, the method for log analysis includes the following steps:

步骤1:对一个受检测的时间窗口进行分析时，随机挑选每个子序列的一个特征组成三个标签并形成三个样本代表受测时间窗口；Step 1: when a detected time window is analyzed, randomly select a feature of each subsequence to form three labels and form three samples to represent the tested time window;

步骤2：设初始值v1＝v2,v3＝v4,v5＝v6,通过公式(1)-(12)分别计算出受检测时间窗口下的三个样本的差异值f(tableα)、f(tableβ)、f(tableγ)是否在对应的置信区间σ(α)、σ(β)、σ(γ)内；Step 2: Set the initial values v1=v2, v3=v4, v5=v6, and calculate the difference values f(tableα) and f(tableβ) of the three samples under the detection time window through formulas (1)-(12). ), whether f(tableγ) is within the corresponding confidence interval σ(α), σ(β), σ(γ);

步骤3：若三个样本都在置信区间内，则对v1、v2、v3、v4、v5、v6约束，约束公式为Step 3: If the three samples are all within the confidence interval, then constrain v1, v2, v3, v4, v5, v6, and the constraint formula is

v1＝a1*v1(0≤a＜1)v1=a1*v1(0≤a＜1)

v3＝a2*v3(0≤a2＜1)v3=a2*v3(0≤a2＜1)

v5＝a3*v5(0≤a3＜1)v5=a3*v5(0≤a3＜1)

v1、v3、v5分别为正常时间窗口的均方误差的变化系数，将v1、v3、v5缩小，则正常时间窗口对差异值影响减少，异常时间窗口对差异值影响增大，根据新的v1、v2、v3、v4、v5、v6与三个样本重新通过公式(1)-(12)，分别计算出受测时间窗口下的三个样本的差异值是否在对应的置信区间σ(α)、σ(β)、σ(γ)内，并根据漏报率要求确定重复约束的次数；v1, v3, and v5 are the variation coefficients of the mean square error of the normal time window, respectively. If v1, v3, and v5 are reduced, the influence of the normal time window on the difference value is reduced, and the influence of the abnormal time window on the difference value is increased. According to the new v1 , v2, v3, v4, v5, v6 and the three samples re-pass formulas (1)-(12) to calculate whether the difference value of the three samples under the tested time window is within the corresponding confidence interval σ(α) , σ(β), σ(γ), and determine the number of repeated constraints according to the requirements of the false negative rate;

步骤4：若重复约束结束后，受测时间窗口下的三个样本的差异值依然在置信区间内，则认为信息系统在该受测时间窗口正常，否则认为不正常。Step 4: If the difference value of the three samples under the tested time window is still within the confidence interval after the repeated constraint is over, it is considered that the information system is normal in the tested time window, otherwise it is considered abnormal.

有益效果：本发明提出SGSE算法对多源异构日志进行处理，处理成可以代表信息系统状态的多维样本以供算法进行分析。本发明提出针对多源异构日志的ECC算法可以对多维样本进行分析，根据该时间窗口下的多源异构日志分析出该时间窗口的信息系统运行状态。Beneficial effects: The present invention proposes the SGSE algorithm to process multi-source heterogeneous logs, and process them into multi-dimensional samples that can represent the state of the information system for analysis by the algorithm. The invention proposes an ECC algorithm for multi-source heterogeneous logs that can analyze multi-dimensional samples, and analyzes the information system operating state of the time window according to the multi-source heterogeneous logs under the time window.

Description of drawings

图1：多源异构日志分析流程图。Figure 1: Multi-source heterogeneous log analysis flowchart.

图2：受测窗口状态样本生成图。Figure 2: Sample generation graph for the window state under test.

detailed description

本发明提出了一种SGSE(State Generation Sequential Extraction，状态生成顺序抽取)算法对信息系统内的多源异构日志数据进行处理的方法，并针对信息系统多源异构日志的特点，本发明提出一种新的ECC(Error Coefficient Constraint，误差系数约束)算法，用于对信息系统内运行状态进行判定。请参见图1，本发明的多源异构日志分析方法包括如下步骤：The present invention proposes an SGSE (State Generation Sequential Extraction) algorithm for processing the multi-source heterogeneous log data in the information system. A new ECC (Error Coefficient Constraint, Error Coefficient Constraint) algorithm is used to determine the operating state of the information system. Referring to FIG. 1, the multi-source heterogeneous log analysis method of the present invention includes the following steps:

步骤1：根据信息系统所要求的响应时间确定时间窗口的大小。Step 1: Determine the size of the time window according to the response time required by the information system.

步骤2：使用SGSE算法对每个时间窗口内的日志进行处理，将每个时间窗口内的日志数据处理成样本。Step 2: Use the SGSE algorithm to process the logs in each time window, and process the log data in each time window into samples.

步骤3：训练并使用ECC日志分析模型对需要分析的时间窗口进行分析。Step 3: Train and use the ECC log analysis model to analyze the time window to be analyzed.

步骤4：呈现日志分析结果。Step 4: Present the log analysis results.

在一种优选方案中，ECC日志分析模型训练的步骤如下：In a preferred solution, the steps of training the ECC log analysis model are as follows:

v1＝1-v2 (3)v1=1-v2 (3)

bias为偏执，bias is paranoia,

v3＝1-v4 (7)v3=1-v4 (7)

f(tableβ)＝v3*M′ _tableβ+v4*M″ _abkeβ+bias (8) f(tableβ)=v3*M′ _tableβ +v4*M″ _abkeβ +bias (8)

bias为偏执,bias is paranoia,

v5＝1-v6 (11)v5=1-v6 (11)

bias为偏执,bias is paranoia,

在一种优选方案中，日志分析的方法包括如下步骤：In a preferred solution, the method for log analysis includes the following steps:

v1＝a1*v1(0≤a＜1)v1=a1*v1(0≤a＜1)

v3＝a2*v3(0≤a2＜1)v3=a2*v3(0≤a2＜1)

v5＝a3*v5(0≤a3＜1)v5=a3*v5(0≤a3＜1)

本发明还提出一种日志异常检测系统，包括：The present invention also provides a log anomaly detection system, including:

时间窗口划分模块，用于根据信息系统对响应时间的要求确定时间窗口的大小。The time window division module is used to determine the size of the time window according to the information system's requirements for response time.

SGSE数据处理模块，用于处理日志数据，根据时间窗口将其处理成可以供ECC日志分析模型调用的样本数据。The SGSE data processing module is used to process log data, and process it into sample data that can be called by the ECC log analysis model according to the time window.

ECC模型训练模块，用于训练ECC日志分析模型。The ECC model training module is used to train the ECC log analysis model.

ECC日志分析模块，用于根据ECC日志分析模型分析的受测的时间窗口，判断受测的时间窗口是否正常，并根据信息系统内各设备的日志，分析时间窗口下的信息系统的状态是否正常。The ECC log analysis module is used to judge whether the tested time window is normal according to the time window under test analyzed by the ECC log analysis model, and analyze whether the status of the information system under the time window is normal according to the logs of each device in the information system. .

在一种方案中，所述的时间窗口划分模块，根据用户对信息系统要求的响应时间内尽可能短的确定时间窗口的大小。In one solution, the time window dividing module determines the size of the time window as short as possible according to the response time required by the user to the information system.

在一种方案中，所述的SGSE数据处理模块，分为SG状态生成子模块、SE顺序抽取子模块。In one solution, the SGSE data processing module is divided into SG state generation sub-module and SE sequential extraction sub-module.

所述的SG状态生成子模块，确定信息系统内的各设备如WAF、负载均衡、防火墙等。将各设备的日志数量进行统计生成日志数量状态子序列，该子序列包含WAF设备日志数量、负载均衡设备日志数量等特征。The SG state generation sub-module determines each device in the information system, such as WAF, load balancing, firewall and the like. Count the number of logs of each device to generate a subsequence of log number status, which includes features such as the number of WAF device logs and the number of load balancing device logs.

表1：日志数量状态子序列Table 1: Log Count Status Subsequence

时间窗口/设备类型time window/device type WAF日志数量Number of WAF logs 负载均衡日志数量The number of load balancing logs 防火墙日志数量Number of firewall logs ……... Nginx日志数量Number of Nginx logs 时间窗口Ntime window N α ₁ α ₁ α ₂ α ₂ α ₃ α ₃ ……... α _n α _n

将时间窗口内每个设备上产生的每个日志种类数量统计得到用户行为状态子序列，种类的确定由每个熵值不为0的字段的类型合并得到，如WAF设备中存在action字段用来记录WAF设备对访问所作的行为，字段中的类型有alert、block两种；存在记录http状态的 http_method字段中类型有200，404，500，501四种，由这两个字段WAF设备可以生成2*4种不同种类的日志。该子序列包含WAF日志种类数量、防火墙日志种类数量等多个特征。Count the number of each log type generated on each device in the time window to obtain the user behavior status sub-sequence. The type is determined by combining the types of each field whose entropy value is not 0. For example, there is an action field in the WAF device for Record the behavior of the WAF device for access. There are two types of fields: alert and block; there are four types in the http_method field that records the http status. There are four types: 200, 404, 500, and 501. From these two fields, the WAF device can generate 2 *4 different kinds of logs. The subsequence includes multiple features such as the number of types of WAF logs and the number of types of firewall logs.

表2：用户行为状态子序列Table 2: User behavior state subsequences

将时间窗口内每个设备某些重要字段中类型出现的次数进行数量统计，生成字段状态子序列。字段状态子序列包含WAF中action字段alert类型数量、防火墙protocol字段TCP协议的数量比列等特征。Count the number of occurrences of types in some important fields of each device in the time window, and generate field status subsequences. The field status subsequence includes the number of alert types in the action field in the WAF, and the ratio of the number of TCP protocols in the firewall protocol field.

表3：字段状态子序列Table 3: Field Status Subsequence

所述的SE顺序抽取子模块，在同一个时间窗口下顺序抽取一个子序列中的一个特征做为标签，与其他两个子序列的所有特征合并成一条样本，如下表所示：The SE sequential extraction sub-module, in the same time window, sequentially extracts a feature in a subsequence as a label, and combines all the features of the other two subsequences into a sample, as shown in the following table:

表4：日志数量状态子序列中WAF日志数量为标签特征合并表Table 4: The number of WAF logs in the log number status subsequence is the tag feature merge table

表5：用户行为状态子序列WAF日志种类1为标签特征合并表Table 5: User behavior status sub-sequence WAF log category 1 is the tag feature merge table

表6：字段状态子序列WAF action字段alert数量为标签特征合并表Table 6: The field status subsequence WAF action field alert number is the tag feature merge table

通过SE顺序抽取子模块，可以使任意一个子序列中的任意特征都有其他两个子序列中的所有特征与之对应。生成的样本特征之间产生了关联。对一个时间窗口进行分析时，随机挑选每个子序列的一个特征组成三个标签代表受测时间窗口。By extracting submodules sequentially, any feature in any subsequence can be matched with all features in the other two subsequences. The generated sample features are associated with each other. When analyzing a time window, one feature of each subsequence is randomly selected to form three labels to represent the tested time window.

在一种方案中，所述的ECC模型训练模块，分为少量正常、异常事件窗口确定子模块和ECC模型训练子模块。In one solution, the ECC model training module is divided into a small number of normal and abnormal event window determination submodules and an ECC model training submodule.

所述的少量正常、异常事件窗口确定子模块。只使用聚类算法无法精准进行归类，在分析模型训练时归类不精确的问题被放大影响分析结果。在现有的信息系统历史数据中根据专业知识可以精准的确定少量的时间窗口为正常或异常时间窗口。The small number of normal and abnormal event windows determine the sub-module. Only using the clustering algorithm cannot accurately classify, and the problem of inaccurate classification during analysis model training is magnified and affects the analysis results. In the existing historical data of the information system, a small number of time windows can be accurately determined as normal or abnormal time windows according to professional knowledge.

所述的ECC模型训练子模块，对模型训练的步骤如下：For the ECC model training submodule, the steps for model training are as follows:

步骤1：将正常、异常时间窗口内的多源异构日志数据通过SG状态生成子模块进行处理，得到日志数据的日志数量统计生成日志数量状态子序列，将时间窗口内每个设备上产生的每个日志种类数量统计生成用户行为状态子序列，将时间窗口内每个设备某些重要字段中类型出现的次数进行数量统计生成字段状态子序列。Step 1: Process the multi-source heterogeneous log data in the normal and abnormal time windows through the SG status generation sub-module, and obtain the log quantity statistics of the log data to generate the log quantity status subsequence, and convert the The number of each log type is counted to generate a sub-sequence of user behavior status, and the number of occurrences of types in some important fields of each device in the time window is counted to generate a sub-sequence of field status.

步骤2：将每个时间窗口下日志数量状态子序列中的n个特征、用户行为状态子序列中的m个特征、字段状态子序列中的j个特征按照SE顺序抽取子模块生成(n+m+j)个样本数据集。Step 2: Extract the n features in the log quantity state subsequence, m features in the user behavior state subsequence, and j features in the field state subsequence in the SE order to generate (n+ m+j) sample datasets.

v1＝1-v2 (3)v1=1-v2 (3)

tableα″代表异常时间窗口下日志数量状态子序列中的某个特征做为标签时所对应的样本，"tableα" represents the sample corresponding to a certain feature in the log quantity status subsequence as a label under the abnormal time window,

bias为偏执，bias is paranoia,

v3＝1-v4 (7)v3=1-v4 (7)

bias为偏执,bias is paranoia,

v5＝1-v6 (11)v5=1-v6 (11)

bias为偏执,bias is paranoia,

在一种方案中，ECC日志分析模块，其分析步骤如下：In one solution, the ECC log analysis module has the following analysis steps:

步骤1:对受检测的窗口，对一个时间窗口进行分析时，随机挑选每个子序列的一个特征为三个标签并形成三个样本代表受测时间窗口，样本如图2所示。Step 1: For the tested window, when analyzing a time window, randomly select one feature of each subsequence as three labels and form three samples to represent the tested time window, as shown in Figure 2.

步骤2：设初始值v1＝v2,v3＝v4,v5＝v6,通过ECC模型训练子模块所述的公式(1)、(2)、(3)、(4)分别计算出受测窗口下的三个样本的差异值f(tableα)、f(tableβ)、f(tableγ)是否在对应的置信区间σ(α)、σ(β)、σ(γ)内。Step 2: Set the initial values v1=v2, v3=v4, v5=v6, and calculate the values of Whether the difference values f(tableα), f(tableβ), and f(tableγ) of the three samples are within the corresponding confidence intervals σ(α), σ(β), and σ(γ).

步骤3：若三个样本都在置信区间内，则通过对v1、v2、v3、v4、v5、v6约束来提升检测精度，约束公式为Step 3: If the three samples are all within the confidence interval, the detection accuracy is improved by constraining v1, v2, v3, v4, v5, and v6. The constraint formula is:

v1＝a1*v1(0≤a＜1)v1=a1*v1(0≤a＜1)

v3＝a2*v3(0≤a2＜1)v3=a2*v3(0≤a2＜1)

v5＝a3*v5(0≤a3＜2)v5=a3*v5(0≤a3＜2)

v1、v3、v5分别为正常时间窗口的变化，将v1、v3、v5缩小，则正常时间窗口对差异值影响减少，异常事件窗口对差异值影响增大。得到的新v1、v2、v3、v4、v5、v6、与三个样本重新带入ECC模型训练子模块所述的公式(1)-(12)查看差异值是否在置信区间内，并根据预设的漏报率要求确定重复约束的次数，若漏报率要求低，则约束次数少。漏报率要求低，则约束次数多。v1, v3, and v5 are the changes of the normal time window, respectively. If v1, v3, and v5 are reduced, the influence of the normal time window on the difference value is reduced, and the influence of the abnormal event window on the difference value is increased. The obtained new v1, v2, v3, v4, v5, v6, and the three samples are brought back into the formulas (1)-(12) described in the ECC model training sub-module to check whether the difference value is within the confidence interval, and according to the prediction The set miss rate requires to determine the number of repeated constraints. If the miss rate is low, the number of constraints is small. If the false negative rate requirement is low, the number of constraints will be high.

步骤4：若重复约束结束后，该时间窗口的差异值依然在置信区间内，则认为信息系统在该受测时间窗口正常，否则认为不正常。Step 4: If the difference value of the time window is still within the confidence interval after the repeated constraint is over, it is considered that the information system is normal in the tested time window, otherwise it is considered abnormal.

基在单一分析聚合的方法中,没有将不同设备内的日志组合分析，而是单独判断了不同设备状态后再分析，不能挖掘出不同设备日志之间的关系。本发明中提出的SGSE算法可以有效的处理不同设备之间的日志并将日志聚合成能够体现出信息系统在该时间窗口状态的样本，综合判断信息系统整体的状态，而不是分析单一设备得到结果后聚合。Based on the single analysis and aggregation method, the logs in different devices are not combined and analyzed, but the status of different devices is judged separately and then analyzed, and the relationship between the logs of different devices cannot be mined. The SGSE algorithm proposed in the present invention can effectively process logs between different devices and aggregate the logs into samples that can reflect the state of the information system in this time window, and comprehensively judge the overall state of the information system instead of analyzing a single device to get the results post-polymerization.

在关联分析的方法中，以生成各类事件的统计报告为目的，不能够深度挖掘各类事件之间的关系直接呈现给用户。通过SGSE-ECC算法对多源异构日志进行深度分析，不只是通过聚类生成各种事件的统计报告，而是深度挖掘日志之间的关系将信息系统的状态直观呈现给用户。In the method of association analysis, for the purpose of generating statistical reports of various events, it is impossible to deeply mine the relationship between various events and present them directly to the user. The in-depth analysis of multi-source heterogeneous logs through the SGSE-ECC algorithm not only generates statistical reports of various events through clustering, but also deeply mines the relationship between the logs to visually present the status of the information system to the user.

总体上，本发明使用SGSE-ECC算法模型对多源异构日志进行数据处理、样本生成、状态分析，自动化的分析日志，降低了运维成本。In general, the present invention uses the SGSE-ECC algorithm model to perform data processing, sample generation, and status analysis on multi-source heterogeneous logs, and automatically analyze logs, thereby reducing operation and maintenance costs.

本发明提出的SGSE算法处理各设备上的日志数据，聚合生成能够体现出时间窗口状态的样本，并提出ECC算法对多维样本进行分析，通过约束变化系数调整检测精度和漏报率。The SGSE algorithm proposed by the present invention processes the log data on each device, aggregates and generates samples that can reflect the state of the time window, and proposes an ECC algorithm to analyze the multi-dimensional samples, and adjust the detection accuracy and false negative rate by constrained variation coefficients.

Claims

A method for analyzing multi-source heterogeneous logs, comprising the following steps:

Step 1: Determine the size of the time window according to the response time required by the information system;

Step 2: Use the SGSE algorithm to process the log data in each time window into a sample that can be called by the ECC log analysis algorithm;

Step 3: Train and use the ECC log analysis model to analyze whether the time window is normal;

Step 4: Present the log analysis results.

The multi-source heterogeneous log analysis method according to claim 1, wherein the step of model training is as follows:

Step 1: Count the number of logs of multi-source heterogeneous log data in normal and abnormal time windows to generate a log number status subsequence, and count the number of each log type generated on each device within the time window to generate a user behavior status subsequence , count the number of occurrences of types in some important fields of each device in the time window to generate field status subsequences;

Step 2: Generate (n+m+j) samples from n features in the log quantity state subsequence, m features in the user behavior state subsequence, and j features in the field state subsequence under each time window data set;

Step 3: Take a certain feature in the log quantity status subsequence in each normal and abnormal time window as the label of the sample data set according to the ECC expression and calculate the sample data sets in other normal and abnormal time windows in pairs. The difference value, the calculation expression is:

v1=1-v2 (3)

f(tableα)=v1*M′ _tableα +v2*M″ _tableα +bias (4)

tableα represents the sample corresponding to a feature in the log quantity state subsequence as a label,

tableα' represents the sample corresponding to a certain feature in the log quantity status subsequence as a label in the normal time window,

"tableα" represents the sample corresponding to a certain feature in the log quantity status subsequence under the abnormal time window as the label,

M _tableα ′ is the difference between a sample corresponding to a certain feature in the log quantity state subsequence as a label and a sample corresponding to a certain feature in the log quantity state subsequence under a normal time window as a label. mean squared error,

M _tableα ″ is the difference between the sample corresponding to a certain feature in the log quantity state subsequence as a label and the sample corresponding to a certain feature in the log quantity state subsequence under the abnormal time window as a label mean squared error,

bias is paranoia,

v1 and v2 are the variation coefficients of the mean square error of the normal time window and the abnormal time window, respectively, and v1=v2 during training;

f(tableα) is the difference value calculated by a feature in the log quantity state subsequence in the trained time window as a label;

Step 4: Calculate the difference value between each normal time window and other normal and abnormal time windows through equations (1)-(4) and save it as a set U ₁ , and use equations (1)-(4) and The difference value is calculated in the normal and abnormal time windows and saved as P ₁ , and the confidence interval σ(α) of the log quantity state subsequence under the normal time window is obtained as

σ(α)=[min(U1),max(U1)]∩[min(P1),max(P1)]

Step 5: Calculate the sample data set with a certain feature in the user behavior state subsequence in each normal and abnormal time window as the label according to the ECC expression and the sample data sets in other normal and abnormal time windows. The difference value, the calculation expression is:

v3=1-v4 (7)

f(tableβ)=v3*M′ _tableβ +v4*M″ _tableβ +bias (8)

tableβ represents the sample corresponding to a feature in the user behavior state subsequence as a label,

tableβ' represents the sample corresponding to a certain feature in the subsequence of user behavior status under the normal time window as a label,

"tableβ" represents the sample corresponding to a certain feature in the subsequence of user behavior status under the abnormal time window as a label,

M _tableβ ′ is the difference between a sample corresponding to a certain feature in the user behavior state subsequence as a label and a sample corresponding to a certain feature in the user behavior state subsequence under a normal time window as a label. mean squared error,

M _tableβ ″ is the difference between a sample corresponding to a certain feature in the user behavior state subsequence as a label and a sample corresponding to a certain feature in the user behavior state subsequence under an abnormal time window as a label mean squared error,

bias is paranoia,

v3 and v4 are the variation coefficients of the mean square error of the normal time window and the abnormal time window, respectively. During training, v3=v4,

f(tableβ) is the difference value calculated by a feature in the user behavior state subsequence in the trained time window as a label;

Step 6: Calculate the difference between each normal time window and the normal and abnormal time windows through equations (5)-(8) and save it as a set U ₂ , and compare the abnormal time window with the normal and abnormal time windows through equations (5)-(8). , the difference value is calculated in the abnormal time window and saved as P ₂ , and the confidence interval σ(β) of the field state subsequence under the normal time window is obtained as

σ(β)=[min(U2),max(U2)]∩[min(P2),max(P2)]

Step 7: Use a certain feature in the field state subsequence in each normal and abnormal time window as the label of the sample data set according to the ECC expression and calculate the difference value in pairs with the sample sets in other normal and abnormal time windows. , the calculation expression is:

v5=1-v6 (11)

f(tableγ)=v5*M′ _tableγ +v6*M″ _tableγ +bias (12)

tableγ represents the sample corresponding to a feature in the field state subsequence as a label,

tableγ′ represents the sample corresponding to a feature in the field state subsequence in the normal time window as a label,

"tableγ" represents the sample corresponding to a feature in the field state subsequence under the abnormal time window as a label,

M _tableγ ′ is the mean square between a sample corresponding to a feature in the field state subsequence as a label and a sample corresponding to a feature in the field state subsequence as a label in a normal time window error,

M _tableγ ″ is the mean square between the sample corresponding to a certain feature in the field state subsequence as the label and the sample corresponding to a certain feature in the field state subsequence under the abnormal time window as the label error,

bias is paranoia,

v5 and v6 are the variation coefficients of the mean square error of the normal time window and the abnormal time window, respectively. During training, v5=v6,

f(tableγ) is the difference value calculated by a feature in the field state subsequence in the trained time window as a label;

Step 8: Calculate the difference between each normal time window and the normal and abnormal time windows through formulas (9)-(12) and save it as a set U ₃ , and use formulas (9)-(12) to calculate the difference between the abnormal time window and the normal time window. , the difference value is calculated in the abnormal time window and saved as P ₃ , and the confidence interval σ(γ) of the log quantity state subsequence under the normal time window is obtained as

σ(γ)=[min(U3),max(U3)]∩[min(P3),max(P3)].

The method for analyzing multi-source heterogeneous logs according to claim 2, wherein the method for analyzing logs comprises the following steps:

Step 1: when a detected time window is analyzed, randomly select a feature of each subsequence to form three labels and form three samples to represent the tested time window;

Step 2: Set the initial values v1=v2, v3=v4, v5=v6, and calculate the difference values f(tableα) and f(tableβ) of the three samples under the detection time window through formulas (1)-(12). ), whether f(tableγ) is within the corresponding confidence interval σ(α), σ(β), σ(γ);

Step 3: If the three samples are all within the confidence interval, then constrain v1, v2, v3, v4, v5, v6, and the constraint formula is

v1=a1*v1(0≤a＜1)

v3=a2*v3(0≤a2＜1)

v5=a3*v5(0≤a3＜1)

v1, v3, and v5 are the variation coefficients of the mean square error of the normal time window, respectively. If v1, v3, and v5 are reduced, the influence of the normal time window on the difference value is reduced, and the influence of the abnormal time window on the difference value is increased. According to the new v1 , v2, v3, v4, v5, v6 and the three samples re-pass formulas (1)-(12) to calculate whether the difference value of the three samples under the tested time window is within the corresponding confidence interval σ(α) , σ(β), σ(γ), and determine the number of repeated constraints according to the requirements of the false negative rate;

Step 4: If the difference value of the three samples under the tested time window is still within the confidence interval after the repeated constraint is over, it is considered that the information system is normal in the tested time window, otherwise it is considered abnormal.