CN111163057B

CN111163057B - User identification system and method based on heterogeneous information network embedding algorithm

Info

Publication number: CN111163057B
Application number: CN201911246787.9A
Authority: CN
Inventors: 于爱民; 李梦; 蔡利君; 马建刚; 孟丹; 于海波
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2019-12-09
Filing date: 2019-12-09
Publication date: 2021-04-02
Anticipated expiration: 2039-12-09
Also published as: CN111163057A

Abstract

The invention relates to a user identification system and method based on a heterogeneous information network embedding algorithm, comprising: a data processing module, a joint embedding module, and an evaluation analysis module; the invention is based on the idea of behavior analysis, and utilizes multi-source heterogeneous user behavior data A normal behavior model is constructed, and when the behavior data of a new time period arrives, user identification is performed by comparing the similarity between the current behavior and the normal behavior model. In the case of incorrect identification, the present invention will also provide suspicious behavior based on the dot product similarity calculation. sort. The present invention can be applied to detect potential internal threats in the enterprise intranet, and a more comprehensive and accurate behavior model can be obtained by combining two heterogeneous information network embedding algorithms, and the user identification accuracy rate can be improved by about 10%. In addition, the present invention will also provide Event-level traceability clues for further analysis by security monitoring personnel.

Description

A user identification system and method based on heterogeneous information network embedding algorithm

技术领域technical field

本发明涉及一种基于异构信息网络嵌入算法的用户识别系统及方法，属于信息安全检测技术领域，用于企业内网环境中。The invention relates to a user identification system and method based on a heterogeneous information network embedding algorithm, which belongs to the technical field of information security detection and is used in an enterprise intranet environment.

背景技术Background technique

当今最具破坏性的安全威胁不是来自外部的恶意人员或恶意软件，而是来自可信赖的内部人员。组织中的成员按照职责获取一定的访问控制权限，有效的身份认证是防御内部攻击的重要途径。然而身份认证机制主要包括账号密码、指纹识别等，仅在登录时有效，仍然存在诸多的安全隐患。现有研究通常基于行为分析建立用户正常行为模型，从而获得登录后持续有效的用户身份监测。因为无论是哪种形式的内部攻击，都会表现出一定程度的行为偏离，通过对比当前行为与历史正常模型的相似程度就可以识别用户身份，进而发现异常操作。Today's most damaging security threats come not from outsiders or malware, but from trusted insiders. Members of an organization obtain certain access control rights according to their responsibilities. Effective identity authentication is an important way to defend against internal attacks. However, the identity authentication mechanism mainly includes account password, fingerprint recognition, etc., which are only valid when logging in, and there are still many security risks. Existing research usually builds a user's normal behavior model based on behavior analysis, so as to obtain continuous and effective user identity monitoring after login. Because no matter what form of internal attack, it will show a certain degree of behavior deviation. By comparing the similarity between the current behavior and the historical normal model, the identity of the user can be identified, and then abnormal operations can be found.

基于行为分析的用户识别可分为单域行为分析和多域行为分析两大类别。基于单域行为分析的用户识别是指利用单一类型的行为数据建模正常行为，例如：文件行为、邮件行为等。存在的问题是使用的数据源单一，难以刻画全面的正常行为模型，且通常采用简单的机器学习分类器，用户识别率不高。基于多域行为分析的用户识别方法借鉴多源数据融合的思想，尝试结合多种行为类型构建全面的行为模型。但思路上要么通过对比工作流相似性，要么采用特征工程提取多源行为特征，均没有考虑多种行为数据之间的关联。User identification based on behavior analysis can be divided into two categories: single-domain behavior analysis and multi-domain behavior analysis. User identification based on single-domain behavior analysis refers to using a single type of behavior data to model normal behavior, such as file behavior, email behavior, and so on. The existing problem is that the data source used is single, it is difficult to describe a comprehensive normal behavior model, and a simple machine learning classifier is usually used, and the user recognition rate is not high. The user identification method based on multi-domain behavior analysis draws on the idea of multi-source data fusion, and tries to build a comprehensive behavior model by combining various behavior types. However, in terms of ideas, either by comparing workflow similarity, or using feature engineering to extract multi-source behavior features, the association between multiple behavior data is not considered.

在此背景下，本发明将多源异构的行为数据转化成异构信息网络，为分析行为关联创造了条件，使用局部嵌入算法和全局嵌入算法分别提取局部特征和全局特征从而可以构建全面的行为模型，并且能够捕获行为之间的关联信息，针对模型识别错误的情况，还可以进一步基于相似性计算给出可疑行为排序供安全人员分析溯源。In this context, the present invention converts multi-source heterogeneous behavior data into a heterogeneous information network, creating conditions for analyzing behavior associations, and using local embedding algorithm and global embedding algorithm to extract local features and global features respectively, so as to construct a comprehensive Behavior model, and can capture the correlation information between behaviors. In the case of model identification errors, it can further provide a ranking of suspicious behaviors based on similarity calculation for security personnel to analyze and trace the source.

发明内容SUMMARY OF THE INVENTION

本发明技术解决问题：解决现有基于行为分析的用户识别方法中存在的多源异构行为数据关联建模困难的问题，本发明提供一种基于异构信息网络嵌入算法的用户识别系统及方法，能构建更加全面的行为模型，大大提高用户识别准确率，且能够针对可疑情况加以分析，提供事件级的可疑操作排序。The technology of the present invention solves the problem: solves the problem of the difficulty of multi-source heterogeneous behavior data association modeling in the existing user identification method based on behavior analysis, and the present invention provides a user identification system and method based on a heterogeneous information network embedding algorithm , can build a more comprehensive behavior model, greatly improve the user identification accuracy, and can analyze suspicious situations and provide event-level suspicious operation sorting.

本发明技术解决方案：一种基于异构信息网络嵌入算法的用户识别系统，其特征在于：所述异构信息网络嵌入算法是指基于神经网络实现的局部嵌入算法和基于元路径实现的全局嵌入算法，所述用户识别是指在企业内网中基于每台主机采集的多源异构审计日志数据识别潜在的操作用户，所述系统包括数据处理模块、联合嵌入模块和评估分析模块，其中：The technical solution of the present invention: a user identification system based on a heterogeneous information network embedding algorithm, characterized in that: the heterogeneous information network embedding algorithm refers to a local embedding algorithm based on neural network and a global embedding based on meta-path. Algorithm, the user identification refers to identifying potential operating users based on the multi-source heterogeneous audit log data collected by each host in the enterprise intranet, and the system includes a data processing module, a joint embedding module and an evaluation and analysis module, wherein:

数据处理模块：有两个功能：第一个功能是从历史行为数据库中提取标准化的审计日志数据，这些日志数据将作为训练集用于构建异构信息网络G；第二个功能是对从内网主机中新采集的原始多源异构审计日志数据进行预处理；无论是历史行为数据库中规范化的审计日志数据还是新采集的原始审计日志数据均包含五种多源异构审计日志数据类型，这五种日志数据类型分别为登录日志数据、文件日志数据、邮件日志数据、HTTP日志数据和设备日志数据，这些数据分别记录了用户的登录行为、文件行为、邮件行为、WEB行为和外部设备连接行为；所述对原始多源异构审计日志数据进行预处理是指对每条日志数据进行标准化处理，使用日志解析器基于预定义的字段提取关键信息，所述预定义的字段包括主体、设备、客体和时间戳四个部分，所述主体指用户标识，所述设备指主机标识，所述客体根据不同的日志数据类型确定，用于标识特定日志数据类型的具体行为；在文件类型的日志数据中，所述客体采用文件路径与文件名的组合；所述时间戳是日志数据的发生时间；经过解析的新采集日志数据将被作为测试集；所述异构信息网络G把标准化的日志数据中基于预定义字段提取到的信息视为节点标识，其中主机标识作为中心节点，用户标识和行为标识作为主机标识的邻居节点，构建的异构信息网络遵循图2所示网络模式；Data processing module: There are two functions: the first function is to extract standardized audit log data from the historical behavior database, and these log data will be used as a training set to build a heterogeneous information network G; the second function is to The original multi-source heterogeneous audit log data newly collected in the network host is preprocessed; whether it is the normalized audit log data in the historical behavior database or the newly collected original audit log data, there are five types of multi-source heterogeneous audit log data. The five types of log data are login log data, file log data, mail log data, HTTP log data and device log data. These data record the user's login behavior, file behavior, email behavior, WEB behavior and external device connection respectively. behavior; the preprocessing of the original multi-source heterogeneous audit log data refers to standardizing each log data, and using a log parser to extract key information based on predefined fields, the predefined fields include subject, device , object and timestamp four parts, the subject refers to the user ID, the device refers to the host ID, the object is determined according to different log data types, and is used to identify the specific behavior of a specific log data type; in the log of the file type In the data, the object adopts the combination of file path and file name; the timestamp is the occurrence time of the log data; the newly collected log data after analysis will be used as the test set; the heterogeneous information network G uses the standardized log data. The information extracted based on the predefined fields in the data is regarded as the node identification, in which the host identification is used as the central node, and the user identification and behavior identification are used as the neighbor nodes of the host identification. The constructed heterogeneous information network follows the network mode shown in Figure 2;

联合嵌入模块：以数据处理模块中构建的异构信息网络G作为输入，训练一个反映每台主机操作模式的模型，称为用户预测器，用户预测器将对测试集执行用户预测，最终得到对应于测试集中日志数据的潜在操作用户排序；训练用户预测器的过程是指学习异构信息网络G中节点的向量表示以及模型的参数。为了使得学习到的节点向量能够保留网络结构信息和节点间的相似性信息，联合嵌入模块采用了两种异构信息网络嵌入算法，称为局部嵌入算法和全局嵌入算法，局部嵌入算法用于学习每台主机与其邻居节点的交互，嵌入正常行为模式信息；全局嵌入算法利用元路径定义的语义嵌入不同类型节点之间的关联信息；最终通过联合目标函数将两个嵌入算法结合起来迭代训练；Joint Embedding Module: Using the heterogeneous information network G constructed in the data processing module as input, train a model that reflects the operation mode of each host, called the user predictor, and the user predictor will perform user prediction on the test set, and finally get the corresponding The potential operation user ranking of log data in the test set; the process of training the user predictor refers to learning the vector representation of nodes in the heterogeneous information network G and the parameters of the model. In order to enable the learned node vector to retain the network structure information and the similarity information between nodes, the joint embedding module adopts two heterogeneous information network embedding algorithms, called local embedding algorithm and global embedding algorithm. The local embedding algorithm is used for learning The interaction between each host and its neighbor nodes embeds the normal behavior pattern information; the global embedding algorithm uses the semantics defined by the meta-path to embed the association information between different types of nodes; finally, the two embedding algorithms are combined for iterative training through the joint objective function;

评估分析模块，对联合嵌入模块中得到的预测结果进行评估，判别主机的真实操作用户是否与预测结果相一致；在联合嵌入模块，得到针对于测试集中的日志数据模型给出的预测结果A，该结果是一个序列，序列中的排名先后代表了测试集中的行为属于某个用户的概率，如果测试集中的行为对应的真实操作用户出现在预测序列中的前K个，则认为识别正确，反之，则表示测试集中的用户行为与训练集中的正常行为模式发生了偏离，称为可疑情况；针对此类可疑情况，通过基于相似性的异常分析，最终得到的是造成用户识别结果出现错误的可疑行为的排序，以便安全分析师或相关工作人员能够根据系统给出的线索进行溯源查证。The evaluation and analysis module evaluates the prediction results obtained in the joint embedding module, and determines whether the actual operating user of the host is consistent with the prediction results; in the joint embedding module, the prediction results A given by the log data model in the test set are obtained, The result is a sequence. The ranking in the sequence represents the probability that the behaviors in the test set belong to a certain user. If the actual operation users corresponding to the behaviors in the test set appear in the top K in the prediction sequence, the recognition is considered correct, otherwise , it means that the user behavior in the test set deviates from the normal behavior pattern in the training set, which is called suspicious situation; for such suspicious situations, through similarity-based anomaly analysis, the final result is a suspicious situation that causes errors in user identification results Sequence of behaviors, so that security analysts or relevant staff can conduct traceability verification based on the clues given by the system.

所述数据处理模块中，构建异构信息网络的过程为：基于历史数据库中已经得到标准化处理的日志数据，利用提取到的主机、用户、行为标识作为节点构建异构信息网络G，其中主机标识作为中心节点，用户标识和行为标识作为主机标识的邻居节点。In the data processing module, the process of constructing a heterogeneous information network is as follows: based on the log data that has been standardized in the historical database, using the extracted hosts, users, and behavior identifiers as nodes to construct a heterogeneous information network G, where the host identifiers are used. As a central node, the user ID and behavior ID are used as neighbor nodes for the host ID.

所述联合嵌入模块中，局部嵌入算法基于神经网络实现，具体过程如下：In the joint embedding module, the local embedding algorithm is implemented based on a neural network, and the specific process is as follows:

(1)首先将异构信息网络G中的所有节点映射到一个潜在空间，即将所有节点的向量表示随机初始化形成嵌入向量表V；(1) First, map all nodes in the heterogeneous information network G to a latent space, that is, randomly initialize the vector representations of all nodes to form an embedding vector table V;

(2)给定主机p，分两个步骤聚合得到主机向量V_p，第一个步骤中，计算主机p的每一类行为标识邻居节点的节点类型向量

方法是将每种类型包含的所有行为标识邻居节点向量v_n取平均；(2) Given a host p, aggregate the host vector V _p in two steps. In the first step, calculate the node type vector of each type of behavior of the host p to identify the neighbor nodes

The method is to take the average of all behavior identification neighbor node vectors v _n contained in each type;

其中，

代表主机p包含的第t种类型的行为标识邻居节点集合；in,

Represents the set of t-th type of behavior identification neighbor nodes contained in host p;

第二个步骤中，计算节点类型向量

的加权组合获得主机向量V_p；In the second step, the node type vector is calculated

The weighted combination of to obtain the host vector V _p ;

其中w_t代表第t类节点类型向量的权重，本发明中行为标识邻居节点类型共有5种，所以t的取值为1到5，分别代表登录节点类型、文件节点类型、邮件节点类型、HTTP节点类型、设备节点类型；Wherein w _t represents the weight of the t-th type node type vector. In the present invention, there are 5 types of neighbor nodes for behavior identification, so the value of t is 1 to 5, representing the login node type, file node type, mail node type, HTTP Node type, device node type;

(3)基于主机向量V_p，计算主机与用户之间的点积相似性，并进行潜在操作用户排序，其中v_u代表用户向量；(3) Based on the host vector V _p , calculate the dot product similarity between the host and the user, and perform a potential operation user ranking, where v _u represents the user vector;

(4)采用随机梯度下降(SGD)更新嵌入向量表V，并学习每一类节点类型向量的权重w_t使用max-margin目标函数作为损失函数，损失函数定义为：(4) Use stochastic gradient descent (SGD) to update the embedding vector table V, and learn the weight w _t of each type of node type vector using the max-margin objective function as the loss function, the loss function is defined as:

max(0，f(p，u′)-f(p，u)+ε)max(0, f(p, u')-f(p, u)+ε)

其中，u为主机p的真实操作用户，即正例样本，u′则为负例样本，ε为边界值，如果f(p，u)与f(p，u′)之间的差值如果小于ε，则产生损失惩罚。Among them, u is the real operating user of the host p, that is, the positive sample, u' is the negative sample, ε is the boundary value, if the difference between f(p, u) and f(p, u') is if Less than ε, a loss penalty occurs.

所述联合嵌入模块的具体实现中，全局嵌入算法基于元路径实现，实现过程如下：In the specific implementation of the joint embedding module, the global embedding algorithm is implemented based on the meta-path, and the implementation process is as follows:

(1)元路径定义不同类型节点之间的高阶语义关联，高阶语义关联是指原始网络中的边无法捕获的关联信息；给定元路径集合R，基于元路径的全局嵌入算法首先对节点的条件邻居分布建模，在异构信息网络G中，从节点i出发的元路径有多种，因此节点的邻居分布既取决于节点i，也取决于给定的元路径r，条件邻居分布函数定义如下：(1) Meta-paths define higher-order semantic associations between different types of nodes. Higher-order semantic associations refer to the association information that cannot be captured by edges in the original network; given a set of meta-paths R, the global embedding algorithm based on meta-paths first Conditional neighbor distribution modeling of nodes, in the heterogeneous information network G, there are many kinds of meta-paths from node i, so the neighbor distribution of nodes depends not only on node i, but also on a given meta-path r, conditional neighbors The distribution function is defined as follows:

其中，v_i和v_j代表节点i和j的向量表示，DST(r)表示节点i在元路径r目标侧的所有可能节点集合；Among them, v _i and v _j represent the vector representation of nodes i and j, and DST(r) represents the set of all possible nodes of node i on the target side of meta-path r;

(2)元路径r目标侧的所有可能节点集合DST(r)中包含的节点数是巨大的，为减轻运算负担，使用负采样策略由下式得到近似解，公式左边即代表对上一个公式的近似；(2) The number of nodes contained in the set of all possible nodes DST(r) on the target side of the meta-path r is huge. In order to reduce the computational burden, the approximate solution is obtained by the following formula using the negative sampling strategy. The left side of the formula represents the response to the previous formula approximation;

表示是对公式取得近似解，j′是从为元路径r预定义的噪声分布

中采样的负节点，每个节点i采样k个负节点，偏置项br用来调整不同元路径的密度；

represents an approximate solution to the formula, j' is derived from the noise distribution predefined for the meta-path r

The negative nodes sampled in , each node i samples k negative nodes, and the bias term br is used to adjust the density of different meta-paths;

(3)使用随机梯度下降(SGD)学习嵌入向量表V和参数b_r，目标是使得似然函数最大化。(3) Use stochastic gradient descent (SGD) to learn the embedding _vector table V and parameter br , with the goal of maximizing the likelihood function.

所述联合嵌入模块的具体实现中，联合目标函数的目的是将局部嵌入算法捕获的局部特征与全局嵌入算法捕获的全局特征有效结合起来，定义如下：In the specific implementation of the joint embedding module, the purpose of the joint objective function is to effectively combine the local features captured by the local embedding algorithm with the global features captured by the global embedding algorithm, which is defined as follows:

其中，ω∈[0，1]是一个预定义的参数，用于平衡模型重要性进行调优，添加正则化项以防止过拟合；其中Z_united代表联合嵌入模型的目标函数，Z_global代表全局嵌入模型的目标函数，Z_local代表局部嵌入模型的目标函数，λ为正则化参数；where ω∈[0, 1] is a predefined parameter used to balance the importance of the model for tuning, adding a regularization term to prevent overfitting; where Z _united represents the objective function of the joint embedding model, and Z _global represents the The objective function of the global embedding model, Z _local represents the objective function of the local embedding model, and λ is the regularization parameter;

运用联合目标函数进行迭代训练过程如下：The iterative training process using the joint objective function is as follows:

(1)基于参数为ω的伯努利分布采样局部嵌入算法和全局嵌入算法中的一个；(1) Sampling one of a local embedding algorithm and a global embedding algorithm based on a Bernoulli distribution with parameter ω;

(2)若采样到局部嵌入算法，则按照局部嵌入算法操作步骤训练嵌入向量表V并学习每一类节点类型向量的权重w_t，同样地，若采样到全局嵌入算法，则按照全局嵌入算法操作步骤训练嵌入向量表V并学习参数b_r，所述嵌入向量表V对于两个嵌入算法是共享的；(2) If the local embedding algorithm is sampled, train the embedding vector table V according to the operation steps of the local embedding algorithm and learn the weight _wt of each type of node type vector. Similarly, if the global embedding algorithm is sampled, follow the global embedding algorithm. The operation step trains the embedding vector table V and learns the parameter _br , and the embedding vector table V is shared for the two embedding algorithms;

(3)重复执行步骤(1)(2)，直至模型收敛，得到用户预测器。(3) Repeat steps (1) and (2) until the model converges and a user predictor is obtained.

所述评估分析模块中针对可疑情况的分析过程为：The analysis process for suspicious situations in the evaluation analysis module is:

可疑情况是测试集中的用户行为与训练集中的正常行为模式发生偏离导致的，在评估分析模块中，针对可疑情况，将依次计算主机行为标识邻居节点与主机真实操作用户节点的点积作为异常参考，点积分数越低，代表两个实体之间的相似性越低，则异常的风险越高，最终按照异常风险由高到低进行可疑行为排序：The suspicious situation is caused by the deviation between the user behavior in the test set and the normal behavior pattern in the training set. In the evaluation and analysis module, for suspicious situations, the dot product of the host’s behavior identification neighbor node and the host’s actual operating user node is calculated in turn as an abnormal reference. , the lower the number of points, the lower the similarity between the two entities, the higher the abnormal risk, and finally the suspicious behaviors are sorted according to the abnormal risk from high to low:

其中，L_p代表最终得到的可疑行为序列，E_p代表主机p的行为标识邻居节点集合，v_i表示节点i的向量表示，u_p代表主机p的真实操作用户的向量表示。Among them, L _p represents the finally obtained suspicious behavior sequence, E _p represents the behavior identification neighbor node set of host p, v _i represents the vector representation of node i, and u _p represents the vector representation of the real operating users of host p.

所述元路径集合R的确定需要经过元路径选择过程，具体如下：The determination of the meta-path set R needs to go through a meta-path selection process, which is as follows:

(1)逐个计算添加每个元路径之后达到的识别准确率，并进行排序，得到每个元路径单独使用时对识别效果的影响；(1) Calculate the recognition accuracy after adding each meta-path one by one, and sort them to obtain the influence of each meta-path on the recognition effect when used alone;

(2)按照得到的排序逐步加入元路径，根据识别准确率的变化，最终贪心地选择能够使得用户识别准确率达到最高的组合作为最优元路径集合R。(2) Gradually add meta-paths according to the obtained ranking, and finally greedily select the combination that can achieve the highest user identification accuracy as the optimal meta-path set R according to the change of the recognition accuracy.

本发明的一种基于异构信息网络嵌入算法的用户识别方法，步骤如下：A kind of user identification method based on heterogeneous information network embedding algorithm of the present invention, the steps are as follows:

步骤(1)数据处理：收集一段时间间隔里内网中某台主机的审计日志数据，审计日志类型包括登录日志、文件日志、邮件日志、HTTP日志和设备日志；利用日志解析器对每种类型的日志逐条解析，提取预定义的关键字段，关键字段包括主体、客体、设备和时间戳，对于一条文件日志，提取到的主体是用户账号，客体是文件路径与文件名的组合，设备是主机编号，时间戳是日志记录的访问时间，解析后的日志数据将用作测试集，此外，利用历史行为数据库里一个时间窗口内标准化的日志数据作为训练集用于构建异构信息网络G；Step (1) Data processing: collect the audit log data of a host in the intranet for a period of time. The audit log types include login logs, file logs, mail logs, HTTP logs and device logs; use a log parser to analyze each type of log. Parse the logs one by one and extract predefined key fields. The key fields include subject, object, device and timestamp. For a file log, the extracted subject is the user account, and the object is the combination of the file path and file name. is the host number, and the timestamp is the access time of the log record. The parsed log data will be used as the test set. In addition, the standardized log data in a time window in the historical behavior database is used as the training set to construct the heterogeneous information network G ;

步骤(2)异构信息网络构建：利用训练集构建异构信息网络G，把历史行为数据库标准化的日志数据中基于预定义字段提取到的信息视为节点标识，其中主机标识作为中心节点，用户标识和行为标识作为主机的邻居节点，针对每台主机p，将与其有关的所有行为标识邻居节点组成集合E_p，同时将其真实操作用户表示成u_p，每个独立的行为标识可以关联多个主机，如果两个主机p、q均与邮件实体e有过日志记录，则邮件实体e将同时作为两个主机p、q的邻居节点；Step (2) Heterogeneous information network construction: use the training set to construct a heterogeneous information network G, and regard the information extracted based on predefined fields in the log data standardized in the historical behavior database as a node identifier, in which the host identifier is used as the central node, and the user The identity and behavior identity are used as the neighbor nodes of the host. For each host p, all the behavior identity neighbor nodes related to it are composed of a set _Ep , and the real operation user is expressed as u _p . Each independent behavior identity can be associated with multiple If the two hosts p and q have log records with the mail entity e, the mail entity e will be the neighbor node of the two hosts p and q at the same time;

步骤(3)联合嵌入：得到异构信息网络G后，将迭代学习每个节点的向量表示，首先随机初始化嵌入向量表V，然后基于联合目标函数中的参数ω采样局部嵌入算法和全局嵌入算法中的一个；若采样到局部嵌入算法，则按照局部嵌入算法操作步骤训练嵌入向量表V并学习每一类节点类型向量的权重w_t；若采样到全局嵌入算法，则按照全局嵌入算法操作步骤训练嵌入向量表V并学习参数b_r，重复这个迭代训练过程直到模型收敛，此时得到训练好的模型，称为用户预测器；Step (3) Joint embedding: After obtaining the heterogeneous information network G, the vector representation of each node will be iteratively learned. First, the embedding vector table V is randomly initialized, and then the local embedding algorithm and the global embedding algorithm are sampled based on the parameter ω in the joint objective function. One of them; if the local embedding algorithm is sampled, the embedding vector table V is trained according to the operation steps of the local embedding algorithm and the weight _wt of each type of node type vector is learned; if the global embedding algorithm is sampled, the operation steps of the global embedding algorithm are followed Train the embedding vector table V and learn the parameters b _r , repeat this iterative training process until the model converges, at which time a trained model is obtained, which is called a user predictor;

步骤(4)用户预测：用户预测器中包含训练后的节点嵌入向量表V以及局部嵌入算法和全局嵌入算法各自的参数，随后在测试集上执行用户预测任务，即给定待预测主机p，预测主机p上的日志数据属于哪个操作用户，预测结果是一个序列，序列中的排名先后代表了测试集中的日志数据属于某个用户的概率，排序的依据是主机向量与用户向量的点积相似性得分；Step (4) User prediction: The user predictor contains the node embedding vector table V after training and the respective parameters of the local embedding algorithm and the global embedding algorithm, and then performs the user prediction task on the test set, that is, given the host p to be predicted, Predict which operating user the log data on host p belongs to. The prediction result is a sequence. The ranking in the sequence represents the probability that the log data in the test set belongs to a certain user. The sorting is based on the similarity of the dot product of the host vector and the user vector. sex score;

步骤(5)评估与分析：针对步骤(4)中得到的预测结果，如果测试集中的行为对应的真实操作用户出现在预测序列中的前K个，则认为识别正确，反之，则表示测试集中的用户行为与训练集中的正常行为模式发生偏离，称为可疑情况，针对此类可疑情况，将通过基于相似性的异常分析，最终得到造成用户识别结果出现错误的可疑行为的排序，以便安全分析师或相关工作人员能够根据系统给出的线索进行溯源查证。Step (5) Evaluation and analysis: For the prediction results obtained in step (4), if the actual operation users corresponding to the behaviors in the test set appear in the top K in the prediction sequence, the identification is considered correct; otherwise, it means that the test set is correct. The user behavior deviates from the normal behavior pattern in the training set, which is called suspicious situation. For such suspicious situations, the abnormal analysis based on similarity will finally get the ranking of suspicious behaviors that cause errors in the user identification results for security analysis. Teachers or relevant staff can conduct traceability verification based on the clues given by the system.

本发明与现有技术相比的优点在于：The advantages of the present invention compared with the prior art are:

(1)防御内部攻击的关键在于用户权限管理，而用户权限管理的有效途径是基于行为分析对用户身份持续监测，传统的用户识别方法均没有充分利用多源异构的行为数据，难以建模数据之间复杂的关联。本发明巧妙的利用异构信息网络将结构化的审计日志数据表示成图结构，为分析数据关联创造了条件；(1) The key to defending against internal attacks lies in user rights management, and an effective way to manage user rights is to continuously monitor user identities based on behavior analysis. Traditional user identification methods do not make full use of multi-source and heterogeneous behavior data, and are difficult to model. complex associations between data. The invention cleverly uses heterogeneous information network to express the structured audit log data into a graph structure, thereby creating conditions for analyzing data association;

(2)本发明结合两种异构信息网络嵌入算法自动学习节点的向量表示，这是一次将异构信息网络嵌入方法应用于安全领域的创新尝试，解决了传统方法依赖人工经验知识提取特征的问题，两种嵌入算法分别关注局部的行为模式特征和网络全局的关联特征，优点在于可以进行全面的用户行为模式刻画，大大提高了用户识别准确率；(2) The present invention automatically learns the vector representation of nodes by combining two heterogeneous information network embedding algorithms, which is an innovative attempt to apply the heterogeneous information network embedding method to the security field, and solves the problem that the traditional method relies on artificial experience knowledge to extract features. The two embedding algorithms focus on local behavior pattern features and network global correlation features respectively. The advantage is that they can describe comprehensive user behavior patterns, which greatly improves the accuracy of user identification.

(3)对于预测错误的可疑情况，本发明还能够根据实体间的相似性给潜在的异常操作排序，提供事件级别的可疑行为线索。安全分析人员可以基于这些事件级别的有效线索进行溯源查证；(3) For suspicious situations with wrong predictions, the present invention can also sort potential abnormal operations according to the similarity between entities, and provide event-level suspicious behavior clues. Security analysts can conduct traceability verification based on valid clues at these event levels;

(4)总的来说，本发明提出了一种基于异构信息网络嵌入算法的用户识别系统，核心优势在于能够建模全面的用户行为特征，提高用户识别准确率，且能提供细粒度的异常分析。(4) In general, the present invention proposes a user identification system based on a heterogeneous information network embedding algorithm. The core advantage is that it can model comprehensive user behavior characteristics, improve user identification accuracy, and provide fine-grained Exception analysis.

附图说明Description of drawings

图1为本发明系统的实现框图；Fig. 1 is the realization block diagram of the system of the present invention;

图2为本发明中异构信息网络的网络模式；Fig. 2 is the network mode of heterogeneous information network in the present invention;

图3为本发明中局部嵌入算法的框架。FIG. 3 is the framework of the local embedding algorithm in the present invention.

具体实施方式Detailed ways

下面结合附图及实施例对本发明进行详细说明。The present invention will be described in detail below with reference to the accompanying drawings and embodiments.

本发明主要解决如何基于多源异构的主机审计日志识别潜在的操作用户，并针对识别错误的可疑情况给出有指导意义的异常分析。The present invention mainly solves how to identify potential operating users based on multi-source heterogeneous host audit logs, and provides instructive abnormal analysis for suspicious situations of identification errors.

如图1所示，本发明的系统包括数据处理模块、联合嵌入模块、评估分析模块。数据处理模块，将原始的多源异构主机审计日志数据进行解析处理，保留预定义的关键字段，并利用标准化的历史日志数据构建异构信息网络；联合嵌入模块，用两种异构信息网络嵌入算法分别进行单个主机操作模式的学习和网络全局关联的捕获，两种异构信息网络嵌入算法称之为局部嵌入算法和全局嵌入算法，并通过联合函数将两者结合起来迭代训练，得到用户预测器；评估分析模块，对用户识别效果进行评估，针对识别错误的可疑情况，通过基于相似性的异常分析给出可疑行为排序。As shown in FIG. 1 , the system of the present invention includes a data processing module, a joint embedding module, and an evaluation and analysis module. The data processing module parses and processes the original multi-source heterogeneous host audit log data, retains predefined key fields, and uses standardized historical log data to build a heterogeneous information network; the joint embedding module uses two heterogeneous information The network embedding algorithm learns the operation mode of a single host and captures the global correlation of the network respectively. The two heterogeneous information network embedding algorithms are called the local embedding algorithm and the global embedding algorithm, and the two are combined iteratively trained through the joint function to obtain User predictor; evaluation analysis module, which evaluates the user recognition effect, and provides a ranking of suspicious behaviors through similarity-based anomaly analysis for suspicious situations of recognition errors.

数据处理模块具体实现如下：The specific implementation of the data processing module is as follows:

(1)数据处理：收集一段时间间隔里内网中某台主机的审计日志数据，审计日志类型包括登录日志、文件日志、邮件日志、HTTP日志和设备日志；利用日志解析器对每种类型的日志逐条解析，提取预定义的关键字段，关键字段包括主体、客体、设备和时间戳，对于一条文件日志，提取到的主体是用户账号，客体是文件路径与文件名的组合，设备是主机编号，时间戳是日志记录的访问时间，解析后的日志数据将用作测试集。此外，利用历史行为数据库里一个时间窗口内标准化的日志数据作为训练集用于在下个步骤中构建异构信息网络G。(1) Data processing: collect audit log data of a host in the intranet for a period of time. The audit log types include login logs, file logs, mail logs, HTTP logs and device logs; use log parser to analyze each type of log. Logs are parsed one by one, and predefined key fields are extracted. The key fields include subject, object, device, and timestamp. For a file log, the extracted subject is the user account, the object is the combination of the file path and file name, and the device is the The host number, timestamp is the access time of the log record, and the parsed log data will be used as the test set. In addition, the normalized log data within a time window in the historical behavior database is used as the training set for constructing the heterogeneous information network G in the next step.

(2)异构信息网络构建：利用训练集构建异构信息网络G，把历史行为数据库标准化的日志数据中基于预定义字段提取到的信息视为节点标识，其中主机标识作为中心节点，用户标识和行为标识作为主机的邻居节点，构建的异构信息网络遵循图2所示网络模式。在本网络模式中，主要涉及六种节点类型，分别是PC、用户、登录、文件、邮件、HTTP、设备，其中PC是连接其余五种节点类型的超节点；涉及的边类型包括“PC访问文件”、“PC发送邮件”等。针对每台主机p，将与其有关的所有行为标识邻居节点组成集合E_p，同时将其真实操作用户表示成u_p，每个独立的行为标识可以关联多个主机，如果两个主机p、q均与邮件实体e有过日志记录，则邮件实体e将同时作为两个主机p、q的邻居节点；(2) Construction of heterogeneous information network: use the training set to build a heterogeneous information network G, and regard the information extracted based on predefined fields in the log data standardized in the historical behavior database as node identifiers, in which the host identifier is used as the central node, and the user identifier is used as the central node. and behaviors identify neighbor nodes as hosts, and the constructed heterogeneous information network follows the network model shown in Figure 2. In this network mode, six types of nodes are mainly involved, namely PC, user, login, file, mail, HTTP, device, among which PC is the super node connecting the other five types of nodes; the types of edges involved include "PC access"File","PC Send Mail", etc. For each host p, the neighbor nodes of all behavior identifiers related to it are formed into a set _Ep , and its real operating user is represented as up _p . Each independent behavior identifier can be associated with multiple hosts. If two hosts p, q Both have log records with the mail entity e, then the mail entity e will be the neighbor node of the two hosts p and q at the same time;

联合嵌入模块具体实现如下：The specific implementation of the joint embedding module is as follows:

传统方法往往采用特征工程的方式人工提取高维特征，需要人为的经验知识。本发明自动提取蕴含丰富结构和语义关联的特征向量用于表示用户和实体。然后基于训练好的用户预测器来预测测试集中日志数据的操作用户。步骤如下：Traditional methods often use feature engineering to manually extract high-dimensional features, which requires artificial experience knowledge. The present invention automatically extracts feature vectors containing rich structural and semantic associations for representing users and entities. Then based on the trained user predictor, the operating users of the log data in the test set are predicted. Proceed as follows:

(1)初始化：首先随机初始化嵌入向量表V，V代表了构建的异构信息网络G中所有节点的向量表示；(1) Initialization: First, initialize the embedding vector table V randomly, V represents the vector representation of all nodes in the constructed heterogeneous information network G;

(2)迭代训练：接下来执行迭代训练过程，基于联合目标函数中的参数ω采样局部嵌入算法和全局嵌入算法中的一个；若采样到的是局部嵌入算法，则按照局部嵌入模型的执行步骤进行训练，更新嵌入向量表V和每一类节点类型向量的权重w_t；若采样到的是全局嵌入算法，则按照全局嵌入模型的执行步骤进行训练，更新嵌入向量表V和参数b_r，重复这个迭代训练过程直到模型收敛，此时得到了训练好的模型，称为用户预测器；(2) Iterative training: Next, the iterative training process is performed, and one of the local embedding algorithm and the global embedding algorithm is sampled based on the parameter ω in the joint objective function; if the local embedding algorithm is sampled, follow the execution steps of the local embedding model. Carry out training, update the embedding vector table V and the weight _wt of each type of node type vector; if the sampling is a global embedding algorithm, perform training according to the execution steps of the global embedding model, update the embedding _vector table V and parameter br , Repeat this iterative training process until the model converges, at which time a trained model is obtained, which is called the user predictor;

(3)用户预测：将测试集输入训练好的用户预测器中，用户预测器基于测试集中标准化的日志数据预测潜在的操作用户，预测结果是一个序列，序列中的排名先后代表了测试集中的日志数据属于某个用户的概率，排序的依据是主机向量与用户向量的点积相似性得分；(3) User prediction: Input the test set into the trained user predictor, and the user predictor predicts potential operating users based on the standardized log data in the test set. The prediction result is a sequence, and the ranking in the sequence represents the data in the test set The probability that the log data belongs to a certain user, the sorting is based on the similarity score of the dot product between the host vector and the user vector;

根据图3所示，所述局部嵌入算法的执行步骤如下：As shown in Figure 3, the execution steps of the local embedding algorithm are as follows:

(1)给定主机p，分两个步骤聚合得到主机向量V_p，第一个步骤中，计算主机p的每一类行为标识邻居节点的节点类型向量

方法是将每种类型包含的所有行为标识邻居节点向量v_n取平均；(1) Given a host p, aggregate the host vector V _p in two steps. In the first step, calculate the node type vector of each type of behavior of the host p that identifies the neighbor nodes

其中，

代表主机p包含的第t种类型的行为标识邻居节点集合；in,

第二个步骤中，计算节点类型向量

The weighted combination of to obtain the host vector V _p ;

其中，w_t代表第t类节点类型向量的权重，本发明中行为标识邻居节点类型共有5种，所以t的取值为1到5，分别代表登录节点类型、文件节点类型、邮件节点类型、HTTP节点类型、设备节点类型；Wherein, w _t represents the weight of the t-th type node type vector. In the present invention, there are 5 types of neighbor node types for behavior identification, so the value of t is 1 to 5, representing the login node type, file node type, mail node type, HTTP node type, device node type;

(2)基于主机向量V_p，计算主机与用户之间的点积相似性，并进行潜在操作用户排序，其中v_u代表用户向量；(2) Based on the host vector V _p , calculate the dot product similarity between the host and the user, and sort the potential operation users, where v _u represents the user vector;

(3)采用随机梯度下降(SGD)更新嵌入向量表V，并学习每一类节点类型向量的权重w_t使用max-margin目标函数作为损失函数，损失函数定义为：(3) Use stochastic gradient descent (SGD) to update the embedding vector table V, and learn the weight w _t of each type of node type vector using the max-margin objective function as the loss function, and the loss function is defined as:

max(0，f(p，u′)-f(p，u)+ε)max(0, f(p, u')-f(p, u)+ε)

所述全局嵌入算法的执行步骤如下：The execution steps of the global embedding algorithm are as follows:

中采样的负节点，每个节点i采样k个负节点，偏置项b_r用来调整不同元路径的密度；

Negative nodes sampled in , each node _i samples k negative nodes, and the bias term br is used to adjust the density of different meta-paths;

评估分析模块具体实现如下：The specific implementation of the evaluation and analysis module is as follows:

(1)评估：对联合嵌入模块中得到的预测结果进行评估，判别主机的真实操作用户是否与预测结果相一致。在联合嵌入模块，得到了针对于测试集中的日志数据模型给出的预测结果A，该结果是一个序列，序列中的排名先后代表了测试集中的行为属于某个用户的概率。如果测试集中的行为对应的真实操作用户出现在预测序列中的前K个，则认为识别正确，反之，称为“可疑情况”。(1) Evaluation: Evaluate the prediction results obtained in the joint embedding module, and determine whether the real operating users of the host are consistent with the prediction results. In the joint embedding module, the prediction result A given by the log data model in the test set is obtained. The result is a sequence, and the ranking in the sequence represents the probability that the behavior in the test set belongs to a certain user. If the real operation users corresponding to the behaviors in the test set appear in the top K in the prediction sequence, the identification is considered correct, otherwise, it is called "suspicious situation".

(2)分析：模型认为“可疑情况”是由测试集中的用户行为与训练集中的正常行为模式发生偏离导致的，在评估分析模块中，针对可疑情况，将依次计算主机行为标识邻居节点与主机真实操作用户的点积作为异常参考，点积分数越低，代表两个实体之间的相似性越低，则异常的风险越高，最终按照异常风险由高到低进行可疑行为排序：(2) Analysis: The model believes that the "suspicious situation" is caused by the deviation of the user behavior in the test set and the normal behavior pattern in the training set. In the evaluation analysis module, for suspicious situations, the host behavior will be calculated in turn to identify neighbor nodes and hosts. The dot product of the real operating user is used as an abnormal reference. The lower the number of dot points, the lower the similarity between the two entities, and the higher the abnormal risk. Finally, suspicious behaviors are sorted according to the abnormal risk from high to low:

Claims

1. a user identification system based on heterogeneous information network embedding algorithm, is characterized in that: described heterogeneous information network embedding algorithm refers to the local embedding algorithm realized based on neural network and the global embedding algorithm realized based on meta-path, described User identification refers to identifying potential operating users based on multi-source heterogeneous audit log data collected by each host in the enterprise intranet. The system includes a data processing module, a joint embedding module and an evaluation and analysis module, wherein:

Data processing module: There are two functions: the first function is to extract standardized audit log data from the historical behavior database, and these log data will be used as a training set to build a heterogeneous information network G; the second function is to The original multi-source heterogeneous audit log data newly collected in the network host is preprocessed; whether it is the normalized audit log data in the historical behavior database or the newly collected original audit log data, there are five types of multi-source heterogeneous audit log data. The five types of log data are login log data, file log data, mail log data, HTTP log data and device log data. These data record the user's login behavior, file behavior, email behavior, WEB behavior and external device connection respectively. behavior; the preprocessing of the original multi-source heterogeneous audit log data refers to standardizing each log data, and using a log parser to extract key information based on predefined fields, the predefined fields include subject, device , object and timestamp four parts, the subject refers to the user ID, the device refers to the host ID, the object is determined according to different log data types, and is used to identify the specific behavior of a specific log data type; in the log of the file type In the data, the object adopts the combination of file path and file name; the timestamp is the occurrence time of the log data; the newly collected log data after analysis will be used as the test set; the heterogeneous information network G uses the standardized log data. The information extracted based on the predefined fields in the data is regarded as the node identification, in which the host identification is used as the central node, and the user identification and behavior identification are regarded as the neighbor nodes of the host identification;

Joint Embedding Module: Using the heterogeneous information network G constructed in the data processing module as input, train a model that reflects the operation mode of each host, called the user predictor, and the user predictor will perform user prediction on the test set, and finally get the corresponding The ranking of potential operating users based on the log data in the test set; the process of training the user predictor refers to learning the vector representation of nodes in the heterogeneous information network G and the parameters of the model; in order to enable the learned node vector to retain network structure information and inter-node The joint embedding module adopts two heterogeneous information network embedding algorithms, called local embedding algorithm and global embedding algorithm. The local embedding algorithm is used to learn the interaction between each host and its neighbor nodes and embed normal behavior pattern information; The global embedding algorithm uses the semantics defined by the meta-path to embed the association information between different types of nodes; finally, the two embedding algorithms are combined for iterative training through the joint objective function;

The evaluation and analysis module evaluates the prediction results obtained in the joint embedding module, and determines whether the actual operating user of the host is consistent with the prediction results; in the joint embedding module, the prediction results A given by the log data model in the test set are obtained, The result is a sequence. The ranking in the sequence represents the probability that the behaviors in the test set belong to a certain user. If the actual operation users corresponding to the behaviors in the test set appear in the top K in the prediction sequence, the recognition is considered correct, otherwise , it means that the user behavior in the test set deviates from the normal behavior pattern in the training set, which is called suspicious situation; for such suspicious situations, through similarity-based anomaly analysis, the final result is a suspicious situation that causes errors in user identification results Sequence of behaviors, so that security analysts or relevant staff can conduct traceability verification based on the clues given by the system.

2. The user identification system based on a heterogeneous information network embedding algorithm according to claim 1, wherein: in the data processing module, the process of constructing a heterogeneous information network is: based on the standardization process in the historical database Log data, using the extracted hosts, users, and behavior IDs as nodes to construct a heterogeneous information network G, where the host ID is used as the central node, and the user ID and behavior ID are used as neighbor nodes of the host ID.

3. the user identification system based on heterogeneous information network embedding algorithm according to claim 1, is characterized in that: in described joint embedding module, local embedding algorithm is realized based on neural network, and concrete process is as follows:

(1) First, map all nodes in the heterogeneous information network G to a latent space, that is, randomly initialize the vector representations of all nodes to form an embedding vector table V;

(2) Given a host p, aggregate the host vector V _p in two steps. In the first step, calculate the node type vector of each type of behavior of the host p to identify the neighbor nodes

in,

In the second step, the node type vector is calculated

The weighted combination of to obtain the host vector V _p ;

Wherein, w _t represents the weight of the t-th type node type vector. In the present invention, there are 5 types of neighbor node types for behavior identification, so the value of t is 1 to 5, representing the login node type, file node type, mail node type, HTTP node type, device node type;

(3) Based on the host vector V _p , calculate the dot product similarity between the host and the user, and perform a potential operation user ranking, where v _u represents the user vector;

(4) Use stochastic gradient descent (SGD) to update the embedding vector table V, and learn the weight w _t of each type of node type vector using the max-margin objective function as the loss function, the loss function is defined as:

max(0, f(p, u')-f(p, u)+ε)

Among them, u is the real operating user of the host p, that is, the positive sample, u' is the negative sample, ε is the boundary value, if the difference between f(p, u) and f(p, u') is if Less than ε, a loss penalty occurs.

4. the user identification system based on heterogeneous information network embedding algorithm according to claim 1, is characterized in that: in the concrete realization of described joint embedding module, global embedding algorithm is realized based on meta-path, and realization process is as follows:

(1) Meta-paths define higher-order semantic associations between different types of nodes. Higher-order semantic associations refer to the association information that cannot be captured by edges in the original network; given a set of meta-paths R, the global embedding algorithm based on meta-paths first Conditional neighbor distribution modeling of nodes, in the heterogeneous information network G, there are many kinds of meta-paths from node i, so the neighbor distribution of nodes depends not only on node i, but also on a given meta-path r, conditional neighbors The distribution function is defined as follows:

Among them, v _i and v _j represent the vector representation of nodes i and j, and DST(r) represents the set of all possible nodes of node i on the target side of meta-path r;

(2) The number of nodes contained in the set of all possible nodes DST(r) on the target side of the meta-path r is huge. In order to reduce the computational burden, the approximate solution is obtained by the following formula using the negative sampling strategy. The left side of the formula represents the response to the previous formula approximation;

The negative nodes sampled in , each node _i samples k negative nodes, and the bias term br is the bias term of the neural network training, which is used to adjust the density of different meta-paths;

(3) Use stochastic gradient descent (SGD) to learn the embedding _vector table V and the parameter bias term br , with the goal of maximizing the likelihood function.

5. the user identification system based on heterogeneous information network embedding algorithm according to claim 1, is characterized in that: in the concrete realization of described joint embedding module, the purpose of joint objective function is to combine the local feature captured by local embedding algorithm with the local feature captured by local embedding algorithm. The global features captured by the global embedding algorithm are effectively combined and defined as follows:

where ω∈[0, 1] is a predefined parameter used to balance the importance of the model for tuning, adding a regularization term to prevent overfitting; where Z _united represents the objective function of the joint embedding model, and Z _global represents the The objective function of the global embedding model, Z _local represents the objective function of the local embedding model, and λ is the regularization parameter;

The iterative training process using the joint objective function is as follows:

(1) Sampling one of a local embedding algorithm and a global embedding algorithm based on a Bernoulli distribution with parameter ω;

(2) If the local embedding algorithm is sampled, train the embedding vector table V according to the operation steps of the local embedding algorithm and learn the weight _wt of each type of node type vector. Similarly, if the global embedding algorithm is sampled, follow the global embedding algorithm. The operation steps train the embedding _vector table V and learn the parameter br , where the parameter _br is used to adjust the density of different meta-paths;

The embedding vector table V is shared for the two embedding algorithms;

(3) Repeat steps (1) and (2) until the model converges and a user predictor is obtained.

6. the user identification system based on heterogeneous information network embedding algorithm according to claim 1, is characterized in that: in the described evaluation analysis module, the analysis process for suspicious situation is:

The suspicious situation is caused by the deviation between the user behavior in the test set and the normal behavior pattern in the training set. In the evaluation and analysis module, for suspicious situations, the dot product of the host’s behavior identification neighbor node and the host’s actual operating user node is calculated in turn as an abnormal reference. , the lower the number of points, the lower the similarity between the two entities, the higher the abnormal risk, and finally the suspicious behaviors are sorted according to the abnormal risk from high to low:

Among them, L _p represents the finally obtained suspicious behavior sequence, E _p represents the behavior identification neighbor node set of host p, v _i represents the vector representation of node i, and u _p represents the vector representation of the real operating users of host p.

7. The user identification system based on a heterogeneous information network embedding algorithm according to claim 4, wherein the determination of the meta-path set R needs to go through a meta-path selection process, which is specifically as follows:

(1) Calculate the recognition accuracy after adding each meta-path one by one, and sort them to obtain the influence of each meta-path on the recognition effect when used alone;

(2) Gradually add meta-paths according to the obtained ranking, and finally greedily select the combination that can achieve the highest user identification accuracy as the optimal meta-path set R according to the change of the recognition accuracy.

8. A user identification method based on a heterogeneous information network embedding algorithm, characterized in that the steps are as follows:

Step (1) Data processing: collect the audit log data of a host in the intranet for a period of time. The audit log types include login logs, file logs, mail logs, HTTP logs and device logs; use a log parser to analyze each type of log. Parse the logs one by one and extract predefined key fields. The key fields include subject, object, device and timestamp. For a file log, the extracted subject is the user account, and the object is the combination of the file path and file name. is the host number, the timestamp is the access time of the log record, and the parsed log data will be used as the test set. In addition, the standardized log data in a time window in the historical behavior database is used as the training set to construct the heterogeneous information network G ;

Step (2) Heterogeneous information network construction: use the training set to construct a heterogeneous information network G, and regard the information extracted based on predefined fields in the log data standardized in the historical behavior database as a node identifier, in which the host identifier is used as the central node, and the user The identity and behavior identity are used as the neighbor nodes of the host. For each host p, all the behavior identity neighbor nodes related to it are composed of a set _Ep , and the real operation user is expressed as u _p . Each independent behavior identity can be associated with multiple If the two hosts p and q have log records with the mail entity e, the mail entity e will be the neighbor node of the two hosts p and q at the same time;

Step (3) Joint embedding: After obtaining the heterogeneous information network G, the vector representation of each node will be iteratively learned. First, the embedding vector table V is randomly initialized, and then the local embedding algorithm and the global embedding algorithm are sampled based on the parameter ω in the joint objective function. One of them; if the local embedding algorithm is sampled, the embedding vector table V is trained according to the operation steps of the local embedding algorithm and the weight _wt of each type of node type vector is learned; if the global embedding algorithm is sampled, the operation steps of the global embedding algorithm are followed Train the embedding vector table V and learn the parameters b _r , repeat this iterative training process until the model converges, at which time a trained model is obtained, which is called a user predictor;

Step (4) User prediction: The user predictor contains the node embedding vector table V after training and the respective parameters of the local embedding algorithm and the global embedding algorithm, and then performs the user prediction task on the test set, that is, given the host p to be predicted, Predict which operating user the log data on host p belongs to. The prediction result is a sequence. The ranking in the sequence represents the probability that the log data in the test set belongs to a certain user. The sorting is based on the similarity of the dot product of the host vector and the user vector. sex score;

Step (5) Evaluation and analysis: For the prediction results obtained in step (4), if the actual operation users corresponding to the behaviors in the test set appear in the top K in the prediction sequence, the identification is considered correct; otherwise, it means that the test set is correct. The user behavior deviates from the normal behavior pattern in the training set, which is called suspicious situation. For such suspicious situations, the abnormal analysis based on similarity will finally get the ranking of suspicious behaviors that cause errors in the user identification results for security analysis. Teachers or relevant staff can conduct traceability verification based on the clues given by the system.