CN114257565B

CN114257565B - Method, system and server for mining potential threat domain names

Info

Publication number: CN114257565B
Application number: CN202010945102.6A
Authority: CN
Inventors: 何振财; 李彬; 全俊斌; 乔雅莉; 邓太良; 郝建忠; 钟雪慧; 刘峥; 余筱蕙; 孙际勇
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Guangdong Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Guangdong Co Ltd
Priority date: 2020-09-10
Filing date: 2020-09-10
Publication date: 2023-09-05
Anticipated expiration: 2040-09-10
Also published as: CN114257565A

Abstract

The application discloses a method, a system and a server for mining potential threat domain names, relates to the field of communication, and aims to solve the problem that analysis efficiency is low for mass data by adopting an existing threat domain name identification method. The method comprises the following steps: acquiring a second feature set based on a first feature set acquired in advance, wherein the first feature set is a feature set related to a potential threat domain name, the second feature set is a subset of the first feature set and comprises dynamic scene features and static scene features; and obtaining a suspected threat domain name set by carrying out association calculation on the features in the second feature set. The present application threatens the mining of domain names.

Description

Method, system and server for mining potentially threatening domain names

技术领域technical field

本申请涉及通信领域，尤其涉及一种挖掘潜在威胁域名的方法、系统和服务器。The present application relates to the communication field, and in particular to a method, system and server for mining potentially threatening domain names.

背景技术Background technique

随着科学的进步和通讯技术的发展，手机已经成为人们日常生活不可缺少的组成。近些年来，越来越多用户通过移动互联网去浏览网页、观看视频资讯，使用各类手机应用程序(APP)来进行社交、娱乐、学习、生活等，产生了巨大的移动互联网访问数据。很多不法分子会通过网络服务器对用户手机进行攻击，，给网络安全带来极大的威胁。因此，对于潜在威胁域名的有效挖掘已经是迫不及待的任务。With the advancement of science and the development of communication technology, mobile phones have become an indispensable part of people's daily life. In recent years, more and more users use the mobile Internet to browse the web, watch video information, and use various mobile phone applications (APP) to socialize, entertain, study, live, etc., resulting in huge mobile Internet access data. Many criminals will attack users' mobile phones through network servers, which poses a great threat to network security. Therefore, it is an urgent task to effectively mine potential threat domain names.

在相关技术中，对于潜在威胁域名的挖掘，往往会建立域名内容识别引擎。这种方式基于网页内容的相似度识别技术，通过对网页的元素进行特征挖掘再分类预测识别。但是，由于这种方式需要成功访问网站页面才能获取内容元素，因而面对海量数据，分析效率低下，难以实现全网日志的有效分析。In related technologies, for the mining of potential threat domain names, a domain name content identification engine is often established. This method is based on the similarity recognition technology of web page content, and performs feature mining on the elements of the web page, then classifies, predicts and recognizes them. However, because this method needs to successfully visit the website page to obtain content elements, the analysis efficiency is low in the face of massive data, and it is difficult to achieve effective analysis of the entire network log.

由此可见，目前亟需一种挖掘潜在威胁域名的方法，以提升挖掘威胁域名的效率。It can be seen that there is an urgent need for a method for mining potential threat domain names to improve the efficiency of mining threat domain names.

发明内容Contents of the invention

本申请实施例提供了一种挖掘潜在威胁域名的方法，用以解决采用现有的威胁域名识别方法，对于海量数据，分析效率低下的问题。The embodiment of the present application provides a method for mining potentially threatening domain names, which is used to solve the problem of low analysis efficiency for massive data using existing threatening domain name identification methods.

本申请实施例还提供了一种挖掘潜在威胁域名的系统，用以解决采用现有的威胁域名识别方法，面对海量数据，分析效率低下，难以实现全网日志的有效分析。The embodiment of the present application also provides a system for mining potentially threatening domain names, which is used to solve the problem of low analysis efficiency and difficulty in realizing the effective analysis of the entire network logs in the face of massive data using the existing threatening domain name identification methods.

本申请实施例采用下述技术方案：The embodiment of the application adopts the following technical solutions:

第一方面，提供一种挖掘潜在威胁域名的方法，包括：In the first aspect, a method for mining potentially threatening domain names is provided, including:

基于预先获取的第一特征集，获取第二特征集，其中，所述第一特征集为与潜在威胁域名相关的特征集，所述第二特征集为所述第一特征集的子集且所述第二特征集包括动态场景特征和静态场景特征；Obtaining a second feature set based on the pre-acquired first feature set, wherein the first feature set is a feature set related to a potential threat domain name, the second feature set is a subset of the first feature set, and The second feature set includes dynamic scene features and static scene features;

通过对所述第二特征集中的特征进行关联计算获取疑似威胁域名集合。A set of suspected threat domain names is obtained by performing correlation calculation on the features in the second feature set.

第二方面，提供一种潜在威胁域名的挖掘系统，包括：In the second aspect, a mining system for potentially threatening domain names is provided, including:

获取模块，用于基于预先获取的第一特征集，获取第二特征集，其中，所述第一特征集为与潜在威胁域名相关的特征集，所述第二特征集为所述第一特征集的子集且所述第二特征集包括动态场景特征和静态场景特征；An acquisition module, configured to acquire a second feature set based on the pre-acquired first feature set, wherein the first feature set is a feature set related to a potential threat domain name, and the second feature set is the first feature set set and said second feature set includes dynamic scene features and static scene features;

处理模块，用于通过对所述第二特征集中的特征进行关联计算获取疑似威胁域名集合。A processing module, configured to obtain a set of suspected threat domain names by performing correlation calculation on the features in the second feature set.

第三方面，提供一种服务器，包括：In a third aspect, a server is provided, including:

第四方面，提供一种计算机可读存储介质，其特征在于，所述计算机可读存储介质中存储程序，当所述程序被执行时，实施以下过程：In a fourth aspect, a computer-readable storage medium is provided, wherein a program is stored in the computer-readable storage medium, and when the program is executed, the following process is implemented:

本申请实施例采用的上述至少一个技术方案能够达到以下有益效果：The above at least one technical solution adopted in the embodiment of the present application can achieve the following beneficial effects:

在本发明实施例中，基于预先获取的第一特征集，获取第二特征集，其中，所述第一特征集为与潜在威胁域名相关的特征集，所述第二特征集为所述第一特征集的子集且所述第二特征集包括动态场景特征和静态场景特征；通过对所述第二特征集中的特征进行关联计算获取疑似威胁域名集合。如此，通过提前获取第一特征集，并从中获得第二特征集，可在大数据中快速圈定潜在威胁域名相关的特征集，基于第二特征集进行疑似威胁域名集合的获取可大大减少数据处理量，从而可以有效提高前端识别威胁域名的效率，实现对全网日志的有效分析。In the embodiment of the present invention, the second feature set is obtained based on the pre-acquired first feature set, wherein the first feature set is a feature set related to a potential threat domain name, and the second feature set is the first feature set A subset of a feature set, and the second feature set includes dynamic scene features and static scene features; a set of suspected threat domain names is obtained by performing correlation calculation on the features in the second feature set. In this way, by obtaining the first feature set in advance and obtaining the second feature set from it, the feature set related to the potential threat domain name can be quickly delineated in the big data, and the acquisition of the suspected threat domain name set based on the second feature set can greatly reduce data processing. In this way, the efficiency of front-end identification of threat domain names can be effectively improved, and the effective analysis of logs on the entire network can be realized.

附图说明Description of drawings

为了更清楚地说明本申请实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请中记载的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments described in this application. Those skilled in the art can also obtain other drawings based on these drawings without any creative effort.

图1是本申请实施例提供的一种挖掘潜在威胁域名的方法的流程图；Fig. 1 is a flow chart of a method for mining potentially threatening domain names provided by an embodiment of the present application;

图2是本申请实施例中一种示例的特征关联关系链路的示意图；Fig. 2 is a schematic diagram of an example feature association relationship link in the embodiment of the present application;

图3是本申请实施例中基于关联分析确定疑似威胁域名集合的流程图；FIG. 3 is a flow chart of determining a set of suspected threat domain names based on association analysis in an embodiment of the present application;

图4是本申请实施例提供的挖掘潜在威胁域名的方法的流程图；FIG. 4 is a flow chart of a method for mining potentially threatening domain names provided by an embodiment of the present application;

图5是本申请实施例提供的系统的结构框图；Fig. 5 is a structural block diagram of the system provided by the embodiment of the present application;

图6是本申请实施例提供的服务器的结构框图。FIG. 6 is a structural block diagram of a server provided by an embodiment of the present application.

具体实施方式Detailed ways

本申请实施例提供一种挖掘潜在威胁域名的方法和系统。Embodiments of the present application provide a method and system for mining potentially threatening domain names.

为了使本技术领域的人员更好地理解本申请中的技术方案，下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都应当属于本申请保护的范围。In order to enable those skilled in the art to better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Obviously, the described The embodiments are only some of the embodiments of the present application, but not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without creative efforts shall fall within the scope of protection of this application.

图1是本申请实施例提供的一种挖掘潜在威胁域名的方法的流程图。如图1所示，本申请实施例提供的一种挖掘潜在威胁域名的方法可以包括以下步骤：FIG. 1 is a flow chart of a method for mining potentially threatening domain names provided by an embodiment of the present application. As shown in Figure 1, a method for mining potentially threatening domain names provided in the embodiment of the present application may include the following steps:

步骤110，基于预先获取的第一特征集，获取第二特征集，其中，所述第一特征集为与潜在威胁域名相关的特征集，所述第二特征集为所述第一特征集的子集且所述第二特征集包括动态场景特征和静态场景特征。Step 110: Obtain a second feature set based on the pre-acquired first feature set, wherein the first feature set is a feature set related to a potential threat domain name, and the second feature set is a feature set of the first feature set and the second feature set includes dynamic scene features and static scene features.

在本申请实施例中，第一特征集具体可以包括但不限于：域名IP个数、访问用户数、访问次数、统一资源定位符(Uniform Resource Locator，URL)URL个数、最大访问用户数、单用户平均URL条数、URL平均用户数、URL平均访问次数、用户数标准差、访问数标准差、URL访问数离散系数、访问用户数离散度、当天单个用户访问域名频繁访问用户数、域名访问的最多URL次数占比、单用户访问URL个数、域名存活时长、终端操作系统属性、访问载体属性、域名返回码属性、IP地域属性、URL返回码属性中的至少二种。In the embodiment of the present application, the first feature set may specifically include but not limited to: the number of domain name IPs, the number of visiting users, the number of visits, the number of Uniform Resource Locator (Uniform Resource Locator, URL) URLs, the maximum number of visiting users, The average number of URLs per user, the average number of URL users, the average number of URL visits, the standard deviation of the number of users, the standard deviation of the number of visits, the dispersion coefficient of the number of URL visits, the dispersion of the number of visiting users, the number of frequent users visiting the domain name by a single user on the day, and the domain name At least two of the proportion of the most visited URLs, the number of URLs accessed by a single user, the survival time of the domain name, the terminal operating system attribute, the access carrier attribute, the domain name return code attribute, the IP region attribute, and the URL return code attribute.

在本申请实施例中，所述第二特征集具体可以包括动态场景特征和静态场景特征。In this embodiment of the present application, the second feature set may specifically include dynamic scene features and static scene features.

所述动态场景特征包括：通信特征和关联特征中的至少一种，其中，所述通信特征包括域名生成算法(Domain Generation Algorithm，DGA)属性，所述关联特征包括：域名返回码属性、域名IP地域属性和URL返回码属性中的至少一种；所述静态场景特征包括：行为特征和指纹特征中的至少一种，其中，所述行为特征包括：访问用户数、访问用户数离散度、域名访问的最多URL次数占比、单用户访问URL个数以及域名存活时长中的至少一种，所述指纹特征包括：终端操作系统属性和访问载体属性中的至少一种。The dynamic scene feature includes: at least one of a communication feature and an associated feature, wherein the communication feature includes a Domain Generation Algorithm (Domain Generation Algorithm, DGA) attribute, and the associated feature includes: a domain name return code attribute, a domain name IP At least one of regional attributes and URL return code attributes; the static scene features include: at least one of behavioral features and fingerprint features, wherein the behavioral features include: number of visiting users, dispersion of visiting users, domain name At least one of the proportion of the most visited URLs, the number of URLs visited by a single user, and the survival time of the domain name. The fingerprint feature includes: at least one of the attributes of the terminal operating system and the attributes of the access carrier.

所述DGA属性具体指计算域名字符随机离散程度，即DGA属性为域名熵值与域名元音占比值的比值，其中熵值指字符随机性程度值，域名元音值指域名的元音字母个数占域名总长度比值。The DGA attribute specifically refers to the calculation of the degree of random dispersion of domain name characters, that is, the DGA attribute is the ratio of the domain name entropy value to the domain name vowel value, wherein the entropy value refers to the randomness value of characters, and the domain name vowel value refers to the number of vowel letters in the domain name. The ratio of the number to the total length of the domain name.

所述域名返回码属性具体指访问域名时，网址的跳转状态码属性，如：域名返回4xx错误，但URL返回200请求成功或者返回204无内容。出现这些返回码的域名存在很大可疑。The domain name return code attribute specifically refers to the jump status code attribute of the URL when accessing the domain name, such as: the domain name returns a 4xx error, but the URL returns 200 Request Success or 204 No Content. Domains with these return codes are highly suspicious.

所述IP地域属性具体指域名解析的IP地址归属地，一般境外的IP相比境内IP危险程度高。The IP geographical attribute specifically refers to the attribution of the IP address resolved by the domain name. Generally, overseas IPs are more dangerous than domestic IPs.

所述URL返回码属性具体指访问该域名下的URL时，该URL的状态码属性。例如：URL返回302状态码，说明该URL进行了重定向，存在可疑。The URL return code attribute specifically refers to the status code attribute of the URL when accessing the URL under the domain name. For example, if the URL returns a 302 status code, it means that the URL has been redirected and is suspicious.

所述访问用户数具体指域名在当天的访问用户数越多，除了白名单域名外，说明该域名的影响范围广，危险程度高。The number of visiting users mentioned specifically refers to the more visiting users of a domain name on the same day, except for whitelisted domain names, it means that the domain name has a wide range of influence and a high degree of danger.

所述访问用户数离散度具体指白名单域名的访问由于用户的主观随机性，访问用户数会比较均衡；若某域名以及其URL的访问用户数出现集中，说明该域名可能处于爆发访问阶段，危险程度高。The dispersion of the number of visiting users specifically refers to the subjective randomness of the whitelisted domain names, and the number of visiting users will be relatively balanced; if the number of visiting users of a certain domain name and its URL is concentrated, it means that the domain name may be in the stage of explosive visits. High risk.

所述域名访问的最多URL次数占比具体指域名下的URL，如果出现集中访问的URL，其访问次数占比越大，说明该URL被利用可能性越高，危险程度越高。The proportion of the most URLs visited by the domain name refers to the URLs under the domain name. If there is a URL that is accessed intensively, the higher the proportion of visits, the higher the possibility of the URL being used and the higher the risk.

所述单用户访问URL个数具体指单个用户的访问行为如果受到控制后，其与恶意服务器域名不会频繁交互，该行为与非受控的正常用户行为存在很大差别，因此单用户访问URL个数越小，危险程度越高。The number of URLs accessed by a single user specifically means that if the access behavior of a single user is controlled, it will not interact frequently with the domain name of the malicious server. This behavior is very different from uncontrolled normal user behavior. The smaller the number, the higher the risk.

所述域名存活时长具体指恶意域名以及灰色域名为了躲避监测，会定期更换新的域名，存活时长越短，危险程度越高。The survival time of the domain name specifically refers to malicious domain names and gray domain names that will be regularly replaced with new domain names in order to avoid monitoring. The shorter the survival time, the higher the degree of danger.

在本申请实施例中，所述第二特征集为所述第一特征集的子集。即，第二特征集中的任意一个特征都来自于第一特征集中。举例而言，第一特征集例如为用户数标准差、DGA属性，第二特征集例如为DGA属性。In this embodiment of the present application, the second feature set is a subset of the first feature set. That is, any feature in the second feature set is from the first feature set. For example, the first feature set is, for example, the standard deviation of the number of users and the DGA attribute, and the second feature set is, for example, the DGA attribute.

在本申请实施例中，在步骤110之前，本申请实施例提供的挖掘潜在威胁域名的方法还可以包括获取第一特征集的步骤。具体地，获取第一特征集的过程可包括：获取运营商侧的数据；基于获取的所述运营商侧的数据，得到与潜在威胁域名相关的第一特征集。In the embodiment of the present application, before step 110, the method for mining potentially threatening domain names provided in the embodiment of the present application may further include the step of obtaining the first feature set. Specifically, the process of acquiring the first feature set may include: acquiring operator-side data; and obtaining the first feature set related to the potentially threatening domain name based on the acquired operator-side data.

运营商侧拥有广泛覆盖的通信网络，每时每刻都在产生着海量数据。运营商侧的数据既包括用户的基本信息，也包括用户的通信数据、社交活动数据、消费行为数据、位置信息数据等多个维度的信息。运营商侧的数据拥有独特的数据完整性、连续性、丰富性，这是其他任何行业数据都无法比拟的。因此，本申请实施例利用运营商侧海量的数据优势，可得到大量用户信息，例如不涉及用户隐私的信息，进而进行海量数据汇总后得到与潜在威胁域名相关的第一特征集，可保证得到的第一特征集的准确性。The operator side has a wide coverage communication network, which generates massive amounts of data all the time. The data on the operator's side includes not only the basic information of the user, but also multi-dimensional information such as the user's communication data, social activity data, consumption behavior data, and location information data. The data on the carrier side has unique data integrity, continuity, and richness, which is unmatched by any other industry data. Therefore, the embodiment of the present application takes advantage of the massive data on the operator side to obtain a large amount of user information, such as information that does not involve user privacy, and then summarizes the massive data to obtain the first feature set related to the potential threat domain name, which can ensure that The accuracy of the first feature set of .

在步骤110中，所述基于预先获取的第一特征集，获取第二特征集包括：对预先获取的所述第一特征集中的特征数据利用随机森林算法进行训练，以获取与潜在威胁域名相关的第二特征集。In step 110, the obtaining the second feature set based on the pre-acquired first feature set includes: using the random forest algorithm to train the feature data in the pre-acquired first feature set to obtain information related to potential threat domain names. the second feature set.

通过随机森林算法对所述第一特征集进行训练后得到第二特征集，所述第二特征集是所述第一特征集中的子集，因此得到的所述第二特征集相较于原有的第一特征集数据量明显减少，更有利于提升数据的处理效率。The second feature set is obtained after the first feature set is trained by the random forest algorithm, and the second feature set is a subset of the first feature set, so the obtained second feature set is compared with the original The amount of data in some first feature sets is significantly reduced, which is more conducive to improving the efficiency of data processing.

具体的，在本申请实施例中，所述利用随机森林算法进行训练的具体过程可以按照如下步骤进行：Specifically, in the embodiment of the present application, the specific process of using the random forest algorithm for training can be performed according to the following steps:

子步骤一：对预先获取的所述第一特征集中的特征数据，采取有放回抽样的方式，构造n个新训练集；Sub-step 1: Construct n new training sets by sampling with replacement for the pre-acquired feature data in the first feature set;

其中，不同的新训练集中所述特征数据可以重复，同一个新数据集中所述特征数据也可以重复。Wherein, the characteristic data in different new training sets may be repeated, and the characteristic data in the same new data set may also be repeated.

其中，所述特征数据可以指所述第一特征集中每个特征的在针对某一域名时的具体取值。Wherein, the feature data may refer to the specific value of each feature in the first feature set for a certain domain name.

子步骤二：根据新训练集来构建子决策树；其中，每个所述子决策树对应一个新训练集；Sub-step 2: construct sub-decision trees according to the new training set; wherein, each sub-decision tree corresponds to a new training set;

其中，随机森林由n(n为正整数，n的具体取值具有随机性)个子决策树构成，决策树中的每个节点都是关于具体特征数据的判断条件。对其中一个节点举例而言，若以该节点作为父节点，那么该节点存在两个子节点，且位于该节点左右两侧，分别假设其为左子节点和右子节点。若该父节点为DGA值，可以采用大于某值则生成其左子节点，小于某值则生成其右子节点的方式生成其左右子节点；若该父节点为终端操作系统属性，可以采用是/否的判断逻辑生成其左右子节点。Among them, the random forest is composed of n (n is a positive integer, and the specific value of n is random) sub-decision trees, and each node in the decision tree is a judgment condition about specific characteristic data. For example, if one of the nodes is used as the parent node, then the node has two child nodes located on the left and right sides of the node, which are respectively assumed to be the left child node and the right child node. If the parent node is a DGA value, its left child node can be generated if it is greater than a certain value, and its right child node can be generated if it is less than a certain value; if the parent node is a terminal operating system attribute, it can be The judgment logic of /No generates its left and right child nodes.

子步骤三：通过对所述n个子决策树的末端节点进行投票，得到所述与潜在威胁域名相关的第二特征集。Sub-step 3: Obtain the second feature set related to the potential threat domain name by voting on the terminal nodes of the n sub-decision trees.

其中，所述投票是指对n个决策树的末端节点进行统计分析，选取出现次数多的节点作为投票结果。Wherein, the voting refers to performing statistical analysis on the end nodes of n decision trees, and selecting the node with the most occurrence times as the voting result.

为了直观展示所述与潜在威胁域名相关的第二特征集，通过上述子步骤，得到的所述与潜在威胁域名相关的第二特征集的示例化数据展示在下表1中。In order to intuitively display the second feature set related to the potential threat domain name, the exemplary data of the second feature set related to the potential threat domain name obtained through the above sub-steps is shown in Table 1 below.

表1Table 1

步骤120，通过对所述第二特征集中的特征进行关联计算获取疑似威胁域名集合。Step 120, obtain a set of suspected threat domain names by performing correlation calculation on the features in the second feature set.

本申请实施例提供的挖掘潜在威胁域名的方法，基于预先获取的第一特征集，获取第二特征集，其中，所述第一特征集为与潜在威胁域名相关的特征集，所述第二特征集为所述第一特征集的子集且所述第二特征集包括动态场景特征和静态场景特征；通过对所述第二特征集中的特征进行关联计算获取疑似威胁域名集合。如此，通过提前获取第一特征集，并从中获得第二特征集，可在大数据中快速圈定潜在威胁域名相关的特征集，基于第二特征集进行疑似威胁域名集合的获取可大大减少数据处理量，从而可以有效提高前端识别威胁域名的效率，实现对全网日志的有效分析。The method for mining potentially threatening domain names provided in the embodiment of the present application obtains a second feature set based on a pre-acquired first feature set, wherein the first feature set is a feature set related to a potentially threatening domain name, and the second The feature set is a subset of the first feature set and the second feature set includes dynamic scene features and static scene features; a set of suspected threat domain names is obtained by performing correlation calculation on features in the second feature set. In this way, by obtaining the first feature set in advance and obtaining the second feature set from it, the feature set related to the potential threat domain name can be quickly delineated in the big data, and the acquisition of the suspected threat domain name set based on the second feature set can greatly reduce data processing. In this way, the efficiency of front-end identification of threat domain names can be effectively improved, and the effective analysis of logs on the entire network can be realized.

在本申请实施例中，对所述第二特征集中的特征进行关联计算可以采用多种方式，例如利用Apriori算法、FP-Tree算法等进行关联计算。下面仅是以Apriori算法为例进行说明。In the embodiment of the present application, various methods may be used to perform association calculation on the features in the second feature set, for example, use Apriori algorithm, FP-Tree algorithm, etc. to perform association calculation. The following only uses the Apriori algorithm as an example for illustration.

其中，所述关联计算的具体计算方式可以通过Apriori算法执行具体计算。Apriori算法是用于挖掘出数据关联规则的算法，通过Apriori算法可以挖掘出第二特征集之间的关联程度，进而形成疑似威胁域名集合。Wherein, the specific calculation manner of the association calculation may use the Apriori algorithm to perform the specific calculation. The Apriori algorithm is an algorithm for mining data association rules. The Apriori algorithm can be used to mine the degree of association between the second feature sets, and then form a collection of suspected threat domain names.

具体地，可参照图3，在本申请实施例中，步骤120可包括如下过程：Specifically, referring to FIG. 3 , in the embodiment of the present application, step 120 may include the following process:

步骤1201：确定潜在威胁域名特征集合的评估标准，所述评估标准包括支持度、置信度中的至少一种；Step 1201: Determine the evaluation criteria of the potential threat domain name feature set, the evaluation criteria include at least one of support and confidence;

潜在威胁域名特征集合的评估标准包括支持度、置信度等。一般而言，要选择一个数据集合中的频繁K项集，通常需要自定义评估标准。而在本申请所述Apriori算法中，主要涉及支持度、置信度中的至少一种作为评估指标。The evaluation criteria of the potential threat domain name feature set include support degree, confidence degree and so on. Generally speaking, to select frequent K-itemsets in a data set, custom evaluation criteria are usually required. In the Apriori algorithm described in this application, at least one of support and confidence is mainly involved as an evaluation index.

其中，所述频繁K项集由关联程度紧密的K项数据组成。其中，K为正整数，且K≥2。Wherein, the frequent K-itemset is composed of closely related K-item data. Wherein, K is a positive integer, and K≥2.

其中，所述支持度就是若干个所述第二特征集中的特征具体值同时出现在一个域名中的概率。Wherein, the support degree is the probability that several feature specific values in the second feature set simultaneously appear in a domain name.

若在一个域名中，有两个要分析关联性的第二特征集中的特征具体值X和Y，则其对应的支持度值为If in a domain name, there are two characteristic values X and Y in the second feature set whose correlation needs to be analyzed, then the corresponding support value is

Support(XY)＝P(XY)＝P(X)P(Y|X)Support(XY)＝P(XY)＝P(X)P(Y|X)

其中，P(XY)表示X和Y同时发生的概率；P(X)表示X发生的概率；P(Y|X)表示X发生的条件下Y发生的概率；Support(XY)表示XY对应的支持度。Among them, P(XY) represents the probability of X and Y occurring at the same time; P(X) represents the probability of X occurring; P(Y|X) represents the probability of Y occurring under the condition that X occurs; Support(XY) represents the corresponding XY Support.

以此类推，若在一个域名中，有三个要分析关联性的第二特征集中的特征具体值X、Y和Z，则其对应的支持度值为By analogy, if in a domain name, there are three feature specific values X, Y, and Z in the second feature set whose relevance needs to be analyzed, then their corresponding support values are

Support(XYZ)＝P(XYZ)＝P(X)P(Y|X)P(Z|XY)Support(XYZ)＝P(XYZ)＝P(X)P(Y|X)P(Z|XY)

其中，P(XYZ)表示X、Y和Z同时发生的概率；P(X)表示X发生的概率；P(Y|X)表示X发生的条件下Y发生的概率；P(Z|XY)表示XY发生的条件下Z发生的概率；Support(XYZ)表示XYZ对应的支持度。Among them, P(XYZ) represents the probability of X, Y and Z occurring at the same time; P(X) represents the probability of X occurring; P(Y|X) represents the probability of Y occurring under the condition that X occurs; P(Z|XY) Indicates the probability of Z occurring under the condition that XY occurs; Support(XYZ) indicates the support corresponding to XYZ.

其中，所述置信度就是统计学中条件概率的含义，结合上述举例，具体的Confidence(X<＝Y)＝P(X|Y)Wherein, the confidence degree is the meaning of conditional probability in statistics, combined with the above example, the specific Confidence(X<=Y)=P(X|Y)

Confidence(X<＝YZ)＝P(X|YZ)Confidence(X<=YZ)＝P(X|YZ)

其中，Confidence(X<＝Y)表示X对Y的置信度，Confidence(X<＝YZ)表示X对YZ的置信度，P(X|Y)表示Y发生的条件下X发生的概率，P(X|YZ)表示YZ发生的条件下X发生的概率。Among them, Confidence(X<=Y) represents the confidence of X to Y, Confidence(X<=YZ) represents the confidence of X to YZ, P(X|Y) represents the probability of X under the condition that Y occurs, P (X|YZ) represents the probability that X occurs under the condition that YZ occurs.

步骤1202：基于确定的所述评估标准和Apriori算法，对所述第二特征集中的特征进行关联分析，以确定出疑似威胁域名集合。Step 1202: Based on the determined evaluation criteria and the Apriori algorithm, perform correlation analysis on the features in the second feature set to determine a set of suspected threat domain names.

具体地，步骤1202可包括如下子步骤一至子步骤三。Specifically, step 1202 may include the following sub-steps 1 to 3.

子步骤一：对所述第二特征集中的数据进行数据连接；对连接后得到的候选1项集按照预设评估标准进行剪枝处理，以得到频繁1项集；Sub-step 1: performing data connection on the data in the second feature set; pruning the candidate 1-itemsets obtained after the connection according to the preset evaluation criteria to obtain frequent 1-itemsets;

其中，所述候选1项集是将初始有效特征数据进行两两数据连接，并剔除相同项后得到的。举例而言，对于数据0和1而言，两两数据连接后会得到项集01和10，而项集01和10实质为相同项，只是顺序不一样而已，那么剔除相同项后，得到候选1项集中的一个项集01或10中的任意一个。对于数据中的其他数据的连接方式，均按照此两两连接方式进行连接。Wherein, the candidate 1-itemset is obtained by performing pairwise data connection of the initial valid feature data and removing identical items. For example, for data 0 and 1, after connecting two pairs of data, you will get itemsets 01 and 10, and itemsets 01 and 10 are essentially the same items, but the order is different, then after removing the same items, you can get the candidate Any one of itemsets 01 or 10 in an itemset of 1. As for the connection methods of other data in the data, they are connected according to this pairwise connection method.

其中，所述剪枝处理是指低于预设评估标准的候选1项集进行删除。Wherein, the pruning process refers to deleting candidate one-itemsets that are lower than a preset evaluation standard.

子步骤二：对频繁1项集中的元素进行连接，获取候选2项集，对候选2项集中低于预设评估标准的元素进行剪枝处理，以得到频繁2项集；Sub-step 2: Connect the elements in the frequent 1-itemset to obtain the candidate 2-itemset, and prune the elements in the candidate 2-itemset that are lower than the preset evaluation standard to obtain the frequent 2-itemset;

其中，对于频繁1项集中的元素参照上述子步骤一的方式进行数据连接，从而得到候选2项集。对于候选2项集同样参照上述子步骤一的方式进行剪枝处理，得到频繁2项集。此处需要特别指出的是，对于已经进行过剪枝处理的候选1项集，那么与该候选1项集相关的候选2项集在进行剪枝处理之前，不需要按照预设评估标准进行判断，而是直接直接执行剪枝处理即可。Among them, for the elements in the frequent 1-itemset, refer to the above sub-step 1 for data connection, so as to obtain the candidate 2-itemset. For candidate 2-itemsets, pruning is also performed in accordance with the above sub-step 1 to obtain frequent 2-itemsets. What needs to be pointed out here is that for a candidate 1-itemset that has been pruned, the candidate 2-itemset related to the candidate 1-itemset does not need to be judged according to the preset evaluation criteria before pruning , but directly execute the pruning process directly.

子步骤三：针对频繁K项集，按照以上过程进行迭代，直到无法找到频繁K+1项集为止，将得到的频繁K项集确定为疑似威胁域名集合；其中，K为正整数，且K≥2。Sub-step 3: For frequent K-itemsets, iterate according to the above process until frequent K+1 itemsets cannot be found, and determine the obtained frequent K-itemsets as a set of suspected threat domain names; where K is a positive integer, and K ≥2.

本申请中所述候选K项集可以是由K-1个初始有效特征组成的集合；频繁K项集是指满足预设评估标准的候选K项集。The candidate K-itemsets mentioned in this application may be a set composed of K-1 initial effective features; frequent K-itemsets refer to candidate K-itemsets that meet preset evaluation criteria.

在Apriori算法的运算过程中，若某个K项集是频繁的，那么该K项集的所有子集均是频繁的。那么可以理解的是，在进行剪枝处理的过程中，若某候选项集低于预设的评估标准，则其将会被删除，那么其所有的子集都将会被剪枝处理。通过利用上述过程，可以大大缩短算法遍历时长，进一步提高算法处理效率。In the operation process of the Apriori algorithm, if a K-itemset is frequent, then all subsets of the K-itemset are frequent. It is understandable that during the pruning process, if a candidate item set is lower than the preset evaluation standard, it will be deleted, and all its subsets will be pruned. By using the above process, the algorithm traversal time can be greatly shortened, and the processing efficiency of the algorithm can be further improved.

通过Apriori算法，能够采集到全部域名场景中的网络操作行为，下面通过简单的单链条模型进行实例化说明，在实际模型中，为多条单链条组成的混合模型。Through the Apriori algorithm, the network operation behavior in all domain name scenarios can be collected. The following is an example of a simple single-chain model. In the actual model, it is a hybrid model composed of multiple single-chains.

下面进行一个简单维度的关系链路示例分析，图3为本申请特征关联关系链路分析的示意图。The following is an example analysis of relationship links in a simple dimension, and FIG. 3 is a schematic diagram of analysis of feature association relationship links in this application.

下面针对步骤1202中的各个子步骤进行具体举例说明，如图3所示，在域名A中存在第二特征集中的特征的初始有效特征数据DGA、IP属性、当天访问的用户数和域名存活时长。将DGA标号为0，IP属性标号为1，当天访问的用户数标号为2，域名存活时长标号为3。The following is a specific example for each sub-step in step 1202. As shown in Figure 3, in the domain name A, there are initial effective feature data DGA, IP attributes, the number of users visiting on the day and the survival time of the domain name in the second feature set . The DGA is marked as 0, the IP attribute is marked as 1, the number of users visiting on that day is marked as 2, and the domain name survival time is marked as 3.

对上述标号特征执行上述子步骤1，对所述第二特征集中的特征的初始有效特征数据进行数据连接得到连接后的数据集。Perform the above sub-step 1 on the above-mentioned label features, and perform data connection on the initial effective feature data of the features in the second feature set to obtain a connected data set.

具体的，DGA(0)和IP属性(1)进行数据连接后得到数据集01；DGA(0)和当天访问的用户数(2)进行连接后得到数据集02；IP属性(1)和当天访问的用户数(2)进行连接后得到数据集12；IP属性(1)和域名存活时长(3)进行连接后得到数据集13；当天访问的用户数(2)和域名存活时长(3)进行连接得到数据集23。Specifically, data set 01 is obtained after data connection between DGA (0) and IP attribute (1); data set 02 is obtained after connection between DGA (0) and the number of users visiting on the day (2); IP attribute (1) and the current day Dataset 12 is obtained after the number of users visited (2) is connected; IP attribute (1) and domain name survival time (3) are connected to obtain data set 13; number of users visited on the day (2) and domain name survival time (3) Connect to get data set 23.

然后对连接后得到的数据集按照预设评估标准进行剪枝处理，假设数据集23低于预设的评估标准，则对其进行剪枝处理，进而得到频繁1项集：01、02、03、12、13。Then the data sets obtained after connection are pruned according to the preset evaluation standard. Assuming that the data set 23 is lower than the preset evaluation standard, it is pruned to obtain frequent 1-itemsets: 01, 02, 03 , 12, 13.

对上述频繁1项集执行上述子步骤2，对频繁1项集中的元素进行连接，获取候选2项集。Execute the above sub-step 2 on the above frequent 1-itemset, connect the elements in the frequent 1-itemset, and obtain the candidate 2-itemset.

具体的，频繁1项集01、频繁1项集02和频繁1项集12进行连接后得到候选2项集012；频繁1项集01、频繁1项集03和频繁1项集13进行连接后得到候选2项集013；频繁1项集02、频繁1项集03和数据集23进行连接后得到候选2项集023；频繁1项集12、频繁1项集13和候选1项集23进行连接后得到候选2项集123。Specifically, the frequent 1-itemset 01, the frequent 1-itemset 02 and the frequent 1-itemset 12 are connected to obtain the candidate 2-itemset 012; the frequent 1-itemset 01, the frequent 1-itemset 03 and the frequent 1-itemset 13 are connected Candidate 2-itemset 013 is obtained; frequent 1-itemset 02, frequent 1-itemset 03 and data set 23 are connected to obtain candidate 2-itemset 023; frequent 1-itemset 12, frequent 1-itemset 13 and candidate 1-itemset 23 are performed Candidate 2-itemset 123 is obtained after connection.

对候选2项集中低于预设评估标准的元素进行剪枝处理，以得到频繁2项集。此处需要说明的是，在上述步骤1中，已对候选1项集23进行了剪枝处理，那么以候选1项集23为父节点的候选2项集023和123同步进行剪枝处理。那么，剪枝处理后的频繁2项集为012、013。The elements in the candidate 2-item set that are lower than the preset evaluation standard are pruned to obtain frequent 2-item sets. It should be noted here that, in the above step 1, the candidate 1-itemset 23 has been pruned, then the candidate 2-itemset 023 and 123 with the candidate 1-itemset 23 as the parent node are simultaneously pruned. Then, the frequent 2-itemsets after pruning are 012 and 013.

对上述频繁2项集执行上述子步骤3，针对频繁K项集(在此示例分析中K＝2)，按照以上过程进行迭代，直到无法找到频繁K+1(在此示例分析中K＝2)项集为止，将得到的频繁K项集确定为疑似威胁域名集合。Execute the above sub-step 3 for the above frequent 2-itemsets, for the frequent K-itemsets (K=2 in this example analysis), iterate according to the above process until no frequent K+1 can be found (K=2 in this example analysis ) itemsets, the obtained frequent K-itemsets are determined as the collection of suspected threat domain names.

具体的，频繁2项集012、频繁2项集013、候选2项集023和候选2项集123进行连接后得到候选3项集0123。此处需要说明的是，在上述步骤2中，已对候选2项集023和123进行了剪枝处理，那么以候选2项集023和123为父节点的候选3项集0123同步进行剪枝处理，即此时已无法找到频繁3项集，将得到的频繁2项集确定为疑似威胁域名集合。Specifically, the frequent 2-itemset 012, the frequent 2-itemset 013, the candidate 2-itemset 023 and the candidate 2-itemset 123 are connected to obtain a candidate 3-itemset 0123. What needs to be explained here is that in the above step 2, the candidate 2-itemset 023 and 123 have been pruned, then the candidate 3-itemset 0123 with the candidate 2-itemset 023 and 123 as the parent node is pruned synchronously Processing, that is, the frequent 3-itemset cannot be found at this time, and the obtained frequent 2-itemset is determined as a set of suspected threat domain names.

可选地，在一个实施例中，本申请实施例提供的挖掘潜在威胁域名的方法还可以包括如下步骤：Optionally, in one embodiment, the method for mining potentially threatening domain names provided in the embodiments of the present application may further include the following steps:

获取监测数据源中的域名数据的动静态特征；Obtain the dynamic and static characteristics of the domain name data in the monitoring data source;

其中，所述监测数据源中的域名数据是指监测实时网络中的域名下的具体域名数据。Wherein, the domain name data in the monitoring data source refers to the specific domain name data under the domain name in the monitoring real-time network.

将获取的所述域名数据的动静态特征与获取的所述疑似威胁域名集合进行比较，以得到域名检测结果。The acquired dynamic and static features of the domain name data are compared with the acquired set of suspected threat domain names to obtain a domain name detection result.

可选地，在一个实施例中，本申请实施例提供的挖掘潜在威胁域名的方法还可以包括如下步骤：在域名检测结果为检测到威胁域名的情况下，将所述域名检测结果反馈到恶意链接库中。Optionally, in one embodiment, the method for mining potentially threatening domain names provided in the embodiments of the present application may further include the following steps: when the domain name detection result is a detected threatening domain name, feeding back the domain name detection result to the malicious link library.

图4是本申请实施例提供的一种挖掘潜在威胁域名的方法的流程图。如图4所示，本申请实施例提供的一种挖掘潜在威胁域名的方法可以包括以下步骤：FIG. 4 is a flow chart of a method for mining potentially threatening domain names provided by an embodiment of the present application. As shown in Figure 4, a method for mining potentially threatening domain names provided in the embodiment of the present application may include the following steps:

步骤410：获取运营商侧的数据；基于获取的所述运营商侧的数据，得到与潜在威胁域名相关的第一特征集。Step 410: Acquire operator-side data; obtain a first feature set related to potential threat domain names based on the acquired operator-side data.

步骤420：对预先获取的所述第一特征集中的特征数据利用随机森林算法进行训练，以获取与潜在威胁域名相关的第二特征集。Step 420: Train the pre-acquired feature data in the first feature set with a random forest algorithm to obtain a second feature set related to potential threat domain names.

具体的随机森林算法处理数据的过程参见上述步骤110，在此不再赘述。For the specific data processing process of the random forest algorithm, refer to the above-mentioned step 110, which will not be repeated here.

步骤430：确定潜在威胁域名特征集合的评估标准，所述评估标准包括支持度、置信度中的至少一种。Step 430: Determine the evaluation standard of the potential threat domain name feature set, the evaluation standard includes at least one of support degree and confidence degree.

步骤440：对所述第二特征集中的特征的初始有效特征数据进行数据连接；对连接后得到的候选1项集按照预设评估标准进行剪枝处理，以得到频繁1项集。Step 440: performing data connection on the initial effective feature data of the features in the second feature set; pruning the candidate 1-itemsets obtained after connection according to the preset evaluation criteria to obtain frequent 1-itemsets.

步骤450：对频繁1项集中的元素进行连接，获取候选2项集，对候选2项集中低于预设评估标准的元素进行剪枝处理，以得到频繁2项集。Step 450: Connect the elements in the frequent 1-itemset to obtain the candidate 2-itemset, and prune the elements in the candidate 2-itemset that are lower than the preset evaluation standard to obtain the frequent 2-itemset.

步骤460：针对频繁K项集，按照以上过程进行迭代，直到无法找到频繁K+1项集为止，将得到的频繁K项集确定为疑似威胁域名集合。Step 460: for frequent K-itemsets, iterate according to the above process until no frequent K+1-itemsets can be found, and determine the obtained frequent K-itemsets as the set of suspected threat domain names.

步骤470：获取监测数据源中的域名数据的动静态特征。Step 470: Obtain the dynamic and static characteristics of the domain name data in the monitoring data source.

步骤480：将获取的所述域名数据的动静态特征与获取的所述疑似威胁域名集合进行比较，以得到域名检测结果。Step 480: Compare the obtained dynamic and static features of the domain name data with the obtained set of suspected threat domain names to obtain a domain name detection result.

图5是本申请实施例提供的一种挖掘潜在威胁域名的系统的结构框图。参照图5，本申请实施例提供的潜在威胁域名的挖掘系统可包括：Fig. 5 is a structural block diagram of a system for mining potentially threatening domain names provided by an embodiment of the present application. Referring to Fig. 5, the mining system for potentially threatening domain names provided by the embodiment of the present application may include:

获取模块502，用于基于预先获取的第一特征集，获取第二特征集，其中，所述第一特征集为与潜在威胁域名相关的特征集，所述第二特征集为所述第一特征集的子集且所述第二特征集包括动态场景特征和静态场景特征；The obtaining module 502 is configured to obtain a second feature set based on the pre-acquired first feature set, wherein the first feature set is a feature set related to a potential threat domain name, and the second feature set is the first feature set a subset of a feature set and said second feature set includes dynamic scene features and static scene features;

处理模块504，用于通过对所述第二特征集中的特征进行关联计算获取疑似威胁域名集合。The processing module 504 is configured to obtain a set of suspected threat domain names by performing correlation calculation on the features in the second feature set.

本发明实施例提供的挖掘潜在威胁域名的系统，基于预先获取的第一特征集，获取第二特征集，其中，所述第一特征集为与潜在威胁域名相关的特征集，所述第二特征集为所述第一特征集的子集且所述第二特征集包括动态场景特征和静态场景特征；通过对所述第二特征集中的特征进行关联计算获取疑似威胁域名集合。如此，通过提前获取第一特征集，并从中获得第二特征集，可在大数据中快速圈定潜在威胁域名相关的特征集，基于第二特征集进行疑似威胁域名集合的获取可大大减少数据处理量，从而可以有效提高前端识别威胁域名的效率，实现对全网日志的有效分析。The system for mining potentially threatening domain names provided by an embodiment of the present invention obtains a second feature set based on a pre-acquired first feature set, wherein the first feature set is a feature set related to a potentially threatening domain name, and the second The feature set is a subset of the first feature set and the second feature set includes dynamic scene features and static scene features; a set of suspected threat domain names is obtained by performing correlation calculation on features in the second feature set. In this way, by obtaining the first feature set in advance and obtaining the second feature set from it, the feature set related to the potential threat domain name can be quickly delineated in the big data, and the acquisition of the suspected threat domain name set based on the second feature set can greatly reduce data processing. In this way, the efficiency of front-end identification of threat domain names can be effectively improved, and the effective analysis of logs on the entire network can be realized.

可选地，在本申请的一个实施例中，在获取第二特征集的过程中，所述获取模块502，可用于对预先获取的所述第一特征集中的特征数据利用随机森林算法进行训练，以获取与潜在威胁域名相关的第二特征集。Optionally, in one embodiment of the present application, during the process of acquiring the second feature set, the acquisition module 502 may be configured to use the random forest algorithm to train the pre-acquired feature data in the first feature set , to obtain the second feature set related to the potential threat domain name.

可选地，在本申请的一个实施例中，在基于预先获取的第一特征集，获取第二特征集之前，所述获取模块502还用于：Optionally, in one embodiment of the present application, before acquiring the second feature set based on the pre-acquired first feature set, the acquiring module 502 is further configured to:

获取运营商侧的数据；基于获取的所述运营商侧的数据，得到与潜在威胁域名相关的第一特征集；Obtaining data on the operator side; obtaining a first feature set related to a potentially threatening domain name based on the acquired data on the operator side;

在本申请实施例中，可选地，所述第一特征集包括：域名IP个数、访问用户数、访问次数、URL个数、最大访问用户数、单用户平均URL条数、URL平均用户数、URL平均访问次数、用户数标准差、访问数标准差、URL访问数离散系数、访问用户数离散度、当天单个用户访问域名频率访问用户数、域名访问的最多URL次数占比、单用户访问URL个数、域名存活时长、终端操作系统属性、访问载体属性、域名返回码属性、IP地域属性、URL返回码属性中的至少二种。In the embodiment of the present application, optionally, the first feature set includes: the number of domain name IPs, the number of visiting users, the number of visits, the number of URLs, the maximum number of visiting users, the average number of URLs per user, and the average number of URL users Number, average number of URL visits, standard deviation of the number of users, standard deviation of the number of visits, dispersion coefficient of the number of URL visits, dispersion of the number of visiting users, frequency of single user visiting the domain name on the day, number of visiting users, proportion of the most URLs visited by the domain name, single user At least two of the number of URLs accessed, the survival time of the domain name, the terminal operating system attribute, the access carrier attribute, the domain name return code attribute, the IP region attribute, and the URL return code attribute.

在本申请实施例中，可选地，所述动态场景特征包括：通信特征和关联特征中的至少一种，其中，所述通信特征包括DGA属性，所述关联特征包括：域名返回码属性、域名IP地域属性和URL返回码属性中的至少一种；所述静态场景特征包括：行为特征和指纹特征中的至少一种，其中，所述行为特征包括：访问用户数、访问用户数离散度、域名访问的最多URL次数占比、单用户访问URL个数以及域名存活时长中的至少一种，所述指纹特征包括：终端操作系统属性和访问载体属性中的至少一种。In this embodiment of the present application, optionally, the dynamic scene feature includes: at least one of a communication feature and an associated feature, wherein the communication feature includes a DGA attribute, and the associated feature includes: a domain name return code attribute, At least one of domain name IP region attributes and URL return code attributes; the static scene features include: at least one of behavioral features and fingerprint features, wherein the behavioral features include: the number of visiting users, the dispersion of the number of visiting users , at least one of the proportion of the maximum number of URLs accessed by the domain name, the number of URLs accessed by a single user, and the survival time of the domain name. The fingerprint feature includes: at least one of the terminal operating system attribute and the access carrier attribute.

在本申请实施例中，可选地，所述处理模块502用于：In this embodiment of the application, optionally, the processing module 502 is used to:

确定潜在威胁域名特征集合的评估标准，所述评估标准包括支持度、置信度中的至少一种；Determining evaluation criteria for a feature set of potentially threatening domain names, where the evaluation criteria include at least one of support and confidence;

基于确定的所述评估标准和Apriori算法，对所述第二特征集中的特征进行关联分析，以确定出疑似威胁域名集合。Based on the determined evaluation criteria and the Apriori algorithm, an association analysis is performed on the features in the second feature set to determine a set of suspected threat domain names.

可选地，在基于确定的所述评估标准和Apriori算法，对所述第二特征集中的特征进行关联分析，以确定出疑似威胁域名集合的过程中，所述处理模块504用于：Optionally, in the process of performing association analysis on the features in the second feature set based on the determined evaluation criteria and the Apriori algorithm to determine a set of suspected threat domain names, the processing module 504 is configured to:

对所述第二特征集中的特征的初始有效特征数据进行数据连接；对连接后得到的候选1项集按照预设评估标准进行剪枝处理，以得到频繁1项集；performing data connection on the initial effective feature data of the features in the second feature set; performing pruning on the candidate item sets obtained after the connection according to preset evaluation criteria to obtain frequent item sets;

对频繁1项集中的元素进行连接，获取候选2项集，对候选2项集中低于预设评估标准的元素进行剪枝处理，以得到频繁2项集；Connect the elements in the frequent 1-itemset to obtain the candidate 2-itemset, and prune the elements in the candidate 2-itemset that are lower than the preset evaluation standard to obtain the frequent 2-itemset;

针对频繁K项集，按照以上过程进行迭代，直到无法找到频繁K+1项集为止，将得到的频繁K项集确定为疑似威胁域名集合；其中，K为正整数，且K≥2。For frequent K-itemsets, iterate according to the above process until frequent K+1 itemsets cannot be found, and determine the obtained frequent K-itemsets as a set of suspected threat domain names; where K is a positive integer, and K≥2.

可选地，所述获取模块502还用于：获取监测数据源中的域名数据的动静态特征。相应地，所述处理模块504还用于：将获取的所述域名数据的动静态特征与获取的所述疑似威胁域名集合进行比较，以得到域名检测结果。Optionally, the acquisition module 502 is further configured to: acquire dynamic and static characteristics of the domain name data in the monitoring data source. Correspondingly, the processing module 504 is further configured to: compare the obtained dynamic and static characteristics of the domain name data with the obtained set of suspected threat domain names, so as to obtain a domain name detection result.

在本申请实施例提供的挖掘潜在威胁域名的系统中的各个模块所执行的步骤的具体过程可参见方法实施例，在此不再赘述。The specific process of the steps performed by each module in the system for mining potentially threatening domain names provided in the embodiment of the present application may refer to the method embodiment, and details are not repeated here.

图6是本申请实施例提供的一种服务器的结构框图。参照图6，本申请实施例提供的潜在威胁域名的服务器可包括：FIG. 6 is a structural block diagram of a server provided by an embodiment of the present application. Referring to Figure 6, the server of the potentially threatening domain name provided by the embodiment of the present application may include:

获取模块602，用于基于预先获取的第一特征集，获取第二特征集，其中，所述第一特征集为与潜在威胁域名相关的特征集，所述第二特征集为所述第一特征集的子集且所述第二特征集包括动态场景特征和静态场景特征；The acquiring module 602 is configured to acquire a second feature set based on the pre-acquired first feature set, wherein the first feature set is a feature set related to a potential threat domain name, and the second feature set is the first feature set a subset of a feature set and said second feature set includes dynamic scene features and static scene features;

处理模块604，用于通过对所述第二特征集中的特征进行关联计算获取疑似威胁域名集合。The processing module 604 is configured to obtain a set of suspected threat domain names by performing correlation calculation on the features in the second feature set.

在本申请实施例中，基于预先获取的第一特征集，获取第二特征集，其中，所述第一特征集为与潜在威胁域名相关的特征集，所述第二特征集为所述第一特征集的子集且所述第二特征集包括动态场景特征和静态场景特征；通过对所述第二特征集中的特征进行关联计算获取疑似威胁域名集合。如此，通过提前获取第一特征集，并从中获得第二特征集，可在大数据中快速圈定潜在威胁域名相关的特征集，基于第二特征集进行疑似威胁域名集合的获取可大大减少数据处理量，从而可以有效提高前端识别威胁域名的效率，实现对全网日志的有效分析。In this embodiment of the present application, the second feature set is obtained based on the pre-acquired first feature set, wherein the first feature set is a feature set related to a potential threat domain name, and the second feature set is the first feature set A subset of a feature set, and the second feature set includes dynamic scene features and static scene features; a set of suspected threat domain names is obtained by performing correlation calculation on the features in the second feature set. In this way, by obtaining the first feature set in advance and obtaining the second feature set from it, the feature set related to the potential threat domain name can be quickly delineated in the big data, and the acquisition of the suspected threat domain name set based on the second feature set can greatly reduce data processing. In this way, the efficiency of front-end identification of threat domain names can be effectively improved, and the effective analysis of logs on the entire network can be realized.

可选地，在本申请的一个实施例中，在获取第二特征集的过程中，所述获取模块602，可用于对预先获取的所述第一特征集中的特征数据利用随机森林算法进行训练，以获取与潜在威胁域名相关的第二特征集。Optionally, in one embodiment of the present application, during the process of acquiring the second feature set, the acquisition module 602 may be configured to use the random forest algorithm to train the pre-acquired feature data in the first feature set , to obtain the second feature set related to the potential threat domain name.

可选地，在本申请的一个实施例中，在基于预先获取的第一特征集，获取第二特征集之前，所述获取模块602还用于：Optionally, in one embodiment of the present application, before acquiring the second feature set based on the pre-acquired first feature set, the acquiring module 602 is further configured to:

在本申请实施例中，可选地，所述处理模块602用于：In this embodiment of the application, optionally, the processing module 602 is used to:

可选地，在基于确定的所述评估标准和Apriori算法，对所述第二特征集中的特征进行关联分析，以确定出疑似威胁域名集合的过程中，所述处理模块604用于：Optionally, in the process of performing correlation analysis on the features in the second feature set based on the determined evaluation criteria and the Apriori algorithm to determine a set of suspected threat domain names, the processing module 604 is configured to:

可选地，所述获取模块602还用于：获取监测数据源中的域名数据的动静态特征。相应地，所述处理模块604还用于：将获取的所述域名数据的动静态特征与获取的所述疑似威胁域名集合进行比较，以得到域名检测结果。Optionally, the acquiring module 602 is further configured to: acquire the dynamic and static characteristics of the domain name data in the monitoring data source. Correspondingly, the processing module 604 is further configured to: compare the obtained dynamic and static characteristics of the domain name data with the obtained set of suspected threat domain names, so as to obtain a domain name detection result.

在本申请实施例提供的服务器中的各个模块所执行的步骤的具体过程可参见方法实施例，在此不再赘述。For the specific process of the steps performed by each module in the server provided in the embodiment of the present application, reference may be made to the method embodiment, and details are not repeated here.

此外，本申请实施例还提供一种计算机可读存储介质，所述计算机可读存储介质中存储程序，当所述程序被执行时，实施以下过程：In addition, an embodiment of the present application also provides a computer-readable storage medium, where a program is stored in the computer-readable storage medium, and when the program is executed, the following process is implemented:

其中，以上各个步骤的具体实施过程可参照上文描述，在此不再赘述。Wherein, the specific implementation process of the above steps can refer to the above description, and will not be repeated here.

本申请实施例提供的存储介质，基于预先获取的第一特征集，获取第二特征集，其中，所述第一特征集为与潜在威胁域名相关的特征集，所述第二特征集为所述第一特征集的子集且所述第二特征集包括动态场景特征和静态场景特征；通过对所述第二特征集中的特征进行关联计算获取疑似威胁域名集合。如此，通过提前获取第一特征集，并从中获得第二特征集，可在大数据中快速圈定潜在威胁域名相关的特征集，基于第二特征集进行疑似威胁域名集合的获取可大大减少数据处理量，从而可以有效提高前端识别威胁域名的效率，实现对全网日志的有效分析。The storage medium provided by the embodiment of the present application obtains the second feature set based on the pre-acquired first feature set, wherein the first feature set is a feature set related to a potential threat domain name, and the second feature set is all A subset of the first feature set and the second feature set includes dynamic scene features and static scene features; a set of suspected threat domain names is obtained by performing associated calculations on the features in the second feature set. In this way, by obtaining the first feature set in advance and obtaining the second feature set from it, the feature set related to the potential threat domain name can be quickly delineated in the big data, and the acquisition of the suspected threat domain name set based on the second feature set can greatly reduce data processing. In this way, the efficiency of front-end identification of threat domain names can be effectively improved, and the effective analysis of logs on the entire network can be realized.

需指出的是，本申请实施例中，通过随机森林算法训练出与威胁域名相关联的有效特征，有效避免了与威胁域名无关的特征造成的干扰。由于随机森林算法中两个随机性的引入(新训练集样本的随机选取，频繁特征的随机选取)，使得随机森林算法不容易陷入过拟合，投票选取出的有效特征具有相对准确性。提取威胁域名的动静态特征，通过关联规则Apriori算法实现在混合场景下挖掘潜在威胁域名特征集合，能够全面反映域名的静态和动态规律，由多个有效特征组成的特征集合，能更加准确对未知域名进行拟合，提高了对未知域名的识别能力。对移动通信网络中的海量数据按照本申请的处理流程进行挖掘，在大数据中可以快速准确的圈定潜在威胁域名范围，进而提高域名识别的效率。It should be pointed out that in the embodiment of the present application, the random forest algorithm is used to train the effective features associated with the threat domain name, which effectively avoids interference caused by features unrelated to the threat domain name. Due to the introduction of two randomnesses in the random forest algorithm (random selection of new training set samples and random selection of frequent features), the random forest algorithm is not easy to fall into overfitting, and the effective features selected by voting are relatively accurate. Extract the dynamic and static features of threat domain names, and use the association rule Apriori algorithm to mine potential threat domain name feature sets in mixed scenarios, which can fully reflect the static and dynamic laws of domain names, and feature sets composed of multiple effective features can more accurately identify unknown domain names. The domain name is fitted to improve the ability to identify unknown domain names. By mining the massive data in the mobile communication network according to the processing flow of this application, the scope of potential threat domain names can be quickly and accurately delineated in the big data, thereby improving the efficiency of domain name identification.

需要说明的是，在本文中，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。It should be noted that, in this document, the term "comprising", "comprising" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, It also includes other elements not expressly listed, or elements inherent in the process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus comprising that element.

上面结合附图对本发明的实施例进行了描述，但是本发明并不局限于上述的具体实施方式，上述的具体实施方式仅仅是示意性的，而不是限制性的，本领域的普通技术人员在本发明的启示下，在不脱离本发明宗旨和权利要求所保护的范围情况下，还可做出很多形式，均属于本发明的保护之内。Embodiments of the present invention have been described above in conjunction with the accompanying drawings, but the present invention is not limited to the above-mentioned specific implementations, and the above-mentioned specific implementations are only illustrative, rather than restrictive, and those of ordinary skill in the art will Under the enlightenment of the present invention, without departing from the gist of the present invention and the protection scope of the claims, many forms can also be made, all of which belong to the protection of the present invention.

Claims

1. A method for mining potential threat domain names, characterized in that, comprising:

Obtaining a second feature set based on the pre-acquired first feature set, wherein the first feature set is a feature set related to a potential threat domain name, the second feature set is a subset of the first feature set, and The second feature set includes dynamic scene features and static scene features;

Obtaining a set of suspected threat domain names by performing correlation calculation on the features in the second feature set;

Wherein, the dynamic scene features include: at least one of communication features and associated features, wherein the communication features include domain name generation algorithm DGA attributes, and the associated features include: domain name return code attributes, domain name IP region attributes and URL At least one of the return code attributes; the static scene features include: at least one of behavioral features and fingerprint features, wherein the behavioral features include: the number of visiting users, the dispersion of the number of visiting users, and the most URLs visited by domain names At least one of the proportion of times, the number of URLs accessed by a single user, and the duration of the domain name, the fingerprint feature includes: at least one of the terminal operating system attribute and the access carrier attribute; wherein, the DGA attribute specifically refers to the calculation domain name The degree of random dispersion of characters, that is, the DGA attribute is the ratio of the entropy value of the domain name to the proportion of vowels in the domain name, where the entropy value refers to the degree of character randomness, and the vowel value of the domain name refers to the ratio of the number of vowels in the domain name to the total length of the domain name.

2. The mining method according to claim 1, wherein said obtaining a second feature set based on the pre-acquired first feature set comprises: using random forest for feature data in the pre-acquired first feature set An algorithm is trained to obtain a second set of features associated with potentially threatening domain names.

3. mining method according to claim 1, is characterized in that, before described first feature set based on pre-acquiring, obtains the second feature set, described method also comprises:

Obtain data from the operator side;

Obtaining a first feature set related to the potentially threatening domain name based on the acquired data on the operator side;

Wherein, the first feature set includes: the number of domain name IPs, the number of visiting users, the number of visits, the number of URLs, the maximum number of visiting users, the average number of URLs for a single user, the average number of URL users, the average number of URL visits, and the number of users Standard deviation, standard deviation of number of visits, dispersion coefficient of URL visits, dispersion of number of visiting users, frequency of single user visiting domain names on the day, number of visiting users, proportion of most URLs visited by domain names, number of URLs visited by a single user, duration of domain name survival, At least two of terminal operating system attributes, access carrier attributes, domain name return code attributes, IP region attributes, URL return code attributes, and DGA attributes.

4. The mining method according to any one of claims 1-3, wherein the obtaining of the collection of suspected threat domain names by performing association analysis on features in the second feature set comprises:

Determining evaluation criteria for a feature set of potentially threatening domain names, where the evaluation criteria include at least one of support and confidence;

Based on the determined evaluation criteria and the Apriori algorithm, an association analysis is performed on the features in the second feature set to determine a set of suspected threat domain names.

5. The mining method according to claim 4, characterized in that, based on the determined evaluation criteria and the Apriori algorithm, the features in the second feature set are associated and analyzed to determine that the suspected threat domain name set includes :

performing data connection on the initial effective feature data of the features in the second feature set; performing pruning on the candidate item sets obtained after the connection according to preset evaluation criteria to obtain frequent item sets;

Connect the elements in the frequent 1-itemset to obtain the candidate 2-itemset, and prune the elements in the candidate 2-itemset that are lower than the preset evaluation standard to obtain the frequent 2-itemset;

For frequent K-itemsets, iterate according to the above process until frequent K+1 itemsets cannot be found, and determine the obtained frequent K-itemsets as a set of suspected threat domain names; where K is a positive integer, and K≥2.

6. mining method according to claim 1, is characterized in that, described method also comprises:

Obtain the dynamic and static characteristics of the domain name data in the monitoring data source;

The acquired dynamic and static features of the domain name data are compared with the acquired set of suspected threat domain names to obtain a domain name detection result.

7. A system for mining potentially threatening domain names, comprising:

An acquisition module, configured to acquire a second feature set based on the pre-acquired first feature set, wherein the first feature set is a feature set related to a potential threat domain name, and the second feature set is the first feature set set and said second feature set includes dynamic scene features and static scene features;

A processing module, configured to obtain a set of suspected threat domain names by performing correlation calculations on features in the second feature set;

8. A server, characterized in that, comprising:

9. A computer-readable storage medium, wherein a program is stored in the computer-readable storage medium, and when the program is executed, the following process is implemented: