CN105897714B

CN105897714B - Botnet detection method based on DNS traffic characteristics

Info

Publication number: CN105897714B
Application number: CN201610222674.5A
Authority: CN
Inventors: 喻梅; 李鑫; 于健; 王建荣; 赵越; 雷霆
Original assignee: Tianjin University
Current assignee: Ningbo Zhiwei Ruichi Information Technology Co ltd
Priority date: 2016-04-11
Filing date: 2016-04-11
Publication date: 2018-11-09
Anticipated expiration: 2036-04-11
Also published as: CN105897714A

Abstract

A botnet detection method based on DNS traffic characteristics, including: Domain‑Flux botnet detection method based on DNS traffic characteristics: combining legal primary domain names and illegal primary domain names to form a target set; processing and extracting domain names with a length greater than 6 for research object; respectively calculate the domain name entropy value, word formation features, phonetic features and grouping features; put it into the random forest classifier to obtain the training model. The Fast-Flux botnet detection method based on the Domain-Flux botnet detection method: process the raw data of the DNS server; use the training model obtained above to evaluate the pre-processed domain name and obtain the score of the DGA situation; use the whitelist, The blacklist and graylist score the domain name and IP; calculate the time characteristics of the IP address; calculate the stability of the IP address; put it into the random forest classifier to obtain the training model SFF. The accuracy rate of the experiment of the present invention is higher.

Description

Botnet detection method based on DNS traffic characteristics

技术领域technical field

本发明涉及一种DNS域名技术和机器学习系分类算法。特别是涉及一种基于DNS流量特征的僵尸网络检测方法。The invention relates to a DNS domain name technology and a machine learning system classification algorithm. In particular, it relates to a botnet detection method based on DNS traffic characteristics.

背景技术Background technique

在目前的域名生成技术中，主要有：Among the current domain name generation technologies, there are mainly:

(1)Domain-Flux技术：Domain-Flux指不停的改变和分配多个域名到一个或多个IP的行为。(1) Domain-Flux technology: Domain-Flux refers to the behavior of continuously changing and assigning multiple domain names to one or more IPs.

(2)Fast-Flux技术：此技术有两种：Single-Flux域名技术和Double-Flux域名技术。(2) Fast-Flux technology: There are two types of this technology: Single-Flux domain name technology and Double-Flux domain name technology.

Single-Flux域名技术可以类比Tor网络来看，在基于Single-Flux域名技术的僵尸网络中的每一个僵尸主机都是一个重定向节点，这样基于不同僵尸主机的重定向实现最优寻址的过程，一方面避免单一节点对整个僵尸网络的影响，另一方面也使得研究者难以循迹。The Single-Flux domain name technology can be compared to the Tor network. In the botnet based on the Single-Flux domain name technology, each zombie host is a redirection node, so that the optimal addressing process can be realized based on the redirection of different zombie hosts. , on the one hand to avoid the impact of a single node on the entire botnet, on the other hand it also makes it difficult for researchers to track.

Double-Flux相比于Single-Flux增加了一个可控的DNS服务层，控制者可以控制域名的修改和发布权限，而不是使用公用的域名提供商的解析服务。解析服务器是Double-Flux架构的一部分，然而解析服务器的地址也是不断变换的。Compared with Single-Flux, Double-Flux adds a controllable DNS service layer. The controller can control the modification and release authority of domain names, instead of using the resolution services of public domain name providers. The parsing server is part of the Double-Flux architecture, but the address of the parsing server is constantly changing.

分类算法在很多领域都有较广泛的应用，尤其是在数据挖掘领域，在数据挖掘领域，分类算法一般是通过概率论中的各种统计模型实现。其中常用的分类器有：Classification algorithms are widely used in many fields, especially in the field of data mining. In the field of data mining, classification algorithms are generally realized through various statistical models in probability theory. Among the commonly used classifiers are:

(1)决策树：决策树算法是常用的分类和预测的方法技术。决策树算法是通过对一组无规则且无序的数据进行推理和演算，从已知实例数据中通过推演得出决策树分类的一些分类规则。(1) Decision tree: The decision tree algorithm is a commonly used classification and prediction method. Decision tree algorithm is to deduce some classification rules of decision tree classification from known instance data by reasoning and calculating a set of irregular and disordered data.

(2)随机森林：随机森林实质上是一个包括多个决策树算法的分类器。随机森林通过构建决策树，得到决策树森林。随机森林中的各个决策树之间是没有关系的。一颗完整的决策树森林构建完成后，对于一个输入数据样本，在随机森林中的多个决策树中进行分类和决策，直到到达随机森林的叶子节点，叶子节点所属类别即为该数据样本预测得到的结果。(2) Random Forest: Random Forest is essentially a classifier that includes multiple decision tree algorithms. The random forest obtains a decision tree forest by constructing a decision tree. There is no relationship between individual decision trees in a random forest. After the construction of a complete decision tree forest is completed, for an input data sample, classification and decision-making are performed in multiple decision trees in the random forest until reaching the leaf node of the random forest, and the category of the leaf node is the prediction of the data sample The results obtained.

僵尸网络融合了传统的网络蠕虫、木马后门、病毒等技术，并结合新技术，成为现今较为广泛传播和较为隐蔽的一种恶意代码形式。使用者由于某些非法初衷，将僵尸程序大范围扩散，从而形成一个僵尸网络，并进一步通过指令和信道控制实现各种攻击行为。当前的僵尸网络的平台化技术已经很成熟，这也就为了攻击人员带来了更多效益的可能。Botnets combine traditional network worms, Trojan horse backdoors, viruses and other technologies, combined with new technologies, to become a form of malicious code that is more widely spread and more concealed today. Due to some illegal original intentions, users spread bots in a large area to form a botnet, and further implement various attack behaviors through instructions and channel control. The platform technology of the current botnet is very mature, which brings the possibility of more benefits for the attackers.

近年来，国外的研究者针对僵尸网络提出了一种新的技术检测方法——DNS数据流分析检测技术，目前的这些基于DNS数据流的僵尸网络检测方法，大多通过模拟僵尸网络来对方法进行验证，没有在实际的网络流量中测试。此外，这些方法在测试中用到的数据量都比较的小，并不能代表实际网络中流量的真实特征。In recent years, foreign researchers have proposed a new technical detection method for botnets——DNS data flow analysis and detection technology. The current botnet detection methods based on DNS data flow mostly simulate botnets. Validation, not tested in real network traffic. In addition, the amount of data used by these methods in the test is relatively small, and cannot represent the real characteristics of the traffic in the actual network.

发明内容Contents of the invention

本发明所要解决的技术问题是，提供一种基于DNS流量特征的僵尸网络检测方法。能够通过对域名系统(Domain Name System，DNS)服务器的查询流量提取、分析，找出Domain-Flux和Fast-Flux两种具有明显域名解析特征的僵尸网络控制服务器的查询策略所对应的命令控制(Command&Control，C&C)服务器。The technical problem to be solved by the present invention is to provide a botnet detection method based on DNS traffic characteristics. By extracting and analyzing the query traffic of Domain Name System (DNS) servers, it is possible to find out the corresponding command control ( Command&Control, C&C) server.

本发明所采用的技术方案是：一种基于DNS流量特征的Domain-Flux僵尸网络检测方法，包括如下步骤：The technical scheme adopted in the present invention is: a kind of Domain-Flux botnet detection method based on DNS flow characteristic, comprises the following steps:

1)读取域名，包括读取合法域名，并提取合法主域名，以及读取DGA算法生成的非法域名，并抽取非法主域名，将合法主域名和非法主域名组合起来，形成目标集合；1) Read the domain name, including reading the legal domain name, and extracting the legal main domain name, and reading the illegal domain name generated by the DGA algorithm, and extracting the illegal main domain name, combining the legal main domain name and the illegal main domain name to form a target set;

2)对获得的目标集合进行处理，提取处理后的每个域名的长度，并抽取长度大于6的域名作为研究对象；2) Process the obtained target set, extract the length of each domain name after processing, and extract domain names with a length greater than 6 as research objects;

3)分别计算域名熵值、构词法特征、语音特征和分组特征，用来识别DGA算法生成的随机域名；3) Calculate the domain name entropy value, word formation feature, phonetic feature and grouping feature respectively to identify the random domain name generated by the DGA algorithm;

4)将得到的域名熵值、构词法特征、语音特征和分组特征分为训练集和测试集，然后放入随机森林分类器得到训练模型mDGA。4) Divide the obtained domain name entropy value, word formation feature, phonetic feature and grouping feature into a training set and a test set, and then put them into a random forest classifier to obtain the training model mDGA.

步骤2)所述的处理包括去除杂数据，并将数据用逗号分为序号部分和域名部分。The processing described in step 2) includes removing miscellaneous data, and dividing the data into a serial number part and a domain name part with commas.

步骤3)所述的计算域名熵值是采用香农信息熵的方法计算域名熵，如下公式所示：The domain name entropy value described in step 3) is to adopt the Shannon information entropy method to calculate the domain name entropy, as shown in the following formula:

其中，E为域名的香农信息熵，即字符串中不同字符出现的离散情况，L是字串的长度，C_i是字母i出现次数，字母i为变量代表的是字符串中出现的字母；Among them, E is the Shannon information entropy of the domain name, that is, the discrete occurrence of different characters in the string, L is the length of the string, C _i is the number of occurrences of the letter i, and the letter i is a variable representing the letter that appears in the string;

所述的计算构词法特征，是采用N-gram的基本模型，用来评估一个句子出现的概率，设定一个域名表示为一个序列S＝w₁w₂w₃...w_n，则域名的概率p(S)，即，构词法特征表示为如下公式：The feature of calculating the word formation method is to use the basic model of N-gram, which is used to evaluate the probability of a sentence. If a domain name is set to be expressed as a sequence S=w ₁ w ₂ w ₃ ... w _n , then the domain name The probability p(S), that is, the feature of word formation is expressed as the following formula:

其中w_i表示第i个字母出现的概率，n为序列S的长度，字母i为变量代表的是字符串中出现的字母；Where w _i represents the probability of the i-th letter appearing, n is the length of the sequence S, and the letter i is a variable representing the letter appearing in the string;

所述的语音特征，是选用元音字母的个数和域名总长度的比例作为元音字母的统计特性，即语音特征，如下公式所示：Described phonetic feature is to select the number of vowels and the ratio of the total length of the domain name as the statistical characteristic of vowels, i.e. phonetic features, as shown in the following formula:

其中d_i表示元音字母出现的次数，L为字串的长度，E即为元音字母的熵；Among them, d _i represents the number of occurrences of vowels, L is the length of the string, and E is the entropy of vowels;

所述的分组特征，是提取每个域名按照数字和字母分割的部分数目。The grouping feature is to extract the number of parts of each domain name divided by numbers and letters.

一种基于所述的基于DNS流量特征的Domain-Flux僵尸网络检测方法的Fast-Flux僵尸网络检测方法，包括如下步骤：A Fast-Flux botnet detection method based on the described Domain-Flux botnet detection method based on DNS traffic characteristics, comprising the steps:

1)将DNS服务器的原始数据使用Passivedns工具处理，只保留DNS服务器返回的A记录，并对原始数据进行预处理；1) Use the Passivedns tool to process the original data of the DNS server, keep only the A record returned by the DNS server, and preprocess the original data;

2)使用针对Domain-Flux僵尸网络检测方法中通过随机森林分类器得到的训练模型对预处理的域名进行评估，获取DGA情况的打分；2) Use the training model obtained through the random forest classifier in the Domain-Flux botnet detection method to evaluate the preprocessed domain name and obtain the score of the DGA situation;

3)使用白名单、黑名单和灰名单对域名和IP进行评分，然后再进行交叉评分得到域名的置信度，其中，所述的白名单表示具有安全性的服务器主站的域名和IP，所述的灰名单存放了具有一定可信度的公司提供共有云服务的域名和IP，所述的黑名单中存放了确定被僵尸网络所有者控制的恶意域名和IP；3) use whitelist, blacklist and greylist to carry out scoring to domain name and IP, then carry out cross-scoring to obtain the confidence degree of domain name, wherein, described whitelist represents the domain name and IP of the server master station with security, so The above-mentioned gray list stores the domain names and IPs of shared cloud services provided by companies with certain credibility, and the above-mentioned black list stores malicious domain names and IPs determined to be controlled by botnet owners;

4)计算IP地址的时间特性；4) Calculate the time characteristics of the IP address;

5)计算IP地址的稳定性；5) Calculate the stability of the IP address;

6)将得到的DGA情况打分、域名置信度、IP地址的时间特性与稳定性分别分为训练集和测试集，然后放入随机森林分类器得到训练模型SFF。6) Divide the obtained DGA score, domain name confidence, and time characteristics and stability of IP addresses into a training set and a test set, and then put them into a random forest classifier to obtain a training model SFF.

步骤1)所述的预处理包括处理域名和IP，并使用maxmind公司的AS信息作为AS号码查询字典，对IP进行AS信息匹配。The preprocessing described in step 1) includes processing the domain name and IP, and using the AS information of maxmind company as the AS number query dictionary to perform AS information matching on the IP.

步骤3)所述的对域名和IP进行评分是，如果域名存在于白名单则置信度加1，存在于黑名单则置信度减1，存在于灰名单则置信度加0.5，将结果记为P_with，如果IP存在于白名单则置信度加1，存在于黑名单则置信度减1，存在于灰名单则置信度加0.5，并将结果记为P_geo；如果域名和IP不存在于白名单、黑名单和灰名单的范围内，则置信度为零。Step 3) described domain name and IP is carried out scoring is, if domain name exists in whitelist then confidence degree adds 1, exists in blacklist then confidence degree subtracts 1, exists in gray list then confidence degree adds 0.5, and the result is denoted as P _with , if the IP exists in the whitelist, the confidence will be increased by 1, if it exists in the blacklist, the confidence will be reduced by 1, if it exists in the gray list, the confidence will be increased by 0.5, and the result will be recorded as P _geo ; if the domain name and IP do not exist in In the range of whitelist, blacklist and graylist, the confidence level is zero.

步骤3)所述的进行交叉评分是，将对域名和IP的评分综合起来，得到域名的置信度：Step 3) the described cross scoring is to combine the scores of domain name and IP to obtain the confidence degree of domain name:

P_domain＝λP_geo+μP_with (4)P _domain = λP _geo + μP _with (4)

P_domian表示为IP置信度，P_geo和P_with分别表示IP是否在三种名单中的结果和域名是否在三种名单中的结果，λ是P_geo的权值，μ是P_with的权值。P _domian is expressed as IP confidence, P _geo and P _with respectively indicate whether the IP is in the three lists and whether the domain name is in the three lists, λ is the weight of P _geo , and μ is the weight of P _with .

步骤4)所述的处理IP的时间特性是指域名IP的解析次数。The time characteristic of processing IP in step 4) refers to the number of resolutions of the domain name IP.

步骤4)所述的计算IP地址的统计特性是，包括：Step 4) the statistical characteristic of calculating IP address is, comprises:

使用如下公式分别计算IP地址的数字特征和IP地址对应的自治域的分布特征：Use the following formulas to calculate the numerical characteristics of the IP address and the distribution characteristics of the autonomous domain corresponding to the IP address:

式中，当计算IP地址的数字特征时，X表示域名对应的IP地址，α表示IP地址的平均值，N表示该域名对应的IP地址数目；当计算IP地址对应的自治域的分布特征时X表示域名对应的自治域，α表示自治域的平均值，N表示自治域数目。In the formula, when calculating the digital characteristics of the IP address, X represents the IP address corresponding to the domain name, α represents the average value of the IP address, and N represents the number of IP addresses corresponding to the domain name; when calculating the distribution characteristics of the autonomous domain corresponding to the IP address X indicates the autonomous domain corresponding to the domain name, α indicates the average value of the autonomous domain, and N indicates the number of autonomous domains.

步骤5)所述的计算IP地址的稳定性是，结合查询次数和IP地址的统计特征以及DNS查询的特点，给出IP地址稳定性的计算公式：Step 5) the described stability of calculating IP address is, in conjunction with the statistical characteristic of query times and IP address and the characteristic of DNS query, the computing formula of IP address stability is given:

其中S表示IP地址稳定性，C_ip和C_hit分别表示获取到的IP地址数目和查询数，C_ip和C_hit均需要有同一个阈值c_th，即IP地址数目或者查询数目上限，c_ip和c_hit分别为IP地址数目和查询数目在阈值c_th规定下的取值，即上限为阈值c_th。Among them, S represents the stability of the IP address, C _ip and C _hit respectively represent the number of obtained IP addresses and the number of queries, both C _ip and C _hit need to have the same threshold c _th , which is the upper limit of the number of IP addresses or the number of queries, and c _ip and c _hit are respectively the values of the number of IP addresses and the number of queries under the threshold c _th , that is, the upper limit is the threshold c _th .

本发明的基于DNS流量特征的僵尸网络检测方法，为僵尸网络的检测提供了一种新思路，提出将发音和构词方法用到DGA恶意域名检测中的方法，同时提出了一种基于置信度的评判标准。丰富了僵尸网络的检测手段，提高了检测的准确率。本发明在计算误报率和漏报率的情况下，使用平均值，可以方便的将误报和漏报的情况取出，做进一步分析和评估，实验的准确率较高。The botnet detection method based on DNS traffic characteristics of the present invention provides a new idea for botnet detection, proposes a method of using pronunciation and word formation methods in DGA malicious domain name detection, and proposes a method based on confidence criteria for judging. It enriches the detection methods of botnets and improves the accuracy of detection. In the case of calculating the false alarm rate and the false negative rate, the present invention uses the average value, and can conveniently take out the false positive and false negative rates for further analysis and evaluation, and the accuracy rate of the experiment is relatively high.

附图说明Description of drawings

图1是本发明基于DNS流量特征的Domain-Flux僵尸网络检测方法的流程图；Fig. 1 is the flow chart of the Domain-Flux botnet detection method based on DNS flow characteristic of the present invention;

图2是分别使用本发明mDGA模型与现有的tDGA检测DGA域名准确率对比；Fig. 2 is respectively using the mDGA model of the present invention and the existing tDGA to detect DGA domain name accuracy rate comparison;

图3是本发明本发明基于DNS流量特征的Fast-Flux僵尸网络检测方法。FIG. 3 is a Fast-Flux botnet detection method based on DNS traffic characteristics of the present invention.

具体实施方式Detailed ways

下面结合实施例和附图对本发明的基于DNS流量特征的僵尸网络检测方法做出详细说明。The botnet detection method based on DNS traffic characteristics of the present invention will be described in detail below in conjunction with the embodiments and the accompanying drawings.

本发明的基于DNS流量特征的僵尸网络检测方法包括有基于DNS流量特征的Domain-Flux僵尸网络检测方法和Fast-Flux僵尸网络检测方法。The botnet detection method based on DNS flow characteristics of the present invention includes a Domain-Flux botnet detection method and a Fast-Flux botnet detection method based on DNS flow characteristics.

如图1所示，本发明的基于DNS流量特征的Domain-Flux僵尸网络检测方法，包括如下步骤：As shown in Figure 1, the Domain-Flux botnet detection method based on DNS traffic characteristics of the present invention comprises the following steps:

2)对获得的目标集合进行处理，提取处理后的每个域名的长度，并抽取长度大于6的域名作为研究对象，所述的处理包括去除杂数据，并将数据用逗号分为序号部分和域名部分；2) Process the obtained target set, extract the length of each domain name after processing, and extract domain names with a length greater than 6 as research objects. The processing includes removing miscellaneous data, and dividing the data into serial number parts and domain name part;

3)分别计算域名熵值、构词法特征、语音特征和分组特征，用来识别DGA算法生成的随机域名；其中，3) Calculate the domain name entropy value, word formation features, phonetic features and grouping features respectively to identify the random domain names generated by the DGA algorithm; where,

所述的计算域名熵值是采用香农信息熵(Information Entropy)的方法计算域名熵，如下公式所示：The domain name entropy value is calculated by Shannon's information entropy (Information Entropy) method to calculate the domain name entropy, as shown in the following formula:

其中，E为域名的香农信息熵，即字符串中不同字符出现的离散情况，L是字串的长度，C_i是字母i出现次数，字母i为变量代表的是字符串中出现的字母。Among them, E is the Shannon information entropy of the domain name, that is, the discrete occurrence of different characters in the string, L is the length of the string, C _i is the number of occurrences of the letter i, and the letter i is a variable representing the letter that appears in the string.

所述的计算构词法特征，是采用N-gram的基本模型，用来估一个句子出现的概率，设定一个域名表示为一个序列S＝w₁w₂w₃...w_n，则域名的概率p(S)，即，构词法特征表示为如下公式：The feature of calculating the word formation method is to use the basic model of N-gram, which is used to estimate the probability of a sentence, and a domain name is set to be expressed as a sequence S=w ₁ w ₂ w ₃ ... w _n , then the domain name The probability p(S), that is, the feature of word formation is expressed as the following formula:

如图3所示，本发明基于DNS流量特征的Fast-Flux僵尸网络检测方法，使用了前面所述的基于DNS流量特征的Domain-Flux僵尸网络检测方法。本发明以DNS流量数据为基础，在现有IP地址分析工作的基础上添加了模糊自治域的数目和熵参数，同时提出IP地址置信度评分的计算方法和域名的置信度计算方法，具体包括如下步骤：As shown in FIG. 3 , the Fast-Flux botnet detection method based on DNS traffic characteristics of the present invention uses the aforementioned Domain-Flux botnet detection method based on DNS traffic characteristics. The present invention is based on DNS traffic data, adds the number of fuzzy autonomous domains and entropy parameters on the basis of the existing IP address analysis work, and proposes a calculation method for IP address confidence scores and a domain name confidence calculation method, specifically including Follow the steps below:

所述的预处理包括处理域名和IP，并使用maxmind公司的AS信息作为AS号码查询字典，对IP进行AS信息匹配。The preprocessing includes processing the domain name and IP, and using the AS information of maxmind company as the AS number query dictionary to perform AS information matching on the IP.

3)使用白名单、黑名单和灰名单对域名和IP进行评分，然后再进行交叉评分得到域名的置信度。其中：3) Use the whitelist, blacklist and graylist to score the domain name and IP, and then perform cross-scoring to obtain the confidence of the domain name. in:

所述的白名单表示具有安全性的服务器主站的域名和IP，所述的灰名单存放了具有一定可信度的公司提供共有云服务的域名和IP，所述的黑名单中存放了确定被僵尸网络所有者控制的恶意域名和IP；The white list indicates the domain names and IPs of the main servers with security, the gray list stores the domain names and IPs of companies with certain credibility that provide public cloud services, and the black list stores certain Malicious domain names and IPs controlled by botnet owners;

所述的对域名和IP进行评分是，如果域名存在于白名单则置信度加1，存在于黑名单则置信度减1，存在于灰名单则置信度加0.5，将结果记为P_with，如果IP存在于白名单则置信度加1，存在于黑名单则置信度减1，存在于灰名单则置信度加0.5，并将结果记为P_geo；如果域名和IP不存在于白名单、黑名单和灰名单的范围内，则置信度为零。The scoring of the domain name and IP is as follows: if the domain name exists in the white list, then the confidence is increased by 1, if it exists in the blacklist, the confidence is reduced by 1, and if it exists in the gray list, the confidence is increased by 0.5, and the result is recorded as P _with , If the IP exists in the whitelist, the confidence will be increased by 1, if it exists in the blacklist, the confidence will be reduced by 1, if it exists in the gray list, the confidence will be increased by 0.5, and the result will be recorded as P _geo ; if the domain name and IP do not exist in the whitelist, In the range of blacklist and graylist, the confidence level is zero.

所述的进行交叉评分是，将对域名和IP的评分综合起来，得到域名的置信度：The described cross-scoring is to combine the scoring of the domain name and IP to obtain the confidence of the domain name:

P_domain＝λP_geo+μP_with (4)P _domain = λP _geo + μP _with (4)

4)计算IP地址的时间特性；其中：4) Calculate the time characteristic of IP address; Wherein:

所述的处理IP的时间特性是指域名IP的解析次数。The time characteristic of processing IP refers to the resolution times of domain name IP.

所述的计算IP地址的统计特性是，包括：The statistical characteristics of calculating the IP address include:

5)计算IP地址的稳定性；所述的计算IP地址的稳定性是，结合查询次数和IP地址的统计特征以及DNS查询的特点，给出IP地址稳定性的计算公式：5) calculate the stability of IP address; The stability of described calculation IP address is, in conjunction with the statistical characteristic of query number of times and IP address and the characteristic of DNS query, provide the computing formula of IP address stability:

本发明的基于DNS流量特征的僵尸网络检测方法，主要针对Domain-Flux和Fast-Flux两种僵尸网络的域名查询特征，分别提出了基于DNS流量的僵尸网络检测方法。为僵尸网络的检测提供了一种新思路，提出将发音和构词方法用到DGA恶意域名检测中的方法，同时提出了一种基于置信度的评判标准。丰富了僵尸网络的检测手段，提高了检测的准确率。The botnet detection method based on DNS flow characteristics of the present invention mainly aims at the domain name query characteristics of two botnets, Domain-Flux and Fast-Flux, respectively proposes a botnet detection method based on DNS flow. It provides a new way of thinking for the detection of botnets, and proposes a method of using pronunciation and word formation methods in DGA malicious domain name detection, and also proposes a judging standard based on confidence. It enriches the detection methods of botnets and improves the accuracy of detection.

下面将本发明的方法与原始算法进行评价比较：The method of the present invention is evaluated and compared with the original algorithm below:

对于Domain-Flux僵尸网络实验结果评价标准是预测的准确率，即准确预测的样本占总体样本数的比例以及其混淆矩阵的各项指标。本发明的对比实验为未考虑语音和分组特性的算法DGA detection，该算法记为tDGA，本发明的算法记mDGA。The evaluation criteria for the Domain-Flux botnet experiment results are the prediction accuracy, that is, the proportion of accurately predicted samples to the total number of samples and the indicators of the confusion matrix. The comparative experiment of the present invention is an algorithm DGA detection that does not consider voice and packet characteristics, and the algorithm is denoted as tDGA, and the algorithm of the present invention is denoted as mDGA.

首先对比本发明和原始算法的准确率。本发明使用kFold方法进行实验，其中，k取经验值10。在计算误报率和漏报率的情况下，使用平均值，这样可以方便的将误报和漏报的情况取出，做进一步分析和评估。First compare the accuracy of the present invention and the original algorithm. The present invention uses kFold method to carry out experiments, wherein, k takes empirical value 10. In the case of calculating the false positive rate and the false negative rate, the average value is used, so that the false negative and false positive can be easily taken out for further analysis and evaluation.

实验结果如图2所示，本发明的算法记mDGA相比原算法tDGA在准确率上有所提升，但是提升效果并不明显，只提高了0.2％的准确率。这是因为测试集中样本数量较大，正常域名与DGA域名数目大约为3:1的关系，DGA域名较少。对于Fast-Flux僵尸网络，实验结果的评价标准是准确率，可以使用k-fold方法进行多次实验给出平均得分，也可以使用多次随机切分训练集和测试集，给出平均得分。本发明同时使用了这两种方式进行实验效果的检测。The experimental results are shown in Figure 2. Compared with the original algorithm tDGA, the algorithm mDGA of the present invention improves the accuracy rate, but the improvement effect is not obvious, and the accuracy rate is only increased by 0.2%. This is because the number of samples in the test set is large, the relationship between the number of normal domain names and the number of DGA domain names is about 3:1, and the number of DGA domain names is relatively small. For the Fast-Flux botnet, the evaluation standard of the experimental results is the accuracy rate. The k-fold method can be used to conduct multiple experiments to give an average score, or multiple random splits of the training set and test set can be used to give the average score. The present invention uses these two methods at the same time to detect the experimental effect.

根据本发明的SFF模型使用k-fold方法进行了10次交叉检验的训练过程，每次训练给出10个评分，重复了六次。每次训练使用90％的数据作为训练集，训练随机森林，使用剩余10％的数据作为测试集，测试模型的效果如表1所示。According to the SFF model of the present invention, a training process of 10 cross-checks was carried out using the k-fold method, and 10 scores were given for each training, which was repeated six times. Each training uses 90% of the data as the training set, trains the random forest, and uses the remaining 10% of the data as the test set. The effect of the test model is shown in Table 1.

表1使用SFF模型评分结果Table 1 Scoring results using the SFF model

实验结果表明，实验的准确率较高。由于DNS本身特性，查询域名比较分散，数目较大，恶意域名相对较少，因此样本中白名单和灰名单中的数据远远多于黑名单中的数据，较高的准确率可能是由于对白名单数据判断比较准确造成的。The experimental results show that the accuracy of the experiment is high. Due to the characteristics of DNS itself, the query domain names are relatively scattered, the number is large, and the number of malicious domain names is relatively small. Therefore, the data in the white list and gray list in the sample is far more than the data in the black list. The higher accuracy may be due to the dialogue The judgment of the list data is more accurate.

使用Etienne Stalmans的fastfluxanalysis项目进行对比，该项目提供了一系列判断一个域名是否为Fast-Flux域名的算法。对比实验结果表明仅仅使用地理位置信息，并不考虑IP和域名的相关性会影响僵尸网络C&C节点的检测。说明本发明提出的在IP和域名置信度对僵尸网络控制节点的检测效果有一定的提升。For comparison, use Etienne Stalmans' fastfluxanalysis project, which provides a series of algorithms for judging whether a domain name is a Fast-Flux domain name. The results of comparative experiments show that only using geographic location information without considering the correlation between IP and domain name will affect the detection of botnet C&C nodes. It shows that the confidence degree of IP and domain name proposed by the present invention can improve the detection effect of the botnet control node to a certain extent.

Claims

1. A Domain-Flux botnet detection method based on DNS traffic characteristics is characterized by comprising the following steps:

1) reading a domain name, including reading a legal domain name, extracting a legal main domain name, reading an illegal domain name generated by a domain name generation algorithm DGA algorithm, extracting the illegal main domain name, and combining the legal main domain name and the illegal main domain name to form a target set;

2) processing the obtained target set, extracting the length of each processed domain name, and extracting the domain name with the length larger than 6 as a research object;

3) respectively calculating a domain name entropy value, a morphology feature, a voice feature and a grouping feature to identify a random domain name generated by a domain name generation algorithm DGA algorithm;

the method for calculating the domain name entropy value is to calculate the domain name entropy by adopting a Shannon information entropy method, and the following formula is shown as follows:

wherein E is the Shannon information entropy of the domain name, i.e. the discrete condition of different characters in the character string, L is the length of the string, C_iThe occurrence frequency of the letter i is shown, and the letter i is represented by a variable and is a letter in a character string;

the method is characterized in that a basic model of an N-gram is adopted to evaluate the probability of a sentence, and a domain name is set to be expressed as a sequence S-w₁w₂w₃...w_nThen the probability p (S) of the domain name, i.e., the lexical features, is expressed as the following formula:

wherein w_iThe probability of the ith letter is shown, n is the length of the sequence S, and the letter i is the letter represented by a variable and appears in the character string;

the voice feature is the statistical characteristic of selecting the proportion of the number of vowels and the total length of the domain name as vowels, namely the voice feature, and is shown in the following formula:

wherein d is_iRepresenting the occurrence frequency of the vowels, wherein L is the length of the character string, and E is the entropy of the vowels;

the grouping feature is to extract the number of parts of each domain name divided according to numbers and letters

4) And dividing the obtained domain name entropy value, the morphology feature, the voice feature and the grouping feature into a training set and a testing set, and then putting the training set and the testing set into a random forest classifier to obtain a training model mDGA.

2. The method for detecting a Domain-Flux botnet based on DNS traffic characteristics according to claim 1, wherein the processing in step 2) includes removing impurity data and comma-separating the data into a sequence number part and a Domain name part.

3. A Fast-Flux botnet detection method based on the Domain-Flux botnet detection method based on DNS traffic characteristics of claim 1, comprising the steps of:

1) processing the original data of the DNS by using a Passivedns tool, only reserving a record A returned by the DNS, and preprocessing the original data;

2) evaluating the preprocessed Domain name by using a training model obtained by a random forest classifier in a Domain-Flux botnet detection method to obtain a score of a DGA condition;

3) the method comprises the steps of scoring a domain name and an IP by using a white list, a black list and a gray list, and then carrying out cross scoring to obtain the confidence coefficient of the domain name, wherein the white list represents the domain name and the IP of a server master station with safety, the gray list stores the domain name and the IP of a company providing common cloud service with certain confidence coefficient, and the black list stores malicious domain names and IPs which are determined to be controlled by a botnet owner;

4) calculating the time characteristic of the IP address;

5) calculating the stability of the IP address;

6) and respectively dividing the obtained DGA condition score, the domain name confidence coefficient and the time characteristic and the stability of the IP address into a training set and a testing set, and then putting the training sets into a random forest classifier to obtain a training model SFF.

4. The method of claim 3, wherein the preprocessing of step 1) comprises processing Domain names and IPs, and using AS information of maxmind corporation AS AS number query dictionary for AS information matching of IPs.

5. The method of claim 3, wherein the scoring of the Domain name and the IP in step 3) comprises adding 1 to the confidence level if the Domain name exists in a white list, subtracting 1 from the confidence level if the Domain name exists in a black list, adding 0.5 to the confidence level if the Domain name exists in a gray list, and recording the result as P_withAdding 1 to the confidence if the IP exists in the white list, subtracting 1 to the confidence if the IP exists in the black list, adding 0.5 to the confidence if the IP exists in the gray list, and marking the result as P_geo(ii) a The confidence is zero if the domain name and IP are not in the white, black, and gray lists.

6. The Domain-Flux botnet detection method of claim 3, wherein said cross scoring of step 3) is performed by combining the scores for the Domain name and the IP to obtain the confidence level of the Domain name:

P_domain＝λP_geo+μP_with(4)

P_domianexpressed as IP confidence, P_geoAnd P_withRespectively indicating whether IP is in three lists and whether domain name is in three lists, and lambda is P_geoWith weight of (u) being P_withThe weight of (2).

7. The method of claim 3, wherein the time characteristic of processing IP in step 4) is the resolution times of Domain name IP.

8. The method of Domain-Flux botnet detection according to claim 3, wherein said step 4) of calculating statistical properties of IP addresses comprises:

respectively calculating the numerical characteristics of the IP address and the distribution characteristics of the autonomous domain corresponding to the IP address by using the following formula:

when the distribution characteristics of the autonomous domains corresponding to the IP addresses are calculated, X represents the autonomous domains corresponding to the domain names, alpha represents the average value of the autonomous domains, and N represents the number of the autonomous domains.

9. The Domain-Flux botnet detection method according to claim 3, wherein the stability of the computed IP address in step 5) is given by combining the query times, the statistical characteristics of the IP address and the characteristics of DNS query:

wherein S represents IP address stability, C_ipAnd C_hitRespectively representing the number of acquired IP addresses and the number of queries, C_ipAnd C_hitAll need to have the same threshold c_thI.e. the number of IP addresses or the upper limit of the number of queries, c_ipAnd c_hitThe number of IP addresses and the number of queries are respectively at a threshold value c_thA value under regulation, i.e. with an upper limit of the threshold c_th。