WO2019228613A1

WO2019228613A1 - Device and method for detecting malicious domain names

Info

Publication number: WO2019228613A1
Application number: PCT/EP2018/064092
Authority: WO
Inventors: Dmitry MEYTIN; Elad TZOREFF
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2018-05-29
Filing date: 2018-05-29
Publication date: 2019-12-05
Also published as: CN112204930B; CN112204930A

Abstract

The present invention relates to the detection of malicious domain names, particularly generated by a Domain Generation Algorithm. Therefore, the present invention provides a device, system, and method. The device is configured to receive, as an input, a Fully-Qualified Domain Name (FQDN) and a public suffix index. The device can determine a public suffix sequence and a domain characters sequence in the FQDN based on the public suffix index. Then, the device is configured to process the public suffix sequence to obtain a first result indicative of whether the FQDN is malicious or not, to process the domain characters sequence to obtain a second result indicative of whether the FQDN is malicious or not, and to merge the first result and the second result and determine based on the merged result whether the FQDN is malicious or not

Description

DEVICE AND METHOD FOR DETECTING MALICIOUS DOMAIN NAMES

TECHNICAL FIELD The present invention relates generally to malware detection, particularly to the detection of malicious domain names. Especially, the invention is about identifying malicious domain names produced by a Domain Generation Algorithm (DGA). To this end, the present invention proposes a device, system and method for respectively detecting the malicious domain names.

BACKGROUND

Many Botnets, Trojans and other new malware families use DGAs to generate a large number of domain names to connect to a command and control (C&C) server. Older families of malware relied on static lists of domains or IP addresses that were hardcoded in the malware code running on the infected hosts. Once a given malware was discovered, it could then be neutralized by blocking the connections to these network addresses, in order to prevent further communications between the infected hosts and the C&C server. However, starting from the Kraken botnet (released in 2008), the newer families of malware started using DGAs to circumvent such takedown attempts. Instead of relying on a fixed list of domains or IP addresses, these malwares execute an algorithm generating a large number (up to tens-of- thousands per day) of possible domain names, and attempt to connect to a portion of these generated domains until finding a working server.

Detecting and blocking such newer malware families using DGAs presents several challenges:

• Each DGA algorithm uses its different grammar and different seeding mechanism (time, currency exchange rate and more). · Some DGAs uses combinations of known (e.g. English) words (e.g. abobehaven.net, actionfight.net, etc.). • Some DGAs are purposely collisions with benign domains (wdmlmofa.net, yahoo.com, finlwx.com).

• The frequency of Domain Name System (DNS) lookup query could vary significantly There are several possible techniques to identify malicious domains:

• Domain blacklisting, which is a fully reactive approach with an almost zero rate of false positives.

• Heuristic approaches for identifying DGA by modelling their lexical structures or query points to a non-existent domain. These heuristic approaches require data accumulation over large time windows, and cannot really help with real-time malware detection.

• Shallow machine learning based methods, such as a combination of clustering and classification algorithms. These methods use large sets of benign and malicious domains, in order to build a domain classifier.

• Deep Neural Network (DNN) based algorithms. These algorithms show the best performance and accuracy:

^■ The first Recursive Neural Network (RNN) based implementation of DGA detection proposed a one-hot based, one directional RNN using domain information only.

^■ This implementation was then extended by implementing a bidirectional RNN, and adding dense feed-forward layer and predicting the type of DGA (for instance, Suppobox).

^■ DNN-based RNN and Convolutional Neural Network (CNN) models were also compared with shallow learning Random Forest models.

However, all of these techniques (including DNN) are poorly generalized to unseen DGAs and, basically, are effective only in identification of formerly seen attacks. There are several types of DGAs, which cannot be identified by these techniques, even if present in the training set.

In summary, all techniques have the limitation that for previously undetectable and unseen DGAs they have very low detection results. SUMMARY

In view of the above-mentioned challenges, the present invention aims to improve the conventional methods and the mentioned techniques. The present invention has the objective to provide a device and method that are able to detect malicious domain names with a higher detection rate. In particular, they should be able to detect accurately even previously unseen DGAs. Furthermore, the detection of false positives should be reduced.

The objective of the present invention is achieved by the solution provided in the enclosed independent claims. Advantageous implementations of the present invention are further defined in the dependent claims. The invention generally bases on the realization that a public suffix could be helpful for DGA identification.

A“public suffix” is a domain name, under which internet users can (or historically could) directly register their own domain names (i.e. pvt.kl2.ma.us).

A“public suffix list” is an initiative of Mozilla, but is maintained as a community resource. It allows browsers to, for example to:

• Avoid privacy-damaging "supercookies" being set for high-level domain name suffixes.

• Highlight the most important part of a domain name in the user interface.

• Accurately sort history entries by site.

There are two major factors that influence DGA identification accuracy: · Many DGAs are hiding behind known domains as a subdomain: (e.g. dydns.org, mooo.com and others).

• Many web applications/services are using pseudo-random subdomains for their own needs (kdsksksue . cdn. google . com) .

For the first use-case, the usage of a public suffix allows learning separately the„language“ of the subdomain, and obtaining the „bias“ for the public suffix (e.g. for FQDN sdlsjdkjks.dydns.com the separation of the subdomain and the public suffix will create two outputs: sdlsjdkjks, dydns.com. This allows learning separately the„language“ model of sdlsjdkjks and the probability of dydns.com to be used by DGA).

For the second use-case, the subdomain can be omitted from the prediction (e.g. for FQDN kdsksksue.cdn.google.com and the output will be google, com, since cdn.google.com is not a public suffix).

In particular the present invention thus proposes detecting malicious domain names based on a public suffix. Further, the present invention employs particularly a deep neural network model built for processing of domain name and public suffix separately.

A first aspect of the present invention provides a device for detecting malicious domain names, the device being configured to receive, as an input, a FQDN and a public suffix index, determine a public suffix sequence and a domain characters sequence in the FQDN based on the public suffix index, process the public suffix sequence to obtain a first result indicative of whether the FQDN is malicious or not, process the domain characters sequence to obtain a second result indicative of whether the FQDN is malicious or not, and merge the first result and the second result and determine based on the merged result whether the FQDN is malicious or not.

By calculating separately the first result and the second result, and by then merging these two results to determine whether the domain name is malicious or not, the detection accuracy is much improved. In particular, even domain names generated by DGA can be detected more accurately, and particularly with less false positives. Furthermore, by separating the domain name based on the public suffix index into the public suffix sequence and the domain characters sequence, the efficiency of the device is significantly improved. This is, because the separation itself requires only little processing, and also the calculation of the result based on the public suffix sequence is not complex. Moreover, the domain characters sequence is thus as short as possible, i.e. the necessary processing is reduced. In an implementation form of the first aspect, the device comprises a first Long Short-Term Memory (LSTM) network for processing the public suffix sequence, and/or a second LSTM network for processing the domain characters sequence.

Using two such LSTM networks on the respective sequences yields an efficient and accurate detection of the malicious domain names. In a further implementation form of the first aspect, the first LSTM network and/or the second LSTM network is a Recurrent Neural Network.

Such RNNs are optimal for the algorithm provided by the device of the first aspect. They can efficiently process the two sequences separately. Thereby, they can be individually trained to reach higher detection accuracy.

In a further implementation form of the first aspect, for processing the public suffix sequence, the device is configured to compute a probability that the public suffix sequence and the domain character sequence are used for a malicious FQDN based on determined previous events.

This computation of the probability based on the previous events requires only little processing load, but is rather accurate.

In a further implementation form of the first aspect, the device is further configured to compute a probability that the public suffix sequence is used by a DGA.

Thus, the device of the first aspect is particularly suitable for detecting malicious domain names generated by DGAs. In a further implementation from the first aspect, the device is configured to receive, as an input, a training set for learning the determined previous events.

This allows the device of the first aspect to operate with an even higher detection accuracy. In particular, false positive detections can be better avoided.

In a further implementation form of the first aspect, for processing the domain characters sequence, the device is configured to calculate a probability that the domain characters sequence is used for a malicious FQDN based on a likelihood of one or more next characters in the sequence.

This leads to more accurate results. Further, since the domain characters sequence is as short as possible, the device efficiency is high. In a further implementation form of the first aspect, for determining whether the FQDN is malicious or not, the device is configured to classify the merged result. By using such a classification, the final determination of whether the domain name is malicious or not can be carried out accurately and fast.

A second aspect of the present invention provides a system for detecting malicious domain names, the system comprising a monitoring device configured to monitor incoming DNS traffic and determine one at least one FDQN from the incoming DNS traffic, and a device according to the first aspect or any of its implementation forms to determine whether the determined FQDN is malicious or not.

Accordingly, the system of the second aspect achieves all advantages and effects of the device of the first aspect and its implementation forms. This system of the second aspect can be implemented, for instance, in a host intrusion detection system, and can provide higher security.

In an implementation form of the second aspect, the system is configured to, after a number of FQDNs has been determined to be malicious, wherein the number is above a determined threshold number, block a process that is an origin of the incoming DNS traffic, from which the FQDNs were determined, or output an alert message.

A third aspect of the present invention provides a method for detecting malicious domain names, the method comprising receiving, as an input, a FQDN and a public suffix index, determining a public suffix sequence and a domain characters sequence in the FQDN based on the public suffix index, processing the public suffix sequence to obtain a first result indicative of whether the FQDN is malicious or not, processing the domain characters sequence to obtain a second result indicative of whether the FQDN is malicious or not, and merging the first result and the second result and determining based on the merged result whether the FQDN is malicious or not.

In an implementation form of the third aspect, the method comprises processing the public suffix sequence with a LSTM network, and/or processing the domain characters sequence with a second LSTM network.

In a further implementation form of the third aspect, the first LSTM network and/or the second LSTM network is a RNN.

In a further implementation form of the third aspect, if the method determines that the FQDN does not include any public suffix sequence, the method further comprises taking the FQDN and omitting the processing of any sub-domain characters sequence of the FQDN. In a further implementation form of the third aspect, for processing the public suffix sequence, the method comprises computing a probability that the public suffix sequence and the domain character sequence are used for a malicious FQDN based on determined previous events.

In a further implementation form of the third aspect, the method further comprises computing a probability that the public suffix sequence is used by a DGA.

In a further implementation from the third aspect, the method comprises receiving, as an input, a training set for learning the determined previous events.

In a further implementation form of the third aspect, for processing the domain characters sequence, the method comprises calculating a probability that the domain characters sequence is used for a malicious FQDN based on a likelihood of one or more next characters in the sequence.

In a further implementation form of the third aspect, for determining whether the FQDN is malicious or not, the method comprises classifying the merged result.

The method of the third aspect and its implementation forms achieve the same effects and advantages described with respect to the device of the first aspect and its respective implementation forms.

A fourth aspect of the present invention provides a computer program product comprising program code for controlling a device according to the first aspect or any of its implementation forms, or for performing, when implemented on a processor, a method according to the third aspect or any of its implementation forms.

Accordingly, with the program code, which is e.g. stored on the computer program produce, the above-described advantages and effects of the method of the third aspect and of the device of the first aspect can respectively be achieved. The computer program product may be a data carrier carrying the program code or may be a hardware storage device or the like. It has to be noted that all devices, elements, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities.

Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof.

BRIEF DESCRIPTION OF DRAWINGS The above described aspects and implementation forms of the present invention will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which

FIG. 1 shows a device according to an embodiment of the present invention. FIG. 2 shows a device according to an embodiment of the present invention. FIG. 3 shows a device according to an embodiment of the present invention. FIG. 4 shows a performance of a device according to an embodiment of the present invention.

FIG. 5 shows a detection rate for several DGAs achieved by a device according to an embodiment of the present invention. FIG. 6 shows a system according to an embodiment of the present invention. FIG. 7 shows an integration of a system according to an embodiment of the present invention into a host intrusion detection system.

FIG. 8 shows a clout botnet detection service including a device according to an embodiment of the present invention. FIG. 9 shows a method according to an embodiment of the present invention. DETAILED DESCRIPTION OF EMBODIMENTS

FIG. 1 shows a device 100 according to an embodiment of the present invention. The device 100 is especially suited for detecting malicious domain names, particularly generated by DGA, and thus for identifying DGAs. The device 100 may comprise at least one processor and/or at least one LSTM network configured for implementing functions (a detection algorithm) described in the following. Thereby the at least one LSTM network may be implemented by processing circuitry.

The device 100 is configured to receive, as an input, a FQDN 101 and a public suffix index 102. The public suffix index 102 may also be referred to as a public suffix list. Further, the device

100 is configured to determine a public suffix sequence 103 and a domain characters sequence 104 in the FQDN 101 based on the public suffix index 102. In other words, the device 100 can extract from the FQDN 101 the public suffix sequence 103 as a first part, and the domain character sequence 104 as a second part. These parts of the FQDN 101 can then be processed separately by the device 100.

In particular, the device 100 is configured to process the public suffix sequence 103 to obtain a first result 105 indicative of whether the FQDN 101 is malicious or not, and to process the domain characters sequence 104 to obtain a second result 106 indicative of whether the FQDN

101 is malicious or not. To this end the device 100 can comprise at least one LSTM network to carry out the processing. An LSTM network may be, for instance, a RNN or CNN. For obtaining the first result 105, the device 100 may be configured to compute a probability that the public suffix sequence 103 or the public suffix sequence 103 and the domain character sequence 104 are used for a malicious FQDN 101, based on determined previous events like a history recording. For instance, the more often a public suffix sequence 103 was already used for a malicious FQDN 101, the higher the probability that it is again used maliciously. For obtaining the second result 105, the device 100 may be configured to calculate a probability that the domain characters sequence 104 is used for a malicious FQDN 101 based on a likelihood of one or more next characters in the domain characters sequence 104. For instance, the lower the likelihood of the one or more next characters, the higher the probability that the domain characters sequence 104 is used maliciously.

Finally, the device 100 is configured to merge the first result 105 and the second result 106 to obtain a merged result 107. Based on the merged result 107, the device 100 is configured to determine as an end result, whether the FQDN 101 is malicious or not. When merging the first result 105 and the second result 106, the device 100 may also be configured to weight the results.

FIG. 2 shows a device 100 according to an embodiment of the present invention, which builds on the device 100 shown in FIG. 1. Same elements in FIG. 1 and FIG. 2 are labelled with same reference signs and function likewise. Accordingly, also the device 100 of FIG. 2 is configured to receive the public suffix index 102 and the FQDN 101, respectively, and to determine in a two-step process, whether the FQDN 101 is malicious or not. For the two-step process, the device 100 particularly uses two different paths 202, 203 in a deep learning model, particularly two different LSTM networks.

FIG. 2 shows particularly that the public suffix index 102 and the FQDN 101 are input to a unit 200 of the device 100 configured for top level domain extraction. This extraction unit 200 yields the domain characters sequence number 104 and the public suffix sequence 103, which is here referred to as a public suffix array. The public suffix sequence 103 then takes a first path through e.g. a first LSTM 202, which yields the first result 105. The domain character sequence 104 takes a second path through e.g. a second LSTM 203, which yields the second result 106. These two results 105 and 106 are the merged in a merge layer 204 of the device 100, which yields the merged result 107. Based on the merged result 107, the deep learning model takes the decision, whether the input FQDN 101 is malicious or not. Notably, the LSTMs 202 and 203 and the merge layer 204 are part of the deep learning model.

FIG. 3 shows a device 100 according to an embodiment of the present invention, which builds on the device 100 shown in FIG. 2. Same elements in FIG. 2 and FIG. 3 are labelled with the same reference signs and function likewise. In particular, FIG. 3 shows the deep learning model of the device 100 shown in FIG. 2 in more detail.

The deep learning model consists particularly of the two separate LSTM networks 202 and 203. The first LSTM 203 is for the processing of the domain characters sequence 104 (e.g. kmcokkdoqwvfgk) and the second LSTM 202 is for the processing of the public suffix sequence 103 or array (e.g. the public suffix act.edu.au is represent by the array [,act‘, ,edu‘, ,au‘]).

The respective results 105 and 106 are merged in the merge concatenation layer 204, and are processed by fully connected layers 306 and 308, respectively. A first fully concatenated layer 306 processes the output of the first LSTM 202 to produce the result 105, and a second fully concatenated layer 308 processes the merged result 107. The output of the device 100 is then predicted, i.e. whether the FQDN 101 is malicious or not.

The deep learning model has, for instance, been trained on 1M Alexa Index of most popular sites, DMOZ index with more than 3M manually edited non-malicious domains and about 1M of DGA samples taken from Open-source intelligence (OSINT) and DGArchive (DGArchive. caad. fkie. fraunho fer.de) .

FIG. 4 compares the performance of a device 100 according to the present invention, which particularly implements the above-described deep learning model, with a device implementing a conventional model (e.g. an algorithm presented by Norwegian Computing Center). It can be seen that the device 100 according to an embodiment of the present invention shows a significantly improved performance with respect to both validation accuracy and validation loss over the conventional device. That is, the validation accuracy of the device 100 is considerably higher than of the conventional device, while its validation loss is considerably lower.

FIG. 5 shows a detection rate achieved by a device 100 according to an embodiment of the present invention for several DGAs. In particular, the table shown in FIG. 5 names the various DGAs, describes them shortly, and demonstrates that a probability of the device 100 on detection of unseen domains produced with the various DGAs is consistently high. Further, a probability for false positives on non-malicious domains from unseen sources is low.

FIG. 6 shows a system 600 according to an embodiment of the present invention. The system 600 is particularly for detecting malicious domain names, especially malicious domain names generated by a DGA. The system 600 comprises a monitoring device 601 configured to monitor incoming DNS traffic 602 and to determine at least one FDQN 101 from the incoming DNS traffic 602. The system 600 also comprises a device 100 according to an embodiment of the present invention, as for example shown in FIG. 1, 2 or 3. The device 100 is configured to determine, whether the determined FQDN 101 is malicious or not. This determination is achieved with the two-step process explained above.

FIG. 7 shows that the system 600 (and thus respectively also the device 100) may be implemented in, or may even be, a host intrusion detection system (HIDS). The HIDS may be a Cloud Service provided to consumers. The HIDS may be composed of several plugins running on a common agent-based platform at the side of a Guest Virtual Machine (VM). The DGA plugin, i.e. the system 600, may run on top of the HIDS Agent platform, and may passively sniff the DNS traffic 602. Once a new DNS lookup is detected, the FQDN 101 is sent to a Cloud Botnet Detection Service, which includes the device 100. If the Cloud Botnet Detection Service detects a malicious domain, the HIDS waits for certain threshold (e.g. 10 positively detected DGAs) and then blocks (or, alternatively, alerts) the process that is the origin of the DGA traffic. In other words, the system 600 in the HIDS is configured to, after a number of FQDNs 101 has been determined to be malicious, wherein the number is above a determined threshold number, block a process that is an origin of the incoming DNS traffic 602, from which the FQDNs 101 were determined, or output an alert message.

FIG. 8 shows a Cloud Botnet Detection Service, e.g. the one used in the system 600 of FIG. 7, including the device 100. The Cloud Botnet Detection Service may be part of a Galaxy Big Data and AI platform. Galaxy is responsible for data aggregation from multiple sources and its processing (including model building). A DGA Feed Aggregation component is responsible for data aggregation both for proven benign domains (e.g. Alexa, DMOZ, Huawei DNSaaS) and malicious domains (e.g. DGArchive, Malwaredomainlist, OSINT and more). The aggregated data is stored in Big Data platform. A botnet detection service, implemented by or including the device 100 according to an embodiment of the present invention, is responsible for periodic training of the model described above. The trained model is used for inference for domain lists that are coming from HIDS agents.

FIG. 9 shows a method 900 according to an embodiment of the present invention. The method 900 may be carried out by a device 100 according to an embodiment of the present invention, as e.g. shown in FIG. 1, 2 or 3, or a system 600 as shown in FGI. 6 or FIG. 7. The method 900 comprises a step 901 of receiving, as an input, a FQDN 101 and a public suffix index 102. Further, it comprises a step 902 of determining a public suffix sequence 103 and a domain characters sequence 104 in the FQDN 101 based on the public suffix index 103. Further, it comprises a step 902 of processing the public suffix sequence 103 to obtain a first result 105 indicative of whether the FQDN 101 is malicious or not. Further, it comprises a step 904 of processing the domain characters sequence 104 to obtain a second result 106 indicative of whether the FQDN 101 is malicious or not. Finally, it comprises a step 905 of merging the first result 105 and the second result 106 and determining based on the merged result 107 whether the FQDN 101 is malicious or not.

The present invention has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed invention, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word “comprising” does not exclude other elements or steps and the indefinite article“a” or“an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.

Claims

1. Device (100) for detecting malicious domain names, the device (100) being configured to

receive, as an input, a Fully-Qualified Domain Name, FQDN (101) and a public suffix index (102),

determine a public suffix sequence (103) and a domain characters sequence (104) in the FQDN (101) based on the public suffix index (102),

process the public suffix sequence (103) to obtain a first result (105) indicative of whether the FQDN (101) is malicious or not,

process the domain characters sequence (104) to obtain a second result (106) indicative of whether the FQDN (101) is malicious or not, and

merge the first result (105) and the second result ( 106) and determine based on the merged result (107) whether the FQDN (101) is malicious or not.

2. Device (100) according to claim 1, comprising

a first Long Short-Term Memory, LSTM, network (202) for processing the public suffix sequence (103), and/or

a second LSTM network (203) for processing the domain characters sequence (104).

3. Device (100) according to claim 2, wherein

the first LSTM network (202) and/or the second LSTM network (203) is a Recurrent Neural Network.

4. Device (100) according to one of the claims 1 to 3, wherein for processing the public suffix sequence (103), the device (100) is configured to

compute a probability that the public suffix sequence (103) and the domain character sequence (104) are used for a malicious FQDN (101) based on determined previous events.

5. Device (100) according to claim 4, further configured to

compute a probability that the public suffix sequence (103) is used by a Domain Generation Algorithm.

6. Device (100) according to claim 4 or 5, configured to

receive, as an input, a training set for learning the determined previous events.

7. Device (100) according to one of the claims 1 to 6, wherein for processing the domain characters sequence (104), the device (100) is configured to

calculate a probability that the domain characters sequence (104) is used for a malicious FQDN (101) based on a likelihood of one or more next characters in the domain characters sequence (104).

8. Device (100) according to one of the claims 1 to 7, wherein for determining whether the FQDN (101) is malicious or not, the device (100) is configured to

classify the merged result (107).

9. System (600) for detecting malicious domain names, the system (600) comprising

a monitoring device (601) configured to monitor incoming Domain Name System, DNS, traffic (602) and determine at least one FDQN (101) from the incoming DNS traffic (602), and

a device (100) according to one of the claims 1 to 9 configured to determine whether the determined FQDN (101) is malicious or not.

10. System (600) according to claim 9, configured to,

after a number of FQDNs (101) has been determined to be malicious, wherein the number is above a determined threshold number,

block a process that is an origin of the incoming DNS traffic (602), from which the FQDNs (101) were determined, or

output an alert message.

11. Method (900) for detecting malicious domain names, the method (900) comprising

receiving (901), as an input, a Fully-Qualified Domain Name, FQDN (101) and a public suffix index (102),

determining (902) a public suffix sequence (103) and a domain characters sequence (104) in the FQDN (101) based on the public suffix index (102),

processing (903) the public suffix sequence (103) to obtain a first result (105) indicative of whether the FQDN (101) is malicious or not,

processing (904) the domain characters sequence (104) to obtain a second result (106) indicative of whether the FQDN (101) is malicious or not, and

merging (905) the first result (105) and the second result (106) and determining based on the merged result (107) whether the FQDN (101) is malicious or not.

2. Computer program product comprising program code for controlling a device (100) according to one of the claims 1 to 8, or for performing, when implemented on a processor, a method (900) according to claim 11.