GB2567749A

GB2567749A - Method for associating domain name with website access behavior

Info

Publication number: GB2567749A
Application number: GB1816195.0A
Authority: GB
Inventors: Zhang Dashun
Original assignee: Shanghai Yamu Communication Tech Co Ltd
Current assignee: Shanghai Yamu Communication Tech Co Ltd
Priority date: 2016-04-14
Filing date: 2016-08-17
Publication date: 2019-04-24
Also published as: JP2019514137A; WO2017177590A1; RU2709647C1; RU2709647C9; CN105763633A; JP6703621B2; CN105763633B

Abstract

The present invention provides a method for associating a domain name with a website access behavior, comprising the following steps: step S1, simulating a user's website access behavior by means of a crawler program, so as to obtain all DNS domain name requests in a current HTTP request, i.e., a crawled DNS domain name request set; step S2, slicing a DNS log to obtain n domain name request sets, n being an integer greater than or equal to 1; and step 3, performing set-to-set matching on the crawled DNS domain name request set in step S1 and the domain name request sets obtained by DNS log slicing in step S2, and if one of the domain name request sets obtained by DNS log slicing is equal to or contained in the crawled DNS domain name request set, considering that the DNS log indicates that the user clicks on the domain name of a URL requested by the crawler program during crawling. By means of the method for associating a domain name with a website access behavior of the present invention, the analysis on a user's Internet browsing behavior can also be implemented by means of a DNS log.

Description

The disclosure relates to the field of Internet DNS domain name resolution and web crawler technique, and more particularly, to a method for associating a domain name with a website visiting behavior.

Background Art

DNS (Domain Name System) is a distributed database which provides a mapping between a domain name and an IP address on the Internet. DNS can allow a user to access Internet more conveniently without memorizing IP strings of numbers that can be directly read by machine. “The DNS name resolution technique” means that when visiting a website, the user types its domain name in the browser first and press the enter key. Then the browser initiates a DNS request. With the DNS technique, the browser can obtain an IP address of the server corresponding to the domain name and initiate an HTTP request for this IP address.

The web crawler technique is a program or script which crawls web information automatically according to certain rules. The web crawler technique simulates the user to initiate an HTTP request for the website and records the DNS request generated during the process.

The value of DNS data has always been ignored and only considered as a corresponding relation between IP and domain name, so that no one at present on the market would associate via DNS data.

Summary of the Disclosure

The disclosure provides a method for associating a domain name with a website visiting behavior. By means of a combination of DNS log collection and web crawler technique, the analysis on an Internet browsing behavior of a user can also be implemented via a DNS log.

A method for associating a domain name with a website visiting behavior in this disclosure, includes the following steps: step SI, simulating a website visiting behavior of a user by means of a crawler program, so as to obtain all DNS domain name requests in a current HTTP request, i.e., a crawled DNS domain name request set; step S2,segmentinga DNS log to obtain n domain name request sets, n being an integer greater than or equal to 1; and step S3, performing set-to-set matching on the crawled DNS domain name request set in step SI and the domain name request sets obtained by the DNS log segmentation in step S2, and if one of the domain name request sets obtained by the DNS log segmentation is equal to or contained in the crawled DNS domain name request set, considering that the DNS log indicates that the user has clicked the domain name of a URL requested by the crawler program during crawling.

Preferably, the DNS log in step S2 is a DNS log logging on the day of the visiting behavior.

Preferably, the DNS log segmentation in step S2 includes two-time segmentation, that is, a segmentation based on the source IP first and then another segmentation based on the timestamp difference.

Preferably, the DNS log segmentation based on the source IP is to obtain continuous DNS logs with the same source IP over a period of time.

Preferably, the segmentation based on the time stamp difference is to segment, based on the time stamp difference in DNS logs, the log after being segmented based on the source IP, and if the time stamp difference in two DNS logs is longer than a specified time length, the two DNS logs are split up.

Preferably, the specified time length is three seconds.

By means of the method for associating a domain name with a website visiting behavior of the disclosure, the analysis on an Internet browsing behavior of a user can also be implemented by means of a DNS log.

Brief Description of Drawings

FIG. 1 is a schematic representation for a DNS domain name request set crawled by a crawler program.

FIG. 2 is a flow chart of a method for associating a domain name with a website visiting behavior of the disclosure.

Description of Embodiments

The disclosure will be described in detail below with reference to the accompanying drawings and embodiments. The following embodiments are not intended to limit the invention. Variations and advantages that can be conceived by those skilled in the art are included in the disclosure without departing from the spirit and scope of the disclosure.

As mentioned above, DNS (Domain Name System) is a distributed database which provides a mapping between a domain name and an IP address on the Internet. DNS can allow a user to access Internet more conveniently without memorizing the IP strings of numbers that can be directly read by machine. When visiting a website, the user types its domain name in the browser first and press an enter key. Then the browser initiates a DNS request. With the DNS technique, the browser can obtain an IP address of the server corresponding to the domain name and initiate an HTTP request for this IP address. This is the DNS name resolution technique.

The DNS log maybe generated during the above domain name resolution process. The DNS log can record the response content of every DNS request and can almost record the domain name information of all user requests. The format for DNS log will be described below:

14.***.***.10|www.baidu.com|201412110359321180.***.***. 107:180.***.***.1 08|0

Source IP | Domain name | Time stamp | Resolution IP | Status code

That is, DNS logs consist of “Source IP”, “Domain name”, “Time stamp”, “Resolution IP” and “Status code”. The method for associating a domain name with a website visiting behavior of the disclosure will be described in detail below with reference to FIG. 1.

First, a website visiting behavior of a user is simulated by means of a crawler program so as to obtain all DNS domain name requests in a current HTTP request, i.e., a crawled DNS domain name request set(step SI). For example, when a page is opened or a URL (link) is clicked, the crawler program may crawl all DNS domain name requests in a current HTTP request. Since a user may also request other domain names in addition to the domain name of current URL when clicking on a URL, all DNS domain name requests generated after the URL is clicked can be obtained through the crawler technique. Here, Uniform Resource Locator (URL) is a compact representation of the location and access method for resources available from the Internet, and is the address for standard resources on the Internet. Every file on the Internet has a unique URL, which contains the information showing the file location and how the browser should process it.

For instance, a user has clicked a specific URL (link), as shown below: ‘‘http://baike.bai du.com/link?url=Lm-TkKUzV687IRoPCDVUAG5qslgMyZtNa6 e6A3nPnWXorcXEAII5006XHZWpTJat”.

The crawler program can crawl all DNS domain name requests generated after the URL is clicked , i.e., a DNS domain name request set, as shown in detail in FIG. 1.

Next, a DNS log is segmented to obtain n domain name request sets, n being an integer greater than or equal to 1 (step S2). Here, the DNS log is a DNS log logging on the day of the visiting behavior. The segmentation includes two-time segmentation, that is, a segmentation based on a source IP first and then another segmentation based on the timestamp difference.

1) The DNS log segmentation is based on the source IP, that is, continuous DNS logs can be split up if the source IP of a log is different. The segmentation based on the source IP is to obtain continuous DNS logs with the same source IP over a period of time. As shown below:

l.l.l.l|www.baidu.com|20141211035932ll80.***.***.107:180.***.***. 10810

l.l.l.l|www.qq.com|20141211035932ll80.***.***. 107:180.***.***.10810 ---------------------------------------Log Cutting Line----------------------------------------2.2.2.2|www.baidu.com|20141211035932ll80.***.***.107:180.***.***. 10810 2.2.2.2|www.qq.com|20141211035932ll80.***.***. 107:180.***.***.10810

2) The segmentation based on the time stamp difference means that the logs after being segmented based on the source IP are segmented based on the timestamp difference in the DNS logs. If the timestamp difference in two continuous logs is longer than a specified time length, the two logs are split up (the reason for this is the interval of the logs is so long that they are considered as two different behaviors). The specified time length may be adjusted as desired. In this embodiment, the specified time length is three seconds, i.e., the log may be split up if the timestamp interval is longer than three seconds.

For example, the DNS log of source IP 2.2.2.2 can be further segmented based on its timestamp difference, as shown below. (Timestamp 20141211035932 represents 3 (hour):59 (minute):32 (second), Dec 11, 2014)

Source IP | Domain name | Timestamp | Resolution IP | Status code

2.2.2.2|www.baidu.com|20141211000001ll80.***.***.107:180.***.***. 10810

2.2.2.2|a.qq.com|20141211000002ll80.***.***.107:180.***.***.108l0

2.2.2.2|b.baidu.com|20141211000003ll80.***.***.107:180.***.***. 10810

2.2.2.2|c.tanx.com|20141211000004ll80.***.***.107:180.***.***. 10810

2.2.2.2|c.allyes.com|20141211000005ll80.***.***.107:180.***.***. 10810 ---------------------------------------Log Cutting Line----------------------------------------2.2.2.2|www.sina.com|20141211000009ll80.***.***. 107:180.***.***.10810

As described above, since the difference between 05 seconds in timestamp 20141211000005 and 09 seconds in timestamp 20141211000009 is four seconds (longer than three seconds), the log is split up.

www.baidu.com, a.qq.com, b.baidu.com, c.tanx.com, c.tanx.com area part of domain name request set in a DNS log.

Next, a set-to-set matching on the domain name request set crawled by the crawler in step SI and the domain name request sets obtained by the DNS log segmentation in step S2 (step S3) is performed. The matching rule is [(a,b,c) = (b,c,a) = (a,c,b)].

After the matching, it is considered that the DNS log indicates that the user has clicked the domain name (i.e., the URL domain name requested by the crawler during crawling) if a part of domain name request set in DNS log is included in the domain name request set crawled by the crawler, or the two sets are equal to each other. For example:

the URL (as a click behavior of a user) crawled by the crawler is www.a.com/doc/1234. The set A of all the crawled domain name requests is “www.a.com, www.b.com, www.c.com, www.d.com, and www.e.com”

A part of domain name request set B after the segmentation of DNS log is “www.a.com, www.b.com, www.e.com, andwww.d.com”.

As mentioned above, when set B is included in set A, the domain name request set B is considered to reflect www.a.com/doc/1234, which is a visiting behavior of a user mapped by the domain name set A. In this way, the Internet browsing behaviors of users can also be analyzed via a DNS log.

The above-described aspects are only the preferred embodiments of the disclosure and are not intended to limit the scope of the disclosure. Any equivalent variations or modifications made according to the content of the claims of the disclosure should fall within the technical scope of the disclosure.

Claims

1. A method for associating a domain name with a website visiting behavior, comprising the following steps: step SI, simulating a website visiting behavior of a user by means of a crawler program, so as to obtain all DNS domain name requests in a current HTTP request,

1. e., a crawled DNS domain name request set;

step S2,segmentinga DNS log to obtain n domain name request sets, n being an integer greater than or equal to 1; and step S3, performing set-to-set matching on the crawled DNS domain name request set in step SI and the n domain name request sets obtained by DNS log segmentation in step S2, and if one of the n domain name request sets obtained by the DNS log segmentation is equal to or contained in the crawled DNS domain name request set, considering that the DNS log indicates that the user has clicked the domain name of a URL requested by the crawler program during crawling.

2. The method for associating a domain name with a website visiting behavior according to claim 1, wherein the DNS log in step S2 is a DNS log on the day of the visiting behavior.

3. The method for associating a domain name with a website visiting behavior according to claim 1, wherein the DNS log segmentation in step S2 includes two-time segmentation , that is, a segmentation based on a source IP first and then another segmentation based on a time stamp difference.

4. The method for associating a domain name with a website visiting behavior according to claim 3, wherein the DNS log segmentation based on the source IP is to obtain continuous DNS logs with the same source IP over a period of time.

5. The method for associating a domain name with a website visiting behavior according to claim 4, wherein the segmentation based on the timestamp difference is to segment, based on the time stamp difference in DNS logs, the log after being segmented based on the source IP, and if the time stamp difference in two DNS logs is longer than a specified time length, the two DNS logs are split up.

6. The method for associating a domain name with a website visiting behavior according to claim 5, wherein the specified time length is three seconds.

INTERNATIONAL SEARCH REPORT