GB2567749A - Method for associating domain name with website access behavior - Google Patents
Method for associating domain name with website access behavior Download PDFInfo
- Publication number
- GB2567749A GB2567749A GB1816195.0A GB201816195A GB2567749A GB 2567749 A GB2567749 A GB 2567749A GB 201816195 A GB201816195 A GB 201816195A GB 2567749 A GB2567749 A GB 2567749A
- Authority
- GB
- United Kingdom
- Prior art keywords
- domain name
- dns
- log
- associating
- segmentation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/535—Tracking the activity of the user
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/40—Data acquisition and logging
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L61/00—Network arrangements, protocols or services for addressing or naming
- H04L61/45—Network directories; Name-to-address mapping
- H04L61/4505—Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
- H04L61/4511—Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/02—Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computer Hardware Design (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Information Transfer Between Computers (AREA)
- Debugging And Monitoring (AREA)
Abstract
The present invention provides a method for associating a domain name with a website access behavior, comprising the following steps: step S1, simulating a user's website access behavior by means of a crawler program, so as to obtain all DNS domain name requests in a current HTTP request, i.e., a crawled DNS domain name request set; step S2, slicing a DNS log to obtain n domain name request sets, n being an integer greater than or equal to 1; and step 3, performing set-to-set matching on the crawled DNS domain name request set in step S1 and the domain name request sets obtained by DNS log slicing in step S2, and if one of the domain name request sets obtained by DNS log slicing is equal to or contained in the crawled DNS domain name request set, considering that the DNS log indicates that the user clicks on the domain name of a URL requested by the crawler program during crawling. By means of the method for associating a domain name with a website access behavior of the present invention, the analysis on a user's Internet browsing behavior can also be implemented by means of a DNS log.
Description
The disclosure relates to the field of Internet DNS domain name resolution and web crawler technique, and more particularly, to a method for associating a domain name with a website visiting behavior.
Background Art
DNS (Domain Name System) is a distributed database which provides a mapping between a domain name and an IP address on the Internet. DNS can allow a user to access Internet more conveniently without memorizing IP strings of numbers that can be directly read by machine. “The DNS name resolution technique” means that when visiting a website, the user types its domain name in the browser first and press the enter key. Then the browser initiates a DNS request. With the DNS technique, the browser can obtain an IP address of the server corresponding to the domain name and initiate an HTTP request for this IP address.
The web crawler technique is a program or script which crawls web information automatically according to certain rules. The web crawler technique simulates the user to initiate an HTTP request for the website and records the DNS request generated during the process.
The value of DNS data has always been ignored and only considered as a corresponding relation between IP and domain name, so that no one at present on the market would associate via DNS data.
Summary of the Disclosure
The disclosure provides a method for associating a domain name with a website visiting behavior. By means of a combination of DNS log collection and web crawler technique, the analysis on an Internet browsing behavior of a user can also be implemented via a DNS log.
A method for associating a domain name with a website visiting behavior in this disclosure, includes the following steps: step SI, simulating a website visiting behavior of a user by means of a crawler program, so as to obtain all DNS domain name requests in a current HTTP request, i.e., a crawled DNS domain name request set; step S2,segmentinga DNS log to obtain n domain name request sets, n being an integer greater than or equal to 1; and step S3, performing set-to-set matching on the crawled DNS domain name request set in step SI and the domain name request sets obtained by the DNS log segmentation in step S2, and if one of the domain name request sets obtained by the DNS log segmentation is equal to or contained in the crawled DNS domain name request set, considering that the DNS log indicates that the user has clicked the domain name of a URL requested by the crawler program during crawling.
Preferably, the DNS log in step S2 is a DNS log logging on the day of the visiting behavior.
Preferably, the DNS log segmentation in step S2 includes two-time segmentation, that is, a segmentation based on the source IP first and then another segmentation based on the timestamp difference.
Preferably, the DNS log segmentation based on the source IP is to obtain continuous DNS logs with the same source IP over a period of time.
Preferably, the segmentation based on the time stamp difference is to segment, based on the time stamp difference in DNS logs, the log after being segmented based on the source IP, and if the time stamp difference in two DNS logs is longer than a specified time length, the two DNS logs are split up.
Preferably, the specified time length is three seconds.
By means of the method for associating a domain name with a website visiting behavior of the disclosure, the analysis on an Internet browsing behavior of a user can also be implemented by means of a DNS log.
Brief Description of Drawings
FIG. 1 is a schematic representation for a DNS domain name request set crawled by a crawler program.
FIG. 2 is a flow chart of a method for associating a domain name with a website visiting behavior of the disclosure.
Description of Embodiments
The disclosure will be described in detail below with reference to the accompanying drawings and embodiments. The following embodiments are not intended to limit the invention. Variations and advantages that can be conceived by those skilled in the art are included in the disclosure without departing from the spirit and scope of the disclosure.
As mentioned above, DNS (Domain Name System) is a distributed database which provides a mapping between a domain name and an IP address on the Internet. DNS can allow a user to access Internet more conveniently without memorizing the IP strings of numbers that can be directly read by machine. When visiting a website, the user types its domain name in the browser first and press an enter key. Then the browser initiates a DNS request. With the DNS technique, the browser can obtain an IP address of the server corresponding to the domain name and initiate an HTTP request for this IP address. This is the DNS name resolution technique.
The DNS log maybe generated during the above domain name resolution process. The DNS log can record the response content of every DNS request and can almost record the domain name information of all user requests. The format for DNS log will be described below:
14.***.***.10|www.baidu.com|201412110359321180.***.***. 107:180.***.***.1 08|0
Source IP | Domain name | Time stamp | Resolution IP | Status code
That is, DNS logs consist of “Source IP”, “Domain name”, “Time stamp”, “Resolution IP” and “Status code”. The method for associating a domain name with a website visiting behavior of the disclosure will be described in detail below with reference to FIG. 1.
First, a website visiting behavior of a user is simulated by means of a crawler program so as to obtain all DNS domain name requests in a current HTTP request, i.e., a crawled DNS domain name request set(step SI). For example, when a page is opened or a URL (link) is clicked, the crawler program may crawl all DNS domain name requests in a current HTTP request. Since a user may also request other domain names in addition to the domain name of current URL when clicking on a URL, all DNS domain name requests generated after the URL is clicked can be obtained through the crawler technique. Here, Uniform Resource Locator (URL) is a compact representation of the location and access method for resources available from the Internet, and is the address for standard resources on the Internet. Every file on the Internet has a unique URL, which contains the information showing the file location and how the browser should process it.
For instance, a user has clicked a specific URL (link), as shown below: ‘‘http://baike.bai du.com/link?url=Lm-TkKUzV687IRoPCDVUAG5qslgMyZtNa6 e6A3nPnWXorcXEAII5006XHZWpTJat”.
The crawler program can crawl all DNS domain name requests generated after the URL is clicked , i.e., a DNS domain name request set, as shown in detail in FIG. 1.
Next, a DNS log is segmented to obtain n domain name request sets, n being an integer greater than or equal to 1 (step S2). Here, the DNS log is a DNS log logging on the day of the visiting behavior. The segmentation includes two-time segmentation, that is, a segmentation based on a source IP first and then another segmentation based on the timestamp difference.
1) The DNS log segmentation is based on the source IP, that is, continuous DNS logs can be split up if the source IP of a log is different. The segmentation based on the source IP is to obtain continuous DNS logs with the same source IP over a period of time. As shown below:
l.l.l.l|www.baidu.com|20141211035932ll80.***.***.107:180.***.***. 10810
l.l.l.l|www.qq.com|20141211035932ll80.***.***. 107:180.***.***.10810 ---------------------------------------Log Cutting Line----------------------------------------2.2.2.2|www.baidu.com|20141211035932ll80.***.***.107:180.***.***. 10810 2.2.2.2|www.qq.com|20141211035932ll80.***.***. 107:180.***.***.10810
2) The segmentation based on the time stamp difference means that the logs after being segmented based on the source IP are segmented based on the timestamp difference in the DNS logs. If the timestamp difference in two continuous logs is longer than a specified time length, the two logs are split up (the reason for this is the interval of the logs is so long that they are considered as two different behaviors). The specified time length may be adjusted as desired. In this embodiment, the specified time length is three seconds, i.e., the log may be split up if the timestamp interval is longer than three seconds.
For example, the DNS log of source IP 2.2.2.2 can be further segmented based on its timestamp difference, as shown below. (Timestamp 20141211035932 represents 3 (hour):59 (minute):32 (second), Dec 11, 2014)
Source IP | Domain name | Timestamp | Resolution IP | Status code
2.2.2.2|www.baidu.com|20141211000001ll80.***.***.107:180.***.***. 10810
2.2.2.2|a.qq.com|20141211000002ll80.***.***.107:180.***.***.108l0
2.2.2.2|b.baidu.com|20141211000003ll80.***.***.107:180.***.***. 10810
2.2.2.2|c.tanx.com|20141211000004ll80.***.***.107:180.***.***. 10810
2.2.2.2|c.allyes.com|20141211000005ll80.***.***.107:180.***.***. 10810 ---------------------------------------Log Cutting Line----------------------------------------2.2.2.2|www.sina.com|20141211000009ll80.***.***. 107:180.***.***.10810
As described above, since the difference between 05 seconds in timestamp 20141211000005 and 09 seconds in timestamp 20141211000009 is four seconds (longer than three seconds), the log is split up.
www.baidu.com, a.qq.com, b.baidu.com, c.tanx.com, c.tanx.com area part of domain name request set in a DNS log.
Next, a set-to-set matching on the domain name request set crawled by the crawler in step SI and the domain name request sets obtained by the DNS log segmentation in step S2 (step S3) is performed. The matching rule is [(a,b,c) = (b,c,a) = (a,c,b)].
After the matching, it is considered that the DNS log indicates that the user has clicked the domain name (i.e., the URL domain name requested by the crawler during crawling) if a part of domain name request set in DNS log is included in the domain name request set crawled by the crawler, or the two sets are equal to each other. For example:
the URL (as a click behavior of a user) crawled by the crawler is www.a.com/doc/1234. The set A of all the crawled domain name requests is “www.a.com, www.b.com, www.c.com, www.d.com, and www.e.com”
A part of domain name request set B after the segmentation of DNS log is “www.a.com, www.b.com, www.e.com, andwww.d.com”.
As mentioned above, when set B is included in set A, the domain name request set B is considered to reflect www.a.com/doc/1234, which is a visiting behavior of a user mapped by the domain name set A. In this way, the Internet browsing behaviors of users can also be analyzed via a DNS log.
The above-described aspects are only the preferred embodiments of the disclosure and are not intended to limit the scope of the disclosure. Any equivalent variations or modifications made according to the content of the claims of the disclosure should fall within the technical scope of the disclosure.
Claims (6)
1. e., a crawled DNS domain name request set;
step S2,segmentinga DNS log to obtain n domain name request sets, n being an integer greater than or equal to 1; and step S3, performing set-to-set matching on the crawled DNS domain name request set in step SI and the n domain name request sets obtained by DNS log segmentation in step S2, and if one of the n domain name request sets obtained by the DNS log segmentation is equal to or contained in the crawled DNS domain name request set, considering that the DNS log indicates that the user has clicked the domain name of a URL requested by the crawler program during crawling.
2. The method for associating a domain name with a website visiting behavior according to claim 1, wherein the DNS log in step S2 is a DNS log on the day of the visiting behavior.
3. The method for associating a domain name with a website visiting behavior according to claim 1, wherein the DNS log segmentation in step S2 includes two-time segmentation , that is, a segmentation based on a source IP first and then another segmentation based on a time stamp difference.
4. The method for associating a domain name with a website visiting behavior according to claim 3, wherein the DNS log segmentation based on the source IP is to obtain continuous DNS logs with the same source IP over a period of time.
5. The method for associating a domain name with a website visiting behavior according to claim 4, wherein the segmentation based on the timestamp difference is to segment, based on the time stamp difference in DNS logs, the log after being segmented based on the source IP, and if the time stamp difference in two DNS logs is longer than a specified time length, the two DNS logs are split up.
6. The method for associating a domain name with a website visiting behavior according to claim 5, wherein the specified time length is three seconds.
INTERNATIONAL SEARCH REPORT
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201610230263.0A CN105763633B (en) | 2016-04-14 | 2016-04-14 | A method for associating domain name and website access behavior |
| PCT/CN2016/095670 WO2017177590A1 (en) | 2016-04-14 | 2016-08-17 | Method for associating domain name with website access behavior |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| GB2567749A true GB2567749A (en) | 2019-04-24 |
Family
ID=56333890
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| GB1816195.0A Withdrawn GB2567749A (en) | 2016-04-14 | 2016-08-17 | Method for associating domain name with website access behavior |
Country Status (5)
| Country | Link |
|---|---|
| JP (1) | JP6703621B2 (en) |
| CN (1) | CN105763633B (en) |
| GB (1) | GB2567749A (en) |
| RU (1) | RU2709647C9 (en) |
| WO (1) | WO2017177590A1 (en) |
Families Citing this family (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105763633B (en) * | 2016-04-14 | 2019-05-21 | 上海牙木通讯技术有限公司 | A method for associating domain name and website access behavior |
| CN111131370B (en) * | 2018-11-01 | 2022-09-27 | 百度在线网络技术(北京)有限公司 | Method, device and system for detecting whether service call is correct |
| CN110798545B (en) * | 2019-11-05 | 2020-08-18 | 中国人民解放军国防科技大学 | A Web-based Domain Name Data Acquisition Method |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101079064A (en) * | 2007-06-25 | 2007-11-28 | 腾讯科技(深圳)有限公司 | Web page sequencing method and device |
| CN103389983A (en) * | 2012-05-08 | 2013-11-13 | 阿里巴巴集团控股有限公司 | Webpage content grabbing method and device applied to network crawler system |
| CN104065532A (en) * | 2014-06-26 | 2014-09-24 | 国家计算机网络与信息安全管理中心 | A search method and system for unregistered websites based on multi-channel data access |
| CN105005600A (en) * | 2015-07-02 | 2015-10-28 | 焦点科技股份有限公司 | Preprocessing method of URL (Uniform Resource Locator) in access log |
| CN105704260A (en) * | 2016-04-14 | 2016-06-22 | 上海牙木通讯技术有限公司 | Method for analyzing where Internet traffic comes from and goes to |
| CN105763633A (en) * | 2016-04-14 | 2016-07-13 | 上海牙木通讯技术有限公司 | Association method of domain name and website visiting behavior |
Family Cites Families (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7734815B2 (en) * | 2006-09-18 | 2010-06-08 | Akamai Technologies, Inc. | Global load balancing across mirrored data centers |
| US8145762B2 (en) * | 2007-05-22 | 2012-03-27 | Kount Inc. | Collecting information regarding consumer click-through traffic |
| CN101729288B (en) * | 2008-10-31 | 2014-02-05 | 中国科学院计算机网络信息中心 | Method and device for counting network access behaviours of internet users |
| US8527658B2 (en) * | 2009-04-07 | 2013-09-03 | Verisign, Inc | Domain traffic ranking |
| JP5770652B2 (en) * | 2012-01-31 | 2015-08-26 | 日本電信電話株式会社 | Source / destination organization identification apparatus, method and program |
| CN105357054B (en) * | 2015-11-26 | 2019-01-29 | 上海晶赞科技发展有限公司 | Website traffic analysis method, device and electronic equipment |
-
2016
- 2016-04-14 CN CN201610230263.0A patent/CN105763633B/en active Active
- 2016-08-17 RU RU2018139988A patent/RU2709647C9/en active
- 2016-08-17 JP JP2018554480A patent/JP6703621B2/en active Active
- 2016-08-17 GB GB1816195.0A patent/GB2567749A/en not_active Withdrawn
- 2016-08-17 WO PCT/CN2016/095670 patent/WO2017177590A1/en not_active Ceased
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101079064A (en) * | 2007-06-25 | 2007-11-28 | 腾讯科技(深圳)有限公司 | Web page sequencing method and device |
| CN103389983A (en) * | 2012-05-08 | 2013-11-13 | 阿里巴巴集团控股有限公司 | Webpage content grabbing method and device applied to network crawler system |
| CN104065532A (en) * | 2014-06-26 | 2014-09-24 | 国家计算机网络与信息安全管理中心 | A search method and system for unregistered websites based on multi-channel data access |
| CN105005600A (en) * | 2015-07-02 | 2015-10-28 | 焦点科技股份有限公司 | Preprocessing method of URL (Uniform Resource Locator) in access log |
| CN105704260A (en) * | 2016-04-14 | 2016-06-22 | 上海牙木通讯技术有限公司 | Method for analyzing where Internet traffic comes from and goes to |
| CN105763633A (en) * | 2016-04-14 | 2016-07-13 | 上海牙木通讯技术有限公司 | Association method of domain name and website visiting behavior |
Also Published As
| Publication number | Publication date |
|---|---|
| JP2019514137A (en) | 2019-05-30 |
| WO2017177590A1 (en) | 2017-10-19 |
| RU2709647C1 (en) | 2019-12-19 |
| RU2709647C9 (en) | 2020-04-02 |
| CN105763633A (en) | 2016-07-13 |
| JP6703621B2 (en) | 2020-06-03 |
| CN105763633B (en) | 2019-05-21 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11606384B2 (en) | Clustering-based security monitoring of accessed domain names | |
| AU2011200024B2 (en) | Method and System of Measuring and Recording User Data in a Communications Network | |
| CN102663062B (en) | Method and device for processing invalid links in search result | |
| JP4358188B2 (en) | Invalid click detection device in Internet search engine | |
| CN101079768B (en) | A method of counting web page link click data | |
| CN109905288B (en) | Application service classification method and device | |
| US20110004850A1 (en) | Methods and apparatus for determining website validity | |
| RU2702048C1 (en) | Method of analyzing a source and destination of internet traffic | |
| CA2442190A1 (en) | Dynamic web page referrer tracking and ranking | |
| CN102663054A (en) | Method and device for determining weight of website | |
| CN110266661B (en) | Authorization method, device and equipment | |
| JP2006146882A (en) | Content evaluation | |
| CN102436564A (en) | Method and device for identifying tampered webpage | |
| US10360133B2 (en) | Analyzing analytic element network traffic | |
| GB2567749A (en) | Method for associating domain name with website access behavior | |
| US20210383059A1 (en) | Attribution Of Link Selection By A User | |
| CN118740675A (en) | Network supportability testing method, device, equipment, medium and program product | |
| CN104021143A (en) | Method and device for recording webpage access behavior | |
| CN104363309B (en) | Pan-domain name identification processing unit and method | |
| CN102918527B (en) | Investigation method and system for web application hosting | |
| CN102724129B (en) | For the queue scheduling of wide area name and access control apparatus and method | |
| JP5851251B2 (en) | Communication packet storage device | |
| HK1111545B (en) | A method for compiling statistics on webpage linkage click-through data | |
| CN115118624A (en) | Production flow shunting method and device, electronic equipment and storage medium | |
| Kar et al. | Knowledge Retrieval from Web Server Logs Using Web Usage Mining |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| WAP | Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1) |