[go: up one dir, main page]

GB2567749A - Method for associating domain name with website access behavior - Google Patents

Method for associating domain name with website access behavior Download PDF

Info

Publication number
GB2567749A
GB2567749A GB1816195.0A GB201816195A GB2567749A GB 2567749 A GB2567749 A GB 2567749A GB 201816195 A GB201816195 A GB 201816195A GB 2567749 A GB2567749 A GB 2567749A
Authority
GB
United Kingdom
Prior art keywords
domain name
dns
log
associating
segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB1816195.0A
Inventor
Zhang Dashun
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Yamu Communication Tech Co Ltd
Original Assignee
Shanghai Yamu Communication Tech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Yamu Communication Tech Co Ltd filed Critical Shanghai Yamu Communication Tech Co Ltd
Publication of GB2567749A publication Critical patent/GB2567749A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/535Tracking the activity of the user
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/40Data acquisition and logging
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Hardware Design (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Transfer Between Computers (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The present invention provides a method for associating a domain name with a website access behavior, comprising the following steps: step S1, simulating a user's website access behavior by means of a crawler program, so as to obtain all DNS domain name requests in a current HTTP request, i.e., a crawled DNS domain name request set; step S2, slicing a DNS log to obtain n domain name request sets, n being an integer greater than or equal to 1; and step 3, performing set-to-set matching on the crawled DNS domain name request set in step S1 and the domain name request sets obtained by DNS log slicing in step S2, and if one of the domain name request sets obtained by DNS log slicing is equal to or contained in the crawled DNS domain name request set, considering that the DNS log indicates that the user clicks on the domain name of a URL requested by the crawler program during crawling. By means of the method for associating a domain name with a website access behavior of the present invention, the analysis on a user's Internet browsing behavior can also be implemented by means of a DNS log.

Description

The disclosure relates to the field of Internet DNS domain name resolution and web crawler technique, and more particularly, to a method for associating a domain name with a website visiting behavior.
Background Art
DNS (Domain Name System) is a distributed database which provides a mapping between a domain name and an IP address on the Internet. DNS can allow a user to access Internet more conveniently without memorizing IP strings of numbers that can be directly read by machine. “The DNS name resolution technique” means that when visiting a website, the user types its domain name in the browser first and press the enter key. Then the browser initiates a DNS request. With the DNS technique, the browser can obtain an IP address of the server corresponding to the domain name and initiate an HTTP request for this IP address.
The web crawler technique is a program or script which crawls web information automatically according to certain rules. The web crawler technique simulates the user to initiate an HTTP request for the website and records the DNS request generated during the process.
The value of DNS data has always been ignored and only considered as a corresponding relation between IP and domain name, so that no one at present on the market would associate via DNS data.
Summary of the Disclosure
The disclosure provides a method for associating a domain name with a website visiting behavior. By means of a combination of DNS log collection and web crawler technique, the analysis on an Internet browsing behavior of a user can also be implemented via a DNS log.
A method for associating a domain name with a website visiting behavior in this disclosure, includes the following steps: step SI, simulating a website visiting behavior of a user by means of a crawler program, so as to obtain all DNS domain name requests in a current HTTP request, i.e., a crawled DNS domain name request set; step S2,segmentinga DNS log to obtain n domain name request sets, n being an integer greater than or equal to 1; and step S3, performing set-to-set matching on the crawled DNS domain name request set in step SI and the domain name request sets obtained by the DNS log segmentation in step S2, and if one of the domain name request sets obtained by the DNS log segmentation is equal to or contained in the crawled DNS domain name request set, considering that the DNS log indicates that the user has clicked the domain name of a URL requested by the crawler program during crawling.
Preferably, the DNS log in step S2 is a DNS log logging on the day of the visiting behavior.
Preferably, the DNS log segmentation in step S2 includes two-time segmentation, that is, a segmentation based on the source IP first and then another segmentation based on the timestamp difference.
Preferably, the DNS log segmentation based on the source IP is to obtain continuous DNS logs with the same source IP over a period of time.
Preferably, the segmentation based on the time stamp difference is to segment, based on the time stamp difference in DNS logs, the log after being segmented based on the source IP, and if the time stamp difference in two DNS logs is longer than a specified time length, the two DNS logs are split up.
Preferably, the specified time length is three seconds.
By means of the method for associating a domain name with a website visiting behavior of the disclosure, the analysis on an Internet browsing behavior of a user can also be implemented by means of a DNS log.
Brief Description of Drawings
FIG. 1 is a schematic representation for a DNS domain name request set crawled by a crawler program.
FIG. 2 is a flow chart of a method for associating a domain name with a website visiting behavior of the disclosure.
Description of Embodiments
The disclosure will be described in detail below with reference to the accompanying drawings and embodiments. The following embodiments are not intended to limit the invention. Variations and advantages that can be conceived by those skilled in the art are included in the disclosure without departing from the spirit and scope of the disclosure.
As mentioned above, DNS (Domain Name System) is a distributed database which provides a mapping between a domain name and an IP address on the Internet. DNS can allow a user to access Internet more conveniently without memorizing the IP strings of numbers that can be directly read by machine. When visiting a website, the user types its domain name in the browser first and press an enter key. Then the browser initiates a DNS request. With the DNS technique, the browser can obtain an IP address of the server corresponding to the domain name and initiate an HTTP request for this IP address. This is the DNS name resolution technique.
The DNS log maybe generated during the above domain name resolution process. The DNS log can record the response content of every DNS request and can almost record the domain name information of all user requests. The format for DNS log will be described below:
14.***.***.10|www.baidu.com|201412110359321180.***.***. 107:180.***.***.1 08|0
Source IP | Domain name | Time stamp | Resolution IP | Status code
That is, DNS logs consist of “Source IP”, “Domain name”, “Time stamp”, “Resolution IP” and “Status code”. The method for associating a domain name with a website visiting behavior of the disclosure will be described in detail below with reference to FIG. 1.
First, a website visiting behavior of a user is simulated by means of a crawler program so as to obtain all DNS domain name requests in a current HTTP request, i.e., a crawled DNS domain name request set(step SI). For example, when a page is opened or a URL (link) is clicked, the crawler program may crawl all DNS domain name requests in a current HTTP request. Since a user may also request other domain names in addition to the domain name of current URL when clicking on a URL, all DNS domain name requests generated after the URL is clicked can be obtained through the crawler technique. Here, Uniform Resource Locator (URL) is a compact representation of the location and access method for resources available from the Internet, and is the address for standard resources on the Internet. Every file on the Internet has a unique URL, which contains the information showing the file location and how the browser should process it.
For instance, a user has clicked a specific URL (link), as shown below: ‘‘http://baike.bai du.com/link?url=Lm-TkKUzV687IRoPCDVUAG5qslgMyZtNa6 e6A3nPnWXorcXEAII5006XHZWpTJat”.
The crawler program can crawl all DNS domain name requests generated after the URL is clicked , i.e., a DNS domain name request set, as shown in detail in FIG. 1.
Next, a DNS log is segmented to obtain n domain name request sets, n being an integer greater than or equal to 1 (step S2). Here, the DNS log is a DNS log logging on the day of the visiting behavior. The segmentation includes two-time segmentation, that is, a segmentation based on a source IP first and then another segmentation based on the timestamp difference.
1) The DNS log segmentation is based on the source IP, that is, continuous DNS logs can be split up if the source IP of a log is different. The segmentation based on the source IP is to obtain continuous DNS logs with the same source IP over a period of time. As shown below:
l.l.l.l|www.baidu.com|20141211035932ll80.***.***.107:180.***.***. 10810
l.l.l.l|www.qq.com|20141211035932ll80.***.***. 107:180.***.***.10810 ---------------------------------------Log Cutting Line----------------------------------------2.2.2.2|www.baidu.com|20141211035932ll80.***.***.107:180.***.***. 10810 2.2.2.2|www.qq.com|20141211035932ll80.***.***. 107:180.***.***.10810
2) The segmentation based on the time stamp difference means that the logs after being segmented based on the source IP are segmented based on the timestamp difference in the DNS logs. If the timestamp difference in two continuous logs is longer than a specified time length, the two logs are split up (the reason for this is the interval of the logs is so long that they are considered as two different behaviors). The specified time length may be adjusted as desired. In this embodiment, the specified time length is three seconds, i.e., the log may be split up if the timestamp interval is longer than three seconds.
For example, the DNS log of source IP 2.2.2.2 can be further segmented based on its timestamp difference, as shown below. (Timestamp 20141211035932 represents 3 (hour):59 (minute):32 (second), Dec 11, 2014)
Source IP | Domain name | Timestamp | Resolution IP | Status code
2.2.2.2|www.baidu.com|20141211000001ll80.***.***.107:180.***.***. 10810
2.2.2.2|a.qq.com|20141211000002ll80.***.***.107:180.***.***.108l0
2.2.2.2|b.baidu.com|20141211000003ll80.***.***.107:180.***.***. 10810
2.2.2.2|c.tanx.com|20141211000004ll80.***.***.107:180.***.***. 10810
2.2.2.2|c.allyes.com|20141211000005ll80.***.***.107:180.***.***. 10810 ---------------------------------------Log Cutting Line----------------------------------------2.2.2.2|www.sina.com|20141211000009ll80.***.***. 107:180.***.***.10810
As described above, since the difference between 05 seconds in timestamp 20141211000005 and 09 seconds in timestamp 20141211000009 is four seconds (longer than three seconds), the log is split up.
www.baidu.com, a.qq.com, b.baidu.com, c.tanx.com, c.tanx.com area part of domain name request set in a DNS log.
Next, a set-to-set matching on the domain name request set crawled by the crawler in step SI and the domain name request sets obtained by the DNS log segmentation in step S2 (step S3) is performed. The matching rule is [(a,b,c) = (b,c,a) = (a,c,b)].
After the matching, it is considered that the DNS log indicates that the user has clicked the domain name (i.e., the URL domain name requested by the crawler during crawling) if a part of domain name request set in DNS log is included in the domain name request set crawled by the crawler, or the two sets are equal to each other. For example:
the URL (as a click behavior of a user) crawled by the crawler is www.a.com/doc/1234. The set A of all the crawled domain name requests is “www.a.com, www.b.com, www.c.com, www.d.com, and www.e.com”
A part of domain name request set B after the segmentation of DNS log is “www.a.com, www.b.com, www.e.com, andwww.d.com”.
As mentioned above, when set B is included in set A, the domain name request set B is considered to reflect www.a.com/doc/1234, which is a visiting behavior of a user mapped by the domain name set A. In this way, the Internet browsing behaviors of users can also be analyzed via a DNS log.
The above-described aspects are only the preferred embodiments of the disclosure and are not intended to limit the scope of the disclosure. Any equivalent variations or modifications made according to the content of the claims of the disclosure should fall within the technical scope of the disclosure.

Claims (6)

1. A method for associating a domain name with a website visiting behavior, comprising the following steps: step SI, simulating a website visiting behavior of a user by means of a crawler program, so as to obtain all DNS domain name requests in a current HTTP request,
1. e., a crawled DNS domain name request set;
step S2,segmentinga DNS log to obtain n domain name request sets, n being an integer greater than or equal to 1; and step S3, performing set-to-set matching on the crawled DNS domain name request set in step SI and the n domain name request sets obtained by DNS log segmentation in step S2, and if one of the n domain name request sets obtained by the DNS log segmentation is equal to or contained in the crawled DNS domain name request set, considering that the DNS log indicates that the user has clicked the domain name of a URL requested by the crawler program during crawling.
2. The method for associating a domain name with a website visiting behavior according to claim 1, wherein the DNS log in step S2 is a DNS log on the day of the visiting behavior.
3. The method for associating a domain name with a website visiting behavior according to claim 1, wherein the DNS log segmentation in step S2 includes two-time segmentation , that is, a segmentation based on a source IP first and then another segmentation based on a time stamp difference.
4. The method for associating a domain name with a website visiting behavior according to claim 3, wherein the DNS log segmentation based on the source IP is to obtain continuous DNS logs with the same source IP over a period of time.
5. The method for associating a domain name with a website visiting behavior according to claim 4, wherein the segmentation based on the timestamp difference is to segment, based on the time stamp difference in DNS logs, the log after being segmented based on the source IP, and if the time stamp difference in two DNS logs is longer than a specified time length, the two DNS logs are split up.
6. The method for associating a domain name with a website visiting behavior according to claim 5, wherein the specified time length is three seconds.
INTERNATIONAL SEARCH REPORT
GB1816195.0A 2016-04-14 2016-08-17 Method for associating domain name with website access behavior Withdrawn GB2567749A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610230263.0A CN105763633B (en) 2016-04-14 2016-04-14 A method for associating domain name and website access behavior
PCT/CN2016/095670 WO2017177590A1 (en) 2016-04-14 2016-08-17 Method for associating domain name with website access behavior

Publications (1)

Publication Number Publication Date
GB2567749A true GB2567749A (en) 2019-04-24

Family

ID=56333890

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1816195.0A Withdrawn GB2567749A (en) 2016-04-14 2016-08-17 Method for associating domain name with website access behavior

Country Status (5)

Country Link
JP (1) JP6703621B2 (en)
CN (1) CN105763633B (en)
GB (1) GB2567749A (en)
RU (1) RU2709647C9 (en)
WO (1) WO2017177590A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105763633B (en) * 2016-04-14 2019-05-21 上海牙木通讯技术有限公司 A method for associating domain name and website access behavior
CN111131370B (en) * 2018-11-01 2022-09-27 百度在线网络技术(北京)有限公司 Method, device and system for detecting whether service call is correct
CN110798545B (en) * 2019-11-05 2020-08-18 中国人民解放军国防科技大学 A Web-based Domain Name Data Acquisition Method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079064A (en) * 2007-06-25 2007-11-28 腾讯科技(深圳)有限公司 Web page sequencing method and device
CN103389983A (en) * 2012-05-08 2013-11-13 阿里巴巴集团控股有限公司 Webpage content grabbing method and device applied to network crawler system
CN104065532A (en) * 2014-06-26 2014-09-24 国家计算机网络与信息安全管理中心 A search method and system for unregistered websites based on multi-channel data access
CN105005600A (en) * 2015-07-02 2015-10-28 焦点科技股份有限公司 Preprocessing method of URL (Uniform Resource Locator) in access log
CN105704260A (en) * 2016-04-14 2016-06-22 上海牙木通讯技术有限公司 Method for analyzing where Internet traffic comes from and goes to
CN105763633A (en) * 2016-04-14 2016-07-13 上海牙木通讯技术有限公司 Association method of domain name and website visiting behavior

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7734815B2 (en) * 2006-09-18 2010-06-08 Akamai Technologies, Inc. Global load balancing across mirrored data centers
US8145762B2 (en) * 2007-05-22 2012-03-27 Kount Inc. Collecting information regarding consumer click-through traffic
CN101729288B (en) * 2008-10-31 2014-02-05 中国科学院计算机网络信息中心 Method and device for counting network access behaviours of internet users
US8527658B2 (en) * 2009-04-07 2013-09-03 Verisign, Inc Domain traffic ranking
JP5770652B2 (en) * 2012-01-31 2015-08-26 日本電信電話株式会社 Source / destination organization identification apparatus, method and program
CN105357054B (en) * 2015-11-26 2019-01-29 上海晶赞科技发展有限公司 Website traffic analysis method, device and electronic equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079064A (en) * 2007-06-25 2007-11-28 腾讯科技(深圳)有限公司 Web page sequencing method and device
CN103389983A (en) * 2012-05-08 2013-11-13 阿里巴巴集团控股有限公司 Webpage content grabbing method and device applied to network crawler system
CN104065532A (en) * 2014-06-26 2014-09-24 国家计算机网络与信息安全管理中心 A search method and system for unregistered websites based on multi-channel data access
CN105005600A (en) * 2015-07-02 2015-10-28 焦点科技股份有限公司 Preprocessing method of URL (Uniform Resource Locator) in access log
CN105704260A (en) * 2016-04-14 2016-06-22 上海牙木通讯技术有限公司 Method for analyzing where Internet traffic comes from and goes to
CN105763633A (en) * 2016-04-14 2016-07-13 上海牙木通讯技术有限公司 Association method of domain name and website visiting behavior

Also Published As

Publication number Publication date
JP2019514137A (en) 2019-05-30
WO2017177590A1 (en) 2017-10-19
RU2709647C1 (en) 2019-12-19
RU2709647C9 (en) 2020-04-02
CN105763633A (en) 2016-07-13
JP6703621B2 (en) 2020-06-03
CN105763633B (en) 2019-05-21

Similar Documents

Publication Publication Date Title
US11606384B2 (en) Clustering-based security monitoring of accessed domain names
AU2011200024B2 (en) Method and System of Measuring and Recording User Data in a Communications Network
CN102663062B (en) Method and device for processing invalid links in search result
JP4358188B2 (en) Invalid click detection device in Internet search engine
CN101079768B (en) A method of counting web page link click data
CN109905288B (en) Application service classification method and device
US20110004850A1 (en) Methods and apparatus for determining website validity
RU2702048C1 (en) Method of analyzing a source and destination of internet traffic
CA2442190A1 (en) Dynamic web page referrer tracking and ranking
CN102663054A (en) Method and device for determining weight of website
CN110266661B (en) Authorization method, device and equipment
JP2006146882A (en) Content evaluation
CN102436564A (en) Method and device for identifying tampered webpage
US10360133B2 (en) Analyzing analytic element network traffic
GB2567749A (en) Method for associating domain name with website access behavior
US20210383059A1 (en) Attribution Of Link Selection By A User
CN118740675A (en) Network supportability testing method, device, equipment, medium and program product
CN104021143A (en) Method and device for recording webpage access behavior
CN104363309B (en) Pan-domain name identification processing unit and method
CN102918527B (en) Investigation method and system for web application hosting
CN102724129B (en) For the queue scheduling of wide area name and access control apparatus and method
JP5851251B2 (en) Communication packet storage device
HK1111545B (en) A method for compiling statistics on webpage linkage click-through data
CN115118624A (en) Production flow shunting method and device, electronic equipment and storage medium
Kar et al. Knowledge Retrieval from Web Server Logs Using Web Usage Mining

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)