RU2702048C1

RU2702048C1 - Method of analyzing a source and destination of internet traffic

Info

Publication number: RU2702048C1
Application number: RU2018139991A
Authority: RU
Inventors: Дашунь ЧЖАН
Original assignee: Шанхай Яму Коммюникейшн Текнолоджи Ко., Лтд
Priority date: 2016-04-14
Filing date: 2016-08-17
Publication date: 2019-10-03
Also published as: JP2019514303A; CN105704260B; GB2564057A; CN105704260A; JP7075348B2; WO2017177591A1

Abstract

FIELD: information technology.SUBSTANCE: invention relates to DNS name resolution of Internet network. Method obtains a source and an Internet traffic recipient by processing DNS logs and includes a step of filtering logs on which DNS logs are filtered, which can not reflect the actual path of access by the user, a segmentation step of a log, on which, based on the source IP address, the difference between the time stamps and the central domain, sequentially segmenting the DNS logs obtained after the log filtering step to obtain segmented access paths and a step of summing data on which summarize all segmented access paths.EFFECT: wider range of tools.9 cl, 4 dwg

Description

ОБЛАСТЬ ТЕХНИКИ, К КОТОРОЙ ОТНОСИТСЯ ИЗОБРЕТЕНИЕFIELD OF THE INVENTION

Раскрытие относится к области разрешения имени DNS сети Интернет и, в частности, к способу анализа источника и адресата Интернет-трафика.The disclosure relates to the field of resolving the DNS name of the Internet and, in particular, to a method for analyzing the source and destination of Internet traffic.

УРОВЕНЬ ТЕХНИКИBACKGROUND

Так называемые источник и адресат Интернет-трафика относятся к последовательности путей доступа к веб-сайту, включающей в себя конкретный веб-сайт, к которому сначала осуществляет доступ пользователь, и другие веб-сайты, к которым пользователь осуществляет доступ позже. Существует только один основной подход для подтверждения источника трафика веб-сайта, а именно, добавить JavaScript-код для отслеживания на страницу веб-сайта. Наиболее распространены инструменты обнаружения от сторонних производителей, такие как Google Analytics и Baidu Analytics.The so-called source and destination of Internet traffic refers to a sequence of access paths to a website that includes a particular website that the user first accesses, and other websites that the user accesses later. There is only one basic approach to confirm the source of website traffic, namely, add JavaScript tracking code to the website page. The most common third-party discovery tools, such as Google Analytics and Baidu Analytics.

Вышеописанные статистические модели обладают огромными ограничениями, заключающимися в следующем: каждый веб-сайт может знать только веб-сайт, к которому посетитель осуществлял доступ в последний раз, и не может узнать о множестве веб-сайтов, к которым посетитель осуществлял доступ до того, а также узнать, куда посетитель перейдет после того, как покинет его. DNS (система доменных имен) представляет собой распределенную базу данных, которая обеспечивает сопоставление между доменным именем и IP-адресом в сети Интернет.DNS может предоставить пользователю возможность осуществить доступ к сети Интернет более удобным образом без запоминания IP-строк чисел, которые могут быть непосредственно считаны машиной. «Технология разрешения имени DNS» означает, что когда пользователю необходимо осуществить доступ к веб-сайту, пользователю необходимо ввести в браузере его доменное имя; после нажатия клавиши ввода пользователем браузер сначала инициирует запрос DNS; и с помощью технологии DNS браузер получает IP-адрес сервера, соответствующий доменному имени, и затем инициирует HTTP-запрос для этого IP-адреса.The statistical models described above have enormous limitations in the following: each website can only know the website that the visitor accessed for the last time, and cannot learn about the many websites the visitor accessed before, and also find out where the visitor will go after leaving him. DNS (Domain Name System) is a distributed database that provides a mapping between a domain name and an IP address on the Internet. DNS can provide the user with the ability to access the Internet in a more convenient way without storing IP strings of numbers that can be directly read by machine. “DNS name resolution technology” means that when a user needs to access a website, the user needs to enter his domain name in the browser; after pressing the enter key by the user, the browser first initiates a DNS query; and using DNS technology, the browser obtains the server IP address corresponding to the domain name, and then initiates an HTTP request for that IP address.

В журналы DNS может записываться содержание ответов каждого запроса DNS и может почти записываться информация о доменных именах всех запросов пользователей. Однако журналы могут содержать слишком много неправильной и недействительной информации. Например, сервер также может инициировать запросы DNS таким образом, чтобы сформировать большой объем информации о доменных именах, и поисковые роботы сети Интернет и даже сетевые атаки будут формировать большое количество запросов DNS. Эти запросы не способны правильно и эффективно отразить реальные пути осуществления доступа пользователем.The DNS logs can record the contents of the responses of each DNS query and can almost record the domain name information of all user queries. However, the logs may contain too much incorrect and invalid information. For example, the server can also initiate DNS queries in such a way as to generate a large amount of information about domain names, and Internet crawlers and even network attacks will generate a large number of DNS queries. These requests are not able to correctly and efficiently reflect the real ways of user access.

В настоящее время не существует хороших способов для анализа всех путей осуществления доступа посетителями сети Интернет на рынке. Настоящее раскрытие заполняет этот пробел, оно обеспечивает способ анализа трафика веб-сайта для выяснения, с каких веб-сайтов он поступает, и на какие веб-сайты он перейдет после ухода, посредством повторной обработки журналов DNS.Currently, there are no good ways to analyze all the ways visitors can access the Internet on the market. This disclosure fills this gap; it provides a way to analyze website traffic to find out which websites it is coming from and which websites it will go to after leaving by reprocessing the DNS logs.

РАСКРЫТИЕ ИЗОБРЕТЕНИЯSUMMARY OF THE INVENTION

Принимая во внимание вышеописанные недостатки, раскрытие обеспечивает способ анализа источника и адресата Интернет-трафика. С помощью способа согласно раскрытию характеристика не относящихся к человеку осуществлений доступа удаляется в журналах в максимально возможной степени, так что источник и адресат Интернет-трафика могут быть получены эффективно.Given the above-described disadvantages, the disclosure provides a method for analyzing the source and destination of Internet traffic. Using the method of the disclosure, the characteristic of non-human accesses is deleted in the logs as much as possible, so that the source and destination of Internet traffic can be obtained efficiently.

Способ анализа источника и адресата Интернет-трафика согласно раскрытию, который получает источник и адресата Интернет-трафика посредством обработки журнала DNS, включает в себя следующие этапы:The method of analyzing the source and destination of Internet traffic according to the disclosure, which receives the source and destination of Internet traffic by processing the DNS log, includes the following steps:

этап фильтрации журналов, на котором осуществляют фильтрацию журналов DNS, которые не могут отразить действительный путь осуществления доступа пользователем; этап сегментации журнала, на котором на основе IP-адреса источника, разницы между метками времени и центрального домена последовательно сегментируют журналы DNS, полученные после этапа фильтрации журналов, для получения сегментированных путей осуществления доступа; и этап суммирования данных, на котором суммируют все сегментированные пути осуществления доступа.the step of filtering logs, which filter DNS logs, which cannot reflect the actual path of access by the user; a log segmentation step, in which, based on the source IP address, the differences between timestamps and the central domain, the DNS logs obtained after the log filtering step are sequentially segmented to obtain segmented access paths; and a step of summing data, which summarizes all the segmented access paths.

Предпочтительно, посредством установки черного списка и белого списка на этапе фильтрации журналов сохраняются журналы DNS, содержащие запросы доменных имен, представляющие значительный интерес, а журналы DNS, содержащие не относящиеся к человеку запросы доменных имен, сформированные сервером, удаляются.Preferably, by setting a blacklist and a white list at the filtering stage of the logs, DNS logs containing domain name queries of significant interest are stored, and DNS logs containing non-human domain name queries generated by the server are deleted.

Предпочтительно, удаление журналов DNS дополнительно включает в себя удаление журналов, к котором осуществляет доступ корпоративный IP-адрес, и журналов, в которых IP-адрес не преобразован.Preferably, deleting the DNS logs further includes deleting the logs that the corporate IP address accesses and the logs in which the IP address is not mapped.

Предпочтительно, сегментация журнала DNS, основанная на IP-адресе источника, заключается в том, чтобы получать последовательные журналы DNS с одинаковым IP-адресом источника в течение периода времени.Preferably, segmentation of the DNS log based on the source IP address is to obtain consecutive DNS logs with the same source IP address over a period of time.

Предпочтительно, сегментация журналов, основанная на разнице между метками времени, заключается в том, чтобы на основе разницы между метками времени в журналах DNS сегментировать журналы после того, как они были сегментированы на основе IP-адреса источника, и, если разница между метками времени в двух журналах DNS больше, чем определенный временной промежуток, два журнала DNS разделяются.Preferably, log segmentation based on the difference between timestamps is based on the difference between the timestamps in the DNS logs to segment the logs after they have been segmented based on the source IP address, and if the difference between the timestamps is two DNS logs are longer than a certain period of time, the two DNS logs are separated.

Предпочтительно, определенный временной промежуток представляет собой три секунды.Preferably, the determined time period is three seconds.

Предпочтительно, способ анализа после этапа, на котором сегментируют журналы DNS на основе разницы между метками времени, дополнительно включает в себя этап объединения, на котором преобразуют доменное имя в путях осуществления доступа, полученных посредством сегментации, в домен, и объединяют последовательные идентичные домены таким образом, чтобы получить путь IP-адреса источника.Preferably, the analysis method after the step of segmenting the DNS logs based on the difference between the timestamps further includes a combining step of converting the domain name in the access paths obtained by segmentation into a domain and combining consecutive identical domains in this way to get the path of the source IP address.

Предпочтительно, сегментация, основанная на центральном домене, заключается в том, чтобы сегментировать путь IP-адреса источника на основе центрального домена, и путь осуществления доступа, полученный после сегментации, представляет собой: доменное имя источника n + … + доменное имя источника 1 + центральное доменное имя + доменное имя адресата 1 + … + доменное имя адресата n, и центральный домен представляет собой домен, который главным образом анализируется на основе пользовательских/системных требований.Preferably, the central domain-based segmentation is to segment the source IP address path based on the central domain, and the access path obtained after the segmentation is: source domain name n + ... + source domain name 1 + central domain name + destination domain name 1 + ... + destination domain name n, and the central domain is a domain that is mainly analyzed based on user / system requirements.

Предпочтительно, все пути осуществления доступа IP-адреса источника, которые получены после этапа сегментации на основе центрального домена, суммируются на этапе суммирования данных.Preferably, all access paths of the source IP address that are obtained after the segmentation step based on the central domain are summed up at the data summation step.

Посредством способа анализа согласно этому раскрытию источник и адресат Интернет-трафика могут быть освоены таким образом, чтобы можно было обеспечить улучшенный анализ и оптимизацию трафика вебсайта. Кроме того, будучи полностью осведомленным о направлении потока всего Интернет-трафика, состояние трафика других веб-сайтов может быть проанализировано и понято комплексно так, чтобы узнать все.By the analysis method according to this disclosure, the source and destination of Internet traffic can be mastered in such a way as to provide improved analysis and optimization of website traffic. In addition, being fully aware of the direction of the flow of all Internet traffic, the traffic status of other websites can be analyzed and understood comprehensively so as to find out everything.

КРАТКОЕ ОПИСАНИЕ ЧЕРТЕЖЕЙBRIEF DESCRIPTION OF THE DRAWINGS

Фиг. 1(a) и 1(b) представляют собой блок-схемы последовательности операций способа анализа источника и адресата Интернет-трафика согласно раскрытию; иFIG. 1 (a) and 1 (b) are flowcharts of a method for analyzing a source and destination of Internet traffic according to the disclosure; and

Фиг. 2(a) и 2(b) представляют собой схематические диаграммы источника трафика, полученного посредством способа анализа источника и адресата Интернет-трафика согласно раскрытию.FIG. 2 (a) and 2 (b) are schematic diagrams of a traffic source obtained by the method of analyzing the source and destination of Internet traffic according to the disclosure.

ОСУЩЕСТВЛЕНИЕ ИЗОБРЕТЕНИЯDETAILED DESCRIPTION OF THE INVENTION

Далее раскрытие будет описано подробно со ссылкой на прилагаемые чертежи и варианты осуществления. Нижеследующие варианты осуществления не предназначены для ограничения изобретения. Изменения и преимущества, которые могут быть поняты специалистами в данной области техники, включены в настоящее раскрытие без отступления от сущности и объема раскрытия.The disclosure will now be described in detail with reference to the accompanying drawings and embodiments. The following embodiments are not intended to limit the invention. Changes and advantages that may be understood by those skilled in the art are included in the present disclosure without departing from the spirit and scope of the disclosure.

Как указано выше, DNS (система доменных имен) представляет собой распределенную базу данных, которая обеспечивает сопоставление между доменным именем и IP-адресом в сети Интернет.DNS может предоставить пользователю возможность осуществить доступ к сети Интернет более удобным образом без запоминания IP-строк чисел, которые могут быть непосредственно считаны машиной. При осуществлении доступа к веб-сайту пользователь сначала вводит в браузере его доменное имя и нажимает клавишу ввода. Затем браузер инициирует запрос DNS. С помощью технологии DNS браузер получает IP-адрес сервера, соответствующий доменному имени, и затем инициирует HTTP-запрос для этого IP-адреса. Вышеописанные этапы являются технологией разрешения имени DNS.As indicated above, DNS (domain name system) is a distributed database that provides a mapping between a domain name and an IP address on the Internet.DNS can provide the user with the ability to access the Internet in a more convenient way without storing IP numbers strings, which can be directly read by the machine. When accessing the website, the user first enters his domain name in the browser and presses the enter key. Then the browser initiates a DNS query. Using DNS technology, the browser obtains the server IP address corresponding to the domain name, and then initiates an HTTP request for that IP address. The above steps are DNS name resolution technology.

Журналы DNS могут быть сформированы во время вышеупомянутого процесса разрешения доменного имени. В журналы DNS может записываться содержание ответов каждого запроса DNS и может почти записываться информация о доменных именах всех запросов пользователей. Формат журналов DNS описывается следующим образом:DNS logs can be generated during the above domain name resolution process. The DNS logs can record the contents of the responses of each DNS query and can almost record the domain name information of all user queries. The format of DNS logs is described as follows:

Таким образом, журнал DNS состоит из «IP-адреса источника», «Доменного имени», «Метки времени», «1Р разрешения» и «Кода состояния».Thus, the DNS log consists of “Source IP Address”, “Domain Name”, “Timestamp”, “1P Resolution” and “Status Code”.

Так как журнал DNS включает в себя информацию о доменном имени всех запросов пользователей, авторы настоящего изобретения полагают, что источник и адресат трафика веб-сайта анализируются посредством повторной обработки журнала DNS. Однако журнал также включает в себя большой объем неправильной и недействительной информации. Например, сервер также может инициировать запросы DNS таким образом, чтобы сформировать большой объем информации о доменных именах, и поисковые роботы сети Интернет и даже сетевые атаки будут формировать большое количество запросов DNS. Эти запросы не способны правильно и эффективно отразить реальные пути осуществления доступа пользователем. На основе вышеописанной ситуации авторы настоящего изобретения полагают, что посредством очистки характеристик не относящихся к человеку осуществлений доступа в журнале в максимально возможной степени эффективно получаю источник и адресата Интернет-трафика.Since the DNS log includes the domain name information of all user requests, the present inventors believe that the source and destination of website traffic are analyzed by reprocessing the DNS log. However, the journal also includes a large amount of incorrect and invalid information. For example, the server can also initiate DNS queries in such a way as to generate a large amount of information about domain names, and Internet crawlers and even network attacks will generate a large number of DNS queries. These requests are not able to correctly and efficiently reflect the real ways of user access. Based on the above situation, the authors of the present invention believe that by clearing the characteristics of non-human access to the journal, I obtain the source and destination of Internet traffic as much as possible.

Фиг. 1 представляет собой блок-схему последовательности операций способа анализа источника и адресата Интернет-трафика согласно раскрытию. Как показано на Фиг. 1, способ анализа источника и адресата Интернет-трафика в этом раскрытии включает в себя следующие этапы.FIG. 1 is a flowchart of a method for analyzing a source and destination of Internet traffic according to the disclosure. As shown in FIG. 1, a method for analyzing the source and destination of Internet traffic in this disclosure includes the following steps.

Во-первых, отфильтровываются (этап S1) журналы DNS, которые не могут отразить действительный путь осуществления доступа. Как описано выше, так как запрос DNS включает в себя множество доменных имен, которые не могу правильно и эффективно отражать действительный путь осуществления доступа, требуется очистка. Например, посредством установки черного списка и белого списка сохраняются журналы DNS, содержащие запросы доменных имен, представляющие значительный интерес, а журналы DNS, содержащие не относящиеся к человеку запросы доменных имен, сформированные сервером, удаляются. Не относящиеся к человеку запросы доменных имен, сформированные сервером, могут быть удалены посредством установки черного списка. Некоторые доменные имена, представляющие значительный интерес, могут быть сохранены посредством установки белого списка. Белый список имеет более высокий приоритет по сравнению с черным списком. Кроме того, удаление журналов DNS дополнительно включает в себя удаление журналов, к котором осуществляет доступ корпоративный IP-адрес, и журналов, в которых IP-адрес не преобразован, причем корпоративный IP-адрес удаляется, так как он может формировать журналы, к которым осуществляется доступ множеством человек одновременно, и влиять на оценку индивидуального отслеживания доступа; и журнал с непреобразованным IP-адресом удаляется, т.е. удаляется журнал с ошибкой доступа. Фильтрация журналов выполняется с использованием различных показателей таким образом, чтобы можно было получить журналы DNS, отражающие действительный путь осуществления доступа пользователем.First, DNS logs are filtered out (step S1), which cannot reflect the actual access path. As described above, since a DNS query includes many domain names that cannot correctly and efficiently reflect the actual access path, a cleanup is required. For example, by setting a blacklist and a white list, DNS logs containing domain name queries of significant interest are saved, and DNS logs containing non-human domain name queries generated by the server are deleted. Non-human domain name requests generated by the server can be removed by setting a blacklist. Some domain names of significant interest can be saved by setting a whitelist. Whitelisting has a higher priority than blacklisting. In addition, deleting DNS logs additionally includes deleting logs that are accessed by the corporate IP address and logs in which the IP address is not mapped, and the corporate IP address is deleted, since it can generate logs to which access by many people at the same time, and affect the assessment of individual access tracking; and the log with the untranslated IP address is deleted, i.e. the log with access error is deleted. Log filtering is performed using a variety of metrics so that you can obtain DNS logs that reflect the actual path for the user to access.

Затем журналы DNS, полученные после этапа фильтрации журналов, сегментируются на основе IP-адреса источника, разницы между метками времени и центрального домена таким образом, чтобы получить (этап S2) сегментированный домен.Then, the DNS logs obtained after the log filtering step are segmented based on the source IP address, the difference between the time stamps and the central domain so as to obtain (step S2) a segmented domain.

Подробные этапы заключаются в следующем.The detailed steps are as follows.

1) Обеспечиваются сегментации на основе IP-адреса (этап S21) источника. Журнал DNS сегментируют на основе IP-адреса источника, чтобы получать последовательные журналы DNS с одинаковым IP-адресом источника в течение периода времени.1) Segmentation is provided based on the IP address (step S21) of the source. The DNS log is segmented based on the source IP address to obtain consecutive DNS logs with the same source IP address over a period of time.

Например, IP-адрес источника 1.1.1.1 отличается от IP-адреса источника 2.2.2.2, следовательно, журнал сегментируется. Это показывается следующим образом:For example, the source IP address 1.1.1.1 is different from the source IP address 2.2.2.2, therefore, the log is segmented. This is shown as follows:

2) Затем журналы, сегментированные на основе IP-адреса источника, сегментируются на основе разницы между метками времени (этап S22). Сегментация на основе разницы между метками времени означает то, что после того, как журналы сегментируются на основе IP-адреса источника, они сегментируются на основе разницы между метками времени в журналах DNS. Если разница между метками времени в двух журналах DNS больше, чем определенный временной промежуток, два журнала DNS разделяются (причиной для сегментации является то, что интервал между журналами настолько большой, что два журнала рассматриваются как две различные характеристики). Определенный временной промежуток может быть настроен как требуется. В этом варианте осуществления определенный временной промежуток равен трем секундам, т.е. журнал может быть сегментирован, если интервалы между метками времени больше, чем три секунды.2) Then, the logs segmented based on the source IP address are segmented based on the difference between the time stamps (step S22). Segmentation based on the difference between timestamps means that after the logs are segmented based on the source IP address, they are segmented based on the difference between the timestamps in the DNS logs. If the difference between the timestamps in the two DNS logs is greater than a certain time interval, the two DNS logs are separated (the reason for the segmentation is that the interval between the logs is so large that the two logs are considered as two different characteristics). A specific time period can be configured as required. In this embodiment, the determined time period is three seconds, i.e. the log can be segmented if the intervals between timestamps are more than three seconds.

Например, как показано ниже, журнал DNS IP-адреса источника 2.2.2.2 дополнительно сегментирован на основе его разницы между метками времени (Метка времени 20141211035932 указывает 3 (часа):59(минут):32(секунды) 11 декабря 2014).For example, as shown below, the DNS log of source IP address 2.2.2.2 is further segmented based on its difference between timestamps (Timestamp 20141211035932 indicates 3 (hours): 59 (minutes): 32 (seconds) December 11, 2014).

Как описано выше, так как разница между 05 секундами в метке времени 20141211000005 и 09 секундами в метке времени 20141211000009 равна четырем секундам (больше, чем три секунды), то журнал разделяется. Разница между 20141211000009 и 201412110000015 равна шести секундам, таким образом, журнал также разделяется.As described above, since the difference between 05 seconds in the time stamp 20141211000005 and 09 seconds in the time stamp 20141211000009 is four seconds (more than three seconds), the log is split. The difference between 20141211000009 and 201412110000015 is six seconds, so the log is also split.

Как описано выше, журнал сегментируется на шесть частей. IP-адрес источника 2.2.2.2 в первой части журнала осуществил доступ к пяти доменным именам, состоящим из www.baidu.com, а. qq.com, b. baidu.com, с. tanx.com и с. allyes.com. Согласно способу оценки характеристики осуществления доступа пользователем можно заключить, что пользователь на самом деле осуществил доступ только к www.baidu.com, а остальные четыре доменных имени являются лишь запросами доменных имен, дополнительно сформированными после того, как пользователь переходит (click) на www.baidu.com, и они не являются действительными характеристиками осуществления доступа пользователем. Следовательно, из первой части журнала может быть заключено, что пользователь осуществляет доступ к пути доменного имени, т.е. www.baidu.com. Способ определения характеристики осуществления доступа пользователем, упомянутый в данном документе, заключается в следующем: когда пользователь переходит по URL, помимо доменного имени текущего URL запрашиваются некоторые другие доменные имена. Все запросы других доменных имен, произведенные после запроса доменного имени URL, могут быть получены с помощью технологии поискового робота, и просканированные запросы доменных имен сопоставляются с частью доменных имен, сегментированных из журнала DNS, таким образом, что может быть получено соответствие между журналом DNS и доменным именем, к которому на самом деле был осуществлен доступ пользователем. На основе соответствия, полученного этим способом, может быть выяснено, что эта часть журнала отражает то, что пользователь на самом деле осуществляет доступ к www.baidu.com. Вторая часть журнала содержит только www.sina.com, следовательно, www.sina.com является путем доменного имени, к которому осуществлялся доступ пользователем.As described above, the magazine is segmented into six parts. Source IP 2.2.2.2 in the first part of the journal accessed five domain names consisting of www.baidu.com, as well. qq.com, b. baidu.com, p. tanx.com and s. allyes.com. According to the method for evaluating the characteristics of access by a user, it can be concluded that the user actually accessed only www.baidu.com, and the other four domain names are only domain name requests, additionally generated after the user clicks on www. baidu.com, and they are not valid characteristics of user access. Therefore, from the first part of the log it can be concluded that the user is accessing the domain name path, i.e. www.baidu.com. The method for determining the user access characteristics mentioned in this document is as follows: when the user navigates to a URL, in addition to the domain name of the current URL, some other domain names are requested. All requests for other domain names made after requesting a domain name URL can be obtained using search engine technology, and crawled domain name requests are mapped to part of the domain names segmented from the DNS log, so that a correspondence can be obtained between the DNS log and The domain name that the user actually accessed. Based on the correspondence obtained by this method, it can be found that this part of the journal reflects the fact that the user actually accesses www.baidu.com. The second part of the magazine contains only www.sina.com, therefore, www.sina.com is the path of the domain name accessed by the user.

После соединения путей упомянутых выше журналов полученные пути представляются следующим образом:After connecting the paths of the above logs, the resulting paths are represented as follows:

Затем пути, полученные посредством сегментации на основе разницы между метками времени, объединяются в соответствии с одним и тем же доменом, т.е. доменом второго уровня в настоящем примере, и объединенный результат представляется следующим образом:Then, the paths obtained by segmentation based on the difference between the timestamps are combined in accordance with the same domain, i.e. second level domain in the present example, and the combined result is as follows:

Вышеописанный путь представляет собой путь по характеристикам осуществления доступа IP-адресом источника, и все пути осуществления доступа всех IP-адресов источников могут быть вычислены по такому правилу.The above path is a path according to the characteristics of access by the source IP address, and all access paths of all source IP addresses can be calculated according to this rule.

3) Далее вышеописанные результаты сегментируются на основе центрального домена (этап S23). Центральный домен, который главным образом анализируется на основе пользовательских/системных требований, анализируется для того, чтобы узнать, откуда пользователь перешел на центральный домен, и к каким доменам пользователь затем перейдет от центрального домена. Например, а.com в журнале рассматривается в качестве центрального домена, и это показывается следующим образом:3) Next, the above-described results are segmented based on the central domain (step S23). The central domain, which is mainly analyzed based on user / system requirements, is analyzed in order to find out where the user came from to the central domain, and to which domains the user then goes from the central domain. For example, a.com in the magazine is considered as the central domain, and it is shown as follows:

Например, ниже перечислены четыре пути вышеупомянутого IP-адреса источника, и в качестве примера в каждом пути приведены только домены источников первых трех слоев центрального домена, и логика обработки пути после центрального домена соответствует логике обработки пути до обработки центрального домена. Фактическое количество слоев может быть отрегулировано в соответствии с конкретными потребностями. Они также показаны на Фиг. 2(a):For example, the four paths of the above source IP address are listed below, and as an example, only the source domains of the first three layers of the central domain are shown in each path, and the processing logic of the path after the central domain corresponds to the processing logic of the path to the processing of the central domain. The actual number of layers can be adjusted according to specific needs. They are also shown in FIG. 2 (a):

В заключение все четыре пути осуществления доступа вышеупомянутого IP-адреса источника суммируются на этапе суммирования данных. Диаграмма суммирования показана на Фиг. 2(b).In conclusion, all four access paths of the aforementioned source IP addresses are summarized in the data summarization step. The summation diagram is shown in FIG. 2 (b).

Суммарная сводка центрального домена представляет собой четыре a.com.The summary summary of the central domain is four a.com.

Суммарная сводка Домена 1 Источника представляет собой два qq.com, один baidu.com и один youku.com.The summary summary of Source Domain 1 is two qq.com, one baidu.com and one youku.com.

Суммарная сводка Домена 2 Источника представляет собой два sina.com, один baidu.com и один qq.com.The summary summary of Source Domain 2 is two sina.com, one baidu.com and one qq.com.

Суммарная сводка Домена 3 Источника представляет собой два baidu.com, один sina.com и один youku.com.The summary summary of Source Domain 3 is two baidu.com, one sina.com and one youku.com.

На основе визуализирующего чертежа, как показано на Фиг. 2(b), можно ясно увидеть, к каким доменам на последнем этапе был осуществлен доступ пользователем, осуществляющим доступ к центральному домену a.com, и к каким доменам был осуществлен доступ пользователем до этого, и т.д.Based on the visualization drawing, as shown in FIG. 2 (b), you can clearly see which domains in the last step were accessed by the user accessing the central domain a.com, and which domains were accessed by the user before, etc.

Когда все IP-адреса источников обрабатываются согласно упомянутой логике, можно увидеть источник и адресата всего Интернет-трафика.When all source IP addresses are processed according to the mentioned logic, you can see the source and destination of all Internet traffic.

Посредством способа анализа согласно этому раскрытию источник и адресат Интернет-трафика могут быть могут быть освоены на основе центрального доменного имени, подлежащего анализу, таким образом, чтобы можно было обеспечить улучшенный анализ и оптимизацию трафика веб-сайта для веб-сайта центрального доменного имени. Кроме того, при полной осведомленности о направлении потока всего Интернет-трафика состояние трафика других веб-сайтов может быть проанализировано и понято комплексно так, чтобы узнать все.By the analysis method according to this disclosure, the source and destination of Internet traffic can be learned based on the central domain name to be analyzed, so that it is possible to provide improved analysis and optimization of website traffic for the website of the central domain name. In addition, with full awareness of the direction of the flow of all Internet traffic, the traffic status of other websites can be analyzed and understood comprehensively so as to find out everything.

Аспекты, описанные выше, представляют собой только предпочтительные варианты осуществления раскрытия, и они не предназначены для ограничения объема этого раскрытия. Любые эквивалентные изменения и модификации, сделанные в соответствии с формулой изобретения этого раскрытия, должны подпадать в пределы технического объема этого раскрытия.The aspects described above are only preferred embodiments of the disclosure, and are not intended to limit the scope of this disclosure. Any equivalent changes and modifications made in accordance with the claims of this disclosure should fall within the technical scope of this disclosure.

Claims

1. The method of analyzing the source and destination of Internet traffic, which consists in obtaining the source and destination of Internet traffic of the user by processing the DNS log, the method comprising the following steps:

a step of filtering the logs, where the DNS logs are filtered to obtain DNS logs reflecting the actual path for the user to access;

a log segmentation step in which sequential segmentation of the DNS logs is based on the source IP address, on the difference between the timestamps in the DNS logs and is based on the data from where the user moved to the central domain;

receive data after the stage of filtering logs to determine the segmented paths for user access; and

the step of summing the data, which summarizes all the segmented access paths of the user's IP address.

2. The analysis method according to claim 1, wherein by setting a black list and a white list at the stage of filtering the logs, DNS logs containing domain name queries of significant interest are saved, and DNS logs containing non-human domain name queries generated by the server delete.

3. The analysis method according to claim 2, wherein deleting the DNS logs further includes deleting the logs to which the corporate IP address accesses and the logs in which the IP address is not mapped.

4. The analysis method according to claim 3, wherein the segmentation of the DNS log based on the source IP address is to obtain consecutive DNS logs with the same source IP address over a period of time.

5. The analysis method according to claim 4, wherein the log segmentation based on the difference between timestamps is to segment the logs after they have been segmented based on the difference between the timestamps in the DNS logs, based on the source IP address, and if the difference between the timestamps in the two DNS logs is greater than a certain time period, the two DNS logs are separated.

6. The analysis method according to claim 5, wherein a specific time period is three seconds.

7. The analysis method according to claim 6, further comprising, after the step of segmenting the DNS logs based on the difference between the timestamps, the combining step of converting the domain name in the access paths obtained by segmentation into a domain, and sequentially identical domains in such a way as to obtain the path of the source IP address.

8. The analysis method according to claim 7, wherein the segmentation based on the central domain is to segment the path of the source IP address based on the central domain, and the access path obtained after segmentation is:

source domain name n + ... + source domain name 1 + central domain name + destination domain name + + ... + destination domain name n, and

a central domain is a domain that is mainly analyzed based on user / system requirements.

9. The analysis method of claim 8, wherein all access paths of the source IP address that are obtained after the segmentation step based on the central domain are summed up at the data summation step.