CN115186166B

CN115186166B - A Tor core site discovery method based on hidden service association

Info

Publication number: CN115186166B
Application number: CN202210854926.1A
Authority: CN
Inventors: 杨明; 邢琳; 顾晓丹; 宋炳辰
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-07-20
Filing date: 2022-07-20
Publication date: 2025-10-14
Anticipated expiration: 2042-07-20
Also published as: CN115186166A

Abstract

The present invention discloses a Tor core site discovery method based on hidden service association, comprising the steps of (1) a hidden service association algorithm: for websites with similar content but different domain names, a hidden service association algorithm based on page structure and content is designed; (2) calculating the hidden service survival rate; (3) measuring the hidden service visit volume; and (4) Tor core site discovery: analyzing the hidden services in each group clustered in (1) using the hidden service survival rate and visit volume obtained in (2) and (3) to identify the core sites therein. The present invention can realize the discovery of core sites with high analysis value in the Tor dark web.

Description

Tor core site discovery method based on hidden service association

Technical Field

The invention belongs to the technical field of anonymous networks (Anonymity Network), and particularly relates to a Tor core site discovery method based on hidden service association.

Background

Because of their strong anonymity, torr is used by many criminals to conduct illegal transactions, engage in illegal activities such as gun sales, drug sales, and privacy information transactions, and in addition, it is used by some organizations to implement large-scale network attacks. In order to effectively manage the darknet content, efficient crawlers of the Tor darknet content are required. However, different hidden services in the darknet have different importance, and the effective information amount also has a great difference, so that if the full-net crawler is used, much valuable information cannot be obtained, and the quality of the data provided by the hidden service is low. In addition, the site contents corresponding to a large number of domain names in the darknet are almost the same, namely the web page contents of different domain names are basically the same, and the phenomenon can cause a large number of crawlers to analyze, store and calculate resources to consume on repeated site contents, so that the detection and grasp of the darknet space are seriously restricted, and therefore, the core site discovery aiming at the Torr darknet is necessary.

Disclosure of Invention

Aiming at the problem that the quality of acquired data of the current darknet is lower because a large amount of illegal contents are urgent to be managed in the Tor darknet, the invention provides a method for discovering the Tor core site based on hidden service association.

The invention adopts the following technical scheme:

A method for discovering a Tor core site based on hidden service association, the method comprising the steps of:

(1) Designing a hidden service association algorithm based on a page structure and contents aiming at Web sites with similar contents and different domain names;

(2) Calculating the survival rate of the hidden service, namely indirectly judging whether the hidden service is online or not through whether the descriptor of the hidden service exists or not, and taking the hidden service as one of the characteristics of core site judgment;

(3) The hidden service access amount measurement, namely collecting the condition that a hidden service blind public key is requested through deploying a hidden service directory server HSDir, and further analyzing and comparing the access amount of the calculated hidden service;

(4) The Tor core site finds that the hidden services in each group clustered in (1) are analyzed by the hidden service survival rate and the access amount obtained in (2) and (3), and the core site is identified.

Further, the step (1) specifically includes:

(11) Clustering by using the redirection links in the Response Header, wherein after some domain names are accessed, the state codes are returned 301 and are automatically redirected to other pages, and the Location field in the Response Header displays the redirected page domain names, so that the domain names and the redirection domain names are clustered into a group;

(12) Defining that the titles of the default pages of the sites in the dark network are nonsensical, including 'Index of/', 'Apache 2 Debian Default Page', '401 Authorization Required', apache and Nginx, grouping the nonsensical titles and the sites without title information respectively, and grouping the sites with the meaningful title information and the same title text;

(13) And (3) integrating HTMLDOM tree, CSS style and page keywords to cluster, namely extracting one page from the set of meaningful titles, calculating DOM tree structure, class attribute value, id attribute value of each page and the first 20 pieces of keyword information in the page, and comparing the DOM tree structure similarity, class attribute value, id attribute value similarity and page keyword similarity of each page by using a similarity algorithm.

Further, the step (2) specifically includes:

(21) Reading the domain name of the hidden service survival rate to be calculated from the database;

(22) Deploying a plurality of Torr processes, and sending a query request to the hidden server by the client through the Torr control protocol to realize the concurrent execution of a plurality of processes;

(23) If the descriptor is in a non-abnormal state, judging whether the descriptor exists according to the returned information so as to save the result, wherein if the descriptor exists, the domain name is considered to be on-line, and if the descriptor does not exist, the domain name is considered to be off-line;

(24) If the descriptor inquiry is abnormal and the inquiry times are not more than 5 times, putting the domain name into a queue again, and carrying out re-inquiry later, and returning to the step (22);

(25) And according to the returned information, storing the detection result for calculating the survival rate of the hidden service.

Further, the step (3) specifically includes:

(31) Calculating all blind public keys in a certain period for each v3 domain name;

(32) Comparing the off-line calculated blind public key result with the blind public key data collected from the hidden service directory server to obtain the total access quantity of each v3 domain name;

(33) The daily average access amount of the hidden service v3 domain name is calculated by dividing the total access amount of each v3 domain name by the statistical days.

Further, the step (4) specifically includes:

(41) For each group of clusters in (1), the survival sr _{j_i} of each group was calculated, which was the maximum survival for all domain names in the group, and the survival sr _{j_i} was expressed as follows:

wherein, online_num is the measurement domain name online

(42) For each group of clusters in (1), calculating an access quantity view _{j_i} for each group, view _{j_i} being the sum of all domain name access quantities for each group for websites with declared mirror sites, and view _{j_i} being the maximum value of all domain name access quantities for each group for websites without declared mirror sites;

(43) Modeling the discovery problem of the core site as a classification problem in machine learning, taking the access quantity, the survival rate, the number of similar pages and the access degree as classification attributes, and using XGBoost model to discover the core site;

(44) For pages classified as core sites, a discrimination probability x of classification is calculated at the same time, and based on the discrimination probability, the identified core sites are further classified into importance degrees of 3 levels, wherein, the page with x being equal to or greater than 0.9 is regarded as the most important core site, the page with 0.75 being equal to or greater than x <0.9 is regarded as the next most important page, and the page with 0.5 being equal to or greater than x <0.75 is regarded as the least important core site.

Compared with the prior art, the invention has the remarkable advantages that:

1. the hidden service detection efficiency is improved, namely, from the beginning of a request sent by the Torr client to the receiving of the hidden service, the whole process needs to pass through a 15-hop onion router, and by using the hidden service detection method, the hidden service detection efficiency is obviously improved only by passing through a 3-hop onion router.

2. The traditional scheme for deploying hidden service directory servers to collect access volume is based on Torv protocol, but the method calculates the blind public key of v3 domain name offline by stripping Tor source code, and finally obtains Torv hidden service access volume by analysis and comparison.

3. The existing Tor hidden service importance degree ordering does not consider the characteristics of the Tor protocol, and the core site and the characteristics of the Tor protocol are combined, including the survival rate, the access quantity and the like of the hidden service, so that the discovery of the hidden service core site can be realized more effectively.

Drawings

FIG. 1 is a schematic diagram of the comprehensive cluster analysis algorithm of the present invention.

Fig. 2 is a hidden service probe flow chart of the present invention.

FIG. 3 is a system deployment diagram of the hidden service probe activity and access volume measurement of the present invention.

FIG. 4 is a flow chart of model training for core site discovery of the present invention.

Detailed Description

The invention designs and realizes the Tor core site discovery technology based on hidden service association, and discovers the core site in the dark network. The method comprises the following steps of hidden service association, hidden service detection activity, hidden service access amount measurement and core site discovery scheme, and specifically comprises the following steps:

1. hiding service associations

The hidden service association algorithm comprises three steps of clustering by utilizing a redirection link in a Response Header, clustering meaningful titles, clustering by combining with HTMLDOM tree, CSS style, page keywords and the like.

Clustering is performed by using the redirect links in the Response Header, wherein the Location field in the Response Header displays the redirected page domain name, so that the domain name and the redirect domain name are clustered into a group in step one.

The method comprises the steps of clustering meaningful titles, namely, the invention considers that the titles of default pages of Web servers (such as Apache, nginx and the like) such as Index of/"," Apache2 Debian Default Page "," 401Authorization Required "and the like are meaningless, and on the basis of the step one (the invention considers that the group titles successfully clustered in the step one are the titles of the redirected domain name), the meaningless titles and the sites without title information are respectively divided into one group, and the sites with meaningful title information and the same title text are divided into one group.

In combination with content clustering such as HTMLDOM tree, CSS style, page keywords and the like, the method extracts one page from the set of meaningful titles, calculates DOM tree structure, class attribute value, id attribute value and first 20 keyword information in the page of each page, and compares the DOM tree structure similarity, class attribute value, id attribute value similarity and page keyword similarity of each page by using a similarity algorithm, wherein the overall flow is shown in figure 1. Specifically, similarity of DOM trees of each two pages is calculated by using a sequence comparison method and is denoted as similarity ₁, similarity of class attribute values and id attribute values of each two page documents is calculated by using a Jaccard coefficient (Jaccard similarity coefficient) and is denoted as similarity ₂, and similarity of keyword information in each two pages is also calculated by using the coefficient and is denoted as similarity ₃. The three similarities are combined to determine whether two pages should be counted as a group.

2. Hidden service activity detection scheme

In the scheme, whether the hidden service is online or not is indirectly judged by whether the descriptor of the hidden service exists or not. By analyzing the Tor protocol, it is found herein that the client needs to query the hidden service directory server for hidden service descriptors before communicating to the hidden service, and when querying, the return situation of the hidden service directory server can be generalized to three situations:

(1) The query is successful, namely the descriptor exists and returns successfully;

(2) Query failure: descriptor not present;

(3) Query exceptions-no descriptor information is returned for some reasons including query timeout, hidden service directory server rejection of request, etc.

Each hidden service will send its own descriptor to the hidden service directory server periodically (no more than two hours), and the hidden service directory server will also clear the expiration descriptor periodically, so whether the hidden service is online can be determined indirectly by whether the hidden service's descriptor is present.

The whole activity detection flow is shown in fig. 2, and the specific steps are as follows:

(1) Reading the domain name to be tested from the database;

(2) Deploying a plurality of Torr processes, and sending a query request to the hidden server by the client through the Torr control protocol to realize the concurrent execution of a plurality of processes;

(3) If the descriptor is in a non-abnormal state, judging whether the descriptor exists according to the returned information so as to save the result, wherein if the descriptor exists, the domain name is considered to be on-line, and if the descriptor does not exist, the domain name is considered to be off-line;

(4) If the descriptor inquiry is abnormal and the inquiry times are not more than 5 times, putting the domain name into a queue again, and carrying out re-inquiry later to return to the step (2);

(5) And storing the online detection result of the hidden service according to the returned information.

For each hidden service, the survival rate is sr, which can be expressed as follows:

Wherein online_num is the measurement domain name online

3. Hiding service access amount measurement scheme

The hidden service domain name in the Tor network needs to be queried through HSDir at first when accessed, so that corresponding modification can be made in the Tor source code, and the client access request condition can be recorded and counted. This is also the general idea of the hidden service access amount measurement method proposed by the present invention.

When the client sends the descriptor Id value corresponding to the domain name to the selected HSDir, the call of the cache_lookup_v3_as_dir function in the cache/src/feature/hs/hs_cache.c file in the Tor source code is triggered to find out whether the descriptor Id value exists in the cache. If return 1 is found, otherwise return 0. Thus, code may be modified in this function to record client access volume request conditions. However, HSDir cannot directly obtain the hidden service domain name, and only the blind public key can be seen when the access amount is obtained, and the blind public key can be calculated offline. The specific flow of the measurement is as follows:

(1) Calculating all blind public keys in a certain period for each domain name;

(2) Comparing the blind public key result of the offline calculation with blind public key data collected from HSDir to obtain the total access quantity of each v3 domain name;

(3) The average daily access for the hidden service domain name is calculated by dividing the total access for each domain name by the number of days counted.

Fig. 3 shows an overall deployment scenario for hidden service probe activity and access volume probe.

4. Core site discovery scheme

The core site discovery scheme combines and calculates the obtained survival rate and access quantity characteristics according to 2 and 3, and the whole algorithm flow is as follows:

(1) The survival rate and access amount of each group of hidden services are calculated by recording the survival rate of each group of domain names as sr _{j_i}, the value of which is the maximum value of the survival rate of all domain names of the group, recording the access amount of each group as view _{j_i}, for a website with a mirror site declared, view _{j_i} is the sum of the access amounts of all domain names of each group, and for a website without a mirror site declared, view _{j_i} is the maximum value of the access amount of all domain names of each group.

(2) And data preprocessing, namely carrying out data normalization processing on the access quantity of each group of hidden services. Let the normalized access amount of each group of hidden services be view' _{j_i}, then

(3) And (3) training a classification model to obtain a core site, namely modeling the discovery problem of the core site as a classification problem in machine learning, using the preprocessed data as classification attributes, and using XGBoost model to perform core site discovery, wherein the overall flow of model training is shown in figure 4.

It should be noted that the above embodiments are only for illustrating the technical solution of the present application and not for limiting the scope of protection thereof, and although the present application has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that various changes, modifications or equivalents may be made to the specific embodiments of the application after reading the present application, and these changes, modifications or equivalents are within the scope of protection of the claims appended hereto.

Claims

1. A Tor core site discovery method based on hidden service association, characterized in that the method comprises the following steps:

(1) Hidden service association algorithm: For websites with similar content but different domain names, a hidden service association algorithm based on page structure and content is designed;

(11) Clustering using redirect links in Response Header:

(12) Group similar sites with meaningful titles into one group:

(13) Clustering based on HTML DOM tree, CSS style, and page keywords:

(2) Calculate the hidden service survival rate: indirectly determine whether the hidden service is online by whether the hidden service descriptor exists, and use it as one of the features for core site judgment;

(21) Read the domain name of the hidden service survival rate to be calculated from the database;

(22) Deploy multiple Tor processes, and the client sends query requests to the hidden server through the Tor control protocol to achieve multi-process concurrent execution;

(23) If the descriptor is in a non-abnormal state, the descriptor will be judged based on the returned information to determine whether it exists and then save the result: if the descriptor exists, the domain name is considered to be online; if it does not exist, the domain name is considered to be offline;

(24) If the descriptor query is abnormal and the number of queries does not exceed 5, put the domain name back into the queue and requery it later, and return to step (22);

(25) Based on the returned information, save the detection results of calculating the survival rate of the hidden service;

(3) Hidden service access volume measurement: By deploying the hidden service directory server HSDir, we collect the requests for the blind public key of the hidden service, and then analyze and compare the hidden service access volume;

(31) For each v3 domain name, calculate all blind public keys within a certain period;

(32) Compare the offline calculated blind public key result with the blind public key data collected from the hidden service directory server to obtain the total number of visits for each v3 domain name;

(33) Divide the total number of visits to each v3 domain name by the number of statistical days to calculate the average daily number of visits to the hidden service v3 domain name;

(4) Tor core site discovery: Analyze the hidden services in each group clustered in (1) based on the hidden service survival rate and visit volume obtained in (2) and (3) to identify the core sites;

(41) For each group clustered in (1), calculate the survival rate of each group , whose value is the maximum survival rate of all domain names in the group; survival rate The following formula expresses it:

;

Among them, online_num is the total number of times the domain name is measured online, and check_num is the total number of times the domain name is measured;

(42) For each group clustered in (1), calculate the number of visits for each group ：For websites that have declared mirror sites, The sum of all domain name visits for each group; for websites that do not declare mirror sites, The maximum value of all domain name visits for each group;

(43) The core site discovery problem is modeled as a binary classification problem in machine learning, with the number of visits, survival rate, number of similar pages, and in-and-out degree as classification attributes, and the XGBoost model is used for core site discovery;

(44) For pages classified as core sites, the classification probability x is calculated at the same time. Based on the probability, the identified core sites are further divided into three levels of importance, among which pages with x ≥ 0.9 will be regarded as the most important core sites, pages with 0.75 ≤ x < 0.9 will be regarded as the second most important pages, and pages with 0.5 ≤ x < 0.75 will be regarded as the least important core sites.

2. A Tor core site discovery method based on hidden service association according to claim 1, characterized in that:

The step (11) is specifically as follows:

Since some domain names will return a 301 status code and automatically redirect to other pages after being accessed, the Location field in the Response Header will display the domain name of the redirected page. Therefore, the domain name and the redirected domain name are clustered into one group;

The step (12) is specifically as follows:

The titles of the default pages of sites on the dark web are defined as meaningless, including "Index of /", "Apache2Debian Default Page", "401 Authorization Required", Apache, and Nginx. Sites with these meaningless titles and no title information are grouped separately, while sites with meaningful title information and the same title text are grouped together.

The step (13) is specifically as follows:

Clustering is performed by combining HTML DOM tree, CSS style, and page keywords: a page will be extracted from the group of meaningful titles, and the DOM tree structure, class attribute value, id attribute value, and the top 20 keyword information of each page will be calculated. The similarity of the DOM tree structure, class attribute value, id attribute value, and page keyword similarity of each page will be compared using a similarity algorithm.