Tor core site discovery method based on hidden service association
Technical Field
The invention belongs to the technical field of anonymous networks (Anonymity Network), and particularly relates to a Tor core site discovery method based on hidden service association.
Background
Because of their strong anonymity, torr is used by many criminals to conduct illegal transactions, engage in illegal activities such as gun sales, drug sales, and privacy information transactions, and in addition, it is used by some organizations to implement large-scale network attacks. In order to effectively manage the darknet content, efficient crawlers of the Tor darknet content are required. However, different hidden services in the darknet have different importance, and the effective information amount also has a great difference, so that if the full-net crawler is used, much valuable information cannot be obtained, and the quality of the data provided by the hidden service is low. In addition, the site contents corresponding to a large number of domain names in the darknet are almost the same, namely the web page contents of different domain names are basically the same, and the phenomenon can cause a large number of crawlers to analyze, store and calculate resources to consume on repeated site contents, so that the detection and grasp of the darknet space are seriously restricted, and therefore, the core site discovery aiming at the Torr darknet is necessary.
Disclosure of Invention
Aiming at the problem that the quality of acquired data of the current darknet is lower because a large amount of illegal contents are urgent to be managed in the Tor darknet, the invention provides a method for discovering the Tor core site based on hidden service association.
The invention adopts the following technical scheme:
A method for discovering a Tor core site based on hidden service association, the method comprising the steps of:
(1) Designing a hidden service association algorithm based on a page structure and contents aiming at Web sites with similar contents and different domain names;
(2) Calculating the survival rate of the hidden service, namely indirectly judging whether the hidden service is online or not through whether the descriptor of the hidden service exists or not, and taking the hidden service as one of the characteristics of core site judgment;
(3) The hidden service access amount measurement, namely collecting the condition that a hidden service blind public key is requested through deploying a hidden service directory server HSDir, and further analyzing and comparing the access amount of the calculated hidden service;
(4) The Tor core site finds that the hidden services in each group clustered in (1) are analyzed by the hidden service survival rate and the access amount obtained in (2) and (3), and the core site is identified.
Further, the step (1) specifically includes:
(11) Clustering by using the redirection links in the Response Header, wherein after some domain names are accessed, the state codes are returned 301 and are automatically redirected to other pages, and the Location field in the Response Header displays the redirected page domain names, so that the domain names and the redirection domain names are clustered into a group;
(12) Defining that the titles of the default pages of the sites in the dark network are nonsensical, including 'Index of/', 'Apache 2 Debian Default Page', '401 Authorization Required', apache and Nginx, grouping the nonsensical titles and the sites without title information respectively, and grouping the sites with the meaningful title information and the same title text;
(13) And (3) integrating HTMLDOM tree, CSS style and page keywords to cluster, namely extracting one page from the set of meaningful titles, calculating DOM tree structure, class attribute value, id attribute value of each page and the first 20 pieces of keyword information in the page, and comparing the DOM tree structure similarity, class attribute value, id attribute value similarity and page keyword similarity of each page by using a similarity algorithm.
Further, the step (2) specifically includes:
(21) Reading the domain name of the hidden service survival rate to be calculated from the database;
(22) Deploying a plurality of Torr processes, and sending a query request to the hidden server by the client through the Torr control protocol to realize the concurrent execution of a plurality of processes;
(23) If the descriptor is in a non-abnormal state, judging whether the descriptor exists according to the returned information so as to save the result, wherein if the descriptor exists, the domain name is considered to be on-line, and if the descriptor does not exist, the domain name is considered to be off-line;
(24) If the descriptor inquiry is abnormal and the inquiry times are not more than 5 times, putting the domain name into a queue again, and carrying out re-inquiry later, and returning to the step (22);
(25) And according to the returned information, storing the detection result for calculating the survival rate of the hidden service.
Further, the step (3) specifically includes:
(31) Calculating all blind public keys in a certain period for each v3 domain name;
(32) Comparing the off-line calculated blind public key result with the blind public key data collected from the hidden service directory server to obtain the total access quantity of each v3 domain name;
(33) The daily average access amount of the hidden service v3 domain name is calculated by dividing the total access amount of each v3 domain name by the statistical days.
Further, the step (4) specifically includes:
(41) For each group of clusters in (1), the survival sr j_i of each group was calculated, which was the maximum survival for all domain names in the group, and the survival sr j_i was expressed as follows:
wherein, online_num is the measurement domain name online
(42) For each group of clusters in (1), calculating an access quantity view j_i for each group, view j_i being the sum of all domain name access quantities for each group for websites with declared mirror sites, and view j_i being the maximum value of all domain name access quantities for each group for websites without declared mirror sites;
(43) Modeling the discovery problem of the core site as a classification problem in machine learning, taking the access quantity, the survival rate, the number of similar pages and the access degree as classification attributes, and using XGBoost model to discover the core site;
(44) For pages classified as core sites, a discrimination probability x of classification is calculated at the same time, and based on the discrimination probability, the identified core sites are further classified into importance degrees of 3 levels, wherein, the page with x being equal to or greater than 0.9 is regarded as the most important core site, the page with 0.75 being equal to or greater than x <0.9 is regarded as the next most important page, and the page with 0.5 being equal to or greater than x <0.75 is regarded as the least important core site.
Compared with the prior art, the invention has the remarkable advantages that:
1. the hidden service detection efficiency is improved, namely, from the beginning of a request sent by the Torr client to the receiving of the hidden service, the whole process needs to pass through a 15-hop onion router, and by using the hidden service detection method, the hidden service detection efficiency is obviously improved only by passing through a 3-hop onion router.
2. The traditional scheme for deploying hidden service directory servers to collect access volume is based on Torv protocol, but the method calculates the blind public key of v3 domain name offline by stripping Tor source code, and finally obtains Torv hidden service access volume by analysis and comparison.
3. The existing Tor hidden service importance degree ordering does not consider the characteristics of the Tor protocol, and the core site and the characteristics of the Tor protocol are combined, including the survival rate, the access quantity and the like of the hidden service, so that the discovery of the hidden service core site can be realized more effectively.
Drawings
FIG. 1 is a schematic diagram of the comprehensive cluster analysis algorithm of the present invention.
Fig. 2 is a hidden service probe flow chart of the present invention.
FIG. 3 is a system deployment diagram of the hidden service probe activity and access volume measurement of the present invention.
FIG. 4 is a flow chart of model training for core site discovery of the present invention.
Detailed Description
The invention designs and realizes the Tor core site discovery technology based on hidden service association, and discovers the core site in the dark network. The method comprises the following steps of hidden service association, hidden service detection activity, hidden service access amount measurement and core site discovery scheme, and specifically comprises the following steps:
1. hiding service associations
The hidden service association algorithm comprises three steps of clustering by utilizing a redirection link in a Response Header, clustering meaningful titles, clustering by combining with HTMLDOM tree, CSS style, page keywords and the like.
Clustering is performed by using the redirect links in the Response Header, wherein the Location field in the Response Header displays the redirected page domain name, so that the domain name and the redirect domain name are clustered into a group in step one.
The method comprises the steps of clustering meaningful titles, namely, the invention considers that the titles of default pages of Web servers (such as Apache, nginx and the like) such as Index of/"," Apache2 Debian Default Page "," 401Authorization Required "and the like are meaningless, and on the basis of the step one (the invention considers that the group titles successfully clustered in the step one are the titles of the redirected domain name), the meaningless titles and the sites without title information are respectively divided into one group, and the sites with meaningful title information and the same title text are divided into one group.
In combination with content clustering such as HTMLDOM tree, CSS style, page keywords and the like, the method extracts one page from the set of meaningful titles, calculates DOM tree structure, class attribute value, id attribute value and first 20 keyword information in the page of each page, and compares the DOM tree structure similarity, class attribute value, id attribute value similarity and page keyword similarity of each page by using a similarity algorithm, wherein the overall flow is shown in figure 1. Specifically, similarity of DOM trees of each two pages is calculated by using a sequence comparison method and is denoted as similarity 1, similarity of class attribute values and id attribute values of each two page documents is calculated by using a Jaccard coefficient (Jaccard similarity coefficient) and is denoted as similarity 2, and similarity of keyword information in each two pages is also calculated by using the coefficient and is denoted as similarity 3. The three similarities are combined to determine whether two pages should be counted as a group.
2. Hidden service activity detection scheme
In the scheme, whether the hidden service is online or not is indirectly judged by whether the descriptor of the hidden service exists or not. By analyzing the Tor protocol, it is found herein that the client needs to query the hidden service directory server for hidden service descriptors before communicating to the hidden service, and when querying, the return situation of the hidden service directory server can be generalized to three situations:
(1) The query is successful, namely the descriptor exists and returns successfully;
(2) Query failure: descriptor not present;
(3) Query exceptions-no descriptor information is returned for some reasons including query timeout, hidden service directory server rejection of request, etc.
Each hidden service will send its own descriptor to the hidden service directory server periodically (no more than two hours), and the hidden service directory server will also clear the expiration descriptor periodically, so whether the hidden service is online can be determined indirectly by whether the hidden service's descriptor is present.
The whole activity detection flow is shown in fig. 2, and the specific steps are as follows:
(1) Reading the domain name to be tested from the database;
(2) Deploying a plurality of Torr processes, and sending a query request to the hidden server by the client through the Torr control protocol to realize the concurrent execution of a plurality of processes;
(3) If the descriptor is in a non-abnormal state, judging whether the descriptor exists according to the returned information so as to save the result, wherein if the descriptor exists, the domain name is considered to be on-line, and if the descriptor does not exist, the domain name is considered to be off-line;
(4) If the descriptor inquiry is abnormal and the inquiry times are not more than 5 times, putting the domain name into a queue again, and carrying out re-inquiry later to return to the step (2);
(5) And storing the online detection result of the hidden service according to the returned information.
For each hidden service, the survival rate is sr, which can be expressed as follows:
Wherein online_num is the measurement domain name online
3. Hiding service access amount measurement scheme
The hidden service domain name in the Tor network needs to be queried through HSDir at first when accessed, so that corresponding modification can be made in the Tor source code, and the client access request condition can be recorded and counted. This is also the general idea of the hidden service access amount measurement method proposed by the present invention.
When the client sends the descriptor Id value corresponding to the domain name to the selected HSDir, the call of the cache_lookup_v3_as_dir function in the cache/src/feature/hs/hs_cache.c file in the Tor source code is triggered to find out whether the descriptor Id value exists in the cache. If return 1 is found, otherwise return 0. Thus, code may be modified in this function to record client access volume request conditions. However, HSDir cannot directly obtain the hidden service domain name, and only the blind public key can be seen when the access amount is obtained, and the blind public key can be calculated offline. The specific flow of the measurement is as follows:
(1) Calculating all blind public keys in a certain period for each domain name;
(2) Comparing the blind public key result of the offline calculation with blind public key data collected from HSDir to obtain the total access quantity of each v3 domain name;
(3) The average daily access for the hidden service domain name is calculated by dividing the total access for each domain name by the number of days counted.
Fig. 3 shows an overall deployment scenario for hidden service probe activity and access volume probe.
4. Core site discovery scheme
The core site discovery scheme combines and calculates the obtained survival rate and access quantity characteristics according to 2 and 3, and the whole algorithm flow is as follows:
(1) The survival rate and access amount of each group of hidden services are calculated by recording the survival rate of each group of domain names as sr j_i, the value of which is the maximum value of the survival rate of all domain names of the group, recording the access amount of each group as view j_i, for a website with a mirror site declared, view j_i is the sum of the access amounts of all domain names of each group, and for a website without a mirror site declared, view j_i is the maximum value of the access amount of all domain names of each group.
(2) And data preprocessing, namely carrying out data normalization processing on the access quantity of each group of hidden services. Let the normalized access amount of each group of hidden services be view' j_i, then
(3) And (3) training a classification model to obtain a core site, namely modeling the discovery problem of the core site as a classification problem in machine learning, using the preprocessed data as classification attributes, and using XGBoost model to perform core site discovery, wherein the overall flow of model training is shown in figure 4.
It should be noted that the above embodiments are only for illustrating the technical solution of the present application and not for limiting the scope of protection thereof, and although the present application has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that various changes, modifications or equivalents may be made to the specific embodiments of the application after reading the present application, and these changes, modifications or equivalents are within the scope of protection of the claims appended hereto.