CN112764882B

CN112764882B - Docker-based onion address and hidden service content collection method

Info

Publication number: CN112764882B
Application number: CN202110085622.9A
Authority: CN
Inventors: 杨力; 应世睿; 张岩; 贾竣博; 李茜; 秦文静; 马卓茹; 王志鑫; 李成; 李江涛
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-01-22
Filing date: 2021-01-22
Publication date: 2022-09-23
Anticipated expiration: 2041-01-22
Also published as: CN112764882A

Abstract

The invention discloses a Docker-based onion address and hidden service content searching method, which mainly solves the problems of long time consumption and low searching speed in the prior art. The scheme is as follows: 1) obtaining a small number of onion addresses by using two different methods, and collecting hidden service home pages corresponding to the addresses; 2) constructing a deep neural network, training the deep neural network by using a training set to obtain a classifier, and calculating the probability that the hidden service home page belongs to each category; 3) referring to the probability and the number of onion addresses in the home page of the hidden service, prejudging the number of onion addresses embedded in all pages of a corresponding website, and setting the number of collected mirror images; 4) a mirror image starting generation container is used for searching all page embedded onion addresses and content of the website corresponding to the home page; 5) repeating 1) to 4) to finish the collection of the Tor onion address and the hidden service content. The method has the advantages of less time consumption and high data collecting speed, and can be used for monitoring the onion routing network Tor.

Description

Docker-based onion address and hidden service content collection method

技术领域technical field

本发明属于计算机技术领域，更进一步涉及一种洋葱地址和隐藏服务内容搜集方法，可用于对洋葱路由网络Tor的监控及网络中隐藏服务内容变化趋势的感知。The invention belongs to the field of computer technology, and further relates to a method for collecting onion addresses and hidden service content, which can be used for monitoring the onion routing network Tor and perceiving the changing trend of hidden service content in the network.

背景技术Background technique

近年来，随着互联网的发展和各种计算设备的普及，互联网中的信息总量已经达到了空前庞大的量级。根据统计数据显示，截至2020年12月，互联网中的网页总数超过了18亿，日均流量达到了近6EB。互联网又可以细分为表层网络和与之对应的深层网络。表层网络指可以被标准的搜索引擎索引的网站，而深层网络指没有被标准搜索引擎索引的网站。表层网络中的内容只占整个互联网中内容总量的10％，其余内容属于深层网络，它和表层网络类似，也包含了常见的邮件服务，网上银行，社交论坛等服务，但是需要密码或者被赋予权限才能进行访问。深层网络中的Tor、I2P、Freenet这些匿名网络，均需要借助特殊的软件、进行特殊的配置、获得认证；或者使用独有的通信协议才能访问，因此获得了良好的匿名性。这些网络容易被违法活动所滥用，网络中充斥着各类敏感内容和非法活动。In recent years, with the development of the Internet and the popularization of various computing devices, the total amount of information in the Internet has reached an unprecedented magnitude. According to statistics, as of December 2020, the total number of web pages on the Internet has exceeded 1.8 billion, and the average daily traffic has reached nearly 6EB. The Internet can be subdivided into the surface network and the corresponding deep network. The surface web refers to websites that can be indexed by standard search engines, while the deep web refers to websites that are not indexed by standard search engines. The content in the surface network only accounts for 10% of the total content in the entire Internet, and the rest belongs to the deep network. Similar to the surface network, it also includes common mail services, online banking, social forums and other services, but requires a password or is accessed by Grant permission to access. Anonymous networks such as Tor, I2P, and Freenet in the deep network all require special software, special configuration, and authentication; or use a unique communication protocol to access them, thus obtaining good anonymity. These networks are easily abused by illegal activities, and the network is full of various sensitive content and illegal activities.

在多种匿名网络中，Tor是被使用最广泛的一种。截至2020年12月，Tor的活动节点个数约为7000个，活跃用户约250万，隐藏服务超过17万个。Tor的匿名性依赖于志愿者和相关组织机构自愿在网络中提供的中继节点和独特的通信机制实现。用户在使用Tor服务时，通常客户端会借助三个Tor中继节点作为通信代理，使用公钥加密的方式依次与这三个节点协商会话密钥。在通信的过程中，首先在发送端对传输的内容进行三层加密，通过选定的三个中继节点依次传递至接收端。当经过加密的内容到达一个节点时，节点通过协商的密钥进行一层解密，获知下一个中继节点的IP地址，继续进行传递。整个传递过程中，每个中继节点只能知悉与其相邻的节点的IP地址，而无法同时获悉发送者端和接收端的IP地址，从而实现了内容匿名传输。Among the various anonymous networks, Tor is the most widely used one. As of December 2020, Tor has about 7,000 active nodes, about 2.5 million active users, and more than 170,000 hidden services. Tor's anonymity relies on the relay nodes and unique communication mechanisms that volunteers and related organizations voluntarily provide in the network. When users use Tor services, the client usually uses three Tor relay nodes as communication agents, and uses public key encryption to negotiate session keys with these three nodes in turn. In the process of communication, firstly, the transmitted content is encrypted in three layers at the sending end, and then transmitted to the receiving end in turn through the selected three relay nodes. When the encrypted content reaches a node, the node performs a layer of decryption through the negotiated key, obtains the IP address of the next relay node, and continues to transmit. During the whole delivery process, each relay node can only know the IP addresses of its adjacent nodes, but cannot know the IP addresses of the sender and the receiver at the same time, thus realizing the anonymous transmission of content.

Tor中的非法内容大多通过隐藏服务提供。隐藏服务的运行以及用户对服务的访问过程以三跳通信机制为基础，进行了更为精密的设计。隐藏服务的实现中，有五个重要的组成部分，分别是隐藏服务器，目录服务器，隐藏服务目录服务器，客户端，引入节点以及会话节点。Illegal content in Tor is mostly served through hidden services. The operation of the hidden service and the user's access to the service are based on the three-hop communication mechanism, and a more sophisticated design has been carried out. In the realization of hidden service, there are five important components, namely hidden server, directory server, hidden service directory server, client, introduction node and session node.

隐藏服务器：在Tor中提供隐藏服务，如web服务、即时通信服务等。隐藏服务在创建的时候，会生成公钥等相关信息。Hidden server: Provides hidden services in Tor, such as web services, instant messaging services, etc. When the hidden service is created, it will generate public key and other related information.

目录服务器：目录服务器保存了中继节点的IP地址，带宽，标志位，指纹等信息。Tor客户端在启动之后会向目录服务器发送请求，得到一部分中继节点的信息，从中随机选择符合条件的节点，建立通信链路。Directory server: The directory server stores information such as IP addresses, bandwidth, flag bits, and fingerprints of relay nodes. After the Tor client is started, it will send a request to the directory server to obtain information about a part of the relay nodes, and randomly select qualified nodes from it to establish a communication link.

隐藏服务目录服务器：负责存储隐藏服务的隐藏服务描述符，描述符包含了隐藏服务选择的引入节点，公钥等信息。负责响应用户从客户端发送的对特定隐藏服务的访问请求。一个中继节点需要提供足够的带宽，并保证连续运行96小时，才有可能被权威隐藏服务目录服务器确定为隐藏服务目录服务器。隐藏服务目录服务器的功能类似于表层网络的DNS服务器。Hidden service directory server: responsible for storing the hidden service descriptor of the hidden service. The descriptor contains the information such as the introduction node and public key selected by the hidden service. Responsible for responding to user requests for access to specific hidden services sent from clients. A relay node needs to provide sufficient bandwidth and guarantee continuous operation for 96 hours before it may be determined as a hidden service directory server by the authoritative hidden service directory server. The hidden service directory server functions like the DNS server of the surface network.

客户端：为用户提供接入Tor的程序，在连接过程中，客户端借助目录服务器的提供的相关信息，进行三跳中继节点的选择与连接。Client: Provides a program for users to access Tor. During the connection process, the client selects and connects to the three-hop relay node with the help of the relevant information provided by the directory server.

引入节点：隐藏服务通常会选择多个引入节点，通过这些节点提供服务。隐藏服务器通过三跳通信机制和他们建立连接。客户端在访问隐藏服务时需要通过三跳通信机制从隐藏服务目录服务器获取引入节点信息，并再次通过三跳通信机制和引入节点取得联系，交换会话节点的信息。Introducing nodes: Hidden services usually choose multiple introductory nodes through which to provide services. Hidden servers establish connections with them through a three-hop communication mechanism. When the client accesses the hidden service, it needs to obtain the introduction node information from the hidden service directory server through the three-hop communication mechanism, and contact the introduction node again through the three-hop communication mechanism to exchange the information of the session node.

会话节点：客户端和隐藏服务器协商得到的关键通信节点。以会话节点为中心节点，客户端和隐藏服务器建立两个三跳通信链路，形成六条通信链路，进行通信。Session node: The key communication node negotiated between the client and the hidden server. Taking the session node as the central node, the client and the hidden server establish two three-hop communication links to form six communication links for communication.

隐藏服务的访问需要使用洋葱地址。洋葱地址是通过对1024bits的隐藏服务RSA公钥进行安全散列算法1即SHA 1计算，截取结果的前80bits，进行base 32编码得到长度为16bytes的字符串，在其后拼接.onion后缀形成。隐藏服务的工作过程如下：Access to hidden services requires the use of onion addresses. The onion address is calculated by performing secure hashing algorithm 1, namely SHA 1, on the 1024bits hidden service RSA public key, intercepting the first 80bits of the result, performing base 32 encoding to obtain a string of length 16bytes, and then splicing it with the .onion suffix. The working process of the hidden service is as follows:

1.隐藏服务选择3个中继节点作为其引入节点，在这些节点上提供隐藏服务，隐藏服务器通过3跳中继节点与引入节点建立连接。1. The hidden service selects 3 relay nodes as its introduction nodes, and provides hidden services on these nodes. The hidden server establishes a connection with the introduction node through the 3-hop relay node.

2.隐藏服务器将隐藏服务描述符上传至6个隐藏服务目录服务器，描述符中包含选定的引入节点和公钥信息。2. The hidden server uploads the hidden service descriptor to the 6 hidden service directory servers, and the descriptor contains the selected incoming node and public key information.

3.客户端访问隐藏服务时，随机向6个隐藏服务目录服务器发送请求，得到隐藏服务的描述符，获悉引入节点信息。3. When the client accesses the hidden service, it randomly sends a request to 6 hidden service directory servers to obtain the descriptor of the hidden service and learn the information of the imported node.

4.客户端选定一个会话节点，并通过三跳中继节点与之建立连接。4. The client selects a session node and establishes a connection with it through the three-hop relay node.

5.客户端通过三跳中继节点与引入节点建立连接，发送会话节点的相关信息。5. The client establishes a connection with the incoming node through the three-hop relay node, and sends the relevant information of the session node.

6.如果隐藏服务器可以提供服务，则建立到会话节点的三跳连接，并对链路进行认证。6. If the hidden server can provide services, establish a three-hop connection to the session node and authenticate the link.

7.客户端和隐藏服务器通过会话节点，建立6跳通信链路，进行通信。7. The client and the hidden server establish a 6-hop communication link through the session node to communicate.

用户需要知悉隐藏服务的洋葱地址才可以对其进行访问。洋葱地址仅在私密论坛、隐藏服务内容中进行公开、只有较少一部分被表层网络或深层网络的搜索引擎索引。难以大规模获取。由于Tor隐藏服务的多跳节点通信机制，造成其访问速度较慢，限制了洋葱地址和隐藏服务内容的大规模快速获取。如何有效快速的获取数据成为了Tor研究领域的研究热点。Users need to know the onion address of the hidden service in order to access it. Onion addresses are only publicized in private forums, hidden service content, and only a small part is indexed by search engines on the surface or deep web. difficult to obtain on a large scale. Due to the multi-hop node communication mechanism of Tor hidden services, its access speed is slow, which limits the large-scale and rapid acquisition of onion addresses and hidden service content. How to obtain data effectively and quickly has become a research hotspot in the field of Tor research.

北京交通大学的宋胜男在硕士学位论文中提出的洋葱地址和隐藏服务内容搜集方法，该方法将搜集程序部署于租用的境外服务器上，由于一般服务器配置较低，想要做到大规模快速搜集，需要购买部署较多的服务器，价格昂贵，且在搜集结束之后，需要从服务器下载大量的文件，增加了额外耗时。同时由于搜集程序直接运行于服务器上，并发程度较低，因而无法充分利用租用服务器的算力，造成了资源浪费。Song Shengnan of Beijing Jiaotong University proposed a method for collecting onion addresses and hidden service content in his master's thesis. This method deploys the collection program on a rented overseas server. Due to the low configuration of the general server, large-scale and rapid collection is required. It is expensive to purchase and deploy more servers, and after the collection is completed, a large number of files need to be downloaded from the server, which increases extra time. At the same time, because the collection program runs directly on the server, the degree of concurrency is low, so the computing power of the rented server cannot be fully utilized, resulting in a waste of resources.

东南大学的丁翔在硕士学位论文中提出的隐藏服务内容获取方法，其使用多线程进行并发搜集，虽说在一定程度上减少了搜集耗时，提升了并发程度，可以通过扩展爬虫和代理的规模增加获取速度以提高效率，但却需要修改配置文件和项目代码，较为繁琐，无法进行快速扩展。The hidden service content acquisition method proposed by Ding Xiang of Southeast University in his master's thesis uses multiple threads for concurrent collection. Although it reduces the time-consuming of collection to a certain extent and improves the degree of concurrency, the scale of crawlers and agents can be expanded by expanding the scale of crawlers and agents. Increase the acquisition speed to improve efficiency, but it needs to modify the configuration file and project code, which is cumbersome and cannot be quickly expanded.

上述现有的洋葱地址和隐藏服务内容获取方法由于均使用了同一个主机的文件系统，但又做不到不同项目之间的文件系统隔离，因而健壮性较低；此外由于项目依赖的升级较为麻烦，需要手动对每个依赖进行更新，还可能会遇到多个依赖版本冲突的问题。项目虽然运行时间较长但是取得的洋葱地址和隐藏服务内容数量较少，效率较低。The above existing methods for obtaining onion addresses and hidden service content all use the file system of the same host, but cannot achieve file system isolation between different projects, so the robustness is low; in addition, the upgrade of project dependencies is relatively low. Trouble, you need to manually update each dependency, and you may encounter the problem of multiple dependency version conflicts. Although the project runs for a long time, the number of onion addresses and hidden service content obtained is small, and the efficiency is low.

发明内容SUMMARY OF THE INVENTION

本发明的目的是针对上述现有技术的不足，提出一种基于Docker虚拟化技术的Tor洋葱地址和隐藏服务内容高效搜集方法，以在较短的时间内搜集更多的洋葱地址，并减少获取隐藏服务内容的耗时，提高获取洋葱地址和隐藏服务内容的效率。The purpose of the present invention is to aim at the above-mentioned deficiencies of the prior art, and to propose a method for efficiently collecting Tor onion addresses and hidden service contents based on Docker virtualization technology, so as to collect more onion addresses in a short time and reduce the number of acquisitions. Hiding the time-consuming service content improves the efficiency of obtaining onion addresses and hiding service content.

为实现上述目的，本发明的技术方案是：1)使用两种不同方法得到少量洋葱地址，并搜集这些地址对应的隐藏服务首页；2)构建深度神经网络使用训练集对其进行训练，得到分类器，计算隐藏服务首页属于每种类别的概率；3)参考该概率和隐藏服务首页中洋葱地址数量，预判其对应网站所有页面内嵌洋葱地址个数，并设置搜集镜像的数量；4)镜像启动生成容器，搜索上述首页对应网站所有页面内嵌洋葱地址和内容；5)重复1)至4)完成对Tor洋葱地址和隐藏服务内容的搜集。具体实现步骤包括如下：In order to achieve the above object, the technical scheme of the present invention is: 1) use two different methods to obtain a small number of onion addresses, and collect the corresponding hidden service homepages of these addresses; 2) construct a deep neural network to train it using a training set to obtain a classification 3) With reference to the probability and the number of onion addresses in the homepage of the hidden service, predict the number of embedded onion addresses on all pages of the corresponding website, and set the number of collected mirrors; 4) The image is started to generate a container, and all pages of the website corresponding to the above home page are searched for embedded onion addresses and content; 5) Repeat 1) to 4) to complete the collection of Tor onion addresses and hidden service content. The specific implementation steps include the following:

(1)获取洋葱地址：(1) Get the onion address:

1a)利用深层网络搜索引擎和表层网络搜索引擎对特定的关键词分别进行搜索，对搜索结果中的洋葱地址进行提取，得到与该关键词有关且被搜索引擎索引的洋葱地址Z₁；1a) utilize deep network search engine and surface layer network search engine to search specific keywords respectively, extract the onion address in the search result, obtain the onion address Z ₁ relevant to this keyword and indexed by the search engine;

1b)在Tor中布设中继节点，使其成为隐藏服务目录服务器，通过修改源码的方式，对服务器中的隐藏服务公钥进行生成洋葱地址需要的哈希计算和编码，得到该公钥对应的洋葱地址Z₂；1b) Deploy a relay node in Tor to make it a hidden service directory server. By modifying the source code, the hidden service public key in the server needs to be hashed and encoded to generate the onion address, and the corresponding public key is obtained. onion address Z ₂ ;

将上述两种方法得到的洋葱地址存储为文件；Store the onion address obtained by the above two methods as a file;

(2)对(1)中两种方法得到的所有洋葱地址所对应的隐藏服务首页进行搜集，得到待分类隐藏服务首页集合；(2) Collect hidden service homepages corresponding to all onion addresses obtained by the two methods in (1), and obtain a collection of hidden service homepages to be classified;

(3)对暗网地址文本数据集DUTA依次进行文件去重、数据清洗、文本分词、向量化的进行预处理，构建多分类文本类别分类器：(3) The darknet address text data set DUTA is preprocessed in sequence by file deduplication, data cleaning, text segmentation, and vectorization, and a multi-category text category classifier is constructed:

(3a)对属于C个类别的，且已经具有类别标记的暗网地址文本数据集DUTA依次进行文件去重、数据清洗、文本分词、向量化的预处理，形成训练集词向量；(3a) Perform the preprocessing of file deduplication, data cleaning, text segmentation, and vectorization on the darknet address text data set DUTA that belongs to C categories and already has category tags to form a training set word vector;

(3b)设置由一个嵌入层，两个一维卷积层，两个一维池化层，一个全连接层和一个softmax层组成的深度神经网络各项参数，将DUTA数据集经过(3a)预处理后得到的训练集词向量作为训练数据，输入到该深度神经网络进行训练，得到多分类文本类别分类器；(3b) Set the parameters of the deep neural network consisting of one embedding layer, two one-dimensional convolutional layers, two one-dimensional pooling layers, one fully connected layer and one softmax layer, and pass the DUTA dataset through (3a) The training set word vector obtained after preprocessing is used as training data, and is input into the deep neural network for training to obtain a multi-category text category classifier;

(4)对(2)获得待分类隐藏服务首页集合中的每一个类别未知的隐藏服务首页依次进行文件去重、数据清洗、文本分词、向量化的预处理，得到测试集词向量，将测试集词向量作为已训练好的多分类文本类别分类器的输入，使用该分类器得出这些经过预处理后的隐藏服务首页分别属于C种类别中第i种类别的概率P_i，1≤i≤C，同时统计每一个首页页面内洋葱地址数量N；(4) Perform the preprocessing of file deduplication, data cleaning, text segmentation, and vectorization for each unknown hidden service homepage in the set of hidden service homepages to be classified in (2) to obtain the test set word vector, and test Set the word vector as the input of the trained multi-category text category classifier, and use the classifier to obtain the probability P _i that the preprocessed hidden service home pages belong to the i-th category in the C categories respectively, 1≤i ≤C, and count the number N of onion addresses in each home page;

(5)设定区分隐藏服务首页内嵌洋葱地址数量的阈值T＝100，设定区分首页对应网站中所有页面内包含的洋葱地址数量的阈值K＝500；将每一个隐藏服务首页内洋葱地址数量N与阈值T或概率P进行比较：(5) Set the threshold value T=100 to distinguish the number of onion addresses embedded in the homepage of the hidden service, and set the threshold value K=500 to distinguish the number of onion addresses contained in all pages of the website corresponding to the homepage; Quantity N is compared with threshold T or probability P:

如果某个隐藏服务首页内洋葱地址个数N≥T或该首页属于每种类别的概率P_i，1≤i≤C之间相近，则表明该首页可能包含多个类别的内容，该首页对应的网站为目录类型，预判该首页对应网站中所有页面内包含的洋葱地址数量M大于K，执行(6)；If the number of onion addresses in a homepage of a hidden service is N≥T or the probability P _i of the homepage belonging to each category is close to 1≤i≤C, it indicates that the homepage may contain content of multiple categories, and the homepage corresponds to The website is a directory type, predict that the number M of onion addresses contained in all pages in the website corresponding to the home page is greater than K, and execute (6);

如果某个隐藏服务首页内洋葱地址个数N＜T或该首页属于C种类别中第j种类别的概率P_j远大于该首页属于C种类别中除了第j种类别的概率，则表明该首页可能只属于第j种类别，预判该首页对应网站中所有页面内包含的洋葱地址数量M小于阈值K，执行(7)；If the number of onion addresses in the homepage of a hidden service is N<T or the probability Pj of the homepage belonging to the _jth category in the C category is much greater than the probability that the homepage belongs to the C category except the jth category, it means that the homepage belongs to the jth category. The home page may only belong to the jth category, predict that the number M of onion addresses contained in all pages of the website corresponding to the home page is less than the threshold K, and execute (7);

(6)设置D＝50个隐藏服务内容搜集的Docker镜像，启动这些Docker镜像生成50个容器，在这些容器中运行已构建的隐藏服务内容搜集代码，对每一个首页对应网站中所有页面内嵌洋葱地址对应的隐藏服务内容进行搜集；(6) Set D=50 Docker images for hidden service content collection, start these Docker images to generate 50 containers, run the built hidden service content collection code in these containers, and embed all pages in the website corresponding to each home page Collect the hidden service content corresponding to the onion address;

(7)设置D＝20个隐藏服务内容搜集的Docker镜像，启动这些Docker镜像生成20个容器，在这些容器中运行已构建的隐藏服务内容搜集代码，对每一个首页对应网站中所有页面内嵌洋葱地址对应的隐藏服务内容进行搜集。(7) Set D=20 Docker images for hidden service content collection, start these Docker images to generate 20 containers, run the built hidden service content collection code in these containers, and embed all pages in the website corresponding to each home page The hidden service content corresponding to the onion address is collected.

(8)统计(6)和(7)搜集得到的隐藏服务内容总数计W，从这些隐藏服务内容中解析提取出所有的内嵌洋葱地址Z₃，并将这些洋葱地址与(1a)、(1b)所获的洋葱地址相加，得到洋葱地址的总量为：Z＝Z₁+Z₂+Z₃；(8) Count the total number W of hidden service contents collected in (6) and (7), parse and extract all embedded onion addresses Z ₃ from these hidden service contents, and compare these onion addresses with (1a), ( 1b) The obtained onion addresses are added, and the total amount of the obtained onion addresses is: Z=Z ₁ +Z ₂ +Z ₃ ;

(9)根据实际需要设定要获得的洋葱目标地址数量为X和隐藏服务内容目标数量为Y，并比较X与Z以及Y与W的大小：(9) Set the number of onion target addresses to be obtained as X and the number of hidden service content targets as Y according to actual needs, and compare the sizes of X and Z and Y and W:

如果X＞Z或Y＞W，则表明目前获得的洋葱地址数量Z或隐藏服务内容数量W未达到设定的目标，重复(1)-(8)直到洋葱地址数量Z和隐藏服务内容数量W达到设定的数量；If X>Z or Y>W, it indicates that the currently obtained number of onion addresses Z or the number of hidden service contents W has not reached the set target, repeat (1)-(8) until the number of onion addresses Z and the number of hidden service contents W reach the set number;

如果同时满足X≤Z和Y≤W，则表明目前获得的洋葱地址数量Z和隐藏服务内容数量W已经达到了设定的数量，则停止数据搜集。If X≤Z and Y≤W are satisfied at the same time, it indicates that the currently obtained number Z of onion addresses and the number W of hidden service contents have reached the set number, and the data collection is stopped.

本发明与现有技术相比，具有以下优点：Compared with the prior art, the present invention has the following advantages:

第一，有利于开发环境迁移到生产环境的快速部署。First, it is conducive to the rapid deployment of the development environment to the production environment.

本发明由于使用Docker技术进行隐藏服务内容搜集，避免了现有方法搭建开发环境的繁琐步骤，只需下载相关依赖镜像进行配置，再对他们进行打包形成项目镜像，直接在生产环境中对该项目镜像进行部署，避免了开发环境到生产环境迁移的繁琐步骤。Since the invention uses Docker technology to collect hidden service content, it avoids the tedious steps of building a development environment in the existing method. It only needs to download the relevant dependent images for configuration, and then package them to form a project image, and the project is directly in the production environment. The image is deployed, avoiding the tedious steps of migrating the development environment to the production environment.

第二，各个容器完成数据搜集的时间较为相近，进一步缩短了搜集耗时。Second, the time for each container to complete data collection is relatively similar, which further shortens the collection time.

本发明通过对洋葱地址对应的隐藏服务首页属于每种设定类别的概率进行预测、统计每一个隐藏服务首页页面内洋葱地址数量，预判该隐藏服务首页对应网站中所有页面包含洋葱地址数量的多少，以此设定Docker镜像数量，可使得各个容器完成所有页面内洋葱地址搜集的时间较为相近，避免某个容器需要处理的网站内洋葱地址数量过多导致搜集过程耗时过长，提高了洋葱地址和隐藏服务内容的搜集效率。The present invention predicts the probability that the hidden service home page corresponding to the onion address belongs to each set category, counts the number of onion addresses in each hidden service home page, and predicts that all pages in the website corresponding to the hidden service home page contain the number of onion addresses. By setting the number of Docker images, the time required for each container to complete the collection of onion addresses in all pages is relatively similar, avoiding the excessive number of onion addresses in a website that needs to be processed by a container and causing the collection process to take too long. Collection efficiency of onion addresses and hidden service content.

附图说明Description of drawings

图1是本发明的实现总流程图；Fig. 1 is the realization general flow chart of the present invention;

图2是本发明中获取隐藏服务内容的子流程图。FIG. 2 is a sub-flow chart of acquiring hidden service content in the present invention.

具体实施方式Detailed ways

下面结合附图对本发明的实施例做进一步的详细描述。The embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

参照图1，对本发明的实现步骤如下：Referring to Fig. 1, the implementation steps of the present invention are as follows:

步骤1，通过两种方法搜集洋葱地址。Step 1. Collect onion addresses through two methods.

由于使用Tor需要使用专用软件并进行配置，为了使用户可以更为便捷的使用Tor服务，Tor2web于2008年由志愿者建立，项目在牺牲部分匿名性的前提下，为用户提供特定的关键词，如.sh，.guide，.city，用户需要在被项目代理的洋葱地址后拼接关键词，即可使用普通浏览器访问该洋葱地址对应的隐藏服务网站，具体实现如下：Since the use of Tor requires the use of special software and configuration, in order to allow users to use the Tor service more conveniently, Tor2web was established by volunteers in 2008. The project provides users with specific keywords under the premise of sacrificing some anonymity. For example, .sh, .guide, .city, users need to splicing keywords after the onion address represented by the project, and then they can use a common browser to access the hidden service website corresponding to the onion address. The specific implementation is as follows:

1.1)利用深层网络搜索引擎和表层网络搜索引擎对特定的关键词分别进行搜索，对搜索结果中的洋葱地址进行提取，得到与该关键词有关且被搜索引擎索引的洋葱地址Z₁；1.1) utilize deep web search engine and surface web search engine to search specific keywords respectively, extract the onion address in the search result, obtain the onion address Z ₁ relevant to this keyword and indexed by the search engine;

1.2)通过修改Tor源码的方式，对服务器中的隐藏服务公钥进行安全散列算法1即SHA 1计算，对计算结果的前80bits进行base 32编码得到长度为16bytes的字符串，在字符串后拼接.onion得到洋葱地址，这种方法虽然较为复杂耗时较长，然而是一种不可或缺的洋葱地址收集方式，可以得到未被搜索引擎或者其他隐藏服务内容索引的洋葱地址Z₂。1.2) By modifying the Tor source code, perform secure hashing algorithm 1, ie SHA 1, on the hidden service public key in the server, and perform base 32 encoding on the first 80 bits of the calculation result to obtain a string with a length of 16 bytes. After the string Splicing .onion to get onion addresses, although this method is more complicated and time-consuming, it is an indispensable way to collect onion addresses, and can obtain onion addresses Z ₂ that are not indexed by search engines or other hidden service content.

将上述两种方法得到的洋葱地址存储为文件。Store the onion addresses obtained by the above two methods as a file.

本实例在Tor匿名网络中布设中继节点，提供200KB/S以上的带宽，获取快速Fast和稳定Stable标志并持续运行96小时之后即可获得隐藏服务目录服务器标志。In this example, relay nodes are deployed in the Tor anonymous network, providing bandwidth of more than 200KB/S, obtaining the fast Fast and stable stable flags and running for 96 hours to obtain the hidden service directory server flag.

步骤2，编写隐藏服务首页搜集代码，获取依赖镜像，将隐藏服务首页搜集代码和依赖镜像进行整合打包，形成项目镜像。Step 2: Write the homepage collection code of the hidden service, obtain the dependent image, and integrate and package the hidden service homepage collection code and the dependent image to form a project image.

2.1)编写隐藏服务首页搜集代码；2.1) Write hidden service homepage collection code;

2.2)分别获取Ubuntu操作系统镜像和Tor镜像，并对Tor镜像进行配置，设置网络代理；2.2) Obtain the Ubuntu operating system image and the Tor image respectively, configure the Tor image, and set the network proxy;

2.3)将隐藏服务首页搜集代码与上述Ubuntu操作系统镜像和Tor镜像这两个镜像进行整合打包，形成进行隐藏服务首页搜集的项目镜像；2.3) Integrate and package the hidden service home page collection code with the above-mentioned Ubuntu operating system image and Tor image, to form a project image for the hidden service home page collection;

2.4)创建Docker配置文件，指定项目镜像启动生成容器的过程中需要完成的操作：将步骤1所得的包含洋葱地址的文件拷贝至该容器中、运行隐藏服务首页获取代码；2.4) Create a Docker configuration file and specify the operations that need to be completed in the process of starting the project image to generate the container: copy the file containing the onion address obtained in step 1 into the container, and run the hidden service home page to obtain the code;

2.5)启动项目镜像，读取已经创建的Docker配置文件，完成设定的操作；2.5) Start the project image, read the created Docker configuration file, and complete the set operation;

2.6)运行隐藏服务首页获取代码，对步骤1所得包含洋葱地址文件中洋葱地址对应的隐藏服务首页进行搜集，得到待分类隐藏服务首页集合。2.6) Run the hidden service home page acquisition code, collect the hidden service home pages corresponding to the onion addresses in the onion address file obtained in step 1, and obtain the set of hidden service home pages to be classified.

步骤3，对暗网地址文本数据集DUTA依次进行预处理。Step 3: Preprocess the darknet address text data set DUTA in sequence.

3.1)文件去重：3.1) File deduplication:

依次计算DUTA数据集中每个隐藏服务内容文件的第五代信息摘要MD5值，并存储于集合，如果某文件的MD5值已经存在于集合之中，则表明此文件与已经完成MD5值计算的文件包含的内容完全相同，对同一种内容只保留一个保存页面内容的文件，删除多余文件，得到去重文件集合；Calculate the MD5 value of the fifth-generation information digest of each hidden service content file in the DUTA data set in turn, and store it in the set. If the MD5 value of a file already exists in the set, it indicates that this file and the file whose MD5 value calculation has been completed The content contained is exactly the same. For the same content, only one file to save the page content is reserved, and redundant files are deleted to obtain a set of deduplicated files;

3.2)数据清洗：3.2) Data cleaning:

使用正则表达式，去除去重文件集合每个隐藏服务内容文件中无用的修饰内容，得到数据清洗后文件集合；Use regular expressions to remove the useless decoration content in each hidden service content file of the deduplication file set, and obtain the file set after data cleaning;

3.3)文本分词：3.3) Text segmentation:

使用自然语言处理工具NLTK，对数据清洗后文件集合中每个文件的隐藏服务内容进行文本分词，即将隐藏服务内容中每一个句子进行切分，得到组成这个句子的所有单词，再去除这些单词中的副词、介词、连词、助动词，保留其中的名词、动词、形容词、数量词这些实词；Use the natural language processing tool NLTK to perform text segmentation on the hidden service content of each file in the file collection after data cleaning, that is, segment each sentence in the hidden service content to get all the words that make up the sentence, and then remove the words in these words. adverbs, prepositions, conjunctions, auxiliary verbs, and retain the content words such as nouns, verbs, adjectives, and quantifiers;

3.4)向量化：3.4) Vectorization:

使用向量生成器word2Vec对3.3)得到的实词进行编码，得到向量化的实词，形成训练集词向量。Use the vector generator word2Vec to encode the real words obtained in 3.3) to obtain vectorized real words and form the training set word vector.

步骤4，构建多分类文本类别分类器。Step 4, build a multi-category text category classifier.

4.1)建立依次由嵌入层、第一个卷积层、第一个池化层、第二个卷积层、第二个池化层、全连接层和softmax层组成的深度神经网络，各层参数设置如下：4.1) Establish a deep neural network consisting of an embedding layer, the first convolution layer, the first pooling layer, the second convolution layer, the second pooling layer, the fully connected layer and the softmax layer. The parameter settings are as follows:

嵌入层：输入维度为200，输出维度为128；Embedding layer: the input dimension is 200, and the output dimension is 128;

第一个卷积层：卷积核数目为64，卷积核大小为7，滑动步长为2，激活函数设置为relu；The first convolution layer: the number of convolution kernels is 64, the size of the convolution kernel is 7, the sliding step size is 2, and the activation function is set to relu;

第一个池化层，池化窗口大小为7，池化步长为2；The first pooling layer, the pooling window size is 7, and the pooling step size is 2;

第二个卷积层：卷积核数目为64，卷积核大小为3，滑动步长为2，激活函数设置为relu；The second convolution layer: the number of convolution kernels is 64, the size of the convolution kernel is 3, the sliding step size is 2, and the activation function is set to relu;

第二个池化层，池化窗口大小为20，池化步长为1；The second pooling layer, the pooling window size is 20, and the pooling step size is 1;

全连接层：激活函数设置为relu；Fully connected layer: the activation function is set to relu;

Softmax层：输出类别数目为8类。Softmax layer: The number of output categories is 8.

4.2)将DUTA数据集经过步骤3预处理后得到的训练集词向量作为训练数据，输入到4.1)建立的深度神经网络进行训练，得到多分类文本类别分类器：4.2) Use the training set word vector obtained after the preprocessing of the DUTA data set in step 3 as the training data, and input it into the deep neural network established in 4.1) for training to obtain a multi-category text category classifier:

4.2.1)设置深度神经网络的损失函数为交叉熵函数Ceem，设定目标输出阈值数据为R_i；4.2.1) set the loss function of the deep neural network to be the cross entropy function Ceem, and set the target output threshold data to be R _i ;

4.2.2)深度神经网络模型中的各层对输入的训练数据进行运算，得到输出数据E_i，1≤i≤C；4.2.2) Each layer in the deep neural network model operates on the input training data to obtain the output data E _i , 1≤i≤C;

4.2.3)通过Ceem函数计算输出数据E_i与DUTA训练集给定的目标输出数据R_i的偏差，并使用Adam优化算法，对深度神经网络的参数进行更新，不断减小E_i与R_i的偏差程度；4.2.3) Calculate the deviation between the output data E _i and the target output data R _i given by the DUTA training set through the Ceem function, and use the Adam optimization algorithm to update the parameters of the deep neural network to continuously reduce E _i and R _i degree of deviation;

4.2.4)重复步骤4.2.2)～步骤4.2.3)训练1000轮，完成对深度神经网络参数的优化，得到多分类文本类别分类器。4.2.4) Repeat steps 4.2.2) to 4.2.3) for 1000 rounds of training to complete the optimization of the parameters of the deep neural network, and obtain a multi-category text category classifier.

步骤5，使用多分类文本类别分类器计算隐藏服务首页属于C种类别中每种类别的概率。Step 5, using a multi-category text category classifier to calculate the probability that the home page of the hidden service belongs to each of the C categories.

5.1)对步骤2得到的隐藏服务首页进行与步骤3相同的预处理，得到测试集词向量；5.1) Perform the same preprocessing as step 3 on the hidden service home page obtained in step 2 to obtain the test set word vector;

5.2)将5.1)处理后得到的测试集词向量作为多分类文本类别分类器的输入，经过分类器运算得到第i个输出结果z_i，i表示文本类别编号，1≤i≤C，C代表文本类别数目；5.2) The test set word vector obtained after processing in 5.1) is used as the input of the multi-category text category classifier, and the i-th output result zi is obtained through the classifier operation, _i represents the text category number, 1≤i≤C, C represents the number of text categories;

5.3)根据分类器的运算结果z_i计算隐藏服务首页分别属于C种类别中第i种类别的概率P_i：5.3) According to the operation result zi of the classifier, calculate the probability P _i of the hidden service homepage belonging to the _ith category in the C categories:

步骤6，获取隐藏服务内容。Step 6, obtain hidden service content.

参照图2，本步骤的具体实现如下：Referring to Fig. 2, the concrete realization of this step is as follows:

6.1)统计步骤2得到的每一个隐藏服务首页页面内洋葱地址数量N；6.1) Count the number N of onion addresses in the home page of each hidden service obtained in step 2;

6.2)设置区分隐藏服务首页内嵌洋葱地址数量的阈值T＝100，设置区分首页对应网站中所有页面内包含的洋葱地址数量的阈值K＝500；6.2) Set the threshold value T=100 to distinguish the number of onion addresses embedded in the homepage of the hidden service, and set the threshold value K=500 to distinguish the number of onion addresses contained in all pages in the website corresponding to the homepage;

6.3)根据统计的洋葱地址数量N和设置的洋葱地址数量的阈值T，及隐藏服务首页分别属于C种类别中第i种类别的概率P_i，确定需要搜索的内容：6.3) According to the statistical number N of onion addresses and the set threshold value T of the number of onion addresses, and the probability P _i of the hidden service homepage belonging to the i-th category in the C categories, determine the content to be searched:

如果某个隐藏服务首页内洋葱地址个数N≥T或该首页属于每种类别的概率P_i之间相近，表明该首页可能包含多个类别的内容，则该首页对应的网站为目录类型，预判该首页对应网站中所有页面内包含的洋葱地址数量M大于K，执行步骤6.4)；If the number of onion addresses in the homepage of a hidden service N≥T or the probability P _i of the homepage belonging to each category is similar, indicating that the homepage may contain content of multiple categories, then the website corresponding to the homepage is a directory type. It is predicted that the number M of onion addresses contained in all pages of the website corresponding to the home page is greater than K, and step 6.4) is executed;

如果某个隐藏服务首页内洋葱地址个数N＜T或该首页属于C种类别中第j种类别的概率P_j远大于该首页属于C种类别中除第j种类别的概率，表明该首页可能只属于第j种类别，则预判该首页对应网站中所有页面内包含的洋葱地址数量M小于阈值K，执行步骤6.5)；If the number of onion addresses in the homepage of a hidden service is N<T or the probability Pj of the homepage belonging to the _jth category in the C category is much greater than the probability that the homepage belongs to the C category except the jth category, it indicates that the homepage belongs to the jth category. It may only belong to the jth category, then it is predicted that the number M of onion addresses contained in all pages of the website corresponding to the home page is less than the threshold K, and step 6.5) is executed;

6.4)设置D＝50个隐藏服务内容搜集的Docker镜像，启动这些Docker镜像生成50个容器，在这些容器中运行已构建的隐藏服务内容搜集代码，使用步骤7的方法对每一个首页对应网站中所有页面内嵌洋葱地址对应的隐藏服务内容进行搜集，得到隐藏服务内容W；6.4) Set D = 50 Docker images for hidden service content collection, start these Docker images to generate 50 containers, run the built hidden service content collection code in these containers, and use the method in step 7 for each homepage corresponding website. Collect the hidden service content corresponding to the embedded onion address in all pages, and obtain the hidden service content W;

6.5)设置D＝20个隐藏服务内容搜集的Docker镜像，启动这些Docker镜像生成20个容器，在这些容器中运行已构建的隐藏服务内容搜集代码，使用步骤7的方法对每一个首页对应网站中所有页面内嵌洋葱地址对应的隐藏服务内容进行搜集，得到隐藏服务内容W。6.5) Set D = 20 Docker images for hidden service content collection, start these Docker images to generate 20 containers, run the built hidden service content collection code in these containers, and use the method in step 7 for each homepage corresponding website. Collect the hidden service content corresponding to the embedded onion address in all pages, and obtain the hidden service content W.

步骤7，对每一个首页对应网站中所有页面内嵌洋葱地址对应的隐藏服务内容进行搜集。Step 7: Collect hidden service content corresponding to the embedded onion addresses in all pages of the website corresponding to each home page.

7.1)编写代码，对隐藏服务首页对应网站中所有页面内嵌洋葱地址对应的隐藏服务内容进行搜集；7.1) Write code to collect the hidden service content corresponding to the embedded onion address on all pages in the website corresponding to the hidden service home page;

7.2)分别获取Ubuntu操作系统镜像和Tor镜像，并对Tor镜像进行配置，设置网络代理；7.2) Obtain the Ubuntu operating system image and the Tor image respectively, configure the Tor image, and set the network proxy;

7.3)将7.1)编写的代码与上述Ubuntu操作系统镜像和Tor镜像这两个镜像进行整合打包，形成项目镜像；7.3) Integrate and package the code written in 7.1) with the two images of the above-mentioned Ubuntu operating system image and Tor image to form a project image;

7.4)创建Docker配置文件，指定项目镜像启动生成容器的过程中需要完成的操作：运行7.1)编写的代码；7.4) Create a Docker configuration file and specify the operations that need to be completed in the process of starting the project image to generate the container: run the code written in 7.1);

7.5)启动项目镜像，读取已经创建的Docker配置文件，完成设定的操作；7.5) Start the project image, read the created Docker configuration file, and complete the set operation;

7.6)运行7.1)编写的代码，对步骤2得到的每一个首页对应网站中所有页面内嵌洋葱地址对应的隐藏服务内容进行搜集。7.6) Run the code written in 7.1), and collect the hidden service content corresponding to the onion address embedded in all pages in the website corresponding to each home page obtained in step 2.

步骤8，解析提取出隐藏服务内容中所有的内嵌洋葱地址，并统计目前得到的洋葱地址总数和隐藏服务内容总数。Step 8: Parse and extract all embedded onion addresses in the hidden service content, and count the total number of onion addresses and hidden service content currently obtained.

8.1)读取步骤6得到的隐藏服务内容W；8.1) Read the hidden service content W obtained in step 6;

8.2)根据洋葱地址的格式特点：即前缀为http或https，后缀为.onion，前缀和后缀之间，是由16位字母或数字组成的字符串，使用正则表达式：https*://\w+\.onion，对读取得到的隐藏服务内容进行正则匹配，得到隐藏服务内容中所有的内嵌洋葱地址Z₃，并将这些内嵌洋葱地址与1.1)和1.2)所获的洋葱地址相加，得到洋葱地址的总量为：Z＝Z₁+Z₂+Z₃。8.2) According to the format characteristics of the onion address: that is, the prefix is http or https, and the suffix is .onion. Between the prefix and the suffix is a string composed of 16-bit letters or numbers, using a regular expression: https*://\ w+\.onion, perform regular matching on the hidden service content obtained by reading, get all the embedded onion addresses Z ₃ in the hidden service content, and compare these embedded onion addresses with the onion addresses obtained in 1.1) and 1.2) Add, the total amount of onion addresses obtained is: Z=Z ₁ +Z ₂ +Z ₃ .

步骤9，判断洋葱地址数量和隐藏服务内容数量是否达到设定目标。Step 9: Determine whether the number of onion addresses and the number of hidden service contents reach the set target.

9.1)根据实际需要设定要获得的洋葱目标地址数量为X和隐藏服务内容目标数量为Y；9.1) According to actual needs, set the number of onion target addresses to be obtained as X and the number of hidden service content targets as Y;

9.2)将X与Z以及Y与W的大小进行比较：9.2) Compare the magnitudes of X with Z and Y with W:

如果X＞Z或Y＞W，则表明目前获得的洋葱地址数量Z或隐藏服务内容数量W未达到设定的目标，重复步骤1～步骤8直到洋葱地址数量Z和隐藏服务内容数量W达到设定的数量；If X>Z or Y>W, it means that the currently obtained number of onion addresses Z or the number of hidden service contents W has not reached the set target, repeat steps 1 to 8 until the number of onion addresses Z and the number of hidden service contents W reach the set value a fixed quantity;

如果同时满足X≤Z和Y≤W，则表明目前获得的洋葱地址数量Z和隐藏服务内容数量W已经达到了设定的数量，则停止数据搜集，获得最终搜索出的洋葱地址数量和隐藏服务内容数量。If X≤Z and Y≤W are satisfied at the same time, it means that the currently obtained number of onion addresses Z and the number of hidden service contents W have reached the set number, stop data collection, and obtain the final number of onion addresses and hidden services searched out. amount of content.

以上描述仅是本发明的一个具体实例，并未构成对本发明的任何限制，显然对于本领域的专业人员来说，在了解了本发明内容和原理后，都可能在不背离本发明原理、结构的情况下，进行形式和细节上的各种修改和改变，但是这些基于本发明思想的修正和改变仍在本发明的权利要求保护范围之内。The above description is only a specific example of the present invention, and does not constitute any limitation to the present invention. Obviously, for those skilled in the art, after understanding the content and principles of the present invention, they may not deviate from the principles and structures of the present invention. Under the circumstance of the present invention, various modifications and changes in form and details are made, but these modifications and changes based on the idea of the present invention are still within the scope of protection of the claims of the present invention.

Claims

1. A method for collecting onion addresses and hidden service contents based on Docker is characterized by comprising the following steps:

(1) obtaining an onion address:

1a) respectively searching specific keywords by using a deep network search engine and a surface network search engine, extracting onion addresses in search results to obtain onion addresses Z which are related to the keywords and indexed by the search engines ₁ ；

1b) Arranging a relay node in Tor to enable the relay node to become a hidden service directory server, and performing Hash calculation and encoding required by generating an onion address on a hidden service public key in the server in a source code modification mode to obtain an onion address Z corresponding to the public key ₂ ；

Storing the onion addresses obtained by the two methods as files;

(2) collecting hidden service home pages corresponding to all the onion addresses obtained by the two methods in the step (1) to obtain a set of hidden service home pages to be classified;

(3) sequentially carrying out file duplicate removal, data cleaning, text word segmentation and vectorization on a dark web address text data set DUTA to construct a multi-classification text category classifier:

(3a) sequentially carrying out preprocessing of file duplicate removal, data cleaning, text word segmentation and vectorization on a dark web address text data set DUTA which belongs to C categories and has category marks to obtain a training set word vector;

(3b) setting parameters of a deep neural network sequentially consisting of an embedding layer, a first convolution layer, a first pooling layer, a second convolution layer, a second pooling layer, a full-link layer and a softmax layer, inputting training set word vectors obtained after the DUTA data set is preprocessed in the step (3a) as training data into the deep neural network for training to obtain a multi-classification text class classifier;

(4) sequentially carrying out file duplication removal, data cleaning, text word segmentation and vectorization preprocessing on each unknown hidden service home page in the hidden service home page set to be classified in the step (2) to obtain a test word set vector, taking the test word set vector as the input of a trained multi-classification text class classifier, and obtaining the probability P that the preprocessed hidden service home pages respectively belong to the ith class in the C classes by using the classifier _i I is more than or equal to 1 and less than or equal to C, and the number N of onion addresses in each home page is counted;

(5) setting a threshold value T of the number of onion addresses embedded in a home page of the hidden service to be 100, and setting a threshold value K of the number of onion addresses contained in all pages in a website corresponding to the home page of the hidden service to be 500; comparing the number N of onion addresses in each hidden service home page to a threshold T or probability P:

if the number N of onion addresses in a certain hidden service home page is more than or equal to T or the probability P that the home page belongs to each category _i If i is more than or equal to 1 and less than or equal to C is close, the home page possibly comprises contents of a plurality of categories, the website corresponding to the home page is of a directory type, the number M of onion addresses contained in all pages in the website corresponding to the home page is judged to be more than K, and the step (6) is executed;

if the number N of onion addresses in a certain hidden service home page is less than T or the probability P that the home page belongs to the jth category in the C categories _j If the probability that the home page belongs to the C categories except the jth category is far greater than the probability that the home page belongs to the jth category, the home page is possibly only the jth category, the number M of onion addresses contained in all pages in a website corresponding to the home page is judged to be smaller than a threshold value K, and then the step (7) is executed;

(6) setting D as Docker images collected by 50 hidden service contents, starting the Docker images to generate 50 containers, running constructed hidden service content collection codes in the containers, and collecting the hidden service contents corresponding to all page embedded onion addresses in a website corresponding to each home page;

(7) setting D as 20 Docker images collected by hidden service contents, starting the Docker images to generate 20 containers, running constructed hidden service content collection codes in the containers, and collecting the hidden service contents corresponding to all page embedded onion addresses in a website corresponding to each home page;

(8) counting the total number W of the hidden service contents collected in (6) and (7), and analyzing and extracting all the embedded onion addresses Z from the hidden service contents ₃ And adding the onion addresses with the onion addresses obtained in (1a) and (1b) to obtain the total number of the onion addresses as follows: z ═ Z ₁ +Z ₂ +Z ₃ ；

(9) Setting the number of onion target addresses to be obtained as X and the number of hidden service content targets as Y according to actual needs, and comparing the sizes of X and Z and Y and W:

if X is larger than Z or Y is larger than W, the number Z of the onion addresses obtained at present or the number W of the hidden service contents does not reach the set target, and repeating the steps (1) - (8) until the number Z of the onion addresses and the number W of the hidden service contents reach the set number;

if X is less than or equal to Z and Y is less than or equal to W, the number Z of the onion addresses and the number W of the hidden service contents which are obtained at present are indicated to reach the set number, and then data collection is stopped.

2. The method of claim 1, wherein the step (2) of collecting the hidden service header page corresponding to the onion address obtained in the step (1) is to use a Docker technique to obtain the hidden service header page, and is implemented as follows:

(2a) compiling hidden service home page collection codes;

(2b) respectively acquiring a Ubuntu operating system image and a Tor image, configuring the Tor image and setting a network agent;

(2c) integrating and packaging a hidden service home page collecting code and the two images of the Ubuntu operating system image and the Tor image to form an item image for collecting the hidden service home page;

(2d) creating a Docker configuration file, and specifying operations to be completed in the process of starting the project image to generate the container: copying the file containing the onion address obtained in the step (1) to the container, and operating a hidden service home page acquisition code;

(2e) starting the project mirror image, reading the established Docker configuration file, and finishing the set operation;

(2f) and (3) operating the hidden service home page acquisition code, and collecting the hidden service home page corresponding to the onion address in the onion address file obtained in the step (1) to obtain a hidden service home page set to be classified.

3. The method according to claim 1, wherein in the step (3a), the dark web address text data set DUTA is preprocessed by document de-duplication, data cleaning, text word segmentation and vectorization in sequence, and the method is implemented as follows:

(3a1) file deduplication:

sequentially calculating the MD5 value of a fifth generation information summary of each hidden service content file in the DUTA data set, storing the values in the set, if the MD5 value of a certain file exists in the set, indicating that the content of the file is completely the same as the content of the file which completes the MD5 value calculation, only reserving a file for storing the page content for the same content, deleting redundant files, and obtaining a duplicate removal file set;

(3a2) data cleaning:

removing useless modification content in each hidden service content file of the duplicate removal file set by using a regular expression to obtain a data-cleaned file set;

(3a3) text word segmentation

Using a natural language processing tool NLTK to perform text segmentation on hidden service contents of each file in a file set after data cleaning, namely segmenting each sentence in the hidden service contents to obtain all words forming the sentence, removing adverbs, prepositions, conjunctions and auxiliary verbs in the words, and keeping real words of nouns, verbs, adjectives and quantitative words in the words;

(3a4) vectorization

And (4) using a vector generator word2Vec to encode the real words obtained in the step (3c) to obtain vectorized real words and form training set word vectors.

4. The method of claim 1, wherein the setting of the structural parameters of the deep neural network in (3b) is as follows:

embedding layer: an input dimension of 200 and an output dimension of 128;

first convolutional layer: the number of convolution kernels is 64, the size of the convolution kernels is 7, the sliding step length is 2, and the activation function is set to relu;

the first largest pooling layer, the size of a pooling window is 7, and the pooling step length is 2;

a second convolutional layer: the number of convolution kernels is 64, the size of the convolution kernels is 3, the sliding step length is 2, and the activation function is set to relu;

the second largest pooling layer, the size of the pooling window is 20, and the pooling step length is 1;

full connection layer: the activation function is set to relu;

softmax layer: the number of output classes is 8 classes.

5. The method of claim 1, wherein the deep neural network is trained in (3b) as follows:

(3b1) setting a loss function of the deep neural network as a cross entropy function Ceem, and setting target output threshold data R _i ；

(3b2) Each layer in the deep neural network model operates input training data to obtain output data E _i ，1≤i≤C；

(3b3) Computing output data E by Ceem function _i Target output data R given with DUTA training set _i And updating the parameters of the deep neural network by using an Adam optimization algorithm, and continuously reducing E _i And R _i The degree of deviation of (a);

(3b4) repeating the steps 3b1) -3 b3) for 1000 rounds of training, completing the optimization of the deep neural network parameters, and obtaining the multi-classification text category classifier.

6. The method of claim 1, wherein (4) the probability P that the hidden top of service page belongs to each of the C classes is calculated using a multi-class text class classifier _i ：

(4.1) carrying out the same pretreatment as the pretreatment of the (3a) on the hidden service home page obtained in the step (2) to obtain a test set word vector;

(4.1) using the test set word vector obtained after the processing of the step 4.1) as the input of a multi-classification text category classifier, and obtaining the ith output result z through the operation of the classifier _i I represents the number of text categories, i is more than or equal to 1 and less than or equal to C, and C represents the number of the text categories;

(4.1) operation result z according to the classifier _i Calculating the probability P that the hidden service home page respectively belongs to the ith category in the C categories _i ：

7. The method of claim 1, wherein the hidden service contents corresponding to all page embedded onion addresses in the website corresponding to each home page are collected in step (6) as follows:

(6a) compiling codes, and collecting hidden service contents corresponding to the onion addresses embedded in all pages in the website corresponding to the hidden service home page;

(6b) respectively acquiring a Ubuntu operating system image and a Tor image, configuring the Tor image and setting a network agent;

(6c) integrating and packaging the code written in the step (6a) and the Ubuntu operating system mirror image and the Tor mirror image to form a project mirror image;

(6d) creating a Docker configuration file, and specifying operations to be completed in the process of starting the project image to generate the container: running (6a) the written code;

(6e) starting the project mirror image, reading the established Docker configuration file, and finishing the set operation;

(6f) and (3) running the code compiled in the step (6a), and collecting hidden service contents corresponding to all page embedded onion addresses in the website corresponding to each home page obtained in the step (2) to obtain a hidden service content set.

8. The method of claim 1, wherein (8) parsing extracts all the embedded onion addresses in the hidden service content as follows:

(8a) reading hidden service content;

(8b) according to the format characteristics of the onion address: i.e. the prefix is http or https, the suffix is onion, between the prefix and the suffix, is a string of 16 letters or numbers, using a regular expression: and (3) performing regular matching on the read hidden service content to obtain all embedded onion addresses in the hidden service content.