CN106202077B

CN106202077B - Task distribution method and device

Info

Publication number: CN106202077B
Application number: CN201510217232.7A
Authority: CN
Inventors: 左啸冰; 罗纯杰
Original assignee: Huawei Technologies Co Ltd; Institute of Computing Technology of CAS
Current assignee: Huawei Technologies Co Ltd; Institute of Computing Technology of CAS
Priority date: 2015-04-30
Filing date: 2015-04-30
Publication date: 2020-01-21
Anticipated expiration: 2035-04-30
Also published as: CN106202077A

Abstract

The embodiment of the invention discloses a task distribution method and a task distribution device, which are used for improving the crawling efficiency of a web crawler. The method provided by the embodiment of the invention comprises the following steps: analyzing a Uniform Resource Locator (URL) of a first page and a hash value of a parent page of the URL of the first page into a task list; judging whether the URL of the first page is crawled or not; when the URL of the first page is not crawled, determining whether the first page and the second page are adjacent pages according to a sub domain name in the URL of the first page and a hash value of a parent page of the URL of the first page; when the first page and the second page are determined not to be adjacent pages, allocating the first page to a new thread; when the first page and the second page are determined to be adjacent pages, the first page is distributed to a thread where the second page is located; and controlling to execute the downloading task on the distributed threads according to the time sequence.

Description

A task distribution method and device

技术领域technical field

本发明涉及通信技术领域，特别涉及一种任务分发方法及装置。The present invention relates to the field of communication technologies, and in particular, to a task distribution method and device.

背景技术Background technique

在网络爬虫技术中，如何在批量获取目标URL(Uniform Resource Locator，统一资源定位符)网页集合内容的同时，网络爬虫本身不被目标网页集合所在的服务器误认为是恶意网络爬虫，是一大难题。网络爬虫要保证爬取网页的效率，就需要开启多个下载线程对URL任务列表中的网页并发发送访问请求。但是，目标网站服务器会通过检测同一IP(Internet Protocol，互联网协议)地址发出的URL访问请求序列及频率来对该地址请求是否为恶意网络爬虫发出做判断。若某IP地址在某一较短时间内段内的URL访问请求序列中页面集合符合该网站对相邻页面定义，则认定该IP地址运行恶意网络爬虫。为了避免这一情况发生，现在有多种技术或策略被采用。In the web crawler technology, how to obtain the target URL (Uniform Resource Locator, Uniform Resource Locator) web page collection content in batches, at the same time, the web crawler itself is not mistaken by the server where the target web page collection is located as a malicious web crawler, which is a big problem. . To ensure the efficiency of crawling web pages, web crawlers need to open multiple download threads to concurrently send access requests to the web pages in the URL task list. However, the target website server will determine whether the address request is sent by a malicious web crawler by detecting the sequence and frequency of URL access requests sent from the same IP (Internet Protocol, Internet Protocol) address. If the page set in the URL access request sequence of a certain IP address within a certain short period of time conforms to the definition of adjacent pages by the website, it is determined that the IP address is running a malicious web crawler. To avoid this from happening, various techniques or strategies are now being employed.

现有技术中提供了一种使用优先级积分排序任务列表的分布式网络爬虫技术。其在采用分布式网络爬虫爬取网页时，为每个节点设置1个固定的IP地址，同时设置每个节点并发执行的线程数量最大值为3个或以下，并在各个线程上执行URL请求的时序控制；同时，在任务列表中添加一列名为积分的数据项，根据对网页的URL或内容进行分析确定积分，按照积分从大到小对任务队列进行重新排序。采用这种方法，需要的IP地址资源数量大大降低。同时，由于对任务列表进行了重新排列，父页面相同的相邻页面的访问顺序有可能被打乱，并且各个从每个IP地址最大并发的爬取线程数为3，发出的访问请求数最大值为3，这些访问请求又有时序控制不至其过于密集，因此不易被误识别为恶意网络爬虫。The prior art provides a distributed web crawler technology that uses priority points to sort a task list. When using distributed web crawler to crawl web pages, it sets a fixed IP address for each node, and sets the maximum number of concurrently executed threads for each node to 3 or less, and executes URL requests on each thread. At the same time, a column of data items named points is added to the task list, the points are determined according to the analysis of the URL or content of the web page, and the task queue is reordered according to the points from large to small. With this approach, the amount of IP address resources required is greatly reduced. At the same time, due to the rearrangement of the task list, the access order of adjacent pages with the same parent page may be disrupted, and the maximum number of concurrent crawling threads from each IP address is 3, and the maximum number of access requests issued is With a value of 3, these access requests are time-controlled so that they are not too dense, so they are not easily misidentified as malicious web crawlers.

采用这种方法，各节点并发线程数太少，导致网络爬虫的爬取效率(单位时间内爬取网页数量)很低；而且存在相邻页面在同一时间段被密集访问而被识别为恶意网络爬虫的可能。With this method, the number of concurrent threads of each node is too small, resulting in a low crawling efficiency (the number of web pages crawled per unit time) of the web crawler; and there are adjacent pages that are intensively accessed in the same time period and are identified as malicious networks The possibility of reptiles.

发明内容SUMMARY OF THE INVENTION

本发明提供了一种任务分发方法及装置，用于提高网络爬虫的爬取效率。The present invention provides a task distribution method and device for improving the crawling efficiency of a web crawler.

本发明第一方面提供了一种任务分发方法，包括：A first aspect of the present invention provides a task distribution method, comprising:

将第一页面的统一资源定位符URL和所述第一页面的URL的父页面的哈希hash值解析到任务列表中；Parse the hash value of the URL of the first page and the parent page of the URL of the first page into the task list;

判断所述第一页面的URL是否已被爬取；Determine whether the URL of the first page has been crawled;

当所述第一页面的URL未被爬取时，根据所述第一页面的URL中的子域名和所述第一页面的URL的父页面的哈希hash值确定所述第一页面与第二页面是否为相邻页面；When the URL of the first page is not crawled, determine the relationship between the first page and the first page according to the subdomain name in the URL of the first page and the hash value of the parent page of the URL of the first page. Whether the two pages are adjacent pages;

当确定所述第一页面与所述第二页面不为相邻页面时，将所述第一页面分配到新线程中；when it is determined that the first page and the second page are not adjacent pages, assigning the first page to a new thread;

当确定所述第一页面与所述第二页面为相邻页面时，将所述第一页面分配到所述第二页面所在的线程中；When it is determined that the first page and the second page are adjacent pages, allocating the first page to the thread where the second page is located;

按时序控制对分配后的线程执行下载任务。Execute the download task on the allocated thread according to the timing control.

结合本发明的第一方面，在本发明第一方面的第一种实现方式中，所述判断所述第一页面的URL是否已被爬取具体包括：In combination with the first aspect of the present invention, in the first implementation manner of the first aspect of the present invention, the judging whether the URL of the first page has been crawled specifically includes:

根据已爬取页的URL列表判断所述第一页面的URL是否已被爬取，若是，则将所述第一页面的URL从所述任务列表中删去。Determine whether the URL of the first page has been crawled according to the URL list of the crawled pages, and if so, delete the URL of the first page from the task list.

结合本发明的第一方面，在本发明第一方面的第二种实现方式中，所述根据所述第一页面的URL中的子域名和所述第一页面的URL的父页面的哈希hash值确定所述第一页面与第二页面是否为相邻页面具体包括：In combination with the first aspect of the present invention, in the second implementation manner of the first aspect of the present invention, the hash of the parent page based on the subdomain name in the URL of the first page and the URL of the first page The hash value to determine whether the first page and the second page are adjacent pages specifically includes:

判断所述第一页面的URL的子域名与所述第二页面的URL的子域名是否相同，以及判断所述第一页面的URL的父页面的哈希hash值与所述第二页面的URL的父页面的哈希hash值是否相同；Determine whether the subdomain name of the URL of the first page is the same as the subdomain name of the URL of the second page, and determine whether the hash value of the parent page of the URL of the first page is the same as the URL of the second page. Whether the hash value of the parent page is the same;

在所述子域名相同或所述哈希hash值相等时，确定所述第一页面与所述第二页面为相邻页面；When the subdomain names are the same or the hash values are the same, determining that the first page and the second page are adjacent pages;

在所述子域名不相同且所述哈希hash值不相等时，确定所述第一页面与所述第二页面不为相邻页面。When the subdomain names are not the same and the hash values are not the same, it is determined that the first page and the second page are not adjacent pages.

结合本发明的第一方面、或第一方面的第一种实现方式、或第一方面的第二种实现方式，在本发明第一方面的第三种实现方式中，所述将所述第一页面分配到新线程中具体包括：With reference to the first aspect of the present invention, or the first implementation manner of the first aspect, or the second implementation manner of the first aspect, in the third implementation manner of the first aspect of the present invention, the third The allocation of a page to a new thread includes:

将所述第一页面分配到新线程中，并将所述新线程对应的线程号写入所述第一页面的URL的线程号项中；所述新线程保存所述第一页面的URL的父页面的哈希hash值和所述第一页面的URL的名称；The first page is allocated to a new thread, and the thread number corresponding to the new thread is written into the thread number item of the URL of the first page; the new thread saves the URL of the first page. The hash value of the parent page and the name of the URL of the first page;

将所述第一页面的URL从所述任务列表中删去，并写入已爬取页的URL列表中。Delete the URL of the first page from the task list and write it into the URL list of crawled pages.

结合本发明的第一方面、或第一方面的第一种实现方式、或第一方面的第二种实现方式，在本发明第一方面的第四种实现方式中，所述将所述第一页面分配到所述第二页面所在的线程中具体包括：With reference to the first aspect of the present invention, or the first implementation manner of the first aspect, or the second implementation manner of the first aspect, in the fourth implementation manner of the first aspect of the present invention, the The allocation of a page to the thread where the second page is located specifically includes:

将所述第一页面分配到所述第二页面所在的线程中，并在所述第一页面的URL的线程号项中写入所述第二页面所在的线程对应的线程号；所述第二页面所在的线程保存所述第一页面的URL的父页面的哈希hash值和所述第一页面的URL的名称；Allocate the first page to the thread where the second page is located, and write the thread number corresponding to the thread where the second page is located in the thread number item of the URL of the first page; The thread where the two pages are located saves the hash value of the parent page of the URL of the first page and the name of the URL of the first page;

本发明第二方面提供了一种任务分发装置，包括：A second aspect of the present invention provides a task distribution device, comprising:

解析单元，用于将第一页面的统一资源定位符URL和所述第一页面的URL的父页面的哈希hash值解析到任务列表中；a parsing unit for parsing the URL of the first page and the hash value of the parent page of the URL of the first page into the task list;

判断单元，用于判断所述第一页面的URL是否已被爬取；a judging unit for judging whether the URL of the first page has been crawled;

确定单元，用于在所述第一页面的URL未被爬取时，根据所述第一页面的URL中的子域名和所述第一页面的URL的父页面的哈希hash值确定所述第一页面与第二页面是否为相邻页面；A determining unit, configured to determine the said first page according to the subdomain name in the URL of the first page and the hash value of the parent page of the URL of the first page when the URL of the first page is not crawled Whether the first page and the second page are adjacent pages;

第一分配单元，用于在确定所述第一页面与所述第二页面不为相邻页面后，将所述第一页面分配到新线程中；a first allocation unit, configured to allocate the first page to a new thread after determining that the first page and the second page are not adjacent pages;

第二分配单元，用于在确定所述第一页面与所述第二页面为相邻页面后，将所述第一页面分配到所述第二页面所在的线程中；a second allocation unit, configured to allocate the first page to the thread where the second page is located after determining that the first page and the second page are adjacent pages;

执行单元，用于按时序控制对分配后的线程执行下载任务。The execution unit is used to execute the download task on the allocated thread according to the timing control.

结合本发明的第二方面，在本发明第二方面的第一种实现方式中，所述判断单元具体用于根据已爬取页的URL列表判断所述第一页面的URL是否已被爬取，若是，则将所述第一页面的URL从所述任务列表中删去。In combination with the second aspect of the present invention, in the first implementation manner of the second aspect of the present invention, the judging unit is specifically configured to judge whether the URL of the first page has been crawled according to the URL list of the crawled page , if yes, delete the URL of the first page from the task list.

结合本发明的第二方面，在本发明第二方面的第二种实现方式中，所述确定单元具体包括：With reference to the second aspect of the present invention, in a second implementation manner of the second aspect of the present invention, the determining unit specifically includes:

判断模块，用于在所述第一页面的URL未被爬取时，判断所述第一页面的URL的子域名与所述第二页面的URL的子域名是否相同，以及判断所述第一页面的URL的父页面的哈希hash值与所述第二页面的URL的父页面的哈希hash值是否相同；a judging module for judging whether the subdomain name of the URL of the first page is the same as the subdomain name of the URL of the second page when the URL of the first page is not crawled, and judging the first page Whether the hash value of the parent page of the URL of the page is the same as the hash value of the parent page of the URL of the second page;

第一确定模块，用于在所述子域名相同或所述哈希hash值相等时，确定所述第一页面与所述第二页面为相邻页面；a first determining module, configured to determine that the first page and the second page are adjacent pages when the subdomain names are the same or the hash values are equal;

第二确定模块，用于在所述子域名不相同且所述哈希hash值不相等时，确定所述第一页面与所述第二页面不为相邻页面。A second determining module, configured to determine that the first page and the second page are not adjacent pages when the subdomain names are not the same and the hash values are not the same.

结合本发明的第二方面、或第二方面的第一种实现方式、或第二方面的第二种实现方式，在本发明第二方面的第三种实现方式中，所述第一分配单元具体包括：With reference to the second aspect of the present invention, or the first implementation manner of the second aspect, or the second implementation manner of the second aspect, in the third implementation manner of the second aspect of the present invention, the first distribution unit Specifically include:

第一分发模块，用于将所述第一页面分配到新线程中，并将所述新线程对应的线程号写入所述第一页面的URL的线程号项中；所述新线程保存所述第一页面的URL的父页面的哈希hash值和所述第一页面的URL的名称；The first distribution module is used to assign the first page to a new thread, and write the thread number corresponding to the new thread into the thread number item of the URL of the first page; the new thread saves all The hash value of the parent page of the URL of the first page and the name of the URL of the first page;

第一删写模块，用于将所述第一页面的URL从所述任务列表中删去，并写入已爬取页的URL列表中。The first deletion and writing module is used to delete the URL of the first page from the task list and write it into the URL list of the crawled pages.

结合本发明的第二方面、或第二方面的第一种实现方式、或第二方面的第二种实现方式，在本发明第二方面的第四种实现方式中，所述第二分配单元具体包括：With reference to the second aspect of the present invention, or the first implementation manner of the second aspect, or the second implementation manner of the second aspect, in the fourth implementation manner of the second aspect of the present invention, the second allocating unit Specifically include:

第二分发模块，用于将所述第一页面分配到所述第二页面所在的线程中，并在所述第一页面的URL的线程号项中写入所述第二页面所在的线程对应的线程号；所述第二页面所在的线程保存所述第一页面的URL的父页面的哈希hash值和所述第一页面的URL的名称；A second distribution module, configured to allocate the first page to the thread where the second page is located, and write the thread corresponding to the thread where the second page is located in the thread number item of the URL of the first page Thread number; The thread where the second page is located saves the hash value of the parent page of the URL of the first page and the name of the URL of the first page;

第二删写模块，用于将所述第一页面的URL从所述任务列表中删去，并写入已爬取页的URL列表中。The second deletion and writing module is configured to delete the URL of the first page from the task list and write it into the URL list of the crawled pages.

从以上技术方案可以看出，本发明实施例具有以下优点：将第一页面的统一资源定位符URL和所述第一页面的URL的父页面的哈希hash值解析到任务列表中；并判断所述第一页面的URL是否已被爬取；且在所述第一页面的URL未被爬取时，根据所述第一页面的URL中的子域名和所述第一页面的URL的父页面的哈希hash值确定所述第一页面与第二页面是否为相邻页面；当确定所述第一页面与所述第二页面不为相邻页面时，将所述第一页面分配到新线程中，并按时序控制执行下载任务；当确定所述第一页面与所述第二页面为相邻页面时，将所述第一页面分配到所述第二页面所在的线程中，并按时序控制执行下载任务。由于本发明对分布式网络爬虫的URL任务进行了合理的分配，对相邻网页分配到同一线程，保证在同一IP地址分到的每个相邻页面串行下载，而不相邻的页面则分配到不同的线程并发下载；因此，在IP数量受限的情况下，不仅提高了网络爬虫的爬取效率，而且可以有效避免IP数量有限的网络爬虫被识别为恶意网络爬虫。It can be seen from the above technical solutions that the embodiment of the present invention has the following advantages: the uniform resource locator URL of the first page and the hash value of the parent page of the URL of the first page are parsed into the task list; Whether the URL of the first page has been crawled; and when the URL of the first page has not been crawled, according to the subdomain name in the URL of the first page and the parent of the URL of the first page The hash value of the page determines whether the first page and the second page are adjacent pages; when it is determined that the first page and the second page are not adjacent pages, the first page is allocated to In the new thread, and perform the download task according to the timing control; when it is determined that the first page and the second page are adjacent pages, the first page is allocated to the thread where the second page is located, and Execute download tasks according to timing control. Because the present invention reasonably allocates the URL tasks of the distributed web crawler, and allocates adjacent web pages to the same thread, it ensures that each adjacent page assigned to the same IP address is downloaded serially, and the non-adjacent pages are downloaded in series. It is allocated to different threads for concurrent download; therefore, in the case of a limited number of IPs, it not only improves the crawling efficiency of web crawlers, but also effectively prevents web crawlers with limited IP numbers from being identified as malicious web crawlers.

附图说明Description of drawings

图1为本发明所提供的任务分发方法的一个实施例流程示意图；1 is a schematic flowchart of an embodiment of a task distribution method provided by the present invention;

图2为本发明所提供的任务分发方法的另一实施例流程示意图；2 is a schematic flowchart of another embodiment of the task distribution method provided by the present invention;

图3为本发明所提供的任务分发方法的另一实施例流程示意图；3 is a schematic flowchart of another embodiment of the task distribution method provided by the present invention;

图4为本发明所提供的任务分发方法的另一实施例流程示意图；4 is a schematic flowchart of another embodiment of the task distribution method provided by the present invention;

图5为本发明所提供的任务分发装置的一个实施例结构示意图；5 is a schematic structural diagram of an embodiment of a task distribution device provided by the present invention;

图6为本发明所提供的任务分发装置的另一实施例结构示意图；6 is a schematic structural diagram of another embodiment of the task distribution device provided by the present invention;

图7为本发明所提供的任务分发装置的另一实施例结构示意图；7 is a schematic structural diagram of another embodiment of the task distribution apparatus provided by the present invention;

图8为本发明所提供的任务分发装置的另一实施例结构示意图；8 is a schematic structural diagram of another embodiment of the task distribution apparatus provided by the present invention;

图9为本发明所提供的任务分发装置的另一实施例结构示意图。FIG. 9 is a schematic structural diagram of another embodiment of the task distribution apparatus provided by the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative efforts shall fall within the protection scope of the present invention.

应当理解，尽管在本发明实施例中可能采用术语第一、第二等来描述各个用户或终端，但用户或终端不应限于这些术语。这些术语仅用来将用户或终端彼此区分开。例如，在不脱离本发明实施例范围的情况下，第一用户也可以被称为第二用户，类似地，第二用户也可以被称为第一用户；同样的，第二用户也可以被称为第三用户等等，本发明实施例对此不做限制。It should be understood that although the terms first, second, etc. may be used to describe each user or terminal in this embodiment of the present invention, the user or terminal should not be limited by these terms. These terms are only used to distinguish users or terminals from one another. For example, without departing from the scope of the embodiments of the present invention, the first user may also be referred to as the second user, and similarly, the second user may also be referred to as the first user; similarly, the second user may also be referred to as It is called a third user, etc., which is not limited in this embodiment of the present invention.

首先对本发明所涉及到的一些缩略语和关键术语进行定义：First, some abbreviations and key terms involved in the present invention are defined:

网络爬虫:一种按照一定的规则，自动的抓取万维网信息的程序或者脚本。Web crawler: A program or script that automatically crawls information from the World Wide Web according to certain rules.

URL：在WWW上，每一信息资源都有统一的且在网上唯一的地址，该地址就叫URL(Uniform Resource Locator，统一资源定位符)，它是WWW的统一资源定位标志，就是指网络地址。URL: On the WWW, each information resource has a uniform and unique address on the Internet. The address is called URL (Uniform Resource Locator), which is the uniform resource locator of WWW, which refers to the network address. .

相邻页面：各个网站对相邻页面的定义不尽相同，通常包含：同一网页中包含的多个URL链接分别指向的不同页面、以及子域名相同的页面(例：http：//www.very.com/entries/793336/和http：//www.very.com/entries/792442/)。Adjacent pages: Different websites have different definitions of adjacent pages, which usually include: different pages pointed to by multiple URL links contained in the same web page, and pages with the same subdomain name (for example: http://www.very .com/entries/793336/ and http://www.very.com/entries/792442/).

爬虫线程内时序控制：设定网络爬虫下载线程对目标网站发送的URL访问请求的时间间隔，该时间间隔大于人点击网页所需时间间隔，且必须是一个随机值。Timing control in the crawler thread: Set the time interval of the URL access request sent by the web crawler download thread to the target website. The time interval is greater than the time interval required for people to click on the webpage, and must be a random value.

IP地址；给每个连接在互联网上的主机分配的一个32位地址。IP address; a 32-bit address assigned to each host connected to the Internet.

本发明实施例提供了一种任务分发方法，主要是任务分发装置所执行的方法，请参阅图1，本发明所提供的任务分发方法一个实施例包括：An embodiment of the present invention provides a task distribution method, which is mainly a method performed by a task distribution device. Please refer to FIG. 1. An embodiment of the task distribution method provided by the present invention includes:

101、将第一页面的统一资源定位符URL和该第一页面的URL的父页面的哈希hash值解析到任务列表中；101. Parse the uniform resource locator URL of the first page and the hash value of the parent page of the URL of the first page into the task list;

需要说明的是，解析器负责向任务列表传输数据类型，具体为将获取到的第一页面的统一资源定位符URL和该第一页面的URL的父页面的哈希hash值解析到任务列表中；因此，该数据类型不再仅仅是解析出的URL字符串，而是包含了该URL字符串和该URL的父页面的hash值对应起来的一个二元数组。可以理解的是，在任务列表中，该任务列表的格式也由仅仅列出URL字符串，改为不仅列出URL字符串，并且列出标志该URL的父页面的hash数值。It should be noted that the parser is responsible for transmitting the data type to the task list, specifically parsing the obtained uniform resource locator URL of the first page and the hash value of the parent page of the URL of the first page into the task list ; Therefore, the data type is no longer just the parsed URL string, but contains a binary array corresponding to the URL string and the hash value of the parent page of the URL. It can be understood that, in the task list, the format of the task list is also changed from only listing the URL string to not only listing the URL string, but also listing the hash value that identifies the parent page of the URL.

102、判断该第一页面的URL是否已被爬取；102. Determine whether the URL of the first page has been crawled;

需要说明的是，当任务列表中有新的URL任务进入任务等待队列时，先进行去重处理；因此，首先需要判断该第一页面的URL是否已被爬取。It should be noted that when a new URL task in the task list enters the task waiting queue, deduplication processing is performed first; therefore, it is first necessary to determine whether the URL of the first page has been crawled.

103、当该第一页面的URL未被爬取时，根据该第一页面的URL中的子域名和该第一页面的URL的父页面的哈希hash值确定该第一页面与第二页面是否为相邻页面；103. When the URL of the first page is not crawled, determine the first page and the second page according to the subdomain name in the URL of the first page and the hash value of the parent page of the URL of the first page Whether it is an adjacent page;

当该第一页面的URL未被爬取时，根据该第一页面的URL中的子域名和该第一页面的URL的父页面的哈希hash值确定该第一页面与第二页面是否为相邻页面；其中，该第二页面为与该第一页面不同的在运行线程中的页面。When the URL of the first page is not crawled, determine whether the first page and the second page are the same according to the subdomain name in the URL of the first page and the hash value of the parent page of the URL of the first page. an adjacent page; wherein the second page is a page in a running thread that is different from the first page.

104、当确定该第一页面与该第二页面不为相邻页面时，将该第一页面分配到新线程中；104. When it is determined that the first page and the second page are not adjacent pages, allocate the first page to a new thread;

当确定该第一页面与该第二页面不为相邻页面时，将该第一页面分配到新线程中。When it is determined that the first page and the second page are not adjacent pages, the first page is allocated to a new thread.

105、当确定该第一页面与该第二页面为相邻页面时，将该第一页面分配到该第二页面所在的线程中；105. When it is determined that the first page and the second page are adjacent pages, allocate the first page to the thread where the second page is located;

当确定该第一页面与该第二页面为相邻页面时，将该第一页面分配到该第二页面所在的线程中。When it is determined that the first page and the second page are adjacent pages, the first page is allocated to the thread where the second page is located.

106、按时序控制对分配后的线程执行下载任务。106. Execute the download task on the allocated thread according to the time sequence control.

本发明实施例中，通过将第一页面的统一资源定位符URL和该第一页面的URL的父页面的哈希hash值解析到任务列表中；并判断该第一页面的URL是否已被爬取；且当所述第一页面的URL未被爬取时，根据该第一页面的URL中的子域名和该第一页面的URL的父页面的哈希hash值确定该第一页面与第二页面是否为相邻页面；当确定该第一页面与该第二页面不为相邻页面时，将该第一页面分配到新线程中；当确定该第一页面与该第二页面为相邻页面时，将该第一页面分配到该第二页面所在的线程中，并按时序控制对分配后的线程执行下载任务。由于本发明对分布式网络爬虫的URL任务进行了合理的分配，对相邻网页分配到同一线程，保证在同一IP地址分到的每个相邻页面串行下载，而不相邻的页面则分配到不同的线程并发下载；因此，在IP数量受限的情况下，不仅提高了网络爬虫的爬取效率，而且可以有效避免IP数量有限的网络爬虫被识别为恶意网络爬虫。In the embodiment of the present invention, the URL of the first page and the hash value of the parent page of the URL of the first page are parsed into the task list; and it is judged whether the URL of the first page has been crawled and when the URL of the first page is not crawled, determine the first page and the first page according to the subdomain name in the URL of the first page and the hash value of the parent page of the URL of the first page. Whether the two pages are adjacent pages; when it is determined that the first page and the second page are not adjacent pages, the first page is allocated to a new thread; when it is determined that the first page and the second page are adjacent pages When the page is adjacent, the first page is allocated to the thread where the second page is located, and the download task is performed on the allocated thread according to the timing control. Because the present invention reasonably allocates the URL tasks of the distributed web crawler, and allocates adjacent web pages to the same thread, it ensures that each adjacent page assigned to the same IP address is downloaded serially, and the non-adjacent pages are downloaded in series. It is allocated to different threads for concurrent download; therefore, in the case of a limited number of IPs, it not only improves the crawling efficiency of web crawlers, but also effectively prevents web crawlers with limited IP numbers from being identified as malicious web crawlers.

请参阅图2，本发明所提供的任务分发方法另一实施例包括：Referring to FIG. 2, another embodiment of the task distribution method provided by the present invention includes:

201、将第一页面的统一资源定位符URL和该第一页面的URL的父页面的哈希hash值解析到任务列表中；201. Parse the uniform resource locator URL of the first page and the hash value of the parent page of the URL of the first page into a task list;

202、根据已爬取页的URL列表判断该第一页面的URL是否已被爬取，若是，则将该第一页面的URL从该任务列表中删去；202. Determine whether the URL of the first page has been crawled according to the URL list of the crawled page, and if so, delete the URL of the first page from the task list;

需要说明的是，根据已爬取页的URL列表判断该第一页面的URL是否已被爬取，若该第一页面的URL曾经被爬取过，则将该第一页面的URL从该任务列表中删去。It should be noted that it is determined whether the URL of the first page has been crawled according to the URL list of the crawled pages. If the URL of the first page has been crawled, the URL of the first page will be removed from the task. Deleted from the list.

203、当该第一页面的URL未被爬取时，根据该第一页面的URL中的子域名和该第一页面的URL的父页面的哈希hash值确定该第一页面与第二页面是否为相邻页面；203. When the URL of the first page is not crawled, determine the first page and the second page according to the subdomain name in the URL of the first page and the hash value of the parent page of the URL of the first page Whether it is an adjacent page;

204、当确定该第一页面与该第二页面不为相邻页面时，将该第一页面分配到新线程中；204. When it is determined that the first page and the second page are not adjacent pages, allocate the first page to a new thread;

205、当确定该第一页面与该第二页面为相邻页面时，将该第一页面分配到该第二页面所在的线程中；205. When it is determined that the first page and the second page are adjacent pages, allocate the first page to the thread where the second page is located;

需要说明的是，步骤201、203～205的具体过程可分别对应参考图1所示实施例中的步骤101、103～105，此处不再赘述。It should be noted that, the specific processes of steps 201 and 203 to 205 may correspond to steps 101 and 103 to 105 in the embodiment shown in FIG. 1 respectively, which will not be repeated here.

206、按时序控制对分配后的线程执行下载任务。206. Execute the download task on the allocated thread according to the time sequence control.

需要说明的是，通过网络爬虫下载线程按照时序控制执行下载任务，并对已下载的页面内容进行解析，将解析出的新的URL和该URL父页面的hash值发送到URL任务列表中，去重后生成新的URL任务等待队列。It should be noted that the download task is executed according to the timing control through the download thread of the web crawler, and the content of the downloaded page is parsed, and the parsed new URL and the hash value of the parent page of the URL are sent to the URL task list. After regenerating a new URL task waiting queue.

本发明实施例中，根据已爬取页的URL列表判断该第一页面的URL是否已被爬取，若是，则将该第一页面的URL从该任务列表中删去，以提高网络爬虫的爬取效率。In the embodiment of the present invention, it is determined whether the URL of the first page has been crawled according to the URL list of the crawled pages, and if so, the URL of the first page is deleted from the task list, so as to improve the web crawler's Crawl efficiency.

请参阅图3，本发明所提供的任务分发方法另一实施例包括：Referring to FIG. 3, another embodiment of the task distribution method provided by the present invention includes:

301、将第一页面的统一资源定位符URL和该第一页面的URL的父页面的哈希hash值解析到任务列表中；301. Parse the uniform resource locator URL of the first page and the hash value of the parent page of the URL of the first page into a task list;

302、根据已爬取页的URL列表判断该第一页面的URL是否已被爬取，若是，则将该第一页面的URL从该任务列表中删去；302. Determine whether the URL of the first page has been crawled according to the URL list of the crawled page, and if so, delete the URL of the first page from the task list;

303、当该第一页面的URL未被爬取时，判断该第一页面的URL的子域名与该第二页面的URL的子域名是否相同，以及判断该第一页面的URL的父页面的哈希hash值与该第二页面的URL的父页面的哈希hash值是否相同；303. When the URL of the first page is not crawled, determine whether the subdomain name of the URL of the first page is the same as the subdomain name of the URL of the second page, and determine whether the parent page of the URL of the first page is the same. Whether the hash value is the same as the hash value of the parent page of the URL of the second page;

需要说明的是，当所述第一页面的URL未被爬取时，判断该第一页面的URL的子域名与该第二页面的URL的子域名是否相同，以及判断该第一页面的URL的父页面的hash值与该第二页面的URL的父页面的哈希hash值是否相同；其中，该第二页面为与该第一页面不同的在运行线程中的页面，将该第一页面的URL的子域名和该第一页面的URL的父页面的hash值分别与所有正在运行线程中的其他页面进行比较，并判断是否相同。It should be noted that when the URL of the first page is not crawled, it is determined whether the subdomain name of the URL of the first page is the same as the subdomain name of the URL of the second page, and the URL of the first page is determined. Whether the hash value of the parent page of the second page is the same as the hash value of the parent page of the URL of the second page; wherein, the second page is a page in a running thread different from the first page, and the first page The subdomain name of the URL and the hash value of the parent page of the URL of the first page are respectively compared with other pages in all running threads, and it is judged whether they are the same.

304、在该子域名相同或该哈希hash值相等时，确定该第一页面与该第二页面为相邻页面；304. When the subdomain name is the same or the hash value is the same, determine that the first page and the second page are adjacent pages;

需要说明的是，若通过步骤303判断得到该第一页面的URL的子域名与该第二页面的URL的子域名相同，或者该第一页面的URL的父页面的hash值与该第二页面的URL的父页面的hash值相等，则确定该第一页面与该第二页面为相邻页面。It should be noted that, if it is determined through step 303 that the subdomain name of the URL of the first page is the same as the subdomain name of the URL of the second page, or the hash value of the parent page of the URL of the first page is the same as that of the second page If the hash value of the parent page of the URL is equal, it is determined that the first page and the second page are adjacent pages.

305、在该子域名不相同且该哈希hash值不相等时，确定该第一页面与该第二页面不为相邻页面；305. When the subdomain names are not the same and the hash values are not the same, determine that the first page and the second page are not adjacent pages;

需要说明的是，若通过步骤303判断得到该第一页面的URL的子域名与该第二页面的URL的子域名不相同，且该第一页面的URL的父页面的hash值与该第二页面的URL的父页面的hash值不相等，则确定该第一页面与该第二页面不为相邻页面。It should be noted that, if it is determined through step 303 that the subdomain name of the URL of the first page is different from the subdomain name of the URL of the second page, and the hash value of the parent page of the URL of the first page is the same as that of the second page URL If the hash values of the parent pages of the URLs of the pages are not equal, it is determined that the first page and the second page are not adjacent pages.

306、当确定该第一页面与该第二页面不为相邻页面时，将该第一页面分配到新线程中；306. When it is determined that the first page and the second page are not adjacent pages, allocate the first page to a new thread;

307、当确定该第一页面与该第二页面为相邻页面时，将该第一页面分配到该第二页面所在的线程中；307. When it is determined that the first page and the second page are adjacent pages, allocate the first page to the thread where the second page is located;

308、按时序控制对分配后的线程执行下载任务。308. Execute the download task on the allocated thread according to the time sequence control.

需要说明的是，步骤301、302、306～308的具体过程可分别对应参考图2所示实施例中的步骤201、202、204～206，此处不再赘述。It should be noted that the specific processes of steps 301 , 302 , 306 to 308 may correspond to steps 201 , 202 , 204 to 206 in the embodiment shown in FIG. 2 , respectively, which will not be repeated here.

本发明实施例中，当该第一页面的URL未被爬取时，通过判断该第一页面的URL的子域名与该第二页面的URL的子域名是否相同，以及判断该第一页面的URL的父页面的哈希hash值与该第二页面的URL的父页面的哈希hash值是否相同来确定该第一页面与该第二页面是否为相邻页面；且在该子域名相同或该哈希hash值相等时，确定该第一页面与该第二页面为相邻页面；在该子域名不相同且该哈希hash值不相等时，确定该第一页面与该第二页面不为相邻页面。从而对分布式网络爬虫的URL任务进行合理的分配，以提高网络爬虫的爬取效率。In the embodiment of the present invention, when the URL of the first page is not crawled, the subdomain name of the URL of the first page is judged whether the subdomain name of the URL of the second page is the same as the subdomain name of the URL of the second page. Whether the first page and the second page are adjacent pages is determined by whether the hash value of the parent page of the URL is the same as the hash value of the parent page of the URL of the second page; and the subdomain name is the same or When the hash value is equal, it is determined that the first page and the second page are adjacent pages; when the subdomain name is not the same and the hash value is not equal, it is determined that the first page and the second page are not equal for adjacent pages. Therefore, the URL tasks of the distributed web crawler are reasonably allocated to improve the crawling efficiency of the web crawler.

请参阅图4，本发明所提供的任务分发方法另一实施例包括：Referring to FIG. 4, another embodiment of the task distribution method provided by the present invention includes:

401、将第一页面的统一资源定位符URL和该第一页面的URL的父页面的哈希hash值解析到任务列表中；401. Parse the uniform resource locator URL of the first page and the hash value of the parent page of the URL of the first page into a task list;

402、根据已爬取页的URL列表判断该第一页面的URL是否已被爬取，若是，则将该第一页面的URL从该任务列表中删去；402. Determine whether the URL of the first page has been crawled according to the URL list of the crawled page, and if so, delete the URL of the first page from the task list;

403、当该第一页面的URL未被爬取时，判断该第一页面的URL的子域名与该第二页面的URL的子域名是否相同，以及判断该第一页面的URL的父页面的哈希hash值与该第二页面的URL的父页面的哈希hash值是否相同；403. When the URL of the first page is not crawled, determine whether the subdomain name of the URL of the first page is the same as the subdomain name of the URL of the second page, and determine whether the parent page of the URL of the first page is the same. Whether the hash value is the same as the hash value of the parent page of the URL of the second page;

404、在该子域名相同或该哈希hash值相等时，确定该第一页面与该第二页面为相邻页面；404. When the subdomain name is the same or the hash value is the same, determine that the first page and the second page are adjacent pages;

405、在该子域名不相同且该哈希hash值不相等时，确定该第一页面与该第二页面不为相邻页面；405. When the subdomain names are not the same and the hash values are not the same, determine that the first page and the second page are not adjacent pages;

406、当确定该第一页面与该第二页面不为相邻页面时，将该第一页面分配到新线程中，将该新线程对应的线程号写入该第一页面的URL的线程号项中；将该第一页面的URL从该任务列表中删去，并写入已爬取页的URL列表中；该新线程保存该第一页面的URL的父页面的哈希hash值和该第一页面的URL的名称；406. When it is determined that the first page and the second page are not adjacent pages, assign the first page to a new thread, and write the thread number corresponding to the new thread into the thread number of the URL of the first page item; delete the URL of the first page from the task list and write it into the URL list of crawled pages; the new thread saves the hash value of the parent page of the URL of the first page and the The name of the URL of the first page;

407、当确定该第一页面与该第二页面为相邻页面时，将该第一页面分配到该第二页面所在的线程中，在该第一页面的URL的线程号项中写入该第二页面所在的线程对应的线程号；将该第一页面的URL从该任务列表中删去，并写入已爬取页的URL列表中；该第二页面所在的线程保存该第一页面的URL的父页面的哈希hash值和该第一页面的URL的名称；407. When it is determined that the first page and the second page are adjacent pages, assign the first page to the thread where the second page is located, and write the first page in the thread number item of the URL of the first page. The thread number corresponding to the thread where the second page is located; delete the URL of the first page from the task list and write it into the URL list of the crawled page; the thread where the second page is located saves the first page The hash value of the parent page of the URL and the name of the URL of the first page;

408、按时序控制对分配后的线程执行下载任务。408. Execute the download task on the allocated thread according to the time sequence control.

需要说明的是，步骤401～405、408的具体过程可分别对应参考图3所示实施例中的步骤301～305、308，此处不再赘述。It should be noted that the specific processes of steps 401 to 405 and 408 may correspond to steps 301 to 305 and 308 in the embodiment shown in FIG. 3 , respectively, which will not be repeated here.

本发明实施例中，当确定该第一页面与该第二页面不为相邻页面时，将该第一页面分配到新线程中，将该新线程对应的线程号写入该第一页面的URL的线程号项中；当确定该第一页面与该第二页面为相邻页面时，将该第一页面分配到该第二页面所在的线程中，在该第一页面的URL的线程号项中写入该第二页面所在的线程对应的线程号；同时，将该第一页面的URL从该任务列表中删去，并写入已爬取页的URL列表中；从而对相邻网页分配到同一线程，保证在同一IP地址分到的每个相邻页面串行下载，而不相邻的页面则分配到不同的线程并发下载；因此，在IP数量受限的情况下，不仅提高了网络爬虫的爬取效率，而且可以有效避免IP数量有限的网络爬虫被识别为恶意网络爬虫。In the embodiment of the present invention, when it is determined that the first page and the second page are not adjacent pages, the first page is allocated to a new thread, and the thread number corresponding to the new thread is written into the first page. In the thread number item of the URL; when it is determined that the first page and the second page are adjacent pages, the first page is allocated to the thread where the second page is located, and the thread number of the URL of the first page is Write the thread number corresponding to the thread where the second page is located in the item; at the same time, delete the URL of the first page from the task list, and write it into the URL list of the crawled page; Allocated to the same thread to ensure serial download of each adjacent page assigned to the same IP address, while non-adjacent pages are assigned to different threads for concurrent download; therefore, when the number of IPs is limited, not only does it improve It improves the crawling efficiency of web crawlers, and can effectively avoid web crawlers with a limited number of IPs from being identified as malicious web crawlers.

为便于理解，下面以一具体应用场景对本发明实施例中任务分发方法进行具体描述：For ease of understanding, the task distribution method in the embodiment of the present invention is specifically described below with a specific application scenario:

向目标网页服务器发送获取第一页面的请求，同时将该第一页面的统一资源定位符URL和该第一页面的URL的父页面的哈希hash值解析到任务列表中；Send a request to obtain the first page to the target web server, and parse the uniform resource locator URL of the first page and the hash value of the parent page of the URL of the first page into the task list;

对该任务列表中的该第一页面的URL进行去重处理，具体为：根据已爬取页的URL列表判断该任务列表中的该第一页面的URL是否已被爬取，若该第一页面的URL曾经被爬取过，则将该第一页面的URL从该任务列表中删去；Perform deduplication processing on the URL of the first page in the task list, specifically: judging whether the URL of the first page in the task list has been crawled according to the URL list of crawled pages, if the first page has been crawled If the URL of the page has been crawled before, delete the URL of the first page from the task list;

当该第一页面的URL未被爬取时，通过以下判断条件以对该第一页面进行具体分配：判断该第一页面的URL的子域名与该第二页面的URL的子域名是否相同，以及判断该第一页面的URL的父页面的hash值与该第二页面的URL的父页面的哈希hash值是否相同；可以理解为将该第一页面的URL的子域名和该第一页面的URL的父页面的hash值分别与所有正在运行线程中的其他页面进行比较，并判断是否相同；When the URL of the first page is not crawled, the following judgment conditions are used to specifically allocate the first page: judging whether the subdomain name of the URL of the first page is the same as the subdomain name of the URL of the second page, And judge whether the hash value of the parent page of the URL of the first page is the same as the hash value of the parent page of the URL of the second page; can be understood as the subdomain name of the URL of the first page and the first page The hash value of the parent page of the URL is compared with other pages in all running threads, and judges whether they are the same;

经过上述判断对得到结果进行如下处理：若判断得到该第一页面的URL的子域名与该第二页面的URL的子域名相同，或者该第一页面的URL的父页面的hash值与该第二页面的URL的父页面的hash值相等，则确定该第一页面与该第二页面为相邻页面；此时，将该第一页面分配到该第二页面所在的线程中，在该第一页面的URL的线程号项中写入该第二页面所在的线程对应的线程号；将该第一页面的URL从该任务列表中删去，并写入已爬取页的URL列表中；其中，该第二页面所在的线程保存该第一页面的URL的父页面的哈希hash值和该第一页面的URL的名称；After the above judgment, the obtained result is processed as follows: if it is judged that the subdomain name of the URL of the first page is the same as the subdomain name of the URL of the second page, or the hash value of the parent page of the URL of the first page is the same as that of the first page. If the hash values of the parent pages of the URLs of the two pages are equal, it is determined that the first page and the second page are adjacent pages; at this time, the first page is allocated to the thread where the second page is located, and the first page is assigned to the thread where the second page is located. Write the thread number corresponding to the thread where the second page is located in the thread number item of the URL of a page; delete the URL of the first page from the task list, and write it into the URL list of the crawled page; Wherein, the thread where the second page is located saves the hash value of the parent page of the URL of the first page and the name of the URL of the first page;

若判断得到该第一页面的URL的子域名与该第二页面的URL的子域名不相同，且该第一页面的URL的父页面的hash值与该第二页面的URL的父页面的hash值不相等，则确定该第一页面与该第二页面不为相邻页面；此时，将该第一页面分配到新线程中，将该新线程对应的线程号写入该第一页面的URL的线程号项中；将该第一页面的URL从该任务列表中删去，并写入已爬取页的URL列表中；其中，该新线程保存该第一页面的URL的父页面的哈希hash值和该第一页面的URL的名称；If it is determined that the subdomain name of the URL of the first page is different from the subdomain name of the URL of the second page, and the hash value of the parent page of the URL of the first page is the same as the hash value of the parent page of the URL of the second page If the values are not equal, it is determined that the first page and the second page are not adjacent pages; at this time, the first page is allocated to a new thread, and the thread number corresponding to the new thread is written into the first page. In the thread number item of the URL; delete the URL of the first page from the task list, and write it into the URL list of the crawled page; wherein, the new thread saves the parent page of the URL of the first page. The hash value and the name of the URL of the first page;

将上述相邻页面以及不相邻页面进行合理分配后，通过网络爬虫下载线程按照时序控制执行下载任务，并对已下载的页面内容进行解析，将解析出的新的URL和该URL父页面的hash值发送到URL任务列表中，去重后生成新的URL任务等待队列。After the above-mentioned adjacent pages and non-adjacent pages are reasonably allocated, the download task is executed according to the timing control through the web crawler download thread, and the content of the downloaded page is parsed. The hash value is sent to the URL task list, and a new URL task waiting queue is generated after deduplication.

以上对该任务分发方法进行了说明，下面将从任务分发装置的角度进行描述，该任务分发装置具体可以集成在芯片中，该芯片可以装载在终端中，请参阅图5，该装置包括：The task distribution method has been described above, and will be described below from the perspective of a task distribution device. Specifically, the task distribution device may be integrated in a chip, and the chip may be loaded in a terminal. Please refer to FIG. 5 , and the device includes:

解析单元501，用于将第一页面的统一资源定位符URL和该第一页面的URL的父页面的哈希hash值解析到任务列表中；Parsing unit 501, for parsing the uniform resource locator URL of the first page and the hash value of the parent page of the URL of the first page into a task list;

判断单元502，用于判断该第一页面的URL是否已被爬取；Judging unit 502, for judging whether the URL of the first page has been crawled;

确定单元503，用于在该第一页面的URL未被爬取时，根据该第一页面的URL中的子域名和该第一页面的URL的父页面的哈希hash值确定该第一页面与第二页面是否为相邻页面；Determining unit 503, configured to determine the first page according to the subdomain name in the URL of the first page and the hash value of the parent page of the URL of the first page when the URL of the first page is not crawled Whether it is adjacent to the second page;

第一分配单元504，用于在确定该第一页面与该第二页面不为相邻页面后，将该第一页面分配到新线程中；a first allocation unit 504, configured to allocate the first page to a new thread after determining that the first page and the second page are not adjacent pages;

第二分配单元505，用于在确定该第一页面与该第二页面为相邻页面后，将该第一页面分配到该第二页面所在的线程中；A second allocation unit 505, configured to allocate the first page to the thread where the second page is located after determining that the first page and the second page are adjacent pages;

执行单元506，用于按时序控制对分配后的线程执行下载任务。The execution unit 506 is configured to execute the download task on the allocated thread according to the time sequence control.

本发明实施例中，解析单元501将第一页面的统一资源定位符URL和该第一页面的URL的父页面的哈希hash值解析到任务列表中；判断单元502判断该第一页面的URL是否已被爬取；在该第一页面的URL未被爬取时，确定单元503根据该第一页面的URL中的子域名和该第一页面的URL的父页面的哈希hash值确定该第一页面与第二页面是否为相邻页面；当确定该第一页面与该第二页面不为相邻页面时，第一分配单元504将该第一页面分配到新线程中；当确定该第一页面与该第二页面为相邻页面时，第二分配单元505将该第一页面分配到该第二页面所在的线程中，执行单元506按时序控制对分配后的线程执行下载任务。由于本发明对分布式网络爬虫的URL任务进行了合理的分配，对相邻网页分配到同一线程，保证在同一IP地址分到的每个相邻页面串行下载，而不相邻的页面则分配到不同的线程并发下载；因此，在IP数量受限的情况下，不仅提高了网络爬虫的爬取效率，而且可以有效避免IP数量有限的网络爬虫被识别为恶意网络爬虫。In this embodiment of the present invention, the parsing unit 501 parses the uniform resource locator URL of the first page and the hash value of the parent page of the URL of the first page into the task list; the judging unit 502 judges the URL of the first page Whether it has been crawled; when the URL of the first page has not been crawled, the determining unit 503 determines the hash value according to the subdomain name in the URL of the first page and the hash value of the parent page of the URL of the first page. Whether the first page and the second page are adjacent pages; when determining that the first page and the second page are not adjacent pages, the first allocation unit 504 allocates the first page to a new thread; when determining that the first page and the second page are not adjacent pages When the first page and the second page are adjacent pages, the second allocation unit 505 allocates the first page to the thread where the second page is located, and the execution unit 506 executes the download task on the allocated thread according to timing control. Because the present invention reasonably allocates the URL tasks of the distributed web crawler, and allocates adjacent web pages to the same thread, it ensures that each adjacent page assigned to the same IP address is downloaded serially, and the non-adjacent pages are downloaded in series. It is allocated to different threads for concurrent download; therefore, in the case of a limited number of IPs, it not only improves the crawling efficiency of web crawlers, but also effectively prevents web crawlers with limited IP numbers from being identified as malicious web crawlers.

基于上述实施例中的任务分发装置，可选的，该判断单元502具体用于根据已爬取页的URL列表判断该第一页面的URL是否已被爬取，若是，则将该第一页面的URL从该任务列表中删去，以提高网络爬虫的爬取效率。Based on the task distribution device in the above embodiment, optionally, the determining unit 502 is specifically configured to determine whether the URL of the first page has been crawled according to the URL list of the crawled pages, and if so, then determine whether the URL of the first page has been crawled. The URL is deleted from the task list to improve the crawling efficiency of the web crawler.

基于上述实施例中的任务分发装置，可选的，如图6所示，上述确定单元503具体包括：Based on the task distribution apparatus in the foregoing embodiment, optionally, as shown in FIG. 6 , the foregoing determining unit 503 specifically includes:

判断模块601，用于在该第一页面的URL未被爬取时，判断该第一页面的URL的子域名与该第二页面的URL的子域名是否相同，以及判断该第一页面的URL的父页面的哈希hash值与该第二页面的URL的父页面的哈希hash值是否相同；The judgment module 601 is used to judge whether the subdomain name of the URL of the first page is the same as the subdomain name of the URL of the second page when the URL of the first page is not crawled, and to judge the URL of the first page Whether the hash value of the parent page is the same as the hash value of the parent page of the URL of the second page;

第一确定模块602，用于在该子域名相同或该哈希hash值相等时，确定该第一页面与该第二页面为相邻页面；a first determining module 602, configured to determine that the first page and the second page are adjacent pages when the subdomain name is the same or the hash value is the same;

第二确定模块603，用于在该子域名不相同且该哈希hash值不相等时，确定该第一页面与该第二页面不为相邻页面。The second determining module 603 is configured to determine that the first page and the second page are not adjacent pages when the subdomain names are not identical and the hash values are not identical.

本发明实施例中，在该第一页面的URL未被爬取时，判断模块601判断该第一页面的URL的子域名与该第二页面的URL的子域名是否相同，以及判断该第一页面的URL的父页面的哈希hash值与该第二页面的URL的父页面的哈希hash值是否相同来确定该第一页面与该第二页面是否为相邻页面；第一确定模块602在该子域名相同或该哈希hash值相等时，确定该第一页面与该第二页面为相邻页面；第二确定模块603在该子域名不相同且该哈希hash值不相等时，确定该第一页面与该第二页面不为相邻页面。从而对分布式网络爬虫的URL任务进行合理的分配，以提高网络爬虫的爬取效率。In this embodiment of the present invention, when the URL of the first page is not crawled, the judgment module 601 judges whether the subdomain name of the URL of the first page is the same as the subdomain name of the URL of the second page, and judges whether the subdomain name of the URL of the first page is the same Whether the hash value of the parent page of the URL of the page and the hash value of the parent page of the URL of the second page are the same to determine whether the first page and the second page are adjacent pages; the first determination module 602 When the subdomain names are the same or the hash values are the same, determine that the first page and the second page are adjacent pages; when the subdomain names are not the same and the hash values are not equal, the second determining module 603, It is determined that the first page and the second page are not adjacent pages. Therefore, the URL tasks of the distributed web crawler are reasonably allocated to improve the crawling efficiency of the web crawler.

基于上述实施例中的任务分发装置，可选的，如图7所示，上述第一分配单元504具体包括：Based on the task distribution apparatus in the foregoing embodiment, optionally, as shown in FIG. 7 , the foregoing first distribution unit 504 specifically includes:

第一分发模块701，用于将该第一页面分配到新线程中，并将该新线程对应的线程号写入该第一页面的URL的线程号项中；该新线程保存该第一页面的URL的父页面的哈希hash值和该第一页面的URL的名称；The first distribution module 701 is used to allocate the first page to a new thread, and write the thread number corresponding to the new thread into the thread number item of the URL of the first page; the new thread saves the first page The hash value of the parent page of the URL and the name of the URL of the first page;

第一删写模块702，用于将该第一页面的URL从该任务列表中删去，并写入已爬取页的URL列表中。The first deletion and writing module 702 is configured to delete the URL of the first page from the task list and write it into the URL list of the crawled pages.

本发明实施例中，当确定该第一页面与该第二页面不为相邻页面时，第一分发模块701将该第一页面分配到新线程中，并将该新线程对应的线程号写入该第一页面的URL的线程号项中；第一删写模块702将该第一页面的URL从该任务列表中删去，并写入已爬取页的URL列表中；从而对不相邻的页面则分配到不同的线程并发下载；因此，在IP数量受限的情况下，不仅提高了网络爬虫的爬取效率，而且可以有效避免IP数量有限的网络爬虫被识别为恶意网络爬虫。In this embodiment of the present invention, when it is determined that the first page and the second page are not adjacent pages, the first distribution module 701 allocates the first page to a new thread, and writes the thread number corresponding to the new thread into the thread number item of the URL of the first page; the first deletion and writing module 702 deletes the URL of the first page from the task list, and writes it into the URL list of the crawled page; Adjacent pages are assigned to different threads for concurrent download; therefore, in the case of limited IP number, not only the crawling efficiency of web crawlers is improved, but also web crawlers with limited IP number can be effectively avoided from being identified as malicious web crawlers.

基于上述实施例中的任务分发装置，可选的，如图8所示，上述第二分配单元505具体包括：Based on the task distribution apparatus in the foregoing embodiment, optionally, as shown in FIG. 8 , the second distribution unit 505 specifically includes:

第二分发模块801，用于将该第一页面分配到该第二页面所在的线程中，并在该第一页面的URL的线程号项中写入该第二页面所在的线程对应的线程号；该第二页面所在的线程保存该第一页面的URL的父页面的哈希hash值和该第一页面的URL的名称；The second distribution module 801 is configured to allocate the first page to the thread where the second page is located, and write the thread number corresponding to the thread where the second page is located in the thread number item of the URL of the first page ; The thread where the second page is located saves the hash value of the parent page of the URL of the first page and the name of the URL of the first page;

第二删写模块802，用于将该第一页面的URL从该任务列表中删去，并写入已爬取页的URL列表中。The second deletion and writing module 802 is configured to delete the URL of the first page from the task list and write it into the URL list of the crawled pages.

本发明实施例中，当确定该第一页面与该第二页面为相邻页面时，第二分发模块801将该第一页面分配到该第二页面所在的线程中，并在该第一页面的URL的线程号项中写入该第二页面所在的线程对应的线程号；第二删写模块802将该第一页面的URL从该任务列表中删去，并写入已爬取页的URL列表中；从而对相邻网页分配到同一线程，保证在同一IP地址分到的每个相邻页面串行下载；因此，在IP数量受限的情况下，不仅提高了网络爬虫的爬取效率，而且可以有效避免IP数量有限的网络爬虫被识别为恶意网络爬虫。In this embodiment of the present invention, when it is determined that the first page and the second page are adjacent pages, the second distribution module 801 allocates the first page to the thread where the second page is located, and assigns the first page to the thread where the second page is located. Write the thread number corresponding to the thread where the second page is located in the thread number item of the URL of the In the URL list; thus, the adjacent web pages are allocated to the same thread to ensure serial download of each adjacent page assigned to the same IP address; therefore, in the case of a limited number of IPs, it not only improves the crawling of web crawlers It is efficient, and can effectively avoid web crawlers with a limited number of IPs from being identified as malicious web crawlers.

图5至图8所示的实施例从功能单元的角度对任务分发装置的具体结构进行了说明，以下结合图9所示的实施例从硬件角度对任务分发装置的具体结构进行说明：The embodiments shown in FIG. 5 to FIG. 8 describe the specific structure of the task distribution device from the perspective of functional units. The specific structure of the task distribution device is described below with reference to the embodiment shown in FIG. 9 from the hardware perspective:

如图9所示，该任务分发装置包括：接收器901、发射器902、处理器903和存储器904。As shown in FIG. 9 , the task distribution apparatus includes: a receiver 901 , a transmitter 902 , a processor 903 and a memory 904 .

本发明实施例涉及的任务分发装置可以具有比图9所示出的更多或更少的部件，可以组合两个或更多个部件，或者可以具有不同的部件配置或设置，各个部件可以在包括一个或多个信号处理和/或专用集成电路在内的硬件、软件或硬件和软件的组合实现。The task dispatching device involved in this embodiment of the present invention may have more or less components than those shown in FIG. 9 , may combine two or more components, or may have different component configurations or settings, and each component may be Hardware, software or a combination of hardware and software implementation including one or more signal processing and/or application specific integrated circuits.

该处理器903用于用于读取该存储器904中所存储的指令，以执行如下操作：The processor 903 is configured to read the instructions stored in the memory 904 to perform the following operations:

将第一页面的统一资源定位符URL和该第一页面的URL的父页面的哈希hash值解析到任务列表中；Parse the uniform resource locator URL of the first page and the hash value of the parent page of the URL of the first page into the task list;

判断该第一页面的URL是否已被爬取；Determine whether the URL of the first page has been crawled;

当所述第一页面的URL未被爬取时，根据该第一页面的URL中的子域名和该第一页面的URL的父页面的哈希hash值确定该第一页面与第二页面是否为相邻页面；When the URL of the first page is not crawled, determine whether the first page and the second page are not according to the subdomain name in the URL of the first page and the hash value of the parent page of the URL of the first page for adjacent pages;

当确定该第一页面与该第二页面不为相邻页面时，将该第一页面分配到新线程中；When it is determined that the first page and the second page are not adjacent pages, the first page is allocated to a new thread;

当确定该第一页面与该第二页面为相邻页面时，将该第一页面分配到该第二页面所在的线程中；When it is determined that the first page and the second page are adjacent pages, the first page is allocated to the thread where the second page is located;

该处理器903具体用于执行如下操作：The processor 903 is specifically configured to perform the following operations:

根据已爬取页的URL列表判断该第一页面的URL是否已被爬取，若是，则将该第一页面的URL从该任务列表中删去。Whether the URL of the first page has been crawled is determined according to the URL list of the crawled pages, and if so, the URL of the first page is deleted from the task list.

判断该第一页面的URL的子域名与该第二页面的URL的子域名是否相同，以及判断该第一页面的URL的父页面的哈希hash值与该第二页面的URL的父页面的哈希hash值是否相同；Determine whether the subdomain name of the URL of the first page is the same as the subdomain name of the URL of the second page, and determine whether the hash value of the parent page of the URL of the first page is the same as that of the parent page of the URL of the second page. Whether the hash value is the same;

在该子域名相同或该哈希hash值相等时，确定该第一页面与该第二页面为相邻页面；When the subdomain name is the same or the hash value is the same, determine that the first page and the second page are adjacent pages;

在该子域名不相同且该哈希hash值不相等时，确定该第一页面与该第二页面不为相邻页面。When the subdomain names are not the same and the hash values are not the same, it is determined that the first page and the second page are not adjacent pages.

将该第一页面分配到新线程中，并将该新线程对应的线程号写入该第一页面的URL的线程号项中；该新线程保存该第一页面的URL的父页面的哈希hash值和该第一页面的URL的名称；The first page is allocated to a new thread, and the thread number corresponding to the new thread is written into the thread number item of the URL of the first page; the new thread saves the hash of the parent page of the URL of the first page The hash value and the name of the URL of the first page;

将该第一页面的URL从该任务列表中删去，并写入已爬取页的URL列表中。Delete the URL of the first page from the task list and write it into the URL list of crawled pages.

将该第一页面分配到该第二页面所在的线程中，并在该第一页面的URL的线程号项中写入该第二页面所在的线程对应的线程号；该第二页面所在的线程保存该第一页面的URL的父页面的哈希hash值和该第一页面的URL的名称；Allocate the first page to the thread where the second page is located, and write the thread number corresponding to the thread where the second page is located in the thread number item of the URL of the first page; the thread where the second page is located Save the hash value of the parent page of the URL of the first page and the name of the URL of the first page;

本发明实施例中，该处理器903将第一页面的统一资源定位符URL和该第一页面的URL的父页面的哈希hash值解析到任务列表中；并对该第一页面的URL进行去重处理；且根据该第一页面的URL中的子域名和该第一页面的URL的父页面的哈希hash值确定该第一页面与第二页面是否为相邻页面；当确定该第一页面与该第二页面不为相邻页面时，将该第一页面分配到新线程中，并按时序控制执行下载任务；当确定该第一页面与该第二页面为相邻页面时，将该第一页面分配到该第二页面所在的线程中，并按时序控制执行下载任务。由于本发明对分布式网络爬虫的URL任务进行了合理的分配，对相邻网页分配到同一线程，保证在同一IP地址分到的每个相邻页面串行下载，而不相邻的页面则分配到不同的线程并发下载；因此，在IP数量受限的情况下，不仅提高了网络爬虫的爬取效率，而且可以有效避免IP数量有限的网络爬虫被识别为恶意网络爬虫。In the embodiment of the present invention, the processor 903 parses the uniform resource locator URL of the first page and the hash value of the parent page of the URL of the first page into the task list; Deduplication processing; and determine whether the first page and the second page are adjacent pages according to the hash value of the subdomain name in the URL of the first page and the parent page of the URL of the first page; when determining whether the first page and the second page are adjacent pages; When a page and the second page are not adjacent pages, the first page is allocated to a new thread, and the download task is executed according to the timing control; when it is determined that the first page and the second page are adjacent pages, The first page is allocated to the thread where the second page is located, and the download task is executed according to time sequence control. Because the present invention reasonably allocates the URL tasks of the distributed web crawler, and allocates adjacent web pages to the same thread, it ensures that each adjacent page assigned to the same IP address is downloaded serially, and the non-adjacent pages are downloaded in series. It is allocated to different threads for concurrent download; therefore, in the case of a limited number of IPs, it not only improves the crawling efficiency of web crawlers, but also effectively prevents web crawlers with limited IP numbers from being identified as malicious web crawlers.

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的系统，装置和单元的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the system, device and unit described above may refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.

在本申请所提供的几个实施例中，应该理解到，所揭露的系统，装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外，在本发明各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-OnlyMemory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, removable hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes.

以上所述，以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand: The technical solutions described in the embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. a task distribution method, is characterized in that, comprises:

Parse the hash value of the URL of the first page and the parent page of the URL of the first page into the task list;

Determine whether the URL of the first page has been crawled;

When the URL of the first page is not crawled, determine the relationship between the first page and the first page according to the subdomain name in the URL of the first page and the hash value of the parent page of the URL of the first page. Whether the two pages are adjacent pages, the first page and the second page are pages allocated at the same IP address;

when it is determined that the first page and the second page are not adjacent pages, assigning the first page to a new thread;

When it is determined that the first page and the second page are adjacent pages, allocating the first page to the thread where the second page is located;

Execute the download task on the allocated thread according to the timing control.

2. The task distribution method according to claim 1, wherein the judging whether the URL of the first page has been crawled specifically comprises:

Determine whether the URL of the first page has been crawled according to the URL list of the crawled pages, and if so, delete the URL of the first page from the task list.

3. The task distribution method according to claim 1, characterized in that, determining said Whether the first page and the second page are adjacent pages specifically includes:

Determine whether the subdomain name of the URL of the first page is the same as the subdomain name of the URL of the second page, and determine whether the hash value of the parent page of the URL of the first page is the same as the URL of the second page. Whether the hash value of the parent page is the same;

When the subdomain names are the same or the hash values are the same, determining that the first page and the second page are adjacent pages;

When the subdomain names are not the same and the hash values are not the same, it is determined that the first page and the second page are not adjacent pages.

4. The task distribution method according to any one of claims 1 to 3, wherein the allocating the first page to a new thread specifically comprises:

The first page is allocated to a new thread, and the thread number corresponding to the new thread is written into the thread number item of the URL of the first page; the new thread saves the URL of the first page. The hash value of the parent page and the name of the URL of the first page;

Delete the URL of the first page from the task list and write it into the URL list of crawled pages.

5. The task distribution method according to any one of claims 1 to 3, wherein the allocating the first page to the thread where the second page is located specifically comprises:

Allocate the first page to the thread where the second page is located, and write the thread number corresponding to the thread where the second page is located in the thread number item of the URL of the first page; The thread where the two pages are located saves the hash value of the parent page of the URL of the first page and the name of the URL of the first page;

6. A task distribution device, comprising:

a parsing unit for parsing the URL of the first page and the hash value of the parent page of the URL of the first page into the task list;

a judging unit for judging whether the URL of the first page has been crawled;

A determining unit, configured to determine the said first page according to the subdomain name in the URL of the first page and the hash value of the parent page of the URL of the first page when the URL of the first page is not crawled Whether the first page and the second page are adjacent pages, and the first page and the second page are pages allocated at the same IP address;

a first allocation unit, configured to allocate the first page to a new thread after determining that the first page and the second page are not adjacent pages;

a second allocation unit, configured to allocate the first page to the thread where the second page is located after determining that the first page and the second page are adjacent pages;

The execution unit is used to execute the download task on the allocated thread according to the timing control.

7. The task distribution device according to claim 6, wherein the judging unit is specifically configured to judge whether the URL of the first page has been crawled according to the URL list of the crawled page, and if so, then The URL of the first page is removed from the task list.

8. The task distribution device according to claim 6, wherein the determining unit specifically comprises:

a judging module for judging whether the subdomain name of the URL of the first page is the same as the subdomain name of the URL of the second page when the URL of the first page is not crawled, and judging the first page Whether the hash value of the parent page of the URL of the page is the same as the hash value of the parent page of the URL of the second page;

a first determining module, configured to determine that the first page and the second page are adjacent pages when the subdomain names are the same or the hash values are equal;

A second determining module, configured to determine that the first page and the second page are not adjacent pages when the subdomain names are not the same and the hash values are not the same.

9. The task distribution device according to any one of claims 6 to 8, wherein the first distribution unit specifically comprises:

The first distribution module is used to assign the first page to a new thread, and write the thread number corresponding to the new thread into the thread number item of the URL of the first page; the new thread saves all The hash value of the parent page of the URL of the first page and the name of the URL of the first page;

The first deletion and writing module is used to delete the URL of the first page from the task list and write it into the URL list of the crawled pages.

10. The task distribution device according to any one of claims 6 to 8, wherein the second distribution unit specifically comprises:

A second distribution module, configured to allocate the first page to the thread where the second page is located, and write the thread corresponding to the thread where the second page is located in the thread number item of the URL of the first page Thread number; The thread where the second page is located saves the hash value of the parent page of the URL of the first page and the name of the URL of the first page;

The second deletion and writing module is configured to delete the URL of the first page from the task list and write it into the URL list of the crawled pages.