CN104516956B

CN104516956B - A kind of site information increment crawling method

Info

Publication number: CN104516956B
Application number: CN201410783643.8A
Authority: CN
Inventors: 刘学; 脱立恒; 董微; 刘照邻
Original assignee: Institute of Acoustics CAS; Shanghai 3Ntv Network Technology Co Ltd
Current assignee: Institute of Acoustics CAS; Shanghai 3Ntv Network Technology Co Ltd
Priority date: 2014-12-16
Filing date: 2014-12-16
Publication date: 2017-12-01
Anticipated expiration: 2034-12-16
Also published as: CN104516956A

Abstract

The invention discloses a kind of site information increment crawling method, this method includes：The data of setting length are crawled according to website data presentation order, and data queue is put into according to the presentation order of website data, the data queue end is provided with comparison window, checks the data in comparison window and has crawled the multiplicity of data, when multiplicity reaches preset value, stop data and crawl；Otherwise, said process is repeated, until data in comparison window and the multiplicity for having crawled data reach preset value, stopping data crawling.When the present invention does not crawl for site information in strict accordance with time-sequencing progress increment, in the case of rate is climbed in admissible leakage, reduce and crawl consumption.In workflow, " the setting length that data crawl " and " length of data queue " size are dynamically adapted, algorithm operating efficiency is improved, meets that different leakages climbs rate and crawls loss demand.

Description

A Method of Incremental Crawling of Website Information

技术领域technical field

本发明涉及网络信息爬取技术领域，尤其涉及一种网站信息增量爬取方法，适用于增量爬取未严格按照时间排序的网站信息。The invention relates to the technical field of network information crawling, in particular to a method for incremental crawling of website information, which is suitable for incremental crawling of website information that is not strictly sorted by time.

背景技术Background technique

全球互联网自上世纪九十年代进入商用以来迅速拓展，已经成为当今世界推动经济发展和社会进步的重要信息基础设施。中国的互联网发展虽然起步比国际互联网发展晚，但是进入新世纪以来，同样快速发展。在服务侧，互联网已经渗透各个领域，尤其在信息搜索、交流沟通、商务交易、手机无线应用等方面得到快速发展。在用户侧，截至2013年12月，我国网民规模达6.18亿，已经成为全球用户最多的国家。The global Internet has expanded rapidly since it entered commercial use in the 1990s, and has become an important information infrastructure for promoting economic development and social progress in today's world. Although China's Internet development started later than that of the international Internet, it has also developed rapidly since the beginning of the new century. On the service side, the Internet has penetrated into various fields, especially in information search, communication, business transactions, and mobile wireless applications. On the user side, as of December 2013, my country's Internet users reached 618 million, making it the country with the largest number of users in the world.

随着互联网技术和服务发展，互联网信息数量庞大，为了方便用户更便捷地从庞大网络信息中获取感兴趣的内容，出现网络聚合服务。网络聚合是指在将互联网上的海量信息与资源(如博客、论坛、影视、音乐、供求信息、文件等)进行人工或机器的内容挑选、分析、分类的基础上，为用户提供有用的、更具针对性的信息。With the development of Internet technology and services, the amount of Internet information is huge. In order to facilitate users to obtain interesting content from the huge network information more conveniently, network aggregation services appear. Network aggregation refers to providing users with useful, more targeted information.

网络聚合首先解决从目标网站上获取信息，有一类网站，网站信息未严格按照时间排序，如果从这些网站增量爬取数据，很难判断哪些信息已经爬取过，哪些是新的信息，如果新爬取的数据逐条进行验证是否爬取过，将带来很大的爬取损耗。经过对这类网站的信息进行分析发现，网站信息越往前的信息，即热度高的信息或较新的信息，未严格按照时间排序排序，而网站后面的信息，即热度底或较旧的信息，相对比较按照时间序列进行排序。比如说一个视频内容网站，每个栏目下的首页的信息往往包括新上线元数据及操作员或系统推荐的元数据，所以整个网页的信息并未按照时间序列进行排序，然而，栏目后面几页元数据，相对都是按照时间序列进行排序，在增量爬取此种网站信息时，需要有一种方法在可允许的漏爬率情况下，尽量降低爬取损耗。Network aggregation first solves the problem of obtaining information from the target website. There is a type of website whose website information is not strictly chronologically sorted. If data is incrementally crawled from these websites, it is difficult to judge which information has been crawled and which is new. If The newly crawled data is verified one by one whether it has been crawled, which will bring a lot of crawling loss. After analyzing the information of this type of website, it is found that the earlier information on the website, that is, the information with high popularity or newer information, is not strictly sorted according to time, while the information at the back of the website, that is, the information with low popularity or older information. Information, relative comparisons are sorted by time series. For example, on a video content website, the information on the homepage under each column often includes newly launched metadata and metadata recommended by the operator or the system, so the information on the entire webpage is not sorted in chronological order. However, the next few pages of the column Metadata are relatively sorted in time series. When incrementally crawling such website information, there needs to be a way to minimize the crawling loss under the allowable missing crawl rate.

发明内容Contents of the invention

本发明目的在于克服现有技术中针对网站信息未严格按照时间排序进行增量爬取时，在可允许的漏爬率情况下，如何降低爬取消耗这一技术问题，从而提供一种网站信息增量爬取方法。The purpose of the present invention is to overcome the technical problem of how to reduce the consumption of crawling under the allowable missing crawling rate in the prior art when website information is not strictly sorted according to time for incremental crawling, so as to provide a website information Incremental crawling method.

为实现上述目的，本发明提供了一种网站信息增量爬取方法。该方法包括：(a)按照网站数据的呈现顺序，从目标网站上呈现第一数据开始，爬取设定长度的数据；(b)将爬取设定长度的数据按照网站数据的呈现顺序放入数据队列，所述数据队列末端设有比较窗口；(c)计算比较窗口内爬取的设定长度的数据与已爬取数据的重复度；(d)根据重复度计算结果停止数据爬取或进行下一次数据的爬取；即，当重复度达到预设值时，停止数据爬取；当重复度未达到预设值时，进行下一次数据的爬取，然后执行步骤(b)-(d)。In order to achieve the above purpose, the present invention provides a method for incremental crawling of website information. The method includes: (a) crawling data of a set length starting from the first data presented on the target website according to the presentation order of the website data; (b) placing the crawled data of a set length in accordance with the presentation order of the website data Enter the data queue, and the end of the data queue is provided with a comparison window; (c) calculate the repetition degree between the data of the set length crawled in the comparison window and the crawled data; (d) stop data crawling according to the calculation result of the repetition degree Or carry out the next data crawl; that is, when the repeatability reaches the preset value, stop data crawling; when the repeatability does not reach the preset value, perform the next data crawl, and then perform step (b)- (d).

优选地，所述当重复度达到预设值时，且预设值小于1时，将所述数据队列中未与已爬取数据重复的数据保存在数据库中，然后停止整个数据爬取流程。Preferably, when the repeatability reaches a preset value and the preset value is less than 1, save the data in the data queue that is not repeated with the crawled data in the database, and then stop the entire data crawling process.

优选地，所述当重复度达到预设值时，且预设值为1时，停止整个数据爬取流程。Preferably, when the repeatability reaches a preset value and the preset value is 1, the entire data crawling process is stopped.

优选地，所述进行下一次数据的爬取具体包括：当重复度未达到预设值时，将所述数据队列中未与已爬取数据重复的数据保存在数据库中，并清空数据队列；按照网站数据的呈现顺序，从上一次数据爬取结束位置处，继续爬取设定长度的数据。Preferably, the crawling of the next data specifically includes: when the repeatability does not reach a preset value, saving the data in the data queue that is not duplicated with the crawled data in the database, and clearing the data queue; According to the presentation order of the website data, continue to crawl the data of the set length from the end position of the last data crawl.

优选地，所述数据爬取的设定长度小于或等于数据队列长度。Preferably, the set length of the data crawling is less than or equal to the length of the data queue.

优选地，所述当重复度没达到预设值时，进行下一次数据的爬取的过程中，动态调整数据爬取的设定长度和数据队列长度。Preferably, when the repeatability does not reach the preset value, during the next data crawling process, the set length of data crawling and the length of the data queue are dynamically adjusted.

本发明针对网站信息未严格按照时间排序进行增量爬取时，在可允许的漏爬率情况下，降低了爬取消耗。在工作流程中，可动态调整“数据爬取的设定长度”和“数据队列长度”大小，提高算法工作效率，满足不同的漏爬率及爬取损耗需求。The present invention reduces the consumption of crawling under the condition of allowable missing crawling rate when website information is not strictly sorted according to time for incremental crawling. In the workflow, the size of the "set length of data crawling" and "data queue length" can be dynamically adjusted to improve the efficiency of the algorithm and meet different crawling leakage rates and crawling loss requirements.

附图说明Description of drawings

图1是根据本发明实施例的网站信息增量爬取方法流程图；Fig. 1 is a flowchart of a website information incremental crawling method according to an embodiment of the present invention;

图1A是图1中重复度达到预设值时对爬取的数据进行处理的示意图；FIG. 1A is a schematic diagram of processing crawled data when the repeatability in FIG. 1 reaches a preset value;

图1B是图1中重复度未达到预设值时进行下一次数据爬取的示意图；FIG. 1B is a schematic diagram of the next data crawl when the repeatability in FIG. 1 does not reach the preset value;

图2是根据本发明实施例的网站信息增量爬取过程中对数据爬取设定长度和数据队列的调整示意图。Fig. 2 is a schematic diagram of adjusting the set length of data crawling and data queues during incremental crawling of website information according to an embodiment of the present invention.

具体实施方式detailed description

为了使本技术领域的人员更好的理解本发明实施例中的技术方案，并使本发明实施例的上述目的、特征和优点能够更加明显易懂，下面通过附图和实施例，对本发明的技术方案做进一步的详细描述。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其它实施例，都属于本发明保护的范围。In order to enable those skilled in the art to better understand the technical solutions in the embodiments of the present invention, and to make the above-mentioned purposes, features and advantages of the embodiments of the present invention more obvious and understandable, the following describes the technical solutions of the present invention through the accompanying drawings and embodiments The technical solution is further described in detail. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

图1是本发明网站信息增量爬取方法流程图。Fig. 1 is a flow chart of the website information incremental crawling method of the present invention.

如图1所示，本发明实施例网站信息增量爬取方法包括以下步骤：As shown in Figure 1, the website information incremental crawling method of the embodiment of the present invention includes the following steps:

在步骤101中，按照网站数据的呈现顺序，从目标网站上呈现第一数据开始，爬取设定长度的数据，如一个视频网站有100条信息，可设定爬取设定长度为10条。In step 101, according to the presentation order of the website data, crawl the data of a set length starting from the first data presented on the target website. For example, if a video website has 100 pieces of information, the set length of crawling can be set to 10 pieces. .

在步骤102中，将爬取设定长度的数据，即步骤101中第1-10条网站信息按照网站数据的呈现顺序放入数据队列，所述数据队列末端设有比较窗口。优选地，数据爬取的设定长度小于或等于数据队列长度。In step 102, the crawled data with a set length, that is, the 1-10 website information in step 101 is put into the data queue according to the presentation order of the website data, and a comparison window is set at the end of the data queue. Preferably, the set length of data crawling is less than or equal to the length of the data queue.

在步骤103中，计算比较窗口内爬取的设定长度的数据即步骤101中第1-10条网站信息与已爬取数据的重复度，也可从1-10条网站信息抽取若干条进行抽样计算。其中，若定义爬取的1-10条信息为今日首次爬取数据，则已爬取数据可指昨日爬取的所述视频网站信息。In step 103, calculate the data of the set length crawled in the comparison window, that is, the repetition degree of the 1-10 website information in step 101 and the crawled data, or extract several pieces from the 1-10 website information Sampling calculations. Wherein, if the 1-10 pieces of information crawled are defined as the first crawled data today, the crawled data may refer to the video website information crawled yesterday.

在步骤104中，根据重复度计算结果停止数据爬取或进行下一次数据的爬取。其中，判定原则为:当重复度达到预设值时，停止数据爬取(见步骤104a)；当重复度未达到预设值时，进行下一次数据的爬取(见步骤104b)，然后执行步骤102-104。In step 104, the data crawling is stopped or the next data crawling is performed according to the calculation result of the repeatability. Among them, the judgment principle is: when the repeatability reaches the preset value, stop data crawling (see step 104a); when the repeatability does not reach the preset value, perform the next data crawling (see step 104b), and then execute Steps 102-104.

图1A是图1中重复度达到预设值时对爬取的数据进行处理的示意图。FIG. 1A is a schematic diagram of processing crawled data when the repeatability in FIG. 1 reaches a preset value.

如图1A所示，当图1中步骤104中计算重复度达到预设值时，则执行步骤105(即判定该预设值是否小于1)。若小于1则执行步骤106(即将数据队列中未与已爬取数据重复的数据保存在数据库中，然后执行图1中的步骤104a(即停止整个数据爬取流程)。若预设值为1时，执行图1中的步骤104a(即停止整个数据爬取流程)。As shown in FIG. 1A , when the calculation repeatability in step 104 in FIG. 1 reaches a preset value, step 105 is executed (ie, it is determined whether the preset value is less than 1). If it is less than 1, then execute step 106 (save the data that is not duplicated with the crawled data in the data queue in the database, and then execute step 104a in Figure 1 (that is, stop the entire data crawling process). If the preset value is 1 , execute step 104a in FIG. 1 (that is, stop the entire data crawling process).

图1B是图1中重复度未达到预设值时进行下一次数据爬取的示意图。FIG. 1B is a schematic diagram of the next data crawl when the repeatability in FIG. 1 does not reach the preset value.

如图1B所示，当图1中步骤104中计算重复度未达到预设值时，则执行步骤104b(即进行下一次数据的爬取)。进行下一次数据的爬取具体包括：当重复度未达到预设值时，执行步骤107(即将所述数据队列中未与已爬取数据重复的数据保存在数据库中，并清空数据队列)；然后再执行步骤108(即按照网站数据的呈现顺序，从上一次数据爬取结束位置处，继续爬取设定长度的数据，该设定长度与图1步骤101中所设定长度一致)。然后执行图1中步骤102，重复循环动作直至比较窗口内的数据与已爬取数据重复度达到预设值，再停止数据爬取。As shown in FIG. 1B , when the calculation repeatability in step 104 in FIG. 1 does not reach the preset value, step 104b is executed (that is, the next data crawling is performed). The crawling of the next data includes: when the repeatability does not reach the preset value, perform step 107 (that is, save the data in the data queue that is not repeated with the crawled data in the database, and clear the data queue); Then execute step 108 (that is, according to the presentation order of the website data, continue to crawl data of a set length from the last data crawl end position, which is consistent with the set length in step 101 of FIG. 1 ). Then execute step 102 in FIG. 1 , repeat the cycle until the repeatability between the data in the comparison window and the crawled data reaches a preset value, and then stop data crawling.

如图2所示，图1中所述设定的数据爬取长度可动态调整，当下一次爬取设定长度的数据重复度未达到预设值时，进一步执行步骤201(即判定下一次爬取设定长度的数据重复度是否大于上一次爬取设定长度的数据重复度)，若是，则执行步骤202(即调整当前设定的数据爬取长度，如可动态缩短数据爬取长度，减少爬取损耗)，然后执行步骤204(即根据新调整的数据爬取长度，重新进行下下一次的数据的爬取)。若否，则执行步骤203(即按照原有设定的数据爬取长度进行下下一次的数据的爬取)。As shown in Figure 2, the data crawling length set in Figure 1 can be dynamically adjusted. When the data repetition of the set length of crawling next time does not reach the preset value, step 201 is further executed (that is, it is determined that the next crawling Get whether the data repetition degree of the set length is greater than the data repetition degree of the set length crawled last time), if so, then perform step 202 (that is, adjust the currently set data crawl length, if the data crawl length can be shortened dynamically, reduce the crawling loss), and then perform step 204 (that is, perform the next data crawling again according to the newly adjusted data crawling length). If not, execute step 203 (that is, perform next data crawling according to the originally set data crawling length).

综上所述，本发明的有益效果在于：可针对网站信息未严格按照时间排序进行增量爬取时，在可允许的漏爬率情况下，降低爬取消耗。可动态调整“数据爬取的设定长度”和“数据队列长度”大小，提高算法工作效率，满足不同的漏爬率及爬取损耗需求。To sum up, the beneficial effect of the present invention is that it can reduce the consumption of crawling under the allowable missing crawling rate when website information is not strictly sorted by time for incremental crawling. The size of "set length of data crawling" and "data queue length" can be dynamically adjusted to improve the efficiency of the algorithm and meet different crawling leakage rates and crawling loss requirements.

应当理解，本发明并不限定具体业务类别、数据类别以及目标网站类别，对以上内容所做的变换也落在本发明的保护范围之内。It should be understood that the present invention does not limit specific business categories, data categories, and target website categories, and changes made to the above contents also fall within the protection scope of the present invention.

以上所述的具体实施方式，对本发明的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上所述仅为本发明的具体实施方式而已，并不用于限定本发明的保护范围，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The specific embodiments described above have further described the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above descriptions are only specific embodiments of the present invention and are not intended to limit the scope of the present invention. Protection scope, within the spirit and principles of the present invention, any modification, equivalent replacement, improvement, etc., shall be included in the protection scope of the present invention.

Claims

1. A website information incremental crawling method, is characterized in that, comprises the following steps:

(a) According to the presentation order of the website data, start from the first data presented on the target website, and crawl the data of a set length;

(b) Putting the crawled data with a set length into the data queue according to the presentation order of the website data, and a comparison window is provided at the end of the data queue;

(c) Calculate the repeatability between the data of the set length crawled in the comparison window and the crawled data;

(d) Stop data crawling or perform next data crawling according to the repeatability calculation result;

That is, when the repeatability reaches the preset value, data crawling is stopped; when the repeatability does not reach the preset value, the next data crawl is performed, and the set length of data crawling and the length of the data queue are dynamically adjusted, and then Perform steps (b)-(d).

2. The website information incremental crawling method according to claim 1, characterized in that, when the repeatability reaches a preset value and the preset value is less than 1, the crawled and uncrawled The data that fetches duplicate data is saved in the database, and then the entire data crawling process is stopped.

3. The website information incremental crawling method according to claim 1, characterized in that, when the repeatability reaches a preset value and the preset value is 1, the entire data crawling process is stopped.

4. The website information incremental crawling method according to claim 1, wherein said crawling of the next data includes:

When the repeatability does not reach the preset value, save the data in the data queue that is not duplicated with the crawled data in the database, and clear the data queue;

According to the presentation order of the website data, continue to crawl the data of the set length from the end position of the last data crawl.

5. The incremental crawling method for website information according to claim 1, characterized in that the set length of said data crawling is less than or equal to the length of the data queue.