CN111444407A

CN111444407A - Automatic extraction method and system for page list information of web crawler

Info

Publication number: CN111444407A
Application number: CN202010222132.4A
Authority: CN
Inventors: 姜建武; 李景文; 陆妍玲
Original assignee: Guilin University of Technology
Current assignee: Guilin University of Technology
Priority date: 2020-03-26
Filing date: 2020-03-26
Publication date: 2020-07-24
Anticipated expiration: 2040-03-26
Also published as: CN111444407B

Abstract

The invention relates to a method and system for automatically extracting page list information of a network crawler. The method includes: acquiring a hypertext markup language document of a page to be extracted; determining a set of hypertext markup language objects according to elements in the hypertext markup language document; traversing the set of hypertext markup language objects to determine a document object model; determine the web page structure of the page to be extracted according to the document object model; the web page structure includes list items and list item attributes; determine the extraction mode according to the web page structure of the page to be extracted; use the extraction mode Extract the page list information to be extracted. The invention provides a method and system for automatically extracting page list information of a web crawler, which realizes automatic crawling of web crawler page list information.

Description

A method and system for automatically extracting page list information from a web crawler

技术领域technical field

本发明涉及网络爬虫领域，特别是涉及一种网络爬虫的页面列表信息自动提取方法及系统。The invention relates to the field of web crawler, in particular to a method and system for automatically extracting page list information of a web crawler.

背景技术Background technique

随着信息技术的飞速发展，带来了海量、磅礴的数据信息。而如何从海量信息中快速、高效、准确的提取有用的信息，对新时代的信息获取技术提出了更高的要求。在此背景下，网络爬虫技术以其获取信息方便、获取方式多样和信息获取的半自动化等特点，得到快速的发展和广泛的应用。但是，传统的网络爬虫技术需要根据网页特点针对性的编写爬虫脚本，而互联网上充斥着海量信息发布平台、系统和网站，每个页面展示信息的方式和格式千变万化，这造成了网络爬虫技术的开发成本较高，且当网页改版后，爬虫程序需要针对性的更新，抓取的稳定性也受到影响。而且，需要经人工干预才可实现页面自动抓取。可见，现有技术还不能实现网络爬虫页面列表信息自动抓取。With the rapid development of information technology, it has brought massive and majestic data information. How to quickly, efficiently and accurately extract useful information from massive information puts forward higher requirements for information acquisition technology in the new era. In this context, web crawler technology has been rapidly developed and widely used due to its characteristics of convenient information acquisition, various acquisition methods and semi-automatic information acquisition. However, traditional web crawling technology needs to write crawler scripts according to the characteristics of web pages, and the Internet is full of massive information publishing platforms, systems and websites, and the ways and formats of information displayed on each page are ever-changing. The development cost is high, and when the web page is revised, the crawler program needs to be updated in a targeted manner, and the stability of the crawling is also affected. Moreover, manual intervention is required to achieve automatic page crawling. It can be seen that the prior art cannot realize automatic crawling of web crawler page list information.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提供一种网络爬虫的页面列表信息自动提取方法及系统，实现网络爬虫页面列表信息自动抓取。The purpose of the present invention is to provide a method and system for automatically extracting page list information of a web crawler, so as to realize automatic crawling of web crawler page list information.

为实现上述目的，本发明提供了如下方案：For achieving the above object, the present invention provides the following scheme:

一种网络爬虫的页面列表信息自动提取方法，包括：A method for automatically extracting page list information for a web crawler, comprising:

获取待提取的页面的超文本标记语言文档；Obtain the hypertext markup language document of the page to be extracted;

根据所述超文本标记语言文档中的元素确定超文本标记语言对象集合；所述元素包括超文本标记语言文档的标签、属性和文本；Determine a set of hypertext markup language objects according to elements in the hypertext markup language document; the elements include tags, attributes and text of the hypertext markup language document;

对所述超文本标记语言对象集合进行遍历，确定文档对象模型；traversing the set of hypertext markup language objects to determine a document object model;

根据所述文档对象模型确定所述待提取的页面的网页结构；所述网页结构包括列表项和列表项属性；Determine the web page structure of the page to be extracted according to the document object model; the web page structure includes a list item and a list item attribute;

根据所述待提取的页面的网页结构确定提取模式；Determine the extraction mode according to the webpage structure of the page to be extracted;

利用所述提取模式对所述待提取的页面列表信息进行提取。The page list information to be extracted is extracted by using the extraction mode.

可选的，所述获取待提取的页面的超文本标记语言文档，之前还包括：Optionally, the obtaining of the hypertext markup language document of the page to be extracted further includes:

判断所述待提取的页面是否改版，得到第一判断结果；Judging whether the page to be extracted is revised, and obtaining a first judgment result;

若所述第一判断结果表示所述待提取的页面改版，则执行所述获取待提取的页面的超文本标记语言文档的步骤；If the first judgment result indicates that the page to be extracted is revised, the step of obtaining the hypertext markup language document of the page to be extracted is performed;

若所述第一判断结果表示所述待提取的页面没有改版，则直接按照未改版时对应的提取模式对所述待提取的页面列表信息进行提取。If the first judgment result indicates that the page to be extracted has not been revised, the page list information to be extracted is directly extracted according to the corresponding extraction mode when the page is not revised.

可选的，所述对所述超文本标记语言对象集合进行遍历，确定文档对象模型，具体包括：Optionally, traversing the set of hypertext markup language objects to determine the document object model specifically includes:

对所述超文本标记语言对象集合进行遍历，确定所有对象间的关系；所有所述对象间的关系包括对象间的同级关系、包含关系、父级关系、子级关系、层级关系的高度和深度；The hypertext markup language object set is traversed to determine the relationship between all objects; the relationship between all the objects includes the sibling relationship, the containment relationship, the parent relationship, the child relationship, the height and the level of the hierarchical relationship between the objects. depth;

根据所述所有对象间的关系确定所述文档对象模型。The document object model is determined according to the relationships among all the objects.

可选的，所述根据所述文档对象模型确定所述待提取的页面的网页结构，具体包括：Optionally, the determining the web page structure of the to-be-extracted page according to the document object model specifically includes:

根据所述文档对象模型确定所述待提取的页面中每一个属性的标签的数量和所有属性的标签的数量；Determine the number of tags of each attribute and the number of tags of all attributes in the to-be-extracted page according to the document object model;

确定所述待提取的页面中每一个属性的每一标签出现次数的比重，得到单一比重集合；Determine the proportion of the number of times of occurrence of each label of each attribute in the to-be-extracted page to obtain a single proportion set;

确定所述待提取的页面中标签属性组合的比重，得到组合比重集合；Determine the proportion of the combination of tag attributes in the page to be extracted, and obtain a set of combination proportions;

按照标签属性出现的频次对所述单一比重集合和所述组合比重集合进行降序排列，得到列表项；所述列表项的属性为多个单标签属性出现的比重均与组合比重集合中组合标签属性出现的比重相等的标签属性；Arrange the single proportion set and the combined proportion set in descending order according to the frequency of occurrence of the tag attribute to obtain a list item; the attribute of the list item is that the proportions of multiple single tag attributes appearing are the same as the combined tag attributes in the combined proportion set. The tag attributes that appear with equal weight;

根据所述文档对象模型确定所述待提取的页面中所有的文档对象链；Determine all document object chains in the to-be-extracted page according to the document object model;

以所述列表项对应的标签属性组合为起始截取每一个所述文档对象链；Intercepting each of the document object chains starting with the tag attribute combination corresponding to the list item;

计算每一个截取之后的所述文档对象链出现的频次；Calculate the frequency of occurrence of the document object chain after each interception;

对每一个截取之后的所述文档对象链出现的频次进行降序排列，确定列表项属性。The frequency of occurrence of each intercepted document object chain is sorted in descending order, and the list item attribute is determined.

可选的，所述以所述列表项对应的标签属性组合为起始截取每一个所述文档对象链，之前还包括：Optionally, the interception of each of the document object chains starting from the combination of tag attributes corresponding to the list items further includes:

判断所述文档对象链是否含有所述列表项对应的标签属性组合，得到第二判断结果；Judging whether the document object chain contains the tag attribute combination corresponding to the list item, and obtaining a second judgment result;

若所述第二判断结果表示所述文档对象链含有所述列表项对应的标签属性组合，则保留所述文档对象链，并以所述列表项对应的标签属性组合为起始截取每一个所述文档对象链；If the second judgment result indicates that the document object chain contains the tag attribute combination corresponding to the list item, then the document object chain is retained, and each tag attribute combination corresponding to the list item is used as a starting point to intercept each Describe the document object chain;

若所述第二判断结果表示所述文档对象链不含有所述列表项对应的标签属性组合，则剔除所述文档对象链。If the second judgment result indicates that the document object chain does not contain the tag attribute combination corresponding to the list item, the document object chain is eliminated.

一种网络爬虫的页面列表信息自动提取系统，包括：A system for automatically extracting page list information for a web crawler, comprising:

超文本标记语言文档获取模块，用于获取待提取的页面的超文本标记语言文档；A hypertext markup language document acquisition module, used for acquiring the hypertext markup language document of the page to be extracted;

超文本标记语言对象集合确定模块，用于根据所述超文本标记语言文档中的元素确定超文本标记语言对象集合；所述元素包括超文本标记语言文档的标签、属性和文本；a hypertext markup language object set determining module, configured to determine a hypertext markup language object set according to elements in the hypertext markup language document; the elements include tags, attributes and texts of the hypertext markup language document;

文档对象模型确定模块，用于对所述超文本标记语言对象集合进行遍历，确定文档对象模型；a document object model determination module, used for traversing the set of hypertext markup language objects to determine a document object model;

网页结构确定模块，用于根据所述文档对象模型确定所述待提取的页面的网页结构；所述网页结构包括列表项和列表项属性；a webpage structure determination module, configured to determine the webpage structure of the to-be-extracted page according to the document object model; the webpage structure includes a list item and a list item attribute;

提取模式确定模块，用于根据所述待提取的页面的网页结构确定提取模式；an extraction mode determination module, configured to determine an extraction mode according to the web page structure of the page to be extracted;

页面列表信息提取模块，用于利用所述提取模式对所述待提取的页面列表信息进行提取。A page list information extraction module, configured to extract the page list information to be extracted by using the extraction mode.

可选的，还包括：Optionally, also include:

第一判断模块，用于判断所述待提取的页面是否改版，得到第一判断结果；a first judgment module, configured to judge whether the page to be extracted is revised, and obtain a first judgment result;

执行模块，用于若所述第一判断结果表示所述待提取的页面改版，则执行所述获取待提取的页面的超文本标记语言文档的步骤；an execution module, configured to execute the step of obtaining the hypertext markup language document of the page to be extracted if the first judgment result indicates that the page to be extracted is revised;

按照未改版时对应的提取模式提取模块，用于若所述第一判断结果表示所述待提取的页面没有改版，则直接按照未改版时对应的提取模式对所述待提取的页面列表信息进行提取。The extracting module according to the corresponding extraction mode when the version is not revised is configured to directly perform the processing on the page list information to be extracted according to the corresponding extraction mode when the version is not revised if the first judgment result indicates that the page to be extracted has not been revised. extract.

可选的，所述文档对象模型确定模块具体包括：Optionally, the document object model determination module specifically includes:

所有对象间的关系确定单元，用于对所述超文本标记语言对象集合进行遍历，确定所有对象间的关系；所有所述对象间的关系包括对象间的同级关系、包含关系、父级关系、子级关系、层级关系的高度和深度；The unit for determining the relationship between all objects is used to traverse the set of hypertext markup language objects to determine the relationship between all objects; all the relationships between the objects include sibling relationship, containment relationship, and parent relationship between objects , height and depth of child relationship, hierarchical relationship;

文档对象模型确定单元，用于根据所述所有对象间的关系确定所述文档对象模型。A document object model determining unit, configured to determine the document object model according to the relationship among all the objects.

可选的，所述网页结构确定模块具体包括：Optionally, the webpage structure determination module specifically includes:

标签属性确定单元，用于根据所述文档对象模型确定所述待提取的页面中每一个属性的标签的数量和所有属性的标签的数量；A tag attribute determining unit, configured to determine, according to the document object model, the number of tags for each attribute and the number of tags for all attributes in the to-be-extracted page;

单一比重集合确定单元，用于确定所述待提取的页面中每一个属性的每一标签出现次数的比重，得到单一比重集合；A single proportion set determination unit, used to determine the proportion of the times of occurrence of each tag of each attribute in the page to be extracted, to obtain a single proportion set;

组合比重集合确定单元，用于确定所述待提取的页面中标签属性组合的比重，得到组合比重集合；a combination proportion set determination unit, used to determine the proportion of the combination of tag attributes in the page to be extracted, and obtain a combination proportion set;

列表项确定单元，用于按照标签属性出现的频次对所述单一比重集合和所述组合比重集合进行降序排列，得到列表项；所述列表项的属性为多个单标签属性出现的比重均与组合比重集合中组合标签属性出现的比重相等的标签属性；The list item determination unit is used to sort the single proportion set and the combined proportion set in descending order according to the frequency of occurrence of tag attributes to obtain list items; The tag attributes with equal proportions appearing in the combined tag attributes in the combined proportion set;

文档对象链确定单元，用于根据所述文档对象模型确定所述待提取的页面中所有的文档对象链；a document object chain determining unit, configured to determine all document object chains in the to-be-extracted page according to the document object model;

文档对象链截取单元，用于以所述列表项对应的标签属性组合为起始截取每一个所述文档对象链；a document object chain interception unit, configured to intercept each of the document object chains starting with the tag attribute combination corresponding to the list item;

频次计算单元，用于计算每一个截取之后的所述文档对象链出现的频次；a frequency calculation unit, configured to calculate the frequency of occurrence of the document object chain after each interception;

列表项属性确定单元，用于对每一个截取之后的所述文档对象链出现的频次进行降序排列，确定列表项属性。The list item attribute determining unit is configured to sort the occurrence frequency of each intercepted document object chain in descending order to determine the list item attribute.

可选的，所述网页结构确定模块还包括：Optionally, the webpage structure determination module further includes:

第一判断单元，用于判断所述文档对象链是否含有所述列表项对应的标签属性组合，得到第二判断结果；a first judgment unit, configured to judge whether the document object chain contains the tag attribute combination corresponding to the list item, and obtain a second judgment result;

文档对象链保留单元，用于若所述第二判断结果表示所述文档对象链含有所述列表项对应的标签属性组合，则保留所述文档对象链，并以所述列表项对应的标签属性组合为起始截取每一个所述文档对象链；A document object chain retention unit, configured to retain the document object chain if the second judgment result indicates that the document object chain contains a combination of tag attributes corresponding to the list item, and use the tag attribute corresponding to the list item The combination starts to intercept each of the document object chains;

文档对象链剔除单元，用于若所述第二判断结果表示所述文档对象链不含有所述列表项对应的标签属性组合，则剔除所述文档对象链。A document object chain elimination unit, configured to eliminate the document object chain if the second judgment result indicates that the document object chain does not contain the tag attribute combination corresponding to the list item.

根据本发明提供的具体实施例，本发明公开了以下技术效果：According to the specific embodiments provided by the present invention, the present invention discloses the following technical effects:

本发明所提供的一种网络爬虫的页面列表信息自动提取方法及系统，通过获取待提取的页面的超文本标记语言文档，并确定超文本标记语言对象集合，根据超文本标记语言对象集合确定文档对象集合，进而通过文档对象集合确定所述待提取的页面的网页结构，即自动对待提取的页面结构进行识别，能够解决每个页面展示信息的方式和格式千变万化的问题，避免了人工干预，实现了网络爬虫页面列表信息自动抓取，保证了网络爬虫的稳定性。The invention provides a method and system for automatically extracting page list information of a web crawler. By acquiring the hypertext markup language document of the page to be extracted, and determining the set of hypertext markup language objects, the document is determined according to the set of hypertext markup language objects. Object collection, and then determine the web page structure of the page to be extracted through the document object collection, that is, automatically identify the page structure to be extracted, which can solve the problem of the ever-changing ways and formats of information displayed on each page, avoid manual intervention, and realize The web crawler page list information is automatically captured to ensure the stability of the web crawler.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the accompanying drawings required in the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some of the present invention. In the embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative labor.

图1为本发明所提供的一种网络爬虫的页面列表信息自动提取方法流程示意图；1 is a schematic flowchart of a method for automatically extracting page list information of a web crawler provided by the present invention;

图2为本发明所提供的一种网络爬虫的页面列表信息自动提取系统结构示意图。FIG. 2 is a schematic structural diagram of a system for automatically extracting page list information of a web crawler provided by the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

为使本发明的上述目的、特征和优点能够更加明显易懂，下面结合附图和具体实施方式对本发明作进一步详细的说明。In order to make the above objects, features and advantages of the present invention more clearly understood, the present invention will be described in further detail below with reference to the accompanying drawings and specific embodiments.

图1为本发明所提供的一种网络爬虫的页面列表信息自动提取方法流程示意图，如图1所示，本发明所提供的一种网络爬虫的页面列表信息自动提取方法包括：1 is a schematic flowchart of a method for automatically extracting page list information of a web crawler provided by the present invention. As shown in FIG. 1 , the method for automatically extracting page list information of a web crawler provided by the present invention includes:

S101，获取待提取的页面的超文本标记语言文档。S101: Obtain a hypertext markup language document of a page to be extracted.

在S101之前还包括：Also included before S101:

判断所述待提取的页面是否改版，得到第一判断结果。It is judged whether the page to be extracted is revised, and a first judgment result is obtained.

若所述第一判断结果表示所述待提取的页面改版，则执行所述获取待提取的页面的超文本标记语言文档的步骤。If the first judgment result indicates that the page to be extracted is revised, the step of acquiring the hypertext markup language document of the page to be extracted is performed.

S102，根据所述超文本标记语言文档中的元素确定超文本标记语言对象集合；所述元素包括超文本标记语言文档的标签、属性和文本。所述超文本标记语言文档中的元素都视为一个对象，并确定的超文本标记语言对象集合为HTML，HTML＝{Obj₁,Obj₂,Obj₃,...,Obj_n}。S102: Determine a set of hypertext markup language objects according to elements in the hypertext markup language document; the elements include tags, attributes and texts of the hypertext markup language document. The elements in the hypertext markup language document are regarded as one object, and the determined set of hypertext markup language objects is HTML, HTML={Obj ₁ , Obj ₂ , Obj ₃ , . . . , Obj _n }.

S103，对所述超文本标记语言对象集合进行遍历，确定文档对象模型。S103, traverse the set of hypertext markup language objects to determine a document object model.

对所述超文本标记语言对象集合进行遍历，确定所有对象间的关系；所有所述对象间的关系包括对象间的同级关系、包含关系、父级关系、子级关系、层级关系的高度和深度。高度是叶子最长路径的长度，深度是路径到其根的长度。The hypertext markup language object set is traversed to determine the relationship between all objects; the relationship between all the objects includes the sibling relationship, the containment relationship, the parent relationship, the child relationship, the height and the level of the hierarchical relationship between the objects. depth. The height is the length of the longest path of the leaf, and the depth is the length of the path to its root.

根据所述所有对象间的关系确定所述文档对象模型Dom。The document object model Dom is determined according to the relationship among all the objects.

S104，根据所述文档对象模型确定所述待提取的页面的网页结构；所述网页结构包括列表项和列表项属性。根据文档对象模型的特点，可快速计算所有元素、属性的数量，以及他们之间的关联关系。获取统计数据和关联关系后，对其进行分析，从文档结构中抽取出列表项、列表项属性、无关信息等内容，并剔除无关信息。所有页面的网页结构均遵循以下规律：①列表页内容列表部分均具有相同的Class或ID属性，或具有相同的标签元素；②列表项每一项的标题都是单独放置的，且不重复，往往为a标签，或者具有事件属性；③列表项的附属信息如时间、作者、简介等内容均在特定的标签内；④列表项主题和列表项的附属信息总是聚集、包含或者并列顺序排列。S104: Determine a web page structure of the page to be extracted according to the document object model; the web page structure includes a list item and a list item attribute. According to the characteristics of the document object model, the number of all elements and attributes, and the relationship between them can be quickly calculated. After obtaining statistical data and association relationships, analyze them, extract list items, list item attributes, irrelevant information and other contents from the document structure, and eliminate irrelevant information. The web page structure of all pages follows the following rules: ① The content list part of the list page has the same Class or ID attribute, or has the same label element; ② The title of each item of the list item is placed separately and does not repeat, Often it is a tag, or has event attributes; ③ the auxiliary information of the list item, such as time, author, introduction, etc., is in a specific tag; ④ the subject of the list item and the auxiliary information of the list item are always aggregated, included or arranged in parallel .

根据所述文档对象模型确定所述待提取的页面中每一个属性的标签的数量和所有属性的标签的数量。计算页面内所有标签、属性x_i的数量为mi；计算页面内所有标签、属性的总数为w，

The number of tags for each attribute and the number of tags for all attributes in the page to be extracted are determined according to the document object model. Calculate the number of all tags and attributes x _i in the page as mi; calculate the total number of all tags and attributes in the page as w,

确定所述待提取的页面中每一个属性的每一标签出现次数的比重，得到单一比重集合。即计算单个标签、属性在整个页面出现次数的比重p_i，组成单一比重集合p，p＝{p₁,p₂,p₃,p₄,...p_w}，

Determine the proportion of the times of occurrence of each tag of each attribute in the page to be extracted to obtain a single proportion set. That is, the proportion p _i of the times of occurrence of a single tag and attribute in the whole page is calculated to form a single proportion set p, p={p ₁ , p ₂ , p ₃ , p ₄ ,...p _w },

确定所述待提取的页面中标签属性组合的比重，得到组合比重集合k。页面内有时属性是多于一个的组合体，通过不同元素的组合，表达不同的结构内容，k＝{p(_1,2,...),p(_1,3,...),p(_2,3,...),...,p(_i,j,...)}，其中i，j代表不同的标签属性组合，i，j都是页面内存在的组合。

n(_i,j,...)是指多个标签属性组合在页面中出现的数量。Determine the proportion of the tag attribute combination in the page to be extracted, and obtain a combination proportion set k. Sometimes attributes in the page are more than one combination, through the combination of different elements, different structural content is expressed, k={p( _1,2,... ),p( _1,3,... ),p ( _2,3,... ),...,p( _i,j,... )}, where i, j represent different tag attribute combinations, and i, j are all combinations that exist in the page.

n( _i,j,... ) refers to the number of tag attribute combinations that appear in the page.

按照标签属性出现的频次对所述单一比重集合和所述组合比重集合进行降序排列，得到列表项；所述列表项的属性为多个单标签属性出现的比重均与组合比重集合中组合标签属性出现的比重相等的标签属性。即k(_i,j,...)＝p_i＝p_j＝...。Arrange the single proportion set and the combined proportion set in descending order according to the occurrence frequency of the tag attribute to obtain a list item; the attribute of the list item is that the proportion of the occurrence of multiple single tag attributes is the same as that of the combined tag attribute in the combined proportion set. Appears as a weighted equal tag attribute. That is, k( _i _,j,... )=pi= _pj =....

根据所述文档对象模型确定所述待提取的页面中所有的文档对象链，即DOM链。其中层次式DOM链以M表示，并列式DOM链以N表示。

All document object chains in the to-be-extracted page, that is, DOM chains, are determined according to the document object model. The hierarchical DOM chain is represented by M, and the parallel DOM chain is represented by N.

以所述列表项对应的标签属性组合为起始截取每一个所述文档对象链。Each of the document object chains is intercepted starting with the tag attribute combination corresponding to the list item.

计算每一个截取之后的所述文档对象链出现的频次。The frequency of occurrence of the document object chain after each interception is calculated.

S105，根据所述待提取的页面的网页结构确定提取模式；S105, determining an extraction mode according to the webpage structure of the page to be extracted;

S106，利用所述提取模式对所述待提取的页面列表信息进行提取。S106: Extract the page list information to be extracted by using the extraction mode.

所述以所述列表项对应的标签属性组合为起始截取每一个所述文档对象链，之前还包括：The step of intercepting each of the document object chains starting from the tag attribute combination corresponding to the list item further includes:

判断所述文档对象链是否含有所述列表项对应的标签属性组合，得到第二判断结果。It is judged whether the document object chain contains the tag attribute combination corresponding to the list item, and a second judgment result is obtained.

若所述第二判断结果表示所述文档对象链含有所述列表项对应的标签属性组合，则保留所述文档对象链，并以所述列表项对应的标签属性组合为起始截取每一个所述文档对象链。If the second judgment result indicates that the document object chain contains the tag attribute combination corresponding to the list item, the document object chain is retained, and each of the document object chain is intercepted starting from the tag attribute combination corresponding to the list item. Describes the document object chain.

具体的实施例中，HML＝{html,head,body,h3,div,p,a,span}，对HTML进行遍历，记录所有对象间的相互关系(同级、包含、、父级、子级等)和层级关系的高度(Height)和深度(Depth)。高度是叶子最长路径的长度，深度是路径到其根的长度。将遍历的结果确定文档对象模型，即Dom模型：In a specific embodiment, HML={html,head,body,h3,div,p,a,span}, traverse the HTML, and record the relationship between all objects (same level, contains, parent level, child level) etc.) and the height and depth of the hierarchical relationship. The height is the length of the longest path of the leaf, and the depth is the length of the path to its root. The result of the traversal determines the document object model, that is, the Dom model:

1)计算页面内所有标签数量为25(由于head内部标签不参与数据提取，因此去除head内无用标签，不参与计算)、属性的数量为11；1) Calculate the number of all tags in the page to be 25 (because the internal tags in the head do not participate in data extraction, so the useless tags in the head are removed and do not participate in the calculation), and the number of attributes is 11;

2)计算页面内所有标签、属性的总数为25+11＝36；2) Calculate the total number of all tags and attributes in the page as 25+11=36;

3)计算单个标签、属性在整个页面出现次数的比重p；如表1所示：3) Calculate the proportion p of the number of times a single tag and attribute appear on the entire page; as shown in Table 1:

表1Table 1

标签/属性tags/attributes pp 标签/属性tags/attributes pp htmlhtml 1/361/36 aa 5/365/36 headhead 1/361/36 spanspan 10/3610/36 bodybody 1/361/36 mainmain 5/365/36 divdiv 1/361/36 listlist 5/365/36 pp 5/365/36 contentcontent 1/361/36

上表中html、head、body、div、p、a、span为标签，main、list、content为属性。In the above table, html, head, body, div, p, a, span are tags, and main, list, and content are attributes.

4)计算页面内标签属性组合的比重。页面内有时属性是多于一个的组合体，通过不同元素的组合，表达不同的结构内容。组合标签属性集合记为k。本实例中p标签有两个class属性，即main和list；4) Calculate the proportion of the combination of tag attributes in the page. Sometimes attributes in the page are more than one combination, and different structural content is expressed through the combination of different elements. The set of combined tag attributes is denoted as k. In this example, the p tag has two class attributes, namely main and list;

k＝{p_main,list}k={p _{main, list} }

5)按照标签属性出现的频次对集合p和k进行降序排列，获取单元素集合与组合元素集合出现比率相等，且多个单元素出现的比率均与组合元素集合中组合标签属性出现的比率相等的标签属性，通过对比发现：5) Arrange the sets p and k in descending order according to the frequency of occurrence of tag attributes, and obtain the same ratio of occurrence of single-element set and combined element set, and the ratio of multiple single-element occurrences is equal to the ratio of combined tag attributes in the combined element set The label properties of , found by comparison:

p_(mian,list)＝p_main＝p_list p _(mian,list) =p _main =p _list

则属性为main、list的标签P为列表项，标签html、head、body、div和属性content为无关标签和无关属性。Then the tag P whose attributes are main and list are list items, and the tags html, head, body, div and attribute content are irrelevant tags and irrelevant attributes.

②列表项属性的判定算法②Determining algorithm of list item attributes

通过分析发现，本实例中列表项和列表项属性呈包含关系，为层次式M，列表项属性之间为并列式N，则判定列表项属性的过程如下：Through analysis, it is found that in this example, the list item and the list item attributes are in an inclusive relationship, which is a hierarchical M, and the list item attributes are a parallel type N. The process of determining the list item attributes is as follows:

1)列出页面内所有的DOM链，其中层次式DOM链以M表示，并列式DOM链以N表示。本实例中只有一层包含关系，即p包含a、span。1) List all DOM chains in the page, wherein the hierarchical DOM chain is represented by M, and the parallel DOM chain is represented by N. In this example, there is only one level of inclusion, that is, p includes a and span.

M＝{P→a|span|span}M={P→a|span|span}

2)以列表项标签属性(组合)为起始截取上述DOM链，无列表项标签属性的DOM链直接舍弃，并计算每个DOM链出现的频次，结果构建集合R。本实例中只有一个链{P→a|span|span}，该链出现5次，则：2) Intercept the above DOM chain starting with the list item label attribute (combination), directly discard the DOM chain without the list item label attribute, and calculate the frequency of occurrence of each DOM chain, and construct a set R as a result. In this example, there is only one chain {P→a|span|span}, and the chain appears 5 times, then:

3)按照频次将DOM链降序排列，记为R'，频次出现较高的为待选列表项属性。再根据上文得出的网页规律，可以判定span为列表项属性。3) Arrange the DOM chains in descending order according to the frequency, denoted as R', and the attribute with a higher frequency is the attribute of the list item to be selected. Then according to the web page rules obtained above, it can be determined that span is a list item attribute.

结果：列表项为P，其属性为main和list；列表项内容为a标签，列表项属性为span标签。Result: The list item is P, and its attributes are main and list; the content of the list item is a tag, and the list item attribute is the span tag.

图2为本发明所提供的一种网络爬虫的页面列表信息自动提取系统结构示意图，如图2所示，本发明所提供的一种网络爬虫的页面列表信息自动提取系统，包括：超文本标记语言文档获取模块201、超文本标记语言对象集合确定模块202、文档对象模型确定模块203、网页结构确定模块204、提取模式确定模块205和页面列表信息提取模块206。FIG. 2 is a schematic structural diagram of a system for automatically extracting page list information of a web crawler provided by the present invention. As shown in FIG. 2 , a system for automatically extracting page list information of a web crawler provided by the present invention includes: hypertext markup Language document acquisition module 201 , hypertext markup language object set determination module 202 , document object model determination module 203 , web page structure determination module 204 , extraction mode determination module 205 and page list information extraction module 206 .

超文本标记语言文档获取模块201用于获取待提取的页面的超文本标记语言文档。The hypertext markup language document obtaining module 201 is configured to obtain the hypertext markup language document of the page to be extracted.

超文本标记语言对象集合确定模块202用于根据所述超文本标记语言文档中的元素确定超文本标记语言对象集合；所述元素包括超文本标记语言文档的标签、属性和文本。The hypertext markup language object set determining module 202 is configured to determine a hypertext markup language object set according to elements in the hypertext markup language document; the elements include tags, attributes and texts of the hypertext markup language document.

文档对象模型确定模块203用于对所述超文本标记语言对象集合进行遍历，确定文档对象模型。The document object model determining module 203 is configured to traverse the set of hypertext markup language objects to determine the document object model.

网页结构确定模块204用于根据所述文档对象模型确定所述待提取的页面的网页结构；所述网页结构包括列表项和列表项属性。The webpage structure determining module 204 is configured to determine the webpage structure of the to-be-extracted page according to the document object model; the webpage structure includes a list item and a list item attribute.

提取模式确定模块205用于根据所述待提取的页面的网页结构确定提取模式。The extraction mode determination module 205 is configured to determine the extraction mode according to the webpage structure of the page to be extracted.

页面列表信息提取模块206用于利用所述提取模式对所述待提取的页面列表信息进行提取。The page list information extraction module 206 is configured to extract the page list information to be extracted by using the extraction mode.

本发明所提供的一种网络爬虫的页面列表信息自动提取系统还包括：第一判断模块、执行模块和按照未改版时对应的提取模式提取模块。The system for automatically extracting page list information of a web crawler provided by the present invention further comprises: a first judgment module, an execution module, and an extraction module according to the corresponding extraction mode when the version is not revised.

第一判断模块用于判断所述待提取的页面是否改版，得到第一判断结果。The first judgment module is used for judging whether the page to be extracted is revised to obtain a first judgment result.

执行模块用于若所述第一判断结果表示所述待提取的页面改版，则执行所述获取待提取的页面的超文本标记语言文档的步骤。The execution module is configured to execute the step of acquiring the hypertext markup language document of the page to be extracted if the first judgment result indicates that the page to be extracted is revised.

按照未改版时对应的提取模式提取模块用于若所述第一判断结果表示所述待提取的页面没有改版，则直接按照未改版时对应的提取模式对所述待提取的页面列表信息进行提取。The extracting module according to the extraction mode corresponding to the unmodified version is configured to directly extract the list information of the to-be-extracted page according to the corresponding extraction mode when the unmodified version is not modified if the first judgment result indicates that the page to be extracted has not been modified. .

所述文档对象模型确定模块203具体包括：所有对象间的关系确定单元和文档对象模型确定单元。The document object model determination module 203 specifically includes: a relationship determination unit among all objects and a document object model determination unit.

所有对象间的关系确定单元用于对所述超文本标记语言对象集合进行遍历，确定所有对象间的关系；所有所述对象间的关系包括对象间的同级关系、包含关系、父级关系、子级关系、层级关系的高度和深度。The relationship determination unit between all objects is used to traverse the set of hypertext markup language objects to determine the relationship between all objects; all the relationships between objects include sibling relationship, containment relationship, parent relationship, The height and depth of child relationships, hierarchical relationships.

文档对象模型确定单元用于根据所述所有对象间的关系确定所述文档对象模型。The document object model determining unit is configured to determine the document object model according to the relationship among all the objects.

所述网页结构确定模块204具体包括：标签属性确定单元、单一比重集合确定单元、组合比重集合确定单元、列表项确定单元、文档对象链确定单元、文档对象链截取单元、频次计算单元和列表项属性确定单元。The webpage structure determination module 204 specifically includes: a tag attribute determination unit, a single weight set determination unit, a combined weight set determination unit, a list item determination unit, a document object chain determination unit, a document object chain interception unit, a frequency calculation unit, and a list item. Property determines the unit.

标签属性确定单元用于根据所述文档对象模型确定所述待提取的页面中每一个属性的标签的数量和所有属性的标签的数量。The tag attribute determining unit is configured to determine, according to the document object model, the number of tags for each attribute and the number of tags for all attributes in the page to be extracted.

单一比重集合确定单元用于确定所述待提取的页面中每一个属性的每一标签出现次数的比重，得到单一比重集合。The unit for determining a single proportion set is configured to determine the proportion of the occurrence times of each tag of each attribute in the page to be extracted to obtain a single proportion set.

组合比重集合确定单元用于确定所述待提取的页面中标签属性组合的比重，得到组合比重集合。The combination proportion set determination unit is configured to determine the proportion of the combination of tag attributes in the to-be-extracted page to obtain a combination proportion set.

列表项确定单元用于按照标签属性出现的频次对所述单一比重集合和所述组合比重集合进行降序排列，得到列表项；所述列表项的属性为多个单标签属性出现的比重均与组合比重集合中组合标签属性出现的比重相等的标签属性。The list item determination unit is used to sort the single proportion set and the combined proportion set in descending order according to the frequency of occurrence of the tag attribute to obtain a list item; Tag attributes with equal weights appearing in the combined tag attributes in the weight set.

文档对象链确定单元用于根据所述文档对象模型确定所述待提取的页面中所有的文档对象链。The document object chain determining unit is configured to determine all document object chains in the to-be-extracted page according to the document object model.

文档对象链截取单元用于以所述列表项对应的标签属性组合为起始截取每一个所述文档对象链。The document object chain interception unit is configured to intercept each of the document object chains starting with the tag attribute combination corresponding to the list item.

频次计算单元用于计算每一个截取之后的所述文档对象链出现的频次。The frequency calculation unit is used for calculating the frequency of occurrence of the document object chain after each interception.

列表项属性确定单元用于对每一个截取之后的所述文档对象链出现的频次进行降序排列，确定列表项属性。The list item attribute determining unit is configured to sort the occurrence frequency of each intercepted document object chain in descending order to determine the list item attribute.

所述网页结构确定模块204还包括：第一判断单元、文档对象链保留单元和文档对象链剔除单元。The webpage structure determination module 204 further includes: a first judgment unit, a document object chain retention unit and a document object chain elimination unit.

第一判断单元用于判断所述文档对象链是否含有所述列表项对应的标签属性组合，得到第二判断结果。The first judging unit is configured to judge whether the document object chain contains the tag attribute combination corresponding to the list item, and obtain a second judgment result.

文档对象链保留单元用于若所述第二判断结果表示所述文档对象链含有所述列表项对应的标签属性组合，则保留所述文档对象链，并以所述列表项对应的标签属性组合为起始截取每一个所述文档对象链。The document object chain retention unit is configured to retain the document object chain if the second judgment result indicates that the document object chain contains the tag attribute combination corresponding to the list item, and use the tag attribute combination corresponding to the list item Each of the document object chains is intercepted for the start.

文档对象链剔除单元用于若所述第二判断结果表示所述文档对象链不含有所述列表项对应的标签属性组合，则剔除所述文档对象链。The document object chain elimination unit is configured to eliminate the document object chain if the second judgment result indicates that the document object chain does not contain the tag attribute combination corresponding to the list item.

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似部分互相参见即可。对于实施例公开的系统而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments can be referred to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method.

本文中应用了具体个例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处。综上所述，本说明书内容不应理解为对本发明的限制。In this paper, specific examples are used to illustrate the principles and implementations of the present invention. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present invention; meanwhile, for those skilled in the art, according to the present invention There will be changes in the specific implementation and application scope. In conclusion, the contents of this specification should not be construed as limiting the present invention.

Claims

1. a kind of page list information automatic extraction method of web crawler, is characterized in that, comprises:

Obtain the hypertext markup language document of the page to be extracted;

Determine a set of hypertext markup language objects according to elements in the hypertext markup language document; the elements include tags, attributes and text of the hypertext markup language document;

traversing the set of hypertext markup language objects to determine a document object model;

Determine the web page structure of the page to be extracted according to the document object model; the web page structure includes a list item and a list item attribute;

Determine the extraction mode according to the webpage structure of the page to be extracted;

The page list information to be extracted is extracted by using the extraction mode.

2. The method for automatically extracting page list information of a web crawler according to claim 1, wherein the acquisition of the hypertext markup language document of the page to be extracted further comprises before:

Judging whether the page to be extracted is revised, and obtaining a first judgment result;

If the first judgment result indicates that the page to be extracted is revised, the step of obtaining the hypertext markup language document of the page to be extracted is performed;

If the first judgment result indicates that the page to be extracted has not been revised, the page list information to be extracted is directly extracted according to the corresponding extraction mode when the page is not revised.

3. The method for automatically extracting page list information of a web crawler according to claim 1, wherein the traversing of the hypertext markup language object collection to determine a document object model specifically comprises:

The hypertext markup language object set is traversed to determine the relationship between all objects; the relationship between all the objects includes the sibling relationship, the containment relationship, the parent relationship, the child relationship, the height and the level of the hierarchical relationship between the objects. depth;

The document object model is determined according to the relationships among all the objects.

4. The method for automatically extracting page list information of a web crawler according to claim 1, wherein the determining the web page structure of the page to be extracted according to the document object model specifically comprises:

Determine the number of tags of each attribute and the number of tags of all attributes in the to-be-extracted page according to the document object model;

Determine the proportion of the number of times of occurrence of each label of each attribute in the to-be-extracted page to obtain a single proportion set;

Determine the proportion of the combination of tag attributes in the page to be extracted, and obtain a set of combination proportions;

Arrange the single proportion set and the combined proportion set in descending order according to the frequency of occurrence of the tag attribute to obtain a list item; the attribute of the list item is that the proportions of multiple single tag attributes appearing are the same as the combined tag attributes in the combined proportion set. The tag attributes that appear with equal weight;

Determine all document object chains in the to-be-extracted page according to the document object model;

Intercepting each of the document object chains starting with the tag attribute combination corresponding to the list item;

Calculate the frequency of occurrence of the document object chain after each interception;

The frequency of occurrence of each intercepted document object chain is sorted in descending order, and the list item attribute is determined.

5. The method for automatically extracting page list information of a web crawler according to claim 4, wherein the described document object chain is intercepted starting from the label attribute combination corresponding to the list item, before the Also includes:

Judging whether the document object chain contains the tag attribute combination corresponding to the list item, and obtaining a second judgment result;

If the second judgment result indicates that the document object chain contains the tag attribute combination corresponding to the list item, then the document object chain is retained, and each tag attribute combination corresponding to the list item is used as a starting point to intercept each Describe the document object chain;

If the second judgment result indicates that the document object chain does not contain the tag attribute combination corresponding to the list item, the document object chain is eliminated.

6. A system for automatically extracting page list information of a web crawler, characterized in that, comprising:

A hypertext markup language document acquisition module, used for acquiring the hypertext markup language document of the page to be extracted;

a hypertext markup language object set determining module, configured to determine a hypertext markup language object set according to elements in the hypertext markup language document; the elements include tags, attributes and texts of the hypertext markup language document;

a document object model determination module, used for traversing the set of hypertext markup language objects to determine a document object model;

a webpage structure determination module, configured to determine the webpage structure of the to-be-extracted page according to the document object model; the webpage structure includes a list item and a list item attribute;

an extraction mode determination module, configured to determine an extraction mode according to the web page structure of the page to be extracted;

A page list information extraction module, configured to extract the page list information to be extracted by using the extraction mode.

7. The page list information automatic extraction system of a kind of web crawler according to claim 6, is characterized in that, also comprises:

a first judgment module, configured to judge whether the page to be extracted is revised, and obtain a first judgment result;

an execution module, configured to execute the step of obtaining the hypertext markup language document of the page to be extracted if the first judgment result indicates that the page to be extracted is revised;

The extracting module according to the corresponding extraction mode when the version is not revised is configured to directly perform the processing on the page list information to be extracted according to the corresponding extraction mode when the version is not revised if the first judgment result indicates that the page to be extracted has not been revised. extract.

8. The system for automatically extracting page list information of a web crawler according to claim 6, wherein the document object model determination module specifically comprises:

The unit for determining the relationship between all objects is used to traverse the set of hypertext markup language objects to determine the relationship between all objects; all the relationships between the objects include sibling relationship, containment relationship, and parent relationship between objects , height and depth of child relationship, hierarchical relationship;

A document object model determining unit, configured to determine the document object model according to the relationship among all the objects.

9. The system for automatically extracting page list information of a web crawler according to claim 6, wherein the web page structure determination module specifically comprises:

A tag attribute determining unit, configured to determine, according to the document object model, the number of tags for each attribute and the number of tags for all attributes in the to-be-extracted page;

A single proportion set determination unit, used to determine the proportion of the times of occurrence of each tag of each attribute in the page to be extracted, to obtain a single proportion set;

a combination proportion set determination unit, used to determine the proportion of the combination of tag attributes in the page to be extracted, and obtain a combination proportion set;

The list item determination unit is used to sort the single proportion set and the combined proportion set in descending order according to the frequency of occurrence of tag attributes to obtain list items; The tag attributes with equal proportions appearing in the combined tag attributes in the combined proportion set;

a document object chain determining unit, configured to determine all document object chains in the to-be-extracted page according to the document object model;

a document object chain interception unit, configured to intercept each of the document object chains starting with the tag attribute combination corresponding to the list item;

a frequency calculation unit, configured to calculate the frequency of occurrence of the document object chain after each interception;

The list item attribute determination unit is configured to sort the occurrence frequency of each intercepted document object chain in descending order to determine the list item attribute.

10. The system for automatically extracting page list information of a web crawler according to claim 9, wherein the web page structure determination module further comprises:

a first judgment unit, configured to judge whether the document object chain contains the tag attribute combination corresponding to the list item, and obtain a second judgment result;

A document object chain retention unit, configured to retain the document object chain if the second judgment result indicates that the document object chain contains a combination of tag attributes corresponding to the list item, and use the tag attribute corresponding to the list item The combination starts to intercept each of the document object chains;

A document object chain elimination unit, configured to eliminate the document object chain if the second judgment result indicates that the document object chain does not contain the tag attribute combination corresponding to the list item.