CN103970749B - Block importance computational methods and system in a kind of webpage - Google Patents
Block importance computational methods and system in a kind of webpage Download PDFInfo
- Publication number
- CN103970749B CN103970749B CN201310029651.9A CN201310029651A CN103970749B CN 103970749 B CN103970749 B CN 103970749B CN 201310029651 A CN201310029651 A CN 201310029651A CN 103970749 B CN103970749 B CN 103970749B
- Authority
- CN
- China
- Prior art keywords
- webpage
- region
- area
- importance
- block
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了一种网页中的块重要性计算方法,包括以下步骤:提供第一网页,第一网页包括多个区域块;对第一网页进行解析以得到具有不同重要性等级的多个特定区域;分别对多个特定区域和多个区域块进行语义分析;根据每个区域块和每个特定区域之间的语义相似度将多个区域块进行分类;根据每个特定区域的重要性等级得到与之对应的分类结果中多个区域块的重要性等级。根据本发明的实施例,具有块重要性等级计算精度高、准确的优点。本发明还提出了一种网页中块重要性计算系统。
The invention discloses a block importance calculation method in a web page, comprising the following steps: providing a first web page, the first web page includes a plurality of area blocks; analyzing the first web page to obtain a plurality of specific blocks with different importance levels Regions; perform semantic analysis on multiple specific regions and multiple regional blocks respectively; classify multiple regional blocks according to the semantic similarity between each specific region and each specific region; according to the importance level of each specific region The importance levels of the multiple area blocks in the corresponding classification results are obtained. According to the embodiment of the present invention, it has the advantage of high and accurate calculation of the block importance level. The invention also provides a block importance calculation system in the web page.
Description
技术领域technical field
本发明涉及互联网技术领域,特别涉及一种网页中的块重要性计算方法及系统。The invention relates to the technical field of the Internet, in particular to a method and system for calculating block importance in a web page.
背景技术Background technique
互联网网页,由阅读者视觉或者网页设计者的表达需要,可将网页分成若干区域,这些区域对表达页面主要内容的作用不同,阅读者的视觉关注程度也不同。比如,网页上部,通常是网站内部通用的,起引导用户了解网站整体结构的作用,对表达网页主要内容的贡献较小;又比如,网页中央部门,一般是网页表达主要内容之所在,也是阅读者主要阅读的区域。因此,搜索引擎检索,有必要计算出网页各个区域对表达网页主要内容的贡献程度,即块的重要性值,对指导网页分块,检索匹配具有重要作用。Internet webpages can be divided into several areas according to the reader's vision or the expression needs of the webpage designer. These areas have different functions in expressing the main content of the page, and the degree of visual attention of readers is also different. For example, the upper part of the webpage is usually used internally by the website, and it plays a role in guiding users to understand the overall structure of the website, and makes little contribution to expressing the main content of the webpage; The reader's main reading area. Therefore, for search engine retrieval, it is necessary to calculate the contribution of each area of the webpage to express the main content of the webpage, that is, the importance value of the block, which plays an important role in guiding the webpage into blocks and retrieval matching.
通常认为:表达页面主要内容的区域,其重要性最高;跟主要内容无关的区域,其重要性最低;表达与主要内容相关的内容的区域,其重要性居中。It is generally believed that: the area that expresses the main content of the page has the highest importance; the area that has nothing to do with the main content has the lowest importance; the area that expresses content related to the main content has intermediate importance.
Ruihua Song等人发明了一种计算块重要性的方法。该方法假设页面中相同主题的区域可以组合成独立的块。计算流程是:先对网页分块,然后将各块的特征值输入到计算重要性的算法中,进而得到各块的重要性值。训练目标是使算法的结果和用户标注的块重要性值之间的差值平方最小。该方法主要使用单个页面的块的空间位置特征和内容特征。空间位置特征是指该块在整个网页中的绝对位置或者相对于整个页面的相对位置,内容特征是指单个页面块中包含的图片,链接,文本,用户评论提交区域等页面内容。Ruihua Song et al. invented a method to calculate block importance. This approach assumes that areas of the same theme within a page can be combined into independent blocks. The calculation process is: first divide the web page into blocks, then input the feature value of each block into the algorithm for calculating the importance, and then obtain the importance value of each block. The training goal is to minimize the square of the difference between the algorithm's result and the block importance value marked by the user. This method mainly uses the spatial location features and content features of blocks of a single page. Spatial position features refer to the absolute position of the block in the entire web page or its relative position relative to the entire page, and content features refer to page content such as pictures, links, texts, and user comment submission areas contained in a single page block.
Shian-Hua等人提出了基于table标签分类的新闻文章块的识别方法。该方法首先将table标签当成块的切分方式,得到块;然后计算每个块的特征,确认这些特征在站点内所有其他页面的情况,计算出每个特征的信息熵;以块内每个特征信息熵的平均值为块的信息熵,当块的信息熵小于阈值时,则判定该块是文章块,否则不是文章块。该方法计算多页面信息的页面,必须来自于同一个站点。Shian-Hua et al. proposed a method for identifying news article blocks based on table label classification. This method first regards the table tag as a block segmentation method to obtain blocks; then calculates the features of each block, confirms the situation of these features on all other pages in the site, and calculates the information entropy of each feature; The average value of feature information entropy is the information entropy of the block. When the information entropy of the block is less than the threshold, it is determined that the block is an article block, otherwise it is not an article block. This method calculates pages with multiple pages of information, which must come from the same site.
Lan Yi等人认为页面copyright,广告等区域,与表达页面主要内容无关,是网页噪声。进而提出了基于同一个站点页面噪声区域含有相近内容和展现形式的假设,提出了一种消除网页噪声的方法。该方法同样假设同一个站点的页面来自于同一类网页模板。基于这种认识,该方法定义了一种称之为Site Style Tree(SST)的数据结构,SST计算站点内页面相近的展现形式和内容。对SST上的每个节点,计算其子节点数目和在所有页面上的分布情况。如果节点的子节点数目越多,类别分布情况越多,该节点的得分越大。当得分小于阈值时,该节点被判为噪声节点,反之为有意义节点。Lan Yi and others believe that the copyright, advertisement and other areas of the page have nothing to do with expressing the main content of the page, and are web page noise. Furthermore, based on the hypothesis that the page noise area of the same site contains similar content and display forms, a method to eliminate the noise of the web page is proposed. This method also assumes that the pages of the same site come from the same type of web page template. Based on this understanding, this method defines a data structure called Site Style Tree (SST), which calculates the similar presentation forms and contents of pages in the site. For each node on the SST, calculate the number of its child nodes and their distribution on all pages. If the number of sub-nodes of a node is more, and the distribution of categories is more, the score of the node is greater. When the score is less than the threshold, the node is judged as a noise node, otherwise it is a meaningful node.
综上,现有技术存在以下缺点:In summary, the prior art has the following disadvantages:
1:只用到了块在页面中的绝对位置关系,没有利用到块跟页面特定区域的关系。1: Only the absolute position relationship of the block in the page is used, and the relationship between the block and the specific area of the page is not used.
2:大多只使用了单一页面的信息。2: Most of them only use a single page of information.
3:使用的多页面信息,假设多页面信息必须来自于同一站点多页面信息也不包含跟特定区域的关系信息。3: The multi-page information used assumes that the multi-page information must come from the same site and does not contain the relationship information with a specific area.
4:应用范围较窄,解决问题有限。4: The scope of application is narrow and the problem solving is limited.
发明内容Contents of the invention
本发明的目的旨在至少解决上述技术缺陷之一。The purpose of the present invention is to solve at least one of the above-mentioned technical drawbacks.
为此,本发明的目的在于提出一种网页中的块重要性计算方法,该方法具有块重要性等级计算精度高、准确的优点。Therefore, the purpose of the present invention is to propose a block importance calculation method in a web page, which has the advantages of high precision and accuracy in block importance level calculation.
本发明的另一目的在于提出一种网页中的块重要性计算系统。Another object of the present invention is to propose a block importance calculation system in web pages.
为达到上述目的,本发明第一方面的实施例公开了一种网页中的块重要性计算方法,包括以下步骤:提供第一网页,所述第一网页包括多个区域块;对所述第一网页进行解析以得到具有不同重要性等级的多个特定区域;分别对所述多个特定区域和所述多个区域块进行语义分析;根据每个区域块和每个特定区域之间的语义相似度将所述多个区域块进行分类;以及根据每个特定区域的重要性等级得到与之对应的分类结果中所述多个区域块的重要性等级。In order to achieve the above object, the embodiment of the first aspect of the present invention discloses a block importance calculation method in a web page, comprising the following steps: providing a first web page, the first web page includes a plurality of area blocks; A webpage is parsed to obtain multiple specific areas with different importance levels; semantic analysis is performed on the multiple specific areas and the multiple area blocks respectively; according to the semantics between each area block and each specific area classify the plurality of area blocks by similarity; and obtain the importance level of the plurality of area blocks in the corresponding classification result according to the importance level of each specific area.
根据本发明实施例的网页中的块重要性计算方法,对网页进行解析得到多个不同重要性等级的特定区域,并通过网页中多个区域块和特定区域之间的语义关系对多个区域块进行分类,并根据分类结果对应的特定区域的重要性得到网页中每个区域块的重要性,本发明的实施例通过特定区域和区域块之间的内容关系等得到区域块的重要性,能够显著地提高区域块分类的召回率和准确率,从而具有块重要性计算精度高、准确的优点。According to the block importance calculation method in the web page of the embodiment of the present invention, the web page is analyzed to obtain a plurality of specific areas with different importance levels, and the multiple area blocks in the web page are semantically related to the specific area. Blocks are classified, and the importance of each area block in the web page is obtained according to the importance of the specific area corresponding to the classification result. The embodiment of the present invention obtains the importance of the area block through the content relationship between the specific area and the area block, etc., The method can significantly improve the recall rate and accuracy rate of the region block classification, thus having the advantages of high precision and accuracy in block importance calculation.
另外,根据本发明上述实施例的网页中的块重要性计算方法还可以具有如下附加的技术特征:In addition, the block importance calculation method in the webpage according to the above-mentioned embodiments of the present invention may also have the following additional technical features:
在一些示例中,还包括:根据与第一网页相关的同簇网页对所述多个特定区域和所述多个区域块的重要性等级进行修正。In some examples, the method further includes: modifying the importance levels of the multiple specific areas and the multiple area blocks according to web pages of the same cluster related to the first web page.
在一些示例中,所述根据与第一网页相关的同簇网页对所述多个特定区域和所述多个区域块的重要性等级进行修正的步骤进一步包括:获取与第一网页相关的同簇网页;分别获取所述同簇网页中每个网页的多个特定区域和多个分类结果;计算所述同簇网页中每个网页的每个特定区域和每个分类结果中每个区域块在对应的网页中的分布信息以及每个分类结果中的区域块和对应的特定区域之间的关系信息;计算所有的所述分布信息的统计信息;根据所述分布信息的统计信息和所述关系信息对所述第一网页中对应的特定区域和每个分类结果中的区域块进行修正;以及根据修正后的所述第一网页的每个特定区域的重要性得到与之对应的分类结果中所述多个区域块的重要性。In some examples, the step of modifying the importance levels of the plurality of specific regions and the plurality of region blocks according to the same cluster webpages related to the first webpage further includes: obtaining the same clusters related to the first webpage Cluster web pages; respectively obtain multiple specific areas and multiple classification results of each web page in the same cluster of web pages; calculate each specific area of each web page in the same cluster of web pages and each area block in each classification result The distribution information in the corresponding webpage and the relationship information between the area block and the corresponding specific area in each classification result; calculate all the statistical information of the distribution information; according to the statistical information of the distribution information and the The relationship information is used to modify the corresponding specific area in the first webpage and the area blocks in each classification result; and obtain the corresponding classification result according to the importance of each specific area of the first webpage after modification The importance of the multiple region blocks described in .
在一些示例中,其中,所述同簇网页为与所述第一网页具有相似的DOM树结构的网页。In some examples, the same-cluster webpage is a webpage having a similar DOM tree structure to the first webpage.
在一些示例中,所述分部信息包括:所在坐标、所占面积以及词语分布信息。In some examples, the segment information includes: location coordinates, occupied area, and word distribution information.
在一些示例中,所述多个特定区域包括:网页路径引导区域、网页内容的标题区域和网页版权声明区域。In some examples, the multiple specific areas include: a web page path guide area, a title area of web page content, and a web page copyright statement area.
在一些示例中,所述网页版权声明区域的重要性等级低于所述网页路径引导区域,所述网页路径引导区域的重要性等级低于所述网页内容的标题区域。In some examples, the importance level of the webpage copyright statement area is lower than that of the webpage path guidance area, and the importance level of the webpage path guidance area is lower than that of the title area of the webpage content.
在一些示例中,在得到所述多个区域块的重要性等级之后,还包括:根据所述多个区域块的重要性等级对所述多个区域块中的内容进行评判。In some examples, after obtaining the importance levels of the multiple area blocks, the method further includes: evaluating the content in the multiple area blocks according to the importance levels of the multiple area blocks.
在一些示例中,还包括:对重要性等级最高的区域块进行内容监测。In some examples, it also includes: performing content monitoring on the area block with the highest importance level.
本发明第二方面的实施例公开了一种网页中的块重要性计算系统,包括:获取模块,用于获取第一网页,所述第一网页包括多个区域块;解析模块,用于对所述第一网页进行解析以得到具有不同重要性等级的多个特定区域;分析模块,用于对所述多个特定区域和所述多个区域块进行语义分析;分类模块,用于根据每个区域块和每个特定区域之间的语义相似度将所述多个区域块进行分类;以及计算模块,用于根据每个特定区域的重要性等级得到与之对应的分类结果中所述多个区域块的重要性等级。The embodiment of the second aspect of the present invention discloses a block importance calculation system in a webpage, including: an acquisition module, used to acquire a first webpage, and the first webpage includes a plurality of area blocks; a parsing module, used to analyze The first webpage is parsed to obtain multiple specific areas with different importance levels; an analysis module is used to perform semantic analysis on the multiple specific areas and the multiple area blocks; a classification module is used to analyze each The semantic similarity between each area block and each specific area is used to classify the plurality of area blocks; and a calculation module is used to obtain the corresponding classification results according to the importance level of each specific area. The importance level of each block.
根据本发明实施例的网页中的块重要性计算系统,对网页进行解析得到多个不同重要性等级的特定区域,并通过网页中多个区域块和特定区域之间的语义关系对多个区域块进行分类,并根据分类结果对应的特定区域的重要性得到网页中每个区域块的重要性,本发明的实施例通过特定区域和区域块之间的内容关系等得到区域块的重要性,能够显著地提高区域块分类的召回率和准确率,从而具有块重要性计算精度高、准确的优点According to the block importance calculation system in the web page of the embodiment of the present invention, the web page is analyzed to obtain a plurality of specific areas with different importance levels, and the multiple area blocks in the web page are semantically related to the specific area. Blocks are classified, and the importance of each area block in the web page is obtained according to the importance of the specific area corresponding to the classification result. The embodiment of the present invention obtains the importance of the area block through the content relationship between the specific area and the area block, etc., It can significantly improve the recall rate and accuracy rate of regional block classification, thus having the advantages of high precision and accuracy in block importance calculation
另外,根据本发明上述实施例的网页中的块重要性计算系统还可以具有如下附加的技术特征:In addition, the block importance calculation system in the webpage according to the above-mentioned embodiments of the present invention may also have the following additional technical features:
在一些示例中,所述获取模块还用于:获取与所述第一网页相关的同簇网页。In some examples, the acquiring module is further configured to: acquire webpages in the same cluster as the first webpage.
在一些示例中,还包括:修正模块,用于根据与第一网页相关的同簇网页对所述多个特定区域和所述多个区域块的重要性等级进行修正。In some examples, it further includes: an amending module, configured to amend the importance levels of the plurality of specific regions and the plurality of region blocks according to webpages of the same cluster related to the first webpage.
在一些示例中,所述修正模块用于在:获取所述同簇网页中每个网页的多个特定区域和多个分类结果之后,计算所述同簇网页中每个网页的每个特定区域和每个分类结果中每个区域块在对应的网页中的分布信息以及每个分类结果中的区域块和对应的特定区域之间的关系信息,且计算所有的所述分布信息的统计信息,并根据所述分布信息的统计信息和所述关系信息对所述第一网页中对应的特定区域和每个分类结果中的区域块进行修正。In some examples, the correction module is configured to calculate each specific area of each web page in the same cluster of web pages after obtaining multiple specific areas and multiple classification results of each web page in the same cluster and the distribution information of each area block in the corresponding web page in each classification result and the relationship information between the area block in each classification result and the corresponding specific area, and calculate the statistical information of all the distribution information, And modify the corresponding specific area in the first webpage and the area block in each classification result according to the statistical information of the distribution information and the relationship information.
在一些示例中,其中,所述同簇网页为与所述第一网页具有相似的DOM树结构的网页。In some examples, the same-cluster webpage is a webpage having a similar DOM tree structure to the first webpage.
在一些示例中,所述分部信息包括:所在坐标、所占面积以及词语分布信息。In some examples, the segment information includes: location coordinates, occupied area, and word distribution information.
在一些示例中,所述多个特定区域包括:网页路径引导区域、网页内容的标题区域和网页版权声明区域。In some examples, the multiple specific areas include: a web page path guide area, a title area of web page content, and a web page copyright statement area.
在一些示例中,所述网页版权声明区域的重要性等级低于所述网页路径引导区域,所述网页路径引导区域的重要性等级低于所述网页内容的标题区域。In some examples, the importance level of the webpage copyright statement area is lower than that of the webpage path guidance area, and the importance level of the webpage path guidance area is lower than that of the title area of the webpage content.
在一些示例中,还包括:评判模块,用于根据所述多个区域块的重要性等级对所述多个区域块中的内容进行评判。In some examples, it further includes: a judging module, configured to judge content in the multiple area blocks according to the importance levels of the multiple area blocks.
在一些示例中,还包括:监测模块,用于对重要性等级最高的区域块进行内容监测。In some examples, it also includes: a monitoring module, configured to monitor the content of the area block with the highest importance level.
本发明附加的方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
附图说明Description of drawings
本发明上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中:The above and/or additional aspects and advantages of the present invention will become apparent and easy to understand from the following description of the embodiments in conjunction with the accompanying drawings, wherein:
图1是根据本发明一个实施例的网页中的块重要性计算方法的流程图;Fig. 1 is the flow chart of the block importance calculation method in the webpage according to one embodiment of the present invention;
图2A-2C是根据本发明一个实施例的网页中的块重要性计算方法的多个特定区域的示意图;2A-2C are schematic diagrams of multiple specific areas of a block importance calculation method in a web page according to an embodiment of the present invention;
图3是根据本发明另一个实施例的网页中的块重要性计算方法的流程图;3 is a flowchart of a method for calculating block importance in a web page according to another embodiment of the present invention;
图4A和图4B是根据本发明一个实施例的网页中的块重要性计算方法的修正前的特定区域块的示意图;Fig. 4A and Fig. 4B are schematic diagrams of specific area blocks before modification of the block importance calculation method in web pages according to an embodiment of the present invention;
图5A和5B是根据本发明一个实施例的网页中的块重要性计算方法的修正后的区域块的示意图;以及5A and 5B are schematic diagrams of revised region blocks of a block importance calculation method in a web page according to an embodiment of the present invention; and
图6是根据本发明一个实施例的网页中的块重要性计算系统的结构图。Fig. 6 is a structural diagram of a block importance calculation system in a webpage according to an embodiment of the present invention.
具体实施方式detailed description
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本发明,而不能解释为对本发明的限制。Embodiments of the present invention are described in detail below, examples of which are shown in the drawings, wherein the same or similar reference numerals designate the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the figures are exemplary only for explaining the present invention and should not be construed as limiting the present invention.
在本发明的描述中,需要理解的是,术语“纵向”、“横向”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本发明和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本发明的限制。In describing the present invention, it should be understood that the terms "longitudinal", "transverse", "upper", "lower", "front", "rear", "left", "right", "vertical", The orientation or positional relationship indicated by "horizontal", "top", "bottom", "inner", "outer", etc. are based on the orientation or positional relationship shown in the drawings, and are only for the convenience of describing the present invention and simplifying the description, rather than Nothing indicating or implying that a referenced device or element must have a particular orientation, be constructed, and operate in a particular orientation should therefore not be construed as limiting the invention.
在本发明的描述中,需要说明的是,除非另有规定和限定,术语“安装”、“相连”、“连接”应做广义理解,例如,可以是机械连接或电连接,也可以是两个元件内部的连通,可以是直接相连,也可以通过中间媒介间接相连,对于本领域的普通技术人员而言,可以根据具体情况理解上述术语的具体含义。In the description of the present invention, it should be noted that unless otherwise specified and limited, the terms "installation", "connection" and "connection" should be understood in a broad sense, for example, it can be a mechanical connection or an electrical connection, or it can be two The internal communication of each element may be directly connected or indirectly connected through an intermediary. Those skilled in the art can understand the specific meanings of the above terms according to specific situations.
以下结合附图描述根据本发明实施例的网页中的块重要性计算方法及系统。A method and system for calculating block importance in a web page according to an embodiment of the present invention will be described below with reference to the accompanying drawings.
图1是根据本发明一个实施例的网页中的块重要性计算方法的流程图。如图1所示,根据本发明一个实施例的网页中的块重要性计算方法,包括如下步骤:Fig. 1 is a flowchart of a method for calculating block importance in a web page according to an embodiment of the present invention. As shown in Figure 1, the block importance calculation method in the webpage according to one embodiment of the present invention comprises the following steps:
步骤S101:提供第一网页,所述第一网页包括多个区域块,其中,第一网页指当前待计算块重要性的网页,即目前大部分普通的网页均可。Step S101: Provide a first webpage, the first webpage includes a plurality of area blocks, wherein the first webpage refers to the webpage whose block importance is to be calculated at present, that is, most common webpages can be used at present.
在本发明的一个实施例中,第一网页可为任意网站中的一个网页,区域块指网页中的最小粒度块,最小粒度块为独立不可再分的块,例如某一个标题或者网页中的某一个链接等,且每个最小粒度块内的内容的重要性是一致的。本发明的实施例中对网页中块重要性进行计算,指对网页中的最小粒度块的重要性进行预测。In one embodiment of the present invention, the first webpage can be a webpage in any website, and the area block refers to the smallest granularity block in the webpage, and the smallest granularity block is an independent and indivisible block, such as a certain title or the A certain link, etc., and the importance of the content in each smallest granularity block is consistent. In the embodiment of the present invention, calculating the importance of a block in a webpage refers to predicting the importance of a smallest granularity block in a webpage.
步骤S102:对第一网页进行解析以得到具有不同重要性等级的多个特定区域。在本发明的一个实施例中,多个特定区域包括但不限于网页路径引导区域、网页内容的标题区域和网页版权声明区域。进一步地,网页版权声明区域的重要性等级低于所述网页路径引导区域,所述网页路径引导区域的重要性等级低于所述网页内容的标题区域。Step S102: Parse the first webpage to obtain multiple specific regions with different importance levels. In an embodiment of the present invention, the multiple specific areas include but not limited to the web page path guide area, the title area of the web page content, and the web page copyright statement area. Further, the importance level of the webpage copyright statement area is lower than that of the webpage path guidance area, and the importance level of the webpage path guidance area is lower than that of the title area of the webpage content.
其中,特定区域指表示特定明确含义的区域。这些特定区域几乎在所有页面中都会存在,含义和位置也相对固定。例如,表示当前页面到网站首页路径关系的区域(一个特定区域),可称之为mypos区域;网页标题所在的区域(一个特定区域),可称之为realtitle区域;表示页面版权、声明隐私的区域(一个特定区域),可称之为copyright区域。在本发明的示例中,可通过网页解析方法识别出网页的特定区域。例如,以网页:http://finance.sina.com.cn/consume/puguangtai/20121220/031014059223.shtml为例,可识别出如图2A、2B和2C所示的三个不同重要性等级的特定区域,其中,图2A示出了mypos区域、图2B示出了realtitle区域、图2C示出了copyright区域。Wherein, a specific region refers to a region that expresses a specific and definite meaning. These specific areas exist in almost all pages, and their meanings and positions are relatively fixed. For example, the area representing the path relationship from the current page to the home page of the website (a specific area) can be called the mypos area; the area where the title of the webpage is located (a specific area) can be called the realtitle area; the area indicating the page copyright and privacy statement Area (a specific area), can be called copyright area. In an example of the present invention, a specific area of a web page can be identified through a web page parsing method. For example, taking the web page: http://finance.sina.com.cn/consume/puguangtai/20121220/031014059223.shtml as an example, specific area, wherein, Figure 2A shows the mypos area, Figure 2B shows the realtitle area, and Figure 2C shows the copyright area.
步骤S103:分别对多个特定区域和多个区域块进行语义分析。Step S103: Semantic analysis is performed on multiple specific areas and multiple area blocks respectively.
具体地,对多个特定区域和多个区域块进行语义分析是为了分析出多个特定区域和多个区域块所表述的内容的含义,例如,再次以网页:http://finance.sina.com.cn/consume/puguangtai/20121220/031014059223.shtml为例。在对多个特定区域进行语义识别后,如图2A、2B和2C所示,分别描述了页面的特定区域:mypos、realtitle和copyright区域。对三者进行语义分析,mypos区域主要语义为:“肯德基麦当劳原料鸡速成门新浪财经生活正文”,realtitle区域主要语义为:“肯德基抗生素隐瞒”,copyright区域主要语义为:“新浪公司版权”。此外,多个区域块进行语义分析与上述对多个特定区域进行语义识别的方法类似,此处不做描述。Specifically, performing semantic analysis on multiple specific regions and multiple regional blocks is to analyze the meaning of the content expressed by multiple specific regions and multiple regional blocks, for example, take the web page: http://finance.sina. com.cn/consume/puguangtai/20121220/031014059223.shtml as an example. After performing semantic recognition on multiple specific regions, as shown in Figures 2A, 2B and 2C, the specific regions of the page are described respectively: mypos, realtitle and copyright regions. Semantic analysis of the three shows that the main semantics of the mypos area is: "KFC McDonald's Raw Chicken Instant Door Sina Financial Life Text", the main semantics of the realtitle area is: "KFC antibiotic concealment", and the main semantics of the copyright area is: "Sina company copyright". In addition, performing semantic analysis on multiple regional blocks is similar to the above-mentioned method for performing semantic recognition on multiple specific regions, which will not be described here.
步骤S104:根据每个区域块和每个特定区域之间的语义相似度将多个区域块进行分类。Step S104: Classify multiple region blocks according to the semantic similarity between each region block and each specific region.
具体地,在得到第一网页中多个区域块和多个特定区域的语义后,如果多个区域块所表示的主要语义和realtitle区域所表示的主要语义的相似度最高,即语义最为接近,则可认为该区域块表达了页面主要内容,是最重要区域,即块重要性最高;如果和mypos区域所表示的语义的相似度最高,则可认为该区域块表达的内容与页面主要内容相关,对页面主要内容有信息补充作用,但不是该页面所表达的最具体内容,是次要区域,块重要性低于与realtitle区域所表示的主要语义的相似度最高的区域块的重要性;如果和copyright区域所表示的语义的相似度最高,可认为该区域块表达的内容与页面主要内容没有直接关系,是无关区域,块重要性低于与mypos区域所表示的语义的相似度最高的区域块的重要性。有上述分析可知,可以将多个区域块进行分类,例如,对于网页:Specifically, after obtaining the semantics of multiple area blocks and multiple specific areas in the first webpage, if the main semantics represented by the multiple area blocks have the highest similarity with the main semantics represented by the realtitle area, that is, the semantics are the closest, Then it can be considered that this area block expresses the main content of the page and is the most important area, that is, the block has the highest importance; if it has the highest semantic similarity with the mypos area, it can be considered that the content expressed by this area block is related to the main content of the page , which supplements information to the main content of the page, but is not the most specific content expressed by the page. It is a secondary area, and the importance of the block is lower than that of the area block with the highest similarity to the main semantics represented by the realtitle area; If the similarity with the semantics represented by the copyright area is the highest, it can be considered that the content expressed by the block in this area is not directly related to the main content of the page, and is an irrelevant area, and the importance of the block is lower than that with the highest similarity with the semantics represented by the mypos area. Importance of block. Based on the above analysis, it can be known that multiple area blocks can be classified, for example, for web pages:
http://finance.sina.com.cn/consume/puguangtai/20121220/031014059223.shtml而言,多个区域块可分为三类,第一类由与realtitle区域所表示的主要语义的相似度较高且与第二类第三类中的区域块所表示的语义的相似度较低的区域块组成,第二类由与mypos区域所表示的语义的相似度较高且与第一类第三类中的区域块所表示的语义的相似度较低的区域块组成,第三类由与copyright区域所表示的语义的相似度较高且与第一类第二类中的区域块所表示的语义的相似度较低的区域块组成的区域块组成。According to http://finance.sina.com.cn/consume/puguangtai/20121220/031014059223.shtml, multiple area blocks can be divided into three categories. The second category consists of regions with high similarity to the semantics represented by the region blocks in the second and third categories, and the second category is composed of regions with a high similarity to the semantics represented by the mypos region and with the first and third categories The semantic similarity represented by the region blocks in the category is composed of regions with low similarity, and the third category is composed of regions with a high similarity to the semantics represented by the copyright region and represented by the region blocks in the first category and the second category Region blocks composed of region blocks with low semantic similarity.
步骤S105:根据每个特定区域的重要性等级得到与之对应的分类结果中多个区域块的重要性等级。其中,每一个分类结果中的区域块的重要性等级相同,不同分类结果中的区域块的重要性等级不同。例如:对于上述第一类的区域块(一个分类结果),其重要性等级最高,第二类的区域块的重要性等级低于第一类的区域块的重要性等级,第三类的区域块的重要性等级低于第二类的区域块的重要性等级。Step S105: According to the importance level of each specific area, the importance levels of multiple area blocks in the corresponding classification result are obtained. Wherein, the importance levels of the area blocks in each classification result are the same, and the importance levels of the area blocks in different classification results are different. For example: for the above-mentioned area blocks of the first type (a classification result), its importance level is the highest, the importance level of the area blocks of the second type is lower than that of the area blocks of the first type, and the area blocks of the third type The importance level of the blocks is lower than that of the area blocks of the second type.
根据本发明实施例的网页中的块重要性计算方法,对网页进行解析得到多个不同重要性等级的特定区域,并通过网页中多个区域块和特定区域之间的语义关系对多个区域块进行分类,并根据分类结果对应的特定区域的重要性得到网页中每个区域块的重要性,本发明的实施例通过特定区域和区域块之间的内容关系等得到区域块的重要性,能够显著地提高区域块分类的召回率和准确率,从而具有块重要性计算精度高、准确的优点。According to the block importance calculation method in the web page of the embodiment of the present invention, the web page is analyzed to obtain a plurality of specific areas with different importance levels, and the multiple area blocks in the web page are semantically related to the specific area. Blocks are classified, and the importance of each area block in the web page is obtained according to the importance of the specific area corresponding to the classification result. The embodiment of the present invention obtains the importance of the area block through the content relationship between the specific area and the area block, etc., The method can significantly improve the recall rate and accuracy rate of the region block classification, thus having the advantages of high precision and accuracy in block importance calculation.
为了进一步提高网页中区域块的重要性的计算精度,本发明的进一步实施例的网页中的块重要性计算方法,还包括:根据与第一网页相关的同簇网页对所述多个特定区域和所述多个区域块的重要性等级进行修正。In order to further improve the calculation accuracy of the importance of area blocks in a webpage, the method for calculating the importance of blocks in a webpage in a further embodiment of the present invention further includes: calculating the plurality of specific areas according to the same cluster of webpages related to the first webpage and the importance levels of the plurality of area blocks are corrected.
具体地,如图3所示,根据与第一网页相关的同簇网页对所述多个特定区域和所述多个区域块的重要性等级进行修正的步骤包括:Specifically, as shown in FIG. 3 , the step of modifying the importance levels of the multiple specific areas and the multiple area blocks according to the web pages of the same cluster related to the first web page includes:
步骤S301:获取与第一网页相关的同簇网页。在本发明的一个实施例中,同簇网页为与第一网页具有相似的DOM树结构的网页。在该实例中,DOM树为HTML Document ObjectModel(文档对象模型)的缩写,HTML DOM指适用于HTML/XHTML的文档对象模型,具体地,DOM树是指将网页的HTML源码解析成“树”结构,便于程序遍历和访问网页中的任何内容。Step S301: Obtain the same cluster webpages related to the first webpage. In an embodiment of the present invention, the same-cluster webpage is a webpage having a similar DOM tree structure to the first webpage. In this example, the DOM tree is the abbreviation of HTML Document Object Model (Document Object Model). HTML DOM refers to the document object model applicable to HTML/XHTML. Specifically, the DOM tree refers to parsing the HTML source code of the web page into a "tree" structure , so that the program can traverse and access any content in the web page.
步骤S302:分别获取同簇网页中每个网页的多个特定区域和多个分类结果。在该实例中,同簇网页中每个网页的多个特定区域和多个分类结果的获取方法与上述实施例中获取第一网页中的多个特定区域和多个分类结果的方法相同。Step S302: Obtain multiple specific regions and multiple classification results of each webpage in the same cluster of webpages. In this example, the method for obtaining multiple specific areas and multiple classification results of each web page in the same cluster is the same as the method for obtaining multiple specific areas and multiple classification results in the first web page in the above embodiment.
步骤S303:计算同簇网页中每个网页的每个特定区域和每个分类结果中每个区域块在对应的网页中的分布信息以及每个分类结果中的区域块和对应的特定区域之间的关系信息。其中,分部信息包括:所在坐标(即每个特定区域和每个区域块在对应网页中的坐标)、所占面积以及词语分布信息等,具体地,分部信息指是指区域块内文本长度,文本term语义分布,链接数目,图片数量,图片面积等信息等。Step S303: Calculate the distribution information of each specific area of each web page in the same cluster of web pages and each area block in each classification result in the corresponding web page, and the relationship between each area block in each classification result and the corresponding specific area relationship information. Among them, the segment information includes: the coordinates of the location (that is, the coordinates of each specific area and each area block in the corresponding webpage), the occupied area, and word distribution information, etc. Specifically, the segment information refers to the text in the area block Length, text term semantic distribution, number of links, number of pictures, picture area and other information.
步骤S304:计算所有的分布信息的统计信息。例如计算所有的分部信息的均值方差等统计学参数,其中,分部信息的分布越广,表明该区域块表达的内容越丰富,也是每个页面频繁变化的部分,可能是重要性等级最高的区域块。Step S304: Calculate statistical information of all distribution information. For example, calculate statistical parameters such as the mean variance of all subsection information. The wider the distribution of subsection information, the richer the content expressed by the area block, and it is also the part that changes frequently on each page, which may be the highest level of importance. area block.
步骤S305:根据分布信息的统计信息和关系信息和关系信息对第一网页中对应的特定区域和每个分类结果中的区域块进行修正。其中,关系信息指:区域块和特定区域的关系的分布信息,例如:在一个网页中,区域块和特定区域位置关系的均值和方差,占页面面积的均值和方差,文本长度均值和方差,文本面积,文本签名,链接的重复比例,链接数目的均值和方差,图片个数的均值和方差,图片面积的均值和方差,图片面积占块面积的均值和方差,用户评论区域的面积均值和方差,用户评论区域面积占块面积的均值和方差等信息。例如:如果区域块与realtitle区域的空间和内容关系比较接近,则说明该区域块重要性高的概率会增大;如果区域块与copyright区域或者mypos区域的内容关系比较接近,则表示该块重要性高的概率会减小。因此,对于第一网页和同簇网页而言,假设与第一网页同簇的网页为5个,对于第一网页中的区域块1而言,在第一网页中,如果按照步骤S101至步骤S105的步骤得到区域块1分为与realtitle区域对应的分类结果中,而对于5个同簇网页中,与区域块1对应位置的区域块在各自的网页中被分为与mypos区域对应的分类结果中,则根据区域块重要性等级的概率分布,可知,应将区域块1分为与mypos区域对应的分类结果中,因此,需要对第一网页中区域块1的进行修正,即将区域块1由与realtitle区域对应的分类结果中调整至与mypos区域对应的分类结果中。Step S305: Correct the corresponding specific area in the first webpage and the area block in each classification result according to the statistical information of the distribution information and the relationship information and the relationship information. Among them, the relationship information refers to: the distribution information of the relationship between the area block and the specific area, for example: in a web page, the mean value and variance of the positional relationship between the area block and the specific area, the mean value and variance of the page area, the mean value and variance of the text length, Text area, text signature, link repetition ratio, link number mean and variance, picture number mean and variance, picture area mean and variance, picture area accounted for block area mean and variance, user comment area mean and variance Variance, information such as the mean and variance of the area of the user's comment area to the area of the block. For example: if the space and content relationship between the area block and the realtitle area are relatively close, it means that the probability of the high importance of the area block will increase; if the content relationship between the area block and the copyright area or mypos area is relatively close, it means that the block is important The probability of high sex will decrease. Therefore, for the first webpage and the webpage of the same cluster, assuming that there are 5 webpages of the same cluster as the first webpage, for the area block 1 in the first webpage, in the first webpage, if according to step S101 to step The step of S105 obtains that the area block 1 is divided into the classification results corresponding to the realtitle area, and for the five web pages of the same cluster, the area blocks corresponding to the area block 1 are classified into the classification corresponding to the mypos area in the respective web pages In the result, according to the probability distribution of the importance level of the area block, it can be seen that the area block 1 should be classified into the classification results corresponding to the mypos area, therefore, it is necessary to modify the area block 1 in the first webpage, that is, the area block 1 is adjusted from the classification result corresponding to the realtitle area to the classification result corresponding to the mypos area.
作为一个具体的例子,统计同簇网页中所有网页相同xpath所示区域的多页面信息,其中,xpath为从DOM树的根节点到当前节点的路径,在该实例中,多页面信指上述所指的分部信息和关系信息。仍以http://finance.sina.com.cn/consume/puguangtai/20121220/031014059223.shtml为例,如果仅在单页面(第一网页)计算如图2A和2B所示的特定区域,从上述分析可知,可能会出现误差甚至错误。由于同簇网页,例如:http://finance.sina.com.cn/consume/20121225/085714106842.shtml等页面的DOM树结构与其很相似,通过计算与其相似的所有页面(同簇网页)的特定区域,然后来修正单页面(第一网页)中特定区域可能存在的问题。以计算第一网页中的realtitle区域为例,具体步骤如下:As a specific example, count the multi-page information in the same xpath area of all web pages in the same cluster of web pages, where xpath is the path from the root node of the DOM tree to the current node. In this example, the multi-page information refers to the above-mentioned Refers to the segment information and relationship information. Still taking http://finance.sina.com.cn/consume/puguangtai/20121220/031014059223.shtml as an example, if only the specific area shown in Figure 2A and 2B is calculated on a single page (the first page), from the above Analysis shows that errors and even errors may occur. Since the DOM tree structure of pages in the same cluster, such as http://finance.sina.com.cn/consume/20121225/085714106842.shtml is very similar to it, by calculating the specific Area, and then to fix possible problems in a specific area of a single page (the first page). Taking the calculation of the realtitle area in the first web page as an example, the specific steps are as follows:
步骤1:根据网页的url查找第一网页的同簇网页,例如第一网页的url:http://finance.sina.com.cn/consume/puguangtai/20121220/031014059223.shtmlStep 1: Find the same cluster of webpages of the first webpage according to the url of the webpage, for example, the url of the first webpage: http://finance.sina.com.cn/consume/puguangtai/20121220/031014059223.shtml
某一网页的url为:The url of a web page is:
http://finance.sina.com.cn/consume/20121225/085714106842.shtmlhttp://finance.sina.com.cn/consume/20121225/085714106842.shtml
由于第一网页的url和上述某一网页的url类似,因此,可认为其DOM树结构也相似。Since the url of the first webpage is similar to the url of the above-mentioned certain webpage, it can be considered that its DOM tree structure is also similar.
步骤2:分别计算上述两个url对应的realtitle区域,分别得到的结果可能如图4A和图4B,均存在错误,其中,图4A所示的realtitle区域不完整,而图4B所示的realtitle区域包括不必要的信息,即图4B的下侧出现的文字等。Step 2: Calculate the realtitle areas corresponding to the above two urls respectively. The results obtained respectively may be shown in Figure 4A and Figure 4B, both of which have errors. Among them, the realtitle area shown in Figure 4A is incomplete, while the realtitle area shown in Figure 4B Unnecessary information such as text appearing on the lower side of FIG. 4B and the like are included.
步骤3:计算上述两个realtitle区域的坐标,面积均值,词语分布等统计信息,并根据上述信息等对图4A所示的realtitle区域进行修正,修正结果如图2B所示。Step 3: Calculate statistical information such as the coordinates of the above two realtitle areas, the mean area, word distribution, etc., and correct the realtitle area shown in Figure 4A according to the above information, and the corrected results are shown in Figure 2B.
步骤4:重复上述步骤2和3,得到上述第一网页的其它特定区域等的修正结果。Step 4: Repeat the above steps 2 and 3 to obtain the correction results of other specific areas of the first web page.
步骤5:假设对第一网页的某区域块或者某区域的修正结果如图5A所示,分别计算各自页面内的上述区域块或者区域中词语与realtitle区域中词语相同的数目以及占realtitle区域中词语数目的比例等。如图5B所示,修正后的相关内容区域,即跟网页主要内容相关的话题等信息的区域如图5B所示。Step 5: Assuming that the correction result of a certain area block or a certain area of the first web page is shown in Figure 5A, calculate the number of words in the above-mentioned area block or area in the respective pages that are the same as those in the realtitle area and the proportion of words in the realtitle area The ratio of the number of words, etc. As shown in FIG. 5B , the corrected related content area, that is, the area of information such as topics related to the main content of the webpage is shown in FIG. 5B .
步骤6:计算步骤5中所述的比例均值,可得realtitle区域和上述区域块或者区域中词语共现比例的页面信息。从而可根据共现比例的页面信息对区域块进行修正等。Step 6: Calculate the average value of the ratio described in step 5 to obtain the page information of the co-occurrence ratio of words in the realtitle area and the above-mentioned area block or area. Therefore, the area block can be corrected according to the page information of the co-occurrence ratio.
步骤S306:根据修正后的第一网页的每个特定区域的重要性得到与之对应的分类结果中多个区域块的重要性。Step S306: According to the corrected importance of each specific area of the first web page, the importance of multiple area blocks in the corresponding classification result is obtained.
根据本发明实施例的网页中的块重要性计算方法,利用同簇网页中区域块和特定区域的关系,对第一网页中的区域块和特定区域进行修正,也可将第一网页中错误的区域块和特定区域进行修正,从而更加准确地得到特定区域以及区域块以及区域块所在的分类结果,进一步提高区域块分类的召回率和准确率,从而保证网页中块重要性计算的精度,使得块重要性计算结果更加可信。According to the block importance calculation method in the web page of the embodiment of the present invention, the area block and the specific area in the first web page are corrected by using the relationship between the area block and the specific area in the web page of the same cluster, and the error in the first web page can also be corrected. Correct the area blocks and specific areas, so as to more accurately obtain the classification results of specific areas, area blocks, and area blocks, and further improve the recall rate and accuracy of area block classification, thereby ensuring the accuracy of block importance calculations in web pages. Make the block importance calculation results more credible.
在本发明的一个实施例中,在得到多个区域块的重要性等级之后,还包括:根据多个区域块的重要性等级对多个区域块中的内容进行评判。例如:搜索引擎需要对网页链接的关系进行打分,如pagerank。本发明能够识别出页面主要内容区域,相关内容区域,无关内容区域,即多个分类结果中区域块对应的区域。通常,位于无关内容区域的链接,重要性较低;位于主要内容区域和相关内容区域的链接,重要性较高。因而,在实际应用中,例如:搜索引擎对页面中不同区域的链接进行打分。如图2C,位于无关区域,图5B位于相关内容区域,图2A所示的链接跟本页面的关系打分低于图5B所示的链接跟本页面的关系打分。In an embodiment of the present invention, after obtaining the importance levels of the multiple area blocks, the method further includes: evaluating the contents of the multiple area blocks according to the importance levels of the multiple area blocks. For example: search engines need to score the relationship of web page links, such as pagerank. The present invention can identify the main content area of the page, the relevant content area and the irrelevant content area, that is, the area corresponding to the area blocks in the multiple classification results. Generally, links located in irrelevant content areas are less important; links located in main content areas and related content areas are more important. Therefore, in practical applications, for example, search engines score links in different areas of a page. As shown in Figure 2C, it is located in an irrelevant area, and Figure 5B is located in a related content area. The relationship score between the link shown in Figure 2A and this page is lower than the relationship score between the link shown in Figure 5B and this page.
在本发明的一个实施例中,还包括:对重要性等级最高的区域块进行内容监测。在具体应用中,互联网上有很多页面存在作弊行为,例如:页面的主要内容被嵌入无关的广告内容。本发明的实施例通过对重要性等级最高的区域块进行内容监测,可发现页面中重要区域,因而可指导搜索引擎通过对该区域内部内容进行分析,如term语义分布,句子主题相关性,进而判断是否有作弊现象。In an embodiment of the present invention, it further includes: performing content monitoring on the area block with the highest importance level. In a specific application, there are cheating behaviors on many pages on the Internet, for example, the main content of the page is embedded with irrelevant advertisement content. Embodiments of the present invention can find important areas in the page by monitoring the content of the area block with the highest importance level, and thus can guide the search engine to analyze the internal content of the area, such as term semantic distribution, sentence topic correlation, and then Determine whether there is cheating.
根据本发明实施例的网页中的块重要性计算方法,具有块重要性等级计算精度高、准确的优点。The block importance calculation method in the webpage according to the embodiment of the present invention has the advantage of high and accurate block importance level calculation precision.
图6是根据本发明一个实施例的网页中的块重要性计算系统的结构图。如图6所示,根据本发明一个实施例的网页中的块重要性计算系统600,包括:获取模块610、解析模块620、分析模块630、分类模块640和计算模块650。Fig. 6 is a structural diagram of a block importance calculation system in a webpage according to an embodiment of the present invention. As shown in FIG. 6 , the block importance calculation system 600 in a webpage according to an embodiment of the present invention includes: an acquisition module 610 , an analysis module 620 , an analysis module 630 , a classification module 640 and a calculation module 650 .
其中,获取模块610用于获取第一网页,第一网页包括多个区域块,多个特定区域包括:网页路径引导区域、网页内容的标题区域和网页版权声明区域,进一步地,网页版权声明区域的重要性等级低于网页路径引导区域,网页路径引导区域的重要性等级低于网页内容的标题区域。解析模块620用于对第一网页进行解析以得到具有不同重要性等级的多个特定区域。分析模块630用于对多个特定区域和多个区域块进行语义分析。分类模块640用于根据每个区域块和每个特定区域之间的语义相似度将多个区域块进行分类。计算模块650用于根据每个特定区域的重要性等级得到与之对应的分类结果中多个区域块的重要性等级。Wherein, the acquisition module 610 is used to acquire the first webpage, the first webpage includes a plurality of area blocks, and the plurality of specific areas include: the webpage path guidance area, the title area of the webpage content and the webpage copyright statement area, further, the webpage copyright statement area The importance level of the webpage path guidance area is lower than that of the webpage path guidance area, and the importance level of the webpage path guidance area is lower than that of the title area of the webpage content. The parsing module 620 is used for parsing the first webpage to obtain multiple specific regions with different importance levels. The analysis module 630 is used for performing semantic analysis on multiple specific areas and multiple area blocks. The classifying module 640 is used for classifying a plurality of area blocks according to the semantic similarity between each area block and each specific area. The calculation module 650 is used to obtain the importance levels of multiple area blocks in the corresponding classification result according to the importance level of each specific area.
根据本发明实施例的网页中的块重要性计算系统,对网页进行解析得到多个不同重要性等级的特定区域,并通过网页中多个区域块和特定区域之间的语义关系对多个区域块进行分类,并根据分类结果对应的特定区域的重要性得到网页中每个区域块的重要性,本发明的实施例通过特定区域和区域块之间的内容关系等得到区域块的重要性,能够显著地提高区域块分类的召回率和准确率,从而具有块重要性计算精度高、准确的优点。According to the block importance calculation system in the web page of the embodiment of the present invention, the web page is analyzed to obtain a plurality of specific areas with different importance levels, and the multiple area blocks in the web page are semantically related to the specific area. Blocks are classified, and the importance of each area block in the web page is obtained according to the importance of the specific area corresponding to the classification result. The embodiment of the present invention obtains the importance of the area block through the content relationship between the specific area and the area block, etc., The method can significantly improve the recall rate and accuracy rate of the region block classification, thus having the advantages of high precision and accuracy in block importance calculation.
在本发明的一个实施例中,获取模块610还用于:获取与所述第一网页相关的同簇网页,其中,同簇网页为与第一网页具有相似的DOM树结构的网页。进一步地,网页中的块重要性计算系统600还包括:修正模块660,修正模块660用于根据与第一网页相关的同簇网页对所述多个特定区域和多个区域块的重要性等级进行修正。具体而言,修正模块660用于在获取同簇网页中每个网页的多个特定区域和多个分类结果之后,计算同簇网页中每个网页的每个特定区域和每个分类结果中每个区域块在对应的网页中的分布信息以及每个分类结果中的区域块和对应的特定区域之间的关系信息,且计算所有的分布信息的统计信息,并根据分布信息的统计信息和关系信息对第一网页中对应的特定区域和每个分类结果中的区域块进行修正。在该实例中,分部信息包括:所在坐标(特定区域或区域块在对应网页中的坐标)、所占面积(特定区域或区域块在对应网页中所占的面积)以及词语分布信息。In an embodiment of the present invention, the acquiring module 610 is further configured to: acquire webpages in the same cluster related to the first webpage, where the webpages in the same cluster are webpages having a similar DOM tree structure to the first webpage. Further, the block importance calculation system 600 in a web page further includes: a correction module 660, which is used to assign importance levels to the multiple specific areas and the multiple area blocks according to the same cluster web pages related to the first web page Make corrections. Specifically, the correction module 660 is used to calculate each specific region of each webpage in the same cluster and each classification result of each webpage in the same cluster after obtaining multiple specific regions and multiple classification results of each webpage in the same cluster. The distribution information of each area block in the corresponding webpage and the relationship information between the area blocks in each classification result and the corresponding specific area, and calculate the statistical information of all the distribution information, and according to the statistical information and relationship of the distribution information The information modifies the corresponding specific area in the first web page and the area block in each classification result. In this example, the segment information includes: location coordinates (coordinates of a specific area or area block in the corresponding web page), occupied area (area occupied by a specific area or area block in the corresponding web page), and word distribution information.
根据本发明实施例的网页中的块重要性计算系统,利用同簇网页中区域块和特定区域的关系,对第一网页中的区域块和特定区域进行修正,也可将第一网页中错误的区域块和特定区域进行修正,从而更加准确地得到特定区域以及区域块以及区域块所在的分类结果,进一步提高区域块分类的召回率和准确率,从而保证网页中块重要性计算的精度,使得块重要性计算结果更加可信。According to the block importance calculation system in the web page of the embodiment of the present invention, the area block and the specific area in the first web page are corrected by using the relationship between the area block and the specific area in the web page of the same cluster, and the error in the first web page can also be corrected. Correct the area blocks and specific areas, so as to more accurately obtain the classification results of specific areas, area blocks, and area blocks, and further improve the recall rate and accuracy of area block classification, thereby ensuring the accuracy of block importance calculations in web pages. Make the block importance calculation results more credible.
在本发明的一个实施例中,网页中的块重要性计算系统600还包括:评判模块670,评判模块670用于根据多个区域块的重要性等级对多个区域块中的内容进行评判。例如:搜索引擎需要对网页链接的关系进行打分,如pagerank。本发明能够识别出页面主要内容区域,相关内容区域,无关内容区域,即多个分类结果中区域块对应的区域。通常,位于无关内容区域的链接,重要性较低;位于主要内容区域和相关内容区域的链接,重要性较高。因而,在实际应用中,例如:搜索引擎对页面中不同区域的链接进行打分。如图2C,位于无关区域,图5B位于相关内容区域,图2A所示的链接跟本页面的关系打分低于图5B所示的链接跟本页面的关系打分。In one embodiment of the present invention, the block importance calculation system 600 in the web page further includes: a judging module 670, which is configured to judge content in multiple regional blocks according to the importance levels of the multiple regional blocks. For example: search engines need to score the relationship of web page links, such as pagerank. The present invention can identify the main content area of the page, the relevant content area and the irrelevant content area, that is, the area corresponding to the area blocks in the multiple classification results. Generally, links located in irrelevant content areas are less important; links located in main content areas and related content areas are more important. Therefore, in practical applications, for example, search engines score links in different areas of a page. As shown in Figure 2C, it is located in an irrelevant area, while Figure 5B is located in a related content area. The relationship score between the link shown in Figure 2A and this page is lower than that shown in Figure 5B.
在本发明的一个实施例中,网页中的块重要性计算系统600,还包括:监测模块680,监测模块680用于对重要性等级最高的区域块进行内容监测。在具体应用中,互联网上有很多页面存在作弊行为,例如:页面的主要内容被嵌入无关的广告内容。本发明的实施例通过对重要性等级最高的区域块进行内容监测,可发现页面中重要区域,因而可指导搜索引擎通过对该区域内部内容进行分析,如term语义分布,句子主题相关性,进而判断是否有作弊现象。In an embodiment of the present invention, the block importance calculation system 600 in the webpage further includes: a monitoring module 680, which is used to monitor the content of the area block with the highest importance level. In a specific application, there are cheating behaviors on many pages on the Internet, for example, the main content of the page is embedded with irrelevant advertisement content. Embodiments of the present invention can find important areas in the page by monitoring the content of the area block with the highest importance level, and thus can guide the search engine to analyze the internal content of the area, such as term semantic distribution, sentence topic correlation, and then Determine whether there is cheating.
根据本发明实施例的网页中的块重要性计算系统,具有块重要性等级计算精度高、准确的优点。The block importance calculation system in the web page according to the embodiment of the present invention has the advantages of high precision and accuracy in block importance level calculation.
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。In the description of this specification, descriptions referring to the terms "one embodiment", "some embodiments", "example", "specific examples", or "some examples" mean that specific features described in connection with the embodiment or example , structure, material or characteristic is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
尽管已经示出和描述了本发明的实施例,对于本领域的普通技术人员而言,可以理解在不脱离本发明的原理和精神的情况下可以对这些实施例进行多种变化、修改、替换和变型,本发明的范围由所附权利要求及其等同限定。Although the embodiments of the present invention have been shown and described, those skilled in the art can understand that various changes, modifications and substitutions can be made to these embodiments without departing from the principle and spirit of the present invention. and modifications, the scope of the invention is defined by the appended claims and their equivalents.
Claims (17)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310029651.9A CN103970749B (en) | 2013-01-25 | 2013-01-25 | Block importance computational methods and system in a kind of webpage |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310029651.9A CN103970749B (en) | 2013-01-25 | 2013-01-25 | Block importance computational methods and system in a kind of webpage |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103970749A CN103970749A (en) | 2014-08-06 |
CN103970749B true CN103970749B (en) | 2017-08-25 |
Family
ID=51240264
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310029651.9A Active CN103970749B (en) | 2013-01-25 | 2013-01-25 | Block importance computational methods and system in a kind of webpage |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103970749B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1702654A (en) * | 2004-04-29 | 2005-11-30 | 微软公司 | Method and system for calculating importance of a block within a display page |
CN101944104A (en) * | 2010-08-19 | 2011-01-12 | 百度在线网络技术(北京)有限公司 | Evaluation method and equipment for importance of webpage sub-blocks |
CN102033964A (en) * | 2011-01-13 | 2011-04-27 | 北京邮电大学 | Text classification method based on block partition and position weight |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090265611A1 (en) * | 2008-04-18 | 2009-10-22 | Yahoo ! Inc. | Web page layout optimization using section importance |
-
2013
- 2013-01-25 CN CN201310029651.9A patent/CN103970749B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1702654A (en) * | 2004-04-29 | 2005-11-30 | 微软公司 | Method and system for calculating importance of a block within a display page |
CN101944104A (en) * | 2010-08-19 | 2011-01-12 | 百度在线网络技术(北京)有限公司 | Evaluation method and equipment for importance of webpage sub-blocks |
CN102033964A (en) * | 2011-01-13 | 2011-04-27 | 北京邮电大学 | Text classification method based on block partition and position weight |
Also Published As
Publication number | Publication date |
---|---|
CN103970749A (en) | 2014-08-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Shu et al. | Beyond news contents: The role of social context for fake news detection | |
US10452694B2 (en) | Information extraction from question and answer websites | |
US9275115B2 (en) | Correlating corpus/corpora value from answered questions | |
US9317498B2 (en) | Systems and methods for generating summaries of documents | |
WO2019227710A1 (en) | Network public opinion analysis method and apparatus, and computer-readable storage medium | |
KR102094659B1 (en) | Automatic generation of headlines | |
US7941420B2 (en) | Method for organizing structurally similar web pages from a web site | |
US20140359421A1 (en) | Annotation Collision Detection in a Question and Answer System | |
US9594730B2 (en) | Annotating HTML segments with functional labels | |
US7987417B2 (en) | System and method for detecting a web page template | |
US7870474B2 (en) | System and method for smoothing hierarchical data using isotonic regression | |
JP5008024B2 (en) | Reputation information extraction device and reputation information extraction method | |
US11966444B2 (en) | Document analysis method and apparatus | |
CN102662969A (en) | Internet information object positioning method based on webpage structure semantic meaning | |
Alassi et al. | Effectiveness of template detection on noise reduction and websites summarization | |
JP2020067987A (en) | Summary creation device, summary creation method, and program | |
CN103365879B (en) | A kind of method and apparatus for being used to obtain Page resemblance | |
CN111339396B (en) | Method, device and computer storage medium for extracting webpage content | |
CN116719999A (en) | Text similarity detection method and device, electronic equipment and storage medium | |
CN115640439A (en) | Method, system and storage medium for network public opinion monitoring | |
Oza et al. | Elimination of noisy information from web pages | |
CN118780277A (en) | Article parsing method and device | |
Parameswaran et al. | Optimal schemes for robust web extraction | |
CN113157857A (en) | Hot topic detection method, device and equipment for news | |
CN103970749B (en) | Block importance computational methods and system in a kind of webpage |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |