CN102955810B

CN102955810B - A kind of Web page classification method and equipment

Info

Publication number: CN102955810B
Application number: CN201110249270.2A
Authority: CN
Inventors: 徐萌; 何洪凌; 胡珉; 罗治国; 孙少陵; 陶涛; 陈婷; 张新访; 李成华
Original assignee: China Mobile Communications Group Co Ltd
Current assignee: Chellona Mobile Communications Corp Cmcc; China Mobile Communication Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Priority date: 2011-08-26
Filing date: 2011-08-26
Publication date: 2015-12-02
Anticipated expiration: 2031-08-26
Also published as: CN102955810A

Abstract

The invention discloses a method and equipment for classifying webpages. The method utilizes the records in the existing URL category database to establish virtual hierarchical URLs and predict the categories of the hierarchical URLs. When the webpage to be classified needs to be classified, the URL category library is queried according to the URL of the webpage to be classified; if no matching URL is found, the URL category library is queried according to the upper URL of the URL, and when a matching URL is The predicted category of the queried URL determines the category of the webpage to be classified. In the present invention, the efficiency and success rate of web page classification are improved.

Description

Method and device for classifying web pages

技术领域 technical field

本发明涉及互联网技术领域，尤其涉及一种网页分类方法和设备。The invention relates to the technical field of the Internet, in particular to a web page classification method and equipment.

背景技术 Background technique

随着移动互联网技术的高速发展，移动互联网用户的数量越来越多，因此，对移动互联网用户的行为分析也逐渐成为一个研究热点。With the rapid development of mobile Internet technology, the number of mobile Internet users is increasing. Therefore, the behavior analysis of mobile Internet users has gradually become a research hotspot.

现有技术中，通常根据移动互联网用户的访问日志对用户行为进行分析。具体的，移动互联网用户的访问日志存放在WAP(WirelessApplicationProtocol，无线应用通讯协议)网关中，该访问日志中记录了用户所访问的网页的URL(UniversalResourceLocator，统一资源定位符)，通过查询URL类别库可以获知用户所访问的网页类别，进而获知对应用户的行为偏好。In the prior art, user behavior is usually analyzed according to access logs of mobile Internet users. Specifically, the access log of the mobile Internet user is stored in the WAP (Wireless Application Protocol, Wireless Application Protocol) gateway, and the URL (UniversalResourceLocator, Uniform Resource Locator) of the webpage visited by the user is recorded in the access log. The category of web pages visited by the user can be known, and then the behavior preference of the corresponding user can be learned.

其中，现有网页分类方法可以包括以下步骤：Wherein, the existing web page classification method may include the following steps:

1、爬虫爬取网页内容；1. The crawler crawls the content of the webpage;

2、对网页内容进行解析，获取相应文本；2. Analyze the content of the webpage and obtain the corresponding text;

3、对文本进行分析、获取关键词；3. Analyze the text and obtain keywords;

4、利用算法模型，例如朴素贝叶斯或者SVM等文本分类算法模型，进行分类；其中，算法模型通常提前根据训练集训练得到。4. Use an algorithm model, such as a text classification algorithm model such as Naive Bayesian or SVM, to classify; wherein, the algorithm model is usually trained in advance based on the training set.

通过上述方法可以对用户所访问的网页(或网页对应的URL)进行分类，进而可以建立URL类别库。其中，现有技术中的URL类别库可以如表1所示。The webpages (or the URLs corresponding to the webpages) visited by the user can be classified through the above method, and then a URL category library can be established. Wherein, the URL category library in the prior art may be as shown in Table 1.

表1Table 1

在实现本发明的过程中，发明人发现现有技术中至少存在以下问题：In the process of realizing the present invention, the inventor finds that there are at least the following problems in the prior art:

现有技术中，URL类别库是一个简单的扁平数据表，条目之间没有任何关系，为了能准确查询到用户所访问的网页的类别，需要存储大量的数据，且需要实时更新类别库。而由于互联网发展迅速，新增网页速度极快，即使每日做一次URL类别库更新，URL类别库并不可能保存所有网页的分类。此时，可采用的方法是实时抓取、预测的方法，预测一个网页的类别可能时间需要约数十分钟，如果批量预测，虽然可以并行化，但时间仍然很长，至少小时级别。In the prior art, the URL category library is a simple flat data table without any relationship between entries. In order to accurately query the categories of webpages visited by users, a large amount of data needs to be stored and the category library needs to be updated in real time. And due to the rapid development of the Internet, the speed of adding web pages is extremely fast, even if the URL category library is updated once a day, the URL category library cannot save the classification of all web pages. At this time, the method that can be used is real-time crawling and prediction. It may take about tens of minutes to predict the category of a webpage. If batch prediction can be parallelized, the time is still very long, at least hours.

发明内容 Contents of the invention

本发明实施例提供一种网页分类的方法和设备，以提高确定网页类别的效率和成功率。Embodiments of the present invention provide a method and device for classifying webpages, so as to improve the efficiency and success rate of determining webpage categories.

为了达到上述目的，本发明实施例提供一种网页分类方法，应用于基于URL类别库实现的网页分类流程，所述URL类别库中记录有各层级URL及各URL的预测类别，其中，相邻层级的URL中的上层URL是在下层URL的基础上截取得到的，该方法包括：In order to achieve the above purpose, an embodiment of the present invention provides a method for classifying webpages, which is applied to the webpage classification process based on the URL category library. The URL category library records the URLs of each level and the predicted category of each URL. Among them, the adjacent The upper-level URL in the hierarchical URL is intercepted on the basis of the lower-level URL, and the method includes:

根据待分类网页的URL查询URL类别库；Query the URL category library according to the URL of the webpage to be classified;

如果未查询到匹配的URL，则根据该URL的上层URL查询URL类别库，并在查询到匹配的URL时，根据查询到的URL的预测类别确定待分类网页的类别。If no matching URL is queried, the URL category library is queried according to the upper URL of the URL, and when a matching URL is queried, the category of the webpage to be classified is determined according to the predicted category of the queried URL.

本发明实施例还提供一种网页分类设备，应用于基于统一资源定位符URL类别库实现的网页分类流程，所述URL类别库中记录有各层级URL及各URL的预测类别，其中，相邻层级的URL中的上层URL是在下层URL的基础上截取得到的，该设备包括：The embodiment of the present invention also provides a webpage classification device, which is applied to the webpage classification process based on the Uniform Resource Locator URL category library. The URL category library records the URLs of each level and the predicted category of each URL. The upper-level URL in the hierarchical URL is intercepted on the basis of the lower-level URL, and the equipment includes:

上层URL生成模块，用于根据待分类网页的URL，生成该URL的上层URL；The upper-level URL generation module is used to generate the upper-level URL of the URL according to the URL of the webpage to be classified;

查询模块，用于根据待分类网页的URL查询URL类别库；如果未查询到匹配的URL，则根据该URL的上层URL查询URL类别库；The query module is used to query the URL category library according to the URL of the webpage to be classified; if no matching URL is found, then query the URL category library according to the upper URL of the URL;

确定模块，用于在所述查询模块查询到匹配的URL时，根据查询到的URL的预测类别确定待分类网页的类别。A determining module, configured to determine the category of the webpage to be classified according to the predicted category of the queried URL when the query module inquires about a matching URL.

与现有技术相比，本发明实施例通过对URL进行层级划分，在URL类别库中记录各层级URL，并对应记录各URL的预测类别；当需要确定待分类网页的类别时，获取该待分类网页的URL，并查询URL类别库中是否记录有该URL；当URL类别库中未记录有相同的URL时，根据该URL的上层URL的预测类别确定为待分类网页的类别，提高了确定网页类别的效率和成功率。Compared with the prior art, the embodiment of the present invention divides the URLs into layers, records the URLs of each level in the URL category library, and records the predicted categories of each URL correspondingly; when it is necessary to determine the category of the webpage to be classified, obtain the Classify the URL of the web page, and check whether the URL is recorded in the URL category library; when the same URL is not recorded in the URL category library, it is determined as the category of the webpage to be classified according to the predicted category of the upper URL of the URL, which improves the determination Efficiency and success rate of web categories.

附图说明 Description of drawings

图1为本发明实施例提供的URL类别库生成流程示意图；FIG. 1 is a schematic diagram of a flow chart for generating a URL category library provided by an embodiment of the present invention;

图2为本发明实施例提供的网页分类方法流程示意图；FIG. 2 is a schematic flow chart of a method for classifying webpages provided by an embodiment of the present invention;

图3为本发明实施例提供的网页分类设备的结构示意图。FIG. 3 is a schematic structural diagram of a webpage classification device provided by an embodiment of the present invention.

具体实施方式 Detailed ways

针对现有技术中的缺陷，本发明实施例提出了一种网页分类的技术方案。本发明实施例提出的技术方案中，通过对URL进行截取的方式对URL进行层级划分，相邻层级的URL中上层URL通过在下层URL的基础上截取得到，在现有URL类别库中增加上层URL的记录(即本发明实施例中URL类别库中记录有URL、该URL的预测类别以及该URL相邻层级的上层URL)，并记录上层URL的预测类别，当需要对网页进行分类时，可以根据待分类网页的URL查询URL类别库；如果未查询到匹配的URL，则根据该URL的上层URL查询URL类别库，并在查询到匹配的URL时，根据查询到的URL的预测类别确定待分类网页的类别，即当URL类别库中未记录有待分类网页的URL时，可以根据该URL的上层URL的预测类别确定待分类网页的类别，通过查询该待分类的URL的上层URL对应的记录，并将其上层URL的预测类别作为待分类网页的预测类别，提高了确定网页类别的效率和成功率。Aiming at the defects in the prior art, the embodiment of the present invention proposes a technical solution for classifying webpages. In the technical solution proposed by the embodiment of the present invention, the URL is divided into levels by intercepting the URL, and the upper-level URLs of adjacent levels of URLs are obtained by intercepting the lower-level URLs, and the upper-level URLs are added to the existing URL category library. The record of URL (that is, URL, the prediction category of this URL and the upper URL of the adjacent hierarchy of this URL are recorded in the URL category storehouse in the embodiment of the present invention), and the prediction category of the record upper URL, when needing to classify the webpage, The URL category library can be queried according to the URL of the webpage to be classified; if no matching URL is queried, the URL category library is queried according to the upper URL of the URL, and when a matching URL is queried, it is determined according to the predicted category of the queried URL The category of the webpage to be classified, that is, when the URL of the webpage to be classified is not recorded in the URL category library, the category of the webpage to be classified can be determined according to the predicted category of the upper URL of the URL, and the URL corresponding to the upper URL of the URL to be classified can be checked. record, and use the predicted category of its upper-level URL as the predicted category of the webpage to be classified, which improves the efficiency and success rate of determining the category of the webpage.

其中，以对URL进行截取的方式对URL进行层级划分可以具体通过以下方式实现：Among them, the hierarchical division of the URL by intercepting the URL may be specifically implemented in the following manner:

根据URL中分隔符“/”对URL进行层级划分，从URL末位向前依次获取“/”，并将该URL从末位向前的预设数量(如1个)“/”之前的字段作为该URL相邻层级的上层URL(即上一层级URL)。According to the delimiter "/" in the URL, the URL is hierarchically divided, and the "/" is obtained sequentially from the end of the URL, and the preset number (such as 1) of the URL from the end to the field before the "/" It is the upper-level URL (ie, the upper-level URL) of the adjacent level of the URL.

例如，对于URL：http://3g.sina.com.cn：80/3g/static/sina.gif？t1＝1252192802，http://3g.sina.com.cn：80/3g/static/sina.gif？t1＝1252192802为该URL的第一层级，http://3g.sina.com.cn：80/3g/static/为该URL的第二层级，http://3g.sina.com.cn：80/3g/为该URL的第三层级，http://3g.sina.com.cn：80/3g/static/则为原URL的上一层级URL，http://3g.sina.com.cn：80/3g/则为http://3g.sina.com.cn：80/3g/static/的上一层级URL。For example, for the URL: http://3g.sina.com.cn:80/3g/static/sina.gif? t1=1252192802, http://3g.sina.com.cn:80/3g/static/sina.gif? t1=1252192802 is the first level of the URL, http://3g.sina.com.cn:80/3g/static/ is the second level of the URL, http://3g.sina.com.cn:80 /3g/ is the third level of the URL, http://3g.sina.com.cn: 80/3g/static/ is the upper level URL of the original URL, http://3g.sina.com.cn :80/3g/ is the upper level URL of http://3g.sina.com.cn:80/3g/static/.

应该认识到，本发明实施例提出的技术方案中确定上一层级URL的方式并不限于上述方式，也可以是其他方式。It should be recognized that the manner of determining the upper-level URL in the technical solution proposed by the embodiment of the present invention is not limited to the foregoing manner, and may also be other manners.

下面将结合本申请中的附图，对本申请中的技术方案进行清楚、完整的描述，显然，所描述的实施例是本申请的一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in this application will be clearly and completely described below in conjunction with the drawings in this application. Apparently, the described embodiments are part of the embodiments of this application, not all of them. Based on the embodiments in the present application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present application.

如图1所示，为本发明实施例提出的URL类别库建立流程的示意图，为便于描述，以URL类别中以数据表的形式存储URL的信息为例进行描述，每一个URL对应一条表项，该URL类别库建立流程可以包括以下步骤：As shown in Figure 1, it is a schematic diagram of the establishment process of the URL category library proposed by the embodiment of the present invention. For the convenience of description, the URL information stored in the form of a data table in the URL category is used as an example for description, and each URL corresponds to an entry , the process of establishing the URL category library may include the following steps:

步骤101、在URL类别库中记录最低层级URL对应的表项。其中，URL对应的表项中记录有URL、该URL的预测类别以及该URL的上一层级URL。Step 101. Record the entry corresponding to the lowest-level URL in the URL category library. Wherein, the entry corresponding to the URL records the URL, the predicted category of the URL, and the upper-level URL of the URL.

具体的，可以将用户在过去一段时间内(如一个月)访问的网页的URL作为URL类别库中的最低层级URL，并通过现有网页分类方法获取对应的URL的预测类别；或者，可以将某些知名网站对应的URL作为种子，通过爬虫爬取的方式获取一定数量的URL，并将获取到的URL作为URL类别库中的最低层级URL，并通过现有网页分类方法获取对应的URL的预测类别。获取到的URL类别库中最低层级的URL及其预测类别后，获取各最低层级URL的上一层级URL，并将相应的信息(URL预测类别、上一层级URL)对应于URL记录到URL类别库中。Specifically, the URL of the webpage visited by the user in the past period of time (such as one month) can be used as the lowest-level URL in the URL category library, and the predicted category of the corresponding URL can be obtained through the existing webpage classification method; or, The URLs corresponding to some well-known websites are used as seeds, and a certain number of URLs are obtained through crawler crawling, and the obtained URLs are used as the lowest-level URLs in the URL category library, and the corresponding URLs are obtained through the existing webpage classification method. predicted category. After obtaining the lowest-level URL and its predicted category in the URL category library, obtain the upper-level URL of each lowest-level URL, and record the corresponding information (URL prediction category, upper-level URL) corresponding to the URL into the URL category library.

步骤102、从URL类别库中选择一条表项，获取该表项中记录的URL的上一层级URL。Step 102. Select an entry from the URL category library, and obtain the upper-level URL of the URL recorded in the entry.

具体的，遍历URL类别库中的表项，并顺序选择URL类别库中的表项，获取所选择的表项中的上一层级URL。Specifically, the table items in the URL category library are traversed, and the table items in the URL category library are sequentially selected to obtain the upper-level URL in the selected table item.

步骤103、判断URL类别库中是否存储有该上一层级URL对应的表项。若判断为是，则转至步骤102；否则，转至步骤104。Step 103, judging whether an entry corresponding to the upper-level URL is stored in the URL category library. If the judgment is yes, go to step 102; otherwise, go to step 104.

具体的，当URL类别库中存储有该上一层级URL对应的表项时，则重新选择另一条表项；当URL类别库中未存储有该上一层级URL对应的表项时，则需要创建该上一层级URL对应的表项。Specifically, when the table item corresponding to the upper-level URL is stored in the URL category library, another table item is selected again; when the table item corresponding to the upper-level URL is not stored in the URL category library, you need Create an entry corresponding to the upper-level URL.

步骤104、确定该上一层级URL的预测类别以及该上一层级URL的上一层级URL，并将其记录在URL类别库中。Step 104: Determine the predicted category of the upper-level URL and the upper-level URL of the upper-level URL, and record them in the URL category library.

具体的，遍历URL类别库中的表项，获取其中上一层级URL相同的表项，并根据获取到的表项中的URL的预测类别确定上一层级URL的预测类别。Specifically, the entries in the URL category library are traversed to obtain entries with the same upper-level URL, and the predicted category of the upper-level URL is determined according to the predicted category of URLs in the obtained entries.

其中，确定上一层级URL的预测类别具体可以通过以下方式实现：Wherein, determining the predicted category of the upper-level URL may specifically be implemented in the following manner:

从所述URL类别库中获取其上一层级URL为该待预测类别的URL的所有URL；确定获取到的URL中各预测类别的URL的数量；将其中URL数量最多的预测类别确定为该待预测类别的URL的预测类别。Obtain all URLs whose upper-level URLs are the URLs of the category to be predicted from the URL category library; determine the quantity of URLs of each prediction category in the obtained URLs; determine the prediction category with the largest number of URLs as the category to be predicted The predicted category of the URL for the predicted category.

例如，对于如下4个URL：For example, for the following 4 URLs:

http://www.chinaweekly.cn/bencandy.php？fid＝48&id＝5464预测类别：历史http://www.chinaweekly.cn/bencandy.php? fid=48&id=5464 Prediction category: history

http://www.chinaweekly.cn/bencandy.php？fid＝48&id＝5463预测类别：历史http://www.chinaweekly.cn/bencandy.php? fid=48&id=5463 Prediction category: history

http://www.chinaweekly.cn/bencandy.php？fid＝48&id＝5344预测类别：历史http://www.chinaweekly.cn/bencandy.php? fid=48&id=5344 Prediction category: history

http://www.chinaweekly.cn/bencandy.php？fid＝49&id＝5449预测类别：时评http://www.chinaweekly.cn/bencandy.php? fid=49&id=5449 Forecast category: current comment

该四个URL具有相同的上一层级URL：http://www.chinaweekly.cn/，由于该上层URL相邻层级的下层URL中，共有3个预测类别为历史，1个预测类别为时评，因此该上层URL的预测类别为历史。These four URLs have the same upper-level URL: http://www.chinaweekly.cn/. Since the upper-level URL is adjacent to the lower-level URL, there are 3 prediction categories as history and 1 prediction category as current commentary. Therefore, the predicted category of the upper-level URL is history.

需要注意的是，本发明实施例提供的技术方案中，URL类别库中还可以对应记录有各URL的预测概率。此时，URL类别库中对应URL的表项中包括URL、该URL的预测类别、预测概率以及该URL的上一层级URL。对于最低层级URL，其预测类别和预测概率通过现有网页分类方法确定，而其余层级的URL的预测类别和预测概率根据该URL的下一层级URL的预测类别和预测概率确定。It should be noted that, in the technical solution provided by the embodiment of the present invention, the predicted probability of each URL may also be correspondingly recorded in the URL category library. At this time, the entry corresponding to the URL in the URL category library includes the URL, the predicted category of the URL, the predicted probability, and the upper-level URL of the URL. For the lowest level URL, its predicted category and predicted probability are determined by the existing web page classification method, while the predicted category and predicted probability of URLs in other levels are determined according to the predicted category and predicted probability of the next level URL of this URL.

具体的，根据下一层级的URL的预测类别和预测概率确定其上一层级的URL的预测类别和预测概率可以具体通过以下方式实现：Specifically, determining the predicted category and predicted probability of the URL at the upper level according to the predicted category and predicted probability of the URL at the next level may be specifically implemented in the following manner:

从所述URL类别库中获取其上一层级URL为该待预测类别和概率的URL的所有URL；对于每一预测类别的URL，计算该预测类别中的各URL的预测概率的加权平均值；将加权平均值最高的预测类别确定为该待预测URL的预测类别，并将该预测类别的URL的预测概率的平均值确定为该待预测URL的预测概率。Obtain all URLs whose upper level URL is the URL of the category and probability to be predicted from the URL category library; for each URL of the predicted category, calculate the weighted average of the predicted probability of each URL in the predicted category; The predicted category with the highest weighted average value is determined as the predicted category of the URL to be predicted, and the average value of predicted probabilities of URLs of the predicted category is determined as the predicted probability of the URL to be predicted.

仍以上述4个URL为例，假设上述4个URL的预测概率依次为80％、79％、81％和80％。则该4个URL中，预测类别为历史的URL的预测概率的加权平均值为60％((80％+79％+81％)/(3+1))，预测类别为时评的URL的预测概率的加权平均值为20％((80％)/(3+1))。因此，该4个URL相邻层级的上层URL的预测类别为历史，其预测概率为60％。Still taking the above four URLs as an example, assume that the prediction probabilities of the above four URLs are 80%, 79%, 81% and 80% in sequence. Then among the 4 URLs, the weighted average of the prediction probability of URLs whose prediction category is historical is 60% ((80%+79%+81%)/(3+1)), and the prediction of URLs whose prediction category is current commentary The weighted average of the probabilities is 20% ((80%)/(3+1)). Therefore, the predicted category of the URLs in the upper layer adjacent to the four URLs is history, and its predicted probability is 60%.

上述流程可通过计算机程序实现，也可以根据以上原则，由人工方式配置该URL类别库。The above process can be realized by computer programs, or the URL category library can be manually configured according to the above principles.

应该认识到，本发明实施例的技术方案中，当URL类别库中未记录有待分类网页的URL时，并不限于通过逐层查询该URL的上一层级URL的方式确定待分类网页的类别，也可以是直接查询该URL的上一层级URL的上一层级URL或该URL的其他上层URL的预测类别来确定待分类网页的类别。此外，本发明实施例提供的技术方案中确定上一层级URL预测类别的方法并不限于上述流程中描述的方式，也可以是其他方式。It should be recognized that in the technical solution of the embodiment of the present invention, when the URL of the webpage to be classified is not recorded in the URL category library, it is not limited to determining the category of the webpage to be classified by querying the URL of the upper level of the URL layer by layer. The category of the webpage to be classified may also be determined by directly inquiring the predicted category of the URL of the URL of the previous level of the URL or other URLs of the URL of the upper level of the URL. In addition, in the technical solution provided by the embodiment of the present invention, the method for determining the URL prediction category of the upper level is not limited to the method described in the above process, and may also be other methods.

通过以上流程，可以确定现有URL类别库中记录的URL的上层URL，并将该上层URL对应的表项存储在URL类别库中，URL类别库中存储的表项可以形成了一个多层次架构。其中，更新后的URL类别库中URL信息的数据结构可以如表2所示：Through the above process, the upper-level URL of the URL recorded in the existing URL category library can be determined, and the entry corresponding to the upper-level URL can be stored in the URL category library. The entries stored in the URL category library can form a multi-level structure . Wherein, the data structure of the URL information in the updated URL category library can be shown in Table 2:

表2Table 2

名称 name 注释 note url url URL URL url_label url_label 预测类别 Prediction category

prediction prediction 预测概率 predicted probability faurlevel faur level 上一层级URL Upper level URL

其中，各项变量的含义如下：Among them, the meaning of each variable is as follows:

url：网页的URLStringUTF-8url: URLStringUTF-8 of the web page

url_label：URL的预测类别StringUTF-8url_label: predicted category of URL StringUTF-8

prediction：URL的预测概率Doubleprediction: URL prediction probability Double

faurlevel：上一层级URLStringUTF-8faurlevel: upper level URLStringUTF-8

基于上述URL类别库，本发明实施例提供了一种网页分类的方法，如图2所示，为本发明实施例提供的网页分类方法流程的示意图，可以包括以下步骤：Based on the above-mentioned URL category library, the embodiment of the present invention provides a method for classifying webpages, as shown in FIG. 2 , which is a schematic diagram of the process flow of the method for classifying webpages provided by the embodiment of the present invention, which may include the following steps:

步骤201、获取待分类网页的URL，查询URL类别库中是否记录有该URL。Step 201, obtain the URL of the webpage to be classified, and check whether the URL is recorded in the URL category library.

步骤202、若查询到URL类别库中记录有相同的URL，则转至步骤204；否则，转至步骤203。Step 202 , if it is found that the same URL is recorded in the URL category library, go to step 204 ; otherwise, go to step 203 .

步骤203、生成该URL的上一层级URL，查询URL类别库中是否记录有该上一层级URL，并转至步骤202。Step 203 : Generate the upper-level URL of the URL, check whether the upper-level URL is recorded in the URL category library, and go to step 202 .

步骤204、将查询到的URL对应的预测类别确定为所述待分类网页的类别。Step 204: Determine the predicted category corresponding to the queried URL as the category of the webpage to be classified.

具体的，在现有技术方案中，直接根据URL在URL类别库中进行精确匹配查询，当查询到的对应的表项时，则返回URL的预测类别；当未查询到的对应的表项时，则返回空值。Specifically, in the prior art solution, an exact match query is performed in the URL category library directly according to the URL, and when the corresponding table item is queried, the predicted category of the URL is returned; when the corresponding table item is not queried , returns a null value.

而在本发明实施例提供的技术方案中，通过引入对URL进行层级划分，并将上层URL对应的表项存储在URL类别库中。当需要对网页进行分类后，首先根据待分类网页的URL在URL类别库中进行精确匹配，当URL类别库中未存储有待分类网页的URL对应的表项时，进一步生成待分类网页的URL的上一层级URL，并根据该上一层级URL在类别库中查询对应的表项，并将查询到的上一层级URL的预测类别作为待分类网页的URL的预测类别。However, in the technical solution provided by the embodiment of the present invention, URLs are hierarchically divided by introducing, and entries corresponding to upper-level URLs are stored in the URL category library. When it is necessary to classify the webpage, firstly perform an exact match in the URL category library according to the URL of the webpage to be classified, and when the table item corresponding to the URL of the webpage to be classified is not stored in the URL category library, further generate the URL of the webpage to be classified The upper-level URL, and query the corresponding entry in the category library according to the upper-level URL, and use the predicted category of the queried upper-level URL as the predicted category of the URL of the webpage to be classified.

例如，获取到的待分类网页的URL为http://sports.sina.com.cn/k/2011-05-18/09415581512.shtml，且当前URL类别库中未记录有与该待分类网页的URL，此时，需要生成该URL的上一层级URL，即http://sports.sina.com.cn/k/2011-05-18/，并在URL类别库中查询该上一层级URL对应的表项。若URL类别库中存储有该上一层级URL对应的表项，则通过查询URL类别库可以得到该上一层级URL的预测类别(如体育)，则将该上一层级URL的预测类别作为待分类网页的URL的预测类别。For example, the obtained URL of the webpage to be classified is http://sports.sina.com.cn/k/2011-05-18/09415581512.shtml, and there is no record related to the webpage to be classified in the current URL category database. URL, at this time, it is necessary to generate the upper-level URL of this URL, that is, http://sports.sina.com.cn/k/2011-05-18/, and query the corresponding upper-level URL in the URL category library table entry. If the entry corresponding to the upper-level URL is stored in the URL category library, the predicted category (such as sports) of the upper-level URL can be obtained by querying the URL category library, and the predicted category of the upper-level URL is used as the Classify the predicted category of URLs for web pages.

需要注意的是，当已经查找到待分类网页的URL对应的最高层级的URL，仍未查询到URL类别库中记录有相同的URL时，返回查询失败响应。It should be noted that when the highest-level URL corresponding to the URL of the webpage to be classified has been found, but the same URL is not found in the URL category library, a query failure response is returned.

在本发明实施例中，当URL类别库中有新的最低层级URL增加时，可以通过事件触发或人工触发等方式对URL类别库进行类别更新。具体的，可以重新遍历URL类别库中存储的最低层级URL，并进行层次划分，重新获取对应的上层URL及其对应的预测类别。此外，也可以仅仅对与新增的最低层级URL相关的上层URL的预测类别进行更新。具体实现在此不再赘述。In the embodiment of the present invention, when a new lowest-level URL is added in the URL category library, the URL category library may be updated by means of event triggering or manual triggering. Specifically, the lowest-level URLs stored in the URL category library may be re-traversed, and hierarchically divided, and the corresponding upper-level URLs and corresponding predicted categories may be reacquired. In addition, it is also possible to only update the predicted categories of the upper-level URLs related to the newly added lowest-level URLs. The specific implementation will not be repeated here.

基于上述网页分类方法相同的技术构思，本发明实施例还提供一种网页分类设备，可以应用于上述基于URL类别库实现的网页分类方法，所述URL类别库中记录有各层级URL，其中，相邻层级的URL中的上层URL是在下层URL的基础上截取得到的，各URL分别对应记录有预测类别。Based on the same technical idea as the above-mentioned webpage classification method, the embodiment of the present invention also provides a webpage classification device, which can be applied to the above-mentioned webpage classification method based on the URL category library. The URL category library records URLs of various levels, wherein, The upper-level URL among the URLs of the adjacent levels is intercepted and obtained on the basis of the lower-level URL, and each URL is recorded with a corresponding prediction category.

如图3所示，为本发明实施例提供的网页分类设备的结构示意图，可以包括：As shown in FIG. 3, a schematic structural diagram of a web page classification device provided in an embodiment of the present invention may include:

上层URL生成模块31，用于根据待分类网页的URL，生成该URL的上层URL；Upper-level URL generating module 31, for generating the upper-level URL of the URL according to the URL of the webpage to be classified;

查询模块32，用于根据待分类网页的URL查询URL类别库；如果未查询到匹配的URL，则根据该URL的上层URL查询URL类别库；Inquiry module 32, is used for according to the URL query URL classification storehouse of webpage to be classified; If do not inquire about matching URL, then query URL category storehouse according to the upper level URL of this URL;

确定模块33，用于在查询模块32查询到匹配的URL时，根据查询到的URL的预测类别确定待分类网页的类别。The determination module 33 is configured to determine the category of the webpage to be classified according to the predicted category of the queried URL when the query module 32 queries a matching URL.

其中，上层URL生成模块31具体用于，当查询模块32未查询到匹配的URL时，生成该URL的上一层级URL；Wherein, the upper-level URL generation module 31 is specifically used to generate the upper-level URL of the URL when the query module 32 does not find a matching URL;

查询模块32具体通过以下流程查询待分类网页的URL的上层URL的预测类别：Inquiry module 32 specifically inquires the prediction category of the upper-level URL of the URL of the webpage to be classified by the following process:

步骤A、获取该URL的上一层级URL，查询URL类别库中是否记录有该上一层级URL；Step A. Obtain the upper-level URL of the URL, and check whether the upper-level URL is recorded in the URL category library;

步骤B、若查询到URL类别库中记录有相同的URL，则转至步骤C；否则转至步骤A；Step B. If the same URL is recorded in the URL category library, go to step C; otherwise, go to step A;

步骤C、获取查询到的URL的预测类别；Step C, obtaining the predicted category of the queried URL;

确定模块33具体用于，将查询模块33查询到的URL预测类别确定为所述待分类网页的类别。The determining module 33 is specifically configured to determine the URL prediction category queried by the query module 33 as the category of the webpage to be classified.

其中，确定模块33还用于，当查询模块32已经查询到所述待分类网页的URL对应的最高层级的URL，仍未查询到URL类别库中记录有相同的URL时，返回查询失败响应。Wherein, the determination module 33 is also used to return a query failure response when the query module 32 has already found the highest-level URL corresponding to the URL of the webpage to be classified, but has not yet found the same URL recorded in the URL category library.

其中，所述网页分类设备还包括：URL类别库维护模块34；Wherein, the webpage classification device also includes: URL category library maintenance module 34;

上层URL生成模块31具体用于，遍历所述URL类别库中的URL，并当遍历到一个URL时，从所述URL类别库中选择该URL，并根据选择出的URL生成该URL的上一层级URL；The upper-level URL generation module 31 is specifically used to traverse the URLs in the URL category library, and when traversing to a URL, select the URL from the URL category library, and generate the last URL of the URL according to the selected URL. Hierarchy URL;

查询模块32具体用于，根据上层URL生成模块31生成的上一层级URL查询URL类别库；The query module 32 is specifically used to query the URL category library according to the upper-level URL generated by the upper-level URL generation module 31;

URL类别维护模块34用于，当查询模块32未查询到匹配的URL时，确定该上一层级URL的预测类别，并将该上一层级URL及其预测类别记录在所述URL类别库中。The URL category maintenance module 34 is configured to, when the query module 32 finds no matching URL, determine the predicted category of the upper-level URL, and record the upper-level URL and its predicted category in the URL category library.

其中，URL类别库维护模块34具体用于，根据URL的下一层级URL的预测类别确定除最低层级以外其余层级的URL的预测类别。Wherein, the URL category library maintenance module 34 is specifically configured to determine the predicted categories of URLs in other levels except the lowest level according to the predicted categories of URLs in the next level of URLs.

其中，URL类别库维护模块34具体用于，从所述URL类别库中获取其上一层级URL为待预测类别的URL的所有URL；确定获取到的URL中各预测类别的URL的数量；将其中URL数量最多的预测类别确定为该待预测类别的URL的预测类别。Wherein, the URL category library maintenance module 34 is specifically used to obtain all URLs whose upper-level URLs are URLs of the category to be predicted from the URL category library; determine the quantity of URLs of each predicted category in the obtained URLs; The predicted category with the largest number of URLs is determined as the predicted category of URLs of the category to be predicted.

其中，URL类别库中的各URL还各自对应有预测概率；Wherein, each URL in the URL category library also has a corresponding prediction probability;

URL类别库维护模块34具体用于，根据URL的下一层级URL的预测类别和预测概率确定除最低层级以外其余层级的URL的预测类别和预测概率。The URL category library maintenance module 34 is specifically configured to determine the predicted categories and predicted probabilities of URLs at other levels except the lowest level according to the predicted categories and predicted probabilities of URLs at the next level of URLs.

其中，URL类别库维护模块34具体用于，从所述URL类别库中获取其上一层级URL为该待预测类别和概率的URL的所有URL；对于每一预测类别的URL，计算该预测类别中的各URL的预测概率的加权平均值；将加权平均值最高的预测类别确定为该待预测URL的预测类别，并将该预测类别的URL的预测概率的平均值确定为该待预测URL的预测概率。Wherein, the URL category library maintenance module 34 is specifically used to obtain all URLs whose upper-level URLs are the URLs of the category and probability to be predicted from the URL category library; for the URL of each predicted category, calculate the predicted category The weighted average of the predicted probabilities of each URL in the URL; the predicted category with the highest weighted average is determined as the predicted category of the URL to be predicted, and the average value of the predicted probabilities of the URLs of the predicted category is determined as the URL to be predicted predicted probability.

其中，当所述URL类别库中增加了新的URL时，Wherein, when a new URL is added in the URL category library,

上层URL生成模块31还用于，生成该URL的上层URL；The upper-level URL generation module 31 is also used to generate the upper-level URL of the URL;

查询模块32具体用于，根据所述URL的上层URL查询URL类别库；The query module 32 is specifically used to query the URL category library according to the upper URL of the URL;

URL类别库维护模块34具体用于，若查询模块32查询到匹配的URL，则更新上层URL的预测类别；若查询模块32未查询到匹配的URL，则在URL类别库中记录该上层URL及对应的预测类别。The URL category storehouse maintenance module 34 is specifically used for, if query module 32 inquires about matching URL, then update the predictive category of upper-level URL; If query module 32 does not inquire about matching URL, then record this upper-level URL and The corresponding predicted category.

其中，上层URL生成模块31具体用于，根据URL中的分隔符对URL进行层级划分，并将该URL从末位向前的预设数量分隔符之前的字段作为该URL的上一层级URL。Wherein, the upper-level URL generation module 31 is specifically configured to divide the URL into layers according to the delimiters in the URL, and use the fields before the preset number of delimiters from the last digit of the URL as the upper-level URL of the URL.

通过以上实施方式的描述，本领域的技术人员可以清楚地了解到本发明可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是更佳的实施方式。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the present invention can be realized by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is a better embodiment . Based on this understanding, the essence of the technical solution of the present invention or the part that contributes to the prior art can be embodied in the form of a software product. The computer software product is stored in a storage medium and includes several instructions to make a A computer device (which may be a personal computer, a server, or a network device, etc.) executes the methods described in various embodiments of the present invention.

本领域技术人员可以理解附图只是一个优选实施例的示意图，附图中的模块或流程并不一定是实施本发明所必须的。Those skilled in the art can understand that the drawing is only a schematic diagram of a preferred embodiment, and the modules or processes in the drawing are not necessarily necessary for implementing the present invention.

本领域技术人员可以理解实施例中的装置中的模块可以按照实施例描述分布于实施例的装置中，也可以进行相应变化位于不同于本实施例的一个或多个装置中。上述实施例的模块可以合并为一个模块，也可以进一步拆分成多个子模块。Those skilled in the art can understand that the modules in the device in the embodiment can be distributed in the device in the embodiment according to the description in the embodiment, and can also be changed and located in one or more devices different from the embodiment. The modules in the above embodiments can be combined into one module, and can also be further split into multiple sub-modules.

上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。The serial numbers of the above embodiments of the present invention are for description only, and do not represent the advantages and disadvantages of the embodiments.

以上公开的仅为本发明的几个具体实施例，但是，本发明并非局限于此，任何本领域的技术人员能思之的变化都应落入本发明的保护范围。The above disclosures are only a few specific embodiments of the present invention, however, the present invention is not limited thereto, and any changes conceivable by those skilled in the art shall fall within the protection scope of the present invention.

Claims

1. a webpage classification method, it is characterized in that, be applied to the webpage classification process that realizes based on Uniform Resource Locator URL category storehouse, the predictive category of each level URL and each URL is recorded in the described URL category storehouse, wherein, adjacent The upper-level URL in the hierarchical URL is intercepted on the basis of the lower-level URL, and the method includes:

Query the URL category library according to the URL of the webpage to be classified;

If no matching URL is found, then query the URL category library according to the upper URL of the URL, and when a matching URL is found, determine the category of the webpage to be classified according to the predicted category of the URL that is inquired;

Wherein, the generating process of the URL category library includes:

Traversing the URLs in the URL category library, and when traversing to a URL, selecting the URL from the URL category library, and generating the upper-level URL of the URL according to the selected URL;

Judging whether the generated upper-level URL already exists in the URL category library, and when the upper-level URL does not exist in the URL category library, determine the predicted category of the upper-level URL, and use the upper-level URL Hierarchical URLs and their predicted categories are recorded in the URL category library.

2. The method according to claim 1, wherein said querying the URL category storehouse according to the upper-level URL of the URL comprises:

Step A, generating the upper-level URL of the URL, and querying whether the upper-level URL is recorded in the URL category library;

Step B. If the same URL is recorded in the URL category library, go to step C; otherwise, go to step A;

Step C, obtaining the predicted category of the queried URL.

3. The method according to any one of claims 1-2, characterized in that, except for the URL at the lowest level, the predicted categories of URLs at other levels are determined according to the predicted categories of URLs at the next level of the URL.

4. The method according to claim 3, wherein, according to the predicted category of the URL of the next level, the predicted category of the URL of the upper level is determined, specifically:

Obtaining all URLs whose upper-level URLs are URLs of the category to be predicted from the URL category library;

Determine the quantity of URLs of each prediction category in the acquired URLs;

The predicted category with the largest number of URLs is determined as the predicted category of URLs of the category to be predicted.

5. The method according to claim 3, characterized in that, each URL in the URL category storehouse also has a prediction probability respectively;

Determine the predicted category and predicted probability of the URL at the upper level according to the predicted category and predicted probability of the URL at the next level, specifically:

Obtain all URLs whose upper-level URLs are URLs of categories and probabilities to be predicted from the URL category library;

For URLs of each prediction category, calculate the weighted average of the prediction probabilities of each URL in the prediction category;

The predicted category with the highest weighted average value is determined as the predicted category of the URL to be predicted, and the average value of predicted probabilities of URLs of the predicted category is determined as the predicted probability of the URL to be predicted.

6. The method according to claim 1, wherein, when a new URL is added in the URL category storehouse, the upper-level URL of the URL is generated, and the URL category storehouse is inquired according to the upper-level URL of the URL, if If a matching URL is found, update the predicted category of the upper-level URL; if no matching URL is found, record the upper-level URL and the corresponding predicted category in the URL category library.

7. The method according to claim 1, characterized in that determining the upper level URL of the URL is specifically:

The URL is hierarchically divided according to the delimiter in the URL, and the field before the delimiter of the preset number from the end of the URL is used as the URL of the upper level of the URL.

8. A webpage classification device, characterized in that, it is applied to the webpage classification process realized based on the Uniform Resource Locator URL category library, and the URL category library is recorded with the predicted categories of each level URL and each URL, wherein the adjacent The upper-level URL in the hierarchical URL is intercepted on the basis of the lower-level URL, and the equipment includes:

The upper-level URL generation module is used to generate the upper-level URL of the URL according to the URL of the webpage to be classified;

The query module is used to query the URL category library according to the URL of the webpage to be classified; if no matching URL is found, then query the URL category library according to the upper URL of the URL;

A determining module, configured to determine the category of the webpage to be classified according to the predicted category of the queried URL when the query module inquires about a matching URL;

Among them, it also includes: URL category library maintenance module;

The upper-level URL generation module is specifically configured to traverse the URLs in the URL category library, and when a URL is traversed, select the URL from the URL category library, and generate the upper URL of the URL according to the selected URL. one-level URL;

The query module is specifically used to query the URL category library according to the upper-level URL generated by the upper-level URL generation module;

The URL category maintenance module is used to determine the predicted category of the upper-level URL when the query module does not find a matching URL, and record the upper-level URL and its predicted category in the URL category library middle.

9. The apparatus of claim 8, wherein

The upper-level URL generation module is specifically used to generate an upper-level URL of the URL when the query module does not find a matching URL;

Described inquiry module specifically inquires the predicted category of the upper URL of the URL of the webpage to be classified by the following process:

Step A. Obtain the upper-level URL of the URL, and check whether the upper-level URL is recorded in the URL category library;

Step C, obtaining the predicted category of the queried URL;

The determining module is specifically configured to determine the URL prediction category queried by the query module as the category of the webpage to be classified.

10. The device according to any one of claims 8-9, wherein the URL category library maintenance module is specifically configured to determine, according to the predicted category of URLs at the next level of URLs, URLs at other levels except the lowest level predicted category.

11. The device according to claim 10, wherein the URL category library maintenance module is specifically configured to obtain all URLs whose upper-level URLs are URLs of the category to be predicted from the URL category library; The number of URLs of each predicted category among the obtained URLs; the predicted category with the largest number of URLs is determined as the predicted category of URLs of the category to be predicted.

12. The device according to claim 10, characterized in that, each URL in the URL category library is also respectively corresponding to a prediction probability;

The URL category library maintenance module is specifically used to obtain all URLs whose upper-level URLs are URLs of categories and probabilities to be predicted from the URL category library; The weighted average of the predicted probabilities of each URL; the predicted category with the highest weighted average is determined as the predicted category of the URL to be predicted, and the average value of the predicted probabilities of the URLs of the predicted category is determined as the predicted probability of the URL to be predicted .

13. The device according to claim 10, when a new URL is added in the URL category library,

The upper-layer URL generation module is also used to generate an upper-layer URL of the URL;

The query module is specifically used to query the URL category library according to the upper URL of the URL;

The URL category library maintenance module is specifically used for, if the query module queries a matching URL, then updates the predicted category of the upper URL; if the query module does not query a matching URL, then records the URL in the URL category library. The upper-level URL and the corresponding prediction category.

14. The device according to claim 8, wherein the upper-layer URL generation module is specifically configured to divide the URL into layers according to the delimiter in the URL, and divide the URL by a preset number from the last bit forward The field before the delimiter is used as the upper level URL of this URL.