CN106611008A

CN106611008A - Method and device for managing internet content labels

Info

Publication number: CN106611008A
Application number: CN201510703822.0A
Authority: CN
Inventors: 赵耀红; 高丹; 熊龙; 邓超
Original assignee: China Mobile Communications Group Co Ltd
Current assignee: China Mobile Communications Group Co Ltd
Priority date: 2015-10-26
Filing date: 2015-10-26
Publication date: 2017-05-03
Anticipated expiration: 2035-10-26
Also published as: CN106611008B

Abstract

The invention discloses a method for managing Internet content tags, which includes creating a content tag library, and the method further includes: acquiring first content tag trees corresponding to different websites, and respectively determining category tags of categories to which each first content tag tree belongs; Matching the content tags of the first content tag tree with the content tags in the content tag library according to the preset matching rules based on the category tags, and updating the content tag library according to the matching result. The invention also discloses a management device for Internet content tags.

Description

Method and device for managing Internet content tags

技术领域technical field

本发明涉及互联网技术领域，尤其涉及一种互联网内容标签的管理方法及装置。The invention relates to the technical field of the Internet, in particular to a method and device for managing Internet content tags.

背景技术Background technique

随着互联网的迅速发展，越来越多的内容提供商不断的涌现，每个领域都有不同的内容提供商参与，而每个内容提供商都有着自己特定的内容标签体系，各个内容标签各不相同，但是部分标签相同或者具有较大的相似性。对于需要基于用户访问互联网内容和行为的用户画像构建，以及基于客户的个性化精准内容推送场景而言，迫切需要一套统一的、完整的互联网内容标签体系，能够为基于用户访问或者基于用户行为的个性化内容推送等提供便捷性和灵活性。目前，尚不存在将不同内容标签体系合并的方案。With the rapid development of the Internet, more and more content providers are constantly emerging, and different content providers participate in each field, and each content provider has its own specific content label system, and each content label is different. The same, but some tags are the same or have greater similarity. For scenarios that require user portrait construction based on user access to Internet content and behavior, as well as customer-based personalized and accurate content push scenarios, there is an urgent need for a unified and complete Internet content labeling system that can be based on user access or user behavior. Personalized content push, etc. provide convenience and flexibility. Currently, there is no plan to combine the different content labeling systems.

发明内容Contents of the invention

有鉴于此，本发明实施例期望提供一种互联网内容标签的管理及装置，能够形成统一的互联网内容标签体系，为基于用户访问或者基于用户行为的个性化内容推送等提供便捷性和灵活性。In view of this, the embodiment of the present invention expects to provide a management and device for Internet content labeling, which can form a unified Internet content labeling system, and provide convenience and flexibility for personalized content push based on user access or user behavior.

为达到上述目的，本发明实施例的技术方案是这样实现的：In order to achieve the above object, the technical solution of the embodiment of the present invention is achieved in this way:

本发明实施例提供了一种互联网内容标签的管理方法，包括：创建内容标签库，所述方法还包括：The embodiment of the present invention provides a kind of management method of Internet content tag, comprising: creating content tag storehouse, described method also includes:

获取不同网站对应的第一内容标签树，分别确定各个第一内容标签树所属类别的类别标签；Obtaining first content tag trees corresponding to different websites, and respectively determining the category tags of the category to which each first content tag tree belongs;

基于所述类别标签将所述第一内容标签树的内容标签与所述内容标签库中的内容标签按预设的匹配规则进行匹配，并依据匹配结果更新所述内容标签库。Matching the content tags of the first content tag tree with the content tags in the content tag library according to the preset matching rules based on the category tags, and updating the content tag library according to the matching result.

上述方案中，所述获取不同网站对应的第一内容标签树包括：In the above solution, the acquiring the first content tag tree corresponding to different websites includes:

获取不同网站对应的网站域名，以及所述网站域名下的至少一个统一资源定位符URL，基于所述网站域名及所述URL，利用各个网站的内容标签规则确定不同网站对应的第一内容标签树。Obtaining the website domain names corresponding to different websites, and at least one uniform resource locator URL under the website domain names, based on the website domain names and the URLs, using the content label rules of each website to determine the first content label tree corresponding to different websites .

上述方案中，所述分别确定各个第一内容标签树所属类别的类别标签包括：In the above solution, the respectively determining the category tags of the category of each first content tag tree includes:

分别读取各个第一内容标签树对应的网站的网站域名，依据所述网站域名及预设的网站域名分类库确定各个第一内容标签树所属类别的类别标签。The website domain names of the websites corresponding to each first content tag tree are respectively read, and the category tags of the category to which each first content tag tree belongs are determined according to the website domain names and the preset website domain name classification library.

上述方案中，所述基于所述类别标签将所述第一内容标签树的内容标签与所述内容标签库中的内容标签按预设的匹配规则进行匹配包括：In the above scheme, matching the content tags of the first content tag tree with the content tags in the content tag library based on the category tags according to preset matching rules includes:

获取所述内容标签库中根节点内容标签与所述类别标签相同的第二内容标签树，结合语义分析，按照从左到右或从顶向下的顺序将所述第一内容标签树的内容标签分别与所述第二内容标签树中各个层级的内容标签逐级进行匹配。Obtaining the second content label tree whose root node content label is the same as the category label in the content label library, combining semantic analysis, according to the order from left to right or from top to bottom, the content labels of the first content label tree The content tags of each level in the second content tag tree are respectively matched step by step.

上述方案中，所述依据匹配结果更新所述内容标签库包括：In the above solution, the updating of the content tag library according to the matching result includes:

确定所述第二内容标签树中不存在与所述内容标签相同或相似的内容标签，则在所述第二内容标签树的相应层级上增加所述内容标签；determining that there is no content tag identical or similar to the content tag in the second content tag tree, then adding the content tag at a corresponding level of the second content tag tree;

确定所述第二内容标签树中存在与所述内容标签相似的内容标签，则更新与所述内容标签相似的内容标签的名称；其中，与所述内容标签相似的内容标签为名称与所述内容标签不同，但对应的父标签及子标签均相同的内容标签。Determine that there is a content tag similar to the content tag in the second content tag tree, then update the name of the content tag similar to the content tag; wherein, the content tag similar to the content tag has a name similar to the content tag The content tags are different, but the corresponding parent tags and child tags are the same.

上述方案中，所述依据匹配结果更新所述内容标签库之后，所述方法还包括：In the above solution, after updating the content tag library according to the matching result, the method further includes:

从所述第二内容标签树中与所述第一内容标签树的根节点内容标签相同的内容标签开始，按照从左到右或从顶向下的顺序将所述第二内容标签树中的内容标签与所述第一内容标签树中相应层级的内容标签进行匹配，若所述第二内容标签树中的内容标签在所述第一内容标签树相应层级的内容标签中不存在，则删除所述第二内容标签树中的内容标签。Starting from the same content tag as the root node content tag of the first content tag tree in the second content tag tree, the The content tag is matched with the content tag of the corresponding level in the first content tag tree, and if the content tag in the second content tag tree does not exist in the content tag of the corresponding level of the first content tag tree, delete Content tags in the second content tag tree.

本发明实施例还提供了一种互联网内容标签的管理装置，所述装置包括：创建模块、获取模块及更新模块；其中，The embodiment of the present invention also provides a management device for Internet content tags, the device includes: a creation module, an acquisition module and an update module; wherein,

所述创建模块，用于创建内容标签库；The creation module is used to create a content tag library;

所述获取模块，用于获取不同网站对应的第一内容标签树，分别确定各个第一内容标签树所属类别的类别标签；The obtaining module is used to obtain the first content tag trees corresponding to different websites, and respectively determine the category tags of the category to which each first content tag tree belongs;

所述更新模块，用于基于所述类别标签将所述第一内容标签树的内容标签与所述内容标签库中的内容标签按预设的匹配规则进行匹配，并依据匹配结果更新所述内容标签库。The update module is configured to match the content tags of the first content tag tree with the content tags in the content tag library according to preset matching rules based on the category tags, and update the content according to the matching result tag library.

上述方案中，所述获取模块，具体用于获取不同网站对应的网站域名，以及所述网站域名下的至少一个统一资源定位符URL，基于所述网站域名及所述URL，利用各个网站的内容标签规则确定不同网站对应的第一内容标签树。In the above solution, the acquisition module is specifically used to acquire the website domain names corresponding to different websites, and at least one Uniform Resource Locator URL under the website domain names, based on the website domain names and the URLs, using the content of each website The tag rules determine the first content tag trees corresponding to different websites.

上述方案中，所述获取模块，具体用于分别读取各个第一内容标签树对应的网站的网站域名，依据所述网站域名及预设的网站域名分类库确定各个第一内容标签树所属类别的类别标签。In the above solution, the acquisition module is specifically used to respectively read the website domain names of the websites corresponding to each first content tag tree, and determine the category to which each first content tag tree belongs according to the website domain name and the preset website domain name classification library category labels.

上述方案中，所述更新模块，具体用于获取所述内容标签库中根节点内容标签与所述类别标签相同的第二内容标签树，结合语义分析，按照从左到右或从顶向下的顺序将所述第一内容标签树的内容标签分别与所述第二内容标签树中各个层级的内容标签逐级进行匹配。In the above scheme, the update module is specifically used to obtain the second content label tree whose root node content label in the content label library is the same as the category label, and combine semantic analysis according to the left-to-right or top-down Sequentially match the content tags of the first content tag tree with the content tags of each level in the second content tag tree step by step.

上述方案中，所述更新模块，具体用于确定所述第二内容标签树中不存在与所述内容标签相同或相似的内容标签，则在所述第二内容标签树的相应层级上增加所述内容标签；In the above solution, the update module is specifically configured to determine that there is no content tag identical or similar to the content tag in the second content tag tree, and then add the the content label;

上述方案中，所述更新模块，还用于从所述第二内容标签树中与所述第一内容标签树的根节点内容标签相同的内容标签开始，按照从左到右或从顶向下的顺序将所述第二内容标签树中的内容标签与所述第一内容标签树中相应层级的内容标签进行匹配，若所述第二内容标签树中的内容标签在所述第一内容标签树相应层级的内容标签中不存在，则删除所述第二内容标签树中的内容标签。In the above solution, the update module is further configured to start from the content tag in the second content tag tree that is the same as the root node content tag of the first content tag tree, and follow the sequence from left to right or from top to bottom match the content tags in the second content tag tree with the corresponding level content tags in the first content tag tree, if the content tags in the second content tag tree are in the first content tag If the content tag at the corresponding level of the tree does not exist, the content tag in the second content tag tree is deleted.

本发明实施例所提供的互联网内容标签的管理方法及装置，创建内容标签库，所述方法还包括：获取不同网站对应的第一内容标签树，分别确定各个第一内容标签树所属类别的类别标签；基于所述类别标签将所述第一内容标签树的内容标签与所述内容标签库中的内容标签按预设的匹配规则进行匹配，并依据匹配结果更新所述内容标签库；如此，可以将不同的互联网内容提供商所有用的不同的内容标签体系进行合并，形成统一的互联网内容标签体系，便于开展后期的用户访问历史轨迹分析，构建用户画像，开展基于用户偏好标签的个性化推荐和实时营销服务，提升用户访问互联网内容的用户体验，为基于用户访问或者基于用户行为的个性化内容推荐等提供便捷性和灵活性。The method and device for managing Internet content tags provided by the embodiments of the present invention create a content tag library, and the method further includes: obtaining first content tag trees corresponding to different websites, and respectively determining the categories of the categories to which each first content tag tree belongs Tags; matching the content tags of the first content tag tree with the content tags in the content tag library according to preset matching rules based on the category tags, and updating the content tag library according to the matching result; thus, Different content labeling systems used by different Internet content providers can be merged to form a unified Internet content labeling system, which is convenient for later analysis of user access history, construction of user portraits, and personalized recommendation based on user preference labels and real-time marketing services to improve the user experience of users accessing Internet content, and provide convenience and flexibility for personalized content recommendations based on user access or user behavior.

附图说明Description of drawings

图1为本发明实施例一互联网内容标签的管理方法流程示意图；Fig. 1 is a schematic flow chart of a method for managing Internet content tags according to an embodiment of the present invention;

图2为本发明实施例一第一内容标签树示意图；FIG. 2 is a schematic diagram of a first content tag tree according to Embodiment 1 of the present invention;

图3为本发明实施例内容标签库中第二内容标签树示意图；3 is a schematic diagram of a second content tag tree in a content tag library according to an embodiment of the present invention;

图4为本发明实施例二第一内容标签树示意图；FIG. 4 is a schematic diagram of a first content tag tree in Embodiment 2 of the present invention;

图5为本发明实施例二互联网内容标签的管理方法流程示意图；5 is a schematic flow chart of a method for managing Internet content tags according to Embodiment 2 of the present invention;

图6为本发明实施例互联网内容标签的管理装置组成结构示意图。FIG. 6 is a schematic diagram of the composition and structure of a device for managing Internet content tags according to an embodiment of the present invention.

具体实施方式detailed description

在本发明实施例中，创建内容标签库，所述方法还包括：获取不同网站对应的第一内容标签树，分别确定各个第一内容标签树所属类别的类别标签；基于所述类别标签将所述第一内容标签树的内容标签与所述内容标签库中的内容标签按预设的匹配规则进行匹配，并依据匹配结果更新所述内容标签库。In the embodiment of the present invention, the content tag library is created, and the method further includes: obtaining first content tag trees corresponding to different websites, respectively determining the category tags of the category to which each first content tag tree belongs; The content tags in the first content tag tree are matched with the content tags in the content tag library according to preset matching rules, and the content tag library is updated according to the matching result.

实施例一Embodiment one

图1所示为本发明实施例互联网内容标签的管理方法流程示意图；如图1所示，本发明实施例互联网内容标签的管理方法包括：Fig. 1 shows the schematic flow chart of the management method of Internet content label of the embodiment of the present invention; As shown in Fig. 1, the management method of Internet content label of the embodiment of the present invention comprises:

步骤100：创建内容标签库；Step 100: Create a content tag library;

这里，需要说明的是，仅首次执行本发明所述互联网内容标签的管理方法时执行该步骤即可，后续所述内容标签库可直接应用。Here, it should be noted that this step can only be performed when the method for managing Internet content tags described in the present invention is executed for the first time, and the subsequent content tag library can be directly applied.

步骤101：获取不同网站对应的第一内容标签树，分别确定各个第一内容标签树所属类别的类别标签；Step 101: Obtain the first content tag trees corresponding to different websites, and respectively determine the category tags of the category to which each first content tag tree belongs;

这里，所述获取可以为周期性的获取，所述周期可以依据实际需要进行设定，如周期为两个星期。Here, the acquisition may be periodic acquisition, and the period may be set according to actual needs, for example, the period is two weeks.

在一实施例中，所述获取不同网站对应的第一内容标签树包括：In one embodiment, the obtaining the first content tag tree corresponding to different websites includes:

获取不同网站对应的网站域名(HOST)，以及所述网站域名下的至少一个统一资源定位符(URL，Uniform Resource Locator)，基于所述网站域名及所述URL，利用各个网站的内容标签规则确定不同网站对应的第一内容标签树；Obtain the website domain names (HOST) corresponding to different websites, and at least one uniform resource locator (URL, Uniform Resource Locator) under the website domain names, based on the website domain names and the URLs, use the content label rules of each website to determine The first content tag tree corresponding to different websites;

这里，可以通过互联网爬虫工具或者其他外部系统已有的标签数据接口获取不同网站对应的网站域名(HOST)以及所述网站域名下的至少一个URL；Here, the website domain names (HOST) corresponding to different websites and at least one URL under the website domain names can be obtained through the existing label data interface of the Internet crawler tool or other external systems;

每个互联网内容提供商的网站都有自己的一套内容标签体系，即内容标签规则，整体呈树状结构，如百度包括百度新闻、百度知道、百度视频、百度地图等，所述百度新闻包括科技、娱乐、社会、军事等；依据该网站的网站域名及该网站下的至少一个URL，利用该网站的内容标签规则确定对应该网站的第一内容标签树；所述第一内容标签树整体呈树状结构，所有的内容标签都具有一定的层级排列，同时，相邻各层级的内容标签之间有隶属关系，靠近根节点的内容标签被称为父标签，而远离根节点的内容标签被称为子标签，如图2所示为本发明实施例一第一内容标签树示意图。The website of each Internet content provider has its own set of content labeling system, that is, the content labeling rules, which are in a tree structure as a whole. Technology, entertainment, society, military, etc.; according to the domain name of the website and at least one URL under the website, use the content label rules of the website to determine the first content label tree corresponding to the website; the first content label tree as a whole It is a tree structure, and all content tags have a certain hierarchical arrangement. At the same time, there is affiliation between the content tags of adjacent levels. The content tags near the root node are called parent tags, while the content tags far from the root node It is called a sub-tab, and FIG. 2 is a schematic diagram of a first content tag tree according to an embodiment of the present invention.

在一实施例中，所述分别确定各个第一内容标签树所属类别的类别标签包括：In an embodiment, the respectively determining the category tags of the category to which each first content tag tree belongs includes:

分别读取各个第一内容标签树对应的网站的网站域名，依据所述网站域名及预设的网站域名分类库确定各个第一内容标签树所属类别的类别标签；Read the website domain names of the websites corresponding to each first content tag tree respectively, and determine the category tags of the category to which each first content tag tree belongs according to the website domain name and the preset website domain name classification library;

这里，依据的所述网站域名分类库可以为预先构建的或者现有的，所述网站域名分类库中包括不同的网站域名对应的网站内容的类别；例如：网站域名为xxsy.com的网站对应的网站内容的类别为图书；即若所述第一内容标签树对应的网站域名为xxsy.com时，所述第一内容标签树的类别标签为图书。Here, the website domain name classification library based on can be pre-built or existing, and the website domain name classification library includes the categories of website content corresponding to different website domain names; for example: a website whose website domain name is xxsy.com corresponds to The category of the website content is books; that is, if the website domain name corresponding to the first content label tree is xxsy.com, the category label of the first content label tree is books.

步骤102：基于所述类别标签将所述第一内容标签树的内容标签与所述内容标签库中的内容标签按预设的匹配规则进行匹配，并依据匹配结果更新所述内容标签库；Step 102: Match the content tags of the first content tag tree with the content tags in the content tag library according to the preset matching rules based on the category tags, and update the content tag library according to the matching result;

在一实施例中，所述基于所述类别标签将所述第一内容标签树的内容标签与所述内容标签库中的内容标签按预设的匹配规则进行匹配包括：In an embodiment, the matching of the content tags of the first content tag tree with the content tags in the content tag library based on the category tags according to a preset matching rule includes:

获取所述内容标签库中根节点内容标签与所述类别标签相同的第二内容标签树，结合语义分析，按照从左到右或从顶向下的顺序将所述第一内容标签树的内容标签分别与所述第二内容标签树中各个层级的内容标签逐级进行匹配；Obtaining the second content label tree whose root node content label is the same as the category label in the content label library, combining semantic analysis, according to the order from left to right or from top to bottom, the content labels of the first content label tree Match the content tags of each level in the second content tag tree step by step;

这里，在所述内容标签库创建初期，可能所述内容标签库中并不存在根节点内容标签与所述类别标签相同的第二内容标签树，在这种情况下，就创建根节点内容标签与所述类别标签相同的第二内容标签树；Here, at the initial stage of creating the content tag library, there may not be a second content tag tree whose root node content tag is the same as the category tag in the content tag library, in this case, the root node content tag is created a second content tag tree identical to the category tags;

所述语义分析为对内容标签名称的语义分析，如两个内容标签的名称相同且对应的父标签及子标签均相同，则判断两个内容标签相同；若两个内容标签的名称不同且对应的父标签及子标签均相同，则判断两个内容标签相似；若两个内容标签的名称不同，且对应的父标签或子标签不相同，则判断两个内容标签不同；Described semantic analysis is the semantic analysis to content label name, if the name of two content labels is identical and corresponding parent label and child label are all the same, then judge that two content labels are identical; If the name of two content labels is different and corresponding If the parent tags and child tags are the same, it is judged that the two content tags are similar; if the names of the two content tags are different, and the corresponding parent tags or child tags are not the same, it is judged that the two content tags are different;

在一实施例中，若所述类别标签为“图书”，则所述获取所述内容标签库中根节点内容标签与所述类别标签相同的第二内容标签树为：获取所述内容标签库中根节点为图书的第二内容标签树。In one embodiment, if the category label is "book", the acquiring the second content label tree whose root node content label in the content label library is the same as the category label is: obtaining the root node in the content label library The node is the second content tag tree of the book.

所述按照从左到右或从顶向下的顺序即为按照从父节点到子节点的顺序。The order from left to right or top to bottom is the order from parent node to child node.

在一实施例中，所述结合语义分析，按照从左到右或从顶向下的顺序将所述第一内容标签树的内容标签分别与所述第二内容标签树中各个层级的内容标签逐级进行匹配包括：In an embodiment, in the combined semantic analysis, the content tags of the first content tag tree are respectively combined with the content tags of each level in the second content tag tree in order from left to right or from top to bottom Step-by-step matching includes:

从所述第一内容标签树的根节点内容标签开始，按照从左到右或从顶向下的顺序，先将所述根节点内容标签分别与所述第二内容标签树中各个层级的内容标签逐级进行匹配，若所述第二内容标签树中存在与所述根节点内容标签相同的内容标签，则继续将所述根节点内容标签的各个子标签分别与所述第二内容标签树中各个层级的内容标签逐级进行匹配，若所述第二内容标签树中存在与所述子标签相同的内容标签，则继续将所述子标签的各个子标签分别与所述第二内容标签树中各个层级的内容标签逐级进行匹配，以此类推，直至完成对所述第一内容标签树中所有内容标签的匹配。Starting from the root node content tag of the first content tag tree, according to the order from left to right or from top to bottom, first associate the root node content tag with the content of each level in the second content tag tree Tags are matched step by step. If there is a content tag identical to the root node content tag in the second content tag tree, then continue to match each sub-tag of the root node content tag with the second content tag tree respectively. The content tags of each level in the tree are matched step by step. If there is a content tag identical to the sub-tag in the second content tag tree, then continue to match each sub-tag of the sub-tag with the second content tag The content tags of each level in the tree are matched level by level, and so on, until the matching of all content tags in the first content tag tree is completed.

在一实施例中，所述依据匹配结果更新所述内容标签库包括：In one embodiment, the updating the content tag library according to the matching result includes:

确定所述第二内容标签树中存在与所述内容标签相似的内容标签，则更新与所述内容标签相似的内容标签的名称；其中，与所述内容标签相似的内容标签为名称与所述内容标签不同，但对应的父标签及子标签均相同的内容标签；Determine that there is a content tag similar to the content tag in the second content tag tree, then update the name of the content tag similar to the content tag; wherein, the content tag similar to the content tag has a name similar to the content tag The content tags are different, but the corresponding parent tags and child tags are the same;

确定所述第二内容标签树中存在与所述内容标签相同的内容标签，则保持当前所述第二内容标签树中与所述内容标签相同的内容标签不变，可继续匹配第一内容标签库中的其它内容标签；If it is determined that there is a content tag identical to the content tag in the second content tag tree, the current content tag identical to the content tag in the second content tag tree remains unchanged, and the first content tag can continue to be matched other content tags in the library;

这里，图3所示为本发明实施例内容标签库中第二内容标签树示意图，如图2、图3所示，所述第一内容标签树的类别标签为“图书”，从所述第一内容标签树的根节点内容标签即“农林”开始，按照从左到右或从顶向下的顺序，将“农林”分别与所述第二内容标签树中各个层级的内容标签逐级进行匹配，即先与“图书”的子节点内容标签“小说”、“历史”、“科技”进行匹配，经语义分析，发现不存在与所述“农林”相同的内容标签，则将“农林”增加至“图书”的子标签；Here, FIG. 3 is a schematic diagram of the second content tag tree in the content tag library according to the embodiment of the present invention. As shown in FIG. 2 and FIG. Starting from the root node content label of a content label tree, namely "agriculture and forestry", according to the order from left to right or from top to bottom, "agriculture and forestry" is respectively connected with the content labels of each level in the second content label tree step by step Matching, that is, matching with the sub-node content tags "novel", "history" and "science and technology" of "book" first, and after semantic analysis, it is found that there is no content tag identical to the "agriculture and forestry", then "agriculture and forestry" added to the "Books" sub-tab;

图4所示为本发明实施例二所述第一内容标签树示意图，如图3、图4所示，所述第一内容标签树的类别标签为“图书”，从所述第一内容标签树的根节点内容标签即“小说”开始，分别与所述第二内容标签树中各个层级的内容标签逐级进行匹配，并经语义分析在第二内容标签树中找到相同的“小说”，然后将“小说”的子标签“恐怖”分别与所述第二内容标签树中各个层级的内容标签逐级进行匹配，经语义分析，发现与所述“恐怖”相似的内容标签“惊悚”，则将所述第二内容标签树中的内容标签“惊悚”更新为“恐怖”。Fig. 4 is a schematic diagram of the first content tag tree according to the second embodiment of the present invention. As shown in Fig. 3 and Fig. 4 , the category tag of the first content tag tree is "book", and from the first content tag The content tag of the root node of the tree, which is "novel", is matched with the content tags of each level in the second content tag tree respectively, and the same "novel" is found in the second content tag tree through semantic analysis, Then the sub-tag "horror" of "fiction" is matched step by step with the content tags of each level in the second content tag tree, and after semantic analysis, it is found that the content tag "thriller" similar to the "horror" Then update the content tag "horror" in the second content tag tree to "horror".

在一实施例中，所述依据匹配结果更新所述内容标签库之后，所述方法还包括：In one embodiment, after updating the content tag library according to the matching result, the method further includes:

从所述第二内容标签树中与所述第一内容标签树的根节点内容标签相同的内容标签开始，按照从左到右或从顶向下的顺序将所述第二内容标签树中的内容标签与所述第一内容标签树中相应层级的内容标签进行匹配，若所述第二内容标签树中的内容标签在所述第一内容标签树相应层级的内容标签中不存在，则删除所述第二内容标签树中的内容标签；Starting from the same content tag as the root node content tag of the first content tag tree in the second content tag tree, the The content tag is matched with the content tag of the corresponding level in the first content tag tree, and if the content tag in the second content tag tree does not exist in the content tag of the corresponding level of the first content tag tree, delete a content tag in the second content tag tree;

这里，如图3、图4所示，从所述第二内容标签树中与所述第一内容标签树的根节点内容标签相同的内容标签开始，即从第二内容标签树中内容标签“小说”开始，按照从左到右或从顶向下的顺序将所述第二内容标签树中的内容标签与所述第一内容标签树中相应层级的内容标签进行匹配，发现第二内容标签树中“小说”的子标签“神怪”在所述第一内容标签树“小说”的子标签中并不存在，则删除第二内容标签树中的内容标签“神怪”。Here, as shown in Figure 3 and Figure 4, start from the same content label as the root node content label of the first content label tree in the second content label tree, that is, from the second content label tree in the content label " Novel", match the content tags in the second content tag tree with the corresponding level content tags in the first content tag tree in order from left to right or top to bottom, and find the second content tag If the subtag "gods" of "novels" in the tree does not exist in the subtags of "novels" in the first content tag tree, the content tag "gods" in the second content tag tree is deleted.

实施例二Embodiment two

图5所示为本发明实施例互联网内容标签的管理方法流程示意图；如图5所示，本发明实施例互联网内容标签的管理方法包括：Fig. 5 shows the schematic flow chart of the management method of Internet content label of the embodiment of the present invention; As shown in Fig. 5, the management method of Internet content label of the embodiment of the present invention comprises:

步骤501：获取不同网站对应的第一内容标签树，并确定每个第一内容标签树所属类别的类别标签；Step 501: Obtain first content tag trees corresponding to different websites, and determine the category tags of the category to which each first content tag tree belongs;

本步骤之前，所述方法还包括：创建内容标签库；需要说明的是，内容标签库的创建仅在首次执行本发明所述互联网内容标签的管理方法时执行即可，后续可直接应用。Before this step, the method further includes: creating a content label library; it should be noted that the creation of the content label library can only be performed when the Internet content label management method of the present invention is executed for the first time, and can be directly applied subsequently.

所述获取可以为周期性的获取，所述周期可以依据实际需要进行设定，如周期为两个星期。The acquisition may be periodic acquisition, and the period may be set according to actual needs, for example, the period is two weeks.

通过互联网爬虫工具或者其他外部系统已有的标签数据接口，获取不同网站对应的网站域名，以及所述网站域名下的至少一个URL，基于所述网站域名及所述URL，利用各个网站的内容标签规则确定不同网站对应的第一内容标签树；Obtain the website domain names corresponding to different websites and at least one URL under the website domain names through Internet crawler tools or other existing label data interfaces of external systems, and use the content tags of each website based on the website domain names and the URLs The rule determines the first content tag tree corresponding to different websites;

这里，每个互联网内容提供商的网站都有自己的一套内容标签体系，整体呈树状结构，如百度包括百度新闻、百度知道、百度视频、百度地图等，所述百度新闻包括科技、娱乐、社会、军事等；对应所述内容标签体系有内容标签规则，也即每一个内容标签对应该网站域名下的URL，依据该网站的网站域名及该网站下的至少一个URL，利用该网站的内容标签规则确定对应该网站的第一内容标签树。Here, the website of each Internet content provider has its own set of content labeling system, which is in a tree structure as a whole. For example, Baidu includes Baidu News, Baidu Zhizhi, Baidu Video, Baidu Map, etc. Baidu News includes technology, entertainment , social, military, etc.; corresponding to the content labeling system, there are content labeling rules, that is, each content label corresponds to the URL under the website domain name, and according to the website domain name of the website and at least one URL under the website, use the website’s URL The content label rule determines the first content label tree corresponding to the website.

在一实施例中，所述确定每个第一内容标签树所属类别的类别标签包括：In an embodiment, the determining the category label of the category to which each first content label tree belongs includes:

这里，依据的所述网站域名分类库可以为预先构建的或者现有的，所述网站域名分类库中包括不同的网站域名对应的网站内容的类别；例如：网站域名为youku.com的网站对应的网站内容的类别为视频；即若所述第一内容标签树对应的网站域名为youku.com时，所述第一内容标签树的类别标签为视频。Here, the website domain name classification library based on can be pre-built or existing, and the website domain name classification library includes the categories of website content corresponding to different website domain names; for example: a website whose website domain name is youku.com corresponds to The category of the website content is video; that is, if the website domain name corresponding to the first content label tree is youku.com, the category label of the first content label tree is video.

步骤502：基于所述类别标签将所述第一内容标签树的内容标签与所述内容标签库中的内容标签按预设的匹配规则进行匹配，判断所述内容标签库中是否存在与所述第一内容标签树中的内容标签相同的内容标签，如果存在，执行步骤503；如果不存在，执行步骤505；Step 502: Based on the category tags, match the content tags of the first content tag tree with the content tags in the content tag library according to the preset matching rules, and judge whether there is a content tag corresponding to the content tag library in the content tag library. If the same content tag as the content tag in the first content tag tree exists, execute step 503; if not exist, execute step 505;

在本发明实施例中，所述基于所述类别标签将所述第一内容标签树的内容标签与所述内容标签库中的内容标签按预设的匹配规则进行匹配包括：In the embodiment of the present invention, the matching of the content tags of the first content tag tree with the content tags in the content tag library based on the category tags according to preset matching rules includes:

这里，所述语义分析为对内容标签名称的语义分析，如两个内容标签的名称相同且对应的父标签及子标签均相同，则判断两个内容标签相同；若两个内容标签的名称不同且对应的父标签及子标签均相同，则判断两个内容标签相似；若两个内容标签的名称不同，且对应的父标签或子标签不相同，则判断两个内容标签不同；Here, the semantic analysis is the semantic analysis of the content tag names, if the names of the two content tags are the same and the corresponding parent tags and child tags are the same, then it is judged that the two content tags are the same; if the names of the two content tags are different And the corresponding parent tags and child tags are the same, it is judged that the two content tags are similar; if the names of the two content tags are different, and the corresponding parent tags or child tags are not the same, it is judged that the two content tags are different;

步骤503：从所述第二内容标签树中与所述第一内容标签树的根节点内容标签相同的内容标签开始，按照从左到右或从顶向下的顺序将所述第二内容标签树中的内容标签与所述第一内容标签树中相应层级的内容标签进行匹配，判断所述第二内容标签树中的内容标签在所述第一内容标签树中是否存在，如果不存在，执行步骤504；如果存在，执行步骤508；Step 503: Starting from the content tag in the second content tag tree that is the same as the root node content tag of the first content tag tree, sort the second content tag in order from left to right or from top to bottom The content tags in the tree are matched with the content tags of the corresponding level in the first content tag tree, and it is judged whether the content tag in the second content tag tree exists in the first content tag tree, if not, Execute step 504; if exist, execute step 508;

这里，由于随着内容提供商提供的内容对应的内容标签的动态变化，导致所述内容标签库中会存在一些无用的内容标签，为了保证内容标签库中的内容标签与互联网中的内容标签的同步，需要周期性的删除所述内容标签库中无用的标签；如图3、图4所示，从所述第二内容标签树中与所述第一内容标签树的根节点内容标签相同的内容标签开始，即从第二内容标签树中内容标签“小说”开始，按照从左到右或从顶向下的顺序将所述第二内容标签树中的内容标签与所述第一内容标签树中相应层级的内容标签进行匹配，发现第二内容标签树中“小说”的子标签“神怪”在所述第一内容标签树“小说”的子标签中并不存在，则删除第二内容标签树中的内容标签“神怪”。Here, due to the dynamic changes of the content tags corresponding to the content provided by the content provider, there will be some useless content tags in the content tag library. Synchronization requires periodic deletion of useless tags in the content tag library; as shown in Figure 3 and Figure 4, from the second content tag tree the same as the root node content tag of the first content tag tree Starting from the content tag, that is, starting from the content tag "novel" in the second content tag tree, align the content tags in the second content tag tree with the first content tag in order from left to right or from top to bottom The content tags of the corresponding levels in the tree are matched, and it is found that the sub-tag "genie" of "novel" in the second content tag tree does not exist in the sub-tag of the first content tag tree "novel", then delete the second content The content tag "genie" in the tag tree.

步骤504：删除所述第二内容标签树中未存在于所述第一内容标签树中的内容标签，并执行步骤508。Step 504: Delete content tags in the second content tag tree that do not exist in the first content tag tree, and execute step 508.

步骤505：判断所述内容标签库中是否存在与所述第一内容标签树中的内容标签相似的内容标签，如果存在，执行步骤506；如果不存在，执行步骤507；Step 505: Judging whether there is a content tag similar to the content tag in the first content tag tree in the content tag library, if yes, go to step 506; if not, go to step 507;

步骤506：更新与所述内容标签相似的内容标签的名称为所述第一内容标签树中的内容标签的名称，并执行步骤508。Step 506: Update the name of the content tag similar to the content tag to the name of the content tag in the first content tag tree, and execute step 508.

步骤507：在所述第二内容标签树的相应层级上增加所述内容标签。Step 507: Add the content tag on the corresponding level of the second content tag tree.

步骤508：结束本次处理流程。Step 508: End this processing flow.

实施例三Embodiment three

图6为本发明实施例互联网内容标签的管理装置组成结构示意图；如图6所示，本发明实施例互联网内容标签的管理装置组成包括：创建模块61、获取模块62及更新模块63；其中，Fig. 6 is a schematic diagram of the composition and structure of the management device of the Internet content label according to the embodiment of the present invention;

所述创建模块61，用于创建内容标签库；The creation module 61 is used to create a content label library;

所述获取模块62，用于获取不同网站对应的第一内容标签树，分别确定各个第一内容标签树所属类别的类别标签；The obtaining module 62 is configured to obtain the first content tag trees corresponding to different websites, and respectively determine the category tags of the category to which each first content tag tree belongs;

所述更新模块63，用于基于所述类别标签将所述第一内容标签树的内容标签与所述内容标签库中的内容标签按预设的匹配规则进行匹配，并依据匹配结果更新所述内容标签库；The update module 63 is configured to match the content tags of the first content tag tree with the content tags in the content tag library according to preset matching rules based on the category tags, and update the content tag library;

在一实施例中，所述获取模块62，具体用于获取不同网站对应的网站域名，以及所述网站域名下的至少一个URL，基于所述网站域名及所述URL，利用各个网站的内容标签规则确定不同网站对应的第一内容标签树。In one embodiment, the obtaining module 62 is specifically configured to obtain website domain names corresponding to different websites, and at least one URL under the website domain names, and use the content tags of each website based on the website domain names and the URLs The rule determines the first content tag tree corresponding to different websites.

在一实施例中，所述获取模块62，具体用于分别读取各个第一内容标签树对应的网站的网站域名，依据所述网站域名及预设的网站域名分类库确定各个第一内容标签树所属类别的类别标签；In one embodiment, the acquisition module 62 is specifically configured to read the website domain names of the websites corresponding to each first content tag tree, and determine each first content tag according to the website domain names and the preset website domain name classification library. the category label of the category the tree belongs to;

在一实施例中，所述更新模块63，具体用于获取所述内容标签库中根节点内容标签与所述类别标签相同的第二内容标签树，结合语义分析，按照从左到右或从顶向下的顺序将所述第一内容标签树的内容标签分别与所述第二内容标签树中各个层级的内容标签逐级进行匹配；In an embodiment, the update module 63 is specifically configured to obtain a second content label tree whose root node content label in the content label library is the same as the category label, and combine semantic analysis, from left to right or from top to bottom Matching the content tags of the first content tag tree with the content tags of each level in the second content tag tree in a downward order;

这里，所述语义分析为对内容标签名称的语义分析，如两个内容标签的名称相同且对应的父标签及子标签均相同，则判断两个内容标签相同；若两个内容标签的名称不同且对应的父标签及子标签均相同，则判断两个内容标签相似；若两个内容标签的名称不同，且对应的父标签或子标签不相同，则判断两个内容标签不同。Here, the semantic analysis is the semantic analysis of the content tag names, if the names of the two content tags are the same and the corresponding parent tags and child tags are the same, then it is judged that the two content tags are the same; if the names of the two content tags are different And if the corresponding parent tags and child tags are the same, it is judged that the two content tags are similar; if the names of the two content tags are different, and the corresponding parent tags or child tags are not the same, it is judged that the two content tags are different.

在一实施例中，所述更新模块63，具体用于从所述第一内容标签树的根节点内容标签开始，按照从左到右或从顶向下的顺序，先将所述根节点内容标签分别与所述第二内容标签树中各个层级的内容标签逐级进行匹配，若所述第二内容标签树中存在与所述根节点内容标签相同的内容标签，则继续将所述根节点内容标签的各个子标签分别与所述第二内容标签树中各个层级的内容标签逐级进行匹配，若所述第二内容标签树中存在与所述子标签相同的内容标签，则继续将所述子标签的各个子标签分别与所述第二内容标签树中各个层级的内容标签逐级进行匹配，以此类推，直至完成对所述第一内容标签树中所有内容标签的匹配。In one embodiment, the update module 63 is specifically configured to start from the root node content tag of the first content tag tree, and first update the root node content in order from left to right or top to bottom Tags are matched step by step with the content tags of each level in the second content tag tree, and if there is a content tag identical to the content tag of the root node in the second content tag tree, the root node will continue to be Each sub-tag of the content tag is matched step by step with the content tags of each level in the second content tag tree, and if there is a content tag identical to the sub-tag in the second content tag tree, continue to Each subtag of the subtags is matched step by step with the content tags of each level in the second content tag tree, and so on, until the matching of all content tags in the first content tag tree is completed.

在一实施例中，所述更新模块63，具体用于确定所述第二内容标签树中不存在与所述内容标签相同或相似的内容标签，则在所述第二内容标签树的相应层级上增加所述内容标签；In an embodiment, the update module 63 is specifically configured to determine that there is no content tag identical or similar to the content tag in the second content tag tree, and then at the corresponding level of the second content tag tree Add the content tag above;

确定所述第二内容标签树中存在与所述内容标签相同的内容标签，则保持当前所述第二内容标签树中与所述内容标签相同的内容标签不变，可继续匹配第一内容标签库中的其它内容标签。If it is determined that there is a content tag identical to the content tag in the second content tag tree, the current content tag identical to the content tag in the second content tag tree remains unchanged, and the first content tag can continue to be matched Other content tabs in the library.

在一实施例中，所述更新模块63，还用于从所述第二内容标签树中与所述第一内容标签树的根节点内容标签相同的内容标签开始，按照从左到右或从顶向下的顺序将所述第二内容标签树中的内容标签与所述第一内容标签树中相应层级的内容标签进行匹配，若所述第二内容标签树中的内容标签在所述第一内容标签树相应层级的内容标签中不存在，则删除所述第二内容标签树中的内容标签。In an embodiment, the update module 63 is further configured to start from the content tag in the second content tag tree that is the same as the content tag of the root node of the first content tag tree, and proceed from left to right or from left to right Matching the content tags in the second content tag tree with the corresponding level content tags in the first content tag tree in a top-down order, if the content tag in the second content tag tree is in the first content tag tree If the content tag at the corresponding level of the first content tag tree does not exist, the content tag in the second content tag tree is deleted.

这里需要指出的是：以上涉及装置的描述，与上述方法描述是类似的，同方法的有益效果描述，不做赘述。对于本发明装置实施例中未披露的技术细节，请参照本发明方法实施例的描述。What needs to be pointed out here is that the above description related to the device is similar to the above description of the method, and the description of the beneficial effects of the same method will not be repeated. For the technical details not disclosed in the device embodiments of the present invention, please refer to the description of the method embodiments of the present invention.

在本发明实施例中，所述创建模块61、获取模块62及更新模块63均可由终端或服务器中的中央处理器(CPU，Central Processing Unit)或数字信号处理器(DSP，Digital Signal Processor)、或现场可编程门阵列(FPGA，FieldProgrammable Gate Array)、或集成电路(ASIC，Application Specific IntegratedCircuit)实现。In the embodiment of the present invention, the creation module 61, the acquisition module 62 and the update module 63 can all be implemented by a central processing unit (CPU, Central Processing Unit) or a digital signal processor (DSP, Digital Signal Processor), Or field programmable gate array (FPGA, Field Programmable Gate Array), or integrated circuit (ASIC, Application Specific Integrated Circuit) implementation.

以上所述，仅为本发明较佳实施例而已，并非用于限定本发明的保护范围。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the protection scope of the present invention.

Claims

1. A method for managing internet content tags, wherein a content tag library is created, the method further comprising:

acquiring first content tag trees corresponding to different websites, and respectively determining category tags of categories to which the first content tag trees belong;

and matching the content label of the first content label tree with the content label in the content label library according to a preset matching rule based on the category label, and updating the content label library according to a matching result.

2. The method of claim 1, wherein the obtaining the first content tag tree corresponding to different websites comprises:

acquiring website domain names corresponding to different websites and at least one Uniform Resource Locator (URL) under the website domain names, and determining first content tag trees corresponding to the different websites by using content tag rules of the websites based on the website domain names and the URLs.

3. The method according to claim 1 or 2, wherein the determining the category label of the category to which each first content label tree belongs comprises:

and respectively reading the website domain names of the websites corresponding to the first content label trees, and determining the category labels of the categories to which the first content label trees belong according to the website domain names and a preset website domain name classification library.

4. The method according to claim 1 or 2, wherein the matching the content tag of the first content tag tree with the content tag in the content tag library according to a preset matching rule based on the category tag comprises:

and acquiring a second content label tree with the root node content labels in the content label library identical to the category labels, and matching the content labels of the first content label tree with the content labels of each level in the second content label tree step by step according to the sequence from left to right or from top to bottom by combining semantic analysis.

5. The method of claim 4, wherein the updating the content tag library according to the matching result comprises:

if the content label identical or similar to the content label does not exist in the second content label tree, adding the content label on the corresponding level of the second content label tree;

if the content tag similar to the content tag exists in the second content tag tree, updating the name of the content tag similar to the content tag; the content tags similar to the content tags are content tags with different names and the same corresponding parent tags and child tags.

6. The method of claim 4, wherein after the updating the content tag library according to the matching result, the method further comprises:

starting from a content label in the second content label tree which is the same as a content label of a root node of the first content label tree, matching the content label in the second content label tree with the content label of the corresponding level in the first content label tree according to the sequence from left to right or from top to bottom, and deleting the content label in the second content label tree if the content label in the second content label tree does not exist in the content label of the corresponding level in the first content label tree.

7. An apparatus for managing tags of internet contents, the apparatus comprising: the system comprises a creating module, an obtaining module and an updating module; wherein,

the creating module is used for creating a content label library;

the acquisition module is used for acquiring first content tag trees corresponding to different websites and respectively determining category tags of categories to which the first content tag trees belong;

and the updating module is used for matching the content tags of the first content tag tree with the content tags in the content tag library according to a preset matching rule based on the category tags and updating the content tag library according to a matching result.

8. The apparatus according to claim 7, wherein the obtaining module is specifically configured to obtain website domain names corresponding to different websites and at least one uniform resource locator URL under the website domain names, and determine, based on the website domain names and the URLs, first content tag trees corresponding to the different websites by using content tag rules of the websites.

9. The apparatus according to claim 7 or 8, wherein the obtaining module is specifically configured to read website domain names of websites corresponding to the first content tag trees, and determine category tags of categories to which the first content tag trees belong according to the website domain names and a preset website domain name classification library.

10. The apparatus according to claim 7 or 8, wherein the updating module is specifically configured to obtain a second content tag tree in the content tag library, where content tags of root nodes are the same as the category tags, and match content tags of the first content tag tree with content tags of each level in the second content tag tree step by step according to a left-to-right or top-to-bottom order in combination with semantic analysis.

11. The apparatus of claim 10, wherein the updating module is specifically configured to determine that no content tag identical or similar to the content tag exists in the second content tag tree, and add the content tag at a corresponding level of the second content tag tree;

12. The apparatus of claim 10, wherein the updating module is further configured to match content tags in the second content tag tree with content tags in corresponding levels of the first content tag tree in a left-to-right or top-to-bottom order starting from a content tag in the second content tag tree that is the same as a content tag in a root node of the first content tag tree, and delete a content tag in the second content tag tree if the content tag in the second content tag tree does not exist in the content tag in the corresponding level of the first content tag tree.