[go: up one dir, main page]

CN103425660B - The acquisition methods and device of a kind of entry - Google Patents

The acquisition methods and device of a kind of entry Download PDF

Info

Publication number
CN103425660B
CN103425660B CN201210151282.6A CN201210151282A CN103425660B CN 103425660 B CN103425660 B CN 103425660B CN 201210151282 A CN201210151282 A CN 201210151282A CN 103425660 B CN103425660 B CN 103425660B
Authority
CN
China
Prior art keywords
anchor text
entry
extracted
existing
anchor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210151282.6A
Other languages
Chinese (zh)
Other versions
CN103425660A (en
Inventor
李永强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201210151282.6A priority Critical patent/CN103425660B/en
Publication of CN103425660A publication Critical patent/CN103425660A/en
Application granted granted Critical
Publication of CN103425660B publication Critical patent/CN103425660B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提供了一种词条的获取方法和装置,其中,该方法包括:获取词条库中同一分类的已有词条集合;利用所获取的已有词条集合进行搜索,得到包含所述已有词条的锚文本,并记录所述已有词条的锚文本所在的网页位置;根据所记录的网页位置,在相应的位置提取与所述已有词条的锚文本之间的上下文距离满足预设要求的锚文本。本发明提供的获取方法和装置,利用已有词库挖掘实体词条,可指导用户创建新词,解决百科数据库中实体词条收录不足的问题,便于实现更有效的知识搜索。

The present invention provides a method and device for obtaining an entry, wherein the method includes: obtaining an existing entry collection of the same category in the entry database; searching by using the obtained collection of existing entries to obtain the The anchor text of the existing entry, and record the webpage position where the anchor text of the existing entry is located; according to the recorded webpage location, extract the context between the anchor text and the anchor text of the existing entry at the corresponding position Anchor text that meets preset requirements. The acquisition method and device provided by the present invention utilize the existing thesaurus to mine entity entries, can guide users to create new words, solve the problem of insufficient collection of entity entries in the encyclopedia database, and facilitate more effective knowledge search.

Description

一种词条的获取方法和装置A method and device for acquiring an entry

【技术领域】【Technical field】

本发明涉及互联网信息处理技术领域,特别涉及一种词条的获取方法和装置。The invention relates to the technical field of Internet information processing, in particular to a method and device for acquiring entries.

【背景技术】【Background technique】

随着信息和网络技术的不断发展,人们越来越多地通过互联网进行各种知识和信息的搜索。百科网站是一个互联网所有用户均能平等的浏览、创造、完善内容的平台,例如百度百科、维基百科、互动百科等,能够让互联网用户通过百科网站即能找到自己想要的全面、准确、客观的定义性信息,可供其他用户进行类似主题的查询和浏览,以便提供相应的知识或者借鉴。With the continuous development of information and network technology, people increasingly search for various knowledge and information through the Internet. The encyclopedia website is a platform where all Internet users can browse, create, and improve content equally, such as Baidu Encyclopedia, Wikipedia, Interactive Encyclopedia, etc., allowing Internet users to find the comprehensive, accurate and objective content they want through the encyclopedia website. The definitional information of , which can be used by other users to query and browse similar topics, so as to provide corresponding knowledge or reference.

词条是百科网站所含内容的基础分割单位,一个词条具有一个或多个单一的主题,用于阐述一件事物、一个人物、或者具备特定主题的组合等知识内容,例如:“故宫”、“刘德华”、“2008年北京奥运会”等。在百科网站中包括极大数量的词条,这些词条记录了各种行业、各种主题、各种知识领域的内容。对于搜索引擎来说,利用这些百科词条可以大大提高检索的准确性和检索覆盖率,并且有利于从网页中提取结构化数据,用以进行垂直搜索,得到更为精确的信息。An entry is the basic division unit of the content contained in the encyclopedia website. An entry has one or more single themes, which are used to explain knowledge content such as a thing, a person, or a combination of specific themes, such as: "Forbidden City" , "Andy Lau", "2008 Beijing Olympic Games" and so on. The encyclopedia website includes a huge number of entries, which record the content of various industries, various topics, and various knowledge fields. For search engines, the use of these encyclopedia entries can greatly improve the accuracy and coverage of retrieval, and is beneficial to extract structured data from web pages for vertical search and more accurate information.

随着信息的大量传播以及人们交流内容的不断扩展,新词条层出不穷。现有的新词条都是通过人工添加并创建新词条对应的知识内容,进而通过人工审核的方式将创建合格的新词条添加到百科网站中,以供用户进行知识和信息的搜索。对于一个未创建新词条,比如新的歌曲、电影、人物等,系统并不会在互联网上主动发现,导致一些新词条无法及时创建和更新,影响搜索引擎的检索速度,甚至还会影响检索的准确性和召回率。With the mass dissemination of information and the continuous expansion of people's communication content, new entries emerge in endlessly. The existing new entries are manually added and the knowledge content corresponding to the new entries is created, and then qualified new entries are added to the encyclopedia website through manual review for users to search for knowledge and information. For a new entry that has not been created, such as a new song, movie, character, etc., the system will not actively discover it on the Internet, resulting in some new entries not being created and updated in time, affecting the retrieval speed of search engines, and even affecting Retrieval precision and recall.

【发明内容】【Content of invention】

有鉴于此,本发明提供了一种词条的获取方法和装置,利用已有词库挖掘实体词条,可指导用户创建新词,解决百科数据库中实体词条收录不足的问题,便于实现更有效的知识搜索。In view of this, the present invention provides a method and device for obtaining entries, which can guide users to create new words by using existing thesaurus to mine entity entries, solve the problem of insufficient collection of entity entries in the encyclopedia database, and facilitate more Effective knowledge search.

具体技术方案如下:The specific technical scheme is as follows:

一种词条的获取方法,该方法包括以下步骤:A method for obtaining an entry, the method includes the following steps:

S1、获取词条库中同一分类的已有词条集合;S1. Obtain the collection of existing entries of the same category in the entry database;

S2、利用所获取的已有词条集合进行搜索,得到包含所述已有词条的锚文本,并记录所述已有词条的锚文本所在的网页位置;S2. Using the acquired collection of existing entries to search, obtain the anchor text containing the existing entries, and record the location of the webpage where the anchor text of the existing entries is located;

S3、根据所记录的网页位置,在相应的位置提取与所述已有词条的锚文本之间的上下文距离满足预设要求的锚文本。S3. According to the recorded web page position, extract the anchor text whose contextual distance from the anchor text of the existing entry satisfies the preset requirement at the corresponding position.

根据本发明一优选实施例,在所述步骤S3之后,还包括:According to a preferred embodiment of the present invention, after the step S3, it also includes:

S4、根据与所述已有词条的锚文本之间的上下文距离计算所提取的锚文本的权重,统计所提取的锚文本在当前分类中出现的频度,将频度或权重满足预设要求的锚文本识别为新词条。S4. Calculate the weight of the extracted anchor text according to the context distance between the anchor text of the existing entry, count the frequency of the extracted anchor text in the current category, and set the frequency or weight to meet the preset The required anchor text is recognized as a new term.

根据本发明一优选实施例,所述锚文本所在的网页位置,包括:According to a preferred embodiment of the present invention, the location of the webpage where the anchor text is located includes:

锚文本所在的网页、锚文本所在的网页分块以及锚文本在网页分块中的位置。The page where the anchor text is located, the section of the page where the anchor text is located, and the position of the anchor text in the section of the page.

根据本发明一优选实施例,所述上下文距离满足预设要求包括:According to a preferred embodiment of the present invention, the context distance meeting preset requirements includes:

所提取的锚文本所在的网页分块与已有词条的锚文本所在的网页分块相同。The web page block where the extracted anchor text is located is the same as the web page block where the anchor text of the existing entry is located.

根据本发明一优选实施例,所述上下文距离满足要求,还包括:According to a preferred embodiment of the present invention, the context distance meets the requirements, and further includes:

所提取的锚文本与已有词条的锚文本的间隔距离小于预设距离阈值。The distance between the extracted anchor text and the anchor text of the existing entry is smaller than a preset distance threshold.

根据本发明一优选实施例,所述根据与所述已有词条的锚文本之间的上下文距离计算所提取的锚文本的权重,具体包括:According to a preferred embodiment of the present invention, the calculation of the weight of the extracted anchor text according to the context distance between the anchor text of the existing entry includes:

在同一网页分块中,确定所提取的锚文本与已有词条的锚文本的上下文距离;In the same web page segment, determine the context distance between the extracted anchor text and the anchor text of the existing entry;

利用确定的上下文距离,计算在对应的网页分块中所提取的锚文本的权重;calculating the weight of the anchor text extracted in the corresponding webpage segment by using the determined context distance;

在整个当前分类下,将提取到的各个网页分块中计算得到的所提取的锚文本的权重进行求和,得到所提取的锚文本的权重。Under the entire current category, the weights of the extracted anchor texts calculated in the extracted webpage blocks are summed to obtain the weights of the extracted anchor texts.

根据本发明一优选实施例,所述在同一网页分块中确定所提取的锚文本与已有词条的锚文本的上下文距离,具体包括:According to a preferred embodiment of the present invention, said determining the contextual distance between the extracted anchor text and the anchor text of an existing entry in the same web page block specifically includes:

确定所提取的锚文本所在的网页分块中包含的已有词条的锚文本;Determining the anchor text of the existing entry contained in the web page block where the extracted anchor text is located;

计算所提取的锚文本与获取的各个已有词条的锚文本之间的距离;Calculating the distance between the extracted anchor text and the obtained anchor text of each existing entry;

选取距离的最小值作为与已有词条的上下文距离。The minimum value of the distance is chosen as the context distance to the existing entry.

根据本发明一优选实施例,在所述步骤S3之后,还包括:According to a preferred embodiment of the present invention, after the step S3, it also includes:

将所提取的锚文本与所述词条库进行对比,得到未收录的锚文本;Comparing the extracted anchor text with the entry database to obtain uncollected anchor text;

仅对所述未收录的锚文本执行所述步骤S4。The step S4 is only executed for the anchor text not included.

根据本发明一优选实施例,在所述步骤S3之后,还包括:According to a preferred embodiment of the present invention, after the step S3, it also includes:

将所提取的锚文本中不包含指定词性的锚文本过滤掉;Filter out anchor texts that do not contain the specified part of speech from the extracted anchor texts;

仅对过滤后剩余的锚文本执行所述步骤S4。The step S4 is only performed on the remaining anchor text after filtering.

一种词条的获取装置,该装置包括:A device for acquiring an entry, the device comprising:

已有词条获取模块,用于获取词条库中同一分类的已有词条集合;The existing entry acquisition module is used to obtain the existing entry collection of the same category in the entry library;

搜索模块,用于利用所述已有词条获取模块获取的已有词条集合进行搜索,得到包含所述已有词条的锚文本,并记录所述已有词条的锚文本所在的网页位置;A search module, configured to use the set of existing entries acquired by the existing entry acquisition module to search, obtain the anchor text containing the existing entries, and record the webpage where the anchor text of the existing entries is located Location;

提取模块,用于根据所述搜索模块记录的网页位置,在相应的位置提取与所述已有词条的锚文本之间的上下文距离满足预设要求的锚文本。The extracting module is used to extract the anchor text whose contextual distance from the anchor text of the existing entry satisfies the preset requirement at the corresponding position according to the webpage position recorded by the search module.

根据本发明一优选实施例,该装置还包括:According to a preferred embodiment of the present invention, the device also includes:

新词条识别模块,用于根据与所述已有词条的锚文本之间的上下文距离计算所述提取模块提取的锚文本的权重,统计所提取的锚文本在当前分类中出现的频度,将频度或权重满足预设要求的锚文本识别为新词条。The new entry identification module is used to calculate the weight of the anchor text extracted by the extraction module according to the context distance between the anchor text of the existing entry, and count the frequency of occurrence of the extracted anchor text in the current classification , identifying the anchor text whose frequency or weight meets the preset requirements as a new entry.

根据本发明一优选实施例,所述锚文本所在的网页位置,包括:According to a preferred embodiment of the present invention, the location of the webpage where the anchor text is located includes:

锚文本所在的网页、锚文本所在的网页分块以及锚文本在网页分块中的位置。The page where the anchor text is located, the section of the page where the anchor text is located, and the position of the anchor text in the section of the page.

根据本发明一优选实施例,所述上下文距离满足预设要求包括:According to a preferred embodiment of the present invention, the context distance meeting preset requirements includes:

所提取的锚文本所在的网页分块与已有词条的锚文本所在的网页分块相同。The web page block where the extracted anchor text is located is the same as the web page block where the anchor text of the existing entry is located.

根据本发明一优选实施例,所述上下文距离满足要求,还包括:According to a preferred embodiment of the present invention, the context distance meets the requirements, and further includes:

所提取的锚文本与已有词条的锚文本的间隔距离小于预设距离阈值。The distance between the extracted anchor text and the anchor text of the existing entry is smaller than a preset distance threshold.

根据本发明一优选实施例,所述新词条识别模块,包括:According to a preferred embodiment of the present invention, the new entry recognition module includes:

距离确定单元,用于在同一网页分块中,确定所提取的锚文本与已有词条的锚文本的上下文距离;A distance determination unit, configured to determine the contextual distance between the extracted anchor text and the anchor text of an existing entry in the same web page block;

权重计算单元,用于利用所述距离确定单元确定的上下文距离,计算在对应的网页分块中所提取的锚文本的权重;a weight calculation unit, configured to use the context distance determined by the distance determination unit to calculate the weight of the anchor text extracted in the corresponding webpage block;

加权单元,用于在整个当前分类下,将提取到的各个网页分块中计算得到的所提取的锚文本的权重进行求和,得到所提取的锚文本的权重。The weighting unit is configured to sum the weights of the extracted anchor texts calculated in each extracted web page block under the entire current classification to obtain the weights of the extracted anchor texts.

根据本发明一优选实施例,所述距离确定单元,具体配置为:According to a preferred embodiment of the present invention, the distance determining unit is specifically configured as:

确定所提取的锚文本所在的网页分块中包含的已有词条的锚文本;Determining the anchor text of the existing entry contained in the web page block where the extracted anchor text is located;

计算所提取的锚文本与获取的各个已有词条的锚文本之间的距离;Calculating the distance between the extracted anchor text and the obtained anchor text of each existing entry;

选取距离的最小值作为与已有词条的上下文距离。The minimum value of the distance is chosen as the context distance to the existing entry.

根据本发明一优选实施例,该装置还包括:According to a preferred embodiment of the present invention, the device also includes:

已有词条过滤模块,用于将所述提取模块提取的锚文本与所述词条库进行对比,得到未收录的锚文本;Existing term filtering module is used to compare the anchor text extracted by the extraction module with the term database to obtain uncollected anchor text;

并将所述未收录的锚文本提供给所述新词识别模块。and provide the uncollected anchor text to the new word recognition module.

根据本发明一优选实施例,该装置还包括:According to a preferred embodiment of the present invention, the device also includes:

词性过滤模块,用于将所述提取模块提取的锚文本中不包含指定词性的锚文本过滤掉;A part-of-speech filtering module, configured to filter out anchor texts that do not contain a specified part-of-speech in the anchor text extracted by the extraction module;

并将过滤后剩余的锚文本提供给所述新词识别模块。And the remaining anchor text after filtering is provided to the new word recognition module.

由以上技术方案可以看出,本发明提供的词条的获取方法和装置,利用已有词库挖掘实体词条,提供尚未创建的新词条,可指导用户创建新词条对应的知识,解决百科数据库中实体词条收录不足的问题,有利于完善结构化的数据资料,便于实现更有效的知识搜索。It can be seen from the above technical solutions that the method and device for obtaining entries provided by the present invention utilize existing thesaurus to mine entity entries, provide new entries that have not yet been created, and can guide users to create knowledge corresponding to new entries, solving the problem of The problem of insufficient collection of entity entries in the encyclopedia database is conducive to improving structured data and facilitating more effective knowledge search.

【附图说明】【Description of drawings】

图1为本发明实施例一提供的词条的获取方法流程图;FIG. 1 is a flow chart of a method for obtaining an entry provided in Embodiment 1 of the present invention;

图2为网页及其包含的网页分块示意图;FIG. 2 is a schematic diagram of a webpage and the webpage blocks it contains;

图3为利用已有词条“因为爱情”搜索到的某个网页分块示意图;Fig. 3 is a block diagram of a web page searched by using the existing entry "because of love";

图4为本发明实施例二提供的词条的获取方法流程图;FIG. 4 is a flow chart of a method for obtaining an entry provided in Embodiment 2 of the present invention;

图5为本发明实施例三提供的词条的获取装置示意图;FIG. 5 is a schematic diagram of an acquisition device for an entry provided by Embodiment 3 of the present invention;

图6为本发明实施例四提供的词条的获取装置示意图。FIG. 6 is a schematic diagram of an apparatus for obtaining an entry provided by Embodiment 4 of the present invention.

【具体实施方式】【detailed description】

为了使本发明的目的、技术方案和优点更加清楚,下面结合附图和具体实施例对本发明进行详细描述。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

实施例一、Embodiment one,

图1是本实施例提供的词条的获取方法流程图,如图1所示,该方法包括:Fig. 1 is the flow chart of the acquisition method of the entry provided by the present embodiment, as shown in Fig. 1, the method includes:

步骤S101、获取词条库中同一分类的已有词条集合。Step S101. Obtain a set of existing entries of the same category in the entry database.

所述词条库可以是百科词条库、输入法词条库等分类词条库,在本发明中以百科词条库为例进行说明。The entry database may be an encyclopedia entry database, an input method entry database and other classified entry databases. In the present invention, the encyclopedia entry database is used as an example for illustration.

所述分类可以采用分类词条库原有的各个类别,包括:歌曲、电影、人物、自然、文化、地理、历史、生活、社会、艺术、经济、科技、体育等类别,或者,可以对已有词条利用现有的分类或聚类方法(如贝叶斯分类方法、决策树方法、支持向量机SVM等)划分的类别。The classification can adopt the original categories of the classification entry library, including: songs, movies, characters, nature, culture, geography, history, life, society, art, economy, science and technology, sports and other categories, or can be used for existing Some entries are divided into categories using existing classification or clustering methods (such as Bayesian classification method, decision tree method, support vector machine SVM, etc.).

获取词条库中同一分类的已有词条集合,逐一对词条库中各个分类的已有词条,执行步骤S102和步骤S103。Obtain the set of existing entries of the same classification in the entry database, pair the existing entries of each category in the entry database one by one, and execute step S102 and step S103.

步骤S102、利用所获取的已有词条集合进行搜索,得到包含所述已有词条的锚文本,并记录所述已有词条的锚文本所在的网页位置。Step S102 , search using the acquired set of existing entries to obtain the anchor text containing the existing entries, and record the location of the webpage where the anchor text of the existing entries is located.

在互联网网页中,利用获取的已有词条集合进行搜索,得到包含已有词条的锚文本,记录该些锚文本及锚文本所在的网页位置。In the Internet webpage, search is carried out by using the acquired collection of existing entries to obtain anchor texts containing the existing entries, and record the anchor texts and the webpage positions where the anchor texts are located.

锚文本所在的网页位置可以包括:锚文本所在的网页、锚文本所在的网页分块以及锚文本在网页分块中的位置。图2是一个网页及其包含的网页分块示意图,如图2所示,锚文本1所在的网页位置为该网页的网页分块A内的第一个位置。The webpage location where the anchor text is located may include: the webpage where the anchor text is located, the webpage segment where the anchor text is located, and the position of the anchor text in the webpage segment. FIG. 2 is a schematic diagram of a webpage and the webpage blocks it contains. As shown in FIG. 2 , the position of the webpage where the anchor text 1 is located is the first position in the webpage block A of the webpage.

举个例子,通过步骤S101获取到百科词条中已有的歌曲分类集合T1,该歌曲分类集合T1中包括几万个已有词条,例如{因为爱情,爱你痛到不知痛,等等...}。通过搜索找到包含歌曲分类集合T1中已有词条的锚文本,例如,利用已有词条“因为爱情”进行搜索,在http://ting.baidu.com网页中找到锚文本“因为爱情”,如图3所示,记录该锚文本“因为爱情”所在的网页分块以及网页位置。For example, the existing song classification set T1 in the encyclopedia entry is obtained through step S101. The song classification collection T1 includes tens of thousands of existing entries, for example {Because of love, loving you so much that it hurts so much, etc. ...}. Find the anchor text containing the existing entries in the song classification collection T1 by searching, for example, use the existing entry "because of love" to search, and find the anchor text "because of love" in the http://ting.baidu.com webpage , as shown in FIG. 3 , record the webpage block and the webpage location where the anchor text "because of love" is located.

或者,在进行搜索包含所述已有词条的锚文本时,也可以先获取互联网上每个网页的所有锚文本,再利用各分类的已有词条集合进行匹配,找出能够匹配的锚文本,记录该些锚文本所在的网页、网页分块以及网页位置。Or, when searching for the anchor text containing the existing entries, it is also possible to obtain all the anchor texts of each webpage on the Internet first, and then use the existing entry collections of each category to perform matching to find out the anchor text that can be matched. Text, record the webpage where these anchor texts are located, the webpage division and the webpage location.

步骤S103、根据所记录的网页位置,在相应的位置提取与所述已有词条的锚文本之间的上下文距离满足预设要求的锚文本。Step S103 , according to the recorded position of the webpage, extract the anchor text whose contextual distance from the anchor text of the existing entry satisfies the preset requirement at the corresponding position.

对于所记录的已有词条的锚文本的网页位置,提取与该网页位置上下文距离满足要求的锚文本作为词条。For the recorded web page position of the anchor text of the existing entry, the anchor text whose contextual distance from the web page position meets the requirement is extracted as the entry.

其中,所述上下文距离满足预设要求可以包括:Wherein, the context distance meeting preset requirements may include:

所提取的锚文本所在的网页分块与已有词条的锚文本所在的网页分块相同。如图2中的锚文本1和锚文本3所在的网页分块相同,但锚文本1和锚文本5则处于不同的网页分块中。如果锚文本1为已有词条的锚文本,则可以提取到满足要求的锚文本为:锚文本2和锚文本3。The web page block where the extracted anchor text is located is the same as the web page block where the anchor text of the existing entry is located. As shown in FIG. 2 , anchor text 1 and anchor text 3 are in the same web page block, but anchor text 1 and anchor text 5 are in different web page blocks. If the anchor text 1 is the anchor text of an existing entry, the anchor texts that meet the requirements can be extracted as: anchor text 2 and anchor text 3 .

具体地,可以根据页面布局标签确定锚文本所在的网页分块,如页面布局标签“<div></div>”和“<table></table>”等进行判断,确定是否处于相同的网页分块。或者,也可以根据网页视觉分块等来确定同一网页分块。Specifically, the web page block where the anchor text is located can be determined according to the page layout tags, such as page layout tags "<div></div>" and "<table></table>" to determine whether they are in the same web page Block. Alternatively, the same web page segment may also be determined according to the visual segment of the web page or the like.

或者,所提取的锚文本所在的网页分块与已有词条的锚文本所在的网页分块相同,且所提取的锚文本与已有词条的锚文本的间隔距离小于预设距离阈值。Alternatively, the webpage segment where the extracted anchor text is located is the same as the webpage segment where the anchor text of the existing entry is located, and the distance between the extracted anchor text and the anchor text of the existing entry is smaller than a preset distance threshold.

例如,图3为利用已有词条“因为爱情”搜索到的某个网页分块示意图,在图3中,“王菲”、“伤不起”、“王麟”、“最炫民族风”、“凤凰传奇”、“新贵妃醉酒”、“爱的供养”等锚文本与已有词条的锚文本“因为爱情”处于同一网页分块中,提取该些锚文本作为词条。For example, Figure 3 is a block diagram of a web page searched by using the existing entry "because of love". , "Phoenix Legend", "New Drunken Concubine", "Love's Support" and other anchor texts are in the same web page block as the anchor text "because of love" of the existing entry, and these anchor texts are extracted as entries.

为了进一步提高精度,在提取上下文距离满足预设要求的锚文本,还对间隔距离有所限定。如果图3中“新贵妃醉酒”、“爱的供养”等锚文本与已有词条的锚文本“因为爱情”之间的间隔距离超过了预设距离阈值时,则不提取该些锚文本。In order to further improve the accuracy, when extracting the anchor text whose contextual distance meets the preset requirements, the interval distance is also limited. If the distance between the anchor texts such as "new imperial concubine drunk" and "love's support" in Figure 3 and the anchor text "because of love" of the existing entry exceeds the preset distance threshold, these anchor texts will not be extracted .

所述预设距离阈值根据实际需要进行设定,比如10个字符以内。The preset distance threshold is set according to actual needs, such as within 10 characters.

实施例二、Embodiment two,

图4是本实施例提供的词条的获取方法流程图,如图4所示,该方法包括:Fig. 4 is the flow chart of the acquisition method of the entry provided in this embodiment, as shown in Fig. 4, the method includes:

步骤S401、获取词条库中同一分类的已有词条集合。Step S401. Acquiring a collection of existing entries of the same category in the entry database.

步骤S402、利用所获取的已有词条集合进行搜索,得到包含所述已有词条的锚文本,并记录所述已有词条的锚文本所在的网页位置。Step S402 , search using the acquired set of existing entries to obtain the anchor text containing the existing entries, and record the location of the webpage where the anchor text of the existing entries is located.

步骤S403、根据所记录的网页位置,在相应的位置提取与所述已有词条的锚文本之间的上下文距离满足预设要求的锚文本。Step S403 , according to the recorded webpage position, extract the anchor text whose contextual distance from the anchor text of the existing entry satisfies the preset requirement at the corresponding position.

上述步骤S401至S403与实施例一中的步骤S101至S103对应相同,于此不再赘述。The above steps S401 to S403 are correspondingly the same as the steps S101 to S103 in the first embodiment, and will not be repeated here.

步骤S404、将所提取的锚文本与所述词条库进行对比,得到未收录的锚文本。Step S404 , comparing the extracted anchor text with the term database to obtain unrecorded anchor text.

由于提取到的锚文本很可能为已有词条,因而,为了提高效率,对提取到的锚文本进行过滤,将已有词条过滤掉,以便后续仅对未收录的锚文本进行处理。如果图3中的“牵手”、“背叛情歌”是已有词条,则予以过滤掉。Since the extracted anchor text is likely to be an existing entry, in order to improve efficiency, the extracted anchor text is filtered to filter out the existing entry, so that only the uncollected anchor text can be processed later. If "hand in hand" and "betrayal love song" in Fig. 3 are existing entries, then filter them out.

由于在某一个分类下提取到的锚文本可能属于其他分类,例如,图3中可以提取到“王菲”、“王麟”等人物。因而,将提取到的锚文本与整个词条库进行对比,去掉已存在于词条库中的锚文本,得到未收录的锚文本。如果未收录的锚文本属于人物或其他预设相关分类下的词条,也予以保留,进一步执行步骤S405至S406。所述预设相关分类是指具有关联关系的分类,根据经验设定,例如,歌曲分类与人物、电影、娱乐等分类具有关联关系。Because the anchor text extracted under a certain category may belong to other categories, for example, characters such as "Faye Wong" and "Lin Wang" can be extracted in Figure 3. Therefore, the extracted anchor text is compared with the entire entry database, the anchor text existing in the entry database is removed, and the uncollected anchor text is obtained. If the unrecorded anchor text belongs to people or entries under other preset related categories, it is also retained, and steps S405 to S406 are further performed. The preset related classifications refer to classifications with associated relationships. According to empirical settings, for example, song classifications have associations with categories such as characters, movies, and entertainment.

值得说明的是,在处理效率要求不高时,也可以不执行本步骤,或者,也可以在执行步骤S406得到锚文本的权重或频度之后再进行识别是否为未收录,以确定新词条。此时,以下步骤S405至S406则是对所提取的锚文本执行。It is worth noting that this step may not be performed when the processing efficiency requirement is not high, or, after performing step S406 to obtain the weight or frequency of the anchor text, identify whether it is not included, so as to determine the new entry . At this time, the following steps S405 to S406 are performed on the extracted anchor text.

步骤S405、将未收录的锚文本中不包含指定词性的锚文本过滤掉。Step S405 , filtering out anchor texts that do not contain the specified part of speech from the uncollected anchor texts.

对于步骤S404得到的锚文本,通过分词、词性标注技术过滤掉不包含指定词性的锚文本,例如过滤掉不包含动词、名词、形容词等的锚文本。For the anchor text obtained in step S404, the word segmentation and part-of-speech tagging techniques are used to filter out anchor texts that do not contain specified parts of speech, for example, filter out anchor texts that do not contain verbs, nouns, adjectives, etc.

同时,为了得到规范的词条,还可以基于锚文本的长度和包含的标点符号进行过滤,将不符合要求的锚文本过滤掉。At the same time, in order to obtain standardized entries, filtering can also be performed based on the length of the anchor text and the punctuation marks included, so as to filter out the anchor text that does not meet the requirements.

当然,本步骤也并非为必要的步骤。Of course, this step is not a necessary step.

步骤S406、根据与所述已有词条的锚文本之间的上下文距离计算所述未收录的锚文本的权重,统计所述未收录的锚文本在当前分类中出现的频度,将频度或权重满足预设要求的锚文本识别为新词条。Step S406, calculate the weight of the anchor text not included according to the context distance between the anchor text of the existing entry, count the frequency of occurrence of the anchor text not included in the current category, and divide the frequency Or the anchor text whose weight meets the preset requirements is recognized as a new entry.

统计步骤S405过滤后剩余的锚文本在当前分类中出现的频度,即出现次数,并计算步骤S405过滤后剩余的锚文本的权重,具体地,根据与所述已有词条的锚文本之间的上下文距离计算锚文本的权重,包括:Statistical step S405 filters the remaining anchor text in the current classification frequency, that is, the number of occurrences, and calculates the weight of the remaining anchor text after step S405 filtering, specifically, according to the anchor text of the existing entry. The weight of the anchor text is calculated based on the contextual distance between them, including:

步骤S406_1、在同一网页分块中,确定所述未收录的锚文本与已有词条的锚文本的上下文距离。Step S406_1. In the same web page block, determine the contextual distance between the anchor text not included and the anchor text of the existing entry.

具体地,先确定所述未收录的锚文本所在的网页分块中包含的已有词条的锚文本。Specifically, the anchor text of the existing entry included in the web page block where the uncollected anchor text is located is determined first.

再计算所述未收录的锚文本与获取的各个已有词条的锚文本之间的距离。Then calculate the distance between the anchor text not included and the obtained anchor text of each existing entry.

其中,上下文距离d可以但不限于采用未收录的锚文本与已有词条之间间隔的字符串长度来计算,不包括页面布局标签、空格、回车等符号。Wherein, the context distance d can be calculated by using, but not limited to, the string length between the uncollected anchor text and the existing entry, excluding symbols such as page layout tags, spaces, and carriage returns.

最后,选取距离的最小值作为与已有词条的上下文距离。Finally, the minimum value of the distance is selected as the context distance to the existing entry.

例如,在同一个网页分块中有多个已有词条的锚文本K1,K2,K3,…Kn,和多个未收录的锚文本L1,L2,L3等,逐一对该网页分块中未收录的锚文本,分别计算到K1~Kn的距离,将得出的距离最小值确定为该未收录的锚文本与已有词条的上下文距离。For example, in the same web page block, there are anchor texts K1, K2, K3, ... Kn of multiple existing entries, and multiple uncollected anchor texts L1, L2, L3, etc., one by one in the block of the web page For the non-recorded anchor text, the distances to K1-Kn are calculated respectively, and the minimum value of the obtained distance is determined as the context distance between the non-recorded anchor text and the existing entry.

步骤S406_2、利用确定的上下文距离,计算在对应的网页分块中所述未收录的锚文本的权重。Step S406_2. Using the determined context distance, calculate the weight of the anchor text not included in the corresponding web page block.

利用未收录的锚文本与已有词条的上下文距离,计算该未收录的锚文本在各个网页分块中的权重。上下文距离越近,权重越大。Using the context distance between the non-collected anchor text and the existing entry, the weight of the non-collected anchor text in each web page segment is calculated. The closer the context distance, the greater the weight.

权重计算公式可以但不限于采用:The weight calculation formula can be, but not limited to:

(公式1) (Formula 1)

如图3中,在该网页分块中,利用已有词条锚文本“因为爱情”计算未收录锚文本“伤不起”的权重,具体为:As shown in Figure 3, in the block of the web page, the weight of the uncollected anchor text "can't afford to be hurt" is calculated by using the anchor text of the existing entry "because of love", specifically:

上下文距离d=6,间隔的字符串包括“2,王麟,-,进而得到权重为The context distance d=6, the character strings in the interval include "2, Wang Lin, -, and then the weight is

依次类推,在记录的各个网页分块中,计算在对应分块中的未收录锚文本的权重。By analogy, in each recorded web page segment, the weight of the anchor text not included in the corresponding segment is calculated.

步骤S406_3、在整个当前分类下,将提取到的各个网页分块中计算得到的所述未收录的锚文本的权重进行求和,得到未收录的锚文本的权重。Step S406_3, under the entire current classification, sum the weights of the anchor texts not included that are calculated in each extracted web page block to obtain the weights of the anchor texts not included.

在整个当前分类下,将步骤S406_2计算得到的在各个分块中的未收录锚文本的权重进行加权求和,作为所述未收录锚文本的权重。Under the entire current classification, the weights of the anchor texts not included in each block calculated in step S406_2 are weighted and summed, and used as the weights of the anchor texts not included.

例如:将步骤S406_2计算得到各个网页分块中“伤不起”的权重求和得到“伤不起”的权重为295.4,判断是否大于预设权重阈值。For example: the sum of the weights of "can't afford to hurt" in each web page block calculated in step S406_2 is 295.4, and it is judged whether it is greater than the preset weight threshold.

统计得到“伤不起”在歌曲分类中出现了1442次,判断是否大于预设频次阈值。According to statistics, "Can't afford to hurt" appeared 1442 times in the song classification, and it was judged whether it was greater than the preset frequency threshold.

如果权重大于预设权重阈值或者出现频次大于预设频次阈值,则将该锚文本识别为新词条。根据实际应用场合可以设定需两个条件同时满足时,才识别为新词条。If the weight is greater than the preset weight threshold or the frequency of occurrence is greater than the preset frequency threshold, the anchor text is identified as a new entry. According to the actual application occasions, it can be set that two conditions must be met at the same time before it can be recognized as a new entry.

步骤S407、判断是否获取完词条库中的所有分类,如果是,则进入步骤S408,输出新词条的识别结果,否则,返回步骤S401,获取词条库中下一个分类的已有词条集合,直至取完所有分类,输出结果。Step S407, judge whether all classifications in the entry database have been obtained, if yes, then enter step S408, output the recognition result of the new entry, otherwise, return to step S401, obtain the existing entry of the next classification in the entry database Gather until all the classifications are taken and output the result.

以上是对本发明所提供的方法进行的详细描述,下面对本发明提供的词条的获取装置进行详细描述。The above is a detailed description of the method provided by the present invention, and the device for obtaining entries provided by the present invention will be described in detail below.

实施例三Embodiment three

图5是本实施例提供的词条的获取装置示意图。如图5所示,该装置包括:Fig. 5 is a schematic diagram of an apparatus for obtaining an entry provided in this embodiment. As shown in Figure 5, the device includes:

已有词条获取模块501,用于获取词条库中同一分类的已有词条集合。Existing entry obtaining module 501 is used to obtain the collection of existing entries of the same category in the entry database.

所述词条库可以是百科词条库、输入法词条库等分类词条库,在本发明中以百科词条库为例进行说明。The entry database may be an encyclopedia entry database, an input method entry database and other classified entry databases. In the present invention, the encyclopedia entry database is used as an example for illustration.

所述分类可以采用分类词条库原有的各个类别,包括:歌曲、电影、人物、自然、文化、地理、历史、生活、社会、艺术、经济、科技、体育等类别,或者,可以对已有词条利用现有的分类或聚类方法(如贝叶斯分类方法、决策树方法、支持向量机SVM等)划分的类别。The classification can adopt the original categories of the classification entry library, including: songs, movies, characters, nature, culture, geography, history, life, society, art, economy, science and technology, sports and other categories, or can be used for existing Some entries are divided into categories using existing classification or clustering methods (such as Bayesian classification method, decision tree method, support vector machine SVM, etc.).

获取词条库中同一分类的已有词条集合,逐一将词条库中各个分类的已有词条提供给搜索模块502和提取模块503执行。Obtain the set of existing entries of the same classification in the entry database, and provide the existing entries of each category in the entry database to the search module 502 and the extraction module 503 for execution.

搜索模块502,用于利用已有词条获取模块501获取的已有词条集合进行搜索,得到包含所述已有词条的锚文本,并记录所述已有词条的锚文本所在的网页位置。The search module 502 is configured to use the existing entry collection obtained by the existing entry acquisition module 501 to search, obtain the anchor text containing the existing entry, and record the webpage where the anchor text of the existing entry is located Location.

在互联网网页中,利用获取的已有词条集合进行搜索,得到包含已有词条的锚文本,记录该些锚文本及锚文本所在的网页位置。In the Internet webpage, search is carried out by using the acquired collection of existing entries to obtain anchor texts containing the existing entries, and record the anchor texts and the webpage positions where the anchor texts are located.

锚文本所在的网页位置可以包括:锚文本所在的网页、锚文本所在的网页分块以及锚文本在网页分块中的位置。图2是一个网页及其包含的网页分块示意图,如图2所示,锚文本1所在的网页位置为该网页的网页分块A内的第一个位置。The webpage location where the anchor text is located may include: the webpage where the anchor text is located, the webpage segment where the anchor text is located, and the position of the anchor text in the webpage segment. FIG. 2 is a schematic diagram of a webpage and the webpage blocks it contains. As shown in FIG. 2 , the position of the webpage where the anchor text 1 is located is the first position in the webpage block A of the webpage.

举个例子,通过已有词条获取模块501获取到百科词条中已有的歌曲分类集合T1,该歌曲分类集合T1中包括几万个已有词条,例如{因为爱情,爱你痛到不知痛,等等...}。通过搜索找到包含歌曲分类集合T1中已有词条的锚文本,例如,利用已有词条“因为爱情”进行搜索,在http://ting.baidu.com网页中找到锚文本“因为爱情”,如图3所示,记录该锚文本“因为爱情”所在的网页分块以及网页位置。For example, the existing song classification set T1 in the encyclopedia entry is obtained through the existing entry acquisition module 501. The song classification collection T1 includes tens of thousands of existing entries, for example {Because of love, I love you so much No pain, wait...}. Find the anchor text containing the existing entries in the song classification collection T1 by searching, for example, use the existing entry "because of love" to search, and find the anchor text "because of love" in the http://ting.baidu.com webpage , as shown in FIG. 3 , record the webpage block and the webpage location where the anchor text "because of love" is located.

或者,在进行搜索包含所述已有词条的锚文本时,也可以先获取互联网上每个网页的所有锚文本,再利用各分类的已有词条集合进行匹配,找出能够匹配的锚文本,记录该些锚文本所在的网页、网页分块以及网页位置。Or, when searching for the anchor text containing the existing entries, it is also possible to obtain all the anchor texts of each webpage on the Internet first, and then use the existing entry collections of each category to perform matching to find out the anchor text that can be matched. Text, record the webpage where these anchor texts are located, the webpage division and the webpage location.

提取模块503,用于根据搜索模块502记录的网页位置,在相应的位置提取与所述已有词条的锚文本之间的上下文距离满足预设要求的锚文本。The extracting module 503 is configured to extract the anchor text whose contextual distance from the anchor text of the existing entry satisfies a preset requirement at a corresponding position according to the webpage position recorded by the search module 502 .

对于所记录的已有词条的锚文本的网页位置,提取与该网页位置上下文距离满足要求的锚文本作为词条。For the recorded web page position of the anchor text of the existing entry, the anchor text whose contextual distance from the web page position meets the requirement is extracted as the entry.

其中,所述上下文距离满足预设要求可以包括:Wherein, the context distance meeting preset requirements may include:

所提取的锚文本所在的网页分块与已有词条的锚文本所在的网页分块相同。如图2中的锚文本1和锚文本3所在的网页分块相同,但锚文本1和锚文本5则处于不同的网页分块中。如果锚文本1为已有词条的锚文本,则可以提取到满足要求的锚文本为:锚文本2和锚文本3。The web page block where the extracted anchor text is located is the same as the web page block where the anchor text of the existing entry is located. As shown in FIG. 2 , anchor text 1 and anchor text 3 are in the same web page block, but anchor text 1 and anchor text 5 are in different web page blocks. If the anchor text 1 is the anchor text of an existing entry, the anchor texts that meet the requirements can be extracted as: anchor text 2 and anchor text 3 .

具体地,可以根据页面布局标签确定锚文本所在的网页分块,如页面布局标签“<div></div>”和“<table></table>”等进行判断,确定是否处于相同的网页分块。或者,也可以根据网页视觉分块等来确定同一网页分块。Specifically, the web page block where the anchor text is located can be determined according to the page layout tags, such as page layout tags "<div></div>" and "<table></table>" to determine whether they are in the same web page Block. Alternatively, the same web page segment may also be determined according to the visual segment of the web page or the like.

或者,所提取的锚文本所在的网页分块与已有词条的锚文本所在的网页分块相同,且所提取的锚文本与已有词条的锚文本的间隔距离小于预设距离阈值。Alternatively, the webpage segment where the extracted anchor text is located is the same as the webpage segment where the anchor text of the existing entry is located, and the distance between the extracted anchor text and the anchor text of the existing entry is smaller than a preset distance threshold.

例如,图3为利用已有词条“因为爱情”搜索到的某个网页分块示意图,在图3中,“王菲”、“伤不起”、“王麟”、“最炫民族风”、“凤凰传奇”、“新贵妃醉酒”、“爱的供养”等锚文本与已有词条的锚文本“因为爱情”处于同一网页分块中,提取该些锚文本作为词条。For example, Figure 3 is a block diagram of a web page searched by using the existing entry "because of love". , "Phoenix Legend", "New Drunken Concubine", "Love's Support" and other anchor texts are in the same web page block as the anchor text "because of love" of the existing entry, and these anchor texts are extracted as entries.

为了进一步提高精度,在提取上下文距离满足预设要求的锚文本,还对间隔距离有所限定。如果图3中“新贵妃醉酒”、“爱的供养”等锚文本与已有词条的锚文本“因为爱情”之间的间隔距离超过了预设距离阈值时,则不提取该些锚文本。In order to further improve the accuracy, when extracting the anchor text whose contextual distance meets the preset requirements, the interval distance is also limited. If the distance between the anchor texts such as "new imperial concubine drunk" and "love's support" in Figure 3 and the anchor text "because of love" of the existing entry exceeds the preset distance threshold, these anchor texts will not be extracted .

所述预设距离阈值根据实际需要进行设定,比如10个字符以内。The preset distance threshold is set according to actual needs, such as within 10 characters.

实施例四、Embodiment four,

图6是本实施例提供的词条的获取装置示意图,如图6所示,该装置包括:Fig. 6 is a schematic diagram of a device for obtaining an entry provided in this embodiment. As shown in Fig. 6, the device includes:

已有词条获取模块601,用于获取词条库中同一分类的已有词条集合。Existing entry acquisition module 601, configured to acquire a set of existing entries of the same category in the entry database.

搜索模块602,用于利用已有词条获取模块601获取的已有词条集合进行搜索,得到包含所述已有词条的锚文本,并记录所述已有词条的锚文本所在的网页位置。The search module 602 is configured to use the existing entry collection obtained by the existing entry acquisition module 601 to search, obtain the anchor text containing the existing entry, and record the webpage where the anchor text of the existing entry is located Location.

提取模块603,用于根据搜索模块602记录的网页位置,在相应的位置提取与所述已有词条的锚文本之间的上下文距离满足预设要求的锚文本。The extracting module 603 is configured to extract the anchor text whose contextual distance from the anchor text of the existing entry satisfies a preset requirement at the corresponding position according to the webpage position recorded by the search module 602 .

上述模块601至603与实施例三中的501至503的配置对应相同,于此不再赘述。The configurations of the above modules 601 to 603 are correspondingly the same as those of 501 to 503 in the third embodiment, and will not be repeated here.

已有词条过滤模块604,用于将所提取的锚文本与所述词条库进行对比,得到未收录的锚文本。The existing term filtering module 604 is used to compare the extracted anchor text with the term database to obtain anchor text not included.

由于提取到的锚文本很可能为已有词条,因而,为了提高效率,对提取到的锚文本进行过滤,将已有词条过滤掉,以便后续仅对未收录的锚文本进行处理。如果图3中的“牵手”、“背叛情歌”是已有词条,则予以过滤掉。Since the extracted anchor text is likely to be an existing entry, in order to improve efficiency, the extracted anchor text is filtered to filter out the existing entry, so that only the uncollected anchor text can be processed later. If "hand in hand" and "betrayal love song" in Fig. 3 are existing entries, then filter them out.

由于在某一个分类下提取到的锚文本可能属于其他分类,例如,图3中可以提取到“王菲”、“王麟”等人物。因而,将提取到的锚文本与整个词条库进行对比,去掉已存在于词条库中的锚文本,得到未收录的锚文本。如果未收录的锚文本属于人物或其他预设相关分类下的词条,也予以保留,供给后续词性过滤模块605和新词条识别模块606进一步进行处理。所述预设相关分类是指具有关联关系的分类,根据经验设定,例如,歌曲分类与人物、电影、娱乐等分类具有关联关系。Because the anchor text extracted under a certain category may belong to other categories, for example, characters such as "Faye Wong" and "Lin Wang" can be extracted in Figure 3. Therefore, the extracted anchor text is compared with the entire entry database, the anchor text existing in the entry database is removed, and the uncollected anchor text is obtained. If the unrecorded anchor text belongs to people or entries under other preset related categories, it will also be retained and provided to the subsequent part-of-speech filtering module 605 and new entry identification module 606 for further processing. The preset related classifications refer to classifications with associated relationships. According to empirical settings, for example, song classifications have associations with categories such as characters, movies, and entertainment.

值得说明的是,在处理效率要求不高时,也可以不设置本模块,或者,也可以在新词条识别模块606中得到锚文本的权重或频度之后再利用本模块进行识别是否为未收录,以确定新词条。此时,词性过滤模块605和新词条识别模块606则是对所提取的锚文本执行。It is worth noting that, when the processing efficiency requirement is not high, this module may not be set, or, after obtaining the weight or frequency of the anchor text in the new entry identification module 606, this module may be used to identify whether it is an unidentified entry. Included to determine new entries. At this time, the part-of-speech filtering module 605 and the new entry recognition module 606 are executed on the extracted anchor text.

词性过滤模块605,用于将未收录的锚文本中不包含指定词性的锚文本过滤掉。The part-of-speech filtering module 605 is configured to filter out anchor texts that do not contain the specified part of speech among the uncollected anchor texts.

对于已有词条过滤模块604得到的锚文本,通过分词、词性标注技术过滤掉不包含指定词性的锚文本,例如过滤掉不包含动词、名词、形容词等的锚文本。For the anchor text obtained by the existing term filtering module 604, the anchor text not containing the specified part of speech is filtered out through word segmentation and part-of-speech tagging techniques, for example, the anchor text not containing verbs, nouns, adjectives, etc. is filtered out.

同时,为了得到规范的词条,还可以基于锚文本的长度和包含的标点符号进行过滤,将不符合要求的锚文本过滤掉。At the same time, in order to obtain standardized entries, filtering can also be performed based on the length of the anchor text and the punctuation marks included, so as to filter out the anchor text that does not meet the requirements.

当然,本模块也并非为必要的模块。Of course, this module is not a necessary module.

新词条识别模块606,用于根据与所述已有词条的锚文本之间的上下文距离计算所述未收录的锚文本的权重,统计所述未收录的锚文本在当前分类中出现的频度,将频度或权重满足预设要求的锚文本识别为新词条。The new entry identification module 606 is used to calculate the weight of the anchor text not included according to the context distance between the anchor text of the existing entry and count the number of anchor texts not included in the current category. Frequency, the anchor text whose frequency or weight meets the preset requirements is identified as a new entry.

统计词性过滤模块605过滤后剩余的锚文本在当前分类中出现的频度,即出现次数,并计算词性过滤模块605过滤后剩余的锚文本的权重,具体地,根据与所述已有词条的锚文本之间的上下文距离计算锚文本的权重,包括:Statistical part-of-speech filtering module 605 filters the remaining anchor text in the current classification frequency, that is, the number of occurrences, and calculates the weight of the remaining anchor text after the part-of-speech filtering module 605 filters, specifically, according to the existing entry The contextual distance between anchor texts calculates anchor text weights, including:

距离确定单元,用于在同一网页分块中,确定所述未收录的锚文本与已有词条的锚文本的上下文距离。The distance determination unit is configured to determine the contextual distance between the anchor text not included and the anchor text of the existing entry in the same web page block.

具体地,距离确定单元先确定所述未收录的锚文本所在的网页分块中包含的已有词条的锚文本。再计算所述未收录的锚文本与获取的各个已有词条的锚文本之间的距离。Specifically, the distance determination unit first determines the anchor text of the existing entry contained in the web page block where the unincluded anchor text is located. Then calculate the distance between the anchor text not included and the obtained anchor text of each existing entry.

其中,上下文距离d可以但不限于采用未收录的锚文本与已有词条之间间隔的字符串长度来计算,不包括页面布局标签、空格、回车等符号。Wherein, the context distance d can be calculated by using, but not limited to, the string length between the uncollected anchor text and the existing entry, excluding symbols such as page layout tags, spaces, and carriage returns.

最后,距离确定单元选取距离的最小值作为与已有词条的上下文距离。Finally, the distance determining unit selects the minimum value of the distance as the context distance to the existing entry.

例如,在同一个网页分块中有多个已有词条的锚文本K1,K2,K3,…Kn,和多个未收录的锚文本L1,L2,L3等,逐一对该网页分块中未收录的锚文本,分别计算到K1~Kn的距离,将得出的距离最小值确定为该未收录的锚文本与已有词条的上下文距离。For example, in the same web page block, there are anchor texts K1, K2, K3, ... Kn of multiple existing entries, and multiple uncollected anchor texts L1, L2, L3, etc., one by one in the block of the web page For the non-recorded anchor text, the distances to K1-Kn are calculated respectively, and the minimum value of the obtained distance is determined as the context distance between the non-recorded anchor text and the existing entry.

权重计算单元,用于利用距离确定单元确定的上下文距离,计算在对应的网页分块中所述未收录的锚文本的权重。The weight calculation unit is used to calculate the weight of the anchor text not included in the corresponding web page block by using the context distance determined by the distance determination unit.

权重计算单元利用未收录的锚文本与已有词条的上下文距离,计算该未收录的锚文本在各个网页分块中的权重,上下文距离越近,权重越大。The weight calculation unit uses the context distance between the uncollected anchor text and the existing entry to calculate the weight of the non-collected anchor text in each webpage block, and the closer the context distance, the greater the weight.

权重计算公式可以但不限于采用公式1进行计算。The weight calculation formula may be, but not limited to, formula 1 for calculation.

如图3中,在该网页分块中,利用已有词条锚文本“因为爱情”计算未收录锚文本“伤不起”的权重,具体为:As shown in Figure 3, in the block of the web page, the weight of the uncollected anchor text "can't afford to be hurt" is calculated by using the anchor text of the existing entry "because of love", specifically:

上下文距离d=6,间隔的字符串包括“2,王麟,-,进而得到权重为The context distance d=6, the character strings in the interval include "2, Wang Lin, -, and then the weight is

依次类推,在记录的各个网页分块中,计算在对应分块中的未收录锚文本的权重。By analogy, in each recorded web page segment, the weight of the anchor text not included in the corresponding segment is calculated.

加权单元,用于在整个当前分类下,将提取到的各个网页分块中计算得到的所述未收录的锚文本的权重进行求和,得到未收录的锚文本的权重。The weighting unit is configured to sum the weights of the non-collected anchor texts calculated in the extracted web page blocks under the entire current classification to obtain the weights of the non-collected anchor texts.

在整个当前分类下,将权重计算单元计算得到的在各个分块中的未收录锚文本的权重进行加权求和,作为所述未收录锚文本的权重。Under the entire current classification, the weights of the non-collected anchor texts in each block calculated by the weight calculation unit are weighted and summed, and used as the weights of the non-collected anchor texts.

例如:将权重计算单元计算得到各个网页分块中“伤不起”的权重求和得到“伤不起”的权重为295.4,判断是否大于预设权重阈值。For example: the sum of the weights of "can't afford to hurt" in each web page block calculated by the weight calculation unit is 295.4, and it is judged whether it is greater than the preset weight threshold.

新词条识别模块606统计得到“伤不起”在歌曲分类中出现了1442次,判断是否大于预设频次阈值。The new entry identification module 606 counts and obtains that "I can't afford to hurt" appears 1442 times in the song classification, and judges whether it is greater than the preset frequency threshold.

如果权重大于预设权重阈值或者出现频次大于预设频次阈值,则将该锚文本识别为新词条。根据实际应用场合可以设定需两个条件同时满足时,才识别为新词条。If the weight is greater than the preset weight threshold or the frequency of occurrence is greater than the preset frequency threshold, the anchor text is identified as a new entry. According to the actual application, it can be set that the new entry is recognized only when two conditions are met at the same time.

判断模块607,用于判断是否获取完词条库中的所有分类,如果是,则进入结果输出模块608,输出新词条的识别结果,否则,返回至已有词条获取模块601,获取词条库中下一个分类的已有词条集合,直至取完所有分类,输出结果。Judgment module 607, is used for judging whether to have obtained all classifications in the entry storehouse, if yes, then enters result output module 608, outputs the recognition result of new entry, otherwise, returns to existing entry acquisition module 601, obtains word The collection of existing entries of the next category in the entry library, until all categories are retrieved, and the result is output.

本发明提供的词条的获取方法和装置,用已有词库挖掘实体词条,提供尚未创建的新词条,可指导用户创建新词条对应的知识,解决百科数据库中实体词条收录不足的问题,有利于完善结构化的数据资料(实体词条-属性名-属性值),便于实现更有效的知识搜索。The method and device for obtaining entries provided by the present invention use the existing thesaurus to mine entity entries, provide new entries that have not yet been created, guide users to create knowledge corresponding to new entries, and solve the problem of insufficient collection of entity entries in the encyclopedia database It is beneficial to improve the structured data (entity entry-attribute name-attribute value) and facilitate more effective knowledge search.

以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明保护的范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the present invention. within the scope of protection.

Claims (18)

1.一种词条的获取方法,其特征在于,包括:1. A method for obtaining an entry, characterized in that, comprising: S1、获取词条库中同一分类的已有词条集合;S1. Obtain the collection of existing entries of the same category in the entry database; S2、利用所获取的已有词条集合进行搜索,得到包含所述已有词条的锚文本,并记录所述已有词条的锚文本所在的网页位置;S2. Using the acquired collection of existing entries to search, obtain the anchor text containing the existing entries, and record the location of the webpage where the anchor text of the existing entries is located; S3、根据所记录的网页位置,在相应的位置提取与所述已有词条的锚文本之间的上下文距离满足预设要求的锚文本。S3. According to the recorded web page position, extract the anchor text whose contextual distance from the anchor text of the existing entry satisfies the preset requirement at the corresponding position. 2.根据权利要求1所述的方法,其特征在于,在所述步骤S3之后,还包括:2. The method according to claim 1, characterized in that, after the step S3, further comprising: S4、根据与所述已有词条的锚文本之间的上下文距离计算所提取的锚文本的权重,统计所提取的锚文本在当前分类中出现的频度,将频度或权重满足预设要求的锚文本识别为新词条。S4. Calculate the weight of the extracted anchor text according to the context distance between the anchor text of the existing entry, count the frequency of the extracted anchor text in the current category, and set the frequency or weight to meet the preset The required anchor text is recognized as a new term. 3.根据权利要求1或2所述的方法,其特征在于,所述锚文本所在的网页位置,包括:3. The method according to claim 1 or 2, wherein the webpage position where the anchor text is located comprises: 锚文本所在的网页、锚文本所在的网页分块以及锚文本在网页分块中的位置。The page where the anchor text is located, the section of the page where the anchor text is located, and the position of the anchor text in the section of the page. 4.根据权利要求3所述的方法,其特征在于,所述上下文距离满足预设要求包括:4. The method according to claim 3, wherein the context distance meeting preset requirements comprises: 所提取的锚文本所在的网页分块与已有词条的锚文本所在的网页分块相同。The web page block where the extracted anchor text is located is the same as the web page block where the anchor text of the existing entry is located. 5.根据权利要求4所述的方法,其特征在于,所述上下文距离满足要求,还包括:5. The method according to claim 4, wherein the context distance meets requirements, further comprising: 所提取的锚文本与已有词条的锚文本的间隔距离小于预设距离阈值。The distance between the extracted anchor text and the anchor text of the existing entry is smaller than a preset distance threshold. 6.根据权利要求2所述的方法,其特征在于,所述根据与所述已有词条的锚文本之间的上下文距离计算所提取的锚文本的权重,具体包括:6. The method according to claim 2, wherein the calculation of the weight of the extracted anchor text according to the context distance between the anchor text of the existing entry includes: 在同一网页分块中,确定所提取的锚文本与已有词条的锚文本的上下文距离;In the same web page segment, determine the context distance between the extracted anchor text and the anchor text of the existing entry; 利用确定的上下文距离,计算在对应的网页分块中所提取的锚文本的权重;calculating the weight of the anchor text extracted in the corresponding webpage segment by using the determined context distance; 在整个当前分类下,将提取到的各个网页分块中计算得到的所提取的锚文本的权重进行求和,得到所提取的锚文本的权重。Under the entire current category, the weights of the extracted anchor texts calculated in the extracted webpage blocks are summed to obtain the weights of the extracted anchor texts. 7.根据权利要求6所述的方法,其特征在于,所述在同一网页分块中确定所提取的锚文本与已有词条的锚文本的上下文距离,具体包括:7. The method according to claim 6, characterized in that, determining the anchor text extracted and the context distance of the anchor text of the existing entry in the same web page block, specifically comprises: 确定所提取的锚文本所在的网页分块中包含的已有词条的锚文本;Determining the anchor text of the existing entry contained in the web page block where the extracted anchor text is located; 计算所提取的锚文本与获取的各个已有词条的锚文本之间的距离;Calculating the distance between the extracted anchor text and the obtained anchor text of each existing entry; 选取距离的最小值作为与已有词条的上下文距离。The minimum value of the distance is chosen as the context distance to the existing entry. 8.根据权利要求6所述的方法,其特征在于,在所述步骤S3之后,还包括:8. The method according to claim 6, characterized in that, after the step S3, further comprising: 将所提取的锚文本与所述词条库进行对比,得到未收录的锚文本;Comparing the extracted anchor text with the entry database to obtain uncollected anchor text; 仅对所述未收录的锚文本执行所述步骤S4。The step S4 is only executed for the anchor text not included. 9.根据权利要求2所述的方法,其特征在于,在所述步骤S3之后,还包括:9. The method according to claim 2, characterized in that, after the step S3, further comprising: 将所提取的锚文本中不包含指定词性的锚文本过滤掉;Filter out anchor texts that do not contain the specified part of speech from the extracted anchor texts; 仅对过滤后剩余的锚文本执行所述步骤S4。The step S4 is only performed on the remaining anchor text after filtering. 10.一种词条的获取装置,其特征在于,包括:10. A device for acquiring an entry, characterized in that it comprises: 已有词条获取模块,用于获取词条库中同一分类的已有词条集合;The existing entry acquisition module is used to obtain the existing entry collection of the same category in the entry library; 搜索模块,用于利用所述已有词条获取模块获取的已有词条集合进行搜索,得到包含所述已有词条的锚文本,并记录所述已有词条的锚文本所在的网页位置;A search module, configured to use the set of existing entries acquired by the existing entry acquisition module to search, obtain the anchor text containing the existing entries, and record the webpage where the anchor text of the existing entries is located Location; 提取模块,用于根据所述搜索模块记录的网页位置,在相应的位置提取与所述已有词条的锚文本之间的上下文距离满足预设要求的锚文本。The extracting module is used to extract the anchor text whose contextual distance from the anchor text of the existing entry satisfies the preset requirement at the corresponding position according to the webpage position recorded by the search module. 11.根据权利要求10所述的装置,其特征在于,该装置还包括:11. The device according to claim 10, further comprising: 新词条识别模块,用于根据与所述已有词条的锚文本之间的上下文距离计算所述提取模块提取的锚文本的权重,统计所提取的锚文本在当前分类中出现的频度,将频度或权重满足预设要求的锚文本识别为新词条。The new entry identification module is used to calculate the weight of the anchor text extracted by the extraction module according to the context distance between the anchor text of the existing entry, and count the frequency of occurrence of the extracted anchor text in the current classification , identifying the anchor text whose frequency or weight meets the preset requirements as a new entry. 12.根据权利要求10或11所述的装置,其特征在于,所述锚文本所在的网页位置,包括:12. The device according to claim 10 or 11, wherein the location of the webpage where the anchor text is located comprises: 锚文本所在的网页、锚文本所在的网页分块以及锚文本在网页分块中的位置。The page where the anchor text is located, the section of the page where the anchor text is located, and the position of the anchor text in the section of the page. 13.根据权利要求12所述的装置,其特征在于,所述上下文距离满足预设要求包括:13. The device according to claim 12, wherein the context distance meeting preset requirements comprises: 所提取的锚文本所在的网页分块与已有词条的锚文本所在的网页分块相同。The web page block where the extracted anchor text is located is the same as the web page block where the anchor text of the existing entry is located. 14.根据权利要求13所述的装置,其特征在于,所述上下文距离满足要求,还包括:14. The device according to claim 13, wherein the context distance meets requirements, further comprising: 所提取的锚文本与已有词条的锚文本的间隔距离小于预设距离阈值。The distance between the extracted anchor text and the anchor text of the existing entry is smaller than a preset distance threshold. 15.根据权利要求11所述的装置,其特征在于,所述新词条识别模块,包括:15. The device according to claim 11, wherein the new entry recognition module comprises: 距离确定单元,用于在同一网页分块中,确定所提取的锚文本与已有词条的锚文本的上下文距离;A distance determination unit, configured to determine the contextual distance between the extracted anchor text and the anchor text of an existing entry in the same web page block; 权重计算单元,用于利用所述距离确定单元确定的上下文距离,计算在对应的网页分块中所提取的锚文本的权重;a weight calculation unit, configured to use the context distance determined by the distance determination unit to calculate the weight of the anchor text extracted in the corresponding webpage block; 加权单元,用于在整个当前分类下,将提取到的各个网页分块中计算得到的所提取的锚文本的权重进行求和,得到所提取的锚文本的权重。The weighting unit is configured to sum the weights of the extracted anchor texts calculated in each extracted web page block under the entire current classification to obtain the weights of the extracted anchor texts. 16.根据权利要求15所述的装置,其特征在于,所述距离确定单元,具体配置为:16. The device according to claim 15, wherein the distance determining unit is specifically configured as: 确定所提取的锚文本所在的网页分块中包含的已有词条的锚文本;Determining the anchor text of the existing entry contained in the web page block where the extracted anchor text is located; 计算所提取的锚文本与获取的各个已有词条的锚文本之间的距离;Calculating the distance between the extracted anchor text and the obtained anchor text of each existing entry; 选取距离的最小值作为与已有词条的上下文距离。The minimum value of the distance is chosen as the context distance to the existing entry. 17.根据权利要求15所述的装置,其特征在于,该装置还包括:17. The device of claim 15, further comprising: 已有词条过滤模块,用于将所述提取模块提取的锚文本与所述词条库进行对比,得到未收录的锚文本;Existing term filtering module is used to compare the anchor text extracted by the extraction module with the term database to obtain uncollected anchor text; 并将所述未收录的锚文本提供给所述新词条识别模块。and provide the uncollected anchor text to the new entry recognition module. 18.根据权利要求11所述的装置,其特征在于,该装置还包括:18. The device of claim 11, further comprising: 词性过滤模块,用于将所述提取模块提取的锚文本中不包含指定词性的锚文本过滤掉;A part-of-speech filtering module, configured to filter out anchor texts that do not contain a specified part-of-speech in the anchor text extracted by the extraction module; 并将过滤后剩余的锚文本提供给所述新词条识别模块。And the remaining anchor text after filtering is provided to the new entry recognition module.
CN201210151282.6A 2012-05-15 2012-05-15 The acquisition methods and device of a kind of entry Active CN103425660B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210151282.6A CN103425660B (en) 2012-05-15 2012-05-15 The acquisition methods and device of a kind of entry

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210151282.6A CN103425660B (en) 2012-05-15 2012-05-15 The acquisition methods and device of a kind of entry

Publications (2)

Publication Number Publication Date
CN103425660A CN103425660A (en) 2013-12-04
CN103425660B true CN103425660B (en) 2017-10-17

Family

ID=49650418

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210151282.6A Active CN103425660B (en) 2012-05-15 2012-05-15 The acquisition methods and device of a kind of entry

Country Status (1)

Country Link
CN (1) CN103425660B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104978354B (en) * 2014-04-10 2020-11-06 中电长城网际系统应用有限公司 Text classification method and device
CN104102738B (en) * 2014-07-28 2018-04-27 百度在线网络技术(北京)有限公司 A kind of method and device for expanding entity storehouse

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7257530B2 (en) * 2002-02-27 2007-08-14 Hongfeng Yin Method and system of knowledge based search engine using text mining
US7657507B2 (en) * 2007-03-02 2010-02-02 Microsoft Corporation Pseudo-anchor text extraction for vertical search
CN101251854A (en) * 2008-03-19 2008-08-27 深圳先进技术研究院 A method for generating retrieval terms, and a data retrieval method and system
CN102043808B (en) * 2009-10-14 2014-06-18 腾讯科技(深圳)有限公司 Method and equipment for extracting bilingual terms using webpage structure

Also Published As

Publication number Publication date
CN103425660A (en) 2013-12-04

Similar Documents

Publication Publication Date Title
Shen et al. Linden: linking named entities with knowledge base via semantic knowledge
CN104615593B (en) Hot microblog topic automatic testing method and device
CN102737039B (en) Index building method, searching method and searching result sorting method and corresponding device
EP2159715B1 (en) System and method for providing a topic-directed search
CN103226578B (en) A Method for Website Identification and Webpage Segmentation in the Medical Field
CN112395395B (en) Text keyword extraction method, device, equipment and storage medium
CN101661513B (en) Detection method of network focus and public sentiment
CN103902619B (en) A kind of network public-opinion monitoring method and system
CA2774278C (en) Methods and systems for extracting keyphrases from natural text for search engine indexing
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
CN109271477A (en) A kind of method and system by internet building taxonomy library
CN107688616B (en) Make the unique facts of the entity appear
CN109960756A (en) Methods of Summarizing News Event Information
CN111625621B (en) Document retrieval method and device, electronic equipment and storage medium
CN114706972B (en) An automatic generation method of unsupervised scientific and technological information summaries based on multi-sentence compression
CN102750316A (en) Concept relation label drawing method based on semantic co-occurrence model
CN113934910A (en) Automatic optimization and updating theme library construction method and hot event real-time updating method
CN104809252B (en) Internet data extraction system
CN101751455A (en) Method for automatically generating title by adopting artificial intelligence technology
CN104462399B (en) The processing method and processing device of search result
CN107908749B (en) Character retrieval system and method based on search engine
CN107526792A (en) A kind of Chinese question sentence keyword rapid extracting method
CN104268230A (en) Method for detecting objective points of Chinese micro-blogs based on heterogeneous graph random walk
CN109446399A (en) A kind of video display entity search method
KR101473239B1 (en) Category and Sentiment Analysis System using Word pattern.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant