CN102567409A

CN102567409A - Method and device for providing retrieval associated word

Info

Publication number: CN102567409A
Application number: CN2010106185605A
Authority: CN
Inventors: 吴周强; 蔡勇; 王彪; 倪玉华; 吴悠; 彭德琦
Original assignee: ZHUHAI BORUI SCIENCE AND TECHNOLOGY Co Ltd; Beijing Normal University
Current assignee: ZHUHAI BORUI SCIENCE AND TECHNOLOGY Co Ltd; Beijing Normal University
Priority date: 2010-12-31
Filing date: 2010-12-31
Publication date: 2012-07-11

Abstract

The present invention relates to retrieval technology, and discloses a method and device for providing retrieval associated words, which are used to improve the accuracy of retrieval associated words and the retrieval efficiency of search engines; The matching method is used to obtain search related words, but based on the text data contained in the web page where the search keyword is located, the target word segmentation is extracted, and then the correlation degree between each target word segmentation and the search keyword is calculated, and the correlation degree reaches the set threshold The target participle of the value is presented as a search associated word. In this way, according to the information content contained in the web page where the search keyword is located, it can be obtained that there is a logical relationship with the search keyword, and there is or does not exist a fuzzy matching relationship. Other search related words, thereby improving the information accuracy of the search related words, avoiding the omission of some search related words, and effectively improving the search efficiency of the search engine.

Description

A method and device for providing retrieval associated words

技术领域 technical field

本发明涉及检索技术，特别涉及一种提供检索关联词的方法及装置。The invention relates to retrieval technology, in particular to a method and device for providing retrieval associated words.

背景技术 Background technique

随着英特网上信息数据量的剧增，搜索引擎为了更好的服务于用户，就有必要采取辅助手段来提高搜索效率，而智能推荐检索关联词就是多种辅助手段中的一种。参阅图1所示，所谓智能推荐检索关联词，即是当用户输入某个检索关键字时，例如，人参，在搜索页面将会呈现出一系列可能与用户输入的检索关键字相关联的词汇，称为检索关联词，例如，如图1所示，人参果、人参健脾丸、人参娃娃......，点击这些检索关联词，用户可以进行进一步的搜索。With the rapid increase of the amount of information and data on the Internet, in order to serve users better, it is necessary for search engines to adopt auxiliary means to improve search efficiency, and intelligent recommendation of search related words is one of the various auxiliary means. As shown in Figure 1, the so-called intelligent recommendation of search related words means that when a user enters a search keyword, for example, ginseng, a series of words that may be associated with the search keyword entered by the user will be displayed on the search page. It is called retrieval associated words, for example, as shown in Figure 1, ginseng fruit, ginseng Jianpi pill, ginseng doll..., click on these retrieval associated words, users can conduct further searches.

目前，支持智能推荐检索关联词的搜索引擎有很多，这此搜索引擎采用的推荐检索关联词方法通常包括以下几种：At present, there are many search engines that support intelligently recommending retrieval associated words, and the methods of recommending retrieval associated words used by these search engines usually include the following:

但不同的搜索引擎完成这个功能的方式不同，通用的搜索引擎由于搜索人群非常广泛，涉猎的领域非常繁多，因此在推荐方面大都采用简单高效的推荐计算方式，常见的方式有：However, different search engines have different ways of accomplishing this function. General search engines generally use simple and efficient recommendation calculation methods in terms of recommendation due to their extensive search population and wide range of fields. The common methods are:

1、系统关键字模糊匹配。1. System keyword fuzzy matching.

即在用户输入检索关键字后，根据预设的系统关键字库中记录的内容进行模糊匹配，并将模糊匹配成功的检索关键字作为检索关联词推荐给用户。That is, after the user enters the search keyword, fuzzy matching is performed according to the content recorded in the preset system keyword library, and the search keyword with successful fuzzy matching is recommended to the user as a search related word.

2、用户输入关键字模糊匹配。2. User input keyword fuzzy matching.

即将用户输入的检索关键字保存至用户输入关键字库中并进行条数累计，当该用户输入检索关键字后，根据用户输入关键字库中记录的内容进行模糊匹配，并将模糊匹配成功的检索关键字按照累计次数排序，作为检索关联词推荐给用户。That is to save the search keywords entered by the user in the user input keyword library and accumulate the number of entries. When the user enters the search keywords, fuzzy matching is performed according to the content recorded in the user input keyword library, and the fuzzy matching is successfully The search keywords are sorted according to the accumulated times, and are recommended to users as search related words.

3、参考其他用户输入的关键字进行推荐。3. Refer to keywords input by other users for recommendation.

系统记录任何一个用户输入检索关键字的全过程并保存到用户输入关键字库中，用户输入检索关键字后，根据用户输入关键字库搜索匹配的检索关键字，并将以前输入这个检索关键字的用户输入的其它检索关键字，作为检索关键词推荐给用户。The system records the entire process of any user inputting a search keyword and saves it in the user input keyword library. After the user enters the search keyword, it searches for a matching search keyword according to the user input keyword library, and saves the previously entered search keyword Other search keywords input by the user are recommended to the user as search keywords.

然而，现有的搜索引擎在提供检索关联词时，没有考虑到检索关联词的行业特征，推荐的检索关联词一般是通用的关键字序列，并且，现有的搜索引擎通常采用模糊匹配的方式推荐检索关联词，这样，很多与检索关键字不存在模糊匹配关系，但存在一定逻辑关系的检索关联词将得不到推荐。例如，当用户输入的检索关键字为“人参”时，与“人参”存在一定逻辑关系的检索关联词“皂苷”，“黄芪”等等将得不到推荐，从而，在一定程度上影响了检索关联词的准确性，从而降低了搜索引擎的检索效率。However, when the existing search engines provide search related words, they do not take into account the industry characteristics of the search related words. The recommended search related words are generally common keyword sequences, and the existing search engines usually use fuzzy matching to recommend search related words. , so that many retrieval related words that do not have a fuzzy matching relationship with the retrieval keywords but have a certain logical relationship will not be recommended. For example, when the search keyword entered by the user is "ginseng", the search related words "saponin" and "astragalus" which have a certain logical relationship with "ginseng" will not be recommended, thus affecting the search to a certain extent. The accuracy of associated words reduces the retrieval efficiency of search engines.

发明内容 Contents of the invention

本发明实施例公开一种提供检索关联词的方法及装置，用以提高检索关联词的准确性，提高搜索引擎的检索效率。The embodiment of the invention discloses a method and device for providing retrieval related words, which are used to improve the accuracy of retrieval related words and improve the retrieval efficiency of a search engine.

本发明实施例提供的具体技术方案如下：The specific technical scheme that the embodiment of the present invention provides is as follows:

一种提供检索关联词的方法，包括：A method for providing retrieval associated words, comprising:

根据用户输入的检索关键字获取包含所述检索关键字的网页页面；Acquiring webpages containing the search keywords according to the search keywords input by the user;

基于所述网页页面包含的文本数据提取出目标分词；Extracting target word segmentation based on the text data included in the web page;

分别基于每一个目标分词在各网页页面中的密度，计算每一个目标分词与所述检索关键字之间的关联度；Respectively based on the density of each target participle in each web page, calculate the degree of association between each target participle and the retrieval keyword;

将关联度达到设定门限值的目标分词，作为检索关联词呈现给用户。The target participle whose relevance degree reaches the set threshold is presented to the user as a search associated word.

一种提供检索关联词的装置，包括：A device for providing retrieval associated words, comprising:

获取单元，用于根据用户输入的检索关键字获取包含所述检索关键字的网页页面；An acquisition unit, configured to acquire a web page containing the retrieval keyword according to the retrieval keyword input by the user;

提取单元，用于基于所述网页页面包含的文本数据提取出目标分词；An extraction unit, configured to extract target word segmentation based on the text data included in the web page;

计算单元，用于分别基于每一个目标分词在各网页页面中的密度，计算每一个目标分词与所述检索关键字之间的关联度；A computing unit, configured to calculate the degree of association between each target word and the retrieval keyword based on the density of each target word in each web page;

呈现单元，用于将关联度达到设定门限值的目标分词，作为检索关联词呈现给用户。The presenting unit is configured to present the target word segment whose relevance degree reaches the set threshold value to the user as a retrieval related word.

本发明实施例中，针对用户输入的检索关键字，不采用模糊匹配的方式获取检索关联词，而是基于检索关键字所在的网页页面包含的文本数据，提取出目标分词，再计算各目标分词与检索关键字之间的关联度，将关联度达到设定门限值的目标分词，作为检索关联词进行呈现，这样，便可以根据检索关键字所在的网页页面包含的信息内容，获取到与检索关键字存在逻辑上的关联关系，并且存在或不存在模糊匹配关系的其他检索关联词，从而提高了检索关联词的信息准确度，避免了部分检索关联词的遗漏，进而有效提高了搜索引擎的检索效率。In the embodiment of the present invention, for the search keyword input by the user, instead of using fuzzy matching to obtain the search associated words, the target word is extracted based on the text data contained in the webpage where the search keyword is located, and then the target word and word are calculated. Retrieve the degree of relevance between keywords, and present the target word segment whose degree of relevance reaches the set threshold value as a search related word. In this way, according to the information content contained in the web page where the search keyword is located, you can obtain the corresponding search keyword. Words are logically related, and there are other search related words with fuzzy matching relationship, so as to improve the information accuracy of the search related words, avoid the omission of some search related words, and effectively improve the search efficiency of the search engine.

附图说明 Description of drawings

图1为现有技术下智能推荐检索关联词示意图；FIG. 1 is a schematic diagram of intelligently recommended retrieval associated words in the prior art;

图2为本发明实施例中检索装置功能结构图；Fig. 2 is a functional structural diagram of a retrieval device in an embodiment of the present invention;

图3为本发明实施例中建立推荐词库流程图；Fig. 3 is a flow chart of establishing a recommended lexicon in an embodiment of the present invention;

图4为本发明实施例中向用户提供检索关联词流程图。Fig. 4 is a flow chart of providing retrieval associated words to users in an embodiment of the present invention.

具体实施方式 Detailed ways

本发明实施例中，用以提高检索关联词的准确性，提高搜索引擎的检索效率，本发明实施例中，不再采用模糊匹配的方式获取检索关联词，而是根据检索关键字出现的网页页面中包含的文本数据来获取相应的检索关联词，具体为：根据用户输入的检索关键字获取包含该检索关键字的网页页面，基于获取的网页页面包含的文本数据提取出目标分词，再分别基于每一个目标分词在各网页页面中的密度，计算每一个目标分词与检索关键字之间的关联度，最后，将关联度达到设定门限值的目标分词，作为检索关联词呈现给用户。In the embodiment of the present invention, in order to improve the accuracy of retrieval related words, improve the retrieval efficiency of search engine, in the embodiment of the present invention, no longer adopt the mode of fuzzy matching to obtain retrieval related words, but according to the web page that retrieval keyword appears The text data included to obtain the corresponding search associated words, specifically: obtain the web pages containing the search keywords according to the search keywords input by the user, extract the target word segmentation based on the text data contained in the obtained web pages, and then based on each The density of the target participle in each web page calculates the correlation degree between each target participle and the search keyword, and finally, the target participle whose correlation degree reaches the set threshold value is presented to the user as a search related word.

下面结合附图对本发明优选的实施方式进行详细说明。Preferred embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings.

本发明实施例中，提供检索关联词的方法分为两个阶段，第一个阶段为检索关联词的推荐词库的创建与更新，第二个阶段是根据用户输入的检索关键字呈现检索关联词推荐列表，In the embodiment of the present invention, the method for providing search related words is divided into two stages. The first stage is to create and update the recommended thesaurus of search related words, and the second stage is to present the search related word recommendation list according to the search keyword input by the user. ,

参阅图2所示，本发明实施例中，推荐词库的创建流程如下：Referring to shown in Figure 2, in the embodiment of the present invention, the creation process of the recommended thesaurus is as follows:

步骤200：判断检索关键字列表中是否还存在未读取的检索关键字，若是则进行步骤210；否则，结束当前流程。Step 200: Determine whether there are unread search keywords in the search keyword list, and if so, go to step 210; otherwise, end the current process.

本实施例中，根据常用到的检索关键字设置一张检索关键字列表，并基于该检索关键字列表建立相应的推荐词库。该列表中记录的检索关键字可以是由管理人员预先输入的，也可以是在使用流程中，根据用户输入的检索关键字而学习获取的。In this embodiment, a list of search keywords is set according to commonly used search keywords, and a corresponding recommended thesaurus is established based on the list of search keywords. The search keywords recorded in the list may be pre-input by the management personnel, or may be learned and acquired according to the search keywords input by the user during the use process.

步骤210：从检索关键字列表中读取一个检索关键字。Step 210: Read a retrieval keyword from the retrieval keyword list.

步骤220：根据读取的检索关键字获取对应的网页页面。Step 220: Obtain the corresponding web page according to the retrieved keyword.

本实施例中，可以采用网页抓取工具如WebSpider、PClawer、hpricot等。网页抓取工具根据检索关键字获取的是一系列与检索关键字相关的网页集合，将抓取获得的网页集合保存在存储器中，以供后续计算关联度时使用。In this embodiment, web crawling tools such as WebSpider, PClawer, hpricot, etc. can be used. The webpage crawling tool obtains a series of webpage collections related to the retrieval keywords according to the retrieval keywords, and stores the crawled webpage collections in the memory for subsequent use in calculating the relevance degree.

另一方面，在获取网页页面时，可以对网站行业属性进行限制，确定指定的网站行业属性后，再到具有该网站行业属性的网站内，获取包含检索关键字的网页页面。例如，若用户需要在中医行业的范围内检索相关信息，则可以将搜索引擎的设置为仅限在中医药行业搜索引擎或者与中医药行业相关的网站内获取对应的网页页面。On the other hand, when acquiring web pages, the industry attribute of the website can be restricted, and after the specified industry attribute of the website is determined, the web page containing the search keyword can be obtained from the website with the industry attribute of the website. For example, if a user needs to retrieve relevant information within the scope of the Chinese medicine industry, the search engine can be set to only obtain corresponding webpages within the Chinese medicine industry search engine or websites related to the Chinese medicine industry.

步骤230：提取出各网页页面包含的文本数据，并进行分词划分，获得目标分词。Step 230: Extract the text data contained in each webpage, and perform word segmentation to obtain target word segmentation.

较佳的，可以采用最小切分法进行分词划分，所谓最小切分法即是令基于一个整句切分获得的目标分词数目最小。Preferably, a minimum segmentation method can be used for word segmentation. The so-called minimum segmentation method is to minimize the number of target word segments obtained based on a whole sentence segmentation.

步骤240：采用聚类算法分别计算每一个目标分词在各网页页面中的密度。Step 240: Using a clustering algorithm to calculate the density of each target word segment in each web page.

采用聚类算法，其目的是计算出各目标分词的出现的频率(即密度)，较佳的，可以将密度未达到设定门限值的目标分词删除，从而降低后续处理流程的复杂度。例如，根据检索关键字“人参”获得的网页页面包含的文本数据，经过分词划分后，获得的多个目标分词分别为：黄芪、党参、人参的治疗、人参的疗效、人参娃娃......，那么，经过聚类筛选后，密度达到设定门限值的目标分词为、黄芪、党参、人参的疗效。The purpose of using a clustering algorithm is to calculate the frequency (ie density) of each target participle. Preferably, the target participle whose density does not reach the set threshold can be deleted, thereby reducing the complexity of the subsequent processing flow. For example, according to the text data contained in the web page obtained from the search keyword "ginseng", after word segmentation, the multiple target word segments obtained are: astragalus, Codonopsis pilosula, ginseng treatment, ginseng curative effect, ginseng doll... .., then, after clustering and screening, the target word segmentation whose density reaches the set threshold value is curative effect of astragalus, codonopsis, and ginseng.

在对各分词进行聚类前，较佳的，可以先将一些惯用的语气助词，连词、介词等等对于检索关联词的推荐没有帮助的词汇进行删除，例如，“吗”、“一个”、“非常”......，以便提取出更为精确的目标分词。Before clustering each participle, it is preferable to delete some commonly used modal particles, conjunctions, prepositions, etc., which are not helpful for the recommendation of retrieval associated words, for example, "do", "one", " Very"...in order to extract more precise target word segmentation.

步骤250：确定获取的各网页页面的权重。Step 250: Determine the weight of each acquired web page.

本发明实施例中，在确定任意一个网页页面的权重时，可以参考以下参数中的一种或任意组合：In the embodiment of the present invention, when determining the weight of any web page, one or any combination of the following parameters can be referred to:

网站权威指数，用于表征网页页面归属的网站的权威性；例如，知名大型网站的网站权威指数，要高于普通网站的网站权威指数。网站权威指数越高的网站包含的网页页面的参考价值越高。The website authority index is used to represent the authority of the website to which the web page belongs; for example, the website authority index of a well-known large website is higher than that of an ordinary website. The higher the website authority index, the higher the reference value of the webpages contained in the website.

网站专业指数，用于表征网页页面归属的网站的行业属性，例如，若限制在中医药行业内进行检索，则专门的中医药类网站的网站专业指数，便高于普通医学类网站的网站专业指数。在限定网站的行业属性时，网站专业指数越高的网站包含的网页页面的参考价值越高。The website professional index is used to represent the industry attribute of the website to which the web page belongs. For example, if the search is limited to the Chinese medicine industry, the website professional index of the special Chinese medicine website is higher than that of the general medical website. index. When limiting the industry attributes of the website, the higher the professional index of the website, the higher the reference value of the webpages included.

用户点击率，用于于表征网页页面的用户浏览量，用户浏览量越大的网页页面，其参考价值越高。User click rate is used to represent the number of user views of a web page, and the greater the number of user views of a web page, the higher its reference value.

目标分词出现次数，用于表征网页页面中目标分词的累计出现次数，包含目标分词数目越多的网页页面，其参考价值越高。The number of occurrences of the target word is used to represent the cumulative number of occurrences of the target word in the web page, and the more the number of the target word is contained in the web page, the higher the reference value is.

...... …

实际应用中，管理人员可以根据具体应用环境自行设置确定网页页面的权重时所需参考的各类参数，上述几中参数仅为举例，并不局限于此。In practical applications, administrators can set various parameters that need to be referred to when determining the weight of web pages according to the specific application environment. The above-mentioned parameters are just examples and are not limited thereto.

步骤260：基于各目标分词的密度，以及各网页页面的权重，分别计算每一个目标分词与获取的检索关键字之间的关联度。Step 260: Based on the density of each target word segment and the weight of each webpage, respectively calculate the degree of association between each target word segment and the obtained retrieval keyword.

本发明实际应用中，在计算任意一个目标分词与检索关健字之间的关联度时，可以采用以下公式：In the practical application of the present invention, when calculating the degree of association between any target word segmentation and retrieval key word, can adopt following formula:

关联度＝X1*w1+X2*w2+......Xm*wn 公式一Relevance = X1*w1+X2*w2+......Xm*wn Formula 1

其中，X1、X2、......Xn为任意一个目标分词在各网页页面中的密度，w1、w2、......wn为各网页页面的权重。Among them, X1, X2, . . . Xn are the densities of any target word segmentation in each web page, and w1, w2, . . . wn are the weights of each web page.

实际应用中，计算关联度所使用到的目标分词在各网页页面中的密度，以及各网页页面的权重，需要定期进行更新。In practical applications, the density of the target word segmentation used in calculating the degree of relevance in each web page and the weight of each web page need to be updated regularly.

步骤270：将关联度达到设定门限值的目标分词，对应用户输入的检索关键字，作为检索关联词进行保存。Step 270: Save the target participle whose relevance degree reaches the set threshold, corresponding to the search keyword input by the user, as a search related word.

这样，采用步骤200-步骤270便可以初始建立推荐词库，以待后续使用。In this way, by adopting steps 200-270, the recommended vocabulary can be initially established for subsequent use.

基于上述实施例，参阅图3所示，本发明实施例中，基于已建立的推荐词库，针对用户输入的检索关键字，提供相应的检索关联词的详细流程如下：Based on the above-mentioned embodiment, referring to Fig. 3, in the embodiment of the present invention, based on the established recommended thesaurus, for the search keyword input by the user, the detailed process of providing corresponding search related words is as follows:

步骤300：获取用户输入的检索关键字。Step 300: Obtain the search keyword input by the user.

步骤310：获取的检索关键字已存在于推荐词库中？若是，则进行步骤320；否则，进行步骤330。Step 310: Does the retrieved keyword already exist in the recommended thesaurus? If yes, go to step 320 ; otherwise, go to step 330 .

步骤320：在推荐词库中获取与检索关键字相对应的检索关联词，并呈现给用户。Step 320: Acquire search-related words corresponding to the search keywords from the recommended thesaurus, and present them to the user.

本实施例中，在将获取到的检索关联词呈现给用户时，较佳的，按照各检索关联词与检索关键字的关联度从大到小的顺页序进行呈现。In this embodiment, when presenting the retrieved associated words to the user, preferably, the retrieved associated words are presented in descending page order according to the degree of relevance between each retrieved associated word and the retrieved keyword.

步骤330：根据获取的检索关键字获取对应的网页页面Step 330: Obtain the corresponding web page according to the obtained retrieval keyword

与步骤220同理，在获取网页页面时，可以对网站行业属性进行限制，确定指定的网站行业属性后，再到具有该网站行业属性的网站内，获取包含检索关键字的网页页面。例如，若用户需要在中医行业的范围内检索相关信息，则可以将搜索引擎的设置为仅限在中医药行业搜索引擎或者与中医药行业相关的网站内获取对应的网页页面。Similar to step 220, when acquiring web pages, the industry attribute of the website can be restricted, and after the specified industry attribute of the website is determined, the web page containing the search keyword can be obtained from the website with the industry attribute of the website. For example, if a user needs to retrieve relevant information within the scope of the Chinese medicine industry, the search engine can be set to only obtain corresponding webpages within the Chinese medicine industry search engine or websites related to the Chinese medicine industry.

步骤340：提取出各网页页面包含的文本数据，并进行分词划分，获得目标分词。Step 340: Extract the text data contained in each webpage, and perform word segmentation to obtain target word segmentation.

步骤350：采用聚类算法分别计算每一个目标分词在各网页页面中的密度。Step 350: Using a clustering algorithm to calculate the density of each target word segment in each web page.

较佳的，在计算每一个目标分词在各网页页面中的密度后，可以将密度未达到设定门限值的目标分词删除，从而降低后续操作的复杂度。Preferably, after calculating the density of each target word in each web page, the target word whose density does not reach the set threshold can be deleted, thereby reducing the complexity of subsequent operations.

步骤360：确定各网页页面的权重。Step 360: Determine the weight of each web page.

与步骤250同理，本发明实施例中，确定任意一个网页页面的权重时，可以参考以下参数中的一种或任意组合：网站权威指数，网站专业指数、用户点击率、目标分词出现次数......In the same way as step 250, in the embodiment of the present invention, when determining the weight of any web page, one or any combination of the following parameters can be referred to: website authority index, website professional index, user click rate, and occurrence times of target word segmentation. .....

步骤370：基于各目标分词的密度，以及各网页页面的权重，分别计算每一个目标分词与用户输入的检索关键字之间的关联度。Step 370: Based on the density of each target word segment and the weight of each web page, respectively calculate the degree of association between each target word segment and the retrieval keyword input by the user.

与步骤260同理，本实施例中，仍可以采用公式一计算每一个目标分词的关联度。Similar to step 260, in this embodiment, Formula 1 can still be used to calculate the relevance degree of each target word segment.

步骤380：将关联度达到设定门限值的目标分词，作为检索关联词呈现给用户。Step 380: Present the target word segment whose relevance degree reaches the set threshold value to the user as a search related word.

本实施例中，在将各目标分词作为检索关联词呈现给用户时，较佳的，按照各检索关联词与检索关键字的关联度从大到小的顺序进行呈现。In this embodiment, when presenting each target participle as a retrieval associated word to the user, preferably, the presentation is performed in descending order of the degree of association between each retrieval associated word and the retrieval keyword.

同时，还需要将用户输入的检索关键字及相应的检索关联词在推荐词库中进行保存。At the same time, it is also necessary to save the search keywords and corresponding search associated words input by the user in the recommended thesaurus.

实际应用中，随着各种网页的不断更新，计算关联度所使用到的目标分词的密度，以及各网页页面的权重需要定期进行更新，相应的，各检索关键字对应的检索关联词也在不断发生着变化，因此，也需要对各检索关键字对应的检索关联词进行定期更新。In practical applications, with the continuous updating of various web pages, the density of the target word segmentation used to calculate the relevance degree and the weight of each web page need to be updated regularly. Correspondingly, the search associated words corresponding to each search keyword are also constantly updated. Therefore, it is also necessary to regularly update the search associated words corresponding to each search keyword.

基于上述实施例，参阅图4所示，本发明实施例中，用于提供检索关联词的装置，称为检索装置，包括：Based on the above-mentioned embodiment, referring to shown in Fig. 4, in the embodiment of the present invention, the device that is used to provide retrieval related word, is called retrieval device, comprises:

获取单元40，用于根据用户输入的检索关键字获取包含所述检索关键字的网页页面；An acquisition unit 40, configured to acquire a web page containing the retrieval keyword according to the retrieval keyword input by the user;

提取单元41，用于基于所述网页页面包含的文本数据提取出目标分词；An extracting unit 41, configured to extract target word segmentation based on the text data contained in the web page;

计算单元42，用于分别基于每一个目标分词在各网页页面中的密度，计算每一个目标分词与检索关键字之间的关联度；Calculation unit 42, for calculating the degree of association between each target participle and the retrieval keyword based on the density of each target participle in each web page;

呈现单元43，用于将关联度达到设定门限值的目标分词，作为检索关联词呈现给用户。The presenting unit 43 is configured to present the target word segment whose relevance degree reaches the set threshold value to the user as a search related word.

如图4所示，检索装置中进一步设置有更新单元44，用于按照设定周期更新推荐词库中保存的各检索关键字对应的检索关联词。As shown in FIG. 4 , the retrieval device is further provided with an update unit 44 for updating the retrieval related words corresponding to each retrieval keyword stored in the recommended thesaurus according to a set period.

综上所述，本发明实施例中，针对用户输入的检索关键字，不采用模糊匹配的方式获取检索关联词，而是基于检索关键字所在的网页页面包含的文本数据，提取出目标分词，再计算各目标分词与检索关键字之间的关联度，将关联度达到设定门限值的目标分词，作为检索关联词进行呈现，这样，便可以根据检索关键字所在的网页页面包含的信息内容，采用聚类算法获取到与检索关键字存在关联关系，并且存在或不存在模糊匹配关系的其他检索关联词，从而提高了检索关联词的信息准确度，避免了部分检索关联词的遗漏，进而有效提高了搜索引擎的检索效率。To sum up, in the embodiment of the present invention, for the retrieval keyword input by the user, instead of using fuzzy matching to obtain retrieval related words, the target word segmentation is extracted based on the text data contained in the web page where the retrieval keyword is located, and then Calculate the degree of correlation between each target participle and the retrieval keyword, and present the target participle whose correlation degree reaches the set threshold value as a retrieval related word. In this way, according to the information content contained in the web page where the retrieval keyword is located, The clustering algorithm is used to obtain other search related words that have a relationship with the search keyword and have or do not have a fuzzy matching relationship, thereby improving the information accuracy of the search related words, avoiding the omission of some search related words, and effectively improving the search results. engine retrieval efficiency.

进一步地，本发明实施例中，还可以限制作为检索对象的网站的行业属性，从而可以对检索范围进行准确定位，进一步提高了搜索引擎的检索效率。Furthermore, in the embodiment of the present invention, the industry attribute of the website as the retrieval object can also be limited, so that the retrieval range can be accurately positioned, and the retrieval efficiency of the search engine is further improved.

显然，本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样，倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内，则本发明也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present invention without departing from the spirit and scope of the present invention. Thus, if these modifications and variations of the present invention fall within the scope of the claims of the present invention and equivalent technologies thereof, the present invention also intends to include these modifications and variations.

Claims

1. A method for providing retrieval associated words, is characterized in that, comprising:

Acquiring webpages containing the search keywords according to the search keywords input by the user;

Extracting the target participle based on the text data included in the web page; calculating the degree of association between each target participle and the retrieval keyword based on the density of each target participle in each web page;

The target participle whose relevance degree reaches the set threshold is presented to the user as a search associated word.

2. The method according to claim 1, wherein, when obtaining the web page containing the retrieval keyword, comprising:

Determine the specified website industry attributes;

In the website with the website industry attribute, the web page containing the search keyword is obtained.

3. The method according to claim 1, wherein a clustering algorithm is used to calculate the density of each target word segment in each web page.

4. The method according to claim 1, wherein, based on the density of any target participle in each web page, when calculating the degree of association between the target participle and the search keyword, the following formula is adopted:

Correlation degree = X1*w1+X2*w2+......Xm*wn

Wherein, X1, X2, . . . Xn is the density of any target word segment in each web page, and w1, w2, . . . wn is the weight of each web page.

5. The method according to claim 4, wherein, when determining the weight of any web page, one or any combination of website authority index, website professional index, user click-through rate and target participle occurrence times is referred to.

6. The method according to any one of claims 1-5, wherein the obtained target word segmentation is saved as the retrieval associated word of the retrieval keyword, and when the user inputs the retrieval keyword next time, directly Obtain the corresponding retrieval associated words and present them to the user.

7. The method according to claim 6, characterized in that the search related words corresponding to the search keywords are updated according to a set period.

8. A device for providing retrieval associated words, characterized in that it comprises:

An acquisition unit, configured to acquire a web page containing the retrieval keyword according to the retrieval keyword input by the user;

An extraction unit, configured to extract target word segmentation based on the text data included in the web page;

A computing unit, configured to calculate the degree of association between each target word and the retrieval keyword based on the density of each target word in each web page;

The presenting unit is configured to present the target word segment whose relevance degree reaches the set threshold value to the user as a retrieval related word.

9. The device according to claim 8, wherein, when the acquisition unit acquires the web page containing the search keyword, it determines the specified website industry attribute, and in the website with the website industry attribute, Obtain webpages containing the retrieval keywords.

10. The device according to claim 8, wherein the calculation unit uses a clustering algorithm to calculate the density of each target word segment in each web page.

11. The device according to claim 8, wherein the calculation unit calculates the degree of association between the target word and the search keyword based on the density of any target word in each webpage, using The following formula:

Correlation degree = X1*w1+X2*w2+......Xm*wn

Wherein, X1, X2, . . . Xm is the density of any target word segment in each web page, and w1, w2, . . . wn is the weight of each web page.

12. The device according to claim 11, wherein when the calculation unit determines the weight of any web page, it refers to one or more of the website authority index, website professional index, user click rate and target participle occurrence times. random combination.

13. The device according to any one of claims 8-12, wherein the presenting unit saves the obtained target participle as a retrieval associated word of the retrieval keyword, and waits for the user to input the retrieval keyword next time. When searching a word, directly obtain the corresponding search associated word and present it to the user.

14. The apparatus of claim 13, further comprising:

An update unit, configured to update the search related words corresponding to the search keywords according to a set period.