CN107122421A

CN107122421A - Information retrieval method and device

Info

Publication number: CN107122421A
Application number: CN201710217499.5A
Authority: CN
Inventors: 杨硕; 邹磊
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2017-04-05
Filing date: 2017-04-05
Publication date: 2017-09-01

Abstract

This application discloses a kind of information retrieval method and device, belong to internet arena, with the accuracy for the result for improving the user's problem to be solved retrieved.Methods described includes：Receive the problem to be solved of input；Determine the technical field belonging to the problem to be solved；According to the knowledge base in the technical field pre-established, determine the destination document matched in the technical field with the problem to be solved, wherein, the knowledge base includes the corresponding relation between corresponding relation and the Object of Knowledge and the document object between problem objects, Object of Knowledge, document object, described problem object and the Object of Knowledge, and the Object of Knowledge is selected from a part for described problem object；Return to the destination document.The application is used to answer problem to be solved.

Description

Information retrieval method and device

技术领域technical field

本申请涉及互联网领域，特别涉及一种信息检索方法和装置。The present application relates to the field of the Internet, in particular to an information retrieval method and device.

背景技术Background technique

随着互联网的高速发展，用户当前越来越多的倾向于通过在互联网上提问来获取问题的答案。搜索引擎在获取到用户的提问后，会基于提问中出现的一或多个关键词进行检索，并返回与所述一或多个关键词匹配的结果。With the rapid development of the Internet, users are more and more inclined to obtain answers to questions by asking questions on the Internet. After obtaining the user's question, the search engine will search based on one or more keywords appearing in the question, and return results matching the one or more keywords.

然而，对于机器来说，理解人类的一个问题是一件很困难的事情，通过上述这种方式获取到的结果很可能并不是用户提问想要获取的结果，从而造成检索准确率偏低。However, it is very difficult for a machine to understand a human question, and the result obtained through the above method may not be the result that the user wants to obtain, resulting in low retrieval accuracy.

发明内容Contents of the invention

本申请实施例提供了一种信息检索方法和装置，以提高检索出的用户待解决问题的结果的准确性。所述技术方案如下：The embodiment of the present application provides an information retrieval method and device, so as to improve the accuracy of the retrieved results of the user's problem to be solved. Described technical scheme is as follows:

一方面，提供了一种信息检索方法，所述方法包括：In one aspect, an information retrieval method is provided, the method comprising:

接收输入的待解决问题；receive input open problems;

确定所述待解决问题所属的技术领域；Determine the technical field to which the problem to be solved belongs;

根据预先建立的在所述技术领域的知识库，确定所述技术领域中与所述待解决问题相匹配的目标文档，其中，所述知识库中包括问题对象、知识对象、文档对象、所述问题对象和所述知识对象之间的对应关系以及所述知识对象和所述文档对象之间的对应关系，所述知识对象选自所述问题对象的一部分；According to the pre-established knowledge base in the technical field, determine the target document in the technical field that matches the problem to be solved, wherein the knowledge base includes question objects, knowledge objects, document objects, the a correspondence between a question object and said knowledge object and a correspondence between said knowledge object and said document object, said knowledge object being selected from a part of said question object;

返回所述目标文档。Returns the target document.

另一方面，提供了一种信息检索装置，所述信息检索装置包括：In another aspect, an information retrieval device is provided, and the information retrieval device includes:

界面模块，用于接收输入的待解决问题；an interface module for receiving input pending problems;

处理模块，用于确定所述待解决问题所属的技术领域；A processing module, configured to determine the technical field to which the problem to be solved belongs;

所述处理模块，还用于根据预先建立的在所述技术领域的知识库，确定所述技术领域中与所述待解决问题相匹配的目标文档，其中，所述知识库中包括问题对象、知识对象、文档对象、所述问题对象和所述知识对象之间的对应关系以及所述知识对象和所述文档对象之间的对应关系，所述知识对象选自所述问题对象的一部分；The processing module is further configured to determine a target document matching the problem to be solved in the technical field according to a pre-established knowledge base in the technical field, wherein the knowledge base includes problem objects, a knowledge object, a document object, a correspondence between the question object and the knowledge object, and a correspondence between the knowledge object and the document object, the knowledge object being selected from a part of the question object;

所述界面模块，还用于返回所述目标文档。The interface module is further configured to return the target document.

本申请实施例提供的技术方案带来的有益效果包括：The beneficial effects brought by the technical solutions provided by the embodiments of the present application include:

在基于用户的待解决问题(即用户提问)进行检索时，不仅考虑到问题中的一或多个关键词，同时考虑到问题的技术领域，通过考虑待解决问题的技术领域以及利用预先构建的特定知识库，可以大幅提高检索出的用户待解决问题的结果的准确性。When searching based on the user's problem to be solved (that is, the user's question), not only one or more keywords in the problem are considered, but also the technical field of the problem is considered. By considering the technical field of the problem to be solved and using the pre-built The specific knowledge base can greatly improve the accuracy of the retrieved results of the user's problem to be solved.

附图说明Description of drawings

图1是本申请实施例提供的特定技术领域中的四层知识图的示意图；FIG. 1 is a schematic diagram of a four-layer knowledge map in a specific technical field provided by an embodiment of the present application;

图2是本申请实施例提供的一种示例性的问题节点、知识节点和文件节点的关系图；FIG. 2 is an exemplary relational diagram of a problem node, a knowledge node and a file node provided by an embodiment of the present application;

图3是本申请实施例提供的示例信息检索方法的流程图；Fig. 3 is a flow chart of an example information retrieval method provided by an embodiment of the present application;

图4是本申请实施例提供的一种示例信息检索方法的示意图；Fig. 4 is a schematic diagram of an example information retrieval method provided by an embodiment of the present application;

图5是本申请实施例提供的示出节点间随机游走概率的节点之间的关系图；FIG. 5 is a diagram of the relationship between nodes showing the random walk probability between nodes provided by the embodiment of the present application;

图6是本申请实施例提供的一种示例信息检索装置的结构框图。Fig. 6 is a structural block diagram of an example information retrieval device provided by an embodiment of the present application.

具体实施方式detailed description

为使本发明的目的、技术方案和优点更加清楚，下面将结合附图对本发明实施方式作进一步地详细描述。文中所讲的“电子设备”可以包括智能手机、平板电脑、智能电视、电子书阅读器、MP3播放器(Moving Picture Experts Group Audio Layer III，动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV，动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。文中所讲的“信息检索装置”可以是一或多个服务器等。In order to make the object, technical solution and advantages of the present invention clearer, the implementation manner of the present invention will be further described in detail below in conjunction with the accompanying drawings. The "electronic devices" mentioned in this article can include smart phones, tablet computers, smart TVs, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, moving picture experts compression standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image experts compress standard audio layer 4) Players, laptops and desktop computers, etc. The "information retrieval device" mentioned in this article may be one or more servers and the like.

相关的信息检索方法仅仅考虑到在一个问题当中出现的关键词，往往很难理解用户的意图。为了理解一个问题，人类通常使用他们在技术领域的基本知识。比如对于问题“当用户尝试在发件箱中发送一些特殊表格时，程序就停滞在等待状态”。首先，我们会关注到“特殊表格”和“发件箱”，这些都是产品outlook的一些部件，我们就能推断出这是outlook产生的一些问题。Relevant information retrieval methods only consider the keywords appearing in a question, and it is often difficult to understand the user's intention. To understand a problem, humans usually use their basic knowledge in technical domains. For example, for the problem "When the user tries to send some special forms in the outbox, the program is stuck in the waiting state". First of all, we will pay attention to the "special form" and "outbox", these are some parts of the product outlook, we can infer that this is some problem caused by outlook.

从以上分析可以得出，技术领域的背景知识在问题理解当中发挥着重要作用。本申请中通过构建特定技术领域的知识库来方便机器理解用户问题。From the above analysis, it can be concluded that background knowledge in the technical field plays an important role in problem understanding. In this application, a knowledge base of a specific technical field is constructed to facilitate the machine to understand user questions.

本申请中的信息检索方法是基于预先构建的知识库。所述知识库中包括问题对象、知识对象、文档对象、所述问题对象和所述知识对象之间的对应关系以及所述知识对象和所述文档对象之间的对应关系。其中，问题对象可以是用户输入的一个个待解决问题，知识对象可选自所述待解决问题的一部分，文档对象可以是一个个解决待解决问题的文档。The information retrieval method in this application is based on a pre-built knowledge base. The knowledge base includes question objects, knowledge objects, document objects, correspondences between the question objects and the knowledge objects, and correspondences between the knowledge objects and the document objects. Wherein, the question object can be each unresolved problem input by the user, the knowledge object can be selected from a part of the unresolved problem, and the document object can be a document that solves each unresolved problem.

为便于理解本申请中提到的知识库，下面的描述中是以知识图的形式展现知识库中的各个部分及其关系。In order to facilitate the understanding of the knowledge base mentioned in this application, the following description presents various parts and their relationships in the knowledge base in the form of a knowledge graph.

一个技术问题通常由三部分组成：产品，组件和事件词。通常来说，本申请中的知识图可包含四部分：概念层、产品层、组件层和事件层。其中：A technical question usually consists of three parts: product, component and event words. Generally speaking, the knowledge graph in this application can include four parts: concept layer, product layer, component layer and event layer. in:

概念层：在概念层中，一个节点表示一个概念，一个概念表示一组具有相似功能的产品，一个概念通常也是另一个概念的子概念。Concept layer: In the concept layer, a node represents a concept, a concept represents a group of products with similar functions, and a concept is usually also a sub-concept of another concept.

产品层：产品层中包含了所有的产品以及产品的属性。产品层是整个知识图的核心，产品层的节点表示了一个特定的产品或者产品的属性。可预先定义产品的几种属性，例如版本、语言和运行环境。Product layer: The product layer contains all products and product attributes. The product layer is the core of the entire knowledge graph, and the nodes of the product layer represent a specific product or product attributes. Several attributes of a product, such as version, language, and runtime, can be predefined.

组件层：通常来讲，一个技术问题都是关于产品的某个组件，组件层包含了所有产品的组件。Component layer: Generally speaking, a technical problem is about a certain component of the product, and the component layer includes all product components.

事件层：当产品或者组件已经定义之后，需要理解一个问题的具体现象，组件层包含描述问题现象的一些名词，动词，形容词等等。Event layer: After the product or component has been defined, it is necessary to understand the specific phenomenon of a problem. The component layer contains some nouns, verbs, adjectives, etc. that describe the problem phenomenon.

一个知识图的例子可如1所示，图1中按从上到下的顺序以虚线分割出四层：概念层、产品层、组件层和事件层。An example of a knowledge graph can be shown in Figure 1. In Figure 1, four layers are divided by dotted lines from top to bottom: concept layer, product layer, component layer and event layer.

本文利用技术语料构建知识图，下面将描述具体的构建方法。This paper uses technical corpus to construct a knowledge graph, and the specific construction method will be described below.

概念层和产品层concept layer and product layer

本文从产品信息中抽取概念和产品。总共得到了例如6052个产品，一共属于例如214个不同的类别。同时本文利用预先定义的规则来抽取产品的属性，比如“Office ProWin32IT”表示产品名为Office，版本为Pro，语言是意大利语(Italian)，并且是安装在32位的windows操作系统上。This paper extracts concepts and products from product information. A total of eg 6052 products are obtained, belonging to eg 214 different categories. At the same time, this paper uses predefined rules to extract product attributes. For example, "Office ProWin32IT" means that the product name is Office, the version is Pro, the language is Italian, and it is installed on a 32-bit Windows operating system.

组件层component layer

本文利用技术语料和用户的问题日志来抽取组件。首先，利用一些序列标注的方法识别出语料当中提到的组件。这些抽取出的短语被表示为组件层的节点，本文使用产品与组件的PMI值来衡量。PMI是一种常见的用来衡量两个短语之间相似度的方法，如果说一个组件c与一个产品p的PMI值超过一个阈值，那么我们认为c是p的一个组件。PMI的定义如下：This paper utilizes technical corpus and user's question logs to extract components. First, identify the components mentioned in the corpus using some sequence annotation methods. These extracted phrases are represented as nodes of the component layer, which are measured by the PMI values of products and components in this paper. PMI is a common method used to measure the similarity between two phrases. If the PMI value of a component c and a product p exceeds a threshold, then we consider c to be a component of p. The definition of PMI is as follows:

其中 in

#(c)表示c的出现次数，#(p)表示p的出现次数，#(p，c)表示p和c的共现次数。#(c) indicates the number of occurrences of c, #(p) indicates the number of occurrences of p, and #(p, c) indicates the number of co-occurrences of p and c.

事件层event layer

事件层有两种不同的边，分别是“事件词(EventWordOf)”和“有关于(RelatedTo)”，我们分别讨论这两种边。首先，事件词(EventWordOf)连接一个产品和一个动作词，我们使用类似组件层的方法，利用PMI来抽取这样的关系。通常来说，用户都使用动词、形容词、副词、名词等来描述一个问题的现象。给定大规模的技术语料，首先利用一些成熟的位置标签(POS-TAG)的方法，标注出技术语料的词性。同时，假定两个技术问题如果能被同一个问题解决，那么它们在语义上应该是非常相似的，比如，文档d能解决3个技术问题，分别如下：There are two different kinds of edges in the event layer, namely "event word (EventWordOf)" and "about (RelatedTo)", we discuss these two kinds of edges separately. First, the event word (EventWordOf) connects a product and an action word. We use a method similar to the component layer and use PMI to extract such a relationship. Generally speaking, users use verbs, adjectives, adverbs, nouns, etc. to describe the phenomenon of a problem. Given a large-scale technical corpus, first use some mature position tag (POS-TAG) methods to mark the part of speech of the technical corpus. At the same time, it is assumed that if two technical problems can be solved by the same problem, then they should be very similar in semantics. For example, document d can solve three technical problems, which are as follows:

q2:Outlook 2007不动了(Outlook 2007gets frozen)。q2: Outlook 2007 does not move (Outlook 2007gets frozen).

q9:Outlook发送状态保持数小时了(Outlook sending status remains forhours)。q9: The Outlook sending status remains for hours.

q15:电子邮件卡在发件箱了(Emails get stuck in outbox)。q15: Emails get stuck in outbox.

所以，我们能得出不动(frozen)、保持(remain)和卡住(stuck)三个词在语义上比较相似，所以在这三个词语对应的时间节点之间，会连接“有关于(RelatedTo)”的关系。Therefore, we can conclude that the three words “frozen”, “remain” and “stuck” are semantically similar, so between the time nodes corresponding to these three words, there will be a connection “about ( RelatedTo)" relationship.

为了返回和用户待解决问题相关联的目标文档，本申请中将知识对象和文档对象进行关联。其中，文档对象可根据网络上搜集的技术问题日志来获得。In order to return the target document associated with the user's problem to be solved, the knowledge object is associated with the document object in this application. Wherein, the document object can be obtained according to technical problem logs collected on the Internet.

问题对象、知识对象和文档对象的一种示例连接关系可如图2所示。图2中每个节点可表示一个对象，例如一个问题对象、一个知识对象或一个文档对象。图2中的待解决问题q为：一些具体的自定义表格在用户发送时卡在发件箱了(some specific Custom formsget stuck in Outbox when users send it)。文档d1为对微软Office套装SP2的说明(Description of 2007Microsoft Office Suite SP2)。An example connection relationship of question objects, knowledge objects and document objects can be shown in FIG. 2 . Each node in Figure 2 can represent an object, such as a question object, a knowledge object or a document object. The unresolved problem q in Figure 2 is: some specific Custom forms get stuck in Outbox when users send it. Document d1 is a description of Microsoft Office Suite SP2 (Description of 2007Microsoft Office Suite SP2).

在图2中，存在三种类型的连接边：问题节点连接到知识节点的边、连接两个知识节点的边以及知识节点连接到文档节点的边。其中，对同一个问题节点而言，该问题节点连接到各个知识节点的边，具有相同的权重。连接两个知识节点的边的权重可用条件概率表示，也就是说，从节点x到节点y的权重表示为x出现的情况下y出现的概率，如下表示：In Figure 2, there are three types of connection edges: an edge connecting a question node to a knowledge node, an edge connecting two knowledge nodes, and an edge connecting a knowledge node to a document node. Wherein, for the same problem node, the edges connecting the problem node to each knowledge node have the same weight. The weight of the edge connecting two knowledge nodes can be expressed by conditional probability, that is, the weight from node x to node y is expressed as the probability of occurrence of y when x appears, as follows:

其中，#(x,y)表示x和y的共现次数。Among them, #(x,y) represents the number of co-occurrences of x and y.

对于知识节点连接到文档节点的边的权重可用如下公式来表示：The weight of the edge connecting the knowledge node to the document node can be expressed by the following formula:

其中，分子表示所有能被d解决且包含属于x的问题的个数，分母是所有能被d解决的问题的数量，QL(d)表示所有能被d解决的问题。Among them, the numerator represents the number of all problems that can be solved by d and include the number of problems belonging to x, the denominator is the number of all problems that can be solved by d, and QL(d) represents all the problems that can be solved by d.

在预先构建好知识库后，即可根据用户输入的问题进行信息检索。After the knowledge base is pre-built, information retrieval can be performed according to the questions entered by the user.

参照图3，本发明实施例提供一种信息检索方法，所述方法包括：Referring to Fig. 3, an embodiment of the present invention provides an information retrieval method, the method comprising:

步骤31，接收输入的待解决问题。Step 31, receiving an input problem to be solved.

其中，输入的待解决问题可以是用户通过电子设备输入的待解决问题。Wherein, the input problem to be solved may be a problem to be solved input by the user through the electronic device.

步骤32，确定所述待解决问题所属的技术领域。Step 32, determining the technical field to which the problem to be solved belongs.

在本申请实施例中，可以通过待解决问题中的一或多个关键词确定待解决问题所属的技术领域。In the embodiment of the present application, the technical field to which the problem to be solved belongs can be determined through one or more keywords in the problem to be solved.

步骤33，根据预先建立的在所述技术领域的知识库，确定所述技术领域中与所述待解决问题相匹配的目标文档，其中，所述知识库中包括问题对象、知识对象、文档对象、所述问题对象和所述知识对象之间的对应关系以及所述知识对象和所述文档对象之间的对应关系，所述知识对象选自所述问题对象的一部分；Step 33, according to the pre-established knowledge base in the technical field, determine the target document in the technical field that matches the problem to be solved, wherein the knowledge base includes question objects, knowledge objects, document objects , the corresponding relationship between the question object and the knowledge object and the corresponding relationship between the knowledge object and the document object, the knowledge object is selected from a part of the question object;

步骤34，返回所述目标文档。Step 34, returning the target document.

在本申请中，所述与所述待解决问题相匹配的目标文档可以例如为解决所述待解决问题的目标文档、包含待解决问题的目标文档、包含待解决问题中的一或多个关键词的目标文档。In this application, the target document matching the problem to be solved may be, for example, a target document that solves the problem to be solved, a target document that contains the problem to be solved, or contains one or more keys in the problem to be solved The target document for words.

在本申请中，步骤34中所述返回所述目标文档可包括：返回所述目标文档的名称和/或返回所述目标文档中的内容。In this application, returning the target document in step 34 may include: returning the name of the target document and/or returning the content in the target document.

本申请实施例在基于用户的待解决问题(即用户提问)进行检索时，不仅考虑到问题中的一或多个关键词，同时考虑到问题的技术领域，通过考虑待解决问题的技术领域以及利用预先构建的特定知识库，可以大幅提高检索出的用户待解决问题的结果的准确性。In the embodiment of the present application, when searching based on the user's problem to be solved (that is, the user's question), not only one or more keywords in the problem are considered, but also the technical field of the problem is considered. By considering the technical field of the problem to be solved and Utilizing the pre-built specific knowledge base can greatly improve the accuracy of the retrieved results of the user's problem to be solved.

在本申请实施例中，步骤33中所述确定所述技术领域中与所述待解决问题相匹配的目标文档可包括：In the embodiment of the present application, the determination of the target document in the technical field that matches the problem to be solved in step 33 may include:

根据所述知识库中所述问题对象、所述知识对象以及所述问题对象和所述知识对象之间的对应关系，确定所述技术领域中与所述待解决问题类似的问题；determining problems similar to the problem to be solved in the technical field according to the problem object, the knowledge object, and the correspondence between the problem object and the knowledge object in the knowledge base;

确定每个所述类似的问题与所述待解决问题之间的相似度得分；determining a similarity score between each of said similar problems and said problem to be solved;

基于所述相似度得分，以及每个所述类似的问题对应的目标文档，确定与所述待解决问题相匹配的目标文档。Based on the similarity score and the target document corresponding to each of the similar questions, the target document matching the problem to be solved is determined.

这里需了解的是，本申请实施例基于相似度得分以及每个所述类似的问题对应的目标文档，可直接选择相似度得分最高的类似问题对应的目标文档作为所述待解决问题相匹配的目标文档。这样，可以以最快的速度向用户返回结果。这种方式可以适用于用户对速度要求极高的场景。What needs to be understood here is that in this embodiment of the present application, based on the similarity score and the target document corresponding to each of the similar problems, the target document corresponding to the similar problem with the highest similarity score can be directly selected as the target document that matches the problem to be solved. target document. This way, results are returned to the user as quickly as possible. This method can be applied to scenarios where users have extremely high speed requirements.

当然，在本申请中，可以以每个所述类似的问题对应的目标文档作为候选文档，所述基于所述相似度得分以及每个所述类似的问题对应的目标文档，确定与所述待解决问题相匹配的目标文档可包括：Of course, in this application, the target document corresponding to each of the similar questions can be used as a candidate document, and based on the similarity score and the target document corresponding to each of the similar questions, determine the Problem-solving matches to target documentation may include:

基于所述相似度得分，确定所述待解决问题与所述候选文档中的每一个的相似度；determining a similarity between the problem to be solved and each of the candidate documents based on the similarity scores;

按照所述待解决问题与所述候选文档之间相似度从高到低的顺序选择一或多个候选文档作为与所述待解决问题相匹配的目标文档；Selecting one or more candidate documents as target documents matching the problem to be solved according to the descending order of similarity between the problem to be solved and the candidate document;

其中，以如下方式确定所述待解决问题与所述候选文档中的每一个的相似度：Wherein, the degree of similarity between the problem to be solved and each of the candidate documents is determined in the following manner:

q表示待解决问题，d表示一个候选文档，score(q,d)表示待解决问题q和候选文档d之间的相似度，#(d，C)表示d在C中出现的总次数，#(d，C₀)表示d在C₀中出现的次数，(q’_i,d)∈C₀表示d能解决在C₀中的问题q’_i，score(q’_i,q)表示q’_i与q的相似度得分；且C₀表示问题日志C的子集，q’表示与待解决问题q类似的问题，且q represents the problem to be solved, d represents a candidate document, score(q,d) represents the similarity between the problem to be solved q and the candidate document d, #(d, C) represents the total number of times d appears in C, # (d, C ₀ ) indicates the number of times d appears in C ₀ , (q' _i ,d)∈C ₀ indicates that d can solve the problem q' _i in C ₀ , score(q' _i ,q) indicates q ' The similarity score between _i and q; and C ₀ represents a subset of the problem log C, q' represents a problem similar to the problem q to be solved, and

C₀＝{(q′₀，d′₀)，{(q′₁，d′₁)，...，{q′_m，d′_m)}，q’_i表示第i个与q类似的问题，m表示与q类似的问题的总数，d’表示与q’对应的目标文档。C ₀ = {(q′ ₀ , d′ ₀ ), {(q′ ₁ , d′ ₁ ), ..., {q′ _m , d′ _m )}, q' _i means that the i-th one is similar to q , m represents the total number of questions similar to q, and d' represents the target document corresponding to q'.

可参照图4，图4中示出了与所述待解决问题属于同一技术领域的类似问题，以及每个类似问题对应的目标文档。本申请在确定所述技术领域中与所述待解决问题相匹配的目标文档的过程中，参照图4，假如问题q000为与待解决问题之间的相似度得分最高的类似问题，则可将问题q000对应的文档d1作为待解决问题的目标文档。当然，也可以将d5和d1(仅为示例)均作为待解决问题的目标文档，同时在返回结果时将d1排在d5之前。Referring to FIG. 4 , FIG. 4 shows similar problems belonging to the same technical field as the problem to be solved, and a target document corresponding to each similar problem. In the process of determining the target document matching the problem to be solved in the technical field, the present application refers to FIG. 4 , if problem q000 is a similar problem with the highest similarity score between the problem to be solved, then the The document d1 corresponding to the question q000 is used as the target document of the question to be solved. Of course, both d5 and d1 (just an example) can be used as the target documents of the problem to be solved, and d1 is ranked before d5 when the result is returned.

作为一种可选方式，也可基于所述待解决问题与所述候选文档中的每一个的相似度来确定返回的结果的排列次序。相应地，在步骤33确定出目标文档之后，本申请实施例提供的信息检索方法还可包括：基于随机游走(random walk)算法，计算所述待解决问题与所述知识库中的每一个文档对象的相似度；基于所述待解决问题与所述知识库中的每一个文档对象的相似度，对所述多个目标文档进行重排序。As an optional manner, the ranking order of the returned results may also be determined based on the similarity between the problem to be solved and each of the candidate documents. Correspondingly, after the target document is determined in step 33, the information retrieval method provided by the embodiment of the present application may further include: based on a random walk (random walk) algorithm, calculating each of the problem to be solved and the knowledge base Similarity of document objects: reordering the plurality of target documents based on the similarity between the problem to be solved and each document object in the knowledge base.

在对多个目标文档重排序之后，即可按照重排序后的结果返回目标文档。After multiple target documents are reordered, the target documents can be returned according to the reordered results.

其中，本申请实施例中所述基于随机游走算法，计算所述待解决问题与所述知识库中的每一个文档对象的相似度包括：选择所述待解决问题与所述文档对象之间的一或多个节点设置索引，其中，所述节点的索引表示该节点到所述知识库中的各个文档对象的相似度；基于为所述一或多个节点设置的所述索引，计算所述待解决问题与所述知识库中的每一个文档对象的相似度。Wherein, the calculation of the similarity between the problem to be solved and each document object in the knowledge base based on the random walk algorithm described in the embodiment of the present application includes: selecting the relationship between the problem to be solved and the document object Indexes are set for one or more nodes of the node, wherein the index of the node represents the similarity between the node and each document object in the knowledge base; based on the index set for the one or more nodes, the calculated The similarity between the problem to be solved and each document object in the knowledge base.

一种选择设置索引的节点的方式为：选择路径上的频繁节点设置索引，其中，频繁节点为入度和出度的乘积大于阈值的节点。One way of selecting nodes for setting indexes is: selecting frequent nodes on the path to set indexes, wherein a frequent node is a node whose in-degree and out-degree product is greater than a threshold.

随机游走(random walk)算法是衡量节点相似度的方法，通常来讲，如果从一个节点作为开始，根据每条边的概率，随机走到另一个节点上，到达另一节点的概率就是初始节点和另一节点的相似度。随机游走算法计算出的相似度可以由如下方式进行计算：The random walk algorithm is a method to measure the similarity of nodes. Generally speaking, if you start from a node and walk to another node randomly according to the probability of each edge, the probability of reaching another node is the initial The similarity of a node to another node. The similarity calculated by the random walk algorithm can be calculated as follows:

其中，s(x，y)是基于随机游走的节点x和节点y之间的相似度，N(x)表示所有和x相连接的节点，T(x,x’)表示从节点x走到节点x’的概率。Among them, s(x, y) is the similarity between node x and node y based on random walk, N(x) represents all nodes connected to x, T(x, x') represents walking from node x Probability of going to node x'.

本申请中，使用归一化的权重作为转移的概率。在基于随机游走算法计算相似度时，只保留从知识空间节点连接到文档节点的边，对于从问题节点连接到知识节点的边，也使用相同的方式。In this application, normalized weights are used as transition probabilities. When calculating the similarity based on the random walk algorithm, only the edge connecting the knowledge space node to the document node is kept, and the same method is used for the edge connecting the question node to the knowledge node.

基于随机游走的用户问题节点q和文档节点d的相似度能通过不同的方式来计算。一种方式是基于采样的方法。我们从用户问题节点q出发，以边权重为转移概率，随机移动到一个相邻的节点。假设采样次数为N，其中有r次停留在了文档节点d上，q和d的相似度就是r/N，实验表明，大概需要400万次采样，节点相似度才会趋于收敛，这表明在在线查询中使用基于采样的方法非常耗时，因为在查询阶段，需要系统实时响应。另一种方式是基于随机游走相似度的定义，创建线性方程组，并解线性方程组得到答案。参照图5，图5中每两个节点之间的连线表示从一个节点游走到另一个相邻节点的概率。基于图5所示数值可列出的线性方程组如下：The similarity between user question node q and document node d based on random walk can be calculated in different ways. One way is based on sampling method. We start from the user problem node q, and randomly move to an adjacent node with the edge weight as the transition probability. Assuming that the number of sampling is N, among which there are r times staying on the document node d, the similarity between q and d is r/N. Experiments show that it takes about 4 million samples before the node similarity tends to converge, which shows that Using sampling-based methods in online queries is time-consuming because during the query phase, real-time responses from the system are required. Another way is to create a system of linear equations based on the definition of random walk similarity, and solve the system of linear equations to get the answer. Referring to FIG. 5 , the connection line between every two nodes in FIG. 5 represents the probability of walking from one node to another adjacent node. The linear equations that can be listed based on the values shown in Figure 5 are as follows:

然而，求解一个线性方程组的复杂度很高，通过高斯消元求解线性方程组的复杂度为O(n³)，其中n为方程组中未知数的个数。在本文构建的知识图中，节点数量非常庞大，求解一个线性方程组的复杂度很高，为了提高计算速度，可预先在一些节点上构建索引。对于一个被索引的节点，索引的形式就是一连串的浮点数，表示当前节点到所有文档的相似度，比如对节点x进行索引，x的索引形式为：However, the complexity of solving a system of linear equations is very high, and the complexity of solving a system of linear equations by Gaussian elimination is O(n ³ ), where n is the number of unknowns in the system of equations. In the knowledge graph constructed in this paper, the number of nodes is very large, and the complexity of solving a linear equation system is very high. In order to improve the calculation speed, indexes can be built on some nodes in advance. For an indexed node, the form of the index is a series of floating-point numbers, indicating the similarity between the current node and all documents. For example, to index node x, the index form of x is:

Idex(x)＝{s(x，d₀)，s(x，d₁)，...，s(x，d_m)}Idex(x)={s(x, d ₀ ), s(x, d ₁ ), . . . , s(x, d _m )}

其中，m为文档的个数，假设在节点上构建索引，可以得到与各个文档的相似度。Among them, m is the number of documents, assuming that the index is built on the node, the similarity with each document can be obtained.

举例而言，如果在节点v₅、v₈、v₁₀上预先建立索引，可以直接得到s(v₅,d₁)＝0.701，s(v₈,d₁)＝0.668，s(v₁₀,d₁)＝0.642，那么上面的线性方程组经过简化的结果可如下：For example, if indexes are pre-established on nodes v ₅ , v ₈ , and v ₁₀ , s(v ₅ ,d ₁ )=0.701, s(v ₈ ,d ₁ )=0.668, s(v ₁₀ , d ₁ )=0.642, then the simplified result of the above linear equations can be as follows:

从上面的例子可以看出，如果预先在一些节点上构建索引，方程的未知数的数量将会大大减少(从11个减少到了3个)。From the example above, it can be seen that the number of unknowns of the equation will be greatly reduced (from 11 to 3) if the indexes are built on some nodes in advance.

在本申请实施例中，提出一个贪心算法来选择物化(被索引)的节点。这个贪心的算法每次选择出一些频繁出现在很多路径上的节点，因为频繁的节点更容易覆盖到更多的路径，本文利用入度×出度作为频繁节点的衡量指标，也就是说入度×出度越大，频繁度越高。贪心算法中每次挑选出频繁度最高的节点，将此节点加入索引节点，然后重新计算其它节点的频繁度，最后得到所有的物化节点。In the embodiment of this application, a greedy algorithm is proposed to select materialized (indexed) nodes. This greedy algorithm selects some nodes that frequently appear on many paths each time, because frequent nodes are easier to cover more paths, this paper uses in-degree × out-degree as the measurement index of frequent nodes, that is to say, in-degree The greater the degree of ×, the higher the frequency. In the greedy algorithm, the node with the highest frequency is selected each time, this node is added to the index node, and then the frequency of other nodes is recalculated, and finally all materialized nodes are obtained.

本申请实施例提供的基于索引的计算相似度的方式，可以极大地降低计算量，提高计算效率。同时，基于频繁度来选择设置索引的节点，可以选择部分节点设置索引，而不用对所有节点设置索引，进一步降低了计算量。The index-based similarity calculation method provided by the embodiment of the present application can greatly reduce the calculation amount and improve the calculation efficiency. At the same time, by selecting the nodes for indexing based on frequency, some nodes can be selected for indexing instead of indexing for all nodes, which further reduces the amount of calculation.

图6是本申请实施例提供的一种信息检索装置的结构框图，参照图6，本申请实施例提供的信息检索装置600包括：界面模块601和处理模块602。其中：FIG. 6 is a structural block diagram of an information retrieval device provided in the embodiment of the present application. Referring to FIG. 6 , the information retrieval device 600 provided in the embodiment of the present application includes: an interface module 601 and a processing module 602 . in:

界面模块601，用于接收输入的待解决问题；An interface module 601, configured to receive an input problem to be solved;

处理模块602，用于确定所述待解决问题所属的技术领域；A processing module 602, configured to determine the technical field to which the problem to be solved belongs;

所述处理模块602，还用于根据预先建立的在所述技术领域的知识库，确定所述技术领域中与所述待解决问题相匹配的目标文档，其中，所述知识库中包括问题对象、知识对象、文档对象、所述问题对象和所述知识对象之间的对应关系以及所述知识对象和所述文档对象之间的对应关系，所述知识对象选自所述问题对象的一部分；The processing module 602 is further configured to determine a target document in the technical field that matches the problem to be solved according to a pre-established knowledge base in the technical field, wherein the knowledge base includes problem objects , a knowledge object, a document object, a correspondence between the question object and the knowledge object, and a correspondence between the knowledge object and the document object, the knowledge object being selected from a part of the question object;

所述界面模块601，还用于返回所述目标文档。The interface module 601 is further configured to return the target document.

本申请实施例提供的信息检索装置，在基于用户的待解决问题(即用户提问)进行检索时，不仅考虑到问题中的一或多个关键词，同时考虑到问题的技术领域，通过考虑待解决问题的技术领域以及利用预先构建的特定知识库，可以大幅提高检索出的用户待解决问题的结果的准确性。The information retrieval device provided in the embodiment of the present application not only takes into account one or more keywords in the question, but also considers the technical field of the question when searching based on the user's problem to be solved (that is, the user's question). The technical field of problem solving and the use of pre-built specific knowledge base can greatly improve the accuracy of the results retrieved for the user's problem to be solved.

可选地，所述与所述待解决问题相匹配的目标文档为解决所述待解决问题的目标文档。Optionally, the target document matching the problem to be solved is a target document for solving the problem to be solved.

所述界面模块具体用于：返回所述目标文档的名称和/或返回所述目标文档中的内容。The interface module is specifically configured to: return the name of the target document and/or return the content in the target document.

可选地，所述处理模块602具体用于：Optionally, the processing module 602 is specifically configured to:

可选地，每个所述类似的问题对应的目标文档作为候选文档，所述处理模块602具体用于：Optionally, the target document corresponding to each of the similar questions is used as a candidate document, and the processing module 602 is specifically configured to:

可选地，在确定目标文档之后，所述处理模块602还用于：Optionally, after the target document is determined, the processing module 602 is further configured to:

基于随机游走算法，计算所述待解决问题与所述知识库中的每一个文档对象的相似度；Based on a random walk algorithm, calculating the similarity between the problem to be solved and each document object in the knowledge base;

基于所述待解决问题与所述知识库中的每一个文档对象的相似度，对所述多个目标文档进行重排序。The plurality of target documents are reordered based on the similarity between the problem to be solved and each document object in the knowledge base.

可选地，在基于随机游走算法，计算所述待解决问题与所述知识库中的每一个文档对象的相似度时，所述处理模块602具体用于：Optionally, when calculating the similarity between the problem to be solved and each document object in the knowledge base based on the random walk algorithm, the processing module 602 is specifically configured to:

选择所述待解决问题与所述文档对象之间的一或多个节点设置索引，其中，所述节点的索引表示该节点到所述知识库中的各个文档对象的相似度；Selecting one or more nodes between the problem to be solved and the document object to set an index, wherein the index of the node indicates the similarity between the node and each document object in the knowledge base;

基于为所述一或多个节点设置的所述索引，计算所述待解决问题与所述知识库中的每一个文档对象的相似度。Based on the index set for the one or more nodes, the similarity between the problem to be solved and each document object in the knowledge base is calculated.

可选地，在选择设置索引的节点时，所述处理模块602具体用于：Optionally, when selecting a node for index setting, the processing module 602 is specifically configured to:

选择路径上的频繁节点设置索引，其中，频繁节点为入度和出度的乘积大于阈值的节点。Select the frequent nodes on the path to set the index, where the frequent node is the node whose in-degree and out-degree product is greater than the threshold.

需要说明的是：上述实施例提供的信息检索装置，仅以上述各功能模块的划分进行举例说明，实际应用中，可以根据需要而将上述功能分配由不同的功能模块完成，即将信息检索装置的内部结构划分成不同的功能模块，以完成以上描述的全部或者部分功能。另外，上述实施例提供的信息检索装置与信息检索方法实施例属于同一构思，其具体实现过程详见方法实施例，这里不再赘述。It should be noted that: the information retrieval device provided by the above-mentioned embodiment is only illustrated by the division of the above-mentioned functional modules. The internal structure is divided into different functional modules to complete all or part of the functions described above. In addition, the information retrieval device and the information retrieval method embodiments provided in the above embodiments belong to the same idea, and the specific implementation process thereof is detailed in the method embodiments, and will not be repeated here.

这里还需了解的是，界面模块601和处理模块602可以为同一物理设备内的不同模块，还可以视应用而定，界面模块601可以为分布于不同位置处的一或多个物理设备，处理模块602也可以为分布于不同位置处的一或多个物理设备。What needs to be understood here is that the interface module 601 and the processing module 602 can be different modules in the same physical device, and depending on the application, the interface module 601 can be one or more physical devices distributed in different locations, processing Module 602 may also be one or more physical devices distributed at different locations.

本发明实施例还提供了一种计算机可读存储介质，该计算机可读存储介质可以是上述实施例中的存储器中所包含的计算机可读存储介质；也可以是单独存在，未装配入终端中的计算机可读存储介质。该计算机可读存储介质存储有一个或者一个以上程序，该一个或者一个以上程序被一个或者一个以上的处理器用来执行上述信息检索方法。An embodiment of the present invention also provides a computer-readable storage medium, which may be the computer-readable storage medium included in the memory in the above-mentioned embodiments; or exist independently and not be assembled into the terminal computer readable storage media. The computer-readable storage medium stores one or more programs, and the one or more programs are used by one or more processors to execute the above information retrieval method.

除非另作定义，此处使用的技术术语或者科学术语应当为本申请所属领域内具有一般技能的人士所理解的通常意义。本申请专利申请说明书以及权利要求书中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性，而只是用来区分不同的组成部分。同样，“一个”或者“一”等类似词语也不表示数量限制，而是表示存在至少一个。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接，而是可以包括电性的连接，不管是直接的还是间接的。Unless otherwise defined, the technical terms or scientific terms used herein shall have the usual meanings understood by those skilled in the art to which this application belongs. "First", "second" and similar words used in the patent application specification and claims of this application do not indicate any order, quantity or importance, but are only used to distinguish different components. Likewise, words like "a" or "one" do not denote a limitation in quantity, but indicate that there is at least one. Words such as "connected" or "connected" are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect.

本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成，也可以通过程序来指令相关的硬件完成，所述的程序可以存储于一种计算机可读存储介质中，上述提到的存储介质可以是只读存储器，磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps for implementing the above embodiments can be completed by hardware, and can also be completed by instructing related hardware through a program. The program can be stored in a computer-readable storage medium. The above-mentioned The storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk, and the like.

以上所述仅为本申请的示例实施例，并不用以限制本申请，凡在本申请的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本申请的保护范围之内。The above are only exemplary embodiments of the application, and are not intended to limit the application. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the application shall be included in the protection scope of the application. within.

Claims

1. a kind of information retrieval method, it is characterised in that methods described includes：

Receive the problem to be solved of input；

Determine the technical field belonging to the problem to be solved；

According to the knowledge base in the technical field pre-established, determine in the technical field with the problem phase to be solved The destination document of matching, wherein, the knowledge base include problem objects, Object of Knowledge, document object, described problem object and The corresponding relation between corresponding relation and the Object of Knowledge and the document object between the Object of Knowledge, it is described to know Know the part that object is selected from described problem object；

Return to the destination document.

2. according to the method described in claim 1, it is characterised in that the destination document matched with the problem to be solved To solve the destination document of the problem to be solved；

The return destination document includes：Return to the title of the destination document and/or return in the destination document Content.

3. according to the method described in claim 1, it is characterised in that described to determine in the technical field to be solved to ask with described Inscribing the technical documentation matched includes：

The problem objects according to the knowledge base, the Object of Knowledge and described problem object and the Object of Knowledge it Between corresponding relation, the problem of determining similar with the problem to be solved in the technical field；

It is determined that it is each described similar the problem of and the problem to be solved between similarity score；

Based on the similarity score, and it is each described similar the problem of corresponding destination document, it is determined that with it is described to be solved The destination document that problem matches.

4. method according to claim 3, it is characterised in that corresponding destination document conduct the problem of each described similar Candidate documents, it is described based on the similarity score and it is each described similar the problem of corresponding destination document, it is determined that and institute Stating the destination document that problem to be solved matches includes：

Based on the similarity score, the problem to be solved and the similarity of each in the candidate documents are determined；

According to one or more candidates of the sequential selection of similarity from high to low between the problem to be solved and the candidate documents Document is used as the destination document matched with the problem to be solved；

Wherein, the problem to be solved and the similarity of each in the candidate documents are determined as follows：

<mrow> <mi>s</mi> <mi>c</mi> <mi>o</mi> <mi>r</mi> <mi>e</mi> <mrow> <mo>(</mo> <mrow> <mi>q</mi> <mo>,</mo> <mi>d</mi> </mrow> <mo>)</mo> </mrow> <mo>=</mo> <mi>log</mi> <mo>#</mo> <mrow> <mo>(</mo> <mrow> <mi>d</mi> <mo>,</mo> <mi>C</mi> </mrow> <mo>)</mo> </mrow> <mo>&times;</mo> <munder> <mi>&Sigma;</mi> <mrow> <mrow> <mo>(</mo> <mrow> <msub> <msup> <mi>q</mi> <mo>&prime;</mo> </msup> <mi>i</mi> </msub> <mo>,</mo> <mi>d</mi> </mrow> <mo>)</mo> </mrow> <mo>&Element;</mo> <msub> <mi>C</mi> <mn>0</mn> </msub> </mrow> </munder> <mfrac> <mrow> <mo>#</mo> <mrow> <mo>(</mo> <mrow> <mi>d</mi> <mo>,</mo> <msub> <mi>C</mi> <mn>0</mn> </msub> </mrow> <mo>)</mo> </mrow> </mrow> <mrow> <mo>#</mo> <mrow> <mo>(</mo> <mrow> <mi>d</mi> <mo>,</mo> <mi>C</mi> </mrow> <mo>)</mo> </mrow> <mo>&times;</mo> <mi>i</mi> </mrow> </mfrac> <mo>&times;</mo> <mi>s</mi> <mi>c</mi> <mi>o</mi> <mi>r</mi> <mi>e</mi> <mrow> <mo>(</mo> <mrow> <msub> <msup> <mi>q</mi> <mo>&prime;</mo> </msup> <mi>i</mi> </msub> <mo>,</mo> <mi>q</mi> </mrow> <mo>)</mo> </mrow> <mo>;</mo> </mrow>

Q represents problem to be solved, and d represents a candidate documents, score (q, d) represent problem q and candidate documents d to be solved it Between similarity, # (d, C) represents the total degrees that occur in C of d, # (d, C₀) represent d in C₀The number of times of middle appearance, (q '_i,d)∈ C₀Represent that d can be solved in C₀Middle the problem of q '_i, score (q '_i, q) represent q '_iWith q similarity score；And C₀Problem of representation day Will C subset, the problem of q ' represents similar with problem q to be solved, and

C₀={ (q '₀, d '₀), { (q '₁, d '₁) ..., { q '_m, d '_m), q '_iRepresent i-th it is similar with q the problem of, m is represented and q The sum of similar the problem of, d ' represents destination document corresponding with q '.

5. according to any described methods of claim 1-4, it is characterised in that it is determined that after destination document, methods described is also Including：

Based on Random Walk Algorithm, the problem to be solved is calculated similar to each document object in the knowledge base Degree；

Based on the similarity of each document object in the problem to be solved and the knowledge base, to the multiple target text Shelves are reordered.

6. method according to claim 5, it is characterised in that described to be based on Random Walk Algorithm, is calculated described to be solved Problem and the similarity of each document object in the knowledge base include：

Select one or more nodes between the problem to be solved and the document object that index is set, wherein, the node Index represent the node to each document object in the knowledge base similarity；

Be based upon the index that one or more described nodes are set, calculate the problem to be solved with it is every in the knowledge base The similarity of one document object.

7. method according to claim 6, it is characterised in that selection sets the node of index to include：

Select the frequent node on path that index is set, wherein, frequent node is section of the product more than threshold value of in-degree and out-degree Point.

8. a kind of information indexing device, it is characterised in that described information retrieval device includes：

Interface module, the problem to be solved for receiving input；

Processing module, for determining the technical field belonging to the problem to be solved；

The processing module, is additionally operable to, according to the knowledge base in the technical field pre-established, determine the technical field In the destination document that matches with the problem to be solved, wherein, the knowledge base includes problem objects, Object of Knowledge, text Corresponding relation and the Object of Knowledge and the document object between shelves object, described problem object and the Object of Knowledge Between corresponding relation, the Object of Knowledge be selected from described problem object a part；

The interface module, is additionally operable to return to the destination document.

9. information indexing device according to claim 8, it is characterised in that what the described and problem to be solved matched Destination document is the destination document for solving the problem to be solved；

The interface module specifically for：Return to the content in the title and/or the return destination document of the destination document.

10. information indexing device according to claim 8, it is characterised in that the processing module specifically for：

11. information indexing device according to claim 10, it is characterised in that corresponding mesh the problem of each described similar Mark document as candidate documents, the processing module specifically for：

12. according to any described information indexing devices of claim 8-11, it is characterised in that it is determined that after destination document, The processing module is additionally operable to：

13. information indexing device according to claim 12, it is characterised in that based on Random Walk Algorithm, calculate institute When stating the similarity of each document object in problem to be solved and the knowledge base, the processing module specifically for：

14. information indexing device according to claim 13, it is characterised in that when selection sets the node of index, institute State processing module specifically for：