CN102364469B - A kind of method and device that illustrative sentence retrieval result is ranked up - Google Patents
A kind of method and device that illustrative sentence retrieval result is ranked up Download PDFInfo
- Publication number
- CN102364469B CN102364469B CN201110303380.2A CN201110303380A CN102364469B CN 102364469 B CN102364469 B CN 102364469B CN 201110303380 A CN201110303380 A CN 201110303380A CN 102364469 B CN102364469 B CN 102364469B
- Authority
- CN
- China
- Prior art keywords
- collocation
- query word
- sentence
- words
- example sentence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明提供了一种对例句检索结果进行排序的方法及装置,其中所述方法包括:A.获取用户的查询词;B.从句库中检索包含查询词的匹配例句;C.计算各个匹配例句与查询词之间的搭配强度,其中匹配例句与查询词之间的搭配强度由各查询词之间的搭配概率及各查询词与匹配例句中除各查询词之外的其他词之间的搭配概率来确定;D.按照匹配例句与所述查询词之间的搭配强度对各个匹配例句进行排序。通过上述方式,能够更好地满足用户的语言学习的目的和需求,提高用户的浏览效率,同时减少了系统为满足用户需求而增加的响应次数。
The present invention provides a method and device for sorting example sentence retrieval results, wherein the method includes: A. obtaining the user's query words; B. retrieving matching example sentences containing the query words from the sentence database; C. calculating each matching example sentence The collocation strength between the matching example sentence and the query word is determined by the collocation probability between each query word and the collocation between each query word and other words in the matching example sentence except each query word D. sort each matching example sentence according to the collocation strength between the matching example sentence and the query word. Through the above method, the purpose and demand of the user for language learning can be better met, the user's browsing efficiency can be improved, and the number of times the system responds to meet the user's demand can be reduced at the same time.
Description
【技术领域】 【Technical field】
本发明涉及自然语言处理技术领域,特别涉及一种对例句检索结果进行排序的方法及装置。The invention relates to the technical field of natural language processing, in particular to a method and device for sorting example sentence retrieval results.
【背景技术】 【Background technique】
随着计算机与互联网技术的深入发展,人们在语言学习中借助计算机强大的计算能力来获取自己需要的信息成为可能,例句检索系统就是一种帮助语言学习的人们获取相关资讯的有力工具,其通过在大规模句库中检索与用户输入相匹配的例句,帮助用户获得相关语言的正确用法。With the in-depth development of computer and Internet technology, it is possible for people to use the powerful computing power of computers to obtain the information they need in language learning. The example sentence retrieval system is a powerful tool to help language learners obtain relevant information. Retrieve example sentences that match user input in a large-scale sentence database to help users obtain the correct usage of the relevant language.
但是现有的例句检索系统在对检索结果的排序过程中,不考虑用户输入的查询词在某个具体的例句中与例句上下文之间的相互关系,这样很可能出现排在检索结果前列的例句,并不是用户真正希望获取的例句。However, the existing example sentence retrieval system does not consider the relationship between the query word input by the user in a specific example sentence and the context of the example sentence in the process of sorting the retrieval results, so that the example sentences that are ranked in the forefront of the retrieval results are likely to appear , not the example sentences the user really wants to get.
例如针对用户输入的查询词:“提高”+“效率”,得到下面两个匹配例句:For example, for the query word input by the user: "improvement" + "efficiency", the following two matching example sentences are obtained:
1、从某种意义上说,生产力的提高可以实现更高的效率。1. In a sense, the increase in productivity can achieve higher efficiency.
2、这篇文章详细的解释了如何提高大规模检索系统的效率。2. This article explains in detail how to improve the efficiency of large-scale retrieval systems.
通常来说,当用户输入多个查询词,这多个查询词之间是有联系的,用户希望看到的是这几个查询词在例句中是如何被联合使用的。在例句2中,“提高”与“效率”恰好构成搭配关系,具有较强的内在联系,而例句1中,“提高”实际上是与“生产力”构成了搭配,“提高”与“效率”之间的联系并不强,对用户来说,显然例句2才是他真正希望获取的内容。由于现有技术对例句检索结果进行排序时,不能对例句1和例句2这两种情况进行区分,从而导致与用户需求不够相关的检索结果被排在前列,从而影响了用户的浏览效率,增加了系统的响应次数。Generally speaking, when the user enters multiple query words, there are connections between the multiple query words, and what the user wants to see is how these query words are used in combination in the example sentence. In example sentence 2, "improvement" and "efficiency" just form a collocation relationship, which has a strong internal connection. In example sentence 1, "improvement" actually forms a collocation with "productivity", and "improvement" and "efficiency" The connection between is not strong. For the user, it is obvious that example sentence 2 is what he really wants to obtain. Since the existing technology sorts the retrieval results of example sentences, the two cases of example sentence 1 and example sentence 2 cannot be distinguished, so that the retrieval results that are not sufficiently relevant to the user's needs are ranked in the forefront, thereby affecting the user's browsing efficiency and increasing system response times.
【发明内容】 【Content of invention】
本发明所要解决的技术问题是提供一种对例句检索结果进行排序的方法及装置,以解决现有的例句检索系统中存在的影响用户浏览效率,增加系统响应次数的缺陷。The technical problem to be solved by the present invention is to provide a method and device for sorting example sentence retrieval results, so as to solve the defects in the existing example sentence retrieval system that affect the browsing efficiency of users and increase the number of system responses.
本发明为解决技术问题而采用的技术方案是提供一种对例句检索结果进行排序的方法,包括:A.获取用户的查询词;B.从句库中检索包含所述查询词的匹配例句;C.计算各个匹配例句与所述查询词之间的搭配强度,其中匹配例句与所述查询词之间的搭配强度由各查询词之间的搭配概率及各查询词与匹配例句中除各查询词之外的其他词之间的搭配概率来确定,词语之间搭配概率是指词语之间形成搭配关系的可能性;D.按照匹配例句与所述查询词之间的搭配强度对各个匹配例句进行排序。The technical solution adopted by the present invention to solve the technical problem is to provide a method for sorting example sentence retrieval results, including: A. obtaining the query words of the user; B. retrieving matching example sentences containing the query words from the sentence library; C .Calculate the collocation strength between each matching example sentence and the query word, wherein the collocation strength between the matching example sentence and the query word is divided by the collocation probability between each query word and each query word and the matching example sentence except each query word The collocation probability between other words other than words is determined, and the collocation probability between words refers to the possibility of forming a collocation relationship between words; D. each matching example sentence is carried out according to the collocation strength between the matching example sentence and the query word Sort.
根据本发明之一优选实施例,所述句库包括单语句库或双语句库。According to a preferred embodiment of the present invention, the sentence database includes a single sentence database or a double sentence database.
根据本发明之一优选实施例,匹配例句与所述查询词之间的搭配强度等于:各查询词之间的搭配概率中的最大值与各查询词与匹配例句中除各查询词之外的其他词之间的搭配概率中的最大值的比值,或者,各查询词之间的搭配概率中的最大值与各查询词与匹配例句中除各查询词之外的其他词之间的搭配概率中的最大值的差值,或者,各查询词之间的搭配概率的平均值与各查询词与匹配例句中除各查询词之外的其他词之间的搭配概率的平均值的比值,或者,各查询词之间的搭配概率的平均值与各查询词与匹配例句中除各查询词之外的其他词之间的搭配概率的平均值的差值,或者,各查询词之间的搭配概率之和与各查询词与匹配例句中除各查询词之外的其他词之间的搭配概率之和的比值,与长度修正因子的乘积,其中所述长度修正因子是一个与匹配例句的长度有关的函数。According to a preferred embodiment of the present invention, the collocation strength between the matching example sentences and the query words is equal to: the maximum value of the collocation probabilities between each query word and the probability of each query word and the matching example sentences except each query word The ratio of the maximum value among the collocation probabilities between other words, or the maximum value among the collocation probabilities between each query word and the collocation probability between each query word and other words in the matching example sentence except each query word or the ratio of the average of the collocation probabilities between each query word to the average of the collocation probabilities between each query word and words other than each query word in the matching example sentence, or , the difference between the average of the collocation probabilities between each query word and the average of the collocation probabilities between each query word and other words in the matching example sentence except each query word, or, the collocation between each query word The ratio of the sum of the probabilities to the sum of the collocation probabilities between each query word and other words except each query word in the matching example sentence, and the product of the length correction factor, wherein the length correction factor is a length of the matching example sentence related functions.
根据本发明之一优选实施例,所述方法进一步包括:如果所述句库为双语句库,在展示各个匹配例句时,展示所述双语句库中与各个匹配例句互为译文的另一语言的例句。According to a preferred embodiment of the present invention, the method further includes: if the sentence library is a bilingual sentence library, when displaying each matching example sentence, displaying another language in the bilingual sentence library that is a translation of each matching example sentence example sentences.
根据本发明之一优选实施例,所述方法进一步包括:在展示各个匹配例句时,确定并展示各匹配例句与所述查询词之间的搭配强度等级。According to a preferred embodiment of the present invention, the method further includes: when displaying each matching example sentence, determining and displaying a collocation strength level between each matching example sentence and the query word.
本发明还提供了一种对例句检索结果进行排序的装置,包括:接收单元,用于获取用户的查询词;检索单元,用于从句库中检索包含各查询词的匹配例句;计算单元,用于计算各个匹配例句与所述查询词之间的搭配强度,其中匹配例句与所述查询词之间的搭配强度由各查询词之间的搭配概率及各查询词与匹配例句中除各查询词之外的其他词之间的搭配概率来确定,词语之间搭配概率是指词语之间形成搭配关系的可能性;排序单元,用于按照匹配例句与所述查询词之间的搭配强度对各个匹配例句进行排序。The present invention also provides a device for sorting the retrieval results of example sentences, comprising: a receiving unit for obtaining query words of the user; a retrieval unit for retrieving matching example sentences containing each query word from the sentence database; a calculation unit for In calculating the collocation strength between each matching example sentence and the query word, wherein the collocation strength between the matching example sentence and the query word is divided by the collocation probability between each query word and each query word and the matching example sentence The collocation probability between other words is determined, and the collocation probability between words refers to the possibility of forming a collocation relationship between words; the sorting unit is used to sort each Matching example sentences are sorted.
根据本发明之一优选实施例,所述句库包括单语句库或双语句库。According to a preferred embodiment of the present invention, the sentence database includes a single sentence database or a double sentence database.
根据本发明之一优选实施例,匹配例句与所述查询词之间的搭配强度等于:各查询词之间的搭配概率中的最大值与各查询词与匹配例句中除各查询词之外的其他词之间的搭配概率中的最大值的比值,或者,各查询词之间的搭配概率中的最大值与各查询词与匹配例句中除各查询词之外的其他词之间的搭配概率中的最大值的差值,或者,各查询词之间的搭配概率的平均值与各查询词与匹配例句中除各查询词之外的其他词之间的搭配概率的平均值的比值,或者,各查询词相互之间的搭配概率的平均值与各查询词与匹配例句中除各查询词之外的其他词之间的搭配概率的平均值的差值,或者,各查询词之间的搭配概率之和与各查询词与匹配例句中除各查询词之外的其他词之间的搭配概率之和的比值,与长度修正因子的乘积,其中所述长度修正因子是一个与匹配例句的长度有关的函数。According to a preferred embodiment of the present invention, the collocation strength between the matching example sentences and the query words is equal to: the maximum value of the collocation probabilities between each query word and the probability of each query word and the matching example sentences except each query word The ratio of the maximum value among the collocation probabilities between other words, or the maximum value among the collocation probabilities between each query word and the collocation probability between each query word and other words in the matching example sentence except each query word or the ratio of the average of the collocation probabilities between each query word to the average of the collocation probabilities between each query word and words other than each query word in the matching example sentence, or , the difference between the average value of collocation probabilities between each query word and the average value of collocation probabilities between each query word and other words in the matching example sentence except each query word, or, the difference between each query word The ratio of the sum of collocation probabilities to the sum of collocation probabilities between each query word and other words except each query word in the matching example sentence, and the product of the length correction factor, wherein the length correction factor is one and the matching example sentence Functions related to length.
根据本发明之一优选实施例,所述装置进一步包括:展示单元,如果所述句库为双语句库,则所述展示单元在展示各个匹配例句时,展示所述双语句库中与各个匹配例句互为译文的另一语言的例句。According to a preferred embodiment of the present invention, the device further includes: a display unit. If the sentence library is a bilingual sentence library, when the display unit displays each matching example sentence, it will display the matching sentences in the bilingual sentence library. An example sentence in another language that is a translation of each other.
根据本发明之一优选实施例,所述装置进一步包括:确定单元,用于在展示各个匹配例句时,确定各匹配例句与所述查询之间的搭配强度等级。According to a preferred embodiment of the present invention, the device further includes: a determining unit, configured to determine a collocation strength level between each matching example sentence and the query when displaying each matching example sentence.
由以上技术方案可以看出,通过计算各个匹配例句中各查询词之间的搭配强度,并根据搭配强度对匹配例句进行排序和展示,能够更好地满足用户的语言学习的目的和需求,提高用户的浏览效率,同时减少了系统为满足用户需求而增加的响应次数。From the above technical solutions, it can be seen that by calculating the collocation strength between each query word in each matching example sentence, and sorting and displaying the matching example sentences according to the collocation strength, the user's language learning purpose and needs can be better met, and the user's language learning can be improved. The user's browsing efficiency is improved, and at the same time, the response times increased by the system to meet the user's needs are reduced.
【附图说明】 【Description of drawings】
图1为本发明中对例句检索结果进行排序的方法的实施例的流程示意图;Fig. 1 is the schematic flow chart of the embodiment of the method for sorting example sentence retrieval results in the present invention;
图2为本发明中例句检索结果展示界面的实施例一的示意图;Fig. 2 is the schematic diagram of embodiment one of example sentence retrieval result display interface in the present invention;
图3为本发明中例句检索结果展示界面的实施例二的示意图;Fig. 3 is the schematic diagram of embodiment two of example sentence retrieval result display interface in the present invention;
图4为本发明中对例句检索结果进行排序的装置的实施例的结构示意框图。Fig. 4 is a schematic structural block diagram of an embodiment of a device for sorting example sentence retrieval results in the present invention.
【具体实施方式】 【detailed description】
为了使本发明的目的、技术方案和优点更加清楚,下面结合附图和具体实施例对本发明进行详细描述。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.
请参考图1,图1为本发明中对例句检索结果进行排序的方法的实施例的流程示意图。如图1所示,所述方法包括:Please refer to FIG. 1 , which is a schematic flowchart of an embodiment of a method for sorting example sentence retrieval results in the present invention. As shown in Figure 1, the method includes:
步骤101:获取用户的查询词。Step 101: Obtain the user's query words.
步骤102:从句库中检索包含各查询词的匹配例句。Step 102: Retrieve matching example sentences containing each query word from the sentence database.
步骤103:计算各个匹配例句与查询词之间的搭配强度。Step 103: Calculate the collocation strength between each matching example sentence and the query word.
步骤104:按照匹配例句与查询词之间的搭配强度的大小对各个匹配例句进行排序。Step 104: sort each matching example sentence according to the collocation strength between the matching example sentence and the query word.
下面对上述步骤进行具体说明。The above steps are described in detail below.
用户在进行语言学习时,在查询一个词或多个词时的目的通常是不一样的,在查询一个词的时候,用户希望获得包含该词语的例句,以了解查询词在句子中的用法,而用户在查询多个词时,通常这多个词在使用时是有搭配关系的,用户希望获得包含这几个查询词的例句,同时希望了解这几个查询词之间的搭配关系是如何体现在例句中的。在本发明实施例中将只考虑两个或两个以上的查询词在例句中搭配关系的状况,因此在步骤101中,获取的用户查询词为多个查询词。When a user is learning a language, the purpose of querying one word or multiple words is usually different. When querying a word, the user hopes to obtain an example sentence containing the word in order to understand the usage of the query word in the sentence. When a user queries multiple words, usually these multiple words have a collocation relationship when they are used. The user hopes to obtain example sentences containing these query words, and at the same time wants to know how the collocation relationship between these query words is. reflected in the example sentences. In the embodiment of the present invention, only the collocation relationship of two or more query words in the example sentence will be considered. Therefore, in step 101, the obtained user query words are multiple query words.
在步骤102中,从已有的句库中检索包含查询词的匹配例句,句库可以是单语句库或双语句库。单语句库是由一种语言的句子形成的句库,双语句库是由双语句对形成的句库,该句对由两种不同语言的句子构成,并且这两个句子互为对方的译文。句库可以通过现有技术在线下生成,例如单语句库可以从一种语言的大规模语料中得来,而双语句库可以从大规模双语语料中提取得来。如果句库为双语句库,在检索得到源语言的匹配例句时,其对应的目标语言例句也可以相应得到。In step 102, matching example sentences containing the query word are retrieved from an existing sentence database, which may be a single sentence database or a double sentence database. A single sentence base is a sentence base formed by sentences in one language, and a bilingual sentence base is a sentence base formed by a pair of double sentences, which are composed of sentences in two different languages, and these two sentences are translations of each other . Sentence bases can be generated offline through existing technologies. For example, a single sentence base can be obtained from a large-scale corpus of one language, while a bilingual sentence base can be extracted from a large-scale bilingual corpus. If the sentence database is a bilingual sentence database, when matching example sentences in the source language are retrieved, the corresponding example sentences in the target language can also be obtained accordingly.
步骤103中,匹配例句与查询词之间的搭配强度由各查询词之间的搭配概率及各查询词与匹配例句中除各查询词之外的其他词之间的搭配概率来确定。搭配概率是指词语之间形成搭配关系的可能性。例如“提高”常和“效率”一起使用,那么“提高”和“效率”之间的搭配概率就较高,而“提高”和“面积”很少会在一起使用,那么“提高”和“面积”之间的搭配概率就很小。搭配概率可以通过现有技术获得,例如通过线下的大规模语料库进行词与词之间的共现概率的统计,就可以得到包含词和词之间的搭配概率的语言模型。由于在自然语言处理中,计算词和词之间的多元共现概率是非常成熟的技术,因此在本发明中将不再赘述其具体内容。In step 103, the collocation strength between the matching example sentence and the query word is determined by the collocation probability between each query word and the collocation probability between each query word and other words in the matching example sentence except each query word. Collocation probability refers to the possibility of forming a collocation relationship between words. For example, "improvement" is often used together with "efficiency", then the probability of collocation between "enhancement" and "efficiency" is relatively high, while "increase" and "area" are rarely used together, then "increase" and " The probability of collocation between "area" is very small. The collocation probability can be obtained through existing technologies, for example, the statistics of the co-occurrence probability between words can be obtained through a large-scale offline corpus, and a language model including the collocation probability between words can be obtained. Since it is a very mature technology to calculate the multivariate co-occurrence probability between words and words in natural language processing, its specific content will not be repeated in the present invention.
匹配例句与查询词之间的搭配强度用于衡量匹配例句中各查询词之间结合的紧密程度,利用搭配强度,可以对相同的查询词在不同匹配例句中的应用进行区分,从而找到在匹配例句中,各查询词相互之间联系紧密的匹配例句返回给用户,这些匹配例句通常也是用户真正希望获得的。The collocation strength between the matching example sentence and the query word is used to measure the closeness of the combination of each query word in the matching example sentence. Using the collocation strength, the application of the same query word in different matching example sentences can be distinguished, so as to find the matching In the example sentences, matching example sentences closely related to each other are returned to the user, and these matching example sentences are usually what the user really wants to obtain.
搭配强度在考虑词语之间的搭配概率的基础上有多种实施方式,其中一种方式是:搭配强度等于各查询词之间的搭配概率中的最大值与各查询词与匹配例句中除各查询词之外的其他词之间的搭配概率中的最大值的比值,或各查询词之间的搭配概率中的最大值与各查询词与匹配例句中除各查询词之外的其他词之间的搭配概率中的最大值的差值。以公式表示如下:The collocation strength has multiple implementation modes on the basis of considering the collocation probability between words, one of which is: the collocation strength is equal to the maximum value of the collocation probability between each query word and each query word and the matching example sentence except each The ratio of the maximum value of collocation probabilities between other words other than the query words, or the maximum value of the collocation probabilities between each query word and the ratio between each query word and other words other than each query word in the matching example sentence The difference between the maximum value in the collocation probability. Expressed in the formula as follows:
或or
其中Q为用户的查询词,E为匹配例句,M(Q,E)为匹配例句与用户的查询词之间的搭配强度,wi或wj分别为其中一个查询词,wk为匹配例句中除各查询词之外的某个词语,p(wi,wj)或p(wi,wk)为两个词语间的搭配概率。Where Q is the user's query word, E is the matching example sentence, M(Q, E) is the collocation strength between the matching example sentence and the user's query word, w i or w j is one of the query words, w k is the matching example sentence A word other than each query word in , p(w i , w j ) or p(w i , w k ) is the collocation probability between two words.
搭配强度的计算还可以采用下列方式,即搭配强度等于各查询词之间的搭配概率的平均值与各查询词与匹配例句中除各查询词之外的其他词之间的搭配概率的平均值的比值,或各查询词之间的搭配概率的平均值与各查询词与匹配例句中除各查询词之外的其他词之间的搭配概率的平均值的差值。以公式表示为:The calculation of the collocation strength can also adopt the following method, that is, the collocation strength is equal to the average value of the collocation probability between each query word and the average value of the collocation probability between each query word and other words except each query word in the matching example sentence , or the difference between the average value of collocation probabilities between each query word and the average value of collocation probabilities between each query word and other words in the matching example sentence except each query word. Expressed as a formula:
或or
其中Q为用户的查询词,E为匹配例句,M(Q,E)为匹配例句与用户的查询词之间的搭配强度,wi或wj分别为其中一个查询词,wk为匹配例句中除各查询词之外的某个词语,p(wi,wj)或p(wi,wk)为两个词语间的搭配概率,|Q|为用户的查询词的个数,|E|为匹配例句包含的词语的个数。Where Q is the user's query word, E is the matching example sentence, M(Q, E) is the collocation strength between the matching example sentence and the user's query word, w i or w j is one of the query words, w k is the matching example sentence A word other than each query word in , p(w i , w j ) or p( wi , w k ) is the collocation probability between two words, |Q| is the number of query words of the user, |E| is the number of words contained in the matching example sentence.
此外,搭配强度的计算还可以采用下述方式,即搭配强度等于各查询词之间的搭配概率之和与各查询词与匹配例句中除各查询词之外的其他词之间的搭配概率之和的比值,与长度修正因子的乘积,其中所述长度修正因子是一个与匹配例句的长度有关的函数。以公式表示如下:In addition, the collocation strength can also be calculated in the following way, that is, the collocation strength is equal to the sum of the collocation probabilities between each query word and the collocation probability between each query word and other words in the matching example sentence except each query word The ratio of the sum and the product of the length correction factor, wherein the length correction factor is a function related to the length of the matching example sentence. Expressed in the formula as follows:
其中Q为用户的查询词,E为匹配例句,M(Q,E)为匹配例句与用户的查询词之间的搭配强度,wi或wj分别为其中一个查询词,wk为匹配例句中除各查询词之外的某个词语,p(wi,wj)或p(wi,wk)为两个词语间的搭配概率,L(E)为一个与匹配例句的长度有关的函数,用于防止太长或太短的句子对搭配强度的影响。Where Q is the user's query word, E is the matching example sentence, M(Q, E) is the collocation strength between the matching example sentence and the user's query word, w i or w j is one of the query words, w k is the matching example sentence A word other than each query word in , p(w i , w j ) or p(w i , w k ) is the collocation probability between two words, L(E) is a word related to the length of the matching example sentence The function of is used to prevent the impact of too long or too short sentences on collocation strength.
除此之外,搭配强度的计算还可以用其他方式表示,在此不再穷举。In addition, the calculation of collocation strength can also be expressed in other ways, which will not be exhaustive here.
步骤104中,将按照步骤103中计算的匹配例句与查询词之间的搭配强度对各个匹配例句进行排序。排序后的各个匹配例句可以进一步地传递给其他的系统或应用。In step 104, each matching example sentence is sorted according to the collocation strength between the matching example sentence and the query word calculated in step 103. Each sorted matching example sentence can be further transmitted to other systems or applications.
在本发明的另一个实施例中,在步骤104后,还可进一步包括对排序后的各个匹配例句进行展示。在对例句进行展示的时候,可以在各个匹配例句的旁边,进一步确定各个匹配例句与用户的查询词之间的搭配强度等级。具体做法为:按照预设的阈值区间对各个匹配例句中各查询词之间的搭配强度进行分级,可以在展示排序后的匹配例句时进一步对匹配例句与用户的查询词之间的搭配强度等级进行展示。例如将搭配强度分为三级,当搭配强度大于0.8时,等级为强,当搭配强度大于0.3且小于0.8时,等级为中,当搭配强度小于0.3时,等级为弱,对于强、中、弱三种等级,相应地用五个五角星、四个五角星和三个五角星来标识相应的匹配例句。In another embodiment of the present invention, after step 104, it may further include displaying each sorted matching example sentence. When displaying example sentences, the collocation strength level between each matching example sentence and the query word of the user may be further determined next to each matching example sentence. The specific method is: classify the collocation strength between each query word in each matching example sentence according to the preset threshold range, and further classify the collocation strength level between the matching example sentence and the user's query word when displaying the sorted matching example sentences to show. For example, the matching strength is divided into three levels. When the matching strength is greater than 0.8, the level is strong; when the matching strength is greater than 0.3 and less than 0.8, the level is medium; when the matching strength is less than 0.3, the level is weak. Weak three levels, corresponding matching example sentences are marked with five five-pointed stars, four five-pointed stars and three five-pointed stars.
请参考图2,图2为本发明中例句检索结果的展示界面的实施例一的示意图。如图2所示,对查询词:“提高”+“效率”形成的查询(Query),假设例句1至例句3为按照步骤103中计算的匹配例句中各查询词之间的搭配强度大小从高到低排列的三个匹配例句,在各个例句的右边,有若干个五角星,用以标识匹配例句与用户的查询词之间的搭配强度等级,其中五角星个数较多的表示匹配例句与查询词之间的搭配强度较高。当然,以搭配强度大小来标识各个匹配例句并不限于图2中表示的这种形式,任何能够用于表示大小或多少的符号、数字、文字或图示都应该包含在本发明的思想范围之内。Please refer to FIG. 2 . FIG. 2 is a schematic diagram of Embodiment 1 of an example sentence retrieval result display interface in the present invention. As shown in Figure 2, to query word: the query (Query) that " improves "+" efficiency " forms, assuming example sentence 1 to example sentence 3 are according to the collocation strength size between each query word in the matching example sentence calculated in step 103 from Three matching example sentences arranged from high to low, on the right side of each example sentence, there are several five-pointed stars, which are used to identify the level of collocation strength between the matching example sentence and the user's query word, among which the number of five-pointed stars is more than the matching example sentence The matching strength with query words is high. Of course, identifying each matching example sentence with collocation intensity is not limited to the form shown in Fig. 2, and any symbol, number, text or illustration that can be used to represent the size or number should be included in the scope of thought of the present invention Inside.
请参考图3,图3为本发明中例句检索结果的展示界面的实施例二的示意图。如图3所示,如果在步骤102中用于检索匹配例句的句库为双语句库,在展示排序后的各个匹配例句时,还将展示双语句库中与各个匹配例句互为译文的另一语言的例句。Please refer to FIG. 3 . FIG. 3 is a schematic diagram of Embodiment 2 of an example sentence retrieval result display interface in the present invention. As shown in Figure 3, if in step 102, the sentence library used to search for matching example sentences is a bilingual sentence library, when displaying each matching example sentence after sorting, another translation of each matching example sentence in the bilingual sentence library will also be displayed. Example sentences in one language.
在图3中,与图2相比,例句1至例句3的下方分别有三个与其对应的互为译文的英文例句,当然,这些例句的译文也可以放在其上方或其他能与其对应的位置上。In Figure 3, compared with Figure 2, there are three English example sentences corresponding to each other as translations below Example Sentences 1 to 3. Of course, the translations of these example sentences can also be placed above them or in other positions that can correspond to them superior.
请参考图4,图4为本发明中对例句检索结果进行排序的装置的实施例的结构示意框图。如图4所示,所述装置包括:接收单元201、检索单元202、计算单元203、排序单元204、展示单元205和确定单元206。Please refer to FIG. 4 . FIG. 4 is a structural block diagram of an embodiment of an apparatus for sorting example sentence retrieval results in the present invention. As shown in FIG. 4 , the device includes: a receiving unit 201 , a retrieving unit 202 , a calculating unit 203 , a sorting unit 204 , a displaying unit 205 and a determining unit 206 .
其中接收单元201,用于获取用户的查询词。Wherein the receiving unit 201 is used to obtain the user's query words.
用户在进行语言学习时,在查询一个词或多个词时的目的通常是不一样的,在查询一个词的时候,用户希望获得包含该词语的例句,以了解查询词在句子中的用法,而用户在查询多个词时,通常这多个词在使用时是有搭配关系的,用户希望获得包含这几个查询词的例句,同时希望了解这几个查询词之间的搭配关系是如何体现在例句中的。在本发明实施例中将只考虑两个或两个以上的查询词在例句中搭配关系的状况,因此接收单元201获取的用户查询词为多个查询词。When a user is learning a language, the purpose of querying one word or multiple words is usually different. When querying a word, the user hopes to obtain an example sentence containing the word in order to understand the usage of the query word in the sentence. When a user queries multiple words, usually these multiple words have a collocation relationship when they are used. The user hopes to obtain example sentences containing these query words, and at the same time wants to know how the collocation relationship between these query words is. reflected in the example sentences. In the embodiment of the present invention, only the collocation relationship of two or more query words in the example sentence will be considered, so the user query words acquired by the receiving unit 201 are multiple query words.
检索单元202,用于从句库中检索包含各查询词的匹配例句。The retrieval unit 202 is configured to retrieve matching example sentences containing each query word from the sentence database.
句库可以是单语句库或双语句库。单语句库是由一种语言的句子形成的句库,双语句库是由双语句对形成的句库,该句对由两种不同语言的句子构成,并且这两个句子互为对方的译文。句库可以通过现有技术在线下生成,例如单语句库可以从一种语言的大规模语料中得来,而双语句库可以从大规模双语语料中提取得来。如果句库为双语句库,在检索得到源语言的匹配例句时,其对应的目标语言例句也可以相应得到。The sentence base can be a single sentence base or a double sentence base. A single sentence base is a sentence base formed by sentences in one language, and a bilingual sentence base is a sentence base formed by a pair of double sentences, which are composed of sentences in two different languages, and these two sentences are translations of each other . Sentence bases can be generated offline through existing technologies. For example, a single sentence base can be obtained from a large-scale corpus of one language, while a bilingual sentence base can be extracted from a large-scale bilingual corpus. If the sentence database is a bilingual sentence database, when matching example sentences in the source language are retrieved, the corresponding example sentences in the target language can also be obtained accordingly.
计算单元203,用于计算各个匹配例句与查询词之间的搭配强度。Calculation unit 203, configured to calculate the collocation strength between each matching example sentence and the query word.
匹配例句与查询词之间的搭配强度由各查询词之间的搭配概率及各查询词与匹配例句中除各查询词之外的其他词之间的搭配概率来确定。搭配概率是指词语之间形成搭配关系的可能性。例如“提高”常和“效率”一起使用,那么“提高”和“效率”之间的搭配概率就较高,而“提高”和“面积”很少会在一起使用,那么“提高”和“面积”之间的搭配概率就很小。搭配概率可以通过现有技术获得,例如通过线下的大规模语料库进行词与词之间的共现概率的统计,就可以得到包含词和词之间的搭配概率的语言模型。由于在自然语言处理中,计算词和词之间的多元共现概率是非常成熟的技术,因此在本发明中将不再赘述其具体内容。The collocation strength between the matching example sentence and the query word is determined by the collocation probability between each query word and the collocation probability between each query word and other words in the matching example sentence except each query word. Collocation probability refers to the possibility of forming a collocation relationship between words. For example, "improvement" is often used together with "efficiency", then the probability of collocation between "enhancement" and "efficiency" is relatively high, while "increase" and "area" are rarely used together, then "increase" and " The probability of collocation between "area" is very small. The collocation probability can be obtained through existing technologies, for example, the statistics of the co-occurrence probability between words can be obtained through a large-scale offline corpus, and a language model including the collocation probability between words can be obtained. Since it is a very mature technology to calculate the multivariate co-occurrence probability between words and words in natural language processing, its specific content will not be repeated in the present invention.
匹配例句与查询词之间的搭配强度用于衡量匹配例句中各查询词之间结合的紧密程度,利用搭配强度,可以对相同的查询词在不同匹配例句中的应用进行区分,从而找到在匹配例句中,各查询词相互之间联系紧密的匹配例句返回给用户,这些匹配例句通常也是用户真正希望获得的。The collocation strength between the matching example sentence and the query word is used to measure the closeness of the combination of each query word in the matching example sentence. Using the collocation strength, the application of the same query word in different matching example sentences can be distinguished, so as to find the matching In the example sentences, matching example sentences closely related to each other are returned to the user, and these matching example sentences are usually what the user really wants to obtain.
搭配强度在考虑词语之间的搭配概率的基础上有多种实施方式,其中一种方式是:搭配强度等于各查询词之间的搭配概率中的最大值与各查询词与匹配例句中除各查询词之外的其他词之间的搭配概率中的最大值的比值,或各查询词之间的搭配概率中的最大值与各查询词与匹配例句中除各查询词之外的其他词之间的搭配概率中的最大值的差值。以公式表示如下:The collocation strength has multiple implementation modes on the basis of considering the collocation probability between words, one of which is: the collocation strength is equal to the maximum value of the collocation probability between each query word and each query word and the matching example sentence except each The ratio of the maximum value of collocation probabilities between other words other than the query words, or the maximum value of the collocation probabilities between each query word and the ratio between each query word and other words other than each query word in the matching example sentence The difference between the maximum value in the collocation probability. Expressed in the formula as follows:
或or
其中Q为用户的查询词,E为匹配例句,M(Q,E)为匹配例句与用户的查询词之间的搭配强度,wi或wj分别为其中一个查询词,wk为匹配例句中除各查询词之外的某个词语,p(wi,wj)或p(wi,wk)为两个词语间的搭配概率。Where Q is the user's query word, E is the matching example sentence, M(Q, E) is the collocation strength between the matching example sentence and the user's query word, w i or w j is one of the query words, w k is the matching example sentence A word other than each query word in , p(w i , w j ) or p(w i , w k ) is the collocation probability between two words.
搭配强度的计算还可以采用下列方式,即搭配强度等于各查询词之间的搭配概率的平均值与各查询词与匹配例句中除各查询词之外的其他词之间的搭配概率的平均值的比值,或各查询词之间的搭配概率的平均值与各查询词与匹配例句中除各查询词之外的其他词之间的搭配概率的平均值的差值。以公式表示为:The calculation of the collocation strength can also adopt the following method, that is, the collocation strength is equal to the average value of the collocation probability between each query word and the average value of the collocation probability between each query word and other words except each query word in the matching example sentence , or the difference between the average value of collocation probabilities between each query word and the average value of collocation probabilities between each query word and other words in the matching example sentence except each query word. Expressed as a formula:
或or
其中Q为用户的查询词,E为匹配例句,M(Q,E)为匹配例句与用户的查询词之间的搭配强度,wi或wj分别为其中一个查询词,wk为匹配例句中除各查询词之外的某个词语,p(wi,wj)或p(wi,wk)为两个词语间的搭配概率,|Q|为用户的查询词的个数,|E|为匹配例句包含的词语的个数。Where Q is the user's query word, E is the matching example sentence, M(Q, E) is the collocation strength between the matching example sentence and the user's query word, w i or w j is one of the query words, w k is the matching example sentence A word other than each query word in , p(w i , w j ) or p( wi , w k ) is the collocation probability between two words, |Q| is the number of query words of the user, |E| is the number of words contained in the matching example sentence.
此外,搭配强度的计算还可以采用下述方式,即搭配强度等于各查询词之间的搭配概率之和与各查询词与匹配例句中除各查询词之外的其他词之间的搭配概率之和的比值,与长度修正因子的乘积,其中所述长度修正因子是一个与匹配例句的长度有关的函数。以公式表示如下:In addition, the collocation strength can also be calculated in the following way, that is, the collocation strength is equal to the sum of the collocation probabilities between each query word and the collocation probability between each query word and other words in the matching example sentence except each query word The ratio of the sum and the product of the length correction factor, wherein the length correction factor is a function related to the length of the matching example sentence. Expressed in the formula as follows:
其中Q为用户的查询词,E为匹配例句,M(Q,E)为匹配例句与用户的查询词之间的搭配强度,wi或wj分别为其中一个查询词,wk为匹配例句中除各查询词之外的某个词语,p(wi,wj)或p(wi,wk)为两个词语间的搭配概率,L(E)为一个与匹配例句的长度有关的函数,用于防止太长或太短的句子对搭配强度的影响。Where Q is the user's query word, E is the matching example sentence, M(Q, E) is the collocation strength between the matching example sentence and the user's query word, w i or w j is one of the query words, w k is the matching example sentence A word other than each query word in , p(w i , w j ) or p(w i , w k ) is the collocation probability between two words, L(E) is a word related to the length of the matching example sentence The function of is used to prevent the impact of too long or too short sentences on collocation strength.
除此之外,搭配强度的计算还可以用其他方式表示,在此不再穷举。In addition, the calculation of collocation strength can also be expressed in other ways, which will not be exhaustive here.
排序单元204,用于按照匹配例句与查询词之间的匹配强度对各个匹配例句进行排序。排序后的各个匹配例句可以进一步地传递给其他的系统或应用。The sorting unit 204 is configured to sort each matching example sentence according to the matching strength between the matching example sentence and the query word. Each sorted matching example sentence can be further transmitted to other systems or applications.
展示单元205,用于展示排序后的各个匹配例句。如果检索单元202检索匹配例句的句库为双语句库,在展示排序后的各个匹配例句时,展示单元205还将展示双语句库中与各个匹配例句互为译文的另一语言的例句。确定单元206,用于确定各个匹配例句与用户的查询词之间的搭配强度等级。具体做法为:按照预设的阈值区间对各个匹配例句中各查询词之间的搭配强度进行分级,并可由展示单元205在展示排序后的匹配例句时进一步对匹配例句与用户的查询词之间的搭配强度等级进行展示。例如将搭配强度分为三种等级,当搭配强度大于0.8时,等级为强,当搭配强度大于0.3且小于0.8时,等级为中,当搭配强度小于0.3时,等级为弱,对于强、中、弱三种等级,相应地用五个五角星、四个五角星和三个五角星来标识相应的匹配例句。The display unit 205 is configured to display each matching example sentence after sorting. If the sentence base for which the retrieval unit 202 retrieves the matching example sentences is a bilingual sentence base, when displaying each sorted matching example sentence, the display unit 205 will also display an example sentence in another language that is a translation of each matching example sentence in the bilingual sentence base. The determination unit 206 is configured to determine the collocation strength level between each matching example sentence and the query word of the user. The specific method is: according to the preset threshold interval, the collocation strength between each query word in each matching example sentence is graded, and the display unit 205 can further compare the matching example sentence and the user's query word when displaying the sorted matching example sentences. The collocation strength level is displayed. For example, the collocation strength is divided into three grades. When the collocation strength is greater than 0.8, the grade is strong; when the collocation strength is greater than 0.3 and less than 0.8, the grade is medium; when the collocation strength is less than 0.3, the grade is weak; , Weak three grades, corresponding matching example sentences are marked with five five-pointed stars, four five-pointed stars and three five-pointed stars.
请参考图2,图2为本发明中例句检索结果的展示界面的实施例一的示意图。如图2所示,对查询词:“提高”+“效率”形成的查询(Query),假设例句1至例句3为按照匹配例句中各查询词之间的搭配强度大小从高到低排列的三个匹配例句,在各个例句的右边,有若干个五角星,用以标识匹配例句与用户的查询词之间的搭配强度等级,其中五角星个数较多的表示匹配例句与查询词之间的搭配强度较高。当然,以搭配强度大小来标识各个匹配例句并不限于图2中表示的这种形式,任何能够用于表示大小或多少的符号、数字、文字或图示都应该包含在本发明的思想范围之内。Please refer to FIG. 2 . FIG. 2 is a schematic diagram of Embodiment 1 of an example sentence retrieval result display interface in the present invention. As shown in Figure 2, for the query (Query) formed by the query word: "improvement" + "efficiency", it is assumed that example sentences 1 to 3 are arranged according to the collocation strength between each query word in the matching example sentence from high to low Three matching example sentences, on the right side of each example sentence, there are several five-pointed stars, which are used to identify the collocation strength level between the matching example sentence and the user's query word, among which the number of five-pointed stars is more, indicating the matching between the example sentence and the query word The collocation strength is higher. Of course, identifying each matching example sentence with collocation intensity is not limited to the form shown in Fig. 2, and any symbol, number, text or illustration that can be used to represent the size or number should be included in the scope of thought of the present invention Inside.
请参考图3,图3为本发明中例句检索结果的展示界面的实施例二的示意图。如图3所示,如果在检索单元202检索匹配例句的句库为双语句库,在展示排序后的各个匹配例句时,还将展示双语句库中与各个匹配例句互为译文的另一语言的例句。Please refer to FIG. 3 . FIG. 3 is a schematic diagram of Embodiment 2 of an example sentence retrieval result display interface in the present invention. As shown in Figure 3, if the sentence library for searching matching example sentences at the retrieval unit 202 is a bilingual sentence library, when each matching example sentence after sorting is displayed, another language that is a translation of each matching example sentence in the bilingual sentence library will also be displayed example sentences.
在图3中,与图2相比,例句1至例句3的下方分别有三个与其对应的互为译文的英文例句,当然,这些例句的译文也可以放在其上方或其他能与其对应的位置上。In Figure 3, compared with Figure 2, there are three English example sentences corresponding to each other as translations below Example Sentences 1 to 3. Of course, the translations of these example sentences can also be placed above them or in other positions that can correspond to them superior.
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明保护的范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the present invention. within the scope of protection.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201110303380.2A CN102364469B (en) | 2011-10-09 | 2011-10-09 | A kind of method and device that illustrative sentence retrieval result is ranked up |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201110303380.2A CN102364469B (en) | 2011-10-09 | 2011-10-09 | A kind of method and device that illustrative sentence retrieval result is ranked up |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN102364469A CN102364469A (en) | 2012-02-29 |
| CN102364469B true CN102364469B (en) | 2016-08-03 |
Family
ID=45691035
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201110303380.2A Active CN102364469B (en) | 2011-10-09 | 2011-10-09 | A kind of method and device that illustrative sentence retrieval result is ranked up |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN102364469B (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103699672A (en) * | 2013-12-30 | 2014-04-02 | 北京百度网讯科技有限公司 | Method and device for retrieving example sentences |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20070067274A1 (en) * | 2005-09-16 | 2007-03-22 | International Business Machines Corporation | Hybrid push-down/pull-up of unions with expensive operations in a federated query processor |
| CN101933017A (en) * | 2009-03-24 | 2010-12-29 | 三菱电机信息系统株式会社 | Document search device, document search system, document search program, and document search method |
| CN102023989A (en) * | 2009-09-23 | 2011-04-20 | 阿里巴巴集团控股有限公司 | Information retrieval method and system thereof |
| CN102207973A (en) * | 2011-06-22 | 2011-10-05 | 上海互联网软件有限公司 | Fuzzy search system and search method |
-
2011
- 2011-10-09 CN CN201110303380.2A patent/CN102364469B/en active Active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20070067274A1 (en) * | 2005-09-16 | 2007-03-22 | International Business Machines Corporation | Hybrid push-down/pull-up of unions with expensive operations in a federated query processor |
| CN101933017A (en) * | 2009-03-24 | 2010-12-29 | 三菱电机信息系统株式会社 | Document search device, document search system, document search program, and document search method |
| CN102023989A (en) * | 2009-09-23 | 2011-04-20 | 阿里巴巴集团控股有限公司 | Information retrieval method and system thereof |
| CN102207973A (en) * | 2011-06-22 | 2011-10-05 | 上海互联网软件有限公司 | Fuzzy search system and search method |
Non-Patent Citations (1)
| Title |
|---|
| 基于统计的常用词搭配(Collocation)的发现方法;孙健等;《情报学报(2002年)》;20020228;第21卷(第1期);12-16 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN102364469A (en) | 2012-02-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111104794B (en) | Text similarity matching method based on subject term | |
| CN107220295B (en) | Searching and mediating strategy recommendation method for human-human contradiction mediating case | |
| CN103646088B (en) | Product comment fine-grained emotional element extraction method based on CRFs and SVM | |
| US9881037B2 (en) | Method for systematic mass normalization of titles | |
| WO2021218322A1 (en) | Paragraph search method and apparatus, and electronic device and storage medium | |
| CN107818815B (en) | Electronic medical record retrieval method and system | |
| CN111831821B (en) | Training sample generation method and device of text classification model and electronic equipment | |
| WO2019091026A1 (en) | Knowledge base document rapid search method, application server, and computer readable storage medium | |
| Smith et al. | Evaluating visual representations for topic understanding and their effects on manually generated topic labels | |
| CN114116997A (en) | Knowledge question answering method, knowledge question answering device, electronic equipment and storage medium | |
| WO2020074017A1 (en) | Deep learning-based method and device for screening for keywords in medical document | |
| CN102279890A (en) | Sentiment word extracting and collecting method based on micro blog | |
| CN117112595A (en) | Information query method and device, electronic equipment and storage medium | |
| CN106095912B (en) | Method and apparatus for generating expanded query terms | |
| CN111523019A (en) | Method, apparatus, device and storage medium for outputting information | |
| CN110569370B (en) | Knowledge graph construction method and device, electronic equipment and storage medium | |
| Cavalcanti et al. | Good to be bad? Distinguishing between positive and negative citations in scientific impact | |
| CN102737045B (en) | A correlation calculation method and device | |
| CN115809334B (en) | Training method of event relevance classification model, text processing method and device | |
| CN105589976A (en) | Object entity determining method and device based on semantic correlations | |
| CN115712715A (en) | Question answering method, device, electronic equipment and storage medium for introduction | |
| CN102346777B (en) | A kind of method and apparatus that illustrative sentence retrieval result is ranked up | |
| CN102375848B (en) | Evaluation object clustering method and device | |
| CN113987178A (en) | User knowledge data processing method, device, equipment and storage medium | |
| CN102364469B (en) | A kind of method and device that illustrative sentence retrieval result is ranked up |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| C14 | Grant of patent or utility model | ||
| GR01 | Patent grant |