CN110909217A

CN110909217A - Method and device for realizing search, electronic equipment and storage medium

Info

Publication number: CN110909217A
Application number: CN201811061039.9A
Authority: CN
Inventors: 王浩; 庞旭林; 张晨
Original assignee: Beijing Qihoo Technology Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2018-09-12
Filing date: 2018-09-12
Publication date: 2020-03-24

Abstract

The invention discloses a method and a device for realizing search, electronic equipment and a storage medium. The method comprises the following steps: acquiring a search query statement; extracting a plurality of candidate words from the search query sentence, and generating a plurality of candidate words according to the search query sentence; further generating a rewriting sentence according to the extracted candidate words and the generated candidate words; and searching and inquiring according to the rewriting sentences to obtain a search result. The technical scheme combines two modes of extraction and generation to generate candidate words and further generates rewriting sentences, so that reasonable rewriting of search query sentences input by a user is realized according to semantics and scenes, and the results returned by search are further close to the actual requirements of the user.

Description

Search implementation method, device, electronic device and storage medium

技术领域technical field

本发明涉及搜索技术领域，具体涉及搜索的实现方法、装置、电子设备和存储介质。The present invention relates to the technical field of search, and in particular to a method, apparatus, electronic device and storage medium for realizing search.

背景技术Background technique

通常，搜索引擎更适用于输入由精准关键词组成的查询，由自然语言描述的查询会导致较差的返回结果。例如，图1示出了不同查询语句对应的查询结果的示意图，如图1中所示，用户在搜索时可能会输入“我想知道一个iPhone X要多少钱”这样的自然语言，尤其是在语言搜索的场景下更是如此。但是显然，这样得到的搜索结果不尽人意，而如果根据语义更换搜索词，如“iPhone X价格”，则搜索结果相对更符合用户需求。因此，如何对搜索词进行更换是需要解决的问题。Generally, search engines are more suitable for entering queries consisting of precise keywords, and queries described by natural language will result in poor returned results. For example, Figure 1 shows a schematic diagram of query results corresponding to different query sentences. As shown in Figure 1, a user may input natural language such as "I want to know how much an iPhone X costs" when searching, especially when This is especially true in the context of language search. But obviously, the search results obtained in this way are unsatisfactory, and if the search terms are changed according to semantics, such as "iPhone X price", the search results are relatively more in line with user needs. Therefore, how to replace the search term is a problem that needs to be solved.

发明内容SUMMARY OF THE INVENTION

鉴于上述问题，提出了本发明以便提供一种克服上述问题或者至少部分地解决上述问题的搜索的实现方法、装置、电子设备和存储介质。In view of the above problems, the present invention is proposed to provide a search implementation method, apparatus, electronic device and storage medium that overcome the above problems or at least partially solve the above problems.

依据本发明的一个方面，提供了一种搜索的实现方法，包括：获取搜索查询语句；从所述搜索查询语句中抽取出若干个候选词，以及根据所述搜索查询语句生成若干个候选词；根据抽取的候选词和生成的候选词，进一步生成改写语句；根据所述改写语句进行搜索查询，得到搜索结果。According to one aspect of the present invention, a method for realizing search is provided, including: obtaining a search query statement; extracting several candidate words from the search query statement, and generating several candidate words according to the search query statement; According to the extracted candidate words and the generated candidate words, a rewritten sentence is further generated; a search query is performed according to the rewritten sentence, and a search result is obtained.

可选地，所述搜索查询语句是根据用户输入的语音生成的。Optionally, the search query sentence is generated according to the voice input by the user.

可选地，所述从所述搜索查询语句中抽取出若干个候选词，以及根据所述搜索查询语句生成若干个候选词包括：对所述搜索查询语句进行编码，得到编码数据；以抽取模式对所述编码数据进行解码，输出第一候选词表，以及以生成模式对所述编码数据进行解码，输出第二候选词表。Optionally, extracting several candidate words from the search query statement, and generating several candidate words according to the search query statement includes: encoding the search query statement to obtain encoded data; The encoded data is decoded to output a first candidate vocabulary, and the encoded data is decoded in a generative mode to output a second candidate vocabulary.

可选地，所述对所述搜索查询语句进行编码，得到编码数据包括：对所述搜索查询语句进行词嵌入处理，得到所述搜索查询语句中包含的各词对应的词向量；根据各词向量进行编码，得到输入隐向量。Optionally, the encoding the search query statement to obtain the encoded data includes: performing word embedding processing on the search query statement to obtain a word vector corresponding to each word included in the search query statement; The vector is encoded to obtain the input hidden vector.

可选地，所述根据各词向量进行编码，得到隐向量包括：基于一层双向长短期记忆网络LSTM进行所述编码。Optionally, the encoding according to each word vector to obtain a latent vector includes: performing the encoding based on a layer of bidirectional long short-term memory network LSTM.

可选地，所述以抽取模式对所述编码数据进行解码，输出第一候选词表包括：根据所述输入隐向量计算注意力权重a^t；基于公式(1)和(2)计算所述搜索查询语句中各词的抽取权重：Optionally, the decoding the encoded data in the extraction mode, and outputting the first candidate vocabulary includes: calculating an attention weight at according to the input ^latent vector; calculating the The extraction weight of each word in the search query:

其中，P_extract(w)为目标词w的抽取权重，p_w为调节因子，f_w为目标词w在所述搜索查询语句中出现的次数，N是语料中所有查询的次数，|w|是语料中包含目标词w的查询个数，t为t时刻；所述第一候选词表包括一个或多个词及其对应的抽取权重。Among them, P _extract (w) is the extraction weight of the target word w, p _w is the adjustment factor, f _w is the number of times the target word w appears in the search query statement, N is the number of all queries in the corpus, |w| is the number of queries containing the target word w in the corpus, and t is time t; the first candidate word list includes one or more words and their corresponding extraction weights.

可选地，所述以生成模式对所述编码数据进行解码，输出第二候选词表包括：根据所述输入隐向量计算注意力权重a^t；根据所述注意力权重a^t和所述输入隐向量计算上下文权重C^t；根据所述注意力权重a^t、所述上下文权重C^t和当前时刻目标隐向量h_t计算所述第二候选词表的分布概率P_vocab。Optionally, the decoding the encoded data in the generation mode, and outputting the second candidate vocabulary includes: calculating an attention weight at according to the input ^latent vector; according to the attention weight ^at and the input The hidden vector calculates the context weight C ^t ; calculates the distribution probability P _vocab of the second candidate vocabulary according to the attention weight at , the context weight C ^t and the target hidden vector h ^t at the current moment _.

可选地，所述根据所述输入隐向量计算注意力权重a^t包括：基于公式(3)和(4)计算注意力权重a^t：Optionally, the calculating the attention weight at according to the input ^latent vector includes: calculating the attention weight at based on ^formulas (3) and (4):

a^t＝softmax(e^t) (4)；a ^t =softmax(e ^t ) (4);

其中，函数score用来比较目标隐向量h_t和输入隐向量

的相似程度，

为t时刻的覆盖向量，v、W₁、W₂、W_c和b_atten为预置参数；

为输入隐向量，h_t为输出隐向量。Among them, the function score is used to compare the target hidden vector h _t with the input hidden vector

the degree of similarity,

is the coverage vector at time t, and v, W ₁ , W ₂ , W _c and _batten are preset parameters;

is the input hidden vector, and h _t is the output hidden vector.

可选地，所述根据所述注意力权重a^t和所述输入隐向量计算上下文权重C^t包括：基于公式(5)和(6)计算上下文权重C^t：Optionally, calculating the context weight C ^t according to the attention weight at and the input latent vector includes: calculating the context weight C ^t based on formulas (5) and (6 ⁾ :

其中，cov^t为t时刻的覆盖矩阵。where cov ^t is the coverage matrix at time t.

可选地，所述根据所述注意力权重a^t、所述上下文权重C^t和当前时刻目标隐向量h_t计算所述第二候选词表的分布概率P_vocab包括：基于公式(7)计算P_vocab：Optionally, calculating the distribution probability P _vocab of the second candidate vocabulary according to the attention weight a ^t , the context weight C ^t and the target latent vector h _t at the current moment includes: calculating based on formula (7) P _vocab :

P_vocab＝f(C^t,h_t)＝softmax(V'(V[h_t,C^t]+b)+b') (7)；P _vocab =f(C ^t ,h _t )=softmax(V'(V[h _t ,C ^t ]+b)+b') (7);

其中，V、b和V'、b'为两步线性变换参数矩阵和偏置向量。Among them, V, b and V', b' are two-step linear transformation parameter matrix and bias vector.

可选地，所述以抽取模式对所述编码数据进行解码，输出第一候选词表，以及以生成模式对所述编码数据进行解码，输出第二候选词表包括：基于一层单向LSTM实现解码。Optionally, the decoding of the encoded data in the extraction mode, the output of the first candidate vocabulary, and the decoding of the encoded data in the generation mode, and the output of the second candidate vocabulary include: based on a layer of one-way LSTM implement decoding.

可选地，所述根据抽取的候选词和生成的候选词，进一步生成改写语句包括：根据第一候选词表的抽取权重P_extract和第二候选词表的分布概率P_vocab以及调节因子p_gen确定第三候选词表，根据第三候选词表生成改写语句。Optionally, further generating the rewritten statement according to the extracted candidate words and the generated candidate words includes: according to the extraction weight P _extract of the first candidate vocabulary and the distribution probability P _vocab of the second candidate vocabulary and the adjustment factor p _gen A third candidate vocabulary is determined, and a rewritten sentence is generated according to the third candidate vocabulary.

可选地，所述根据第一候选词表的抽取权重P_extract和第二候选词表的分布概率P_vocab以及调节因子p_gen确定第三候选词表包括：基于公式(8)计算调节因子p_gen：Optionally, determining the third candidate vocabulary according to the extraction weight P _extract of the first candidate vocabulary, the distribution probability P _vocab of the second candidate vocabulary, and the adjustment factor p _gen includes: calculating the adjustment factor p based on formula (8). _gen :

其中，w_h、w_s、w_x和b为预设参数，x_t为输入的搜索查询语句，σ是sigmoid函数；Wherein, w _h , w _s , w _x and b are preset parameters, x _t is the input search query statement, and σ is the sigmoid function;

基于公式(9)计算第三候选词表中各候选词的概率：Calculate the probability of each candidate word in the third candidate word list based on formula (9):

P(w)＝p_genP_vocab(w)+(1-p_gen)P_extract(w) (9)。P(w)=p _gen P _vocab (w)+(1-p _gen )P _extract (w) (9).

依据本发明的另一方面，提供了一种搜索的实现装置，包括：获取单元，适于获取搜索查询语句；改写单元，适于从所述搜索查询语句中抽取出若干个候选词，以及根据所述搜索查询语句生成若干个候选词；根据抽取的候选词和生成的候选词，进一步生成改写语句；搜索单元，适于根据所述改写语句进行搜索查询，得到搜索结果。According to another aspect of the present invention, a search implementation device is provided, comprising: an obtaining unit, adapted to obtain a search query; a rewriting unit, adapted to extract several candidate words from the search query; The search query statement generates several candidate words; according to the extracted candidate words and the generated candidate words, a rewritten statement is further generated; the search unit is adapted to perform a search query according to the rewritten statement to obtain a search result.

可选地，所述改写单元，适于对所述搜索查询语句进行编码，得到编码数据；以抽取模式对所述编码数据进行解码，输出第一候选词表，以及以生成模式对所述编码数据进行解码，输出第二候选词表。Optionally, the rewriting unit is adapted to encode the search query statement to obtain encoded data; decode the encoded data in an extraction mode, output a first candidate vocabulary, and encode the encoded data in a generation mode. The data is decoded and the second candidate vocabulary is output.

可选地，所述改写单元，适于对所述搜索查询语句进行词嵌入处理，得到所述搜索查询语句中包含的各词对应的词向量；根据各词向量进行编码，得到输入隐向量。Optionally, the rewriting unit is adapted to perform word embedding processing on the search query sentence to obtain word vectors corresponding to each word included in the search query sentence; and to obtain an input latent vector by encoding each word vector.

可选地，所述改写单元，适于基于一层双向长短期记忆网络LSTM进行所述编码。Optionally, the rewriting unit is adapted to perform the encoding based on a one-layer bidirectional long short-term memory network LSTM.

可选地，所述改写单元，适于根据所述输入隐向量计算注意力权重a^t；基于公式(1)和(2)计算所述搜索查询语句中各词的抽取权重：Optionally, the rewriting unit is adapted to calculate the attention weight at according to the input ^latent vector; calculate the extraction weight of each word in the search query sentence based on formulas (1) and (2):

可选地，所述改写单元，适于根据所述输入隐向量计算注意力权重a^t；根据所述注意力权重a^t和所述输入隐向量计算上下文权重C^t；根据所述注意力权重a^t、所述上下文权重C^t和当前时刻目标隐向量h_t计算所述第二候选词表的分布概率P_vocab。Optionally, the rewriting unit ^{is adapted to calculate an attention weight at according to the input latent vector; calculate a context weight C t} ^according to the attention weight ^at and the input latent vector; according to the attention weight at ^t , the context weight C ^t and the target latent vector h _t at the current moment to calculate the distribution probability P _vocab of the second candidate vocabulary.

可选地，所述改写单元，适于基于公式(3)和(4)计算注意力权重a^t：Optionally, the rewriting unit is adapted to calculate the attention weight at based on ^formulas (3) and (4):

a^t＝softmax(e^t) (4)；a ^t =softmax(e ^t ) (4);

其中，函数score用来比较目标隐向量h_t和输入隐向量

的相似程度，

为t时刻的覆盖向量，v、W₁、W₂、W_c和b_atten为预置参数；

the degree of similarity,

is the input hidden vector, and h _t is the output hidden vector.

可选地，所述改写单元，适于基于公式(5)和(6)计算上下文权重C^t：Optionally, the rewriting unit is adapted to calculate the context weight C ^t based on formulas (5) and (6):

可选地，所述改写单元，适于基于公式(7)计算P_vocab：Optionally, the rewriting unit is adapted to calculate P _vocab based on formula (7):

可选地，所述改写单元，适于基于一层单向LSTM实现解码。Optionally, the rewriting unit is adapted to implement decoding based on a layer of unidirectional LSTM.

可选地，所述改写单元，适于根据第一候选词表的抽取权重P_extract和第二候选词表的分布概率P_vocab以及调节因子p_gen确定第三候选词表，根据第三候选词表生成改写语句。Optionally, the rewriting unit is adapted to determine the third candidate vocabulary according to the extraction weight P _extract of the first candidate vocabulary, the distribution probability P _vocab of the second candidate vocabulary and the adjustment factor p _gen , and according to the third candidate vocabulary Table generation rewrite statement.

可选地，所述改写单元，适于基于公式(8)计算调节因子p_gen：Optionally, the rewriting unit is adapted to calculate the adjustment factor p _gen based on formula (8):

其中，w_h、w_s、w_x和b为预设参数，x_t为输入的搜索查询语句，σ是sigmoid函数；基于公式(9)计算第三候选词表中各候选词的概率：Among them, w _h , w _s , w _x and b are preset parameters, x _t is the input search query statement, and σ is the sigmoid function; the probability of each candidate word in the third candidate word list is calculated based on formula (9):

依据本发明的又一方面，提供了一种电子设备，包括：处理器；以及被安排成存储计算机可执行指令的存储器，所述可执行指令在被执行时使所述处理器执行如上述任一所述的方法。According to yet another aspect of the present invention, there is provided an electronic device comprising: a processor; and a memory arranged to store computer-executable instructions that, when executed, cause the processor to perform any of the above a described method.

依据本发明的再一方面，提供了一种计算机可读存储介质，其中，所述计算机可读存储介质存储一个或多个程序，所述一个或多个程序当被处理器执行时，实现如上述任一所述的方法。According to yet another aspect of the present invention, a computer-readable storage medium is provided, wherein the computer-readable storage medium stores one or more programs, the one or more programs, when executed by a processor, implements the following: Any of the methods described above.

由上述可知，本发明的技术方案，在获取到搜索查询语句后，从搜索查询语句中抽取出若干个候选词，以及根据搜索查询语句生成若干个候选词，根据抽取的候选词和生成的候选词，进一步生成改写语句，根据改写语句进行搜索查询，得到搜索结果。该技术方案结合了抽取和生成两种方式生成候选词，并进一步生成改写语句，实现了根据语义和场景对用户输入的搜索查询语句进行合理的改写，进一步使得搜索返回的结果贴近用户的实际需求。It can be seen from the above that the technical scheme of the present invention extracts several candidate words from the search query statement after acquiring the search query statement, generates several candidate words according to the search query statement, and generates several candidate words according to the extracted candidate words and the generated candidate words. word, further generate a rewritten sentence, perform a search query according to the rewritten sentence, and obtain a search result. The technical solution combines two methods of extraction and generation to generate candidate words, and further generates rewriting sentences, which realizes reasonable rewriting of search query sentences input by users according to semantics and scenarios, and further makes the results returned by the search closer to the actual needs of users. .

上述说明仅是本发明技术方案的概述，为了能够更清楚了解本发明的技术手段，而可依照说明书的内容予以实施，并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂，以下特举本发明的具体实施方式。The above description is only an overview of the technical solutions of the present invention, in order to be able to understand the technical means of the present invention more clearly, it can be implemented according to the content of the description, and in order to make the above and other purposes, features and advantages of the present invention more obvious and easy to understand , the following specific embodiments of the present invention are given.

附图说明Description of drawings

通过阅读下文优选实施方式的详细描述，各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的，而并不认为是对本发明的限制。而且在整个附图中，用相同的参考符号表示相同的部件。在附图中：Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are for the purpose of illustrating preferred embodiments only and are not to be considered limiting of the invention. Also, the same components are denoted by the same reference numerals throughout the drawings. In the attached image:

图1示出了不同查询语句对应的查询结果的示意图；1 shows a schematic diagram of query results corresponding to different query statements;

图2示出了根据本发明一个实施例的一种搜索的实现方法的流程示意图；2 shows a schematic flowchart of a method for implementing a search according to an embodiment of the present invention;

图3示出了根据本发明一个实施例的查询改写模型的结构示意图；3 shows a schematic structural diagram of a query rewriting model according to an embodiment of the present invention;

图4示出了根据本发明一个实施例的一种搜索的实现装置的结构示意图；4 shows a schematic structural diagram of an apparatus for implementing a search according to an embodiment of the present invention;

图5示出了根据本发明一个实施例的电子设备的结构示意图；FIG. 5 shows a schematic structural diagram of an electronic device according to an embodiment of the present invention;

图6示出了根据本发明一个实施例的计算机可读存储介质的结构示意图。FIG. 6 shows a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention.

具体实施方式Detailed ways

下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例，然而应当理解，可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反，提供这些实施例是为了能够更透彻地理解本公开，并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that the present disclosure will be more thoroughly understood, and will fully convey the scope of the present disclosure to those skilled in the art.

图2示出了根据本发明一个实施例的一种搜索的实现方法的流程示意图。如图2所示，该方法包括：FIG. 2 shows a schematic flowchart of a method for implementing a search according to an embodiment of the present invention. As shown in Figure 2, the method includes:

步骤S210，获取搜索查询语句。Step S210, obtaining a search query statement.

步骤S220，从搜索查询语句中抽取出若干个候选词，以及根据搜索查询语句生成若干个候选词。Step S220, extracting several candidate words from the search query statement, and generating several candidate words according to the search query statement.

抽取式改写，可以使用特定的计算规则计算用户输入的搜索查询语句中各词的权重，选取权重大的词作为关键词。这种方法简单便捷，但是所有的关键词限制在输入的词集合中，而且倾向于抽取词频高的词，在某些场合下效果较差。Extractive rewriting can use a specific calculation rule to calculate the weight of each word in the search query sentence input by the user, and select a word with a significant weight as a keyword. This method is simple and convenient, but all keywords are limited to the input word set, and words with high frequency tend to be extracted, which is less effective in some cases.

生成式改写，可以“理解”用户的输入，然后基于用户的意图生成一些关键词。这种方法能生成新的词语，但是生成过程往往不可控，也会生成一些完全错误的词语。Generative rewriting can "understand" the user's input and then generate some keywords based on the user's intent. This method can generate new words, but the generation process is often uncontrollable, and it also generates some completely wrong words.

以图1示出的搜索场景为例，抽取式的方法会抽取出“iPhone X”和“多少”，这两个词全部来自于用户输入的搜索查询语句，不足以概括其意图。而生成模式的方法会依据训练语料产生不同的结果。例如会生成“iPhone 8”和“价钱”。虽然能生成新的词语，但是模型在生成词语时会根据训练语料制作的词表来计算概率，如果“iPhone X”不在训练语料中则只能用一个错误的近义词替代。这样的结果会导致搜索页面错误。Taking the search scenario shown in Figure 1 as an example, the extraction method will extract "iPhone X" and "how many", both of which come from the search query sentences entered by the user and are insufficient to summarize their intent. The method of generating patterns will produce different results depending on the training corpus. For example, "iPhone 8" and "Price" are generated. Although new words can be generated, the model will calculate the probability based on the vocabulary made by the training corpus when generating the words. If "iPhone X" is not in the training corpus, it can only be replaced by a wrong synonym. Such results can lead to search page errors.

因此，上述两种方法各有利弊，仅用一种方法来改写查询往往得不到更好地结果。而在本发明的实施例中，将这两种方法结合起来，对于查询改写有着重要的意义，最终生成的改写语句也更加准确。Therefore, the above two methods have their own advantages and disadvantages, and only one method to rewrite the query often does not get better results. In the embodiment of the present invention, the combination of these two methods is of great significance for query rewriting, and the finally generated rewriting statement is also more accurate.

步骤S230，根据抽取的候选词和生成的候选词，进一步生成改写语句。Step S230, further generating a rewritten sentence according to the extracted candidate words and the generated candidate words.

步骤S240，根据改写语句进行搜索查询，得到搜索结果。In step S240, a search query is performed according to the rewritten sentence, and a search result is obtained.

可见，图2所示的方法，在获取到搜索查询语句后，从搜索查询语句中抽取出若干个候选词，以及根据搜索查询语句生成若干个候选词，根据抽取的候选词和生成的候选词，进一步生成改写语句，根据改写语句进行搜索查询，得到搜索结果。该技术方案结合了抽取和生成两种方式生成候选词，并进一步生成改写语句，实现了根据语义和场景对用户输入的搜索查询语句进行合理的改写，进一步使得搜索返回的结果贴近用户的实际需求。It can be seen that in the method shown in FIG. 2, after the search query statement is obtained, several candidate words are extracted from the search query statement, and several candidate words are generated according to the search query statement, and according to the extracted candidate words and the generated candidate words , further generate a rewritten statement, perform a search query according to the rewritten statement, and obtain a search result. The technical solution combines two methods of extraction and generation to generate candidate words, and further generates rewriting sentences, which realizes reasonable rewriting of search query sentences input by users according to semantics and scenarios, and further makes the results returned by the search closer to the actual needs of users. .

在本发明的一个实施例中，上述方法中，搜索查询语句是根据用户输入的语音生成的。In an embodiment of the present invention, in the above method, the search query sentence is generated according to the voice input by the user.

语音搜索已经被很多搜索引擎所支持，实际上可以通过将输入的语音转成文字，利用现有搜索技术进行实现。在这种情况下，搜索查询语句更为口语化，也就更不容易根据搜索查询语句得到用户想要的结果。Voice search has been supported by many search engines. In fact, it can be implemented by using existing search technology by converting input voice into text. In this case, the search query statement is more colloquial, and it is more difficult to obtain the result desired by the user according to the search query statement.

在本发明的一个实施例中，上述方法中，从搜索查询语句中抽取出若干个候选词，以及根据搜索查询语句生成若干个候选词包括：对搜索查询语句进行编码，得到编码数据；以抽取模式对编码数据进行解码，输出第一候选词表，以及以生成模式对编码数据进行解码，输出第二候选词表。In an embodiment of the present invention, in the above method, extracting several candidate words from the search query statement, and generating several candidate words according to the search query statement includes: encoding the search query statement to obtain encoded data; extracting The mode decodes the encoded data and outputs the first candidate vocabulary, and the generation mode decodes the encoded data and outputs the second candidate vocabulary.

这里可以使用查询改写模型来实现，具体地，该查询改写模型可以是参考sequence-to-sequence模型(seq2seq)实现的。seq2seq是一个“编码器-解码器”(Encoder-Decoder)结构的网络，它的输入是一个序列，输出也是一个序列，编码器将一个可变长度的信号序列变为固定长度的向量表达，解码器将这个固定长度的向量变成可变长度的目标的信号序列。Here, a query rewriting model can be used to implement, specifically, the query rewriting model can be implemented with reference to the sequence-to-sequence model (seq2seq). seq2seq is a network with an "Encoder-Decoder" structure. Its input is a sequence, and its output is also a sequence. The encoder converts a variable-length signal sequence into a fixed-length vector representation, decoding The processor turns this fixed-length vector into a variable-length target signal sequence.

seq2seq的模型结构中，输出每一个词的时候对于输入中词语的关注程度不一致，其权重根据特定规则计算得到。这样可以使得生成的序列更加合理，且能保留输入中的大部分信息，这也称之为注意力机制。一般在自然语言处理应用里把注意力模型看作是输出语句中某个单词和输入语句每个单词的对齐模型。In the model structure of seq2seq, when each word is output, the attention to the words in the input is inconsistent, and its weight is calculated according to specific rules. This can make the generated sequence more reasonable and retain most of the information in the input, which is also called the attention mechanism. Generally, in natural language processing applications, the attention model is regarded as the alignment model of a word in the output sentence and each word of the input sentence.

seq2seq模型下，当用户输入一个查询x＝{x₁,...,x_n}(x_i表示输入句子的第i个词语)后，模型的目标是将此查询转换为语义相似的关键词查询y＝{y₁,...,y_m}(y_i表示输出的第i个词语)。在该模型中，查询的每一个词依次被送入“编码器”中，然后“解码器”接收先前生成的词语{y₁,...,y_t-1}和一个上下文向量C来预测下一词y_t。公式如下：Under the seq2seq model, when the user inputs a query x={x ₁ ,...,x _n } (x _i represents the ith word of the input sentence), the goal of the model is to convert this query into semantically similar keywords Query y={y ₁ ,...,y _m } (y _i represents the ith word of the output). In this model, each word of the query is fed into the "encoder" in turn, and then the "decoder" receives the previously generated words {y ₁ ,...,y _t-1 } and a context vector C to predict The next word y _t . The formula is as follows:

其中p(y_t|{y₁,...,y_t-1},C)表示在已知先前生成的词语{y₁,...,y_t-1}和一个上下文向量C的前提下，y_t的条件概率。上下文向量C可以通过注意力机制计算得到，由一个score方法来衡量输入中每个词语的权重并计算其加权和。where p(y _t |{y ₁ ,...,y _t-1 },C) represents the premise that the previously generated words {y ₁ ,...,y _t-1 } and a context vector C are known below, the conditional probability of y _t . The context vector C can be calculated by the attention mechanism, and a score method is used to measure the weight of each word in the input and calculate its weighted sum.

当计算注意力权重时，本发明还使用了覆盖(coverage)机制来防止生成结果重复问题。下面介绍本发明模型的编码器和解码器的具体实现方式。图3示出了根据本发明一个实施例的查询改写模型的结构示意图。可见，该查询改写模型的解码器分为两个模式，通过一个调节因子来实现输出的词的分布的确定。图3中的示例是以查询语句为“Tell meiPhone X cost”，即告诉我iPhone X多少钱作为示例，其中还示出了两个生成的候选词“iPhone X”和“price(价格)”。When calculating attention weights, the present invention also uses a coverage mechanism to prevent the problem of duplication of generated results. The specific implementations of the encoder and decoder of the model of the present invention are described below. FIG. 3 shows a schematic structural diagram of a query rewriting model according to an embodiment of the present invention. It can be seen that the decoder of the query rewriting model is divided into two modes, and the distribution of the output words is determined by an adjustment factor. The example in Fig. 3 is that the query sentence is "Tell meiPhone X cost", that is, tell me how much the iPhone X is, and two generated candidate words "iPhone X" and "price (price)" are also shown.

参见图3，第一候选词表实际示出了输入语句中各个词的分布，也就是抽取权重(在图中标记为输入分布)；第二候选词表实际示出了生成的各个词的分布，也就是第二候选词表的分布概率(在图中标记为词表分布)。在本发明的实施例中，查询改写模型的解码器可以包括抽取模式和生成模式。Referring to Figure 3, the first candidate vocabulary actually shows the distribution of each word in the input sentence, that is, the extraction weight (marked as input distribution in the figure); the second candidate vocabulary actually shows the generated distribution of each word , that is, the distribution probability of the second candidate vocabulary (marked as vocabulary distribution in the figure). In an embodiment of the present invention, the decoder of the query rewriting model may include an extraction mode and a generation mode.

在本发明的一个实施例中，上述方法中，对搜索查询语句进行编码，得到编码数据包括：对搜索查询语句进行词嵌入处理，得到搜索查询语句中包含的各词对应的词向量；根据各词向量进行编码，得到输入隐向量。In an embodiment of the present invention, in the above method, encoding the search query statement to obtain encoded data includes: performing word embedding processing on the search query statement to obtain a word vector corresponding to each word contained in the search query statement; The word vector is encoded to obtain the input latent vector.

词嵌入(embedding)可以形成语句的词向量表示，即将语句中的词分别表示为一个向量。将各个词逐个送入编码器中可以产生一个隐向量h_s。隐向量作为输入语句的一个高级表示，在解码阶段用于新的序列的生成。在本发明的一个实施例中，上述方法中，根据各词向量进行编码，得到隐向量包括：基于一层双向长短期记忆网络LSTM进行编码。LSTM(Long Short-Term Memory，长短期记忆网络)是一种时间递归神经网络，适合于处理和预测时间序列中间隔和延迟相对较长的重要事件。Word embedding can form the word vector representation of the sentence, that is, the words in the sentence are represented as a vector. Feeding each word into the encoder one by one can generate a hidden vector h _s . The latent vector serves as a high-level representation of the input sentence and is used in the decoding stage to generate new sequences. In an embodiment of the present invention, in the above method, encoding according to each word vector to obtain a latent vector includes: encoding based on a layer of bidirectional long short-term memory network LSTM. LSTM (Long Short-Term Memory) is a temporal recurrent neural network suitable for processing and predicting important events with relatively long intervals and delays in time series.

在本发明的一个实施例中，上述方法中，以抽取模式对编码数据进行解码，输出第一候选词表包括：根据输入隐向量计算注意力权重a^t；基于公式(1)和(2)计算搜索查询语句中各词的抽取权重：In an embodiment of the present invention, in the above method, decoding the encoded data in the extraction mode, and outputting the first candidate vocabulary includes: calculating the attention weight at according to the input ^latent vector; based on formulas (1) and (2) Calculate the extraction weight of each word in the search query:

其中，P_extract(w)为目标词w的抽取权重，p_w为调节因子，f_w为目标词w在搜索查询语句中出现的次数，N是语料中所有查询的次数，|w|是语料中包含目标词w的查询个数，t为t时刻；第一候选词表包括一个或多个词及其对应的抽取权重。Among them, P _extract (w) is the extraction weight of the target word w, p _w is the adjustment factor, f _w is the number of times the target word w appears in the search query sentence, N is the number of all queries in the corpus, and |w| is the corpus contains the number of queries of the target word w, and t is time t; the first candidate word list includes one or more words and their corresponding extraction weights.

TF-IDF是两个统计量的乘积，词频率tf(w)和逆文件频率idf(w)。TF-IDF高是由词频高且该词在整个语料中出现频率低共同决定的，因此该方法可以用于排除常用术语。对于自然语言查询，这种方法可以有效地去除一些常见的口语描述，如“如何”、“什么”，并保留重要信息。TF-IDF is the product of two statistics, word frequency tf(w) and inverse document frequency idf(w). The high TF-IDF is jointly determined by the high frequency of the word and the low frequency of the word in the whole corpus, so this method can be used to exclude common terms. For natural language queries, this method can effectively remove some common colloquial descriptions, such as "how", "what", and retain important information.

TF-IDF值和注意力权重在衡量单词重要性时有不同的侧重点。注意力权重关注输入和输出的语义匹配，使用隐藏状态来计算其相似度值。通过这种方式，它关注的是单词的“含义”。TF-IDF关注单词的统计特征，它统计了整个语料库中该单词的重要性，这两种值从不同的角度描述了输入词的重要性。通过将它们与权重因子相结合，可以从输入中提取更佳的关键词。TF-IDF values and attention weights have different emphases in measuring word importance. Attention weights focus on the semantic matching of input and output, using the hidden state to calculate their similarity value. In this way, it focuses on the "meaning" of the word. TF-IDF focuses on the statistical features of words, it counts the importance of the word in the entire corpus, and these two values describe the importance of the input word from different perspectives. By combining them with weighting factors, better keywords can be extracted from the input.

在本发明的一个实施例中，上述方法中，以生成模式对编码数据进行解码，输出第二候选词表包括：根据输入隐向量计算注意力权重a^t；根据注意力权重a^t和输入隐向量计算上下文权重C^t；根据注意力权重a^t、上下文权重C^t和当前时刻目标隐向量h_t计算第二候选词表的分布概率P_vocab。In an embodiment of the present invention, in the above method, decoding the encoded data in the generation mode, and outputting the second candidate vocabulary includes: calculating the attention weight at according to the input ^latent vector; according to the attention weight ^at and the input hidden vector The vector calculates the context weight C ^t ; calculates the distribution probability P _vocab of the second candidate vocabulary according to the attention weight a ^t , the context weight C ^t and the target latent vector h _t at the current moment.

具体地，在本发明的一个实施例中，上述方法中，根据输入隐向量计算注意力权重a^t包括：基于公式(3)和(4)计算注意力权重a^t：Specifically, in an embodiment of the present invention, in the above method, calculating the attention weight at according to the input ^latent vector includes: calculating the attention weight at based on ^formulas (3) and (4):

a^t＝softmax(e^t) (4)；a ^t =softmax(e ^t ) (4);

其中，函数score用来比较目标隐向量h_t和输入隐向量

的相似程度，cov_i ^t为t时刻的覆盖向量，v、W₁、W₂、W_c和b_atten为预置参数；

为输入隐向量，h_t为输出隐向量。其中，cov⁰是一个全零矩阵。还需要说明的是Softmax函数的意义是将K维实数向量z映射到一个新的K维实数向量σ(z)，使得向量的每一个元素值都在0-1之间，且所有元素和为1。Among them, the function score is used to compare the target hidden vector h _t with the input hidden vector

The similarity degree of , cov _i ^t is the coverage vector at time t, v, W ₁ , W ₂ , W _c and _batten are preset parameters;

is the input hidden vector, and h _t is the output hidden vector. where cov ⁰ is an all-zero matrix. It should also be noted that the meaning of the Softmax function is to map the K-dimensional real vector z to a new K-dimensional real vector σ(z), so that the value of each element of the vector is between 0-1, and the sum of all elements is 1.

在本发明的一个实施例中，上述方法中，根据注意力权重a^t和输入隐向量计算上下文权重C^t包括：基于公式(5)和(6)计算上下文权重C^t：In an embodiment of the present invention, in the above method, calculating the context weight C ^t according to the attention weight at and the input latent vector includes: calculating the context weight C ^t based on formulas (5) and (6 ⁾ :

其中，cov^t为t时刻的覆盖矩阵。即在t时刻，维护一个coverage矩阵cov^t来记录输入中词语的覆盖程度。它是之前所有时刻的注意力分布的和，上下文向量C通过注意力权重a^t对输入隐向量进行加权求和得到。where cov ^t is the coverage matrix at time t. That is, at time t, a coverage matrix cov ^t is maintained to record the coverage of the words in the input. It is the sum of the attention distributions of all previous moments, and the context vector C is obtained by the weighted summation of the input latent vector through the attention weight at ^t .

在本发明的一个实施例中，上述方法中，根据注意力权重a^t、上下文权重C^t和当前时刻目标隐向量h_t计算第二候选词表的分布概率P_vocab包括：基于公式(7)计算P_vocab：In an embodiment of the present invention, in the above method, calculating the distribution probability P _vocab of the second candidate vocabulary according to the attention weight a ^t , the context weight C ^t and the target latent vector h _t at the current moment includes: based on formula (7) Calculate P _vocab :

即获得上下文向量C后，将其与当前时刻目标隐向量h_t结合通过两层全连接层得到词表的分布概率P_vocab。That is, after obtaining the context vector C, combine it with the target latent vector h _t at the current moment to obtain the distribution probability P _vocab of the vocabulary through two fully connected layers.

在本发明的一个实施例中，上述方法中，以抽取模式对编码数据进行解码，输出第一候选词表，以及以生成模式对编码数据进行解码，输出第二候选词表包括：基于一层单向LSTM实现解码。In an embodiment of the present invention, in the above method, decoding the encoded data in the extraction mode, outputting the first candidate vocabulary, and decoding the encoded data in the generation mode, and outputting the second candidate vocabulary includes: based on a layer of One-way LSTM implements decoding.

总结地说，解码器接收输入的词向量表示和解码器的隐向量h_t，并通过注意力机制计算词表中每个词语的概率，选取概率最高的词语作为输出，这种方式对应于生成模式；通过注意力矩阵和抽取式的方法计算输入句子中每个词的权重，选取权重大的词语作为输出，这种方式对应于抽取模式。To sum up, the decoder receives the input word vector representation and the decoder's hidden vector h _t , and calculates the probability of each word in the vocabulary through the attention mechanism, and selects the word with the highest probability as the output. This method corresponds to generating Mode; the weight of each word in the input sentence is calculated by the attention matrix and the extraction method, and the word with the most weight is selected as the output, which corresponds to the extraction mode.

在本发明的一个实施例中，上述方法中，根据抽取的候选词和生成的候选词，进一步生成改写语句包括：根据第一候选词表的抽取权重P_extract和第二候选词表的分布概率P_vocab以及调节因子p_gen确定第三候选词表，根据第三候选词表生成改写语句。In an embodiment of the present invention, in the above method, further generating a rewritten sentence according to the extracted candidate words and the generated candidate words includes: according to the extraction weight P _extract of the first candidate vocabulary and the distribution probability of the second candidate vocabulary P _vocab and the adjustment factor p _gen determine a third candidate vocabulary, and generate a rewritten sentence according to the third candidate vocabulary.

在本发明的一个实施例中，上述方法中，根据第一候选词表的抽取权重P_extract和第二候选词表的分布概率P_vocab以及调节因子p_gen确定第三候选词表包括：基于公式(8)计算调节因子p_gen：In an embodiment of the present invention, in the above method, determining the third candidate vocabulary according to the extraction weight P _extract of the first candidate vocabulary, the distribution probability P _vocab of the second candidate vocabulary, and the adjustment factor p _gen includes: based on the formula (8) Calculate the adjustment factor p _gen :

这样就得到了图3中示出的最终分布。This results in the final distribution shown in Figure 3.

下面对本发明实施例中使用的查询改写模型的训练过程进行简单的介绍。The following briefly introduces the training process of the query rewriting model used in the embodiment of the present invention.

查询改写模型的构建方法可以包括如下步骤：The construction method of the query rewrite model can include the following steps:

根据搜索点击数据生成训练数据。这里，可以将商业搜索引擎中的搜索点击数据作为数据源，并优选地选择高质量用户的搜索记录作为初始训练语料。对初始训练语料进行清洗等处理后可以得到训练数据。Generate training data based on search click data. Here, the search click data in the commercial search engine can be used as the data source, and the search records of high-quality users are preferably selected as the initial training corpus. The training data can be obtained after cleaning the initial training corpus.

根据训练数据对目标模型进行训练，得到中间模型；目标模型包括编码器和解码器，解码器包括抽取模式和生成模式。抽取模式和生成模式在前面已有介绍，具体的实现过程可以参照前述实施例。The target model is trained according to the training data to obtain an intermediate model; the target model includes an encoder and a decoder, and the decoder includes an extraction mode and a generation mode. The extraction mode and the generation mode have been introduced above, and the specific implementation process may refer to the foregoing embodiments.

判断中间模型是否满足预设条件，是则将中间模型作为查询改写模型并停止训练，否则将中间模型作为目标模型进行迭代训练。It is judged whether the intermediate model satisfies the preset conditions. If yes, the intermediate model is used as the query to rewrite the model and the training is stopped. Otherwise, the intermediate model is used as the target model for iterative training.

在本发明的一个实施例中，上述方法中，根据搜索点击数据生成训练数据包括：从搜索点击数据中提取出若干组句对；句对包括搜索查询语句和对应点击搜索结果的标题语句。In an embodiment of the present invention, in the above method, generating training data according to the search click data includes: extracting several groups of sentence pairs from the search click data; the sentence pairs include search query sentences and headline sentences corresponding to the click search results.

句对(query-title)描述了用户输入的搜索查询语句，即表现出来的需求，以及描述了其实际点击的内容，即实际需求。以图1为例，当输入的搜索查询语句为“iPhone X价格”为例，当用户实际点击了第一项搜索结果时，则句对为“iPhone X价格-苹果iPhoneX全网通报价参数图片论坛中关村在线”。The sentence pair (query-title) describes the search query sentence input by the user, that is, the displayed demand, and describes the content of the actual click, that is, the actual demand. Taking Figure 1 as an example, when the input search query is "iPhone X price" as an example, when the user actually clicks the first search result, the sentence pair is "iPhone X price-Apple iPhoneX full Netcom quotation parameter picture forum Zhongguancun Online".

实际上这里的搜索查询语句和标题语句并不一定是读起来通顺的完整语句，可能仅包含若干个孤立的词，在本发明中为便于介绍，统一描述为“语句”。In fact, the search query sentence and the title sentence here are not necessarily complete sentences that read fluently, but may only contain several isolated words, which are collectively described as "sentences" in the present invention for the convenience of introduction.

在本发明的一个实施例中，上述方法还包括：从提取出的句对中去除噪声。这些噪声主要是因为用户的误操作或者恰巧对某个页面感兴趣造成的。具体来说，一般表现在句对中的语句在语义上不相符，这些噪声会严重影响模型的训练过程。In an embodiment of the present invention, the above method further includes: removing noise from the extracted sentence pairs. These noises are mainly caused by the user's misoperation or by coincidentally being interested in a certain page. Specifically, sentences that generally appear in sentence pairs do not match semantically, and these noises can seriously affect the training process of the model.

在本发明的一个实施例中，上述方法中，从提取出的句对中去除噪声包括：计算各句对中搜索查询语句和标题语句的主题相似度，和/或，计算各句对中搜索查询语句和标题语句的语义相似度；根据预设的相似度阈值去除噪声。在具体实施中可以参照下述的两个实施例，但应当理解不限于下面示出的这量种相似度计算方式：在本发明的一个实施例中，上述方法中，计算各句对中搜索查询语句和标题语句的主题相似度包括：对搜索查询语句和标题语句进行语义表示，训练潜在狄利克雷分布LDA主题模型并计算搜索查询语句的主题分布和标题语句的主题分布，基于JS散度计算各句对中搜索查询语句和标题语句的主题分布的分布相似度。在本发明的一个实施例中，上述方法中，计算各句对中搜索查询语句和标题语句的语义相似度包括：确定搜索查询语句和标题语句中词语的词向量，将搜索查询语句和标题语句分别表示为词语词向量的均值，基于余弦相似度计算各句对中搜索查询语句和标题语句的相似度。In an embodiment of the present invention, in the above method, removing noise from the extracted sentence pairs includes: calculating the subject similarity between the search query sentence and the title sentence in each sentence pair, and/or calculating the search query in each sentence pair Semantic similarity between query sentences and title sentences; remove noise according to a preset similarity threshold. In the specific implementation, reference may be made to the following two embodiments, but it should be understood that it is not limited to the following calculation methods of similarity: The topic similarity of query sentences and headline sentences includes: semantic representation of search query sentences and headline sentences, training a latent Dirichlet distribution LDA topic model and calculating the topic distribution of search query sentences and the topic distribution of headline sentences, based on JS divergence The distribution similarity of the topic distribution of the search query sentence and the title sentence in each sentence pair is calculated. In an embodiment of the present invention, in the above method, calculating the semantic similarity between the search query statement and the title sentence in each sentence pair includes: determining the word vector of the words in the search query statement and the title sentence, and combining the search query sentence and the title sentence They are respectively expressed as the mean of the word vectors, and the similarity between the search query sentence and the title sentence in each sentence pair is calculated based on the cosine similarity.

其中，主题相似度从语句的主题分布入手，计算分布之间的相似度。首先对语句进行语义表示，训练LDA模型并计算一个语句的主题分布。然后利用JS(Jensen-Shannon)散度计算两个分布之间的相似性。语义相似度从语句中词语的词向量入手，将一个语句表示成语句中词语的词向量的均值，然后利用余弦相似度计算两个语句的相似度。通过设定合理的阈值，达到去除噪声的目的。LDA(Latent Dirichlet Allocation)是一种文档主题生成模型，也称为一个三层贝叶斯概率模型，包含词、主题和文档三层结构。所谓生成模型，就是说，我们认为一篇文章的每个词都是通过“以一定概率选择了某个主题，并从这个主题中以一定概率选择某个词语”这样一个过程得到。文档到主题服从多项式分布，主题到词服从多项式分布。Among them, topic similarity starts from the topic distribution of sentences, and calculates the similarity between distributions. The sentence is first semantically represented, the LDA model is trained and the topic distribution of a sentence is calculated. The similarity between the two distributions is then calculated using JS (Jensen-Shannon) divergence. Semantic similarity starts with the word vector of the words in the sentence, expresses a sentence as the mean of the word vectors of the words in the sentence, and then uses the cosine similarity to calculate the similarity of the two sentences. By setting a reasonable threshold, the purpose of removing noise is achieved. LDA (Latent Dirichlet Allocation) is a document topic generation model, also known as a three-layer Bayesian probability model, which includes a three-layer structure of words, topics and documents. The so-called generative model means that we think that each word of an article is obtained through a process of "selecting a topic with a certain probability, and selecting a word from this topic with a certain probability". Documents to topics follow a multinomial distribution, and topics to words follow a multinomial distribution.

在本发明的一个实施例中，上述方法中，根据搜索点击数据生成训练数据进一步包括：对句对中的搜索查询语句和标题语句分别进行分词；从分词结果中划分出第一比例的数据作为验证集，划分出第二比例的数据作为训练数据集；基于训练数据集生成训练词汇表。In an embodiment of the present invention, in the above method, generating training data according to the search click data further includes: performing word segmentation on the search query sentence and the title sentence in the sentence pair respectively; dividing the first proportion of data from the word segmentation result as The validation set is divided into a second proportion of data as a training data set; a training vocabulary is generated based on the training data set.

例如，使用jieba分词工具对语句进行分词，将语句按词分隔。在一个具体的场景下，划分20％的数据作为验证集，剩余80％数据作为训练数据集，根据训练数据集制作训练词汇表。这样就做好了训练数据的准备。For example, use the jieba word segmentation tool to tokenize sentences and separate sentences by words. In a specific scenario, 20% of the data is divided as the validation set, the remaining 80% of the data is used as the training data set, and a training vocabulary is made according to the training data set. This prepares the training data.

在本发明的一个实施例中，上述方法中，根据训练数据对目标模型进行训练，得到中间模型包括：将训练数据集中的数据划分为多组训练样本数据；取一组训练样本数据，根据训练词汇表对该组训练样本数据进行编号，将选择其中的标题语句作为训练输入数据，选择相应的搜索查询语句作为训练输出数据。In an embodiment of the present invention, in the above method, training the target model according to the training data, and obtaining the intermediate model includes: dividing the data in the training data set into multiple sets of training sample data; The vocabulary lists the number of the training sample data, selects the title sentence as the training input data, and selects the corresponding search query sentence as the training output data.

例如，将训练数据进行随机打乱，并平均分成S组，设置s＝0(各组对应的序号为0，1，2……s-1)；取第s份训练样本数据，按照构造的训练词汇表对选取的训练样本数据中，各语句的词进行编号，送入目标模型中进行训练。如果训练完毕后，得到的中间模型满足预设条件，则结束训练；如果不满足预设条件，则令s＝s+1，重复训练至得到的中间模型满足预设条件为止。For example, the training data is randomly scrambled, and divided into S groups on average, set s=0 (the corresponding serial numbers of each group are 0, 1, 2...s-1); take the s-th training sample data, according to the constructed The training vocabulary lists the words of each sentence in the selected training sample data, and sends them to the target model for training. If the obtained intermediate model satisfies the preset condition after the training is completed, the training is ended; if the preset condition is not met, s=s+1 is set, and the training is repeated until the obtained intermediate model satisfies the preset condition.

在本发明的一个实施例中，上述方法中，判断中间模型是否满足预设条件包括：在训练过程中，根据公式(10)计算t时刻的损失loss_t：In an embodiment of the present invention, in the above method, judging whether the intermediate model satisfies the preset condition includes: in the training process, calculating the loss loss _t at time t according to formula (10):

其中，

为目标词语，

为注意力权重，

为覆盖向量；in,

is the target word,

is the attention weight,

is the coverage vector;

根据公式(11)计算整个语句的损失loss：Calculate the loss of the entire sentence according to formula (11):

根据中间模型计算验证集的损失，若损失增大则满足预设条件。The loss of the validation set is calculated according to the intermediate model, and if the loss increases, the preset conditions are met.

上述实施例中，各预置参数、两步线性变换参数矩阵和偏置向量可以是根据模型训练得到的。In the above embodiment, each preset parameter, the two-step linear transformation parameter matrix and the bias vector may be obtained according to model training.

图4示出了根据本发明一个实施例的一种搜索的实现装置的结构示意图。如图4所示，搜索的实现装置400包括：Fig. 4 shows a schematic structural diagram of an apparatus for implementing a search according to an embodiment of the present invention. As shown in FIG. 4 , the apparatus 400 for realizing the search includes:

获取单元410，适于获取搜索查询语句。The obtaining unit 410 is adapted to obtain a search query statement.

改写单元420，适于从搜索查询语句中抽取出若干个候选词，以及根据搜索查询语句生成若干个候选词；根据抽取的候选词和生成的候选词，进一步生成改写语句。The rewriting unit 420 is adapted to extract several candidate words from the search query sentence, and generate several candidate words according to the search query sentence; further generate a rewriting sentence according to the extracted candidate words and the generated candidate words.

搜索单元430，适于根据改写语句进行搜索查询，得到搜索结果。The search unit 430 is adapted to perform a search query according to the rewritten sentence to obtain a search result.

可见，图4所示的装置，在获取到搜索查询语句后，从搜索查询语句中抽取出若干个候选词，以及根据搜索查询语句生成若干个候选词，根据抽取的候选词和生成的候选词，进一步生成改写语句，根据改写语句进行搜索查询，得到搜索结果。该技术方案结合了抽取和生成两种方式生成候选词，并进一步生成改写语句，实现了根据语义和场景对用户输入的搜索查询语句进行合理的改写，进一步使得搜索返回的结果贴近用户的实际需求。It can be seen that the device shown in FIG. 4, after acquiring the search query statement, extracts several candidate words from the search query statement, and generates several candidate words according to the search query statement, and generates several candidate words according to the extracted candidate words and the generated candidate words. , further generate a rewritten statement, perform a search query according to the rewritten statement, and obtain a search result. The technical solution combines two methods of extraction and generation to generate candidate words, and further generates rewriting sentences, which realizes reasonable rewriting of search query sentences input by users according to semantics and scenarios, and further makes the results returned by the search closer to the actual needs of users. .

在本发明的一个实施例中，上述装置中，搜索查询语句是根据用户输入的语音生成的。In an embodiment of the present invention, in the above device, the search query sentence is generated according to the voice input by the user.

在本发明的一个实施例中，上述装置中，改写单元420，适于对搜索查询语句进行编码，得到编码数据；以抽取模式对编码数据进行解码，输出第一候选词表，以及以生成模式对编码数据进行解码，输出第二候选词表。In an embodiment of the present invention, in the above-mentioned apparatus, the rewriting unit 420 is adapted to encode the search query sentence to obtain encoded data; decode the encoded data in the extraction mode, output the first candidate vocabulary, and use the generation mode to decode the encoded data The encoded data is decoded, and the second candidate vocabulary is output.

在本发明的一个实施例中，上述装置中，改写单元420，适于对搜索查询语句进行词嵌入处理，得到搜索查询语句中包含的各词对应的词向量；根据各词向量进行编码，得到输入隐向量。In an embodiment of the present invention, in the above device, the rewriting unit 420 is adapted to perform word embedding processing on the search query sentence to obtain word vectors corresponding to each word contained in the search query sentence; encoding according to each word vector, obtaining Input latent vector.

在本发明的一个实施例中，上述装置中，改写单元420，适于基于一层双向长短期记忆网络LSTM进行编码。In an embodiment of the present invention, in the above apparatus, the rewriting unit 420 is adapted to perform coding based on a one-layer bidirectional long short-term memory network LSTM.

在本发明的一个实施例中，上述装置中，改写单元420，适于根据输入隐向量计算注意力权重a^t；基于公式(1)和(2)计算搜索查询语句中各词的抽取权重：In an embodiment of the present invention, in the above device, the rewriting unit 420 is adapted to calculate the attention weight at according to the input ^latent vector; calculate the extraction weight of each word in the search query sentence based on formulas (1) and (2):

在本发明的一个实施例中，上述装置中，改写单元420，适于根据输入隐向量计算注意力权重a^t；根据注意力权重a^t和输入隐向量计算上下文权重C^t；根据注意力权重a^t、上下文权重C^t和当前时刻目标隐向量h_t计算第二候选词表的分布概率P_vocab。In an embodiment of the present invention, in the above device, the rewriting unit 420 ^is adapted to calculate the attention weight at according to the input latent vector; calculate the context weight C ^t according to the attention weight ^at and the input latent vector; according to the attention weight at ^t , the context weight C ^t and the target latent vector h _t at the current moment to calculate the distribution probability P _vocab of the second candidate vocabulary.

在本发明的一个实施例中，上述装置中，改写单元420，适于基于公式(3)和(4)计算注意力权重a^t：In an embodiment of the present invention, in the above apparatus, the rewriting unit 420 is adapted to calculate the attention weight at based on ^formulas (3) and (4):

a^t＝softmax(e^t)(4)；a ^t =softmax( ^et )(4);

其中，函数score用来比较目标隐向量h_t和输入隐向量

is the input hidden vector, and h _t is the output hidden vector.

在本发明的一个实施例中，上述装置中，改写单元420，适于基于公式(5)和(6)计算上下文权重C^t：In an embodiment of the present invention, in the above apparatus, the rewriting unit 420 is adapted to calculate the context weight C ^t based on formulas (5) and (6):

在本发明的一个实施例中，上述装置中，改写单元420，适于基于公式(7)计算P_vocab：In an embodiment of the present invention, in the above device, the rewriting unit 420 is adapted to calculate P _vocab based on formula (7):

在本发明的一个实施例中，上述装置中，改写单元420，适于基于一层单向LSTM实现解码。In an embodiment of the present invention, in the above apparatus, the rewriting unit 420 is adapted to implement decoding based on a layer of unidirectional LSTM.

在本发明的一个实施例中，上述装置中，改写单元420，适于根据第一候选词表的抽取权重P_extract和第二候选词表的分布概率P_vocab以及调节因子p_gen确定第三候选词表，根据第三候选词表生成改写语句。In an embodiment of the present invention, in the above device, the rewriting unit 420 is adapted to determine the third candidate according to the extraction weight P _extract of the first candidate vocabulary, the distribution probability P _vocab of the second candidate vocabulary, and the adjustment factor p _gen The vocabulary list is used to generate a rephrased sentence according to the third candidate vocabulary list.

在本发明的一个实施例中，上述装置中，改写单元420，适于基于公式(8)计算调节因子p_gen：In an embodiment of the present invention, in the above apparatus, the rewriting unit 420 is adapted to calculate the adjustment factor p _gen based on formula (8):

需要说明的是，上述各装置实施例的具体实施方式可以参照前述对应方法实施例的具体实施方式进行，在此不再赘述。It should be noted that, the specific implementations of the foregoing apparatus embodiments may be performed with reference to the specific implementations of the foregoing corresponding method embodiments, and details are not described herein again.

综上所述，本发明的技术方案，在获取到搜索查询语句后，从搜索查询语句中抽取出若干个候选词，以及根据搜索查询语句生成若干个候选词，根据抽取的候选词和生成的候选词，进一步生成改写语句，根据改写语句进行搜索查询，得到搜索结果。该技术方案结合了抽取和生成两种方式生成候选词，并进一步生成改写语句，实现了根据语义和场景对用户输入的搜索查询语句进行合理的改写，进一步使得搜索返回的结果贴近用户的实际需求。To sum up, the technical solution of the present invention, after obtaining the search query sentence, extracts several candidate words from the search query sentence, and generates several candidate words according to the search query sentence, according to the extracted candidate words and the generated candidate words, further generate a rewritten sentence, perform a search query according to the rewritten sentence, and obtain a search result. The technical solution combines two methods of extraction and generation to generate candidate words, and further generates rewriting sentences, which realizes reasonable rewriting of search query sentences input by users according to semantics and scenarios, and further makes the results returned by the search closer to the actual needs of users. .

需要说明的是：It should be noted:

在此提供的算法和显示不与任何特定计算机、虚拟装置或者其它设备固有相关。各种通用装置也可以与基于在此的示教一起使用。根据上面的描述，构造这类装置所要求的结构是显而易见的。此外，本发明也不针对任何特定编程语言。应当明白，可以利用各种编程语言实现在此描述的本发明的内容，并且上面对特定语言所做的描述是为了披露本发明的最佳实施方式。The algorithms and displays provided herein are not inherently related to any particular computer, virtual appliance, or other device. Various general-purpose devices can also be used with the teachings based on this. The structure required to construct such a device is apparent from the above description. Furthermore, the present invention is not directed to any particular programming language. It is to be understood that various programming languages may be used to implement the inventions described herein, and that the descriptions of specific languages above are intended to disclose the best mode for carrying out the invention.

在此处所提供的说明书中，说明了大量具体细节。然而，能够理解，本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中，并未详细示出公知的方法、结构和技术，以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. It will be understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

类似地，应当理解，为了精简本公开并帮助理解各个发明方面中的一个或多个，在上面对本发明的示例性实施例的描述中，本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而，并不应将该公开的方法解释成反映如下意图：即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说，如下面的权利要求书所反映的那样，发明方面在于少于前面公开的单个实施例的所有特征。因此，遵循具体实施方式的权利要求书由此明确地并入该具体实施方式，其中每个权利要求本身都作为本发明的单独实施例。Similarly, it is to be understood that in the above description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together into a single embodiment, figure, or its description. This disclosure, however, should not be construed as reflecting an intention that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention.

本领域那些技术人员可以理解，可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件，以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外，可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述，本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。Those skilled in the art will understand that the modules in the device in the embodiment can be adaptively changed and arranged in one or more devices different from the embodiment. The modules or units or components in the embodiments may be combined into one module or unit or component, and further they may be divided into multiple sub-modules or sub-units or sub-assemblies. All features disclosed in this specification (including accompanying claims, abstract and drawings) and any method so disclosed may be employed in any combination, unless at least some of such features and/or procedures or elements are mutually exclusive. All processes or units of equipment are combined. Each feature disclosed in this specification (including accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

此外，本领域的技术人员能够理解，尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征，但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如，在下面的权利要求书中，所要求保护的实施例的任意之一都可以以任意的组合方式来使用。Furthermore, those skilled in the art will appreciate that although some of the embodiments described herein include certain features, but not others, included in other embodiments, that combinations of features of different embodiments are intended to be within the scope of the invention within and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

本发明的各个部件实施例可以以硬件实现，或者以在一个或者多个处理器上运行的软件模块实现，或者以它们的组合实现。本领域的技术人员应当理解，可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的搜索的实现装置中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如，计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上，或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到，或者在载体信号上提供，或者以任何其他形式提供。Various component embodiments of the present invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art should understand that a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in the apparatus for implementing the search according to the embodiment of the present invention. The present invention can also be implemented as apparatus or apparatus programs (eg, computer programs and computer program products) for performing part or all of the methods described herein. Such a program implementing the present invention may be stored on a computer-readable medium, or may be in the form of one or more signals. Such signals may be downloaded from Internet sites, or provided on carrier signals, or in any other form.

例如，图5示出了根据本发明一个实施例的电子设备的结构示意图。该电子设备包括处理器510和被安排成存储计算机可执行指令(计算机可读程序代码)的存储器520。存储器520可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。存储器520具有存储用于执行上述方法中的任何方法步骤的计算机可读程序代码531的存储空间530。例如，用于存储计算机可读程序代码的存储空间530可以包括分别用于实现上面的方法中的各种步骤的各个计算机可读程序代码531。计算机可读程序代码531可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。这些计算机程序产品包括诸如硬盘，紧致盘(CD)、存储卡或者软盘之类的程序代码载体。这样的计算机程序产品通常为例如图6所述的计算机可读存储介质。图6示出了根据本发明一个实施例的一种计算机可读存储介质的结构示意图。该计算机可读存储介质600存储有用于执行根据本发明的方法步骤的计算机可读程序代码531，可以被电子设备500的处理器510读取，当计算机可读程序代码531由电子设备500运行时，导致该电子设备500执行上面所描述的方法中的各个步骤，具体来说，该计算机可读存储介质存储的计算机可读程序代码531可以执行上述任一实施例中示出的方法。计算机可读程序代码531可以以适当形式进行压缩。For example, FIG. 5 shows a schematic structural diagram of an electronic device according to an embodiment of the present invention. The electronic device includes a processor 510 and a memory 520 arranged to store computer-executable instructions (computer-readable program code). The memory 520 may be electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM. The memory 520 has storage space 530 for storing computer readable program code 531 for performing any of the method steps in the above-described methods. For example, the storage space 530 for storing computer-readable program code may include various computer-readable program codes 531 for implementing various steps in the above methods, respectively. Computer readable program code 531 can be read from or written to one or more computer program products. These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks. Such a computer program product is typically a computer-readable storage medium as described in FIG. 6 . FIG. 6 shows a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention. The computer-readable storage medium 600 stores computer-readable program code 531 for performing the method steps according to the present invention, which can be read by the processor 510 of the electronic device 500 when the computer-readable program code 531 is executed by the electronic device 500 , causing the electronic device 500 to execute each step in the above-described method. Specifically, the computer-readable program code 531 stored in the computer-readable storage medium can execute the method shown in any of the above-described embodiments. The computer readable program code 531 may be compressed in a suitable form.

应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制，并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中，不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中，这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。It should be noted that the above-described embodiments illustrate rather than limit the invention, and that alternative embodiments may be devised by those skilled in the art without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several different elements and by means of a suitably programmed computer. In a unit claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, and third, etc. do not denote any order. These words can be interpreted as names.

Claims

1. A method for realizing search, comprising:

Get the search query;

Extracting several candidate words from the search query statement, and generating several candidate words according to the search query statement;

According to the extracted candidate words and the generated candidate words, further generate a rewritten sentence;

A search query is performed according to the rewritten statement, and a search result is obtained.

2. The method of claim 1, wherein the search query statement is generated based on a user's input speech.

3. The method according to claim 1 or 2, wherein the extracting several candidate words from the search query statement, and generating several candidate words according to the search query statement comprises:

Encoding the search query statement to obtain encoded data;

The encoded data is decoded in extraction mode to output a first candidate vocabulary, and the encoded data is decoded in generation mode to output a second candidate vocabulary.

4. The method according to any one of claims 1-3, wherein the encoding the search query statement to obtain encoded data comprises:

performing word embedding processing on the search query statement to obtain word vectors corresponding to each word contained in the search query statement;

According to the encoding of each word vector, the input latent vector is obtained.

5. A device for realizing search, comprising:

an obtaining unit, adapted to obtain a search query statement;

A rewriting unit, adapted to extract several candidate words from the search query statement, and generate several candidate words according to the search query statement; further generate a rewriting statement according to the extracted candidate words and the generated candidate words;

The search unit is adapted to perform a search query according to the rewritten sentence to obtain a search result.

6. The apparatus of claim 5, wherein the search query sentence is generated according to a voice input by a user.

7. The apparatus of claim 5 or 6, wherein,

The rewriting unit is adapted to encode the search query statement to obtain encoded data; decode the encoded data in an extraction mode, output a first candidate vocabulary, and decode the encoded data in a generation mode, Output the second candidate vocabulary.

8. The device of any one of claims 5-7, wherein,

The rewriting unit is adapted to perform word embedding processing on the search query sentence to obtain a word vector corresponding to each word contained in the search query sentence; encode according to each word vector to obtain an input latent vector.

9. An electronic device, wherein the electronic device comprises: a processor; and a memory arranged to store computer-executable instructions which, when executed, cause the processor to perform as claimed in claims 1-4 The method of any of the above.

10. A computer-readable storage medium, wherein the computer-readable storage medium stores one or more programs that, when executed by a processor, implement any one of claims 1-4 method described in item.