[go: up one dir, main page]

CN115511073A - A training method for a semantic matching model and a text matching method - Google Patents

A training method for a semantic matching model and a text matching method Download PDF

Info

Publication number
CN115511073A
CN115511073A CN202210991280.1A CN202210991280A CN115511073A CN 115511073 A CN115511073 A CN 115511073A CN 202210991280 A CN202210991280 A CN 202210991280A CN 115511073 A CN115511073 A CN 115511073A
Authority
CN
China
Prior art keywords
text
training
sample
matched
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210991280.1A
Other languages
Chinese (zh)
Inventor
程学旗
郭嘉丰
范意兴
张儒清
刘嘉铭
樊润泽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN202210991280.1A priority Critical patent/CN115511073A/en
Publication of CN115511073A publication Critical patent/CN115511073A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a training method of a semantic matching model, which comprises the following steps: a1, obtaining a training set, wherein the training set comprises a plurality of samples, each sample comprises a preset text and two texts to be matched corresponding to each preset text, each sample corresponds to a weak label and a weight, the weak label indicates which text to be matched of the two texts to be matched contained in the corresponding sample is more relevant to the preset text, the numerical value of the initial weight is relevant to a difficulty index indicating the difficulty of the corresponding sample, and the samples with the larger difficulty are endowed with the smaller initial weight; and A2, carrying out repeated iterative training on the semantic matching model by using samples in the training set, enabling the semantic matching model to output correlation scores of the predetermined text and the text pair formed by each text to be matched respectively, determining a weighting loss value according to the correlation scores, the weak labels and the weights so as to update the semantic matching model, and dynamically adjusting the weights of the samples according to the number of times of training which is finished currently.

Description

一种语义匹配模型的训练方法以及文本匹配方法A training method for a semantic matching model and a text matching method

技术领域technical field

本发明涉及神经网络领域,具体来说涉及文本匹配领域,更具体地说,涉及一种语义匹配模型的训练方法以及文本匹配方法。The present invention relates to the field of neural networks, specifically to the field of text matching, and more specifically to a training method for a semantic matching model and a text matching method.

背景技术Background technique

在信息检索时,通常由用户输入相应查询文本,系统返回相关的网页文本(一些文献也称文档)作为查询结果。系统在根据用户输入的查询文本进行检索时,通常经历两个匹配阶段。其中,第一个阶段是召回阶段,通常是从一个海量的文本数据库召回少量的候选文本形成候选文本集(比如:从百万级、千万级的网页文本数据库中计算文本数据与查询文本对应语义向量的余弦距离,按照余弦距离由小到大的顺序从中选取几十、几百或者几千个候选文本作为候选文本集)。第二个阶段是重排序阶段(一些文献也称匹配阶段),该阶段进一步计算查询文本和各个候选文本的相似性,然后基于相似性的排序输出相应的候选文本作为对查询文本的查询结果。In information retrieval, the user usually inputs the corresponding query text, and the system returns the relevant web page text (some documents are also called documents) as the query result. When the system searches based on the query text entered by the user, it usually goes through two matching stages. Among them, the first stage is the recall stage, which usually recalls a small number of candidate texts from a massive text database to form a candidate text set (for example, calculating the correspondence between text data and query text from a million-level or tens of millions-level web text database) The cosine distance of the semantic vector, according to the order of the cosine distance from small to large, select dozens, hundreds or thousands of candidate texts as the candidate text set). The second stage is the reordering stage (some documents are also called matching stage), which further calculates the similarity between the query text and each candidate text, and then outputs the corresponding candidate text as the query result of the query text based on similarity sorting.

重排序阶段一般通过训练神经网络模型来建模查询文本和候选文本的匹配得分。但如果根据人工标注的数据来训练神经网络模型(后面称为语义匹配模型),其代价是非常昂贵的。于是有研究者提出基于弱监督标注训练的语义匹配模型的方法,该方法通过传统的BM25(Best Matching,最佳匹配25)模型或者查询似然模型作为弱标注器,依据其输出的表示文本间相关性的分数来构造样本的弱标签,并基于带弱标签的训练集训练语义匹配模型,训练后的语义匹配模型得到了超过原始BM25模型以及查询似然模型的效果,能够有效的提升检索的性能。The reranking stage generally models the matching scores of query text and candidate text by training a neural network model. However, if a neural network model (hereinafter referred to as a semantic matching model) is trained based on manually labeled data, the cost is very expensive. Therefore, some researchers proposed a semantic matching model method based on weakly supervised annotation training. This method uses the traditional BM25 (Best Matching, Best Matching 25) model or query likelihood model as a weak tagger, and according to the output representation text The correlation score is used to construct the weak label of the sample, and the semantic matching model is trained based on the training set with the weak label. The trained semantic matching model has the effect of surpassing the original BM25 model and the query likelihood model, which can effectively improve the retrieval efficiency. performance.

虽然根据带有大量弱标签的样本来训练语义匹配模型不需要人工标注数据,减少了标注过程的时间成本。但是发明人在训练时发现,在根据带有大量弱标签的样本来训练语义匹配模型时,从开始训练到结束训练的过程中如果一直平等地使用各样本训练语义匹配模型,会导致语义匹配模型很难学习到正确的知识,模型的精度不高;若要保证训练出的模型的可靠性,这种训练方式需要人为地对模型进行大量调试。虽然减少了人工标注的成本,却提升了训练调试的时间成本。因此,需要对现有技术进行改进。Although training the semantic matching model based on samples with a large number of weak labels does not require manual labeling of data, it reduces the time cost of the labeling process. However, the inventor found during training that when training the semantic matching model based on samples with a large number of weak labels, if the semantic matching model is trained equally from the beginning of training to the end of training, the semantic matching model will It is difficult to learn the correct knowledge, and the accuracy of the model is not high; to ensure the reliability of the trained model, this training method requires a lot of manual debugging of the model. Although the cost of manual labeling is reduced, the time cost of training and debugging is increased. Therefore, it is necessary to improve the prior art.

发明内容Contents of the invention

因此,本发明的目的在于克服上述现有技术的缺陷,提供一种语义匹配模型的训练方法以及文本匹配方法。Therefore, the object of the present invention is to overcome the above-mentioned defects in the prior art, and provide a training method of a semantic matching model and a text matching method.

本发明的目的是通过以下技术方案实现的:The purpose of the present invention is achieved through the following technical solutions:

根据本发明的第一方面,提供一种语义匹配模型的训练方法,包括: A1、获取训练集,其包括多个样本,每个样本包含预定文本以及每个预定文本对应的两个待匹配文本,其中,每个样本分别对应有弱标签和权重,弱标签指示对应样本所含两个待匹配文本中的哪一个待匹配文本与预定文本更具相关性,初始权重的数值与指示对应样本的难度的难度指标相关,难度相对越大的样本赋予相对越小的初始权重;A2、利用所述训练集中的样本对语义匹配模型进行多次迭代训练,使其根据预定文本分别和每个待匹配文本形成的文本对输出两者的相关性得分,根据相关性得分、弱标签以及权重确定加权损失值以更新语义匹配模型,其中样本的权重根据当前已完成训练的次数进行动态调整。According to a first aspect of the present invention, a training method for a semantic matching model is provided, including: A1. Obtain a training set, which includes a plurality of samples, each sample includes predetermined text and two texts to be matched corresponding to each predetermined text , where each sample corresponds to a weak label and a weight, the weak label indicates which of the two texts to be matched contained in the corresponding sample is more relevant to the predetermined text, and the value of the initial weight is the same as that of the corresponding sample Difficulty is related to the difficulty index, and the relatively more difficult samples are given relatively smaller initial weights; A2. Using the samples in the training set to perform multiple iterations on the semantic matching model, so that it can be matched with each to be matched according to the predetermined text The text pair formed by the text outputs the correlation score between the two, and the weighted loss value is determined according to the correlation score, weak label and weight to update the semantic matching model, where the weight of the sample is dynamically adjusted according to the number of times the training has been completed.

在本发明的一些实施例中,在所述步骤A2中,每训练预定训练次数后增大相应样本的权重,其中,每次增大样本的权重时,初始权重相对越小的样本的权重增大的数值相对越大。In some embodiments of the present invention, in the step A2, the weight of the corresponding sample is increased after each predetermined training times, wherein, each time the weight of the sample is increased, the weight of the sample whose initial weight is relatively smaller increases. Larger values are relatively larger.

在本发明的一些实施例中,在所述步骤A2中,在当前已完成训练的次数达到预定次数阈值后,将所有样本的权重设为相同的数值对语义匹配模型进行多次训练。In some embodiments of the present invention, in the step A2, after the number of currently completed trainings reaches a predetermined threshold, set the weights of all samples to the same value to train the semantic matching model for multiple times.

在本发明的一些实施例中,单个样本的损失值按照以下方式计算:In some embodiments of the present invention, the loss value of a single sample is calculated as follows:

L=max(0,ε-y×(S1-S2))×w(τ,i);L=max(0,ε-y×(S 1 -S 2 ))×w(τ,i);

其中,max(·)表示取其所含元素中的最大值的函数,其输出为样本的损失值,ε表示间隔参数,y表示弱标签,y∈{-1,+1},当y取+1时表示第一个待匹配文本与预定文本更具相关性,当y取-1时表示第二个待匹配文本与预定文本更具相关性,S1表示第一个待匹配文本与预定文本的相关性得分,S2表示第二个待匹配文本与预定文本的相关性得分,w(τ,i)表示完成i次训练后调整得到的样本τ的权重。Among them, max( ) represents the function of taking the maximum value of the elements contained in it, and its output is the loss value of the sample, ε represents the interval parameter, y represents the weak label, y∈{-1,+1}, when y takes +1 means that the first text to be matched is more relevant to the predetermined text, when y is -1, it means that the second text to be matched is more relevant to the predetermined text, S 1 means that the first text to be matched is more relevant to the predetermined text The correlation score of the text, S 2 represents the correlation score between the second text to be matched and the predetermined text, and w(τ,i) represents the weight of the sample τ adjusted after completing i training.

在本发明的一些实施例中,每训练预定训练次数后增大相应样本的权重,样本的权重根据当前已完成训练的次数以及样本的难度指标进行动态调整。In some embodiments of the present invention, the weight of the corresponding sample is increased after each predetermined number of training times, and the weight of the sample is dynamically adjusted according to the number of times of training currently completed and the difficulty index of the sample.

在本发明的一些实施例中,样本的权重按照以下方式确定:In some embodiments of the present invention, the weight of samples is determined in the following manner:

Figure BDA0003804052690000031
Figure BDA0003804052690000031

Figure BDA0003804052690000032
Figure BDA0003804052690000032

其中,i表示当前已完成训练的次数,c表示预定次数阈值,D(τ)表示难度函数,|aτ|表示样本τ的难度指标,|an|表示集合A中的任意一个样本的难度指标,集合A表示当前训练批次的所有样本构成的集合或者表示训练集,max{|an|∈A}表示当前训练批次的所有样本或者训练集中的所有样本的难度指标中取最大值。Among them, i represents the number of trainings that have been completed so far, c represents the predetermined number of thresholds, D(τ) represents the difficulty function, |a τ | represents the difficulty index of the sample τ, and |a n | represents the difficulty of any sample in the set A Index, set A represents the set of all samples in the current training batch or represents the training set, max{|a n |∈A} represents the maximum value of all samples in the current training batch or the difficulty index of all samples in the training set .

在本发明的一些实施例中,所述样本的难度指标为BM25模型对样本中的第一个待匹配文本与预定文本的相关性生成的BM25分数减去BM25 模型对样本中的第二个待匹配文本与预定文本的相关性生成的BM25分数的差值的绝对值。In some embodiments of the present invention, the difficulty index of the sample is the BM25 score generated by the BM25 model for the correlation between the first text to be matched in the sample and the predetermined text minus the BM25 model for the second text to be matched in the sample. The absolute value of the difference in BM25 scores generated by the relevance of the matched text to the predetermined text.

在本发明的一些实施例中,所述语义匹配模型包括BERT模型和线性预测层,其中,所述BERT模型用于根据每个样本中的预定文本与每个待匹配文本形成文本对输出含有语义相似性信息的语义向量;所述线性预测层用于根据语义向量输出待匹配文本与预定文本的相关性得分。In some embodiments of the present invention, the semantic matching model includes a BERT model and a linear prediction layer, wherein the BERT model is used to form a text pair based on the predetermined text in each sample and each text to be matched to output a text containing semantic A semantic vector of similarity information; the linear prediction layer is used to output a correlation score between the text to be matched and the predetermined text according to the semantic vector.

在本发明的一些实施例中,所述方法还包括:在利用训练集对语义匹配模型进行首次训练前,对训练集中样本的难度指标进行直方图均衡处理,以使训练集中的样本的难度指标在难度空间上均匀分布。In some embodiments of the present invention, the method further includes: before using the training set to train the semantic matching model for the first time, performing histogram equalization processing on the difficulty index of the samples in the training set, so that the difficulty index of the samples in the training set Evenly distributed over the difficulty space.

根据本发明的第二方面,提供一种文本匹配方法,包括:B1、获取查询文本以及多个候选文本,将查询文本和每个候选文本分别组成一个文本对;B2、将每个文本对分别输入根据权利要求1-8所述的语义匹配模型的训练方法训练得到的语义匹配模型,得到查询文本与每个候选文本的相关性得分;B3、查询文本与每个候选文本的相关性得分对候选文本进行排序,根据排序结果输出相应的候选文本。According to a second aspect of the present invention, a text matching method is provided, including: B1, acquiring a query text and a plurality of candidate texts, forming a text pair from the query text and each candidate text; B2, combining each text pair respectively Input the semantic matching model that the training method training of semantic matching model according to claim 1-8 obtains, obtain the relevance score of query text and each candidate text; B3, query text and the correlation score pair of each candidate text The candidate texts are sorted, and the corresponding candidate texts are output according to the sorting results.

在本发明的一些实施例中,所述根据排序结果输出相应的候选文本为以下文本中的至少一种:相关性得分最高的候选文本;按相关性得分从大到小的顺序选取的预定个数的候选文本;相关性得分大于等于预定分数阈值的候选文本。In some embodiments of the present invention, the corresponding candidate text output according to the sorting result is at least one of the following texts: the candidate text with the highest correlation score; the predetermined text selected in descending order of the correlation score number of candidate texts; candidate texts with a relevance score greater than or equal to a predetermined score threshold.

根据本发明的第三方面,一种电子设备,包括:一个或多个处理器;以及存储器,其中存储器用于存储可执行指令;所述一个或多个处理器被配置为经由执行所述可执行指令以实现第一方面和/或第二方面所述方法的步骤。According to a third aspect of the present invention, an electronic device includes: one or more processors; and a memory, wherein the memory is used to store executable instructions; the one or more processors are configured to The instructions are executed to implement the steps of the method described in the first aspect and/or the second aspect.

与现有技术相比,本发明的优点在于:Compared with the prior art, the present invention has the advantages of:

本发明为每个弱标注的样本添加难度指标,在开始训练时为难度相对越大的样本赋予相对越小的初始权重,训练过程中随着迭代训练的次数增加,基于难度指标动态地调整每个样本的权重;由此,以让语义匹配模型在开始时对越难的样本的关注越少,实现从易到难的递进式学习,提升模型的精度并降低人工调试的成本。The present invention adds a difficulty index to each weakly labeled sample, and assigns a relatively smaller initial weight to the relatively more difficult sample at the beginning of training. During the training process, as the number of iterative training increases, dynamically adjust each weight based on the difficulty index. The weight of each sample; thus, the semantic matching model pays less attention to the more difficult samples at the beginning, realizes progressive learning from easy to difficult, improves the accuracy of the model and reduces the cost of manual debugging.

附图说明Description of drawings

以下参照附图对本发明实施例作进一步说明,其中:Embodiments of the present invention will be further described below with reference to the accompanying drawings, wherein:

图1为根据本发明实施例的利用BM25模型计算预定文本与待匹配文本的BM25分数的示意图;Fig. 1 is the schematic diagram that utilizes BM25 model to calculate the BM25 score of predetermined text and to-be-matched text according to the embodiment of the present invention;

图2为根据本发明实施例的语义匹配模型的训练方法的流程示意图;2 is a schematic flow diagram of a training method for a semantic matching model according to an embodiment of the present invention;

图3为根据本发明实施例的动态调整样本的权重的示意图。Fig. 3 is a schematic diagram of dynamically adjusting sample weights according to an embodiment of the present invention.

具体实施方式detailed description

为了使本发明的目的,技术方案及优点更加清楚明白,以下结合附图通过具体实施例对本发明进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below through specific embodiments in conjunction with the accompanying drawings. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

如在背景技术部分提到的,在根据带有大量弱标签的样本来训练语义匹配模型时,从开始训练到结束训练的过程中如果一直平等地使用各样本训练语义匹配模型,会导致语义匹配模型很难学习到正确的知识,模型的精度不高。发明人在使用BM25模型构建弱标注数据时发现,一些训练对由返回的第一位文档(top1文档,即与预定文本最相关的文本)和随机抽取的文档构成,另一些训练对由第一位文档和第二位文档(Top2文档)构成。对于前者,模型是比较容易区分的,而后者比较难易区分。因为前者中的第一位文档相比随机抽取的文档存在更多与预定文本相关联的词。而后者的top1文档和top2文档可能都存在很多的查询相关的词。As mentioned in the background technology section, when training a semantic matching model based on samples with a large number of weak labels, if each sample is used to train the semantic matching model equally from the beginning of training to the end of training, it will lead to semantic matching. It is difficult for the model to learn the correct knowledge, and the accuracy of the model is not high. When the inventor used the BM25 model to construct weakly labeled data, he found that some training pairs consisted of the returned first document (top1 document, that is, the text most relevant to the predetermined text) and randomly extracted documents, and other training pairs consisted of the first A bit document and a second bit document (Top2 document). For the former, the model is relatively easy to distinguish, while the latter is more difficult to distinguish. Because the first document in the former has more words associated with the predetermined text than the randomly extracted documents. However, both the top1 document and the top2 document of the latter may have many query-related words.

为解决该问题,本发明为每个弱标注的样本添加难度指标,在开始训练时为难度相对越大的样本赋予相对越小的初始权重,训练过程中随着迭代训练的次数增加,基于难度指标动态地调整每个样本的权重;由此,以让语义匹配模型在开始时对越难的样本的关注越少,实现从易到难的递进式学习,提升模型的精度并降低人工调试的成本。In order to solve this problem, the present invention adds a difficulty index to each weakly labeled sample, and assigns relatively smaller initial weights to samples with relatively higher difficulty when starting training. During the training process, as the number of iterative training increases, based on the difficulty The index dynamically adjusts the weight of each sample; thus, the semantic matching model pays less attention to the more difficult samples at the beginning, realizes progressive learning from easy to difficult, improves the accuracy of the model and reduces manual debugging the cost of.

本发明根据能直接对两个文本的相关性打分的现有算法或者模型(如 BM25模型、查询似然模型等)构建包含难度指标以及弱标注的样本数据集用以训练语义匹配模型,以提升语义识别模型的精度。为了便于理解,此处以BM25模型为例简单介绍本发明整体的实施流程,主要包括以下几个部分:According to the existing algorithm or model (such as BM25 model, query likelihood model, etc.) Accuracy of Semantic Recognition Models. For ease of understanding, here is a brief introduction to the overall implementation process of the present invention by taking the BM25 model as an example, which mainly includes the following parts:

(1)准备多个预定文本以及多个待匹配文本,基于BM25模型对每个预定文本与每个待匹配文本的相关性进行打分,得到每个预定文本与每个待匹配文本的BM25分数;(1) Prepare a plurality of predetermined texts and a plurality of texts to be matched, score the relevance of each predetermined text and each text to be matched based on the BM25 model, and obtain the BM25 score of each predetermined text and each text to be matched;

(2)将预定文本与两个相关性存在差别的待匹配文本进行组合以形成样本,其中,以一个预定文本q和两个待匹配文本d1和d2的方式形成三元组形式的样本(q,d1,d2);由于有多个预定文本和多个待匹配文本,按照该方式可以快速形成大量含有弱标注和难度指标的样本构成样本数据集;(2) Combine the predetermined text and two texts to be matched with differences in correlation to form a sample, wherein a sample in the form of a triplet is formed in the form of a predetermined text q and two texts to be matched d 1 and d 2 (q, d 1 , d 2 ); since there are multiple predetermined texts and multiple texts to be matched, a large number of samples containing weak labels and difficulty indicators can be quickly formed to form a sample data set in this way;

(3)根据每个样本中预定文本与每个待查询文本的BM25分数构建每个样本的弱标签;例如,假设预定文本q与d1的BM25分数为BM25(q,d1) 以及q与d2的BM25分数为BM25(q,d2),若BM25(q,d1)>BM25(q,d2)则将弱标签的值设为+1,若BM25(q,d1)<BM25(q,d2)则将弱标签的值设为 -1;(3) Construct the weak label of each sample according to the BM25 scores of the predetermined text and each text to be queried in each sample; for example, suppose the BM25 scores of the predetermined text q and d 1 are BM25(q,d 1 ) and q and The BM25 score of d 2 is BM25(q,d 2 ), if BM25(q,d 1 )>BM25(q,d 2 ), set the value of the weak label to +1, if BM25(q,d 1 )< BM25(q,d 2 ) sets the value of the weak label to -1;

(4)根据每个样本中预定文本与每个待查询文本的BM25分数构建每个样本的难度指标;例如,将一个样本中所含预定文本q与d1的BM25 分数减去预定文本q与d2的BM25分数之差的绝对值|BM25(q,d1)- BM25(q,d2)|作为该样本的难度指标;(4) Construct the difficulty index of each sample according to the BM25 score of the predetermined text and each text to be queried in each sample; for example, subtract the BM25 score of predetermined text q and d 1 contained in a sample from The absolute value of the difference between the BM25 scores of d 2 |BM25(q,d 1 )- BM25(q,d 2 )| is used as the difficulty indicator for this sample;

(5)从根据以上步骤形成的样本数据集中选取部分样本作为训练集,按照本发明的训练方法训练语义匹配模型;(5) Select part samples from the sample data set formed according to the above steps as a training set, and train the semantic matching model according to the training method of the present invention;

(6)利用训练好的语义匹配模型,对用户输入的查询文本与多个候选文本的相关性进行打分,输出相应的候选文本。(6) Use the trained semantic matching model to score the correlation between the query text input by the user and multiple candidate texts, and output the corresponding candidate texts.

为了更详细地说明本发明的技术方案,下面从所用模型、训练样本、训练过程、应用场景四个方面来展开描述。In order to illustrate the technical solution of the present invention in more detail, the following describes four aspects of the model used, the training samples, the training process, and the application scenario.

一、所用模型1. The model used

根据本发明的一个实施例,语义匹配模型包括BERT模型和线性预测层,其中,所述BERT模型用于根据每个样本中的预定文本与每个待匹配文本形成文本对输出含有语义相似性信息的语义向量。语义匹配模型可以采用谷歌预训练的BERT模型,并通过一个全连接层作为线性预测层,将 BERT模型输出的语义向量转换为相似性得分。在输入BERT模型时,将输入的两个文本作为句子对的形式输入,对于一个预定文本q和待匹配文本dn,其输入形式例如:[CLS]预定文本q[SEP]待匹配文本dn,[CLS]为 BERT模型的起始标志符,[SEP]为分隔两个句子的分隔符;经过BERT模型的处理,针对取起始标志符[CLS]对应的语义向量作为含有语义相似性信息的语义向量,将其输入线性预测层,得到相似性得分,记为BERT(q,dn)。优选的,语义匹配模型可以采用现有的模型结构,例如,可以采用 BERT-Rank模型的结构。According to an embodiment of the present invention, the semantic matching model includes a BERT model and a linear prediction layer, wherein the BERT model is used to form a text pair based on the predetermined text in each sample and each text to be matched to output information containing semantic similarity The semantic vector of . The semantic matching model can use Google's pre-trained BERT model, and use a fully connected layer as a linear prediction layer to convert the semantic vector output by the BERT model into a similarity score. When entering the BERT model, input the two input texts as sentence pairs. For a predetermined text q and a text to be matched d n , the input form is, for example: [CLS] predetermined text q[SEP] to be matched text d n , [CLS] is the start identifier of the BERT model, and [SEP] is the separator separating two sentences; after the processing of the BERT model, the semantic vector corresponding to the start identifier [CLS] is used as the semantic similarity information The semantic vector is input into the linear prediction layer to get the similarity score, which is recorded as BERT(q,d n ). Preferably, the semantic matching model can adopt an existing model structure, for example, the structure of a BERT-Rank model can be adopted.

二、训练样本2. Training samples

根据本发明的一个实施例,本发明的样本数据集包括多个样本、每个样本对应的弱标签和难度指标,每个样本为包含预定文本q和两个待匹配文本d1和d2的方式形成三元组形式的样本(q,d1,d2)。其中,弱标签的数值为+1时指示该样本中待匹配文本d1与预定文本q更相关,弱标签的数值为-1时指示该样本中待匹配文本d2与预定文本q更相关。难度指标指示对应样本的难度,样本中两个待匹配文本与预定文本的相关性越接近则该样本的难度越大。难度指标可以为样本中所含预定文本q与d1的BM25分数减去预定文本q与d2的BM25分数之差的绝对值|BM25(q,d1)- BM25(q,d2)|。通常设置多个预定文本,每个预定文本用于模拟用户输入的查询文本。换言之,预定文本是构造的查询文本的样文。According to an embodiment of the present invention, the sample data set of the present invention includes a plurality of samples, weak labels corresponding to each sample and difficulty indicators, and each sample is a text containing predetermined text q and two texts d 1 and d 2 to be matched way to form samples in the form of triplets (q, d 1 , d 2 ). Wherein, when the value of the weak label is +1, it indicates that the text d 1 to be matched in the sample is more related to the predetermined text q, and when the value of the weak label is -1, it indicates that the text d 2 to be matched in the sample is more related to the predetermined text q. The difficulty index indicates the difficulty of the corresponding sample, and the closer the correlation between the two texts to be matched and the predetermined text in the sample is, the greater the difficulty of the sample is. The difficulty index can be the absolute value of the difference between the BM25 scores of the predetermined text q and d 1 in the sample minus the difference between the BM25 scores of the predetermined text q and d 2 |BM25(q,d 1 )- BM25(q,d 2 )| . Usually a plurality of predetermined texts are set, and each predetermined text is used to simulate the query text input by the user. In other words, the predetermined text is a sample text of the constructed query text.

参见图1,下面基于BM25模型的对英文打分的示例来说明弱标签和难度指标的设置:Referring to Figure 1, the following is an example of scoring English based on the BM25 model to illustrate the setting of weak labels and difficulty indicators:

假设一个预定文本为:International Organized Crime(国际组织犯罪)Suppose a predetermined text is: International Organized Crime (international organized crime)

假设一些待匹配文本为:Suppose some text to be matched is:

待匹配文本1:Text to be matched 1:

A parliamentary commission accused AAA today of doing little to stopsome kind of crime and BBB international networks from pumping billions ofdollars through companies of CCC. It said DDD were too soft on organizedcrime and even failed to probe charges that criminal EEE was using cocaineprofits to buy military hardware for FFF. "The threat posed by organizedcrime has been recognized too late by the GGG office," the specialinvestigative commission report said.A parliamentary commission accused AAA today of doing little to stopsome kind of crime and BBB international networks from pumping billions of dollars through companies of CCC. hardware for FFF. "The threat posed by organized crime has been recognized too late by the GGG office," the special investigative commission report said.

为便于理解,给出待匹配文本的译文(下同):一个议会委员会今天指责AAA在阻止某类犯罪和BBB国际网络通过CCC的公司注入数十亿美元方面无所作为。它说,DDD对有组织犯罪过于软弱,甚至未能调查有关罪犯EEE利用可卡因利润为FFF购买军事装备的指控。“GGG办公室认识到有组织犯罪构成的威胁为时已晚,”特别调查委员会的报告称。”For ease of understanding, a translation of the text to be matched is given (the same below): A parliamentary committee today accused the AAA of doing nothing to prevent a certain type of crime and the BBB international network from injecting billions of dollars through CCC's companies. It said DDD was too soft on organized crime and failed to even investigate allegations that criminal EEE used cocaine profits to buy military equipment for FFF. "The GGG Office recognized the threat posed by organized crime belatedly," the special commission of inquiry report said. "

待匹配文本2:Text to be matched 2:

The precondition for taking over the task of combating crime is anincrease in personnel as well as an amendment to the Federal ConstitutionalProtection Law.Werthebach sees potential work for the BfV in observingorganized crime structures that might exert influence over the media.It isurgently necessary to harmonize the methods and legal bases of intelligenceservices all over Europe.The precondition for taking over the task of combating crime is anincrease in personnel as well as an amendment to the Federal Constitutional Protection Law. Werthebach sees potential work for the BfV in observing organized crime structures that might exert influence over the media necessary to the issurgent methods and legal bases of intelligence services all over Europe.

译文:接手打击犯罪任务的前提是增加人员和修改联邦宪法保护法。 Werthebach看到了BfV在观察可能对媒体施加影响的有组织犯罪结构方面的潜在工作。迫切需要协调整个欧洲情报服务的方法和法律基础。例句:Taking over the task of fighting crime requires additional personnel and changes to the Federal Constitutional Protection Act. Werthebach sees the BfV's potential work in observing organized crime structures that could exert influence over the media. There is an urgent need to harmonize the methods and legal basis of intelligence services across Europe.

待匹配文本3:Text to be matched 3:

Steven Anreder,Drexel's spokesman,said Sher's promotion had nothingto do with the SEC settlement.He said it was based simply on a decision bythe firm's board to give Sher additional recognition.Securities industryanalysts noted that during Joseph's term as chief executive a variety ofexecutives in addition to Joseph have been vice chairmen,and they said theSher's promotion doesn't appear to change Joseph's status within the firm.Steven Anreder, Drexel's spokesperson, said Sher's promotion had nothing to do with the SEC settlement. He said it was based simply on a decision by the firm's board to give Sher additional recognition. Securities industry analysts noted that during Joseph's addterm as chief a var executive to Joseph have been vice chairmen, and they said the Sher's promotion doesn't appear to change Joseph's status within the firm.

译文:Drexel的发言人Steven Anreder表示,Sher的晋升与SEC的和解无关。他说,这仅仅是基于公司董事会决定给予Sher额外认可。证券行业分析人士指出,在约瑟夫担任首席执行官期间,除了约瑟夫之外,还有多位高管担任副主席,他们表示,谢尔的晋升似乎并未改变约瑟夫在公司内的地位。例句:Drexel spokesman Steven Anreder said Sher's promotion had nothing to do with the SEC settlement. He said this was based solely on the company's board of directors deciding to give Sher additional recognition. Analysts in the securities industry pointed out that during Joseph's tenure as CEO, in addition to Joseph, there were many executives who served as vice chairman. They said that Schell's promotion did not seem to change Joseph's position in the company.

将预定文本“International Organized Crime”作为查询,BM25模型可以给出该查询与各个待匹配文本的BM25分数,分别为21.3、18.7、0。Taking the predetermined text "International Organized Crime" as a query, the BM25 model can give the BM25 scores of the query and each text to be matched, which are 21.3, 18.7, and 0, respectively.

示意性的,如图1所示,通过BM25模型搜索“international organized crime”可以得到一些返回结果,并且存在不同的BM25打分。语义匹配模型去判断查询和待匹配文本1(BM25相关得分为21)匹配还是和待匹配文本2(BM25相关得分为18)匹配的难度,明显大于去判断查询和待匹配文本1(BM25相关得分为21)匹配还是和待匹配文本3(BM25相关得分为0) 匹配的难度。因此,可以构建带有难度的弱标注样本(比如:预定文本,待匹配文本1,待匹配文本2),该样本的弱标签为+1,难度指标为 |21.3-18.7|=2.6。Schematically, as shown in Figure 1, some returned results can be obtained by searching for "international organized crime" through the BM25 model, and there are different BM25 scores. It is significantly more difficult for the semantic matching model to judge whether the query matches the text to be matched 1 (with a BM25 correlation score of 21) or with the text to be matched 2 (with a BM25 correlation score of 18). 21) or the difficulty of matching with the text to be matched 3 (BM25 correlation score is 0). Therefore, we can construct a weakly labeled sample with difficulty (for example: predetermined text, to-be-matched text 1, to-be-matched text 2), the weak label of this sample is +1, and the difficulty index is |21.3-18.7|=2.6.

当然,不论是弱标签还是难度指标,其构建的依据除BM25分数外,也可以是依据查询似然模型给出的相关性打分或者其他的语义匹配模型输出的相关性打分,本发明对此不作任何限制。Of course, whether it is a weak label or a difficulty index, in addition to the BM25 score, the basis for its construction can also be based on the relevance score given by the query likelihood model or the relevance score output by other semantic matching models. any restrictions.

进一步的,为了避免模型对各种难度的样本的学习不均衡,根据本发明的一个实施例,在制作样本数据集时,生成多种不同难易程度的样本,并且其中各种难度指标的样本的数量均衡。比如,将样本的难度指标以一定间隔进行分级,制作样本数据集时,让样本数据集中每种难度级别的样本的数量相同。在划分训练集时,在每个难度级别取出相同数量的样本,以此保证训练集中的各种难度指标的样本的数量均衡。Further, in order to avoid the unbalanced learning of the model on samples of various difficulties, according to an embodiment of the present invention, when making the sample data set, a variety of samples with different degrees of difficulty are generated, and samples of various difficulty indicators quantity balance. For example, the difficulty index of the samples is graded at a certain interval, and when making the sample data set, the number of samples of each difficulty level in the sample data set is the same. When dividing the training set, take the same number of samples at each difficulty level, so as to ensure that the number of samples of various difficulty indicators in the training set is balanced.

因为多个待匹配文档和预定文本之间的相关性分布往往不均衡,按照上述实施方式在制作样本数据集时保障各种难度指标的样本的数量均衡会造成较多样本不能被采用,由此,可以按照一般的方式构造样本数据集,然后通过一定的处理方式将样本的难度分布进行转换,以保障模型的精度,根据本发明的一个实施例,所述方法还包括:对样本数据集或者训练集中样本的难度指标进行直方图均衡处理,以使样本数据集或者训练集中的样本的难度指标在难度空间上均匀分布。例如,以BM25分数为例,在构建样本时,样本对应的难度指标在数据上并不是一个均匀的分布。这会导致模型学习到的样本可能存在比较多的困难样本,亦或是简单样本。可以通过直方图均衡化使得样本的分布尽可能均衡。这里定义变换T让样本在难度空间上均匀分布,记r=|BM25(q,d1)-BM25(q,d2)|,为某个三元组τ (q,d1,d2)的BM25差值的绝对值。定义变换T(r)将r变换到一个新的空间s。s=T(r),用pr(r)表示原先的BM25差值的概率分布函数,ps(s)表示变换后的概率分布函数,定义r的取值空间为[0,R]。ps(s)是在[0,R]上的一个均匀分布,则有:

Figure BDA0003804052690000091
再通过归一化方法将难度指标映射到[0,1] 区间,得到以使各个样本的难度指标在[0,1]区间均衡分布。Because the correlation distribution between multiple documents to be matched and the predetermined text is often unbalanced, ensuring that the number of samples with various difficulty indicators is balanced according to the above-mentioned implementation method when making a sample data set will cause more samples to be unusable, thus , the sample data set can be constructed in a general way, and then the difficulty distribution of the sample can be converted through a certain processing method to ensure the accuracy of the model. According to an embodiment of the present invention, the method further includes: the sample data set or The difficulty indicators of the samples in the training set are subjected to histogram equalization processing, so that the difficulty indicators of the sample data set or the samples in the training set are uniformly distributed in the difficulty space. For example, taking the BM25 score as an example, when constructing a sample, the difficulty index corresponding to the sample is not a uniform distribution on the data. This will lead to more difficult samples or simple samples in the samples learned by the model. Histogram equalization can be used to make the distribution of samples as balanced as possible. Here the transformation T is defined so that the samples are uniformly distributed in the difficulty space, and r=|BM25(q,d 1 )-BM25(q,d 2 )| is a triplet τ (q,d 1 ,d 2 ) The absolute value of the BM25 difference. Define a transformation T(r) that transforms r into a new space s. s=T(r), use p r (r) to represent the probability distribution function of the original BM25 difference, p s (s) represents the transformed probability distribution function, and define the value space of r as [0,R]. p s (s) is a uniform distribution on [0,R], then:
Figure BDA0003804052690000091
Then the difficulty index is mapped to the [0,1] interval through the normalization method, so that the difficulty index of each sample is evenly distributed in the [0,1] interval.

为了让模型进行由易到难的学习过程,每个样本对应有权重,初始权重的数值与指示对应样本的难度的难度指标相关,难度相对越大的样本赋予相对越小的初始权重。In order for the model to carry out the learning process from easy to difficult, each sample has a corresponding weight. The value of the initial weight is related to the difficulty index indicating the difficulty of the corresponding sample. The relatively more difficult samples are given relatively smaller initial weights.

三、训练过程3. Training process

训练前,按照一定的比例将通过上述实施例的方法获取的数据集划分为训练集和测试集,然后进行训练,参见图2,根据本发明的一个实施例,提供一种语义匹配模型的训练方法,包括:A1、获取训练集,其包括多个样本,每个样本包含预定文本以及每个预定文本对应的两个待匹配文本,其中,每个样本分别对应有弱标签和权重,弱标签指示对应样本所含两个待匹配文本中的哪一个待匹配文本与预定文本更具相关性,初始权重的数值与指示对应样本的难度的难度指标相关,难度相对越大的样本赋予相对越小的初始权重;A2、利用所述训练集中的样本对语义匹配模型进行多次迭代训练,使其根据预定文本分别和每个待匹配文本形成的文本对输出两者的相关性得分,根据相关性得分、弱标签以及权重确定加权损失值以更新语义匹配模型,其中样本的权重根据当前已完成训练的次数进行动态调整。随着训练次数的增加再逐渐增大难度相对越大的样本对应的权重。本发明采用此方式至少能够实现以下有益技术效果:本发明为每个弱标注的样本添加难度指标,在开始训练时为难度相对越大的样本赋予相对越小的初始权重,训练过程中随着迭代训练的次数增加,基于难度指标动态地调整每个样本的权重;由此,以让语义匹配模型在开始时对越难的样本的关注越少,实现从易到难的递进式学习,可以引导学习的过程,并且提升学习的速度,最终达到提升模型的精度并降低人工调试的成本的效果。Before training, according to a certain ratio, the data set obtained by the method of the above-mentioned embodiment is divided into a training set and a test set, and then training is performed. Referring to FIG. 2, according to an embodiment of the present invention, a training of a semantic matching model is provided The method includes: A1, obtaining a training set, which includes a plurality of samples, each sample includes predetermined text and two texts to be matched corresponding to each predetermined text, wherein each sample corresponds to a weak label and a weight, and the weak label Indicates which of the two texts to be matched contained in the corresponding sample is more relevant to the predetermined text. The value of the initial weight is related to the difficulty index indicating the difficulty of the corresponding sample. initial weight; A2, using the samples in the training set to carry out multiple iterative trainings on the semantic matching model, so that it outputs the correlation score of the two according to the text pair formed by the predetermined text and each text to be matched, and according to the correlation Score, weak label, and weight determine the weighted loss value to update the semantic matching model, where the weight of the sample is dynamically adjusted according to the number of times the training has been completed. As the number of training increases, the weight corresponding to the relatively more difficult samples is gradually increased. In this way, the present invention can at least achieve the following beneficial technical effects: the present invention adds a difficulty index to each weakly labeled sample, and assigns relatively smaller initial weights to samples with relatively greater difficulty when starting training. The number of iterative training increases, and the weight of each sample is dynamically adjusted based on the difficulty index; thus, the semantic matching model pays less attention to the more difficult samples at the beginning, and realizes progressive learning from easy to difficult, It can guide the learning process and increase the speed of learning, and finally achieve the effect of improving the accuracy of the model and reducing the cost of manual debugging.

根据本发明的一个实施例,损失的权重可以每训练一次(Epoch)调整一次,但这样调整的计算量大,并且也可能不利于模型稳定地学习知识,根据本发明的一个实施例,所述方法还包括:每训练预定训练次数(Epoch) 后增大相应样本的权重,其中,每次增大样本的权重时,初始权重相对越小的样本的权重增大的数值相对越大。优选的,预定训练次数是50-1000 次。比如,若样本1的初始权重为0.05,样本2的初始权重为0.1,预定训练次数设为100次,则在0-99次训练的过程中,以初始权重乘以样本的损失值作为加权损失值;然后增大样本的权重时,比如:样本1的权重增大0.095,变为0.145;样本2的权重增大0.09,变为0.19,在第100次训练-199次训练时,以次权重乘以样本的损失值作为加权损失值;以此类推。对于最简单的样本,其权重可设为固定值并在各个训练次数对应的训练中保持不变,例如权重始终设为1。According to an embodiment of the present invention, the weight of loss can be adjusted once per training (Epoch), but the calculation amount of such adjustment is large, and it may also be unfavorable for the model to learn knowledge stably. According to an embodiment of the present invention, the The method further includes: increasing the weight of the corresponding sample after each predetermined number of training times (Epoch), wherein, each time the weight of the sample is increased, the weight of the sample whose initial weight is relatively smaller is relatively larger. Preferably, the predetermined training times are 50-1000 times. For example, if the initial weight of sample 1 is 0.05, the initial weight of sample 2 is 0.1, and the number of scheduled training is set to 100, then in the process of 0-99 training, the initial weight is multiplied by the loss value of the sample as the weighted loss value; then increase the weight of the sample, for example: the weight of sample 1 is increased by 0.095, becoming 0.145; the weight of sample 2 is increased by 0.09, becoming 0.19, in the 100th -199th training, the next weight Multiply the loss value of the sample as the weighted loss value; and so on. For the simplest sample, its weight can be set to a fixed value and remain unchanged in the training corresponding to each training number, for example, the weight is always set to 1.

在训练此时达到一定次数后,由于模型已经经历了从易到难的良好学习过程,可以让其平等地学习各个样本,以提升模型的精度,根据本发明的一个实施例,在当前已完成训练的次数达到预定次数阈值后,将所有样本的损失值对应的权重设为相同的数值并在后续保持该数值不变的情况下利用训练集对语义匹配模型进行多次训练。比如:预定次数阈值设为 10000-50000次(Epoch)中的一个整数值,以20000为例,在当前已完成训练的次数达到20000次后,将所有样本的权重均设为1,后续的每次训练中均用1作为各个样本的权重。After the training reaches a certain number of times, since the model has experienced a good learning process from easy to difficult, it can be allowed to learn each sample equally to improve the accuracy of the model. According to an embodiment of the present invention, the current After the number of training times reaches the predetermined threshold, the weights corresponding to the loss values of all samples are set to the same value, and the semantic matching model is trained multiple times using the training set while keeping the value unchanged. For example: the threshold of the predetermined number of times is set to an integer value in the range of 10000-50000 times (Epoch). Take 20000 as an example. In each training, 1 is used as the weight of each sample.

在每次需要调整样本的权重时,可以根据当前完成训练的次数以及难度指标确定样本的权重,根据本发明的一个实施例,每训练预定训练次数后增大相应样本的权重,样本的权重根据当前已完成训练的次数以及样本的难度指标进行动态调整。优选的,样本的权重按照以下方式确定:When the weight of the sample needs to be adjusted each time, the weight of the sample can be determined according to the number of times the current training is completed and the difficulty index. According to an embodiment of the present invention, the weight of the corresponding sample is increased after each predetermined training times, and the weight of the sample is based on The number of trainings currently completed and the difficulty index of the samples are dynamically adjusted. Preferably, the weight of the sample is determined in the following manner:

Figure BDA0003804052690000101
Figure BDA0003804052690000101

Figure BDA0003804052690000102
Figure BDA0003804052690000102

其中,i表示当前已完成训练的次数,c表示预定次数阈值,D(τ)表示难度函数,|aτ|表示样本τ的难度指标,|an|表示集合A中的任意一个样本的难度指标,集合A表示当前训练批次的所有样本构成的集合或者表示训练集,max{|an|∈A}表示当前训练批次的所有样本或者训练集中的所有样本的难度指标中取最大值。应当理解,该方式仅是一个较优实施例而已,也可以为不同难度的样本分别设置每次调整权重时增大的数值,确保初始权重相对越小的样本的权重增大的数值相对越大。又或者,对上述公式进行调整,易产生其他的实施方式,比如,将上述公式中的

Figure BDA0003804052690000111
改为
Figure BDA0003804052690000112
其余不变。Among them, i represents the number of trainings that have been completed so far, c represents the predetermined number of thresholds, D(τ) represents the difficulty function, |a τ | represents the difficulty index of the sample τ, and |a n | represents the difficulty of any sample in the set A Index, set A represents the set of all samples in the current training batch or represents the training set, max{|a n |∈A} represents the maximum value of all samples in the current training batch or the difficulty index of all samples in the training set . It should be understood that this method is only a preferred embodiment, and it is also possible to set the value that increases each time the weight is adjusted for samples of different difficulties, so as to ensure that the value of the weight increase of the sample with a relatively small initial weight is relatively large . Or, the above formula is adjusted to easily produce other implementations, for example, the above formula
Figure BDA0003804052690000111
changed to
Figure BDA0003804052690000112
The rest remain unchanged.

根据本发明的一个实施例,单个样本的加权损失值按照以下方式计算:According to an embodiment of the present invention, the weighted loss value of a single sample is calculated in the following manner:

L=max(0,ε-y×(S1-S2))×w(τ,i);L=max(0,ε-y×(S 1 -S 2 ))×w(τ,i);

其中,max(·)表示取其所含元素中的最大值的函数,其输出为样本的损失值,ε表示间隔参数,y表示弱标签,y∈{-1,+1},当y取+1时表示第一个待匹配文本与预定文本更具相关性,当y取-1时表示第二个待匹配文本与预定文本更具相关性,S1表示第一个待匹配文本与预定文本的相关性得分,S2表示第二个待匹配文本与预定文本的相关性得分,w(τ,i)表示完成i次训练后调整得到的样本τ的权重。ε也可称边界值(Margin),其值的大小可由实施者根据实施场景按需确定,取值范围一般为[0.5-1],通常取0.5或1。应当理解,如果是预定训练次数不为1的情况下,比如预定训练次数设为5,在第i+2次训练时样本的权重仍采用w(τ,i),因为还未达到下次调整样本的权重的条件。另外,训练时,如果采用随机梯度下降的方式训练模型,则每次根据单个样本的加权损失值更新模型的参数;如果采用批量梯度下降的方式训练模型,则每次根据一个批次(Batch)中所有样本的加权损失值之和或者所有样本的加权损失值的平均值更新模型的参数。在每一个样本(q,d1,d2)输入到模型中时,要进行两次相关性匹配,第一次是第一个待匹配文本与预定文本的相关性得分 S1=BERT(q,d1),第二次是第二个待匹配文本与预定文本的相关性得分 S2=BERT(q,dn),然后根据两个相关性的得分,基于上述损失函数,计算损失值,更新模型。另外,应当理解,这两次的相关性得分与形成样本数据集的相关性打分(比如BM25分数)是不同的。Among them, max( ) represents the function of taking the maximum value of the elements contained in it, and its output is the loss value of the sample, ε represents the interval parameter, y represents the weak label, y∈{-1,+1}, when y takes +1 means that the first text to be matched is more relevant to the predetermined text, when y is -1, it means that the second text to be matched is more relevant to the predetermined text, S 1 means that the first text to be matched is more relevant to the predetermined text The correlation score of the text, S 2 represents the correlation score between the second text to be matched and the predetermined text, and w(τ,i) represents the weight of the sample τ adjusted after completing i training. ε can also be called a margin value (Margin), and its value can be determined by the implementer according to the implementation scenario. The value range is generally [0.5-1], usually 0.5 or 1. It should be understood that if the scheduled number of training times is not 1, for example, the number of scheduled training times is set to 5, the weight of the sample is still w(τ,i) in the i+2th training, because the next adjustment has not yet been reached The condition on the weight of the sample. In addition, during training, if the model is trained by stochastic gradient descent, the parameters of the model are updated each time according to the weighted loss value of a single sample; if the model is trained by batch gradient descent, each time it is based on a batch (Batch) The sum of the weighted loss values of all samples in or the average of the weighted loss values of all samples updates the parameters of the model. When each sample (q, d 1 , d 2 ) is input into the model, two correlation matches are performed, the first time is the correlation score between the first text to be matched and the predetermined text S1=BERT(q, d 1 ), the second time is the correlation score between the second text to be matched and the predetermined text S2=BERT(q,d n ), and then according to the two correlation scores, based on the above loss function, calculate the loss value and update Model. In addition, it should be understood that the correlation scores of these two times are different from the correlation scores (such as BM25 scores) forming the sample data set.

参见图3,示意性的,假定与预定文本q更相关的待匹配文本为d+,另一待匹配文本为d-,每个待匹配文本与q的BM25得分标记在其下方,如最难的样本中,d+与q的BM25分数为20,d-与q的BM25分数为19, BM25分数的绝对值比其相邻的样本中的更小,其初始权重w(τ,0)相对越小,每隔1000次(Epoch)训练动态调整一次样本的权重,即下次是在第 1000次训练时采用调整后的权重,以此类推。预定次数阈值设为10000次,则在经历0-9999次训练达到预定次数阈值后,在第10000次训练及其后的训练中,将所有样本的难度指标均设为1并保持该权重不变进行多轮训练。See Figure 3, schematically, assuming that the text to be matched that is more related to the predetermined text q is d + , and the other text to be matched is d - , and the BM25 score of each text to be matched and q is marked below it, such as the most difficult In the sample, the BM25 score of d + and q is 20, the BM25 score of d - and q is 19, the absolute value of BM25 score is smaller than that of its adjacent samples, and its initial weight w(τ,0) is relatively The smaller the value, the weight of the sample is dynamically adjusted every 1000 training times (Epoch), that is, the adjusted weight is used at the 1000th training time next time, and so on. If the preset number threshold is set to 10000 times, after the 0-9999 training times reach the preset number threshold, in the 10000th training and subsequent training, set the difficulty index of all samples to 1 and keep the weight unchanged Do multiple rounds of training.

四、应用场景4. Application scenarios

根据本发明的一个实施例,提供一种文本匹配方法,包括:B1、获取查询文本以及多个候选文本,将查询文本和每个候选文本分别组成一个文本对;B2、将每个文本对分别输入根据前述实施例的语义匹配模型的训练方法训练得到的语义匹配模型,得到查询文本与每个候选文本的相关性得分;B3、查询文本与每个候选文本的相关性得分对候选文本进行排序,根据排序结果输出相应的候选文本。如背景技术提到的,多个候选文本可以是召回阶段从一个海量的文本数据库召回的与查询文本相关的少量的候选文本。根据本发明的一个实施例,所述根据排序结果输出相应的候选文本为以下文本中的至少一种:相关性得分最高的候选文本;按相关性得分从大到小的顺序选取的预定个数的候选文本;相关性得分大于等于预定分数阈值的候选文本。According to one embodiment of the present invention, a text matching method is provided, including: B1, acquiring query text and multiple candidate texts, forming a text pair respectively from the query text and each candidate text; B2, combining each text pair respectively Input the semantic matching model trained according to the training method of the semantic matching model of the foregoing embodiment, obtain the relevance score of query text and each candidate text; B3, query text and the relevance score of each candidate text sort the candidate text , output the corresponding candidate texts according to the sorting results. As mentioned in the background art, the plurality of candidate texts may be a small number of candidate texts related to the query text recalled from a massive text database in the recall stage. According to an embodiment of the present invention, the corresponding candidate text output according to the sorting result is at least one of the following texts: the candidate text with the highest correlation score; a predetermined number selected in descending order of the correlation score candidate texts; candidate texts with relevance scores greater than or equal to a predetermined score threshold.

为了验证本发明的效果,发明人进行了相关实验,实验表明,如果训练集中的样本难度分布不均衡,按照本发明的训练方法对语义匹配模型进行训练后,训练后的语义匹配模型对应的排序结果的MAP(mean average precision,平均正确率均值)指标相比现有的训练方法提升大约2-2.5%, NDCG指标(Normalized Discounted Cumulative Gain,归一化折扣累计收益)提升大约2-3%。另外,如果训练集中的样本难度分布均衡,训练效果还有进一步的提升,MAP相比现有的训练方法提升大约2-2.5%,NDCG 指标大约提升3-4%。In order to verify the effect of the present invention, the inventor has carried out relevant experiments, and the experiments show that if the difficulty distribution of the samples in the training set is unbalanced, after the semantic matching model is trained according to the training method of the present invention, the corresponding sorting of the semantic matching model after training The resulting MAP (mean average precision) index is about 2-2.5% higher than the existing training method, and the NDCG index (Normalized Discounted Cumulative Gain, normalized discounted cumulative gain) is about 2-3% higher. In addition, if the difficulty distribution of the samples in the training set is balanced, the training effect will be further improved. Compared with the existing training methods, the MAP will be improved by about 2-2.5%, and the NDCG index will be increased by about 3-4%.

需要说明的是,虽然上文按照特定顺序描述了各个步骤,但是并不意味着必须按照上述特定顺序来执行各个步骤,实际上,这些步骤中的一些可以并发执行,甚至改变顺序,只要能够实现所需要的功能即可。It should be noted that although the steps are described above in a specific order, it does not mean that the steps must be performed in the above specific order. In fact, some of these steps can be performed concurrently, or even change the order, as long as it can be realized The required functions are sufficient.

本发明可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于使处理器实现本发明的各个方面的计算机可读程序指令。The present invention can be a system, method and/or computer program product. A computer program product may include a computer readable storage medium having computer readable program instructions thereon for causing a processor to implement various aspects of the present invention.

计算机可读存储介质可以是保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以包括但不限于电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。A computer readable storage medium may be a tangible device that holds and stores instructions for use by an instruction execution device. A computer readable storage medium may include, for example, but is not limited to, electrical storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or flash memory), static random access memory (SRAM), compact disc read only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanically encoded device, such as a printer with instructions stored thereon A hole card or a raised structure in a groove, and any suitable combination of the above.

以上已经描述了本发明的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。Having described various embodiments of the present invention, the foregoing description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and alterations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principle of each embodiment, practical application or technical improvement in the market, or to enable other ordinary skilled in the art to understand each embodiment disclosed herein.

Claims (13)

1. A training method of a semantic matching model is characterized by comprising the following steps:
a1, obtaining a training set, wherein the training set comprises a plurality of samples, each sample comprises a preset text and two texts to be matched corresponding to each preset text, each sample corresponds to a weak label and a weight, the weak label indicates which text to be matched of the two texts to be matched contained in the corresponding sample is more relevant to the preset text, the numerical value of the initial weight is relevant to a difficulty index indicating the difficulty of the corresponding sample, and the sample with the larger difficulty is endowed with the smaller initial weight;
and A2, carrying out repeated iterative training on the semantic matching model by using the samples in the training set, so that the semantic matching model respectively outputs a correlation score of the predetermined text and a text pair formed by each text to be matched, and determining a weighting loss value according to the correlation score, the weak label and the weight to update the semantic matching model, wherein the weight of the sample is dynamically adjusted according to the currently finished training times.
2. The training method according to claim 1, wherein in the step A2, the weight of the corresponding sample is increased after each training for a predetermined number of times, wherein each time the weight of the sample is increased, the weight of the sample whose initial weight is relatively smaller is increased by a relatively larger value.
3. The training method according to claim 2, wherein in the step A2, after the number of times that training has been completed currently reaches a predetermined number threshold, the semantic matching model is trained for a plurality of times by setting weights of all samples to the same value.
4. Training method according to claim 3, characterized in that the loss value of a single sample is calculated as follows:
L=max(0,ε-y×(S 1 -S 2 ))×w(τ,i);
wherein max (·) represents a function taking the maximum value of the elements contained in the function, the output is the loss value of the sample, epsilon represents an interval parameter, y represents a weak label, y belongs to { -1, +1}, when y takes +1, the first text to be matched is more relevant to the predetermined text, and when y takes-1, the second text to be matched is more relevant to the predetermined text, S 1 Representing the relevance score of the first text to be matched with the predetermined text, S 2 And w (tau, i) represents the weight of the sample tau obtained by adjusting after i times of training.
5. The training method of claim 1, wherein the weight of the corresponding sample is increased after each predetermined number of training times, and the weight of the sample is dynamically adjusted according to the number of training times currently completed and the difficulty index of the sample.
6. Training method according to claim 5, characterized in that the weights of the samples are determined in the following way:
Figure FDA0003804052680000021
Figure FDA0003804052680000022
where i represents the number of times training has been currently completed, c represents a predetermined number threshold, D (τ) represents a difficulty function, | a τ I represents the difficulty index of the sample tau, | a n I represents the difficulty index of any sample in a set A, the set A represents a set formed by all samples of the current training batch or represents a training set, and max { | a n And the [ epsilon ] is equal to A ] represents that the difficulty index of all samples of the current training batch or all samples in the training set takes the maximum value.
7. The training method according to any one of claims 1 to 5, wherein the difficulty index of the sample is an absolute value of a BM25 score generated by the BM25 model on the correlation between the first text to be matched and the predetermined text in the sample minus a BM25 score generated by the BM25 model on the correlation between the second text to be matched and the predetermined text in the sample.
8. The training method according to any one of claims 1-5, wherein the semantic matching model comprises a BERT model and a linear prediction layer, wherein the BERT model is used for forming a text pair according to the predetermined text in each sample and each text to be matched and outputting a semantic vector containing semantic similarity information; the linear prediction layer is used for outputting a correlation score of the text to be matched and the preset text according to the semantic vector.
9. Training method according to any of claims 1-5, wherein the method further comprises: before the semantic matching model is trained for the first time by using the training set, histogram equalization processing is carried out on difficulty indexes of samples in the training set, so that the difficulty indexes of the samples in the training set are uniformly distributed in a difficulty space.
10. A method of text matching, comprising:
b1, acquiring a query text and a plurality of candidate texts, and respectively forming a text pair by the query text and each candidate text;
b2, inputting each text pair into the semantic matching model obtained by training according to the training method of the semantic matching model of claims 1-8 respectively to obtain a correlation score between the query text and each candidate text;
and B3, sorting the candidate texts according to the relevance scores of the query texts and each candidate text, and outputting the corresponding candidate texts according to sorting results.
11. The text matching method according to claim 10, wherein the outputting of the corresponding candidate text according to the ranking result is at least one of:
candidate text with highest relevance score;
selecting a preset number of candidate texts according to the sequence of the relevance scores from big to small;
candidate texts with a relevance score greater than or equal to a predetermined score threshold.
12. A computer-readable storage medium, having stored thereon a computer program executable by a processor for performing the steps of the method of any one of claims 1-9 and 10-11.
13. An electronic device, comprising:
one or more processors; and
a memory, wherein the memory is to store executable instructions;
the one or more processors are configured to implement the steps of the method of any of claims 1-9 and 10-11 via execution of the executable instructions.
CN202210991280.1A 2022-08-18 2022-08-18 A training method for a semantic matching model and a text matching method Pending CN115511073A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210991280.1A CN115511073A (en) 2022-08-18 2022-08-18 A training method for a semantic matching model and a text matching method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210991280.1A CN115511073A (en) 2022-08-18 2022-08-18 A training method for a semantic matching model and a text matching method

Publications (1)

Publication Number Publication Date
CN115511073A true CN115511073A (en) 2022-12-23

Family

ID=84502741

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210991280.1A Pending CN115511073A (en) 2022-08-18 2022-08-18 A training method for a semantic matching model and a text matching method

Country Status (1)

Country Link
CN (1) CN115511073A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119336797A (en) * 2024-12-19 2025-01-21 成都国恒空间技术工程股份有限公司 A fast retrieval method and system for large-scale intelligence data based on large models

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119336797A (en) * 2024-12-19 2025-01-21 成都国恒空间技术工程股份有限公司 A fast retrieval method and system for large-scale intelligence data based on large models
CN119336797B (en) * 2024-12-19 2025-04-15 成都国恒空间技术工程股份有限公司 Large-scale information data quick retrieval method and system based on large model

Similar Documents

Publication Publication Date Title
CN110298028B (en) A method and device for extracting key sentences from text paragraphs
CN108073568B (en) Keyword extraction method and device
CN111797214A (en) Question screening method, device, computer equipment and medium based on FAQ database
CN109947902B (en) Data query method and device and readable medium
WO2023035330A1 (en) Long text event extraction method and apparatus, and computer device and storage medium
CN109101620A (en) Similarity calculating method, clustering method, device, storage medium and electronic equipment
CN109299245B (en) Method and device for recalling knowledge points
CN107992633A (en) Electronic document automatic classification method and system based on keyword feature
CN106202153A (en) The spelling error correction method of a kind of ES search engine and system
CN105183833A (en) User model based microblogging text recommendation method and recommendation apparatus thereof
CN110858269A (en) Criminal charge prediction method and device
CN109255000B (en) Dimension management method and device for label data
CN111199474A (en) Risk prediction method and device based on network diagram data of two parties and electronic equipment
CN115600602B (en) Method, system and terminal device for extracting key elements of long text
CN112562736B (en) Voice data set quality assessment method and device
CN118133221A (en) A privacy data classification and grading method
CN112836010A (en) Patent retrieval method, storage medium and device
CN115809887A (en) Method and device for determining main business range of enterprise based on invoice data
CN108681564A (en) The determination method, apparatus and computer readable storage medium of keyword and answer
CN114663067A (en) Job matching method, system, equipment and medium
CN119046432A (en) Data generation method and device based on artificial intelligence, computer equipment and medium
CN112613321A (en) Method and system for extracting entity attribute information in text
CN115577080A (en) Question reply matching method, system, server and storage medium
CN115511073A (en) A training method for a semantic matching model and a text matching method
CN111126073A (en) Semantic retrieval method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination