[go: up one dir, main page]

CN114398867B - A two-stage similarity calculation method for long texts - Google Patents

A two-stage similarity calculation method for long texts Download PDF

Info

Publication number
CN114398867B
CN114398867B CN202210298133.6A CN202210298133A CN114398867B CN 114398867 B CN114398867 B CN 114398867B CN 202210298133 A CN202210298133 A CN 202210298133A CN 114398867 B CN114398867 B CN 114398867B
Authority
CN
China
Prior art keywords
sentence
similarity
text
similar
long
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210298133.6A
Other languages
Chinese (zh)
Other versions
CN114398867A (en
Inventor
段思宇
苏祺
王军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN202210298133.6A priority Critical patent/CN114398867B/en
Publication of CN114398867A publication Critical patent/CN114398867A/en
Application granted granted Critical
Publication of CN114398867B publication Critical patent/CN114398867B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a two-stage long text similarity calculation method, which comprises the steps of constructing a sentence vector extraction model based on a deep learning model in a first stage similar sentence detection stage, and converting a text into a sentence vector; detecting to obtain multiple similar sentence pairs of similar types between each long text; in the second stage of graph structure calculation, the basic similarity is calculated; expressing the long text similar sentence pair and the basic similarity into a similar sentence relation graph; each node on the graph represents a long text; obtaining high-level node representation of the fusion group information through operation; updating node feature information, wherein the value of each dimension on a node feature vector is the text similarity between corresponding long texts; i.e. to obtain text similarity between long texts. The method of the invention can make the similarity of the long text have stronger interpretability and improve the effectiveness and the precision of text processing.

Description

一种两阶段的长文本相似度计算方法A two-stage similarity calculation method for long texts

技术领域technical field

本发明涉及文本相似度计算方法,具体涉及一种基于深度学习模型与图算法的两阶段的长文本相似度计算方法。The invention relates to a text similarity calculation method, in particular to a two-stage long text similarity calculation method based on a deep learning model and a graph algorithm.

背景技术Background technique

文本相似度计算是自然语言处理的一类重要任务,相关技术致力于使用技术手段度量文本之间的相似程度。对于不同长度的文本,需要适配不同的文本相似度计算方法。在计算长文本相似度时需要将大量文本信息进行提取压缩与匹配计算,这在新闻推荐、文章推荐、引文推荐、文档聚类等方面有重要应用。Text similarity calculation is an important task in natural language processing, and related technologies are devoted to using technical means to measure the similarity between texts. For texts of different lengths, different text similarity calculation methods need to be adapted. When calculating the similarity of long texts, a large amount of text information needs to be extracted, compressed and matched, which has important applications in news recommendation, article recommendation, citation recommendation, and document clustering.

现有技术大多采用基于关键词提取的方法,通过提取少数关键词作为长文本的代表,然后参与进一步的相似度计算。由于计算结果依赖于少数几个关键词,这种方法损失了大量语义信息,鲁棒性差。Most of the existing technologies adopt methods based on keyword extraction, by extracting a few keywords as the representative of long texts, and then participating in further similarity calculation. Since the calculation results depend on a few keywords, this method loses a lot of semantic information and has poor robustness.

基于深度学习模型的方法使用深度学习模型对全文进行编码后计算其相似度。但现有的深度学习模型只能在长度为数百个词以内的文本序列上取得较好的编码效果。而类似书本这样的长文本经常有数万字甚至数十万字,现有的模型不能较好地编码。并且,由于相似度计算都在隐空间进行,可解释性很差。Methods based on deep learning models use deep learning models to encode the full text and calculate its similarity. However, the existing deep learning models can only achieve good encoding results on text sequences with lengths within hundreds of words. And long texts like books often have tens of thousands or even hundreds of thousands of words, and existing models cannot encode them well. Moreover, since the similarity calculation is performed in the latent space, the interpretability is poor.

此外,上述两类技术都只考虑了被比较的长文本之间的信息,计算过程相对孤立,缺乏对群体信息的利用。In addition, the above two types of techniques only consider the information between the long texts being compared, the calculation process is relatively isolated, and the use of group information is lacking.

发明内容SUMMARY OF THE INVENTION

本发明提供一种基于深度学习模型与图算法的两阶段的长文本相似度计算方法,利用文本自身的语义信息,以及与群体信息的协作,两阶段地计算得到书本级别长文本的相似度。The invention provides a two-stage long text similarity calculation method based on a deep learning model and a graph algorithm, which utilizes the semantic information of the text itself and the cooperation with group information to calculate the book-level long text similarity in two stages.

本发明的原理是:对一组长文本,在第一阶段,使用多种检测方法检测得到每条长文本之间的相似句子对;在第二阶段,将相似句子对按其所在长文本合并汇总,将每条长文本抽象表示成图上的节点,进行图上的推理交互运算,让信息在节点间传递,获得融合了群体信息的高层次节点表示;最终,根据节点特征,获得长文本之间的文本相似度。The principle of the present invention is: for a group of long texts, in the first stage, multiple detection methods are used to detect similar sentence pairs between each long text; in the second stage, the similar sentence pairs are merged according to the long texts where they are located. Summarize, abstractly represent each long text as a node on the graph, perform inference and interactive operations on the graph, let the information pass between nodes, and obtain a high-level node representation that integrates group information; finally, according to the node characteristics, obtain long texts text similarity between them.

本发明提供的技术方案如下:The technical scheme provided by the present invention is as follows:

一种两阶段的长文本相似度计算方法,包括如下步骤:A two-stage long text similarity calculation method, comprising the following steps:

在第一阶段相似句子检测阶段,包括:In the first stage similar sentence detection stage, including:

基于深度学习模型构建句向量提取模型,句向量提取模型包括语义相似检测模型和转述相似检测模型;Construct sentence vector extraction model based on deep learning model, sentence vector extraction model includes semantic similarity detection model and paraphrase similarity detection model;

通过句向量提取模型将文本转换为句向量;Convert text to sentence vector through sentence vector extraction model;

使用多种检测方法检测得到每条长文本之间多种相似类型的相似句子对;Use multiple detection methods to detect similar sentence pairs of multiple similar types between each long text;

在第二阶段图结构计算阶段,包括:In the second stage graph structure calculation stage, including:

计算得到基础相似度;Calculate the basic similarity;

基于图算法,将长文本相似句子对和基础相似度表示成相似句子关系图;相似句子关系图上的每个节点表示一条长文本;Based on the graph algorithm, the similar sentence pairs and the basic similarity of the long text are represented as a similar sentence relationship graph; each node on the similar sentence relationship graph represents a long text;

通过相似句子关系图的推理交互运算,获得融合群体信息的高层次节点表示;Through the reasoning and interactive operation of the relationship graph of similar sentences, the high-level node representation of the fusion group information is obtained;

更新节点特征信息,节点特征向量上每个维度的值即对应长文本之间的文本相似度;Update the node feature information, the value of each dimension on the node feature vector is the text similarity between the corresponding long texts;

根据节点特征,获得长文本之间的文本相似度。According to the node features, the text similarity between long texts is obtained.

进一步地,上述两阶段的长文本相似度计算方法在相似句子检测阶段之前,首先将每条长文本分割为句子;通过对比学习微调预训练的语言表征模型BERT模型或RoBERTa模型,得到句向量提取模型;通过句向量提取模型包括的语义相似检测模型和转述相似检测模型分别提取长文本句子和子句的句向量,从而将长文本转换为句向量。Further, the above two-stage long text similarity calculation method first divides each long text into sentences before the similar sentence detection stage; fine-tunes the pre-trained language representation model BERT model or RoBERTa model through comparative learning to obtain sentence vector extraction. The model; the sentence vectors of long text sentences and clauses are extracted respectively through the semantic similarity detection model and the paraphrase similarity detection model included in the sentence vector extraction model, so as to convert the long text into sentence vectors.

进一步地,通过如下步骤得到句向量提取模型:Further, the sentence vector extraction model is obtained through the following steps:

11)通过进行句子语义相似性对比学习训练,微调BERT模型,得到语义相似检测模型;包括:11) By performing sentence semantic similarity comparative learning and training, fine-tuning the BERT model, and obtaining a semantic similarity detection model; including:

对提取得到的句向量,采用丢弃法处理,构造得到对比学习的正例;The extracted sentence vectors are processed by the discarding method, and the positive examples of comparative learning are constructed;

将一个训练批次中其他句向量作为对比学习的负例;Use other sentence vectors in a training batch as negative examples for comparative learning;

用于训练的损失函数采用基于句向量和构造的正例及负例计算的损失函数;The loss function used for training adopts the loss function calculated based on the sentence vector and the constructed positive and negative examples;

将训练好的模型命名为语义相似检测模型;Name the trained model as the semantic similarity detection model;

12)通过进行句子转述相似性对比学习训练,微调BERT模型,得到转述相似检测模型;包括:12) By performing sentence paraphrase similarity comparative learning and training, fine-tuning the BERT model, and obtaining a paraphrase similarity detection model; including:

从句子文本中提取出句向量;Extract sentence vector from sentence text;

对每个句子内部按逗号分割为子句,在句子文本中随机选择和打乱子句,得到新句子文本;对从新句子文本中提取的句向量采用丢弃法处理构造对比学习的正例;将一个训练批次中其他句子文本所提取的向量作为对比学习的负例;Divide each sentence into clauses by comma, randomly select and scramble clauses in the sentence text, and obtain the new sentence text; adopt the discarding method to process the sentence vector extracted from the new sentence text to construct the positive example of contrastive learning; Vectors extracted from other sentence texts in a training batch are used as negative examples for comparative learning;

BERT模型微调的损失函数包含

Figure 448480DEST_PATH_IMAGE001
Figure 515793DEST_PATH_IMAGE002
Figure 64586DEST_PATH_IMAGE003
与步骤11)采用的损失函数相同;计 算是基于句向量和构造的正例及负例计算得到损失函数
Figure 808420DEST_PATH_IMAGE004
; The loss function for fine-tuning the BERT model contains
Figure 448480DEST_PATH_IMAGE001
and
Figure 515793DEST_PATH_IMAGE002
;
Figure 64586DEST_PATH_IMAGE003
The loss function used in step 11) is the same; the calculation is based on the sentence vector and the constructed positive and negative examples to obtain the loss function
Figure 808420DEST_PATH_IMAGE004
;

最终损失函数

Figure 75453DEST_PATH_IMAGE005
为:
Figure 71091DEST_PATH_IMAGE006
;其中,
Figure 880915DEST_PATH_IMAGE007
是需要被设置的超参数,用于调 节模型对句子结构重组和语意差异之间的侧重程度; final loss function
Figure 75453DEST_PATH_IMAGE005
for:
Figure 71091DEST_PATH_IMAGE006
;in,
Figure 880915DEST_PATH_IMAGE007
is a hyperparameter that needs to be set to adjust the model's emphasis on sentence structure reorganization and semantic difference;

得到的模型即命名为转述相似检测模型。The resulting model is named the paraphrase similarity detection model.

进一步地,第一阶段多种检测方法包括三种相似型句子对的检测方法;三种相似型句子对分别是:语义相似型句子对、转述相似型句子对和局部相似型句子对。Further, the multiple detection methods in the first stage include three detection methods of similar sentence pairs; the three similar sentence pairs are: semantic similarity sentence pair, paraphrase similarity sentence pair and local similarity sentence pair.

A. 检测语义相似型句子对时,执行如下操作:A. When detecting semantically similar sentence pairs, do the following:

A1. 将每条长文本

Figure 139858DEST_PATH_IMAGE008
按表示句子分割的标点符号分割为句子; A1. Put each long text
Figure 139858DEST_PATH_IMAGE008
Divide into sentences according to the punctuation marks indicating sentence division;

A2. 使用语义相似检测模型提取所有句子的特征向量,记为

Figure 815559DEST_PATH_IMAGE009
; A2. Use the semantic similarity detection model to extract the feature vectors of all sentences, denoted as
Figure 815559DEST_PATH_IMAGE009
;

A3. 对句子的特征向量

Figure 552571DEST_PATH_IMAGE009
去重复,得到
Figure 607115DEST_PATH_IMAGE010
;对每个特征向量,找到其TOPK个相似的向 量;并将获得的所有向量对记为
Figure 36959DEST_PATH_IMAGE011
; A3. Feature vectors for sentences
Figure 552571DEST_PATH_IMAGE009
to repeat, get
Figure 607115DEST_PATH_IMAGE010
; For each feature vector, find its TOPK similar vectors; and record all the obtained vector pairs as
Figure 36959DEST_PATH_IMAGE011
;

A4. 计算

Figure 685109DEST_PATH_IMAGE011
中向量距离的第t百分位数,作为相似性阈值
Figure 225812DEST_PATH_IMAGE012
; A4. Calculation
Figure 685109DEST_PATH_IMAGE011
the t-th percentile of vector distances, as similarity threshold
Figure 225812DEST_PATH_IMAGE012
;

A5. 过滤出

Figure 790654DEST_PATH_IMAGE009
中特征向量距离小于
Figure 656979DEST_PATH_IMAGE012
的句子对,即为语义相似型句子对; A5. Filter out
Figure 790654DEST_PATH_IMAGE009
Medium eigenvector distance is less than
Figure 656979DEST_PATH_IMAGE012
The sentence pairs are semantically similar sentence pairs;

B. 检测转述相似型的句子对时,执行如下操作:B. When detecting sentence pairs of similar type of paraphrase, do the following:

B1. 将每条长文本

Figure 385901DEST_PATH_IMAGE008
按表示句子分割的标点符号分割为句子;B1. Put each long text
Figure 385901DEST_PATH_IMAGE008
Divide into sentences according to the punctuation marks indicating sentence division;

B2. 使用转述相似检测模型提取所有句子的特征向量,记为

Figure 136819DEST_PATH_IMAGE013
; B2. Use the paraphrase similarity detection model to extract the feature vectors of all sentences, denoted as
Figure 136819DEST_PATH_IMAGE013
;

B3. 对句子的特征向量

Figure 634797DEST_PATH_IMAGE014
去重复,得到
Figure 672023DEST_PATH_IMAGE015
;对每个特征向量,找到其TOPK个相似的 向量;将获得的所有向量对计为
Figure 888240DEST_PATH_IMAGE016
; B3. Feature Vectors for Sentences
Figure 634797DEST_PATH_IMAGE014
to repeat, get
Figure 672023DEST_PATH_IMAGE015
; For each feature vector, find its TOPK similar vectors; count all the obtained vector pairs as
Figure 888240DEST_PATH_IMAGE016
;

B4. 计算

Figure 695047DEST_PATH_IMAGE017
中向量距离的第t百分位数,作为相似性阈值
Figure 47531DEST_PATH_IMAGE018
; B4. Calculation
Figure 695047DEST_PATH_IMAGE017
the t-th percentile of vector distances, as similarity threshold
Figure 47531DEST_PATH_IMAGE018
;

B5. 过滤出

Figure 990079DEST_PATH_IMAGE014
中特征向量距离小于
Figure 834538DEST_PATH_IMAGE018
的句子对,即为转述相似型句子对; B5. Filter out
Figure 990079DEST_PATH_IMAGE014
Medium eigenvector distance is less than
Figure 834538DEST_PATH_IMAGE018
The sentence pair of , that is, the sentence pair of paraphrase similarity;

C. 检测局部相似型的句子对时,执行如下操作:C. When detecting locally similar sentence pairs, do the following:

C1. 将每条长文本

Figure 255155DEST_PATH_IMAGE008
按表示句子分割的标点符号分割为句子后,在句子内部按逗 号分割为子句; C1. Convert each long text
Figure 255155DEST_PATH_IMAGE008
After splitting into sentences according to the punctuation marks indicating sentence splitting, split into clauses according to commas inside the sentence;

C2. 使用语义相似检测模型提取所有子句的特征向量,记为

Figure 524463DEST_PATH_IMAGE019
; C2. Use the semantic similarity detection model to extract the feature vectors of all clauses, denoted as
Figure 524463DEST_PATH_IMAGE019
;

C3. 对句子的特征向量

Figure 372333DEST_PATH_IMAGE019
去重复,得到
Figure 953356DEST_PATH_IMAGE020
;对每个特征向量,找到其TOPK个相似的 向量;将获得的所有向量对计为
Figure 177664DEST_PATH_IMAGE021
; C3. Feature vectors for sentences
Figure 372333DEST_PATH_IMAGE019
to repeat, get
Figure 953356DEST_PATH_IMAGE020
; For each feature vector, find its TOPK similar vectors; count all the obtained vector pairs as
Figure 177664DEST_PATH_IMAGE021
;

C4. 计算

Figure 176844DEST_PATH_IMAGE021
中向量距离的第t百分位数,作为相似性阈值
Figure 257932DEST_PATH_IMAGE022
; C4. Calculation
Figure 176844DEST_PATH_IMAGE021
the t-th percentile of vector distances, as similarity threshold
Figure 257932DEST_PATH_IMAGE022
;

C5. 过滤出

Figure 326251DEST_PATH_IMAGE019
中特征向量距离小于
Figure 88671DEST_PATH_IMAGE022
的子句对; C5. Filter out
Figure 326251DEST_PATH_IMAGE019
Medium eigenvector distance is less than
Figure 88671DEST_PATH_IMAGE022
clause pair;

C6. 对成功匹配的子句对,追溯到对应的句子对,即为局部相似型句子对。C6. For successfully matched clause pairs, trace them back to the corresponding sentence pairs, which are locally similar sentence pairs.

进一步地,将三种类型的相似句子对检测结果进行合并汇总后,根据文本总长度对数值进行标准化处理,得到长文本的基础相似度。Further, after combining and summarizing the detection results of the three types of similar sentence pairs, the value is normalized according to the total length of the text to obtain the basic similarity of the long text.

进一步地,计算基础相似度是:Further, calculating the basic similarity is:

设有两条长文本

Figure 942357DEST_PATH_IMAGE008
Figure 397610DEST_PATH_IMAGE023
,检测到
Figure 828591DEST_PATH_IMAGE008
Figure 394701DEST_PATH_IMAGE023
中的
Figure 352162DEST_PATH_IMAGE024
个句子相似,则两条长文本的基础相似 度
Figure 978315DEST_PATH_IMAGE025
按如下计算得到: with two long texts
Figure 942357DEST_PATH_IMAGE008
,
Figure 397610DEST_PATH_IMAGE023
,detected
Figure 828591DEST_PATH_IMAGE008
and
Figure 394701DEST_PATH_IMAGE023
middle
Figure 352162DEST_PATH_IMAGE024
If the sentences are similar, then the basic similarity of the two long texts
Figure 978315DEST_PATH_IMAGE025
Calculated as follows:

Figure 896593DEST_PATH_IMAGE026
Figure 896593DEST_PATH_IMAGE026

其中,

Figure 815DEST_PATH_IMAGE027
Figure 829094DEST_PATH_IMAGE028
分别为两条长文本中的句子总数量。 in,
Figure 815DEST_PATH_IMAGE027
and
Figure 829094DEST_PATH_IMAGE028
are the total number of sentences in the two long texts, respectively.

进一步地,将长文本和其基础相似度表示成相似句子关系图

Figure 626149DEST_PATH_IMAGE029
;相似句子关系图中 的每个节点
Figure 153426DEST_PATH_IMAGE008
代表一条长文本,
Figure 326919DEST_PATH_IMAGE030
;节点特征是一个独热向量,向量的维度是长 文本总数N;如果两条长文本
Figure 868758DEST_PATH_IMAGE008
Figure 836714DEST_PATH_IMAGE023
之间存在相似句子,则长文本对应的节点之间有一条边; Further, the long text and its basic similarity are represented as a relationship graph of similar sentences
Figure 626149DEST_PATH_IMAGE029
; each node in the graph of similar sentences
Figure 153426DEST_PATH_IMAGE008
represents a long text,
Figure 326919DEST_PATH_IMAGE030
; The node feature is a one-hot vector, and the dimension of the vector is the total number of long texts N; if two long texts
Figure 868758DEST_PATH_IMAGE008
,
Figure 836714DEST_PATH_IMAGE023
If there are similar sentences between them, there is an edge between the nodes corresponding to the long text;

对于长文本

Figure 604950DEST_PATH_IMAGE008
,有特征向量:
Figure 50975DEST_PATH_IMAGE031
; for long text
Figure 604950DEST_PATH_IMAGE008
, with eigenvectors:
Figure 50975DEST_PATH_IMAGE031
;

两条长文本

Figure 181742DEST_PATH_IMAGE008
Figure 976392DEST_PATH_IMAGE023
对应节点之间的边的权重
Figure 559820DEST_PATH_IMAGE032
Figure 606273DEST_PATH_IMAGE025
是基础相似度。 two long texts
Figure 181742DEST_PATH_IMAGE008
,
Figure 976392DEST_PATH_IMAGE023
The weights of the edges between the corresponding nodes
Figure 559820DEST_PATH_IMAGE032
;
Figure 606273DEST_PATH_IMAGE025
is the basic similarity.

进一步地,在关系图上进行两次信息的传递和聚合运算,得到新的节点特征信息

Figure 591547DEST_PATH_IMAGE033
并更新;计算方式如下: Further, perform two information transfer and aggregation operations on the relationship graph to obtain new node feature information
Figure 591547DEST_PATH_IMAGE033
and updated; it is calculated as follows:

Figure 307830DEST_PATH_IMAGE034
Figure 307830DEST_PATH_IMAGE034

其中,

Figure 378554DEST_PATH_IMAGE035
Figure 353332DEST_PATH_IMAGE036
是图
Figure 193112DEST_PATH_IMAGE037
上节点
Figure 204931DEST_PATH_IMAGE008
Figure 497372DEST_PATH_IMAGE023
的初始特征向量值;
Figure 26573DEST_PATH_IMAGE038
是分别第一次和第二次运算 自定义的权重,用于调节两次图上信息聚合的比例;
Figure 642231DEST_PATH_IMAGE039
Figure 293792DEST_PATH_IMAGE040
为图
Figure 870267DEST_PATH_IMAGE037
上节点
Figure 265476DEST_PATH_IMAGE008
Figure 486373DEST_PATH_IMAGE023
经过第一次更新 后的特征向量值;最终得到节点特征向量
Figure 574415DEST_PATH_IMAGE041
,其中,
Figure 372607DEST_PATH_IMAGE042
代表了长文本
Figure 305928DEST_PATH_IMAGE008
和长文本
Figure 899108DEST_PATH_IMAGE023
的文本相似度。 in,
Figure 378554DEST_PATH_IMAGE035
,
Figure 353332DEST_PATH_IMAGE036
is a picture
Figure 193112DEST_PATH_IMAGE037
upper node
Figure 204931DEST_PATH_IMAGE008
,
Figure 497372DEST_PATH_IMAGE023
The initial eigenvector value of ;
Figure 26573DEST_PATH_IMAGE038
are the custom weights of the first and second operations respectively, which are used to adjust the proportion of information aggregation on the two graphs;
Figure 642231DEST_PATH_IMAGE039
,
Figure 293792DEST_PATH_IMAGE040
for the picture
Figure 870267DEST_PATH_IMAGE037
upper node
Figure 265476DEST_PATH_IMAGE008
,
Figure 486373DEST_PATH_IMAGE023
The eigenvector value after the first update; the node eigenvector is finally obtained
Figure 574415DEST_PATH_IMAGE041
,in,
Figure 372607DEST_PATH_IMAGE042
represents long text
Figure 305928DEST_PATH_IMAGE008
and long text
Figure 899108DEST_PATH_IMAGE023
text similarity.

与现有技术相比,本发明的有益效果:Compared with the prior art, the beneficial effects of the present invention:

利用本发明提供的技术方案,在计算长文本相似度时,将长文本拆分为细粒度的句子进行编码和比较,充分利用了被比较文本本身的语义信息;还将长文本抽象成图上的节点,通过图上的信息传播和聚合,让节点表示融合了群体信息;同时因为相似句子可以直观地查看到,本发明提供的长文本的相似度计算方法可使得长文本相似度具有较强的可解释性,提升文本处理的有效性和精度。Using the technical solution provided by the present invention, when calculating the similarity of long texts, the long texts are divided into fine-grained sentences for coding and comparison, and the semantic information of the compared texts is fully utilized; the long texts are also abstracted into graphs. Through the information dissemination and aggregation on the graph, the nodes represent the integration of group information; at the same time, because similar sentences can be viewed intuitively, the method for calculating the similarity of long texts provided by the present invention can make the similarity of long texts stronger. interpretability, improving the effectiveness and precision of text processing.

附图说明Description of drawings

图1为本发明提供的两阶段计算长文本相似度的流程框图。FIG. 1 is a flow chart of the two-stage calculation of the similarity of long text provided by the present invention.

图2为本发明方法的相似句子检测阶段的流程框图。FIG. 2 is a flow chart of the similar sentence detection stage of the method of the present invention.

图3为本发明方法的图结构计算阶段的流程框图。FIG. 3 is a flow chart of the graph structure calculation stage of the method of the present invention.

具体实施方式Detailed ways

下面结合附图,通过实施例进一步描述本发明,但不以任何方式限制本发明的范围。Below in conjunction with the accompanying drawings, the present invention is further described by means of embodiments, but the scope of the present invention is not limited in any way.

本发明提供一种基于深度学习模型与图算法的两阶段的长文本相似度计算方法,利用文本自身的语义信息,以及与群体信息的协作,两阶段地计算得到书本级别长文本的相似度。对一组长文本,在第一阶段,使用多条检测路径检测每条长文本之间的相似句子对;在第二阶段,对匹配的句子对按其来源聚合成图,将每条长文本抽象表示成图上的节点,进行图上的推理交互运算,让信息在节点间传递,获得融合了群体信息的高层次节点表示;最终,根据节点特征,获得长文本之间的文本相似度。The invention provides a two-stage long text similarity calculation method based on a deep learning model and a graph algorithm, which utilizes the semantic information of the text itself and the cooperation with group information to calculate the book-level long text similarity in two stages. For a set of long texts, in the first stage, multiple detection paths are used to detect similar sentence pairs between each long text; in the second stage, the matched sentence pairs are aggregated into a graph according to their source, and each long text is aggregated into a graph. Abstractly represent the nodes on the graph, perform inference and interactive operations on the graph, let the information pass between the nodes, and obtain a high-level node representation that integrates group information; finally, according to the node characteristics, obtain the text similarity between long texts.

图1所示为本发明提供的基于深度学习模型与图算法的两阶段计算长文本相似度的流程。包括如下步骤:FIG. 1 shows the process of calculating the similarity of long texts in two stages based on a deep learning model and a graph algorithm provided by the present invention. It includes the following steps:

第一阶段是相似句子检测阶段:The first stage is the similar sentence detection stage:

1) 基于深度学习模型(可使用BERT模型或RoBERTa模型)构建句向量提取模型,对长文本中的句子和子句提取句向量。1) Build a sentence vector extraction model based on a deep learning model (BERT model or RoBERTa model can be used) to extract sentence vectors for sentences and clauses in long texts.

2) 根据句向量的相似性,检测得到多种类型的相似句子对。2) According to the similarity of sentence vectors, various types of similar sentence pairs are detected.

第二阶段是图结构计算阶段:The second stage is the graph structure calculation stage:

3) 将相似句子对按其来源(所在长文本)构建成图结构。3) Construct similar sentence pairs into a graph structure according to their source (where the long text is located).

4) 在图上进行信息的传递和聚合运算,更新节点特征信息。4) Perform information transfer and aggregation operations on the graph to update node feature information.

节点特征向量上每个维度的值即对应长文本之间的文本相似度。The value of each dimension on the node feature vector corresponds to the text similarity between long texts.

本发明方法具体实施包括:The specific implementation of the method of the present invention includes:

1)将长文本按表示句子分割的标点符号分割为句子,在句子内部按逗号分割为子句。使用对比学习微调好的句向量提取模型(包括语义相似检测模型和转述相似检测模型)分别提取句子和子句的句向量。1) Divide long text into sentences by punctuation marks indicating sentence division, and within sentences by commas into clauses. Sentence vector extraction models fine-tuned by contrastive learning (including semantic similarity detection model and paraphrase similarity detection model) are used to extract sentence vectors of sentences and clauses, respectively.

2)针对语义相似型、转述相似型、局部相似型三种类型的句子文本相似模式,根据句向量的距离,检测出相似句向量,得到对应的相似句子对。2) For the three types of sentence text similarity patterns of semantic similarity, paraphrase similarity and local similarity, according to the distance of sentence vectors, similar sentence vectors are detected, and corresponding similar sentence pairs are obtained.

3)将三种类型的相似句子对检测结果合并统计后,使用长文本的句子数量对统计结果进行标准化处理。将相似句子对按其来源聚合成图。每条长文本代表图上一个节点,边的权重是两条长文本间相似句子对的数量。3) After combining the detection results of three types of similar sentence pairs, the statistical results are normalized using the number of sentences in the long text. Aggregate similar sentence pairs into a graph by their origin. Each long text represents a node on the graph, and the weight of the edge is the number of similar sentence pairs between the two long texts.

4)在图上进行两次信息的传递和聚合运算,融合群体信息后更新节点特征。4) Carry out two information transfer and aggregation operations on the graph, and update the node features after fusing the group information.

节点特征向量上每个维度的值即对应长文本之间的文本相似度。The value of each dimension on the node feature vector corresponds to the text similarity between long texts.

下面通过实例对本发明做进一步说明。The present invention will be further illustrated by examples below.

实施例1Example 1

对N本文本格式的电子书,将其视作N条长文本

Figure 158051DEST_PATH_IMAGE043
,使用本发明提出的 方法计算各条长文本两两之间的文本相似度。本方法包括两个阶段,相似句子检测阶段和 图结构计算阶段(如图1所示)。 For N e-books in text format, treat them as N long texts
Figure 158051DEST_PATH_IMAGE043
, using the method proposed in the present invention to calculate the text similarity between each pair of long texts. This method includes two stages, the similar sentence detection stage and the graph structure calculation stage (as shown in Figure 1).

1)在进行相似句子检测之前,先要构建得到句向量提取模型,用于将文本转换为句向量。首先将所有长文本分割为句子;再通过对比学习微调预训练的语言表征模型BERTBidirectional Encoder Representation from Transformers)模型(或RoBERTa模型),得到句向量提取模型;通过句向量提取模型将文本句子转换为句向量;1) Before similar sentence detection, a sentence vector extraction model must be constructed to convert text into sentence vectors. First segment all long texts into sentences; then fine-tune the pre-trained language representation model (BERTBidirectional Encoder Representation from Transformers) model (or RoBERTa model) through comparative learning to obtain a sentence vector extraction model; convert text sentences into sentences through the sentence vector extraction model vector;

11)通过进行句子语义相似性对比学习训练来微调BERT模型,得到语义相似检测模型。11) Fine-tune the BERT model by performing sentence semantic similarity comparative learning and training to obtain a semantic similarity detection model.

对每个分割后的句子,首先,从句子中提取出句向量

Figure 646801DEST_PATH_IMAGE044
。对从句子中提取的句向量
Figure 55917DEST_PATH_IMAGE044
通过进行丢弃法(Dropout)处理来构造该句子的对比学习的正例,将一个训练批次中其他 句子文本所提取的向量作为该句子的对比学习的负例。训练的损失函数采用基于句向量和 构造的正例及负例计算的损失函数,其设计与SimCSE(Simple Contrastive Learning of Sentence Embeddings,句嵌入简单对比学习)的设计相同。将训练好的模型命名为语义相 似检测模型,记为
Figure 579302DEST_PATH_IMAGE045
。 For each segmented sentence, first, extract the sentence vector from the sentence
Figure 646801DEST_PATH_IMAGE044
. For sentence vectors extracted from sentences
Figure 55917DEST_PATH_IMAGE044
Construct the positive example of the sentence's comparative learning by performing Dropout processing, and use the vectors extracted from other sentence texts in a training batch as the negative example of the sentence's comparative learning. The training loss function adopts the loss function calculated based on sentence vectors and constructed positive and negative examples, and its design is the same as that of SimCSE (Simple Contrastive Learning of Sentence Embeddings). Name the trained model as the semantic similarity detection model, denoted as
Figure 579302DEST_PATH_IMAGE045
.

12)通过进行句子转述相似性对比学习训练来微调BERT模型,得到转述相似检测模型。12) Fine-tune the BERT model by performing sentence paraphrase similarity comparison learning and training, and obtain a paraphrase similarity detection model.

对每个句子,首先,从句子中提取出句向量

Figure 540305DEST_PATH_IMAGE044
。BERT模型微调的损失函数包含
Figure 781930DEST_PATH_IMAGE046
Figure 978425DEST_PATH_IMAGE047
两部分。
Figure 356317DEST_PATH_IMAGE001
Figure 629166DEST_PATH_IMAGE045
的损失函数相同。在计算
Figure 92509DEST_PATH_IMAGE048
时,对每个句子,在句子内部按逗号分割 为子句,在句子文本中随机选择和打乱子句,得到新句子文本。从新句子文本中提取的句向 量
Figure 233640DEST_PATH_IMAGE049
进行Dropout处理构造该句子的对比学习的正例,将一个训练批次中其他句子文本所 提取的向量
Figure 731617DEST_PATH_IMAGE050
作为该句子的对比学习的负例。
Figure 893477DEST_PATH_IMAGE048
是基于句向量
Figure 109695DEST_PATH_IMAGE044
和构造的正例
Figure 664304DEST_PATH_IMAGE051
及负例
Figure 16788DEST_PATH_IMAGE050
计算的损失函数,其设计与句嵌入简单对比学习SimCSE的设计相同。最终的训练的损失函 数是: For each sentence, first, extract the sentence vector from the sentence
Figure 540305DEST_PATH_IMAGE044
. The loss function for fine-tuning the BERT model contains
Figure 781930DEST_PATH_IMAGE046
and
Figure 978425DEST_PATH_IMAGE047
two parts.
Figure 356317DEST_PATH_IMAGE001
and
Figure 629166DEST_PATH_IMAGE045
The loss function is the same. in computing
Figure 92509DEST_PATH_IMAGE048
When , for each sentence, it is divided into clauses by commas inside the sentence, and the clauses are randomly selected and scrambled in the sentence text to get the new sentence text. Sentence vector extracted from new sentence text
Figure 233640DEST_PATH_IMAGE049
Perform Dropout processing to construct a positive example of the sentence's contrastive learning, and use the vectors extracted from other sentence texts in a training batch
Figure 731617DEST_PATH_IMAGE050
as a negative example for the contrastive learning of this sentence.
Figure 893477DEST_PATH_IMAGE048
is based on sentence vector
Figure 109695DEST_PATH_IMAGE044
and positive examples of construction
Figure 664304DEST_PATH_IMAGE051
and negative examples
Figure 16788DEST_PATH_IMAGE050
The computed loss function, whose design is the same as that of SimCSE for Simple Contrastive Learning of Sentence Embeddings. The final training loss function is:

Figure 693757DEST_PATH_IMAGE052
Figure 693757DEST_PATH_IMAGE052

其中,

Figure 662850DEST_PATH_IMAGE053
是需要被设置的超参数,它调节了模型对句子结构重组和语意差异之间的 侧重程度。得到的模型即命名为转述相似检测模型,记为
Figure 349046DEST_PATH_IMAGE054
。 in,
Figure 662850DEST_PATH_IMAGE053
is a hyperparameter that needs to be set, which adjusts the model's emphasis on sentence structure reorganization and semantic difference. The obtained model is named as the paraphrase similarity detection model, denoted as
Figure 349046DEST_PATH_IMAGE054
.

2)在长文本T间检测得到相似型句子对(如图2所示,具体实施时通过设计三种相似型的句子对的检测方法检测得到三种相似型句子对)。2) Similar sentence pairs are obtained by detecting between long texts T (as shown in Figure 2, in the specific implementation, three similar sentence pairs are detected by designing three similar sentence pairs detection methods).

A. 检测语义相似型句子对时,执行如下操作:A. When detecting semantically similar sentence pairs, do the following:

A1. 将每条长文本

Figure 477408DEST_PATH_IMAGE008
按表示句子分割的标点符号分割为句子; A1. Put each long text
Figure 477408DEST_PATH_IMAGE008
Divide into sentences according to the punctuation marks indicating sentence division;

A2. 使用语义相似检测模型提取所有句子的特征向量,记为

Figure 590858DEST_PATH_IMAGE009
; A2. Use the semantic similarity detection model to extract the feature vectors of all sentences, denoted as
Figure 590858DEST_PATH_IMAGE009
;

A3. 对句子的特征向量

Figure 47247DEST_PATH_IMAGE009
去重复,得到
Figure 271555DEST_PATH_IMAGE010
;对每个特征向量,找到其TOPK个相似的向 量;并将获得的所有向量对记为
Figure 270735DEST_PATH_IMAGE011
; A3. Feature vectors for sentences
Figure 47247DEST_PATH_IMAGE009
to repeat, get
Figure 271555DEST_PATH_IMAGE010
; For each feature vector, find its TOPK similar vectors; and record all the obtained vector pairs as
Figure 270735DEST_PATH_IMAGE011
;

A4. 计算

Figure 555086DEST_PATH_IMAGE011
中向量距离的第t百分位数,作为相似性阈值
Figure 702033DEST_PATH_IMAGE012
; A4. Calculation
Figure 555086DEST_PATH_IMAGE011
the t-th percentile of vector distances, as similarity threshold
Figure 702033DEST_PATH_IMAGE012
;

A5. 过滤出

Figure 648474DEST_PATH_IMAGE009
中特征向量距离小于
Figure 830057DEST_PATH_IMAGE012
的句子对,即为语义相似型句子对; A5. Filter out
Figure 648474DEST_PATH_IMAGE009
Medium eigenvector distance is less than
Figure 830057DEST_PATH_IMAGE012
The sentence pairs are semantically similar sentence pairs;

B. 检测转述相似型的句子对时,执行如下操作:B. When detecting sentence pairs of similar type of paraphrase, do the following:

B1. 将每条长文本

Figure 816467DEST_PATH_IMAGE008
按表示句子分割的标点符号分割为句子; B1. Put each long text
Figure 816467DEST_PATH_IMAGE008
Divide into sentences according to the punctuation marks indicating sentence division;

B2. 使用转述相似检测模型提取所有句子的特征向量,记为

Figure 716290DEST_PATH_IMAGE013
; B2. Use the paraphrase similarity detection model to extract the feature vectors of all sentences, denoted as
Figure 716290DEST_PATH_IMAGE013
;

B3. 对句子的特征向量

Figure 954504DEST_PATH_IMAGE014
去重复,得到
Figure 990594DEST_PATH_IMAGE015
;对每个特征向量,找到其TOPK个相似的 向量;将获得的所有向量对计为
Figure 272539DEST_PATH_IMAGE016
; B3. Feature Vectors for Sentences
Figure 954504DEST_PATH_IMAGE014
to repeat, get
Figure 990594DEST_PATH_IMAGE015
; For each feature vector, find its TOPK similar vectors; count all the obtained vector pairs as
Figure 272539DEST_PATH_IMAGE016
;

B4. 计算

Figure 394079DEST_PATH_IMAGE017
中向量距离的第t百分位数,作为相似性阈值
Figure 763880DEST_PATH_IMAGE018
; B4. Calculation
Figure 394079DEST_PATH_IMAGE017
the t-th percentile of vector distances, as similarity threshold
Figure 763880DEST_PATH_IMAGE018
;

B5. 过滤出

Figure 716793DEST_PATH_IMAGE014
中特征向量距离小于
Figure 513848DEST_PATH_IMAGE018
的句子对,即为转述相似型句子对; B5. Filter out
Figure 716793DEST_PATH_IMAGE014
Medium eigenvector distance is less than
Figure 513848DEST_PATH_IMAGE018
The sentence pair of , that is, the sentence pair of paraphrase similarity;

C. 检测局部相似型的句子对时,执行如下操作:C. When detecting locally similar sentence pairs, do the following:

C1. 将每条长文本

Figure 529208DEST_PATH_IMAGE008
按表示句子分割的标点符号分割为句子后,在句子内部按逗 号分割为子句; C1. Convert each long text
Figure 529208DEST_PATH_IMAGE008
After splitting into sentences according to the punctuation marks indicating sentence splitting, split into clauses according to commas inside the sentence;

C2. 使用语义相似检测模型提取所有子句的特征向量,记为

Figure 437121DEST_PATH_IMAGE019
; C2. Use the semantic similarity detection model to extract the feature vectors of all clauses, denoted as
Figure 437121DEST_PATH_IMAGE019
;

C3. 对句子的特征向量

Figure 369174DEST_PATH_IMAGE019
去重复,得到
Figure 337130DEST_PATH_IMAGE020
;对每个特征向量,找到其TOPK个相似的 向量;将获得的所有向量对计为
Figure 433262DEST_PATH_IMAGE021
; C3. Feature vectors for sentences
Figure 369174DEST_PATH_IMAGE019
to repeat, get
Figure 337130DEST_PATH_IMAGE020
; For each feature vector, find its TOPK similar vectors; count all the obtained vector pairs as
Figure 433262DEST_PATH_IMAGE021
;

C4. 计算

Figure 676025DEST_PATH_IMAGE021
中向量距离的第t百分位数,作为相似性阈值
Figure 806792DEST_PATH_IMAGE022
; C4. Calculation
Figure 676025DEST_PATH_IMAGE021
the t-th percentile of vector distances, as similarity threshold
Figure 806792DEST_PATH_IMAGE022
;

C5. 过滤出

Figure 617753DEST_PATH_IMAGE019
中特征向量距离小于
Figure 935602DEST_PATH_IMAGE022
的子句对; C5. Filter out
Figure 617753DEST_PATH_IMAGE019
Medium eigenvector distance is less than
Figure 935602DEST_PATH_IMAGE022
clause pair;

C6. 对成功匹配的子句对,追溯到对应的句子对,即为局部相似型句子对。C6. For successfully matched clause pairs, trace them back to the corresponding sentence pairs, which are locally similar sentence pairs.

得到了检测出的相似句子对结果后,进入图结构计算阶段(如图3所示)。After obtaining the detected similar sentence pair results, enter the graph structure calculation stage (as shown in Figure 3).

3)将三种类型的相似句子对检测结果进行合并汇总后,根据文本总长度对数值进 行标准化处理,得到长文本的基础相似度。具体而言,假设有两条长文本

Figure 372268DEST_PATH_IMAGE008
Figure 357542DEST_PATH_IMAGE023
,检测到
Figure 667300DEST_PATH_IMAGE008
Figure 3604DEST_PATH_IMAGE023
中的
Figure 57010DEST_PATH_IMAGE055
个句子相似(包括三类相似),两条长文本中的句子总数量是
Figure 568894DEST_PATH_IMAGE056
Figure 49554DEST_PATH_IMAGE028
,那这两条长文本 的基础相似度
Figure 138733DEST_PATH_IMAGE025
按如下计算: 3) After combining and summarizing the detection results of the three types of similar sentence pairs, normalize the value according to the total length of the text to obtain the basic similarity of the long text. Specifically, suppose there are two long texts
Figure 372268DEST_PATH_IMAGE008
,
Figure 357542DEST_PATH_IMAGE023
,detected
Figure 667300DEST_PATH_IMAGE008
and
Figure 3604DEST_PATH_IMAGE023
middle
Figure 57010DEST_PATH_IMAGE055
sentences are similar (including three types of similarity), and the total number of sentences in the two long texts is
Figure 568894DEST_PATH_IMAGE056
and
Figure 49554DEST_PATH_IMAGE028
, then the basic similarity of these two long texts
Figure 138733DEST_PATH_IMAGE025
Calculate as follows:

Figure 730251DEST_PATH_IMAGE057
Figure 730251DEST_PATH_IMAGE057

4)将长文本和其基础相似度抽象表示成相似句子关系图

Figure 614418DEST_PATH_IMAGE037
。相似句子关系图中的 每个节点
Figure 265979DEST_PATH_IMAGE008
代表一条长文本,
Figure 452241DEST_PATH_IMAGE030
;节点特征是一个独热向量,向量的维度是长文 本总数N;对于长文本
Figure 847450DEST_PATH_IMAGE008
,有特征向量: 4) Abstractly represent long text and its underlying similarity as a relationship graph of similar sentences
Figure 614418DEST_PATH_IMAGE037
. Each node in a graph of similar sentences
Figure 265979DEST_PATH_IMAGE008
represents a long text,
Figure 452241DEST_PATH_IMAGE030
; The node feature is a one-hot vector, and the dimension of the vector is the total number of long texts N; for long texts
Figure 847450DEST_PATH_IMAGE008
, with eigenvectors:

Figure 458560DEST_PATH_IMAGE058
Figure 458560DEST_PATH_IMAGE058

如果两条长文本

Figure 281023DEST_PATH_IMAGE008
Figure 203848DEST_PATH_IMAGE023
之间存在相似句子,则长文本对应的节点之间有一条边,边 的权重有: If two long texts
Figure 281023DEST_PATH_IMAGE008
,
Figure 203848DEST_PATH_IMAGE023
If there are similar sentences between them, there is an edge between the nodes corresponding to the long text, and the weight of the edge is:

Figure 402749DEST_PATH_IMAGE059
Figure 402749DEST_PATH_IMAGE059

Figure 71627DEST_PATH_IMAGE025
是上一步计算的基础相似度;
Figure 71627DEST_PATH_IMAGE025
is the basic similarity calculated in the previous step;

5)在关系图上进行两次信息的传递和聚合运算,得到新的节点特征信息

Figure 737095DEST_PATH_IMAGE033
并更新。 计算方式如下: 5) Perform two information transfer and aggregation operations on the relationship graph to obtain new node feature information
Figure 737095DEST_PATH_IMAGE033
and update. It is calculated as follows:

Figure 491424DEST_PATH_IMAGE060
Figure 491424DEST_PATH_IMAGE060

其中,

Figure 290753DEST_PATH_IMAGE038
是分别第一次和第二次运算自定义的权重,用于调节两次图上信息聚 合的比例。最终得到节点特征向量
Figure 814138DEST_PATH_IMAGE061
,其中,
Figure 634196DEST_PATH_IMAGE042
代表了长文本
Figure 875821DEST_PATH_IMAGE008
和长文本
Figure 416524DEST_PATH_IMAGE023
的文 本相似度。 in,
Figure 290753DEST_PATH_IMAGE038
are the custom weights of the first and second operations respectively, which are used to adjust the proportion of information aggregation on the two graphs. Finally get the node feature vector
Figure 814138DEST_PATH_IMAGE061
,in,
Figure 634196DEST_PATH_IMAGE042
represents long text
Figure 875821DEST_PATH_IMAGE008
and long text
Figure 416524DEST_PATH_IMAGE023
text similarity.

采用本发明方法计算长文本相似度,将长文本拆分为细粒度的句子进行编码和比较,充分利用了被比较文本本身的语义信息;还将长文本抽象成图上的节点,通过图上的信息传播和聚合,其中的节点表示融合了群体信息;相似句子可直观地查看,可使得长文本具有较强的可解释性,提升文本处理的有效性和精度。The method of the invention is used to calculate the similarity of long texts, and the long texts are divided into fine-grained sentences for coding and comparison, and the semantic information of the compared texts is fully utilized; the long texts are also abstracted into nodes on the graph, and the The information dissemination and aggregation of the node representation integrates group information; similar sentences can be viewed intuitively, which can make long texts more interpretable and improve the effectiveness and accuracy of text processing.

需要注意的是,公布实施例的目的在于帮助进一步理解本发明,但是本领域的技术人员可以理解:在不脱离本发明及所附权利要求的范围内,各种替换和修改都是可能的。因此,本发明不应局限于实施例所公开的内容,本发明要求保护的范围以权利要求书界定的范围为准。It should be noted that the purpose of the disclosed embodiments is to help further understanding of the present invention, but those skilled in the art can understand that various substitutions and modifications are possible without departing from the scope of the present invention and the appended claims. Therefore, the present invention should not be limited to the contents disclosed in the embodiments, and the scope of protection of the present invention shall be subject to the scope defined by the claims.

Claims (9)

1.一种两阶段的长文本相似度计算方法,其特征是,1. A two-stage long text similarity calculation method is characterized in that, 在第一阶段相似句子检测阶段,包括:In the first stage similar sentence detection stage, including: 11)基于深度学习模型构建句向量提取模型,所述句向量提取模型包括语义相似检测模型和转述相似检测模型;11) Constructing a sentence vector extraction model based on a deep learning model, the sentence vector extraction model includes a semantic similarity detection model and a paraphrase similarity detection model; 12)通过所述句向量提取模型将文本转换为句向量,再采用多种检测方法检测得到每条长文本之间多种相似类型的相似句子对,包括:语义相似型句子对、转述相似型句子对和局部相似型句子对;12) Convert the text into sentence vectors through the sentence vector extraction model, and then use a variety of detection methods to detect similar sentence pairs of various similar types between each long text, including: semantically similar sentence pairs, paraphrase similarity Sentence pairs and locally similar sentence pairs; 在第二阶段图结构计算阶段,包括:In the second stage graph structure calculation stage, including: 21)计算得到基础相似度;21) Calculate the basic similarity; 22)根据长文本相似句子对和基础相似度构建相似句子关系图结构;相似句子关系图上的每个节点表示一条长文本;节点之间的边表示节点对应的两条长文本之间存在相似句子;22) Build a similar sentence relationship graph structure based on similar sentence pairs and basic similarity of long texts; each node on the similar sentence relationship graph represents a long text; the edges between nodes indicate that there is similarity between two long texts corresponding to the nodes sentence; 23)通过相似句子关系图的运算,在相似句子关系图上进行两次信息传递和聚合运算,得到融合群体信息的高层次节点表示,由此获得新的节点特征信息并更新;23) Through the operation of the relationship graph of similar sentences, two information transfer and aggregation operations are performed on the relationship graph of similar sentences to obtain a high-level node representation that integrates group information, thereby obtaining and updating new node feature information; 节点特征向量上每个维度的值即对应长文本之间的文本相似度;根据节点特征,获得长文本之间的文本相似度。The value of each dimension on the node feature vector corresponds to the text similarity between long texts; according to the node features, the text similarity between long texts is obtained. 2.如权利要求1所述两阶段的长文本相似度计算方法,其特征是,在相似句子检测阶段之前,首先将每条长文本分割为句子;通过对比学习微调预训练的语言表征模型BERT模型或RoBERTa模型,得到句向量提取模型;通过所述句向量提取模型包括的语义相似检测模型和转述相似检测模型分别提取长文本句子和子句的句向量,从而将长文本转换为句向量。2. The two-stage long text similarity calculation method as claimed in claim 1, characterized in that, before the similar sentence detection stage, each long text is first divided into sentences; fine-tuning the pre-trained language representation model BERT through contrastive learning model or RoBERTa model to obtain a sentence vector extraction model; the sentence vectors of long text sentences and clauses are respectively extracted by the semantic similarity detection model and paraphrase similarity detection model included in the sentence vector extraction model, so as to convert the long text into sentence vectors. 3.如权利要求2所述两阶段的长文本相似度计算方法,其特征是,进一步地,通过如下步骤得到句向量提取模型:3. the long text similarity calculation method of two stages as claimed in claim 2, is characterized in that, further, obtains sentence vector extraction model by following steps: 11)通过进行句子语义相似性对比学习训练,微调BERT模型,得到语义相似检测模型;包括:11) By performing sentence semantic similarity comparative learning and training, fine-tuning the BERT model, and obtaining a semantic similarity detection model; including: 对提取得到的句向量,采用丢弃法处理,构造得到对比学习的正例;The extracted sentence vectors are processed by the discarding method, and the positive examples of comparative learning are constructed; 将一个训练批次中其他句向量作为对比学习的负例;Use other sentence vectors in a training batch as negative examples for comparative learning; 用于训练的损失函数采用基于句向量和构造的正例及负例计算的损失函数;The loss function used for training adopts the loss function calculated based on the sentence vector and the constructed positive and negative examples; 将训练好的模型命名为语义相似检测模型;Name the trained model as the semantic similarity detection model; 12)通过进行句子转述相似性对比学习训练,微调BERT模型,得到转述相似检测模型;包括:12) By performing sentence paraphrase similarity comparative learning and training, fine-tuning the BERT model, and obtaining a paraphrase similarity detection model; including: 从句子文本中提取出句向量;Extract sentence vector from sentence text; 对每个句子内部,按逗号分割为子句,在句子文本中随机选择和打乱子句,得到新句子文本;对从新句子文本中提取的句向量采用丢弃法处理构造对比学习的正例;将一个训练批次中其他句子文本所提取的向量作为对比学习的负例;Inside each sentence, it is divided into clauses by comma, and the clauses are randomly selected and scrambled in the sentence text to obtain the new sentence text; the sentence vector extracted from the new sentence text is processed by the discarding method to construct a positive example of contrastive learning; Use vectors extracted from other sentence texts in a training batch as negative examples for comparative learning; BERT模型微调的损失函数包含
Figure 2896DEST_PATH_IMAGE001
Figure 575828DEST_PATH_IMAGE002
Figure 355565DEST_PATH_IMAGE003
与步骤11)采用的损失函数相同;计算是 基于句向量和构造的正例及负例计算得到损失函数
Figure 422879DEST_PATH_IMAGE004
The loss function for fine-tuning the BERT model contains
Figure 2896DEST_PATH_IMAGE001
and
Figure 575828DEST_PATH_IMAGE002
;
Figure 355565DEST_PATH_IMAGE003
The loss function used in step 11) is the same; the calculation is based on the sentence vector and the constructed positive and negative examples to obtain the loss function
Figure 422879DEST_PATH_IMAGE004
;
最终损失函数
Figure 237251DEST_PATH_IMAGE005
为:
Figure 590872DEST_PATH_IMAGE006
;其中,
Figure 857905DEST_PATH_IMAGE007
是需要被设置的超参数,用于调节 模型对句子结构重组和语意差异之间的侧重程度;
final loss function
Figure 237251DEST_PATH_IMAGE005
for:
Figure 590872DEST_PATH_IMAGE006
;in,
Figure 857905DEST_PATH_IMAGE007
is a hyperparameter that needs to be set to adjust the model's emphasis on sentence structure reorganization and semantic difference;
得到的模型即命名为转述相似检测模型。The resulting model is named the paraphrase similarity detection model.
4.如权利要求1所述两阶段的长文本相似度计算方法,其特征是,进一步地,第一阶段所述多种检测方法包括三种相似型句子对的检测方法,检测得到语义相似型句子对、转述相似型句子对和局部相似型句子对。4. the long text similarity calculation method of two stages as claimed in claim 1, it is characterized in that, further, the described multiple detection methods of the first stage comprise the detection methods of three kinds of similar sentence pairs, detect and obtain semantic similarity type Sentence pairs, paraphrase-similar sentence pairs, and locally-similar sentence pairs. 5.如权利要求4所述两阶段的长文本相似度计算方法,其特征是,进一步地,5. the long text similarity calculation method of two stages as claimed in claim 4, is characterized in that, further, A. 检测语义相似型句子对时,执行如下操作:A. When detecting semantically similar sentence pairs, do the following: A1. 将每条长文本
Figure 978177DEST_PATH_IMAGE008
按表示句子分割的标点符号分割为句子;
A1. Put each long text
Figure 978177DEST_PATH_IMAGE008
Divide into sentences according to the punctuation marks indicating sentence division;
A2. 使用语义相似检测模型提取所有句子的特征向量,记为
Figure 647055DEST_PATH_IMAGE009
A2. Use the semantic similarity detection model to extract the feature vectors of all sentences, denoted as
Figure 647055DEST_PATH_IMAGE009
;
A3. 对句子的特征向量
Figure 171578DEST_PATH_IMAGE010
去重复,得到
Figure 925907DEST_PATH_IMAGE011
;对每个特征向量,找到其TOPK个相似的向量; 并将获得的所有向量对记为
Figure 335023DEST_PATH_IMAGE012
A3. Feature vectors for sentences
Figure 171578DEST_PATH_IMAGE010
to repeat, get
Figure 925907DEST_PATH_IMAGE011
; For each feature vector, find its TOPK similar vectors; and record all the obtained vector pairs as
Figure 335023DEST_PATH_IMAGE012
;
A4. 计算
Figure 858408DEST_PATH_IMAGE013
中向量距离的第t百分位数,作为相似性阈值
Figure 206694DEST_PATH_IMAGE014
A4. Calculation
Figure 858408DEST_PATH_IMAGE013
the t-th percentile of vector distances, as similarity threshold
Figure 206694DEST_PATH_IMAGE014
;
A5. 过滤出
Figure 448319DEST_PATH_IMAGE015
中特征向量距离小于
Figure 254601DEST_PATH_IMAGE016
的句子对,即为语义相似型句子对;
A5. Filter out
Figure 448319DEST_PATH_IMAGE015
Medium eigenvector distance is less than
Figure 254601DEST_PATH_IMAGE016
The sentence pairs are semantically similar sentence pairs;
B. 检测转述相似型的句子对时,执行如下操作:B. When detecting sentence pairs of similar type of paraphrase, do the following: B1. 将每条长文本
Figure 898072DEST_PATH_IMAGE017
按表示句子分割的标点符号分割为句子;
B1. Put each long text
Figure 898072DEST_PATH_IMAGE017
Divide into sentences according to the punctuation marks indicating sentence division;
B2. 使用转述相似检测模型提取所有句子的特征向量,记为
Figure 905343DEST_PATH_IMAGE018
B2. Use the paraphrase similarity detection model to extract the feature vectors of all sentences, denoted as
Figure 905343DEST_PATH_IMAGE018
;
B3. 对句子的特征向量
Figure 634264DEST_PATH_IMAGE019
去重复,得到
Figure 368871DEST_PATH_IMAGE020
;对每个特征向量,找到其TOPK个相似的向量; 将获得的所有向量对计为
Figure 866848DEST_PATH_IMAGE021
B3. Feature Vectors for Sentences
Figure 634264DEST_PATH_IMAGE019
to repeat, get
Figure 368871DEST_PATH_IMAGE020
; For each feature vector, find its TOPK similar vectors; Count all vector pairs obtained as
Figure 866848DEST_PATH_IMAGE021
;
B4. 计算
Figure 435233DEST_PATH_IMAGE022
中向量距离的第t百分位数,作为相似性阈值
Figure 385871DEST_PATH_IMAGE023
B4. Calculation
Figure 435233DEST_PATH_IMAGE022
the t-th percentile of vector distances, as similarity threshold
Figure 385871DEST_PATH_IMAGE023
;
B5. 过滤出
Figure 674901DEST_PATH_IMAGE019
中特征向量距离小于
Figure 27385DEST_PATH_IMAGE024
的句子对,即为转述相似型句子对;
B5. Filter out
Figure 674901DEST_PATH_IMAGE019
Medium eigenvector distance is less than
Figure 27385DEST_PATH_IMAGE024
The sentence pair of , that is, the sentence pair of paraphrase similarity;
C. 检测局部相似型的句子对时,执行如下操作:C. When detecting locally similar sentence pairs, do the following: C1. 将每条长文本
Figure 891305DEST_PATH_IMAGE025
按表示句子分割的标点符号分割为句子后,在句子内部按逗号分 割为子句;
C1. Convert each long text
Figure 891305DEST_PATH_IMAGE025
After splitting into sentences according to the punctuation marks indicating sentence splitting, split into clauses according to commas inside the sentence;
C2. 使用语义相似检测模型提取所有子句的特征向量,记为
Figure 329240DEST_PATH_IMAGE026
C2. Use the semantic similarity detection model to extract the feature vectors of all clauses, denoted as
Figure 329240DEST_PATH_IMAGE026
;
C3. 对句子的特征向量
Figure 281015DEST_PATH_IMAGE027
去重复,得到
Figure 753585DEST_PATH_IMAGE028
;对每个特征向量,找到其TOPK个相似的向量; 将获得的所有向量对计为
Figure 539138DEST_PATH_IMAGE029
C3. Feature vectors for sentences
Figure 281015DEST_PATH_IMAGE027
to repeat, get
Figure 753585DEST_PATH_IMAGE028
; For each feature vector, find its TOPK similar vectors; Count all vector pairs obtained as
Figure 539138DEST_PATH_IMAGE029
;
C4. 计算
Figure 464369DEST_PATH_IMAGE030
中向量距离的第t百分位数,作为相似性阈值
Figure 610048DEST_PATH_IMAGE031
C4. Calculation
Figure 464369DEST_PATH_IMAGE030
the t-th percentile of vector distances, as similarity threshold
Figure 610048DEST_PATH_IMAGE031
;
C5. 过滤出
Figure 671545DEST_PATH_IMAGE032
中特征向量距离小于
Figure 487054DEST_PATH_IMAGE033
的子句对;
C5. Filter out
Figure 671545DEST_PATH_IMAGE032
Medium eigenvector distance is less than
Figure 487054DEST_PATH_IMAGE033
clause pair;
C6. 对成功匹配的子句对,追溯到对应的句子对,即为局部相似型句子对。C6. For successfully matched clause pairs, trace them back to the corresponding sentence pairs, which are locally similar sentence pairs.
6.如权利要求5所述两阶段的长文本相似度计算方法,其特征是,将三种类型的相似句子对检测结果进行合并汇总后,根据文本总长度对数值进行标准化处理,得到长文本的基础相似度。6. the long text similarity calculation method of two stages as claimed in claim 5, it is characterized in that, after three types of similar sentences are combined and summed up the detection results, the numerical value is standardized according to the total length of the text, and the long text is obtained. basic similarity. 7.如权利要求6所述两阶段的长文本相似度计算方法,其特征是,进一步地,计算基础相似度是:7. The long text similarity calculation method of two stages as claimed in claim 6, is characterized in that, further, calculating the basic similarity is: 设有两条长文本
Figure 899581DEST_PATH_IMAGE034
Figure 599684DEST_PATH_IMAGE035
,检测到
Figure 781267DEST_PATH_IMAGE036
Figure 970940DEST_PATH_IMAGE035
中的
Figure 401921DEST_PATH_IMAGE037
个句子相似,则两条长文本的基础相似度
Figure 968031DEST_PATH_IMAGE038
按如下计算得到:
with two long texts
Figure 899581DEST_PATH_IMAGE034
,
Figure 599684DEST_PATH_IMAGE035
,detected
Figure 781267DEST_PATH_IMAGE036
and
Figure 970940DEST_PATH_IMAGE035
middle
Figure 401921DEST_PATH_IMAGE037
If the sentences are similar, then the basic similarity of the two long texts
Figure 968031DEST_PATH_IMAGE038
Calculated as follows:
Figure 194001DEST_PATH_IMAGE039
Figure 194001DEST_PATH_IMAGE039
其中,
Figure 554575DEST_PATH_IMAGE040
Figure 348219DEST_PATH_IMAGE041
分别为两条长文本中的句子总数量。
in,
Figure 554575DEST_PATH_IMAGE040
and
Figure 348219DEST_PATH_IMAGE041
are the total number of sentences in the two long texts, respectively.
8.如权利要求7所述两阶段的长文本相似度计算方法,其特征是,进一步地,将长文本 和其基础相似度表示成相似句子关系图
Figure 718020DEST_PATH_IMAGE042
;相似句子关系图中的每个节点
Figure 405353DEST_PATH_IMAGE034
代表一条长文 本,
Figure 202408DEST_PATH_IMAGE043
;节点特征是一个独热向量,向量的维度是长文本总数N;如果两条长文 本
Figure 467036DEST_PATH_IMAGE008
Figure 640529DEST_PATH_IMAGE035
之间存在相似句子,则长文本对应的节点之间有一条边;
8. the long text similarity calculation method of two stages as claimed in claim 7, is characterized in that, further, long text and its basic similarity are expressed as similar sentence relation graph
Figure 718020DEST_PATH_IMAGE042
; each node in the graph of similar sentences
Figure 405353DEST_PATH_IMAGE034
represents a long text,
Figure 202408DEST_PATH_IMAGE043
; The node feature is a one-hot vector, and the dimension of the vector is the total number of long texts N; if two long texts
Figure 467036DEST_PATH_IMAGE008
,
Figure 640529DEST_PATH_IMAGE035
If there are similar sentences between them, there is an edge between the nodes corresponding to the long text;
对于长文本
Figure 323314DEST_PATH_IMAGE044
,有特征向量:
Figure 291270DEST_PATH_IMAGE045
for long text
Figure 323314DEST_PATH_IMAGE044
, with eigenvectors:
Figure 291270DEST_PATH_IMAGE045
;
两条长文本
Figure 184140DEST_PATH_IMAGE046
Figure 895744DEST_PATH_IMAGE035
对应节点之间的边的权重
Figure 416724DEST_PATH_IMAGE047
Figure 555581DEST_PATH_IMAGE048
是基础相似度。
two long texts
Figure 184140DEST_PATH_IMAGE046
,
Figure 895744DEST_PATH_IMAGE035
The weights of the edges between the corresponding nodes
Figure 416724DEST_PATH_IMAGE047
;
Figure 555581DEST_PATH_IMAGE048
is the basic similarity.
9.如权利要求8所述两阶段的长文本相似度计算方法,其特征是,进一步地,在关系图 上进行两次信息的传递和聚合运算,得到新的节点特征信息
Figure 811113DEST_PATH_IMAGE049
并更新;计算方式如下:
9. the long text similarity calculation method of two stages as claimed in claim 8, it is characterized in that, further, carry out twice information transfer and aggregation operation on the relational graph, obtain new node characteristic information
Figure 811113DEST_PATH_IMAGE049
and updated; it is calculated as follows:
Figure 60829DEST_PATH_IMAGE050
Figure 60829DEST_PATH_IMAGE050
其中,
Figure 577261DEST_PATH_IMAGE051
是分别第一次和第二次运算自定义的权重,用于调节两次图上信息聚合 的比例;
Figure 808391DEST_PATH_IMAGE052
Figure 879115DEST_PATH_IMAGE053
为图
Figure 932522DEST_PATH_IMAGE054
上节点
Figure 303460DEST_PATH_IMAGE044
Figure 456224DEST_PATH_IMAGE055
经过第一次更新后的特征向量值;最终得到节点特征向 量
Figure 748665DEST_PATH_IMAGE056
,其中,
Figure 524204DEST_PATH_IMAGE057
代表了长文本
Figure 218491DEST_PATH_IMAGE058
和长文本
Figure 401210DEST_PATH_IMAGE059
的文本相似度。
in,
Figure 577261DEST_PATH_IMAGE051
are the custom weights of the first and second operations respectively, which are used to adjust the proportion of information aggregation on the two graphs;
Figure 808391DEST_PATH_IMAGE052
,
Figure 879115DEST_PATH_IMAGE053
for the picture
Figure 932522DEST_PATH_IMAGE054
upper node
Figure 303460DEST_PATH_IMAGE044
,
Figure 456224DEST_PATH_IMAGE055
The eigenvector value after the first update; the node eigenvector is finally obtained
Figure 748665DEST_PATH_IMAGE056
,in,
Figure 524204DEST_PATH_IMAGE057
represents long text
Figure 218491DEST_PATH_IMAGE058
and long text
Figure 401210DEST_PATH_IMAGE059
text similarity.
CN202210298133.6A 2022-03-25 2022-03-25 A two-stage similarity calculation method for long texts Active CN114398867B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210298133.6A CN114398867B (en) 2022-03-25 2022-03-25 A two-stage similarity calculation method for long texts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210298133.6A CN114398867B (en) 2022-03-25 2022-03-25 A two-stage similarity calculation method for long texts

Publications (2)

Publication Number Publication Date
CN114398867A CN114398867A (en) 2022-04-26
CN114398867B true CN114398867B (en) 2022-06-28

Family

ID=81234598

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210298133.6A Active CN114398867B (en) 2022-03-25 2022-03-25 A two-stage similarity calculation method for long texts

Country Status (1)

Country Link
CN (1) CN114398867B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114970767B (en) * 2022-06-29 2025-07-04 阳光保险集团股份有限公司 A training method, device, equipment and medium for text similarity model
CN117688138B (en) * 2024-02-02 2024-04-09 中船凌久高科(武汉)有限公司 Long text similarity comparison method based on paragraph division
CN120542434A (en) * 2025-05-22 2025-08-26 北京瑞泊控股(集团)有限公司 Long text representation acceleration system and method based on contrastive learning

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110196906A (en) * 2019-01-04 2019-09-03 华南理工大学 Towards financial industry based on deep learning text similarity detection method
CN113486645A (en) * 2021-06-08 2021-10-08 浙江华巽科技有限公司 Text similarity detection method based on deep learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9892111B2 (en) * 2006-10-10 2018-02-13 Abbyy Production Llc Method and device to estimate similarity between documents having multiple segments

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110196906A (en) * 2019-01-04 2019-09-03 华南理工大学 Towards financial industry based on deep learning text similarity detection method
CN113486645A (en) * 2021-06-08 2021-10-08 浙江华巽科技有限公司 Text similarity detection method based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Miguel Feria等.Constructing a Word Similarity Graph from Vector based Word Representation for Named Entity Recognition.《arXiv》.2018,第1-6页. *
王帅等.TP-AS:一种面向长文本的两阶段自动摘要方法.《中文信息学报》.2018,第32卷(第06期),第71-79页. *

Also Published As

Publication number Publication date
CN114398867A (en) 2022-04-26

Similar Documents

Publication Publication Date Title
CN111104794B (en) Text similarity matching method based on subject term
CN114398867B (en) A two-stage similarity calculation method for long texts
CN111966917B (en) Event detection and summarization method based on pre-training language model
CN112487203B (en) Relation extraction system integrated with dynamic word vector
CN109543183B (en) Multi-label entity-relation combined extraction method based on deep neural network and labeling strategy
CN108874937B (en) A sentiment classification method based on part-of-speech combination and feature selection
CN111737496A (en) A method for constructing fault knowledge graph of power equipment
CN111125367B (en) Multi-character relation extraction method based on multi-level attention mechanism
CN111680488B (en) Cross-lingual entity alignment method based on multi-view information of knowledge graph
CN109815336B (en) Text aggregation method and system
CN105955951B (en) A kind of method and device of message screening
CN108363816A (en) Open entity relation extraction method based on sentence justice structural model
CN104778256B (en) A kind of the quick of field question answering system consulting can increment clustering method
CN113051399B (en) Small sample fine-grained entity classification method based on relational graph convolutional network
CN110532328A (en) A kind of text concept figure building method
CN109670039A (en) Semi-supervised E-commerce Review Sentiment Analysis Method Based on Tripartite Graph and Cluster Analysis
CN104008166A (en) Dialogue short text clustering method based on form and semantic similarity
CN106372061A (en) Short text similarity calculation method based on semantics
CN108920482B (en) Microblog short text classification method based on lexical chain feature extension and LDA model
CN114332519A (en) An image description generation method based on external triples and abstract relations
CN106844328A (en) A kind of new extensive document subject matter semantic analysis and system
CN110851176A (en) A clone code detection method that automatically constructs and utilizes pseudo-clone corpus
CN115238040A (en) A method and system for constructing knowledge graph of steel materials science
CN116992040A (en) Knowledge graph completion method and system based on conceptual diagram
CN116070620A (en) Information processing method and system based on big data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant