CN114398867B

CN114398867B - A two-stage similarity calculation method for long texts

Info

Publication number: CN114398867B
Application number: CN202210298133.6A
Authority: CN
Inventors: 段思宇; 苏祺; 王军
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2022-03-25
Filing date: 2022-03-25
Publication date: 2022-06-28
Anticipated expiration: 2042-03-25
Also published as: CN114398867A

Abstract

The invention discloses a two-stage long text similarity calculation method, which comprises the steps of constructing a sentence vector extraction model based on a deep learning model in a first stage similar sentence detection stage, and converting a text into a sentence vector; detecting to obtain multiple similar sentence pairs of similar types between each long text; in the second stage of graph structure calculation, the basic similarity is calculated; expressing the long text similar sentence pair and the basic similarity into a similar sentence relation graph; each node on the graph represents a long text; obtaining high-level node representation of the fusion group information through operation; updating node feature information, wherein the value of each dimension on a node feature vector is the text similarity between corresponding long texts; i.e. to obtain text similarity between long texts. The method of the invention can make the similarity of the long text have stronger interpretability and improve the effectiveness and the precision of text processing.

Description

A two-stage similarity calculation method for long texts

技术领域technical field

本发明涉及文本相似度计算方法，具体涉及一种基于深度学习模型与图算法的两阶段的长文本相似度计算方法。The invention relates to a text similarity calculation method, in particular to a two-stage long text similarity calculation method based on a deep learning model and a graph algorithm.

背景技术Background technique

文本相似度计算是自然语言处理的一类重要任务，相关技术致力于使用技术手段度量文本之间的相似程度。对于不同长度的文本，需要适配不同的文本相似度计算方法。在计算长文本相似度时需要将大量文本信息进行提取压缩与匹配计算，这在新闻推荐、文章推荐、引文推荐、文档聚类等方面有重要应用。Text similarity calculation is an important task in natural language processing, and related technologies are devoted to using technical means to measure the similarity between texts. For texts of different lengths, different text similarity calculation methods need to be adapted. When calculating the similarity of long texts, a large amount of text information needs to be extracted, compressed and matched, which has important applications in news recommendation, article recommendation, citation recommendation, and document clustering.

现有技术大多采用基于关键词提取的方法，通过提取少数关键词作为长文本的代表，然后参与进一步的相似度计算。由于计算结果依赖于少数几个关键词，这种方法损失了大量语义信息，鲁棒性差。Most of the existing technologies adopt methods based on keyword extraction, by extracting a few keywords as the representative of long texts, and then participating in further similarity calculation. Since the calculation results depend on a few keywords, this method loses a lot of semantic information and has poor robustness.

基于深度学习模型的方法使用深度学习模型对全文进行编码后计算其相似度。但现有的深度学习模型只能在长度为数百个词以内的文本序列上取得较好的编码效果。而类似书本这样的长文本经常有数万字甚至数十万字，现有的模型不能较好地编码。并且，由于相似度计算都在隐空间进行，可解释性很差。Methods based on deep learning models use deep learning models to encode the full text and calculate its similarity. However, the existing deep learning models can only achieve good encoding results on text sequences with lengths within hundreds of words. And long texts like books often have tens of thousands or even hundreds of thousands of words, and existing models cannot encode them well. Moreover, since the similarity calculation is performed in the latent space, the interpretability is poor.

此外，上述两类技术都只考虑了被比较的长文本之间的信息，计算过程相对孤立，缺乏对群体信息的利用。In addition, the above two types of techniques only consider the information between the long texts being compared, the calculation process is relatively isolated, and the use of group information is lacking.

发明内容SUMMARY OF THE INVENTION

本发明提供一种基于深度学习模型与图算法的两阶段的长文本相似度计算方法，利用文本自身的语义信息，以及与群体信息的协作，两阶段地计算得到书本级别长文本的相似度。The invention provides a two-stage long text similarity calculation method based on a deep learning model and a graph algorithm, which utilizes the semantic information of the text itself and the cooperation with group information to calculate the book-level long text similarity in two stages.

本发明的原理是：对一组长文本，在第一阶段，使用多种检测方法检测得到每条长文本之间的相似句子对；在第二阶段，将相似句子对按其所在长文本合并汇总，将每条长文本抽象表示成图上的节点，进行图上的推理交互运算，让信息在节点间传递，获得融合了群体信息的高层次节点表示；最终，根据节点特征，获得长文本之间的文本相似度。The principle of the present invention is: for a group of long texts, in the first stage, multiple detection methods are used to detect similar sentence pairs between each long text; in the second stage, the similar sentence pairs are merged according to the long texts where they are located. Summarize, abstractly represent each long text as a node on the graph, perform inference and interactive operations on the graph, let the information pass between nodes, and obtain a high-level node representation that integrates group information; finally, according to the node characteristics, obtain long texts text similarity between them.

本发明提供的技术方案如下：The technical scheme provided by the present invention is as follows:

一种两阶段的长文本相似度计算方法，包括如下步骤：A two-stage long text similarity calculation method, comprising the following steps:

在第一阶段相似句子检测阶段，包括：In the first stage similar sentence detection stage, including:

基于深度学习模型构建句向量提取模型，句向量提取模型包括语义相似检测模型和转述相似检测模型；Construct sentence vector extraction model based on deep learning model, sentence vector extraction model includes semantic similarity detection model and paraphrase similarity detection model;

通过句向量提取模型将文本转换为句向量；Convert text to sentence vector through sentence vector extraction model;

使用多种检测方法检测得到每条长文本之间多种相似类型的相似句子对；Use multiple detection methods to detect similar sentence pairs of multiple similar types between each long text;

在第二阶段图结构计算阶段，包括：In the second stage graph structure calculation stage, including:

计算得到基础相似度；Calculate the basic similarity;

基于图算法，将长文本相似句子对和基础相似度表示成相似句子关系图；相似句子关系图上的每个节点表示一条长文本；Based on the graph algorithm, the similar sentence pairs and the basic similarity of the long text are represented as a similar sentence relationship graph; each node on the similar sentence relationship graph represents a long text;

通过相似句子关系图的推理交互运算，获得融合群体信息的高层次节点表示；Through the reasoning and interactive operation of the relationship graph of similar sentences, the high-level node representation of the fusion group information is obtained;

更新节点特征信息，节点特征向量上每个维度的值即对应长文本之间的文本相似度；Update the node feature information, the value of each dimension on the node feature vector is the text similarity between the corresponding long texts;

根据节点特征，获得长文本之间的文本相似度。According to the node features, the text similarity between long texts is obtained.

进一步地，上述两阶段的长文本相似度计算方法在相似句子检测阶段之前，首先将每条长文本分割为句子；通过对比学习微调预训练的语言表征模型BERT模型或RoBERTa模型，得到句向量提取模型；通过句向量提取模型包括的语义相似检测模型和转述相似检测模型分别提取长文本句子和子句的句向量，从而将长文本转换为句向量。Further, the above two-stage long text similarity calculation method first divides each long text into sentences before the similar sentence detection stage; fine-tunes the pre-trained language representation model BERT model or RoBERTa model through comparative learning to obtain sentence vector extraction. The model; the sentence vectors of long text sentences and clauses are extracted respectively through the semantic similarity detection model and the paraphrase similarity detection model included in the sentence vector extraction model, so as to convert the long text into sentence vectors.

进一步地，通过如下步骤得到句向量提取模型：Further, the sentence vector extraction model is obtained through the following steps:

11）通过进行句子语义相似性对比学习训练，微调BERT模型，得到语义相似检测模型；包括：11) By performing sentence semantic similarity comparative learning and training, fine-tuning the BERT model, and obtaining a semantic similarity detection model; including:

对提取得到的句向量，采用丢弃法处理，构造得到对比学习的正例；The extracted sentence vectors are processed by the discarding method, and the positive examples of comparative learning are constructed;

将一个训练批次中其他句向量作为对比学习的负例；Use other sentence vectors in a training batch as negative examples for comparative learning;

用于训练的损失函数采用基于句向量和构造的正例及负例计算的损失函数；The loss function used for training adopts the loss function calculated based on the sentence vector and the constructed positive and negative examples;

将训练好的模型命名为语义相似检测模型；Name the trained model as the semantic similarity detection model;

12）通过进行句子转述相似性对比学习训练，微调BERT模型，得到转述相似检测模型；包括：12) By performing sentence paraphrase similarity comparative learning and training, fine-tuning the BERT model, and obtaining a paraphrase similarity detection model; including:

从句子文本中提取出句向量；Extract sentence vector from sentence text;

对每个句子内部按逗号分割为子句，在句子文本中随机选择和打乱子句，得到新句子文本；对从新句子文本中提取的句向量采用丢弃法处理构造对比学习的正例；将一个训练批次中其他句子文本所提取的向量作为对比学习的负例；Divide each sentence into clauses by comma, randomly select and scramble clauses in the sentence text, and obtain the new sentence text; adopt the discarding method to process the sentence vector extracted from the new sentence text to construct the positive example of contrastive learning; Vectors extracted from other sentence texts in a training batch are used as negative examples for comparative learning;

BERT模型微调的损失函数包含

与

；

与步骤11）采用的损失函数相同；计算是基于句向量和构造的正例及负例计算得到损失函数

； The loss function for fine-tuning the BERT model contains

and

;

The loss function used in step 11) is the same; the calculation is based on the sentence vector and the constructed positive and negative examples to obtain the loss function

;

最终损失函数

为：

；其中，

是需要被设置的超参数，用于调节模型对句子结构重组和语意差异之间的侧重程度； final loss function

for:

;in,

is a hyperparameter that needs to be set to adjust the model's emphasis on sentence structure reorganization and semantic difference;

得到的模型即命名为转述相似检测模型。The resulting model is named the paraphrase similarity detection model.

进一步地，第一阶段多种检测方法包括三种相似型句子对的检测方法；三种相似型句子对分别是：语义相似型句子对、转述相似型句子对和局部相似型句子对。Further, the multiple detection methods in the first stage include three detection methods of similar sentence pairs; the three similar sentence pairs are: semantic similarity sentence pair, paraphrase similarity sentence pair and local similarity sentence pair.

A. 检测语义相似型句子对时，执行如下操作：A. When detecting semantically similar sentence pairs, do the following:

A1. 将每条长文本

按表示句子分割的标点符号分割为句子； A1. Put each long text

Divide into sentences according to the punctuation marks indicating sentence division;

A2. 使用语义相似检测模型提取所有句子的特征向量，记为

； A2. Use the semantic similarity detection model to extract the feature vectors of all sentences, denoted as

;

A3. 对句子的特征向量

去重复，得到

；对每个特征向量，找到其TOPK个相似的向量；并将获得的所有向量对记为

； A3. Feature vectors for sentences

to repeat, get

; For each feature vector, find its TOPK similar vectors; and record all the obtained vector pairs as

;

A4. 计算

中向量距离的第t百分位数，作为相似性阈值

； A4. Calculation

the t-th percentile of vector distances, as similarity threshold

;

A5. 过滤出

中特征向量距离小于

的句子对，即为语义相似型句子对； A5. Filter out

Medium eigenvector distance is less than

The sentence pairs are semantically similar sentence pairs;

B. 检测转述相似型的句子对时，执行如下操作：B. When detecting sentence pairs of similar type of paraphrase, do the following:

B1. 将每条长文本

按表示句子分割的标点符号分割为句子；B1. Put each long text

B2. 使用转述相似检测模型提取所有句子的特征向量，记为

； B2. Use the paraphrase similarity detection model to extract the feature vectors of all sentences, denoted as

;

B3. 对句子的特征向量

去重复，得到

；对每个特征向量，找到其TOPK个相似的向量；将获得的所有向量对计为

； B3. Feature Vectors for Sentences

to repeat, get

; For each feature vector, find its TOPK similar vectors; count all the obtained vector pairs as

;

B4. 计算

中向量距离的第t百分位数，作为相似性阈值

； B4. Calculation

the t-th percentile of vector distances, as similarity threshold

;

B5. 过滤出

中特征向量距离小于

的句子对，即为转述相似型句子对； B5. Filter out

Medium eigenvector distance is less than

The sentence pair of , that is, the sentence pair of paraphrase similarity;

C. 检测局部相似型的句子对时，执行如下操作：C. When detecting locally similar sentence pairs, do the following:

C1. 将每条长文本

按表示句子分割的标点符号分割为句子后，在句子内部按逗号分割为子句； C1. Convert each long text

After splitting into sentences according to the punctuation marks indicating sentence splitting, split into clauses according to commas inside the sentence;

C2. 使用语义相似检测模型提取所有子句的特征向量，记为

； C2. Use the semantic similarity detection model to extract the feature vectors of all clauses, denoted as

;

C3. 对句子的特征向量

去重复，得到

； C3. Feature vectors for sentences

to repeat, get

;

C4. 计算

中向量距离的第t百分位数，作为相似性阈值

； C4. Calculation

the t-th percentile of vector distances, as similarity threshold

;

C5. 过滤出

中特征向量距离小于

的子句对； C5. Filter out

Medium eigenvector distance is less than

clause pair;

C6. 对成功匹配的子句对，追溯到对应的句子对，即为局部相似型句子对。C6. For successfully matched clause pairs, trace them back to the corresponding sentence pairs, which are locally similar sentence pairs.

进一步地，将三种类型的相似句子对检测结果进行合并汇总后，根据文本总长度对数值进行标准化处理，得到长文本的基础相似度。Further, after combining and summarizing the detection results of the three types of similar sentence pairs, the value is normalized according to the total length of the text to obtain the basic similarity of the long text.

进一步地，计算基础相似度是：Further, calculating the basic similarity is:

设有两条长文本

，

，检测到

和

中的

个句子相似，则两条长文本的基础相似度

按如下计算得到： with two long texts

,

,detected

and

middle

If the sentences are similar, then the basic similarity of the two long texts

Calculated as follows:

其中，

和

分别为两条长文本中的句子总数量。 in,

and

are the total number of sentences in the two long texts, respectively.

进一步地，将长文本和其基础相似度表示成相似句子关系图

；相似句子关系图中的每个节点

代表一条长文本,

；节点特征是一个独热向量，向量的维度是长文本总数N；如果两条长文本

，

之间存在相似句子，则长文本对应的节点之间有一条边； Further, the long text and its basic similarity are represented as a relationship graph of similar sentences

; each node in the graph of similar sentences

represents a long text,

; The node feature is a one-hot vector, and the dimension of the vector is the total number of long texts N; if two long texts

,

If there are similar sentences between them, there is an edge between the nodes corresponding to the long text;

对于长文本

，有特征向量：

； for long text

, with eigenvectors:

;

两条长文本

，

对应节点之间的边的权重

；

是基础相似度。 two long texts

,

The weights of the edges between the corresponding nodes

;

is the basic similarity.

进一步地，在关系图上进行两次信息的传递和聚合运算，得到新的节点特征信息

并更新；计算方式如下： Further, perform two information transfer and aggregation operations on the relationship graph to obtain new node feature information

and updated; it is calculated as follows:

其中，

、

是图

上节点

、

的初始特征向量值；

是分别第一次和第二次运算自定义的权重，用于调节两次图上信息聚合的比例；

、

为图

上节点

、

经过第一次更新后的特征向量值；最终得到节点特征向量

，其中，

代表了长文本

和长文本

的文本相似度。 in,

,

is a picture

upper node

,

The initial eigenvector value of ;

are the custom weights of the first and second operations respectively, which are used to adjust the proportion of information aggregation on the two graphs;

,

for the picture

upper node

,

The eigenvector value after the first update; the node eigenvector is finally obtained

,in,

represents long text

and long text

text similarity.

与现有技术相比，本发明的有益效果：Compared with the prior art, the beneficial effects of the present invention:

利用本发明提供的技术方案，在计算长文本相似度时，将长文本拆分为细粒度的句子进行编码和比较，充分利用了被比较文本本身的语义信息；还将长文本抽象成图上的节点，通过图上的信息传播和聚合，让节点表示融合了群体信息；同时因为相似句子可以直观地查看到，本发明提供的长文本的相似度计算方法可使得长文本相似度具有较强的可解释性，提升文本处理的有效性和精度。Using the technical solution provided by the present invention, when calculating the similarity of long texts, the long texts are divided into fine-grained sentences for coding and comparison, and the semantic information of the compared texts is fully utilized; the long texts are also abstracted into graphs. Through the information dissemination and aggregation on the graph, the nodes represent the integration of group information; at the same time, because similar sentences can be viewed intuitively, the method for calculating the similarity of long texts provided by the present invention can make the similarity of long texts stronger. interpretability, improving the effectiveness and precision of text processing.

附图说明Description of drawings

图1为本发明提供的两阶段计算长文本相似度的流程框图。FIG. 1 is a flow chart of the two-stage calculation of the similarity of long text provided by the present invention.

图2为本发明方法的相似句子检测阶段的流程框图。FIG. 2 is a flow chart of the similar sentence detection stage of the method of the present invention.

图3为本发明方法的图结构计算阶段的流程框图。FIG. 3 is a flow chart of the graph structure calculation stage of the method of the present invention.

具体实施方式Detailed ways

下面结合附图，通过实施例进一步描述本发明，但不以任何方式限制本发明的范围。Below in conjunction with the accompanying drawings, the present invention is further described by means of embodiments, but the scope of the present invention is not limited in any way.

本发明提供一种基于深度学习模型与图算法的两阶段的长文本相似度计算方法，利用文本自身的语义信息，以及与群体信息的协作，两阶段地计算得到书本级别长文本的相似度。对一组长文本，在第一阶段，使用多条检测路径检测每条长文本之间的相似句子对；在第二阶段，对匹配的句子对按其来源聚合成图，将每条长文本抽象表示成图上的节点，进行图上的推理交互运算，让信息在节点间传递，获得融合了群体信息的高层次节点表示；最终，根据节点特征，获得长文本之间的文本相似度。The invention provides a two-stage long text similarity calculation method based on a deep learning model and a graph algorithm, which utilizes the semantic information of the text itself and the cooperation with group information to calculate the book-level long text similarity in two stages. For a set of long texts, in the first stage, multiple detection paths are used to detect similar sentence pairs between each long text; in the second stage, the matched sentence pairs are aggregated into a graph according to their source, and each long text is aggregated into a graph. Abstractly represent the nodes on the graph, perform inference and interactive operations on the graph, let the information pass between the nodes, and obtain a high-level node representation that integrates group information; finally, according to the node characteristics, obtain the text similarity between long texts.

图1所示为本发明提供的基于深度学习模型与图算法的两阶段计算长文本相似度的流程。包括如下步骤：FIG. 1 shows the process of calculating the similarity of long texts in two stages based on a deep learning model and a graph algorithm provided by the present invention. It includes the following steps:

第一阶段是相似句子检测阶段：The first stage is the similar sentence detection stage:

1) 基于深度学习模型（可使用BERT模型或RoBERTa模型）构建句向量提取模型，对长文本中的句子和子句提取句向量。1) Build a sentence vector extraction model based on a deep learning model (BERT model or RoBERTa model can be used) to extract sentence vectors for sentences and clauses in long texts.

2) 根据句向量的相似性，检测得到多种类型的相似句子对。2) According to the similarity of sentence vectors, various types of similar sentence pairs are detected.

第二阶段是图结构计算阶段：The second stage is the graph structure calculation stage:

3) 将相似句子对按其来源（所在长文本）构建成图结构。3) Construct similar sentence pairs into a graph structure according to their source (where the long text is located).

4) 在图上进行信息的传递和聚合运算，更新节点特征信息。4) Perform information transfer and aggregation operations on the graph to update node feature information.

节点特征向量上每个维度的值即对应长文本之间的文本相似度。The value of each dimension on the node feature vector corresponds to the text similarity between long texts.

本发明方法具体实施包括：The specific implementation of the method of the present invention includes:

1）将长文本按表示句子分割的标点符号分割为句子，在句子内部按逗号分割为子句。使用对比学习微调好的句向量提取模型（包括语义相似检测模型和转述相似检测模型）分别提取句子和子句的句向量。1) Divide long text into sentences by punctuation marks indicating sentence division, and within sentences by commas into clauses. Sentence vector extraction models fine-tuned by contrastive learning (including semantic similarity detection model and paraphrase similarity detection model) are used to extract sentence vectors of sentences and clauses, respectively.

2）针对语义相似型、转述相似型、局部相似型三种类型的句子文本相似模式，根据句向量的距离，检测出相似句向量，得到对应的相似句子对。2) For the three types of sentence text similarity patterns of semantic similarity, paraphrase similarity and local similarity, according to the distance of sentence vectors, similar sentence vectors are detected, and corresponding similar sentence pairs are obtained.

3）将三种类型的相似句子对检测结果合并统计后，使用长文本的句子数量对统计结果进行标准化处理。将相似句子对按其来源聚合成图。每条长文本代表图上一个节点，边的权重是两条长文本间相似句子对的数量。3) After combining the detection results of three types of similar sentence pairs, the statistical results are normalized using the number of sentences in the long text. Aggregate similar sentence pairs into a graph by their origin. Each long text represents a node on the graph, and the weight of the edge is the number of similar sentence pairs between the two long texts.

4）在图上进行两次信息的传递和聚合运算，融合群体信息后更新节点特征。4) Carry out two information transfer and aggregation operations on the graph, and update the node features after fusing the group information.

下面通过实例对本发明做进一步说明。The present invention will be further illustrated by examples below.

实施例1Example 1

对N本文本格式的电子书，将其视作N条长文本

，使用本发明提出的方法计算各条长文本两两之间的文本相似度。本方法包括两个阶段，相似句子检测阶段和图结构计算阶段（如图1所示）。 For N e-books in text format, treat them as N long texts

, using the method proposed in the present invention to calculate the text similarity between each pair of long texts. This method includes two stages, the similar sentence detection stage and the graph structure calculation stage (as shown in Figure 1).

1）在进行相似句子检测之前，先要构建得到句向量提取模型，用于将文本转换为句向量。首先将所有长文本分割为句子；再通过对比学习微调预训练的语言表征模型BERTBidirectional Encoder Representation from Transformers）模型（或RoBERTa模型），得到句向量提取模型；通过句向量提取模型将文本句子转换为句向量；1) Before similar sentence detection, a sentence vector extraction model must be constructed to convert text into sentence vectors. First segment all long texts into sentences; then fine-tune the pre-trained language representation model (BERTBidirectional Encoder Representation from Transformers) model (or RoBERTa model) through comparative learning to obtain a sentence vector extraction model; convert text sentences into sentences through the sentence vector extraction model vector;

11）通过进行句子语义相似性对比学习训练来微调BERT模型，得到语义相似检测模型。11) Fine-tune the BERT model by performing sentence semantic similarity comparative learning and training to obtain a semantic similarity detection model.

对每个分割后的句子，首先，从句子中提取出句向量

。对从句子中提取的句向量

通过进行丢弃法（Dropout）处理来构造该句子的对比学习的正例，将一个训练批次中其他句子文本所提取的向量作为该句子的对比学习的负例。训练的损失函数采用基于句向量和构造的正例及负例计算的损失函数，其设计与SimCSE（Simple Contrastive Learning of Sentence Embeddings，句嵌入简单对比学习）的设计相同。将训练好的模型命名为语义相似检测模型，记为

。 For each segmented sentence, first, extract the sentence vector from the sentence

. For sentence vectors extracted from sentences

Construct the positive example of the sentence's comparative learning by performing Dropout processing, and use the vectors extracted from other sentence texts in a training batch as the negative example of the sentence's comparative learning. The training loss function adopts the loss function calculated based on sentence vectors and constructed positive and negative examples, and its design is the same as that of SimCSE (Simple Contrastive Learning of Sentence Embeddings). Name the trained model as the semantic similarity detection model, denoted as

.

12）通过进行句子转述相似性对比学习训练来微调BERT模型，得到转述相似检测模型。12) Fine-tune the BERT model by performing sentence paraphrase similarity comparison learning and training, and obtain a paraphrase similarity detection model.

对每个句子，首先，从句子中提取出句向量

。BERT模型微调的损失函数包含

与

两部分。

与

的损失函数相同。在计算

时，对每个句子，在句子内部按逗号分割为子句，在句子文本中随机选择和打乱子句，得到新句子文本。从新句子文本中提取的句向量

进行Dropout处理构造该句子的对比学习的正例，将一个训练批次中其他句子文本所提取的向量

作为该句子的对比学习的负例。

是基于句向量

和构造的正例

及负例

计算的损失函数，其设计与句嵌入简单对比学习SimCSE的设计相同。最终的训练的损失函数是： For each sentence, first, extract the sentence vector from the sentence

. The loss function for fine-tuning the BERT model contains

and

two parts.

and

The loss function is the same. in computing

When , for each sentence, it is divided into clauses by commas inside the sentence, and the clauses are randomly selected and scrambled in the sentence text to get the new sentence text. Sentence vector extracted from new sentence text

Perform Dropout processing to construct a positive example of the sentence's contrastive learning, and use the vectors extracted from other sentence texts in a training batch

as a negative example for the contrastive learning of this sentence.

is based on sentence vector

and positive examples of construction

and negative examples

The computed loss function, whose design is the same as that of SimCSE for Simple Contrastive Learning of Sentence Embeddings. The final training loss function is:

其中，

是需要被设置的超参数，它调节了模型对句子结构重组和语意差异之间的侧重程度。得到的模型即命名为转述相似检测模型，记为

。 in,

is a hyperparameter that needs to be set, which adjusts the model's emphasis on sentence structure reorganization and semantic difference. The obtained model is named as the paraphrase similarity detection model, denoted as

.

2）在长文本T间检测得到相似型句子对（如图2所示，具体实施时通过设计三种相似型的句子对的检测方法检测得到三种相似型句子对）。2) Similar sentence pairs are obtained by detecting between long texts T (as shown in Figure 2, in the specific implementation, three similar sentence pairs are detected by designing three similar sentence pairs detection methods).

A1. 将每条长文本

按表示句子分割的标点符号分割为句子； A1. Put each long text

A2. 使用语义相似检测模型提取所有句子的特征向量，记为

;

A3. 对句子的特征向量

去重复，得到

； A3. Feature vectors for sentences

to repeat, get

;

A4. 计算

中向量距离的第t百分位数，作为相似性阈值

； A4. Calculation

the t-th percentile of vector distances, as similarity threshold

;

A5. 过滤出

中特征向量距离小于

的句子对，即为语义相似型句子对； A5. Filter out

Medium eigenvector distance is less than

The sentence pairs are semantically similar sentence pairs;

B1. 将每条长文本

按表示句子分割的标点符号分割为句子； B1. Put each long text

B2. 使用转述相似检测模型提取所有句子的特征向量，记为

;

B3. 对句子的特征向量

去重复，得到

； B3. Feature Vectors for Sentences

to repeat, get

;

B4. 计算

中向量距离的第t百分位数，作为相似性阈值

； B4. Calculation

the t-th percentile of vector distances, as similarity threshold

;

B5. 过滤出

中特征向量距离小于

的句子对，即为转述相似型句子对； B5. Filter out

Medium eigenvector distance is less than

The sentence pair of , that is, the sentence pair of paraphrase similarity;

C1. 将每条长文本

C2. 使用语义相似检测模型提取所有子句的特征向量，记为

;

C3. 对句子的特征向量

去重复，得到

； C3. Feature vectors for sentences

to repeat, get

;

C4. 计算

中向量距离的第t百分位数，作为相似性阈值

； C4. Calculation

the t-th percentile of vector distances, as similarity threshold

;

C5. 过滤出

中特征向量距离小于

的子句对； C5. Filter out

Medium eigenvector distance is less than

clause pair;

得到了检测出的相似句子对结果后，进入图结构计算阶段（如图3所示）。After obtaining the detected similar sentence pair results, enter the graph structure calculation stage (as shown in Figure 3).

3）将三种类型的相似句子对检测结果进行合并汇总后，根据文本总长度对数值进行标准化处理，得到长文本的基础相似度。具体而言，假设有两条长文本

，

，检测到

和

中的

个句子相似（包括三类相似），两条长文本中的句子总数量是

和

，那这两条长文本的基础相似度

按如下计算： 3) After combining and summarizing the detection results of the three types of similar sentence pairs, normalize the value according to the total length of the text to obtain the basic similarity of the long text. Specifically, suppose there are two long texts

,

,detected

and

middle

sentences are similar (including three types of similarity), and the total number of sentences in the two long texts is

and

, then the basic similarity of these two long texts

Calculate as follows:

4）将长文本和其基础相似度抽象表示成相似句子关系图

。相似句子关系图中的每个节点

代表一条长文本,

；节点特征是一个独热向量，向量的维度是长文本总数N；对于长文本

，有特征向量： 4) Abstractly represent long text and its underlying similarity as a relationship graph of similar sentences

. Each node in a graph of similar sentences

represents a long text,

; The node feature is a one-hot vector, and the dimension of the vector is the total number of long texts N; for long texts

, with eigenvectors:

如果两条长文本

，

之间存在相似句子，则长文本对应的节点之间有一条边，边的权重有： If two long texts

,

If there are similar sentences between them, there is an edge between the nodes corresponding to the long text, and the weight of the edge is:

是上一步计算的基础相似度；

is the basic similarity calculated in the previous step;

5）在关系图上进行两次信息的传递和聚合运算，得到新的节点特征信息

并更新。计算方式如下： 5) Perform two information transfer and aggregation operations on the relationship graph to obtain new node feature information

and update. It is calculated as follows:

其中，

是分别第一次和第二次运算自定义的权重，用于调节两次图上信息聚合的比例。最终得到节点特征向量

，其中，

代表了长文本

和长文本

的文本相似度。 in,

are the custom weights of the first and second operations respectively, which are used to adjust the proportion of information aggregation on the two graphs. Finally get the node feature vector

,in,

represents long text

and long text

text similarity.

采用本发明方法计算长文本相似度，将长文本拆分为细粒度的句子进行编码和比较，充分利用了被比较文本本身的语义信息；还将长文本抽象成图上的节点，通过图上的信息传播和聚合，其中的节点表示融合了群体信息；相似句子可直观地查看，可使得长文本具有较强的可解释性，提升文本处理的有效性和精度。The method of the invention is used to calculate the similarity of long texts, and the long texts are divided into fine-grained sentences for coding and comparison, and the semantic information of the compared texts is fully utilized; the long texts are also abstracted into nodes on the graph, and the The information dissemination and aggregation of the node representation integrates group information; similar sentences can be viewed intuitively, which can make long texts more interpretable and improve the effectiveness and accuracy of text processing.

需要注意的是，公布实施例的目的在于帮助进一步理解本发明，但是本领域的技术人员可以理解：在不脱离本发明及所附权利要求的范围内，各种替换和修改都是可能的。因此，本发明不应局限于实施例所公开的内容，本发明要求保护的范围以权利要求书界定的范围为准。It should be noted that the purpose of the disclosed embodiments is to help further understanding of the present invention, but those skilled in the art can understand that various substitutions and modifications are possible without departing from the scope of the present invention and the appended claims. Therefore, the present invention should not be limited to the contents disclosed in the embodiments, and the scope of protection of the present invention shall be subject to the scope defined by the claims.

Claims

1. A two-stage long text similarity calculation method is characterized in that,

In the first stage similar sentence detection stage, including:

11) Constructing a sentence vector extraction model based on a deep learning model, the sentence vector extraction model includes a semantic similarity detection model and a paraphrase similarity detection model;

12) Convert the text into sentence vectors through the sentence vector extraction model, and then use a variety of detection methods to detect similar sentence pairs of various similar types between each long text, including: semantically similar sentence pairs, paraphrase similarity Sentence pairs and locally similar sentence pairs;

In the second stage graph structure calculation stage, including:

21) Calculate the basic similarity;

22) Build a similar sentence relationship graph structure based on similar sentence pairs and basic similarity of long texts; each node on the similar sentence relationship graph represents a long text; the edges between nodes indicate that there is similarity between two long texts corresponding to the nodes sentence;

23) Through the operation of the relationship graph of similar sentences, two information transfer and aggregation operations are performed on the relationship graph of similar sentences to obtain a high-level node representation that integrates group information, thereby obtaining and updating new node feature information;

The value of each dimension on the node feature vector corresponds to the text similarity between long texts; according to the node features, the text similarity between long texts is obtained.

2. The two-stage long text similarity calculation method as claimed in claim 1, characterized in that, before the similar sentence detection stage, each long text is first divided into sentences; fine-tuning the pre-trained language representation model BERT through contrastive learning model or RoBERTa model to obtain a sentence vector extraction model; the sentence vectors of long text sentences and clauses are respectively extracted by the semantic similarity detection model and paraphrase similarity detection model included in the sentence vector extraction model, so as to convert the long text into sentence vectors.

3. the long text similarity calculation method of two stages as claimed in claim 2, is characterized in that, further, obtains sentence vector extraction model by following steps:

11) By performing sentence semantic similarity comparative learning and training, fine-tuning the BERT model, and obtaining a semantic similarity detection model; including:

The extracted sentence vectors are processed by the discarding method, and the positive examples of comparative learning are constructed;

Use other sentence vectors in a training batch as negative examples for comparative learning;

The loss function used for training adopts the loss function calculated based on the sentence vector and the constructed positive and negative examples;

Name the trained model as the semantic similarity detection model;

12) By performing sentence paraphrase similarity comparative learning and training, fine-tuning the BERT model, and obtaining a paraphrase similarity detection model; including:

Extract sentence vector from sentence text;

Inside each sentence, it is divided into clauses by comma, and the clauses are randomly selected and scrambled in the sentence text to obtain the new sentence text; the sentence vector extracted from the new sentence text is processed by the discarding method to construct a positive example of contrastive learning; Use vectors extracted from other sentence texts in a training batch as negative examples for comparative learning;

The loss function for fine-tuning the BERT model contains

and

;

;

final loss function

for:

;in,

The resulting model is named the paraphrase similarity detection model.

4. the long text similarity calculation method of two stages as claimed in claim 1, it is characterized in that, further, the described multiple detection methods of the first stage comprise the detection methods of three kinds of similar sentence pairs, detect and obtain semantic similarity type Sentence pairs, paraphrase-similar sentence pairs, and locally-similar sentence pairs.

5. the long text similarity calculation method of two stages as claimed in claim 4, is characterized in that, further,

A. When detecting semantically similar sentence pairs, do the following:

A1. Put each long text

A2. Use the semantic similarity detection model to extract the feature vectors of all sentences, denoted as

;

A3. Feature vectors for sentences

to repeat, get

;

A4. Calculation

the t-th percentile of vector distances, as similarity threshold

;

A5. Filter out

Medium eigenvector distance is less than

The sentence pairs are semantically similar sentence pairs;

B. When detecting sentence pairs of similar type of paraphrase, do the following:

B1. Put each long text

B2. Use the paraphrase similarity detection model to extract the feature vectors of all sentences, denoted as

;

B3. Feature Vectors for Sentences

to repeat, get

; For each feature vector, find its TOPK similar vectors; Count all vector pairs obtained as

;

B4. Calculation

the t-th percentile of vector distances, as similarity threshold

;

B5. Filter out

Medium eigenvector distance is less than

The sentence pair of , that is, the sentence pair of paraphrase similarity;

C. When detecting locally similar sentence pairs, do the following:

C1. Convert each long text

C2. Use the semantic similarity detection model to extract the feature vectors of all clauses, denoted as

;

C3. Feature vectors for sentences

to repeat, get

;

C4. Calculation

the t-th percentile of vector distances, as similarity threshold

;

C5. Filter out

Medium eigenvector distance is less than

clause pair;

C6. For successfully matched clause pairs, trace them back to the corresponding sentence pairs, which are locally similar sentence pairs.

6. the long text similarity calculation method of two stages as claimed in claim 5, it is characterized in that, after three types of similar sentences are combined and summed up the detection results, the numerical value is standardized according to the total length of the text, and the long text is obtained. basic similarity.

7. The long text similarity calculation method of two stages as claimed in claim 6, is characterized in that, further, calculating the basic similarity is:

with two long texts

,

,detected

and

middle

If the sentences are similar, then the basic similarity of the two long texts

Calculated as follows:

in,

and

are the total number of sentences in the two long texts, respectively.

8. the long text similarity calculation method of two stages as claimed in claim 7, is characterized in that, further, long text and its basic similarity are expressed as similar sentence relation graph

; each node in the graph of similar sentences

represents a long text,

,

for long text

, with eigenvectors:

;

two long texts

,

The weights of the edges between the corresponding nodes

;

is the basic similarity.

9. the long text similarity calculation method of two stages as claimed in claim 8, it is characterized in that, further, carry out twice information transfer and aggregation operation on the relational graph, obtain new node characteristic information

and updated; it is calculated as follows:

in,

,

for the picture

upper node

,

,in,

represents long text

and long text

text similarity.