CN118193726A

CN118193726A - Visual patent retrieval method based on pre-training language model

Info

Publication number: CN118193726A
Application number: CN202410348493.1A
Authority: CN
Inventors: 王建; 张晞曈; 王佐成; 吕孝忠; 李�浩; 张文婷; 王淑莹
Original assignee: Data Space Research Institute
Current assignee: Data Space Research Institute
Priority date: 2024-03-26
Filing date: 2024-03-26
Publication date: 2024-06-14
Anticipated expiration: 2044-03-26
Also published as: CN118193726B

Abstract

The present invention relates to the technical field of natural language processing, and specifically to a visual patent search method based on a pre-trained language model. In a patent keyword search scenario, the present invention extracts keywords and represents word vectors for patent texts based on a pre-trained language model, reduces the dimension of patent features returned by the search, and visualizes the search results in the form of a scatter plot. Compared with the search results displayed in a list form, the method is more intuitive and provides richer search information such as the degree of similarity between patents and patent clustering.

Description

A visual patent retrieval method based on pre-trained language model

技术领域Technical Field

本发明涉及自然语言处理技术领域，具体是一种基于预训练语言模型的可视化专利检索方法。The present invention relates to the technical field of natural language processing, and in particular to a visual patent retrieval method based on a pre-trained language model.

背景技术Background technique

专利检索的首要任务是从海量的专利数据库中检索出相关的专利文本。目前，比较常用的专利文本检索策略为关键词检索，基于待检索关键词构建检索式，从专利数据库中检索出相关的专利文本。The primary task of patent search is to retrieve relevant patent texts from the massive patent database. At present, the more commonly used patent text search strategy is keyword search, which constructs a search formula based on the keywords to be searched and retrieves relevant patent texts from the patent database.

关键词检索策略存在漏检的情况，很容易遗漏一定量重要且技术相似度高的专利文本，进而难以实现高精度的专利检索需求。于是为了解决上述存在的技术问题，专利CN112000783A中公开了一种基于文本相似性分析的专利推荐方法、装置、设备及存储介质。该专利推荐方法通过获取主体关键词及描述性关键词，并以主体关键词和所有的描述性关键词作为检索词获得基础相似文本集，又以主体关键词和各描述性关键词作为检索词获得扩展相似文本集。接着遍历扩展相似文本集，针对每个扩展相似文本，基于该扩展相似文本的文本特征词和该扩展相似文本对应的检索词计算该扩展相似文本与基础相似文本集中的基础相似文本之间的相似度；并当该扩展相似文本与基础相似文本集中的任一基础相似文本之间的相似度高于预定阈值时，将该扩展相似文本移入至基础相似文本集。该专利推荐方法能够提升相似文本的查全率，降低漏检率。The keyword search strategy has the possibility of missing detection, and it is easy to miss a certain amount of important patent texts with high technical similarity, which makes it difficult to achieve high-precision patent search requirements. Therefore, in order to solve the above-mentioned technical problems, patent CN112000783A discloses a patent recommendation method, device, equipment and storage medium based on text similarity analysis. The patent recommendation method obtains the main keywords and descriptive keywords, and uses the main keywords and all descriptive keywords as search terms to obtain a basic similar text set, and then uses the main keywords and each descriptive keyword as search terms to obtain an extended similar text set. Then, the extended similar text set is traversed, and for each extended similar text, the similarity between the extended similar text and the basic similar text in the basic similar text set is calculated based on the text feature words of the extended similar text and the search terms corresponding to the extended similar text; and when the similarity between the extended similar text and any basic similar text in the basic similar text set is higher than a predetermined threshold, the extended similar text is moved into the basic similar text set. The patent recommendation method can improve the recall rate of similar texts and reduce the missed detection rate.

尽管上述现有技术能够在一定程度上提高查全率，但是在实际的使用过程中还存在以下问题：Although the above existing technologies can improve the recall rate to a certain extent, there are still the following problems in actual use:

1、查全结果主要是通过数字化的形式展示相似度的一个大致排序，难以直观的展示各个相似度之间的差异。并且在检索结果里，附带上每条专利的相似度数值，对于较长的列表来说，也显得不够直观，需要翻阅列表，才可看到对应的相似度数值。1. The search results mainly show a rough ranking of similarities in digital form, which makes it difficult to intuitively show the differences between the different similarities. In addition, the similarity value of each patent is attached in the search results, which is not intuitive enough for a long list. You need to flip through the list to see the corresponding similarity value.

2、展示的查全结果列表中，没有计算检索出的各个专利文本之间的相似度，因此对于检索人员来说，无法直观的体现出各个专利文本之间的相似度。2. In the displayed search result list, the similarity between the retrieved patent texts is not calculated. Therefore, for the search personnel, it is impossible to intuitively reflect the similarity between the patent texts.

综上所述，目前的专利检索结果显示中还有很大的提升空间。In summary, there is still much room for improvement in the current patent search results.

发明内容Summary of the invention

为了避免和克服现有技术中存在的技术问题，本发明提供了一种基于预训练语言模型的可视化专利检索方法。本发明能够清楚直观的展示出检索关键词与检索出的专利文本之间的相似度关系，以及检索出的各个专利文本彼此之间的相似度关系。In order to avoid and overcome the technical problems existing in the prior art, the present invention provides a visual patent search method based on a pre-trained language model. The present invention can clearly and intuitively display the similarity relationship between the search keywords and the retrieved patent texts, as well as the similarity relationship between the retrieved patent texts.

为实现上述目的，本发明提供如下技术方案：To achieve the above object, the present invention provides the following technical solutions:

一种基于预训练语言模型的可视化专利检索方法，包括以下检索步骤：A visual patent search method based on a pre-trained language model includes the following search steps:

S1、使用已有关键词标注的专利文本，训练一个RoBERTa+Bi-LSTM模型，用于从专利中抽取关键词；S1. Use patent texts annotated with existing keywords to train a RoBERTa+Bi-LSTM model to extract keywords from patents.

S2、将专利数据库中的专利文本，按指定格式输入S1中的RoBERTa+Bi-LSTM模型，抽取多个关键词以及这些关键词的词向量，多个词向量相加，作为该专利文本的高维向量表示；S2, input the patent text in the patent database into the RoBERTa+Bi-LSTM model in S1 according to the specified format, extract multiple keywords and word vectors of these keywords, and add multiple word vectors as the high-dimensional vector representation of the patent text;

S3、将待检索的多个关键词，按指定格式输入S1中的RoBERTa模型，获取关键词的词向量，多个关键词的向量则相加，作为该查询文本的高维向量表示；S3, input multiple keywords to be searched into the RoBERTa model in S1 according to the specified format, obtain the word vector of the keyword, and add the vectors of multiple keywords as the high-dimensional vector representation of the query text;

S4、依次计算S3中的查询文本向量与专利库中所有专利文本经过S2处理后的高维向量之间的余弦相似度；选取余弦相似度小于设定阈值的专利文本作为检索候选结果；S4, sequentially calculating the cosine similarity between the query text vector in S3 and the high-dimensional vectors of all patent texts in the patent database processed by S2; selecting patent texts with cosine similarity less than a set threshold as retrieval candidate results;

S5、对S3中的查询文本向量和S4中获取的检索候选专利的文本向量，输入流型降维模型Barnes-Hut t-SNE进行降维，均降至2维；S5, the query text vector in S3 and the text vector of the candidate patents obtained in S4 are input into the Barnes-Hut t-SNE model for dimension reduction, both of which are reduced to 2 dimensions;

S6、对降维后的查询文本向量和候选专利文本向量以点的形式呈现在二维平面中，形成散点图，通过散点图中点与点之间的距离可视化的表示检索结果之间的相似度关系。S6. The query text vector and the candidate patent text vector after dimension reduction are presented in the form of points in a two-dimensional plane to form a scatter plot. The similarity relationship between the search results is visualized by the distance between the points in the scatter plot.

作为本发明再进一步的方案：在将专利数据库中的专利文本输入RoBERTa模型之前，需要对专利文本进行数据处理，将专利文本的输入格式转化为T1：As a further solution of the present invention: before inputting the patent text in the patent database into the RoBERTa model, it is necessary to perform data processing on the patent text and convert the input format of the patent text into T1:

T1＝([CLS],TITLE,[SEP],ABSTRACT,[SEP],IPC_TEXT,[SEP],MAIN_TEX T)；T1 = ([CLS], TITLE, [SEP], ABSTRACT, [SEP], IPC_TEXT, [SEP], MAIN_TEX T);

其中，[CLS]是标识文本开始的占位符；[SEP]是分割符；TITLE表示专利文本中的文本序列展开后的专利名称的位置；ABSTRACT表示专利文本中的文本序列展开后的专利说明书摘要的位置；该处的IPC_TEXT表示专利文本中的文本序列展开后的专利IPC分类号的位置；MAIN_TEXT表示专利文本中的文本序列展开后的专利发明内容的位置。这种输入格式能够充分利用专利文本里的结构化信息。Among them, [CLS] is a placeholder to mark the beginning of the text; [SEP] is a separator; TITLE indicates the position of the patent name after the text sequence in the patent text is expanded; ABSTRACT indicates the position of the patent specification abstract after the text sequence in the patent text is expanded; IPC_TEXT here indicates the position of the patent IPC classification number after the text sequence in the patent text is expanded; MAIN_TEXT indicates the position of the patent invention content after the text sequence in the patent text is expanded. This input format can make full use of the structured information in the patent text.

作为本发明再进一步的方案：待检索关键词为一个或者多个，各个待检索关键词依次排列构成待检索关键词序列，并将待检索关键词序列输入到RoBERTa模型中；将待检索关键词序列输入RoBERTa模型之前，需要对待检索关键词进行数据处理，将待检索关键词的输入格式转化为T2，T2＝([CLS],IPC_TEXT,[SEP],KEYWORD_1,[SEP],KEYWORD_2,...,[SEP],KEY WORD_N)，其中，该处的IPC_TEXT表示待检索关键词序列展开后的专利IPC分类号的位置；KEYWORD_1表示待检索关键词序列展开后，第一个待检索关键词的位置；KEYWORD_2表示待检索关键词序列展开后，第二个待检索关键词的位置；KEYWORD_N表示待检索关键词序列展开后，第N个待检索关键词的位置。As a further solution of the present invention: there are one or more keywords to be searched, and each keyword to be searched is arranged in sequence to form a keyword sequence to be searched, and the keyword sequence to be searched is input into the RoBERTa model; before inputting the keyword sequence to be searched into the RoBERTa model, it is necessary to perform data processing on the keywords to be searched, and convert the input format of the keywords to be searched into T2, T2 = ([CLS], IPC_TEXT, [SEP], KEYWORD_1, [SEP], KEYWORD_2, ..., [SEP], KEY WORD_N), wherein the IPC_TEXT here represents the position of the patent IPC classification number after the keyword sequence to be searched is expanded; KEYWORD_1 represents the position of the first keyword to be searched after the keyword sequence to be searched is expanded; KEYWORD_2 represents the position of the second keyword to be searched after the keyword sequence to be searched is expanded; KEYWORD_N represents the position of the Nth keyword to be searched after the keyword sequence to be searched is expanded.

作为本发明再进一步的方案：在使用RoBERTa模型对专利文本中的关键词或者是对待检索关键词序列中的待检索关键词进行对应拼接时，选择RoBERTa模型中的后4层隐藏层输出的向量进行拼接，以得到对应的拼接后的向量。As a further solution of the present invention: when using the RoBERTa model to perform corresponding splicing on the keywords in the patent text or the keywords to be searched in the keyword sequence to be searched, the vectors output by the last 4 hidden layers in the RoBERTa model are selected for splicing to obtain the corresponding spliced vectors.

作为本发明再进一步的方案：在进行降维时采用的是Barnes-Hut t-SNE算法。As a further solution of the present invention: the Barnes-Hut t-SNE algorithm is used when performing dimensionality reduction.

作为本发明再进一步的方案：序列状态向量中各个关键词的词向量，以及各个待检索关键词的词向量的维度均为3072，通过Barnes-Hut t-SNE算法降维处理后，对应的维度由3072降为2。As a further solution of the present invention: the word vector of each keyword in the sequence state vector and the word vector of each keyword to be retrieved have a dimension of 3072. After dimensionality reduction processing by the Barnes-Hut t-SNE algorithm, the corresponding dimension is reduced from 3072 to 2.

作为本发明再进一步的方案：待检索关键词的输入格式中的IPC_TEXT可根据检索需求进行保留或删除；当需要限定IPC分类号时，则保留，反之删除。As a further solution of the present invention: IPC_TEXT in the input format of the keyword to be searched can be retained or deleted according to the search requirements; when it is necessary to limit the IPC classification number, it is retained, otherwise it is deleted.

与现有技术相比，本发明的有益效果是：Compared with the prior art, the present invention has the following beneficial effects:

1、本发明在专利关键词检索场景下，基于预训练语言模型对专利文本进行关键词的抽取和词向量表示，并对检索返回的专利特征降维，以散点图的形式可视化展示检索结果，相比于列表形式展示的检索结果，更加直观，并且提供了专利之间相似程度、专利分簇情况等更加丰富的检索信息。1. In the patent keyword search scenario, the present invention extracts keywords and represents word vectors for patent texts based on a pre-trained language model, reduces the dimensionality of patent features returned by the search, and visualizes the search results in the form of a scatter plot. Compared with the search results displayed in a list form, it is more intuitive and provides richer search information such as the degree of similarity between patents and patent clustering.

2、本发明使用预训练模型抽取专利的关键词后，使用关键词对应的词向量作为专利的特征表示，而不是直接存储关键词，也不是进一步将专利做文本向量化，对于待检索的关键词也做了同样的处理，使用关键词的词向量作为特征表示，一方面为降维可视化提供了基础，另一方面，也给用户提供了控制关键词权重，从而调整检索偏好的功能。2. After the present invention uses a pre-trained model to extract the keywords of the patent, the word vector corresponding to the keyword is used as the feature representation of the patent, rather than directly storing the keyword or further vectorizing the patent into text. The same processing is done for the keywords to be retrieved. The word vector of the keyword is used as the feature representation. On the one hand, it provides a basis for dimensionality reduction visualization. On the other hand, it also provides users with the function of controlling keyword weights and thus adjusting search preferences.

3、本发明对于高维词向量引入了Barnes-Hutt-SNE算法进行降维，相比于常见的PCA降维，Barnes-Hut t-SNE算法并不是通过特征的线性变换来寻求降维，对高维词向量降至2维之后的可视化表现效果更好。3. The present invention introduces the Barnes-Hutt-SNE algorithm to reduce the dimensionality of high-dimensional word vectors. Compared with the common PCA dimensionality reduction, the Barnes-Hut t-SNE algorithm does not seek dimensionality reduction through linear transformation of features, and has a better visualization effect after the high-dimensional word vector is reduced to 2 dimensions.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明的主要流程示意图。FIG. 1 is a schematic diagram of the main process of the present invention.

图2为本发明中的关键词词向量生成流程图。FIG2 is a flowchart of generating keyword vectors in the present invention.

图3为本发明中的相似度计算流程图。FIG3 is a similarity calculation flow chart of the present invention.

图4为本发明中的散点图。FIG. 4 is a scatter plot in the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will be combined with the drawings in the embodiments of the present invention to clearly and completely describe the technical solutions in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

请参阅图1，本发明实施例中，一种基于预训练语言模型的可视化专利检索方法，本发明旨在对专利检索结果进行可视化，主要适用于关键词的检索场景。对于专利文本中的关键词，除了专利项本身给出的关键词之外，还可以从专利文本中提取出关键词。这一过程可以使用当下自然语言处理领域流行的预训练语言模型来完成，本发明使用RoBERTa模型，该模型是在中文语料上训练过的，更适合中文相关的场景。关键词提取可以看做一个序列标注问题，在RoBERTa模型后面加上一个BiLSTM层，对RoBERTa模型输出的序列进行处理，判断序列中哪些位置为关键词。Please refer to Figure 1. In an embodiment of the present invention, a visualized patent search method based on a pre-trained language model is provided. The present invention aims to visualize patent search results and is mainly suitable for keyword search scenarios. For keywords in patent texts, in addition to the keywords given in the patent items themselves, keywords can also be extracted from the patent texts. This process can be completed using the pre-trained language model that is popular in the field of natural language processing. The present invention uses the RoBERTa model, which is trained on Chinese corpus and is more suitable for Chinese-related scenarios. Keyword extraction can be regarded as a sequence labeling problem. A BiLSTM layer is added after the RoBERTa model to process the sequence output by the RoBERTa model to determine which positions in the sequence are keywords.

由于专利文本通常是规范化的，具有专利名称，背景技术，摘要，发明内容等段落，通过利用这些信息，对RoBERTa模型原有的输入格式进行了调整，使用这些结构化的信息作为上下文的补充。更进一步，考虑到同一个关键词，在不同的领域会有不同的含义，因此在补充上下文的时候，也将专利的分类号对应的中文含义放入了上下文中，最终使用的文本输入格式为T1，T1＝([CLS],TITLE,[SEP],ABSTRACT,[SEP],IPC_TEXT,[SEP],MAIN_TEXT)，其中[CLS]是标识文本开始的占位符，[SEP]是分割符，TITLE、ABSTRACT、IPC_TEXT、MAIN_TEXT是专利中的文本序列展开后的位置。Since patent texts are usually standardized and have paragraphs such as patent name, background technology, abstract, and invention content, the original input format of the RoBERTa model is adjusted by utilizing this information, and these structured information are used as a supplement to the context. Furthermore, considering that the same keyword has different meanings in different fields, the Chinese meaning corresponding to the patent classification number is also put into the context when supplementing the context. The final text input format used is T1, T1 = ([CLS], TITLE, [SEP], ABSTRACT, [SEP], IPC_TEXT, [SEP], MAIN_TEXT), where [CLS] is a placeholder that marks the beginning of the text, [SEP] is a separator, and TITLE, ABSTRACT, IPC_TEXT, and MAIN_TEXT are the positions of the text sequence in the patent after expansion.

使用已做过关键词位置标注的专利文本，将其处理成上述输入格式，然后对RoBERTa+Bi-LSTM模型进行微调训练，得到一个适用于专利文本场景的关键词提取模型。Use patent text with keyword position annotated, process it into the above input format, and then fine-tune the RoBERTa+Bi-LSTM model to obtain a keyword extraction model suitable for patent text scenarios.

上述的RoBERTa+BiLSTM结构的关键词提取模型训练好之后，输入专利文本，抽取得到该专利中的关键词，还需要以词向量的形式来表示该关键词，这一步可以直接从RoBERTa模型中获得。如图2所示，本发明选择RoBERTa模型中后4层的隐藏层向量并进行拼接，作为序列中对应位置关键词的词向量表示。每一层的隐向量维度为768，得到的每个关键词的词向量维度为768x 4＝3072，这是一个很高的维度。对专利数据库中的所有专利文本使用上述模型抽取关键词，并以词向量的形式保存，作为该专利的关键词特征。After the keyword extraction model of the above-mentioned RoBERTa+BiLSTM structure is trained, the patent text is input to extract the keywords in the patent. It is also necessary to represent the keywords in the form of word vectors. This step can be obtained directly from the RoBERTa model. As shown in Figure 2, the present invention selects the hidden layer vectors of the last 4 layers in the RoBERTa model and concatenates them as the word vector representation of the keywords at the corresponding position in the sequence. The dimension of the hidden vector of each layer is 768, and the dimension of the word vector of each keyword obtained is 768x 4=3072, which is a very high dimension. The above model is used to extract keywords for all patent texts in the patent database and save them in the form of word vectors as keyword features of the patent.

如图3所示，检索时，针对用户输入的一个或者多个关键词，本发明以([CLS],IPC_TEXT,[SEP],KEYWORD_1,[SEP],KEYWORD_2,...,[SEP],KEYWO RD_N)的格式，将一个或多个关键词构造成查询文本，输入到RoBERTa模型中，此时不需要使用后面的BiLSTM层，其中的IPC_TEXT是可选部分，如果用户指定了查询的专利分类领域，就将该IPC加入到上下文中，和上述模型的训练过程对齐，提供更加完善的上下文信息。之后同样采用上一段的方式，抽取RoBERTa模型后4层的隐向量，作为检索关键词的词向量表示。As shown in Figure 3, during the search, for one or more keywords input by the user, the present invention constructs one or more keywords into a query text in the format of ([CLS], IPC_TEXT, [SEP], KEYWORD_1, [SEP], KEYWORD_2, ..., [SEP], KEYWORD_N) and inputs it into the RoBERTa model. At this time, there is no need to use the following BiLSTM layer, where IPC_TEXT is an optional part. If the user specifies the patent classification field to be queried, the IPC is added to the context and aligned with the training process of the above model to provide more complete context information. Afterwards, the same method as in the previous paragraph is used to extract the latent vectors of the last 4 layers of the RoBERTa model as the word vector representation of the search keyword.

从上述预训练模型得到的词向量具有上下文相关的特点，比如苹果这个词，在“水果摊上的苹果”和“苹果公司”中就是两种表示；并且在一定程度上，词向量之间的加法也展现出符合语义表述的性质，比如“皇冠”+“男性”的词向量，会和“国王”的词向量表示接近。基于这些性质，本发明对用户提供的多个关键词转换后的向量，进行加权求和，默认情况下权重相等，用户可以自行调整各个关键词的权重，来调整自己的检索偏好。The word vectors obtained from the above pre-trained model have the characteristics of context-relatedness. For example, the word apple has two representations in "apples on the fruit stand" and "Apple Company"; and to a certain extent, the addition between word vectors also shows the property of conforming to semantic expression, such as the word vector of "crown" + "male" will be close to the word vector representation of "king". Based on these properties, the present invention performs weighted summation on the vectors converted from multiple keywords provided by the user. By default, the weights are equal. Users can adjust the weights of each keyword to adjust their own search preferences.

接着在专利库中，先按照用户提供的关键词初步筛选出含有该关键词的专利(多个关键词时，含有一个即可)，如图3所示，对候选专利的多个关键词向量进行求和，作为该专利的特征向量。然后对比检索关键词的词向量和候选专利特征向量，计算余弦距离，选择距离最近的K个专利作为检索结果返回。Next, in the patent database, we first preliminarily screen out patents containing the keyword provided by the user (if there are multiple keywords, just one is sufficient), as shown in Figure 3, and sum up the multiple keyword vectors of the candidate patents to serve as the feature vector of the patent. Then, we compare the word vector of the search keyword with the feature vector of the candidate patent, calculate the cosine distance, and select the K patents with the closest distance as the search results to be returned.

得到关键词检索返回的K个专利之后，最后进行可视化处理。由于检索关键词的词向量和专利特征向量维度都是3072，维度过高，需要进行降维。本发明采用Barnes-Hut t-SNE算法，将上述高维词向量降至2维。对于高维特征的样本，Barnes-Hut t-SNE算法在降维之后，可以保留高维空间中样本的局部性质，也就是距离相近的样本降维之后会越近，距离越远的样本降维之后越远，降低了计算复杂度，提升了计算速度，但是只局限于获得2维或3维的表示，刚好符合可视化的要求。将检索关键词向量和候选专利特征向量降至2维之后，采用散点图来进行可视化。最终得到的散点图中，如图4所示，直观上距离检索关键词的点越近的专利，相似的程度越高；而各个专利的点之间的距离，也直观的表示它们之间的相似程度；并且返回的专利结果中，可能会呈现出某种簇，一些专利相比其他的专利，更为接近某个主题，为检索人提供更加丰富的信息。After obtaining the K patents returned by the keyword search, visualization processing is finally performed. Since the word vector and patent feature vector dimensions of the search keyword are both 3072, the dimension is too high and needs to be reduced. The present invention adopts the Barnes-Hut t-SNE algorithm to reduce the above high-dimensional word vector to 2 dimensions. For samples with high-dimensional features, the Barnes-Hut t-SNE algorithm can retain the local properties of samples in the high-dimensional space after dimensionality reduction, that is, samples with similar distances will be closer after dimensionality reduction, and samples with farther distances will be farther after dimensionality reduction, which reduces the computational complexity and improves the computational speed, but is limited to obtaining 2-dimensional or 3-dimensional representations, which just meets the visualization requirements. After reducing the search keyword vector and the candidate patent feature vector to 2 dimensions, a scatter plot is used for visualization. In the final scatter plot, as shown in Figure 4, the closer the patent is to the search keyword point intuitively, the higher the degree of similarity; and the distance between the points of each patent also intuitively represents the degree of similarity between them; and in the returned patent results, a certain cluster may be presented, and some patents are closer to a certain topic than other patents, providing more abundant information for the searcher.

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，根据本发明的技术方案及其发明构思加以等同替换或改变，都应涵盖在本发明的保护范围之内。The above description is only a preferred specific implementation manner of the present invention, but the protection scope of the present invention is not limited thereto. Any technician familiar with the technical field can make equivalent replacements or changes according to the technical scheme and inventive concept of the present invention within the technical scope disclosed by the present invention, which should be covered by the protection scope of the present invention.

Claims

1. A visual patent search method based on a pre-trained language model, characterized by comprising the following search steps:

S1. Use patent texts annotated with existing keywords to train a RoBERTa+Bi-LSTM model to extract keywords from patent texts.

S2, input the patent text in the patent database into the RoBERTa+Bi-LSTM model trained in S1 in the specified format, extract multiple keywords and word vectors of these keywords, and add multiple word vectors as the high-dimensional vector representation of the patent text;

S3, input multiple keywords to be searched into the RoBERTa model in S1 according to the specified format, obtain the word vector of the keyword, and add the vectors of multiple keywords as the high-dimensional vector representation of the query text;

S4, sequentially calculating the cosine similarity between the query text vector in S3 and the high-dimensional vectors of all patent texts in the patent database processed by S2; selecting patent texts with cosine similarity less than a set threshold as retrieval candidate results;

S5, the query text vector in S3 and the text vector of the candidate patents obtained in S4 are input into the Barnes-Hut t-SNE model for dimension reduction, both of which are reduced to 2 dimensions;

S6. The query text vector and the candidate patent text vector after dimension reduction are presented in the form of points in a two-dimensional plane to form a scatter plot. The similarity relationship between the search results is visualized by the distance between the points in the scatter plot.

2. According to claim 1, a visual patent retrieval method based on a pre-trained language model is characterized in that before the patent text in the patent database is input into the RoBERTa model, the patent text needs to be processed and the input format of the patent text is converted into T1, T1 = ([CLS], TITLE, [SEP], ABSTRACT, [SEP], IPC_TEXT, [SEP], MAIN_TEXT), wherein [CLS] is a placeholder for identifying the beginning of the text; [SEP] is a separator; TITLE indicates the position of the patent name after the text sequence in the patent text is expanded; ABSTRACT indicates the position of the patent specification abstract after the text sequence in the patent text is expanded; IPC_TEXT here indicates the position of the patent IPC classification number after the text sequence in the patent text is expanded; MAIN_TEXT indicates the position of the patent invention content after the text sequence in the patent text is expanded.

3. According to claim 3, a visual patent search method based on a pre-trained language model is characterized in that there are one or more keywords to be searched, and each keyword to be searched is arranged in sequence to form a keyword sequence to be searched, and the keyword sequence to be searched is input into the RoBERTa model; before inputting the keyword sequence to be searched into the RoBERTa model, it is necessary to perform data processing on the keywords to be searched, and convert the input format of the keywords to be searched into T2, T2 = ([CLS], IPC_TEXT, [SEP], KEYWORD_1, [SEP], KEYWORD_2, ..., [SEP], KEY WORD_N), wherein the IPC_TEXT here represents the position of the patent IPC classification number after the keyword sequence to be searched is expanded; KEYWORD_1 represents the position of the first keyword to be searched after the keyword sequence to be searched is expanded; KEYWORD_2 represents the position of the second keyword to be searched after the keyword sequence to be searched is expanded; KEYWORD_N represents the position of the Nth keyword to be searched after the keyword sequence to be searched is expanded.

4. A visual patent retrieval method based on a pre-trained language model according to any one of claims 2-3, characterized in that when using the RoBERTa model to perform corresponding splicing on the keywords in the patent text or the keywords to be searched in the keyword sequence to be searched, the vectors output by the last 4 hidden layers in the RoBERTa model are selected for splicing to obtain the corresponding spliced vectors.

5. According to the visual patent retrieval method based on a pre-trained language model described in claim 5, it is characterized in that the Barnes-Hutt-SNE algorithm is used when performing dimensionality reduction.

6. According to claim 5, a visual patent retrieval method based on a pre-trained language model is characterized in that the dimension of the word vector of each keyword in the sequence state vector and the word vector of each keyword to be searched is 3072. After dimensionality reduction processing by the Barnes-Hut t-SNE algorithm, the corresponding dimension is reduced from 3072 to 2.

7. According to claim 3, a visual patent search method based on a pre-trained language model is characterized in that the IPC_TEXT in the input format of the keyword to be searched can be retained or deleted according to the search requirements; when the IPC classification number needs to be limited, it is retained, otherwise it is deleted.