[go: up one dir, main page]

CN119577115A - Intelligent patent retrieval method and system based on large language model re-ranking technology - Google Patents

Intelligent patent retrieval method and system based on large language model re-ranking technology Download PDF

Info

Publication number
CN119577115A
CN119577115A CN202510112691.2A CN202510112691A CN119577115A CN 119577115 A CN119577115 A CN 119577115A CN 202510112691 A CN202510112691 A CN 202510112691A CN 119577115 A CN119577115 A CN 119577115A
Authority
CN
China
Prior art keywords
text
model
vector
retrieval
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510112691.2A
Other languages
Chinese (zh)
Inventor
金玉赫
吴晨帆
徐青伟
裴非
严长春
范娥媚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xinghe Zhiyuan Information Technology Co ltd
Beijing Xinghe Zhiyuan Technology Co ltd
Original Assignee
Beijing Xinghe Zhiyuan Information Technology Co ltd
Beijing Xinghe Zhiyuan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xinghe Zhiyuan Information Technology Co ltd, Beijing Xinghe Zhiyuan Technology Co ltd filed Critical Beijing Xinghe Zhiyuan Information Technology Co ltd
Priority to CN202510112691.2A priority Critical patent/CN119577115A/en
Publication of CN119577115A publication Critical patent/CN119577115A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services
    • G06Q50/184Intellectual property management
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Tourism & Hospitality (AREA)
  • Technology Law (AREA)
  • Databases & Information Systems (AREA)
  • General Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Primary Health Care (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Marketing (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Operations Research (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了一种基于大语言模型重排序技术的智能专利检索方法及系统,方法处理的对象包括专利公开号、自由文本和文本文件在内的多种输入数据,提取关键信息并构建查询文本,执行检索召回,结合稀疏和稠密向量检索技术,通过min sum merge算法融合结果。去噪步骤使用滑动窗口策略和向量编码,以提取与查询文本相关的信息片段。最后,通过训练重排序模型和微调推荐列表,优化检索结果的排名,提升检索的准确性和相关性。本发明的智能检索方法提高了专利检索的效率和精确度。

The present application discloses an intelligent patent search method and system based on large language model re-ranking technology. The method processes various input data including patent publication numbers, free text and text files, extracts key information and constructs query text, performs search recall, combines sparse and dense vector search technology, and fuses the results through the min sum merge algorithm. The denoising step uses a sliding window strategy and vector encoding to extract information fragments related to the query text. Finally, by training the re-ranking model and fine-tuning the recommendation list, the ranking of the search results is optimized, and the accuracy and relevance of the search are improved. The intelligent search method of the present invention improves the efficiency and accuracy of patent search.

Description

Intelligent patent retrieval method and system based on large language model reordering technology
Technical Field
The application relates to the technical field of computers, in particular to the fields of natural language processing and deep learning, and particularly relates to an intelligent patent retrieval method and system based on a large language model reordering technology.
Background
In the current information age, the pace of technological innovation is continuously accelerated, with a dramatic increase in the number of patent documents. This growth presents an unprecedented challenge in terms of how to efficiently and accurately retrieve valuable information from vast amounts of patent data, a major challenge in the area of patent analysis and research and development. Traditional patent retrieval systems often rely on simple keyword matching and static database query, and the method is poor in processing complex queries and understanding deep text semantics, and is difficult to meet the requirements of users on efficient and accurate retrieval.
Disclosure of Invention
The application provides an intelligent patent retrieval method and system based on a large language model reordering technology, which can receive various types of input including the existing patent publication number, free text or text file, and realize the understanding and efficient retrieval of the deep semantics of patent documents through advanced text processing and vectorization technology.
In a first aspect, a method for intelligent patent retrieval based on a large language model reordering technique, the method comprising:
The retrieval recall is carried out by using a query text, and concretely comprises sparse vector retrieval based on a word bag model and dense vector retrieval based on a vector model;
denoising, namely denoising each to-be-selected patent by using a block and sliding window strategy on the basis of the retrieval recall sequence;
Training a reordering model by using a universal large language model as a base based on the denoising result; firstly, constructing a data set containing positive and negative example pairs, wherein the positive example is from patent quotation data, and the negative example is obtained through different mechanisms; then constructing a prompt word to excite the potential capability of the large language model, and splicing the prompt word with the query text and the text of the file to be compared to form model input;
Inputting the denoised text pairs into a trained large language model to obtain a similarity set, sorting the similarity to obtain an initial recommendation sequence, and fine-tuning the initial recommendation list to obtain a final recommendation list, wherein the text pairs comprise query texts and file texts to be compared.
Optionally, before retrieving recall in step one, acquiring multiple types of input data and performing data preprocessing and parsing, wherein the multiple types of input data comprise patent publication numbers, free texts and text files;
And selecting a corresponding text segment as a query text according to a preset field parameter in a data preprocessing and analyzing result, wherein if a patent publication number is input, the independent claim is used by default, if a free text is input, the preprocessed content is the query text, and if a text file is input, the identified and checked content is used as the query text for subsequent vector coding and similarity matching.
Optionally, when the input data is a patent publication number, the method specifically comprises the steps of searching and acquiring corresponding patent documents in a patent database through the patent publication number, analyzing the acquired patent documents by using preset rules, extracting key fields, splitting rights items of the right items, distinguishing independent claim groups and dependent claim groups based on preset rule matching, and constructing a relation tree between the independent claim groups and the dependent claim groups, wherein the key fields at least comprise titles, abstracts, rights items, specifications, inventor information and IPC classification numbers;
when the input data is free text, the method specifically comprises the steps of performing punctuation inspection and automatic wrongly written and wrongly written characters repair on the input free text, and ensuring that the text quality meets the requirement of subsequent processing;
When the input data is a text file, the method specifically comprises the steps of selecting a corresponding analysis strategy according to a file type, wherein the file type at least comprises TXT, word documents, text PDF or picture PDF, directly analyzing and reading the text for the TXT, word documents or the text PDF, calling an OCR technology to carry out text recognition if the file is the picture PDF file, and further punctuation inspection and automatic error correction for the text obtained by OCR.
Optionally, the semantic retrieval recall is performed by using the query text in the retrieval recall, which is specifically classified into sparse vector retrieval based on a word bag model and dense vector retrieval based on a vector model, and specifically comprises the following steps:
Constructing sparse vectors through a TF-IDF algorithm and a BM25 algorithm, and generating vectors of titles, abstracts, claims and specifications for each patent;
dense vector retrieval is carried out by using a BGE vector model, and the degree of distinction of the model to different input texts is improved through comparison loss optimization;
And (5) fusing two search results through a min sum merge algorithm to generate a sequence containing a patent publication number.
Optionally, denoising each candidate patent by using a block and sliding window strategy based on the retrieval recall sequence in denoising, specifically including:
performing blocking processing on the appointed paragraphs of the files to be compared to ensure that the semantic integrity is not damaged;
And vector encoding is carried out on each block and the query text by using a vector model, the similarity is calculated, the average value of the similarity between the blocks and the query text in the window is calculated by presetting the maximum window length N, a similarity change curve graph is drawn, and the peak position is found.
Optionally, fine tuning the initial recommendation list specifically includes:
extracting the fields of the application and the file to be compared, which are the technical problem to be solved, by using a technical problem to be solved extraction model, and carrying out vector coding;
calculating the vector similarity between the application and the file to be compared, and scaling in equal proportion;
acquiring the IPC classification number of each file to be compared, and weighting according to the matching level;
and sequencing by using all the weighted recommendation lists to obtain a final recommendation list.
In a second aspect, an intelligent patent retrieval system based on a large language model reordering technique, the system comprising:
the retrieval recall module is used for carrying out semantic retrieval recall by using a query text and is specifically divided into sparse vector retrieval based on a word bag model and dense vector retrieval based on a vector model;
The denoising module is used for denoising each to-be-selected patent by using a block and sliding window strategy on the basis of the retrieval recall sequence;
The training reordering model module is used for training a reordering model by using a universal large language model as a base based on a denoising result; firstly, constructing a data set containing positive and negative example pairs, wherein the positive example is from patent quotation data, and the negative example is obtained through different mechanisms; then constructing a prompt word to excite the potential capability of the large language model, and splicing the prompt word with the query text and the text of the file to be compared to form model input;
The fine tuning module is used for inputting the denoised text pairs into the trained large language model to obtain a similarity set, sequencing the similarity to obtain an initial recommendation sequence, and fine tuning the initial recommendation list to obtain a final recommendation list, wherein the text pairs comprise query texts and file texts to be compared.
In a third aspect, a computer device is provided, including a memory and a processor, where the memory stores a computer program, and the processor implements the intelligent patent retrieval method based on the large language model reordering technology according to any one of the first aspects when executing the computer program.
In a fourth aspect, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the intelligent patent retrieval method based on the large language model reordering technology according to any one of the first aspects.
In a fifth aspect, a computer program product is provided, including a computer program/instruction, which when executed by a processor implements the intelligent patent retrieval method based on the large language model reordering technology according to any one of the first aspects.
The technical scheme provided by the embodiment of the application has the beneficial effects that at least:
(1) And the sliding window denoising strategy is that denoising is carried out on the file to be compared by introducing the sliding window strategy, so that the system can dynamically evaluate the relevance of the text block and accurately extract the information fragment relevant to the query text. The strategy effectively reduces the interference of noise data and improves the accuracy of text similarity calculation.
(2) The technology adopts a two-way retrieval strategy of sparse vectors (based on TF-IDF and bm25 algorithms) and dense vectors (based on bge vector models) and carries out result fusion through a min sum merge algorithm. The combination utilizes the advantages of two vector retrieval modes, and enhances the retrieval capability and accuracy of the system on patent documents.
(3) The reordering method of the large language model comprises the steps of training and using model reordering by taking a pre-trained large language model as a basis and combining a customized vector layer. The method can optimize the ranking of the search results based on the deep semantic relation of the text content, and improves the performance and user satisfaction of the patent search system.
(4) The multi-dimensional application of the weighting strategy not only adds the weighting for solving the technical problem in the recommendation list, but also considers the matching degree of the IPC classification number and other bibliographic information. Through the multidimensional weighting, patent documents meeting the requirements of users can be screened and recommended more accurately, and the correlation and the accuracy of recommendation results are improved effectively.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It will be apparent to those of ordinary skill in the art that the drawings in the following description are exemplary only and that other implementations can be obtained from the extensions of the drawings provided without inventive effort.
FIG. 1 is a flow chart of steps provided in an embodiment of the present application;
FIG. 2 is a flowchart of a denoising process according to an embodiment of the present application;
FIG. 3 is a block diagram of an intelligent patent retrieval system provided by the present application;
Fig. 4 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
In the description of the present application, the terms "comprises," "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements but may include other steps or elements not expressly listed but inherent to such process, method, article, or apparatus or steps or elements added based on further optimization of the inventive concept.
In the current information age, the pace of technological innovation is continuously accelerated, with a dramatic increase in the number of patent documents. This growth presents an unprecedented challenge in terms of how to efficiently and accurately retrieve valuable information from vast amounts of patent data, a major challenge in the area of patent analysis and research and development. Traditional patent retrieval systems often rely on simple keyword matching and static database query, and the method is poor in processing complex queries and understanding deep text semantics, and is difficult to meet the requirements of users on efficient and accurate retrieval.
Aiming at the problems, the research provides an intelligent patent retrieval method based on a large language model reordering technology. The system can receive various input forms including the existing patent publication number, free text or text file, and realize the understanding and efficient retrieval of the deep semantic meaning of the patent document through advanced text processing and vectorization technology. The system firstly extracts corresponding patent text contents from a patent database, including information such as titles, abstracts, claims, specifications and the like, splits and vector codes the contents according to preset rules, and then performs similarity retrieval in the established vector database, thereby realizing quick and accurate information retrieval and recommendation. In addition, the system also introduces a large model reordering technology, and trains the model through the citation document in the patent examination process, so that the relevance and accuracy of the retrieval result are further improved. The patent retrieval method based on the large language model reordering technology enables the system to continuously adapt to new data and retrieval requirements, and has strong accuracy, practicability and expansibility.
The model has the remarkable advantages of high flexibility and intelligence, and can effectively cope with large-scale and dynamically-changed patent data environments, and the problems of low semantic understanding, overlarge noise and inaccurate recommendation existing in the traditional patent retrieval technology are solved. The method has important significance in accelerating technological innovation, improving research and development efficiency and promoting reasonable application and protection of intellectual property.
In one embodiment, as shown in fig. 1, an intelligent patent retrieval method based on a large language model reordering technology is provided, the method can be applied to a server, and the method further comprises data preprocessing before retrieving recall:
The method is capable of processing multiple types of input data, including patent publication numbers, free text, and text files. The following are specific operation steps for different input types:
treatment of patent publication No.:
database query, firstly, the system searches and acquires corresponding patent documents in the patent database through the patent publication number.
And (3) analyzing the fields, namely analyzing the acquired patent document by using preset rules, and extracting key fields such as titles, abstracts, claims, specifications, inventor information, IPC classification numbers and the like.
Claim parsing, namely claim splitting the claim text, distinguishing independent claim groups from dependent claim groups based on preset rule matching, and constructing a relation tree between the independent claim groups and the dependent claim groups.
The description is analyzed, namely, the description is divided into a technical field, a background technology, an invention content, a drawing description, a specific embodiment and the like, so that the information of each part is ensured to be accurately captured.
Processing free text:
And (3) checking the text, namely performing punctuation check and automatic correction of wrongly written characters on the input free text, and ensuring that the quality of the text meets the requirement of subsequent processing.
Processing a text file:
file type identification, selecting a proper parsing strategy according to the file type (such as TXT, word document, text PDF or picture PDF).
Text parsing, namely directly performing text parsing reading on TXT, word documents or text PDF.
And (3) image recognition, namely if the image is a picture PDF file, invoking an Optical Character Recognition (OCR) technology to perform text recognition.
And (3) checking the text, namely performing further punctuation check and automatic correction of wrongly written characters on the text obtained by OCR.
After the above steps are completed, the system selects the corresponding text segment as the query text (query text) according to the preset field parameters, specifically, if the input is the patent publication number, the text segment is used by default (independent claim), if the input is the free text, the preprocessed content is the query text, and if the input is the text file, the identified and verified content is used as the query text. For use in subsequent vector encoding and similarity matching.
Step one, retrieving recall:
Semantic retrieval is carried out by using a query_text, the semantic retrieval is divided into two ways, one way is retrieval of sparse vectors based on a word bag model, the other way is retrieval of dense vectors based on a vector model, finally, retrieval of the two ways is fused by using a fusion algorithm, the fusion algorithm is a min sum merge algorithm commonly used in the industry, the algorithm is a sequence merging algorithm based on rank ordering, two sequences (a result sequence obtained by sparse retrieval and a result sequence obtained by dense retrieval) are input, and a disclosure number sequence (a result sequence after merging) is output.
The sparse retrieval uses TF-IDF algorithm and bm25 algorithm to construct sparse vectors, and constructs title, abstract, claim, instruction, four sparse vectors (four vectors for each patent) for the whole patent in advance, then controls the input field to retrieve in some word bag or some word bags according to parameters, and merges and returns the result.
The dense retrieval uses bge vector models, the double-tower model is trained on a pre-constructed data set, the data set comprises positive example pairs and multiple negative example pairs, the data set is organized by a certain proportion, the training aim is to improve the distinguishing degree of the vector models for different input texts, so that the similarity score of similar patents is higher, the similarity score of dissimilar patents is lower, and the similarity can be understood as similarity forming new-creation conflicts rather than text similarity.
Generating a sparse vector:
1. Vocabulary construction, namely word segmentation is carried out on all document sets, stop words are removed, and all unique terms are extracted to form the vocabulary. This vocabulary defines the dimensions of the subsequent sparse vectors.
2. The TF-IDF and BM25 weights of each word in the document are calculated:
TF-IDF algorithm:
where t is the term, d is a patent title or other field document, Is the frequency of term t in document d, N is the total number of documents,Is the number of documents containing item t.Is the TF-IDF weight of term t in document d.
BM25 algorithm:
Where k 1 and b are tuning parameters, let k 1 =1.2 and b=0.75 in general. D| is the number of words in document d of a patent title or other field, avgdl is the average number of words for all documents. Calculated here is the BM25 weight of term t in document d.
3. Generating a sparse vector:
An initial vector is created for the query_text that is equal in length to the vocabulary size.
Traversing each term in the query_text, filling the TF-IDF or BM25 weight of the term to the position corresponding to the initial vector if the term exists in the vocabulary, and filling 0 in the position corresponding to the vector if the term does not exist in the query_text.
Generating a dense vector:
1. Data set construction and preprocessing, collecting patent data pairs, wherein the data comprise positive example pairs and negative example pairs constructed by a plurality of different methods. The positive examples are generally disclosed citation pairs, while the negative examples are similar but unrelated patent pairs constructed by various methods.
2. Model training, namely constructing a BGE model. The "BGE" model is a two-tower model, where each tower is typically composed of several layers of transformers. The task of each tower is to convert text (e.g., patent abstract or claims) into a fixed length vector.
And (3) calculating contrast loss:
Model parameters are optimized using contrast loss (Contrastive Loss) with the goal of bringing the vectors of positive case pairs closer together and the vectors of negative case pairs farther apart. The calculation formula of the contrast loss is as follows:
Where P is the set of positive example pairs, i and j are the two vectors in the positive example pair, k is the negative example vector paired with i, m is the boundary value (margin), typically a positive number, Usually a euclidean distance or a cosine distance is used.
Optimizing by using gradient descent to minimize the loss function and updating the weights of the model.
3. And (3) vector generation, namely inputting the query_text into a trained dense vector model to obtain a vector with fixed dimension.
Step two, denoising:
and (3) denoising each candidate patent in the list by using the recall topN list obtained in the step one and using a blocking and sliding window strategy.
In the method, the text segment of the application (default independent claim) and the text segment of the document to be compared are finally required to be input into a large pre-designed language model to obtain a similarity score. And repeating the steps above for each file to be compared, namely obtaining a similarity score with the text segment of the application, finally obtaining a similarity score list, and finally obtaining the recommended sequence by sequencing the similarity score list.
Because the text segment input by the method is shorter (default independent claim) and the text segment input by the file is longer (model specific implementation mode), the length of two text segments is not matched, the noise of the text input by the file is overlarge (a large amount of texts are irrelevant to the text segment of the method), and the model prediction result is inaccurate. Therefore, the denoising method described below is used for simplifying the text segment of the input comparison file, and the simplified text segment is input into the model to be predicted instead of the text segment of the original text.
The retrieval recall is carried out by using a query text, and concretely comprises sparse vector retrieval based on a word bag model and dense vector retrieval based on a vector model;
denoising, namely denoising each to-be-selected patent by using a block and sliding window strategy on the basis of the retrieval recall sequence;
Training a reordering model by using a universal large language model as a base based on the denoising result; firstly, constructing a data set containing positive and negative example pairs, wherein the positive example is from patent quotation data, and the negative example is obtained through different mechanisms; then constructing a prompt word to excite the potential capability of the large language model, and splicing the prompt word with the query text and the text of the file to be compared to form model input;
inputting the denoised text pairs into a trained large language model to obtain a similarity set, sorting the similarity to obtain an initial recommendation sequence, and fine-tuning the initial recommendation list to obtain a final recommendation list, wherein the text pairs comprise query texts and file texts to be compared. A flowchart of the denoising process is given as fig. 2, specifically:
1. The method comprises the steps of blocking, namely blocking an appointed paragraph of a file to be compared by using a blocking algorithm, wherein the blocking principle is that semantic integrity is not damaged, particularly hard truncation (non-period number, number segmentation) is avoided as far as possible, and meanwhile, the length of a part of each blocked block is ensured to be as long as possible and does not exceed a preset maximum length M.
Given document D, a function is defined:
defining a segmentation position and cost:
Identifying a possible breakpoint:
Where P is the set of all possible breakpoints, each breakpoint including location i and associated breakpoint cost c i. If the last position of the breakpoint set P is not the end of the document, the end of the document is added as an additional breakpoint.
Insertion start position:
Where n is the length of the document.
Additional break points are added to meet the maximum block length M-if the distance between two consecutive break points is greater than M, then a new break point is added between the two break points at interval M.
Dynamic programming searches for an optimal breakpoint:
A dynamic programming array best is set, wherein best [ i ] represents the minimum cost and its path to the ith breakpoint.
Initializing:
Recurrence relation:
Wherein, Is the distance between positions j and i, c i is the cost of breakpoint i, sen_cost is a constant c, representing the base sentence cost.
Generating a block according to the optimal breakpoint:
And reversely tracking the optimal path, namely reversely tracking to find all optimal break points from the tail of the best array and generating a sentence-breaking section.
And (3) segmenting the document according to the sentence-segmentation interval to obtain a final segmentation result.
2. And vector encoding is carried out on each block text and query text by using a vector model.
Vector encoding is performed on each block S i and query text_text using a pre-trained vector model:
Where v i is the vector representation of the ith block and v q is the vector representation of the query text.
3. Vector similarity of the blocks and the query_text are calculated one by one.
The similarity of each block vector v i to the query text vector v q is calculated:
cosine similarity is used here as a similarity measure. Representing the cosine similarity of vector v i and vector v q.
4. The maximum window length N is preset, typically an integer multiple of the maximum length of the block, and an integer multiple of M.
5. And firstly placing the window at the first position, namely from the 1 st block to the N/M th block, calculating the average value of the similarity between each partition block and the query_text in the window, sequentially moving the window backwards by one block position, and calculating the average value of the similarity of the new position again until the window moves to the last block.
Defining a window length N, and calculating the average value of the similarity of the blocks in each window:
where k represents the starting chunk index of the window.
6. And 5, carrying out curve drawing on all the similarity averages in the step 5 to obtain a curve graph changing along with the window moving similarity average, and finding the peak position of the graph.
Representing the set of window similarities obtained in the previous step.The treatment method of (2) is as follows:
first, the block with the highest similarity to the target block is determined, and the similarity is used as a base reference value.
Then, a similarity threshold is set,
Based on this threshold, a new set C is created that contains all blocks that have similarity to the target block exceeding this threshold.
Then, blocks having a distance of 3 or less from each other are grouped to form clusters by analyzing the positional relationship of the blocks in the original text.
Finally, the longest cluster is identified from among all the formed clusters, which represents a peak of the graph.
7. If only one wave peak exists, setting the text in the window corresponding to the wave peak position as the denoised text.
8. If a plurality of wave crests exist (X), splitting the original window into X windows with equal length, respectively corresponding to the X wave crests, and removing texts in the sub-windows of each wave crest.
9. And splicing the texts in all the sub-windows according to the sequence in the original text to obtain the denoised text under the condition of multiple wave peaks.
And selecting the text in the corresponding window as the denoised text according to the peak position. If multiple peaks exist, corresponding sub-window selection is performed and text is spliced.
Here the number of the elements is the number,And the text after final denoising is represented and spliced by the texts in the windows corresponding to all the detected peaks.
Training a reordering model:
and training a reordering model by using the universal large language model as a base.
1. The data set construction, namely positive and negative example pairs constructed based on a certain strategy are used, wherein positive examples represent related patents, negative examples represent incoherent documents, each positive example has a plurality of negative examples, and training data are formed according to a certain negative example proportion. The positive examples are derived from patent citation data, while the sources of negative examples are the following mechanisms:
Recall negative examples based on BM25 these are documents that are not positive examples in Top N results recalled by BM25 algorithm, introducing highly relevant negative examples.
Negative examples of different IPC levels from patents similar to different levels of IPC classification of query text, e.g. max same level is a large group or a small group, are intended to introduce similar fields but not related to specific technical ideas.
The random negative examples are randomly selected from the whole data set, have no obvious or direct correlation with the query text, and the difficulty is reduced by introducing the random negative examples.
Negative sample ratio
Setting the sampling proportion of the three negative examples as r 1,r2 and r 3, and respectively corresponding to the three negative examples. The entire training dataset consisted of the following elements:
Wherein, Is a positive example document related to the query text,Negative example documents r 1,r2 and r 3 obtained by sampling according to different strategies specify the sampling proportion of various negative examples.
2. The method comprises the steps of constructing a large language model prompt word, and exciting the potential capability of the large language model by using the specially constructed prompt word so that the large language model can be qualified for the task. And splicing the prompt word and the query text, and obtaining the input of the final input large language model, wherein the text of the file to be compared is spliced:
splicing query text and prompt words:
input_query represents a sequence obtained by splicing query text query_text and specially constructed prompt words special tokens, and is used for being input into a large language model.
Splicing the text of the file to be compared and the prompt word:
input_doc is a sequence formed by splicing the denoised document text D' to be compared with the same prompt word special tokens.
Final model input:
final input represents the final model input, which consists of a concatenation sequence of query text enhanced by the prompt and text to be compared.
3. The original model structure is adjusted, a customized vector layer is connected after the model output layer and used for reducing the dimension of the model output, the dimension of the model output is reduced from a two-dimensional matrix (batchsize, vocabsize) to a one-dimensional vector (batchsize), and the customized vector layer is a trainable parameter.
A trainable vector layer follows the output layer of the model for converting the output from a two-dimensional matrix to a one-dimensional vector for more accurate similarity calculation.
Where x is the original output of the model,Is a trainable parameter of the vector layer, f represents a conversion function of the vector layer, v represents one-dimensional vector output after conversion, and represents the distribution of the predicted words on a dictionary.
And 4, final input model, constructing a loss function by using the predicted error, and further back-propagating a training model.
Loss function the loss function is constructed based on contrast loss so that similar (positive examples) samples are closer to each other and dissimilar (negative examples) samples are farther from each other, allowing the model to effectively distinguish between positive and negative samples.
Contrast loss function:
Wherein, AndThe vector representations of positive and negative examples within the same group, respectively, y i is the label (1 for positive examples and 0 for negative examples), d () represents the distance between the two vectors (typically euclidean distance or cosine distance), and m is a boundary value (margin) to define when the distance between the negative examples is considered "far enough".
And training the constructed large language model by using the contrast loss and the back propagation algorithm to obtain a trained large language model.
The application specifically places the denoised text pairs into a trained large language model (input query text q and document text d to be compared, and returns a similarity value).
And (3) inputting each file to be compared which is retrieved and recalled in the step one with q one by one to obtain a similarity set by combining the trained large language models, and sequencing the similarity to obtain an initial recommendation sequence P.
Step four, fine tuning is carried out on the initial recommendation list:
The purpose of fine tuning the initial recommendation list is to advance the ranking positions of some patents with obvious potential characteristics, so that the recommendation accuracy is improved. The method is used for solving the technical problem of high-correlation weighting, and the main classification numbers are the same and weighted according to the same level.
1. The technical field of the application and the patent of the document to be compared, background technology, summary (first section, last section) and the model are input by using a 'to solve technical problem extraction model', and the model respectively generates the 'to solve technical problem' fields of the application and the document to be compared.
Is a model for extracting "technical problem to be solved" from the relevant field in the patent document d,Representing relevant fields (technical field, background and summary) extracted from document d.Is extracted text for solving the technical problem.
2. The field of the application 'to solve the technical problem' is encoded using a vector model.
D app is the patent document of the present application.Is the "to solve technical problem" field of the present application. v app is the vector after the field code of "to solve technical problem" of the present application.
3. And encoding the field of the technical problem to be solved of each file to be compared of the recall list by using a vector model.
D i is a patent document of a document to be compared.Is the field of the file to be compared for solving the technical problem. v i is the vector of the file to be compared after the field code of "technical problem to be solved".
4. And carrying out similarity calculation on the vector of each file to be compared and the vector of the application. And finally, a similarity list is obtained.
The cosine similarity function is used here to measure the similarity between the patent vector v app of the present application and the patent vector v i to be compared.
5. And processing the similarity list, carrying out equal proportion scaling, adjusting the similarity value with the lowest similarity in the list to be 1, and amplifying the rest similarity values in equal proportion. And obtaining a scaled similarity list.
Wherein S is a scaled similarity list, and P is an initial recommendation sequence. The corresponding products of the similarity in the initial recommendation list and the similarity after vector scaling are ordered,
Where Sort is the ranking function. And obtaining a recommendation sequence weighted by solving the technical problem.
6. Obtaining the ipc classification number of each file to be compared, dividing according to the ipc grade (part, major, minor, major and minor), setting five preset weighting values a1, a2, a3, a4 and a5, corresponding to the rewarding weights matched by different ipc grades, comparing the ipc of each file to be compared in the list with the ipc of the application, including comparing the ipc of each file to be compared in the pre-collected ipc class group (different ipc classes with strong correlation obtained through large data volume statistics, if the ipc class of the comparison file falls in the ipc similar group corresponding to the application, calculating hit), and weighting the recommendation list weighted by the solution technical problem according to the weighting value corresponding to the similar grade to obtain the recommendation list weighted by the ipc classification number.
Description of the steps:
The IPC classification number is obtained and divided, namely each file to be compared and the IPC classification number of the application are obtained and divided according to the level (department, major class, minor class, major group and minor group).
Setting preset weighting values, namely setting weighting values a1, a2, a3, a4 and a5 according to the matching degree of the IPC levels, wherein each level corresponds to one weighting value.
IPC matching and weighting:
The IPC of each document to be compared is compared with the IPC of the present application.
And applying corresponding weighting values according to the matching level.
Weighted similarity calculation:
Re-weighting the previously calculated similarity based on the IPC matching level enhances the weight of the patent highly relevant to the present application.
And representing the weighted similarity of the ith file to be compared.The similarity of the ith file calculated in the previous step.Is a weighted value determined according to the matching level of the ith document and the application patent on IPC classification, whereinRepresenting the level of matching.
7. Using fully weighted recommendation listsAnd sorting to be used as a final recommendation list.
In one embodiment, as shown in fig. 3, an intelligent patent retrieval system based on a large language model reordering technology is provided, which specifically includes:
the retrieval recall module is used for carrying out semantic retrieval recall by using a query text and is specifically divided into sparse vector retrieval based on a word bag model and dense vector retrieval based on a vector model;
The denoising module is used for denoising each to-be-selected patent by using a block and sliding window strategy on the basis of the retrieval recall sequence;
The training reordering model module is used for training a reordering model by using a universal large language model as a base based on a denoising result; firstly, constructing a data set containing positive and negative example pairs, wherein the positive example is from patent quotation data, and the negative example is obtained through different mechanisms; then constructing a prompt word to excite the potential capability of the large language model, and splicing the prompt word with the query text and the text of the file to be compared to form model input;
The fine tuning module is used for inputting the denoised text pairs into the trained large language model to obtain a similarity set, sequencing the similarity to obtain an initial recommendation sequence, and fine tuning the initial recommendation list to obtain a final recommendation list, wherein the text pairs comprise query texts and file texts to be compared.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 4. The computer device includes a processor, a memory, and a network interface connected by a system bus. The processor of the computer device is used for providing computing and control capability, the network interface is used for communicating with an external terminal through network connection, and the computer device runs the computer program by loading to realize the intelligent patent retrieval method.
It will be appreciated by persons skilled in the art that the architecture shown in fig. 4 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements are applicable, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In an embodiment, a computer readable storage medium is also provided, on which a computer program is stored, involving all or part of the flow of the method of the above embodiment.
In an embodiment, a computer program product is also provided, comprising computer programs/instructions, relating to all or part of the flow in the method of the above embodiments.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in M forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYMCHLIMK) DRAM (SLDRAM), memory bus (RaMbus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

Claims (10)

1. An intelligent patent retrieval method based on a large language model reordering technology is characterized by comprising the following steps:
The retrieval recall is carried out by using a query text, and concretely comprises sparse vector retrieval based on a word bag model and dense vector retrieval based on a vector model;
denoising, namely denoising each to-be-selected patent by using a block and sliding window strategy on the basis of the retrieval recall sequence;
Training a reordering model by using a universal large language model as a base based on the denoising result; firstly, constructing a data set containing positive and negative example pairs, wherein the positive example is from patent quotation data, and the negative example is obtained through different mechanisms; then constructing a prompt word to excite the potential capability of the large language model, and splicing the prompt word with the query text and the text of the file to be compared to form model input;
Inputting the denoised text pairs into a trained large language model to obtain a similarity set, sorting the similarity to obtain an initial recommendation sequence, and fine-tuning the initial recommendation list to obtain a final recommendation list, wherein the text pairs comprise query texts and file texts to be compared.
2. The method of claim 1, further comprising, prior to retrieving the recall in step one, obtaining multiple types of input data and performing data preprocessing and parsing, wherein the multiple types of input data include patent publication numbers, free text, and text files;
And selecting a corresponding text segment as a query text according to a preset field parameter in a data preprocessing and analyzing result, wherein if a patent publication number is input, the independent claim is used by default, if a free text is input, the preprocessed content is the query text, and if a text file is input, the identified and checked content is used as the query text for subsequent vector coding and similarity matching.
3. The method of claim 2, wherein when the input data is a patent publication number, the method specifically comprises retrieving and acquiring corresponding patent documents in a patent database through the patent publication number, analyzing the acquired patent documents by using preset rules, extracting key fields, wherein the key fields at least comprise titles, abstracts, claims, specifications, inventor information and IPC classification numbers, splitting rights in the claim text, and distinguishing independent claim groups from dependent claim groups based on preset rule matching, and constructing a relation tree between the independent claim groups and the dependent claim groups;
when the input data is free text, the method specifically comprises the steps of performing punctuation inspection and automatic wrongly written and wrongly written characters repair on the input free text, and ensuring that the text quality meets the requirement of subsequent processing;
When the input data is a text file, the method specifically comprises the steps of selecting a corresponding analysis strategy according to a file type, wherein the file type at least comprises TXT, word documents, text PDF or picture PDF, directly analyzing and reading the text for the TXT, word documents or the text PDF, calling an OCR technology to carry out text recognition if the file is the picture PDF file, and further punctuation inspection and automatic error correction for the text obtained by OCR.
4. The method according to claim 1, wherein the retrieval recall uses query text for semantic retrieval recall, and is specifically classified into a sparse vector retrieval based on a bag of words model and a dense vector retrieval based on a vector model, and specifically comprises:
Constructing sparse vectors through a TF-IDF algorithm and a BM25 algorithm, and generating vectors of titles, abstracts, claims and specifications for each patent;
dense vector retrieval is carried out by using a BGE vector model, and the degree of distinction of the model to different input texts is improved through comparison loss optimization;
And (5) fusing two search results through a min sum merge algorithm to generate a sequence containing a patent publication number.
5. The method according to claim 1, wherein denoising each candidate patent uses a block and sliding window strategy based on the retrieved recall sequence in denoising, specifically comprising:
performing blocking processing on the appointed paragraphs of the files to be compared to ensure that the semantic integrity is not damaged;
And vector encoding is carried out on each block and the query text by using a vector model, the similarity is calculated, the average value of the similarity between the blocks and the query text in the window is calculated by presetting the maximum window length N, a similarity change curve graph is drawn, and the peak position is found.
6. The method of claim 1, wherein fine tuning the initial recommendation list specifically comprises:
extracting the fields of the application and the file to be compared, which are the technical problem to be solved, by using a technical problem to be solved extraction model, and carrying out vector coding;
calculating the vector similarity between the application and the file to be compared, and scaling in equal proportion;
acquiring the IPC classification number of each file to be compared, and weighting according to the matching level;
and sequencing by using all the weighted recommendation lists to obtain a final recommendation list.
7. An intelligent patent retrieval system based on a large language model reordering technology is characterized in that the system comprises:
the retrieval recall module is used for carrying out semantic retrieval recall by using a query text and is specifically divided into sparse vector retrieval based on a word bag model and dense vector retrieval based on a vector model;
The denoising module is used for denoising each to-be-selected patent by using a block and sliding window strategy on the basis of the retrieval recall sequence;
The training reordering model module is used for training a reordering model by using a universal large language model as a base based on a denoising result; firstly, constructing a data set containing positive and negative example pairs, wherein the positive example is from patent quotation data, and the negative example is obtained through different mechanisms; then constructing a prompt word to excite the potential capability of the large language model, and splicing the prompt word with the query text and the text of the file to be compared to form model input;
The fine tuning module is used for inputting the denoised text pairs into the trained large language model to obtain a similarity set, sequencing the similarity to obtain an initial recommendation sequence, and fine tuning the initial recommendation list to obtain a final recommendation list, wherein the text pairs comprise query texts and file texts to be compared.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 6.
10. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method of any of claims 1 to 6.
CN202510112691.2A 2025-01-24 2025-01-24 Intelligent patent retrieval method and system based on large language model re-ranking technology Pending CN119577115A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510112691.2A CN119577115A (en) 2025-01-24 2025-01-24 Intelligent patent retrieval method and system based on large language model re-ranking technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510112691.2A CN119577115A (en) 2025-01-24 2025-01-24 Intelligent patent retrieval method and system based on large language model re-ranking technology

Publications (1)

Publication Number Publication Date
CN119577115A true CN119577115A (en) 2025-03-07

Family

ID=94809094

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510112691.2A Pending CN119577115A (en) 2025-01-24 2025-01-24 Intelligent patent retrieval method and system based on large language model re-ranking technology

Country Status (1)

Country Link
CN (1) CN119577115A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119988386A (en) * 2025-04-14 2025-05-13 上海爱可生信息技术股份有限公司 Joint optimization method, system and readable storage medium for index and representation model

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117591681A (en) * 2023-11-20 2024-02-23 中国科学院深圳先进技术研究院 Patent retrieval method and system based on knowledge graph
CN117725183A (en) * 2023-12-27 2024-03-19 上海网达软件股份有限公司 Reordering method and device for improving retrieval performance of AI large language model
CN118113810A (en) * 2024-03-12 2024-05-31 重庆邮电大学 Patent retrieval system combining patent image and text semantics
CN118520074A (en) * 2024-07-23 2024-08-20 浙江省北大信息技术高等研究院 Real-time retrieval enhancement generation method and device based on industrial brain
CN118964589A (en) * 2024-06-21 2024-11-15 数野科技(深圳)有限公司 A method for constructing an intelligent search engine based on a large language model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117591681A (en) * 2023-11-20 2024-02-23 中国科学院深圳先进技术研究院 Patent retrieval method and system based on knowledge graph
CN117725183A (en) * 2023-12-27 2024-03-19 上海网达软件股份有限公司 Reordering method and device for improving retrieval performance of AI large language model
CN118113810A (en) * 2024-03-12 2024-05-31 重庆邮电大学 Patent retrieval system combining patent image and text semantics
CN118964589A (en) * 2024-06-21 2024-11-15 数野科技(深圳)有限公司 A method for constructing an intelligent search engine based on a large language model
CN118520074A (en) * 2024-07-23 2024-08-20 浙江省北大信息技术高等研究院 Real-time retrieval enhancement generation method and device based on industrial brain

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119988386A (en) * 2025-04-14 2025-05-13 上海爱可生信息技术股份有限公司 Joint optimization method, system and readable storage medium for index and representation model

Similar Documents

Publication Publication Date Title
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
CN113569050B (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
CN111190997B (en) Question-answering system implementation method using neural network and machine learning ordering algorithm
CN110750640B (en) Text data classification method and device based on neural network model and storage medium
CN104199965B (en) Semantic information retrieval method
CN113688954A (en) Method, system, equipment and storage medium for calculating text similarity
CN111753550A (en) Semantic parsing method for natural language
CN111291188B (en) Intelligent information extraction method and system
CN112395875A (en) Keyword extraction method, device, terminal and storage medium
CN106294736A (en) Text feature based on key word frequency
CN108875065B (en) A content-based recommendation method for Indonesian news pages
CN118278365A (en) Automatic generation method and device for scientific literature review
CN119577115A (en) Intelligent patent retrieval method and system based on large language model re-ranking technology
CN113836896A (en) Patent text abstract generation method and device based on deep learning
CN106294733A (en) Page detection method based on text analyzing
Jayady et al. Theme identification using machine learning techniques
CN118839008A (en) Military question-answering method and system based on language big model
CN111325033B (en) Entity identification method, entity identification device, electronic equipment and computer readable storage medium
CN119829764A (en) Text dividing method and device and electronic equipment
CN118013020B (en) Patent query method and system for generating joint training based on retrieval
CN119474409A (en) A method and system for retrieving similar cases in the judicial field based on a large model
CN113516202A (en) Webpage accurate classification method for CBL feature extraction and denoising
CN118964554A (en) A steel supply chain knowledge recovery method and system based on RAG technology
CN106294295A (en) Article similarity recognition method based on word frequency
CN117668234A (en) Text label dividing method, medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination