CN120407878A

CN120407878A - A RAG method and system for compliance analysis of multimodal financial documents

Info

Publication number: CN120407878A
Application number: CN202510490735.5A
Authority: CN
Inventors: 王新宇; 周灵; 池纪君; 邰正晗; 李铸洪; 何海林; 华雨晨; 郭桐深; 李木之; 卢鹏; 王苏羽晨; 黄杰瑞; 吴益洪; 韩博喻; 李维恩; 李泽宇; 马力恒; 崔军
Original assignee: Chengguang Phantom Suzhou Technology Co ltd
Current assignee: Chengguang Phantom Suzhou Technology Co ltd
Priority date: 2025-04-18
Filing date: 2025-04-18
Publication date: 2025-08-01

Abstract

The invention relates to the technical field of large financial models, and provides a RAG method and a system for compliance analysis of multi-modal financial documents, wherein the RAG method comprises the steps of preprocessing an input multi-modal financial document, generating corresponding text blocks from multi-modal data in the multi-modal financial document, and constructing a corresponding vector database; the method comprises the steps of responding to a target query request, entering a multi-path searching process according to the target query statement to generate a bundle package for responding to a sub-query statement, calculating the semantic alignment degree of each bundle package and the sub-query statement based on the bundle package generated by the multi-path searching module, calculating corresponding time rewards according to a time rewards mechanism to obtain a matching score, ranking and optimizing the matching score through a direct preference optimizing mechanism to obtain a preference text block set to respond to generating a corresponding answer, and effectively solving the defects of the traditional method in the aspects of processing heterogeneous financial data and improving the retrieval recall rate and the correlation by integrating a preprocessing, searching and reordering module of financial documents.

Description

RAG method and system for multi-mode financial document compliance analysis

Technical Field

The invention relates to the technical field of large financial models, in particular to a RAG method and system for multi-mode financial document compliance analysis.

Background

In the financial industry, with the evolution of regulations, financial institutions are faced with increasingly complex compliance requirements. To address these challenges, more and more financial institutions seek a question-and-answer (QA) system that can efficiently retrieve and analyze compliance information. These systems typically require the use of external knowledge bases or databases in combination with retrieval enhancement generation (RAG) techniques to enhance the capabilities of the language model, thereby improving the accuracy of decision support. The existing RAG system mostly depends on a single text retrieval method, such as dense retrieval or sparse lexical matching, and the text content in the financial document is analyzed and extracted. However, financial documents often contain multimodal data, such as unstructured text, semi-structured forms, images, etc., and existing RAG methods have limitations in processing such multimodal information.

At present, the conventional RAG system has a disadvantage in processing multi-mode data, especially when facing complex structured and unstructured data in financial documents, the conventional RAG system cannot effectively integrate different types of data, so that context information is fragmented or lost. In addition, the existing retrieval method mainly relies on semantic similarity for matching, and ignores specific hidden supervision relations and field standards in the financial field, so that the accuracy and compliance of the retrieval result are affected. Meanwhile, the conventional sorting method fails to fully consider the importance of compliance, so that key legal or financial information is easily missed, and the reliability and effect of retrieval generation are further affected.

In order to solve the above-mentioned problems, a technology capable of effectively processing a multi-modal financial document and accurately recognizing compliance requirements is needed.

Disclosure of Invention

The application provides a RAG method and a system for multi-modal financial document compliance analysis, which overcome the limitations in the prior art, and can comprehensively process multi-modal information such as texts, forms and images in financial documents by combining a multi-modal preprocessing component, a flexible multi-path retrieval module and a domain-specific document reordering module, thereby improving the accuracy and efficiency of information retrieval. In addition, through the special reordering technique in the field, ensure the high accuracy of system in compliance analysis, provide an extensible, practical compliance analysis solution for financial institutions:

the invention provides a RAG method for multi-mode financial document compliance analysis, which comprises the following steps:

Step S1, preprocessing an input multi-mode financial document, generating corresponding text blocks by using multi-mode data in the multi-mode financial document through a file preprocessing module, and vectorizing and encoding all preprocessed text blocks and metadata thereof by using a BGE-M3 compact encoder to construct a corresponding vector database;

step S2, responding to a target query request;

step S3, entering a multi-path searching process according to a target query statement in the target query request, wherein the multi-path searching stage process is a process of decomposing the target query statement into at least one or more sub-query statements, searching text blocks matched with the sub-query statements from a vector database through a multi-path searching module, and dynamically binding, expanding and combining the text blocks according to a similarity threshold value to generate a binding package for responding to the sub-query statements;

and S4, calculating the semantic alignment degree of each bundle package and sub-query sentences through a domain-specific document reordering module based on the bundles generated by the multi-path retrieval module, calculating corresponding time rewards according to a time rewards mechanism to obtain matching scores, and performing ranking optimization on the matching scores through a direct preference optimization mechanism to obtain corresponding preference text block sets to respond to target query sentences and generate corresponding answers.

Further, in step S1, preprocessing an input multi-modal financial document, generating corresponding text blocks from multi-modal data in the multi-modal financial document by a file preprocessing module, vectorizing and encoding all the preprocessed text blocks and metadata thereof by using a BGE-M3 compact encoder, and constructing a corresponding vector database, including:

Step S11, acquiring a financial document for processing a target query request;

Step S12, decomposing multi-modal modules including text, image and form according to the original layout sequence of the multi-modal data of the financial document by using an open source PDF tool MinerU;

S13, converting the multi-mode blocks into text blocks through a large language model, converting the image blocks into structured text summaries as the text blocks of the images through a visual language model, and converting the table blocks into text descriptions with consistent structures as the text blocks of the tables;

And step S14, processing the text blocks based on the semantic enhancement technology, and outputting a text block set for constructing a vector database.

Further, in step S14, the text blocks are processed based on the semantic enhancement technique, and a text block set for constructing a vector database is output, including:

The cosine similarity among different text blocks is calculated through SBERT sentences in an embedding mode, and when the cosine similarity exceeds a set threshold value, the text blocks are combined to finish redundancy and duplicate removal;

Performing reference resolution by using a large language model, identifying and replacing the pronouns in the same section by context information through iterative resolution, and dividing different references representing the same entity into an equivalent set for solving the problem of reference ambiguity in a text block;

Structured metadata is added to each text block using a large language model, the metadata including chapter headers, page locations, and data types.

And carrying out text vectorization coding on all the preprocessed text blocks and metadata corresponding to the text blocks through a BGE-M3 dense encoder, and constructing a vector database for target query statement query of the client.

Further, in step S3, according to the target query statement in the target query request, a multi-path search process is entered, including:

s31, when a client initiates a target query request to perform database query, an executor disassembles a target query sentence into a plurality of independent sub-query sentences by using a natural language processing technology through a large language model, replaces pronouns in the independent sub-query sentences with clear entities through coreference resolution, and automatically associates context information;

step S32, a multi-path searching module calculates similarity scores between sub-query sentences and document blocks through a plurality of retrievers respectively, wherein the retrievers comprise a BM25 sparse retriever, a FAISS dense retriever, a metadata retriever and a HyDE retriever;

step S33, carrying out weighted fusion on the similarity score of each retriever and a preset weight coefficient of each retriever, sequencing text blocks according to the scores after weighted fusion, and selecting the first K text blocks with the scores from high to low as candidate text blocks, wherein K is a preset natural numerical value larger than 0;

And step S34, carrying out dynamic binding expansion combination on the candidate text blocks based on the similarity threshold value, and generating a binding package for responding to the sub-query statement.

Further, in step S34, the candidate text blocks are dynamically bound, expanded and combined based on the similarity threshold, and a bundle package for responding to the sub-query statement is generated, which includes:

performing preliminary retrieval by taking the candidate text blocks as independent units, and calculating corresponding intensive embeddings through a pre-selected trained text embedding model, wherein the intensive embeddings are vector representations of the candidate text blocks;

cosine similarity of adjacent candidate text blocks based on dense embedding calculation and used for judging content correlation of adjacent candidate text boxes

Judging whether cosine similarity of adjacent candidate text blocks reaches a similarity threshold value or not;

And dynamically merging the adjacent candidate text blocks into a bundle when the cosine similarity of the adjacent candidate text blocks reaches a similarity threshold, otherwise, not merging and keeping independence.

Further, in step S4, based on the bundles generated by the multi-path retrieval module, the semantic alignment degree of each bundle and the sub-query statement is calculated by the domain-specific document reordering module, and the calculation formula for obtaining the matching score by calculating the corresponding time reward score in combination with the time reward mechanism is as follows:

;

Wherein, the Matching scores of the sub-query sentences and candidate text blocks in the bundle; Calculating semantic alignment of the bundle and sub-query statements through a cross encoder; the function is a Sigmoid function and is used for mapping the calculation result to a [0,1] interval; For the transposition of the weight vector W, Awarding points for time; is a bias term.

Further, in step S4, ranking optimization is performed on the matching scores through a direct preference optimization mechanism, for obtaining a corresponding set of preferred text blocks to respond to the target query and generate a corresponding answer, including:

Preliminarily calculating a matching score between the text block and the query sentence in a BAAI/bge-reranker-v2-Gemma model of the document reordering module;

Constructing positive and negative sample pairs, and using a direct preference optimization mechanism to adjust BAAI/bge-reranker-v2-Gemma model weights to optimize the ordering of text blocks by minimizing a cross entropy loss function, so as to form a corresponding preference text block set;

And generating a final answer responding to the target query statement according to the preference text block set.

Further, the cross entropy loss function is calculated by the formula,

;

Wherein, the As a match score for a positive sample,A match score that is a negative sample,Is the expected value of the positive and negative samples.

Based on the same inventive concept, the present invention provides a RAG system for multi-modal financial document compliance analysis, performing a RAG method for multi-modal financial document compliance analysis as described above, comprising:

the file preprocessing module is used for preprocessing an input multi-mode financial document, generating corresponding text blocks from multi-mode data in the multi-mode financial document, and vectorizing and encoding all the preprocessed text blocks and metadata thereof by adopting a BGE-M3 dense encoder to construct a corresponding vector database;

The response module is used for responding to the target query request;

The multi-path searching stage process is a process of decomposing the target query statement into at least one or more sub-query statements, searching text blocks matched with the sub-query statements from a vector database, and dynamically binding and expanding the text blocks according to a similarity threshold value to generate a binding packet for responding to the sub-query statements;

The document reordering module is used for calculating the semantic alignment degree of each bundle package and sub-query sentences through the domain-specific document reordering module based on the bundle packages generated by the multi-path retrieval module, calculating corresponding time rewards according to a time rewards mechanism to obtain matching scores, and ranking and optimizing the matching scores through a direct preference optimization mechanism to obtain corresponding preference text block sets to respond to target query sentences and generate corresponding answers

Further, the calculation formula of the matching score is:

;

Compared with the prior art, the invention has at least one of the following beneficial effects:

(1) The invention provides a RAG method and a system for multi-mode financial document compliance analysis, which effectively solve the defects of the traditional method in the aspects of processing heterogeneous financial data and improving retrieval recall rate and correlation by integrating preprocessing, retrieving and reordering modules of financial documents. The system utilizes a mixed retrieval strategy and a DPO reordering technology special for the field, not only improves the accuracy of information extraction, but also can preferentially identify the compliance key content, ensures the high quality and high compliance of answer generation, shows remarkable performance improvement through comprehensive experimental verification, and particularly exceeds the existing baseline method in accuracy and recall rate in financial compliance tasks.

(2) The invention provides a high-efficiency and accurate end-to-end gold fusion rule question-answering method by integrating preprocessing, retrieving and reordering steps of financial documents, which can process multi-mode financial data and ensure accurate extraction of compliance information.

(3) The invention can uniformly process heterogeneous data formats such as texts, tables, images and the like through the multi-mode file preprocessing module, and generate the structured vector database, thereby effectively overcoming the defect that the traditional method cannot process complex financial document data.

(4) The invention introduces a multi-path retrieval module, comprising a dense retriever, a sparse retriever, a metadata retriever and a HyDE retriever, improves the recall rate and the correlation of complex problems, and has obvious advantages particularly when processing cross-modal and cross-document complex financial data.

(5) According to the invention, the reordering model is finely adjusted through the direct preference optimization mechanism, the key information related to compliance is preferentially presented, irrelevant contents are restrained, and the generated answer is ensured to meet the supervision requirement.

Drawings

FIG. 1 is a flow chart of steps of a RAG method for multimodal financial document compliance analysis of the present invention;

FIG. 2 is a schematic diagram of the operation of the RAG method for compliance analysis of multimodal financial documents of the present invention;

FIG. 3 is a schematic diagram of a query flow in an embodiment of the present invention;

FIG. 4 is a diagram of a working frame of a file preprocessing component in an embodiment of the invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

First embodiment

The application of the existing retrieval enhancement generation RAG method in the financial document compliance analysis faces several core challenges. First, financial documents often contain multiple types of data forms, including unstructured text (e.g., narrative disclosure), semi-structured data (e.g., forms, images), and structured data, which traditional text-based retrieval methods cannot effectively handle, resulting in fragmentation of information and loss of context, thereby affecting the comprehensiveness and accuracy of compliance analysis. Secondly, existing RAG methods generally rely on dense search or sparse matching techniques, which, although performing well in certain general tasks, lack implicit regulatory relationships specific to the financial field and deep capture of legal terms, so that when facing financial documents, the system easily ignores critical compliance information or generates inference errors, thereby reducing the effectiveness and reliability of the system. Furthermore, prior art search rankings are often based on semantic similarity, rather than domain-specific compliance priorities, which results in the failure of timely and accurate presentation of critical information, which may impact compliance decisions by financial institutions, increasing regulatory risks. Finally, due to frequent updates of financial regulations, existing systems lack sufficient dynamic adaptability to quickly respond to regulatory changes, resulting in inadequate stability of the system for long-term use in regulatory environments.

Specifically, existing search methods mainly match by calculating semantic similarity between texts, i.e., find documents or paragraphs that are closest in meaning to the query content. However, this approach ignores implicit regulatory relationships and domain standards that are unique to the financial domain. There are many specific rules and requirements (e.g., compliance requirements, legal terms, financial indicators, etc.) in the financial arts that are often not expressed by direct semantic similarity, but rather are related to specific regulatory provisions, regulatory interpretations, or industry practices, etc. These "implicit" non-explicit regulatory relationships and domain criteria may not be captured solely by semantic similarity matching. Thus, existing retrieval methods may not adequately identify and process these critical, domain-specific content, resulting in incomplete or inaccurate results. That is, existing methods focus only on semantic matching of surfaces, and ignore some of the implicit regulations and standards that must be followed in the financial arts, which may affect the accuracy and reliability of compliance analysis.

Based on the above-mentioned problems, the inventor proposes FinSage a framework based on thinking to solve the key difficulties in the compliance analysis of financial documents fundamentally. The FinSage framework in the present application proposes a comprehensive and robust solution by innovatively combining multi-modal preprocessing, domain-aware retrieval strategies, and compliance prioritization mechanisms. Aiming at the problem of multi-mode data processing, finSage converts heterogeneous data formats such as texts, tables, images and the like into structured data through a file preprocessing component, so that the problem of processing fragmented information by a traditional method is solved, and the efficient processing of different types of data in a unified frame is ensured. Secondly, through a multi-path retrieval module, sparse retrieval, dense retrieval, metadata perception semantic search and hypothesis expansion retrieval (HyDE) are combined, and hidden supervision relations and legal terms in financial documents are accurately captured, so that key information is ensured not to be missed. In the aspect of the priority ordering of the compliance, the special reordering module based on the direct preference optimization ensures the priority presentation of the compliance key content, simultaneously suppresses irrelevant information, and improves the processing efficiency and the accuracy of the compliance decision of the document. In summary, the method for generating RAG by search enhancement provided by the application has the capabilities of dynamic adaptation and real-time updating, and can keep high-efficiency operation in the environment with continuously changing financial supervision standards, thereby fully meeting the demands of modern financial institutions on compliance, accuracy and flexibility. The specific implementation mode is as follows:

as shown in fig. 1 and 2, the present invention provides a RAG method for compliance analysis of multi-modal financial documents, comprising:

step S2, responding to a target query request;

The specific process is shown in fig. 3 and 4, the file preprocessing module performs preprocessing on an input financial document or financial file, including text encoding and semantic enhancement, wherein the text encoding mainly extracts multi-mode blocks including texts, images, tables and the like for an open-source PDF tool, converts the multi-mode blocks into text representations through a large language model LLM, and the semantic enhancement mainly includes three parts, namely (a) removing redundant blocks through similarity comparison, (b) resolving common fingers in subtitle chapters, and (c) generating abstracts based on the subtitle chapters as metadata. The enhanced blocks are embedded into a vector database for subsequent processing. Receiving a target query statement in a database query at the client, such as a user query how the Lotus Tech performed in 2024, how it was planned to scale up in this year,

For a target query statement of a user query, the multi-path search module firstly performs query transformation, including query decomposition, divides the target query statement into a plurality of independent sub-query statements, and performs common reference analysis and context integration. For example, what is the core marketing strategy grown in Lotus Technology inc.2025 and what is the sales data in Technology inc.2024 are divided, what is based on the divided sub-query sentences, the multi-path search module performs matching search through a BM25 sparse searcher, a FAISS dense searcher, a metadata searcher and a HyDE searcher respectively, dynamically binds text blocks according to a similarity threshold, and generates a bundle combination for responding to the sub-query sentences. And carrying out re-optimization sequencing through a document reordering module according to the generated bundle combination, and generating a final answer for responding to the target query statement.

Step S11, acquiring a financial document for processing a target query request;

Step S12, decomposing the multi-modal data into three multi-modal modules of text, image and form according to the original layout sequence of the multi-modal data of the financial document by using an open source PDF tool MinerU;

step S13, converting the multi-mode blocks into text blocks through a large language model, wherein the step comprises the steps of converting the image blocks into text blocks with structured text summaries as images through a visual language model, and converting the table blocks into text descriptions with consistent structures as the text blocks of the table, wherein the text descriptions highlight data trends;

And step S14, processing the text blocks based on a semantic enhancement technology, and outputting a text block set for constructing a structured vector database.

Further, in step S14, the text block is processed based on the semantic enhancement technology, and a text block set for constructing the vector database is output, including:

structured metadata is added to each text block using a large language model, the metadata including chapter headers, page locations, and data structures and time information.

It should be specifically noted that financial documents often contain complex and diverse data formats, including unstructured text (e.g., narrative disclosures), semi-structured data (e.g., forms and images), and contextual metadata. These multimodal data tend to be interdependent and closely related and cannot be handled independently. For example, descriptive text typically relates to detailed corporate financial information and market analysis, while tables and images present data trends and financial indicators, the combination of which truly reflects the corporate financial situation. However, conventional retrieval systems typically focus on processing of text data, with poor understanding and indexing of non-textual information such as images and forms, resulting in fragmentation of information or loss of context upon retrieval.

Because of the heterogeneity of these data types, traditional keyword-based retrieval methods (e.g., BM 25-based sparse retrieval) may not be able to fully capture the semantic relationships of documents, especially when the user's query involves complex financial problems. For example, a mere keyword match may ignore numerical trends in images or key data in tables, affecting the accuracy of the search. On the other hand, modern multi-modal methods require simultaneous processing and understanding of different types of data, thereby ensuring that all relevant information is comprehensively considered and efficiently extracted during retrieval.

In order to solve the challenges, the FinSage framework provided by the application uniformly converts information such as texts, tables, images and the like into a structured vector representation by adopting a multi-mode data file preprocessing component, so that different types of data are ensured to be uniformly processed under the same framework. The processing mode avoids the problem of fragmentation of the context information, improves the integration capability of the multi-mode information, and enables the retrieval system to respond to the complex query requirements in the financial field more comprehensively and accurately. The specific implementation process is as follows:

(1) Using the open source PDF tool MinerU, the document content is parsed into three classes of blocks (chunk) text blocks, image blocks, and form blocks in natural reading order. These sets of blocks are denoted as . Each block represents an independent portion of the document, facilitating subsequent processing and analysis.

(2) Converting the multi-modal blocks includes ① converting the information in the image into a structured text excerpt using a visual language model. For example, "2023 years of revenue is increased by 15%" as a description of image blocks is shown in fig. 1, providing clearer semantic information for subsequent text processing and analysis. ② The data in the form is converted into a text description with consistent structure, and the trend of the data is highlighted. For example, the title company operation cost will be described. What is more, the 8% decrease in the operational cost over the last year is converted into a textual description of the form blocks in order to extract and understand the trends and changes behind the data.

(3) Semantic enhancement, including ① calculating cosine similarity between different text blocks by using SBERT (Sentence-BERT) sentence embedding techniques. When the similarity of two text blocks reaches or exceeds a preset threshold (e.g., 0.95), they are considered repetitive, one of the blocks is merged, redundant content is removed, and information storage and extraction is optimized. ② For each text blockAdding structured metadataMetadata includes chapter title, page location, data type, and time information (e.g., "damage table-page 5-table-year 9, month 1 of 2024"), which helps to quickly locate and identify the source of the block and related information during subsequent retrieval. ③ Text normalization, including ①, extends the abbreviations presented herein to full form. For example, "EBITDA" is extended to "profit before tax return amortization", and normalization of the text is ensured. ② The digits representing the different formats are unified, for example, "$1.2M" is converted into $ 120 ten thousand ", ensuring a standardized treatment of the values. ③ For legal related terms, standardized substitutions are made, such as unifying "non-compliance" as "non-compliance" to promote readability and consistency of legal documents.

(4) Outputting the processed text block setWherein each text blockIs composed of five components including reinforcedMetadata (metadata)Text dense embeddingBM25 sparse vectorAnd metadata intensive embeddingIt is ensured that the multidimensional information of the document can be represented in the structured vector database.

It should be specifically noted that, in general, the conventional RAG system mainly relies on dense search (such as semantic matching based on deep learning) or sparse lexical matching (such as BM25 algorithm), and these methods can provide better effects in general scenarios, but have the following limitations in application in the financial field:

① Conventional RAG systems often do not undergo specialized domain adaptation, and therefore, when processing financial documents, terms, regulatory requirements, or legal relationships of a particular domain may not be effectively captured, resulting in insufficient accuracy in answering complex compliance questions.

② Often, financial documents contain implicit or indirect regulatory requirements, and traditional search methods based on semantic similarity or keyword matching may not accurately capture these implicit relationships, thereby affecting the answers to compliance questions.

In order to solve the above problems, the present application introduces the following two innovative steps:

combining sparse (e.g., BM 25) and dense (e.g., FAISS model-based semantic matching) searches, and enhancing the understanding of legal terms, policy regulations, and industry standards in financial documents through domain-specific fine tuning. For example, the search model is domain adapted by custom data sets to identify regulatory keywords such as "major defects", "non-compliance" and the like.

② In the retrieval process, the query is weighted by combining metadata (such as chapter titles, form labels, context abstracts and the like) of the financial documents, so that the retrieval can better understand the structural information of the documents, and the explicit and implicit supervision relations in the documents can be effectively captured.

③ A large language model is used to generate hypothetical document paragraphs, extending the semantic scope of the query. This process helps the retrieval system identify and capture more implicit relationships by translating the relevant questions into hypothetical scenarios. For example, for the query "liquidity risk factor," the system directs the retrieval system to mine relevant regulatory document content by generating a hypothesis that the paragraph "liquidity coverage decrease may affect payability.

④ By combining multi-path search results and dense search, sparse search and metadata search, the recall rate is improved, and higher correlation search on complex financial problems is ensured. The scores of the paths are weighted and fused, so that the retrieval precision of related documents can be effectively improved, and finally, the document content is ensured to meet the compliance requirement.

Through the steps, the system can effectively overcome the defects of the traditional RAG system in the financial field, particularly in the aspects of capturing the supervision requirements and the implicit legal relations, and improves the performance of the system in the financial compliance task. The specific implementation process is as follows:

(1) After receiving the original target query statement of the user, query expansion is firstly carried out through a HyDE (hypothetical document expansion) method, and hypothetical documents are generated to expand query semantics, so that wider relevant information is captured. For example, for the query "financial risk factor", hyDE the expanded query generated may be "query 'financial risk factor' may involve liquidity ratios or debt terms" to expand the semantic scope of the original query.

(2) Based on the query expansion, the system promotes recall and relevance of the search by:

FAISS dense retrievers that calculate cosine similarity between queries and text blocks based on text dense embedding using the BGE-M3 model.

The BM25 sparse retriever matches keywords using a BM25 algorithm while weighting the metadata fields (e.g., chapter title weights x 2).

The metadata retriever retrieves the chapter abstract related to the query through metadata embedding and increases the retrieved context information. And splicing the chapter titles and the abstracts of each text block into unified metadata embedding, so as to ensure that all blocks in the same semantic segment share the same metadata representation. When a single metadata instance is searched, all blocks in the same chapter are automatically associated, cross-block context association is enhanced, and searching ambiguity of a multi-document scene is reduced.

And comprehensively scoring the results through weighted score fusion (weights are optimized through grid search), finally selecting Top-50 candidate text blocks, binding and expanding the searched candidate text blocks, splicing the candidate blocks and adjacent blocks thereof to form a context-coherent binding package (bundle), and solving the fragmentation problem.

In addition, the step introduction part of the multi-path search method further includes a query translation process, which mainly optimizes the input query by using a large language model, and includes:

query decomposition-splitting a complex query into separate sub-queries. For example, "2023 revenue growth and cost control measures" are broken down into "2023 revenue growth rate" and "2023 cost control strategy".

Coreference resolution-substitution of a pronoun for a specific entity. For example, "its financial risk" is converted to "XX company's financial risk".

Context integration-if the query involves a history dialogue, automatically associating context information.

Further, in step S34, the dynamic binding extension combining the candidate text blocks based on the similarity threshold, generating the bundle package for responding to the sub-query statement includes:

cosine similarity of adjacent text blocks of candidate text blocks is calculated based on dense embedding and is used for judging content relativity of the adjacent text blocks of the candidate text blocks

Judging whether cosine similarity of adjacent text blocks reaches a similarity threshold value or not;

When the cosine similarity of the adjacent text blocks reaches a similarity threshold, adding the adjacent text blocks into the candidate block set, and dynamically combining the adjacent text blocks and the candidate text blocks into a bundle, otherwise, not combining and keeping independent.

It should be specifically noted that, to solve the problem of cross-block distribution of key information, a dynamic binding strategy is designed, including:

1. The initial search, from the beginning of the multi-path search phase, breaks the document into a plurality of independent text blocks, each of which is treated as a separate search object, to participate in the search. Each text block may be a paragraph, form, image or other unit of information from the document, the system will first evaluate the relevance of the text blocks to the query, and each retrieved text block will be assigned a matching score for subsequent processing.

2. And (3) adjacent expansion, namely dynamically expanding the candidate set through the similarity between the adjacent text blocks of the candidate text blocks and the user query, so as to solve the problem that key information is scattered in different blocks. For the text block which has been retrievedThe system will check the text blocks that are adjacent to each other (i.eAnd) And similarity between user queries. The cosine similarity between them is calculated mainly using dense embedding. When the cosine similarity of the neighboring text block and the user query is greater than a certain similarity threshold, for example, the set similarity threshold is 0.85, then the neighboring text block is considered to have sufficient semantic relevance. The similarity threshold is typically tuned to the document type and domain requirements to ensure that only text blocks highly relevant to the user query are merged. By the expansion, the adjacent text blocks related to the semantics can be gathered together, and the loss of fragmentation information is avoided. A binding example is where the candidate text block "2023 revenue" was found during the retrieval of the user query "2023 company net profit" and its neighboring "cost analysis" text blocks also have a higher similarity to the user query (e.g., the similarity reaches 0.86, exceeding the similarity threshold of 0.85). The two text blocks are combined into a bundle, named "2023 financial performance", to form a semantically coherent, complete context unit. Text blocks in the bundle are not necessarily simply spliced, but rather are reasonably combined according to their semantics and context. This binding helps provide a more consistent and comprehensive answer when answering a target query statement, especially when dealing with long text or information fragmentation.

Through the two steps, the problem of key information dispersion and fragmentation is effectively solved. Firstly, obtaining text blocks through initial retrieval, then dynamically judging and merging semantically related blocks through adjacent expansion, and finally, generating coherent contexts through binding adjacent blocks, so as to ensure the integrity of information and the accuracy of answers.

Further, the document reordering module calculates the semantic alignment degree of each bundle package and the sub-query statement through a cross encoder, and calculates a calculation formula of obtaining a matching score by combining a corresponding time rewarding score with a time rewarding mechanism, wherein the calculation formula is as follows:

;

Further, ranking optimization of the matching scores via a direct preference optimization mechanism, generating a corresponding set of preferred text chunks to respond to the target query statement and generating a corresponding answer includes:

constructing positive and negative sample pairs, and using a direct preference optimization mechanism to adjust the ordering of BAAI/bge-reranker-v2-Gemma model optimized text blocks by minimizing a cross entropy loss function to form a corresponding preference text block set;

Further, the cross entropy loss function is calculated by the formula,

;

Specifically, the document reordering module uses BAAI/bge-reranker-v2-Gemma model as a base model, which is a multi-language and high-performance reordering model. The model can process text data from different languages and fields, and is suitable for cross-language and cross-field document ordering tasks.

Cross encoder CrossEncoder is used to calculate query q and text blockThe purpose of which is to evaluate semantic alignment between the query and candidate text blocks, ensuring that the returned text blocks are highly relevant to the query.

The time rewarding mechanism introduced by the application adjusts the score of the text block based on the release time of the text block. The function isThe score is dynamically adjusted based on the release time, and recently released text blocks will get a higher prize (e.g., a text block prize coefficient x 1.2 released within 1 year). This mechanism may ensure that the system favors more recent, more relevant content in the ranking.

The direct preference optimizing DPO in the application is to generate positive and negative sample pairs from multi-path search results, wherein positive samples are text blocks containing compliance keywords such as 'great litigation', negative samples are text blocks which are related to semantics but are not related to compliance, such as 'enterprise social responsibility', and the system can learn how to correctly distinguish compliance and non-compliance content through manual labeling and screening.

The iterative optimization flow of the DPO comprises the following steps:

And (5) searching and labeling, namely obtaining candidate documents through a FAISS equal vector search engine. These candidate documents are then manually annotated to generate a preference dataset comprising positive and negative pairs of samples. In this way, high quality training data can be provided to the model, enabling it to accurately distinguish between compliant and non-compliant content in subsequent learning.

And (3) model evaluation, namely measuring the performance of the model in the sorting task by adopting evaluation indexes such as NDCG normalized damage accumulation gain, MRR average reciprocal rank and the like. NDCG evaluates the relevance of the model to the document ordering, while MRR evaluates the efficiency of the model when the correct document is retrieved.

Dynamic adjustment, namely when the model performs poorly on the new test set, the data needs to be remarked and parameters of the model need to be fine-tuned. In this way, the model can adapt to new data and scenarios, constantly optimizing its performance.

Second embodiment

Based on the same inventive concept, the invention also provides a RAG system for multi-modal financial document compliance analysis, which executes the RAG method for multi-modal financial document compliance analysis, comprising:

The response module is used for responding to the target query request;

Further, the calculation formula of the matching score is:

;

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A RAG method for compliance analysis of multimodal financial documents, comprising:

Step S1: Preprocessing the input multimodal financial document, generating corresponding text blocks from the multimodal data in the multimodal financial document using a file preprocessing module, and vectorizing all the preprocessed text blocks and their metadata using a BGE-M3 dense encoder to construct a corresponding vector database;

Step S2: responding to the target query request;

Step S3: Entering a multi-path retrieval process based on the target query statement in the target query request; the multi-path retrieval process is a process of decomposing the target query statement into at least one or more sub-query statements, retrieving the text blocks matching the sub-query statements from the vector database through a multi-path retrieval module, and dynamically bundling and expanding the text blocks based on a similarity threshold to generate a bundle package for responding to the sub-query statements;

Step S4: Based on the bundles generated by the multi-path retrieval module, the semantic alignment between each bundle and the sub-query statement is calculated through the domain-specific document reranking module, and the corresponding time reward score is calculated in combination with the time reward mechanism to obtain the matching score. The matching score is ranked and optimized through the direct preference optimization mechanism to obtain the corresponding preference text block set to respond to the target query statement and generate the corresponding answer.

2. The RAG method for compliance analysis of multimodal financial documents according to claim 1, characterized in that, in step S1, the input multimodal financial document is preprocessed, and corresponding text blocks are generated from the multimodal data in the multimodal financial document using a file preprocessing module. All preprocessed text blocks and their metadata are vectorized using a BGE-M3 dense encoder to construct a corresponding vector database, including:

Step S11: obtaining the financial document for processing the target query request;

Step S12: using the open source PDF tool MinerU to decompose the multimodal blocks including text, images, and tables according to the original layout order of the multimodal data of the financial document;

Step S13: converting the multimodal block into the text block using a large language model, converting the image block into a structured text summary as the text block of the image using a visual language model, and converting the table block into a text description with consistent structure as the text block of the table;

Step S14: Processing the text blocks based on semantic enhancement technology, and outputting a text block set for constructing the vector database.

3. The RAG method for compliance analysis of multimodal financial documents according to claim 2, characterized in that, in step S14, the text blocks are processed based on semantic enhancement technology to output a set of text blocks for constructing the vector database, including:

Calculate the cosine similarity between different text blocks through SBERT sentence embedding, and when the cosine similarity exceeds a set threshold, merge the text blocks to complete redundancy removal;

Using the large language model to perform coreference resolution, pronouns in the same section are iteratively parsed, and replaced with clear entities based on contextual information. Different references representing the same entity are grouped into an equivalent set to resolve the problem of ambiguous references in the text block.

Adding structured metadata to each of the text blocks using the large language model, the metadata including chapter title, page location, and data type;

All the pre-processed text blocks and the metadata corresponding to the text blocks are subjected to text vectorization encoding through the BGE-M3 dense encoder and a vector database is constructed for querying the target query statement of the client.

4. The RAG method for compliance analysis of multimodal financial documents according to claim 3, characterized in that in step S3, according to the target query statement in the target query request, a multi-path retrieval process is entered, comprising:

Step S31: When the client initiates a target query request to query the database, the executor uses the large language model and natural language processing technology to decompose the target query statement into multiple independent sub-query statements, replaces the pronouns in the independent sub-query statements with the explicit entities through coreference resolution, and automatically associates the context information;

Step S32: the multi-path retrieval module calculates the similarity scores between the sub-query statement and the document block respectively through multiple search engines, wherein the search engines include a BM25 sparse search engine, a FAISS dense search engine, a metadata search engine, and a HyDE search engine;

Step S33: performing weighted fusion based on the similarity scores of each of the retrievers and the preset weight coefficients of the retrievers, sorting the text blocks according to the weighted fusion scores, and selecting the top K text blocks with the highest to lowest scores as candidate text blocks, where K is a preset natural number greater than 0;

Step S34: dynamically bundling and expanding the candidate text blocks based on the similarity threshold to generate the bundle package for responding to the sub-query statement.

5. The RAG method for compliance analysis of multimodal financial documents according to claim 4, characterized in that in step S34, the candidate text blocks are dynamically bundled and extended based on the similarity threshold to generate the bundle package for responding to the sub-query statement, comprising:

Performing preliminary retrieval on the candidate text blocks as independent units, and calculating corresponding dense embeddings using a pre-selected and trained text embedding model, wherein the dense embeddings are vector representations of the candidate text blocks;

Calculate the cosine similarity of adjacent candidate text blocks based on the dense embedding to determine the content relevance of the adjacent candidate text boxes

Determining whether the cosine similarity of the adjacent candidate text blocks reaches the similarity threshold;

When the cosine similarity of the adjacent candidate text blocks reaches the similarity threshold, the adjacent candidate text blocks are dynamically merged into a bundle; otherwise, they are not merged and remain independent.

6. The RAG method for compliance analysis of multimodal financial documents according to claim 5, characterized in that in step S4, based on the bundles generated by the multi-path retrieval module, the domain-specific document re-ranking module calculates the semantic alignment between each bundle and the sub-query statement, and the time reward mechanism is used to calculate the corresponding time reward points to obtain the matching score. The calculation formula is:

;

in, The matching score between the sub-query statement and the candidate text block in the bundle; The method is to calculate the semantic alignment between the bundle and the sub-query statement through a cross encoder; is the Sigmoid function, which is used to map the calculation results to the interval [0, 1]; is the transpose of the weight vector W, awarding points for said time; is the bias term.

7. The RAG method for compliance analysis of multimodal financial documents according to claim 6, characterized in that, in step S4, the matching scores are ranked and optimized using a direct preference optimization mechanism to obtain a corresponding set of preferred text blocks to respond to the target query and generate a corresponding answer, including:

Preliminarily calculating the matching score between the text block and the query statement in the BAAI/bge-reranker-v2-Gemma model of the document reranking module;

Constructing positive and negative sample pairs, and using the direct preference optimization mechanism to adjust the BAAI/bge-reranker-v2-Gemma model weights by minimizing the cross-entropy loss function to optimize the order of the text blocks and form a corresponding set of preferred text blocks;

A final answer is generated in response to the target query statement based on the preferred text block set.

8. The RAG method for compliance analysis of multimodal financial documents according to claim 7, wherein the calculation formula of the cross entropy loss function is:

;

in, is the matching score of the positive sample, is the matching score of the negative sample, is the expected value of the positive and negative samples.

9. A RAG system for compliance analysis of multimodal financial documents, executing the RAG method for compliance analysis of multimodal financial documents according to any one of claims 1 to 8, characterized by comprising:

a file preprocessing module for preprocessing an input multimodal financial document, generating corresponding text blocks from the multimodal data in the multimodal financial document, vectorizing all preprocessed text blocks and their metadata using a BGE-M3 dense encoder, and constructing a corresponding vector database;

A response module, used to respond to target query requests;

A multi-path retrieval module is configured to enter a multi-path retrieval process based on a target query statement in a target query request; the multi-path retrieval process is a process of decomposing the target query statement into at least one or more sub-query statements, retrieving the text blocks matching the sub-query statements from the vector database, and dynamically bundling and expanding the text blocks based on a similarity threshold to generate a bundle for responding to the sub-query statements;

A document re-ranking module, based on the bundles generated by the multi-path retrieval module, calculates the semantic alignment of each bundle with the sub-query statement through a domain-specific document re-ranking module, calculates the corresponding time reward score in combination with a time reward mechanism to obtain a matching score, and optimizes the ranking of the matching score through a direct preference optimization mechanism to obtain a corresponding set of preferred text blocks to respond to the target query statement and generate a corresponding answer.

10. The RAG system for compliance analysis of multimodal financial documents according to claim 9, wherein the calculation formula for the matching score is:

;