[go: up one dir, main page]

CN120875891A - Payment compliance checking method and device and electronic equipment - Google Patents

Payment compliance checking method and device and electronic equipment

Info

Publication number
CN120875891A
CN120875891A CN202510992359.XA CN202510992359A CN120875891A CN 120875891 A CN120875891 A CN 120875891A CN 202510992359 A CN202510992359 A CN 202510992359A CN 120875891 A CN120875891 A CN 120875891A
Authority
CN
China
Prior art keywords
information
node
entities
compliance
payment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510992359.XA
Other languages
Chinese (zh)
Inventor
赵璐
王振东
周婷
龙腾凤
邓对义
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Financial Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Financial Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Financial Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202510992359.XA priority Critical patent/CN120875891A/en
Publication of CN120875891A publication Critical patent/CN120875891A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q20/00Payment architectures, schemes or protocols
    • G06Q20/38Payment protocols; Details thereof
    • G06Q20/42Confirmation, e.g. check or permission by the legal debtor of payment
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90332Natural language query formulation or dialogue systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Finance (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提供一种支付合规性审查方法方法、装置及电子设备,属于自然语言处理技术领域。该方法包括:获取支付合规性审查的业务场景下的待审查信息;基于所述待审查信息进行信息提取,得到待处理信息,所述待处理信息包括所述待审查信息的目标关键词;基于预先构建的三层次结构的图数据库,对所述目标关键词进行检索,得到每个结构层中与所述目标关键词相关检索信息的第一嵌入向量表示,所述三层次结构分别为支付相关信息中文本块的结构层、表示所述文本块中实体之间关系的图结构层和表示所述实体的社区之间关系的图结构层;基于所述第一嵌入向量表示进行生成增强处理,得到所述待审查信息的合规性审查结果。

This invention provides a payment compliance review method, apparatus, and electronic device, belonging to the field of natural language processing technology. The method includes: acquiring information to be reviewed in a business scenario of payment compliance review; extracting information based on the information to be reviewed to obtain information to be processed, the information to be processed including target keywords of the information to be reviewed; retrieving the target keywords based on a pre-constructed three-level graph database to obtain a first embedding vector representation of the retrieval information related to the target keywords in each structural layer, the three-level structure being a structural layer of text blocks related to payment, a graph structural layer representing the relationships between entities in the text blocks, and a graph structural layer representing the relationships between communities of the entities; and performing generative enhancement processing based on the first embedding vector representation to obtain the compliance review result of the information to be reviewed.

Description

Payment compliance checking method and device and electronic equipment
Technical Field
The embodiment of the invention relates to the technical field of natural language processing, in particular to a payment compliance checking method, a payment compliance checking device and electronic equipment.
Background
With the rapid development of the payment industry, particularly the rise of third party payment institutions, payment services are increasing in size and complexity. In order to ensure compliance of payment services, regulatory authorities have formulated strict laws and regulations, such as "management of card receipts business" and "management of non-bank payment institutions' network payment business". These laws and regulations aim to ensure that the operation and management of payment institutions meet national regulations, avoid violations, protect consumer rights and interests, and maintain the stability of financial systems.
Payment compliance screening is an important component of payment business requiring payment institutions to periodically review and review their transaction data, account management, merchant collaboration, etc. to ensure compliance. However, due to the complexity of the payment service itself, the bulkiness of the data volume, and the diversity of regulatory requirements, payment institutions face a number of challenges in performing compliance reviews.
The related payment compliance inspection method mainly relies on rule matching and threshold judgment of structured data, and simple rule matching and judgment are carried out by constructing a rule base, periodically or in real time collecting payment transaction data, account information and the like. Such techniques often fail to effectively process unstructured data, particularly when dealing with complex multimodal data (e.g., text, regulatory provisions, contracts, etc.), present significant limitations, leading to potential compliance issues that are easily missed.
Disclosure of Invention
The embodiment of the invention provides a payment compliance checking method, a device and electronic equipment, which are used for solving the technical problem that compliance checking problems in a business scene of payment compliance checking in the prior art are easy to miss.
In a first aspect, an embodiment of the present invention provides a payment compliance censoring method, the method including:
acquiring information to be checked in a business scene of payment compliance checking;
Information extraction is carried out based on the information to be checked to obtain information to be processed, wherein the information to be processed comprises target keywords of the information to be checked;
Searching the target keyword based on a pre-constructed three-level structure graph database to obtain a first embedded vector representation of search information related to the target keyword in each structural layer, wherein the three levels of structures are respectively a structural layer of a text block in payment related information, a graph structural layer representing a relationship between entities in the text block and a graph structural layer representing a relationship between communities of the entities;
And generating and enhancing the first embedded vector representation to obtain a compliance review result of the information to be reviewed.
In a second aspect, an embodiment of the present invention provides a payment compliance censoring device, the device comprising:
The acquisition module is used for acquiring information to be checked in a business scene of payment compliance checking;
The information extraction module is used for extracting information based on the information to be checked to obtain information to be processed, wherein the information to be processed comprises target keywords of the information to be checked;
The retrieval module is used for retrieving the target keyword based on a pre-constructed graph database with three layers of structures, and obtaining a first embedded vector representation of retrieval information related to the target keyword in each structure layer, wherein the three layers of structures are respectively a structure layer of a text block in payment related information, a graph structure layer representing the relationship among entities in the text block and a graph structure layer representing the relationship among communities of the entities;
and the generation enhancement processing module is used for generating enhancement processing based on the first embedded vector representation to obtain a compliance examination result of the information to be examined.
In a third aspect, an embodiment of the present invention provides an electronic device comprising a processor, a memory, a computer program stored on the memory and executable on the processor, the computer program implementing the steps of the payment compliance censoring method described above when executed by the processor.
In a fourth aspect, embodiments of the present invention provide a readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the payment compliance censoring method described above.
In a fifth aspect, embodiments of the present invention provide a computer program product comprising computer instructions which, when executed by a processor, implement the steps of the payment compliance censoring method described above.
According to the embodiment of the invention, information extraction is carried out based on the information to be inspected to obtain information to be processed, the information to be processed comprises target keywords of the information to be inspected, the target keywords are searched based on a pre-built three-layer-structure graph database to obtain first embedded vector representations of search information related to the target keywords in each structural layer, the three-layer-structure is respectively a structural layer of a text block in the payment related information, a graph structural layer representing a relationship among entities in the text block and a graph structural layer representing a relationship among communities of the entities, and generation enhancement processing is carried out based on the first embedded vector representations to obtain a compliance inspection result of the information to be inspected. In this way, by introducing a multi-level graph structure comprising a structural layer of text blocks in payment related information, a graph structural layer representing relationships between entities in the text blocks and a graph structural layer representing relationships between communities of the entities, the graph structure can be used for dynamically extracting key information from structured and unstructured data, and can be used for efficiently organizing and retrieving information, particularly suitable for use in business scenes of payment compliance inspection, complex relationships between transaction data and merchants can be more comprehensively audited by constructing the multi-level graph structure, and potential compliance problems can be effectively found particularly in large-scale payment networks.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for compliance review of payments provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram of a three-level structure diagram database according to an embodiment of the present invention;
FIG. 3 is a diagram showing data in a question-and-answer scenario in compliance review provided by an embodiment of the present invention;
FIG. 4 is a block diagram of an implementation in a compliance review questioning and answering scenario provided by an embodiment of the present invention;
FIG. 5 is a diagram illustrating an implementation of GraphRAG data construction provided by an embodiment of the present invention;
FIG. 6 is a data presentation diagram of GraphRAG data constructs provided by an example;
FIG. 7 is a schematic diagram of a payment compliance censoring device provided by an embodiment of the present invention;
Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the related technology, a compliance audit model based on a knowledge graph is provided, and the method is used for identifying an audit point by constructing rules and characteristic knowledge graphs and matching by using a cosine similarity algorithm. Although the technology is innovative in terms of processing rule matching, manual judgment and auditing are relied on, and unstructured data is limited in processing, and particularly, the technology is weak in application in complex transaction flow and multi-mode data fusion scenes.
It is known that the following technical drawbacks exist in the prior art:
1. The existing scheme locks the audit points through the comparison of the knowledge maps, but the audit personnel are required to carry out manual judgment by combining the business background and specific conditions, so that the full-automatic compliance examination can not be realized. The mode relying on manpower is low in efficiency in large-scale audit tasks, and cannot meet real-time audit requirements.
2. The unstructured data processing capability is limited, namely the existing scheme mainly aims at structured audit data and has larger limitation in processing unstructured data (such as contract text, laws and regulations and the like). Although a knowledge graph is used, when complex text information is faced, comprehensive and accurate audit is difficult to achieve.
3. The existing scheme is mainly based on structured knowledge graph data, and is difficult to process and integrate various data (such as text, transaction records, merchant information and the like) at the same time, so that the existing scheme is difficult to play a high-efficiency role in payment compliance examination, especially in a scene involving multidimensional data.
4. The rule expansibility and flexibility are poor, and the existing scheme depends on predefined rules and knowledge patterns, so that when the emerging payment service compliance requirement is met, the rule expansibility and expansibility are insufficient, and the new compliance requirement is difficult to adapt quickly.
As current flows of payment reviews and compliance reviews rely on a large number of manual operations, especially when processing multiple types of data (including structured data such as transaction records, unstructured data such as contract text, regulatory clauses, etc.). How to automate the processing of such data and conduct compliance reviews remains a technical challenge. Also, during the payment compliance review process, data in different formats (e.g., adjacency lists, natural language descriptions, code forms, syntax trees, graph embedding, etc.) need to be processed. Moreover, compliance issues in payment transactions involve multiple levels of information, such as fine-grained information of the transaction, associations of the transaction parties, and global transaction networks across merchants, among others. How to transfer this information efficiently in multiple layers and perform semantic analysis is difficult to handle in the prior art.
The application provides a payment compliance examination method based on graph retrieval enhancement generation (GRAPH RETRIEVAL-Augmented Generation, graphRAG), which can automatically process complex transaction data and merchant information through graph retrieval of a three-layer structure, generate a high-precision compliance examination result, support dynamic multi-format data processing and realize more efficient and automatic payment compliance examination.
The business flow of compliance review is as follows:
1. And (3) data processing:
Transaction data, merchant information, account data, etc. are collected from the payment system.
The system cleans and formats the data for subsequent compliance determination.
2. Compliance review:
the system checks whether the transaction data, account management data, merchant information and the like accord with related regulations or not piece by piece according to the existing rule base.
For the part which does not meet the rule, the system automatically generates detailed reasons for the non-compliance and refers to relevant rule terms as a basis.
3. Correction proposal generation:
the system automatically generates a corresponding reformulation proposal for each non-compliance term. The advice may include updating merchant information, adjusting payment limits, supplementing legal documents, and the like.
4. Compliance report generation and output:
the system ultimately generates a compliance report containing compliance judgments, detailed descriptions of non-compliance terms, regulatory compliance, and regulatory compliance advice.
Reports may be output in a standardized format for use by an internal inspection department.
The present application is mainly directed to data processing and compliance screening, and the payment compliance screening method provided in the embodiments of the present application is first described below.
Fig. 1 is a flow chart of a payment compliance review method according to an embodiment of the present invention, as shown in fig. 1, where the method includes:
Step 101, obtaining information to be checked in a business scene of payment compliance checking;
102, extracting information based on the information to be checked to obtain information to be processed, wherein the information to be processed comprises target keywords of the information to be checked;
Step 103, searching the target keyword based on a pre-constructed three-level structure graph database to obtain a first embedded vector representation of search information related to the target keyword in each structural layer, wherein the three levels of structures are respectively a structural layer of a text block in payment related information, a graph structural layer representing a relationship among entities in the text block and a graph structural layer representing a relationship among communities of the entities;
And 104, generating and enhancing the first embedded vector representation to obtain a compliance review result of the information to be reviewed.
In step 101, the information to be examined may be structured data or unstructured data. The information to be inspected differs according to the business scenario of payment compliance inspection. The payment compliance censoring mainly comprises the following business scenes:
1. and (3) checking the compliance of the merchant information, namely, whether the merchant timely updates information such as legal representative information, cooperation protocol and management address according to the legal requirements.
2. Account function and quota management whether the function settings of the payment account and the quota management are in compliance with the regulations, in particular whether the personal payment account exceeds the allowed quota.
3. Transaction data inspection, namely whether abnormal transaction exists or not, and whether the transaction flow and legal and regulatory requirements are met or not.
4. And (3) periodically checking and generating reports, namely periodically comprehensively checking data of payment accounts, merchants, transactions and the like, generating compliance reports, listing non-compliance items and providing correction suggestions.
5. And (3) a question and answer scene of the payment compliance, namely answering according to a user question of the client side about payment compliance examination, and returning an answer.
In some embodiments, payment data, merchant information, payment account, etc. of the payment system background may be obtained as the information to be inspected, and in some embodiments, question information of the payment system foreground may be obtained as the information to be inspected.
In step 102, the information extraction may include keyword extraction, the information extraction may further include query decomposition, the query decomposition may be performed on the information to be inspected by using a large model based on the target keyword, so as to obtain a query sub-problem of the information to be inspected, and the query enhancement of the information to be inspected may be performed through the query decomposition.
Keyword extraction is the starting point of the whole examination, and the input natural language text is understood by relying on a large model. For example, the information to be checked is a question input by the user, "whether there is a non-compliance in a payment account transaction of a certain user in a certain period of time.
After keyword extraction, query decomposition is performed using a large model. The purpose of query decomposition is to break down complex natural language questions into several independent query sub-questions. Taking payment exam as an example, the resolved query may include "account type check", "credit check", "daily credit check", etc., so that rules of the payment path may be enhanced by the query. For example, query sub-questions may be obtained by query decomposition of a user-entered question, "whether there is a non-compliance in a payment account transaction by a user over a certain period of time?" 1 how much is the user's total transaction line? "what is the user's maximum transaction amount.
In this step, a large language model may be used to automatically extract keywords in natural language text and perform semantic analysis to better understand and process complex user queries, which identifies key entities and trade relationships by intelligently analyzing user-entered questions (e.g., payment compliance questions). Through the semantic analysis capability of the large language model, compliance can more intelligently process user queries, improving overall censoring efficiency, especially when complex unstructured text and regulations are involved.
In step 103, by introducing a multi-level graph structure including a structure layer of text blocks in payment related information, a graph structure layer representing relationships between entities in the text blocks, and a graph structure layer representing relationships between communities of the entities, the graph structure is used for dynamically extracting key information from structured and unstructured data, and the graph structure can be used for efficiently organizing and retrieving information, is particularly suitable for use in business scenes of payment compliance inspection, and by constructing the multi-level graph structure, complex relationships between transaction data and merchants can be more comprehensively audited, and potential compliance problems can be effectively found particularly in a large-scale payment network.
The three-layer structure graph database is shown in fig. 2, and is respectively a low layer, a middle layer and a high layer.
The lower layer is a structural layer of text blocks in the payment related information, and contains original block data, and the information of all initial text blocks in the payment related information is in the layer.
The middle layer is a graph structure layer for representing the relationship between entities in the text block, is an entity relationship graph, and displays the relationship between different entities and the entity, such as the relationship between a merchant and legal representatives, transaction records and the like.
The high-level is a graph structure layer for representing the relationship between communities of the entities, the graph can be divided into a plurality of subgraphs based on community structures generated by community detection, each subgraph represents a community, and a specific compliance sub-problem such as merchant information, transaction limits and the like can be corresponding to each subgraph.
In some embodiments, the graph database may perform information retrieval on the target keyword in combination with the vector library, to obtain a first embedded vector representation of retrieval information related to the target keyword in each structural layer, and to obtain an embedded vector corresponding to a candidate data set matching the target keyword. Vector retrieval can be performed on the vector of the target keyword based on a pre-constructed vector library, so that a candidate data set matched with the target keyword is obtained, and the vector library comprises vectors of payment related information.
In some embodiments, information retrieval may be performed in parallel at a lower layer, an intermediate layer, and a higher layer based on a target keyword, resulting in a first embedded vector representation of retrieved information associated with the target keyword in each structural layer.
In some embodiments, information retrieval may be performed serially at a lower level, an intermediate level, and a higher level based on a target keyword, resulting in a first embedded vector representation of retrieved information associated with the target keyword in each structural level. The step 103 specifically includes:
Based on the target keywords, sequentially searching the target keywords according to the search routes of the structural layers of the text blocks, the graph structural layers representing the relation among the entities in the text blocks and the graph structural layers representing the relation among communities of the entities so as to sequentially obtain search information related to the target keywords in each structural layer;
and carrying out embedded vector representation on the retrieval information related to the target keyword in each structural layer in sequence to obtain a first embedded vector representation of the retrieval information related to the target keyword in each structural layer.
In some embodiments, information retrieval may be performed at a lower layer, an intermediate layer, and an upper layer in order, and when information retrieval is performed at the intermediate layer, retrieval information corresponding to the intermediate layer may be determined based on retrieval information corresponding to the lower layer, and retrieval information corresponding to the upper layer may be determined based on retrieval information corresponding to the intermediate layer. In this way, information flows can be transferred between lower, middle and higher layers through a three-level information routing mechanism that enables each layer to focus on its particular level of semantic abstraction, ensuring high efficiency and accuracy of information processing. Cross-level data processing and delivery may be optimized, particularly in payment compliance reviews, to identify potential risk of violation through different levels of abstraction.
In some embodiments, through the mixed search operation of the vector library and the graph database, vector search of the target keywords can be performed based on the vector library by a similarity method after vectorization of all the target keywords, and candidate data sets matched with the target keywords can be found. And aiming at all target keywords, the network embedding model such as a specific algorithm of a graph rolling network (Graph Convolutional Network, GCN) can be utilized to gather the adjacent node information of all target keywords to the query node, and local or global retrieval is carried out in a graph database.
In some embodiments, global searches may be performed in conjunction with GCNs, and more granular entity relationships may be designed for local searches in conjunction with data processing schemes for text blocks, which may enhance the efficiency of global and local searches. Meanwhile, each layer is used for processing information flows and semantic routes on different abstraction levels respectively by using a high layer, a middle layer and a low layer of GraphRAG, wherein the low layer is used for processing fine-grained information and ensuring that text blocks are used for local compliance analysis, the middle layer is used for processing local relations and ensuring that relations among entities in the graph are used for local compliance analysis, such as relations between merchants and legal representatives, and the high layer is used for processing global information and ensuring that relations among communities in the graph are identified and used for global compliance analysis. Referring to fig. 3, the level 1 information, which is text block, level 2 information, which is relationship between entities, and level 3 information, which is relationship between communities, can be retrieved based on GraphRAG.
In some embodiments, the sequentially performing embedded vector characterization on the search information related to the target keyword in each structural layer to obtain a first embedded vector representation of the search information related to the target keyword in each structural layer includes at least one of the following:
Calculating the sentence-to-sentence similarity between every two first text blocks based on at least two first text blocks related to the target keyword in the structural layer of the text blocks, constructing a similarity matrix based on the sentence-to-sentence similarity between every two first text blocks, converting the similarity matrix into a distance matrix, determining a first embedded vector representation of the first text blocks based on the distance matrix, and the retrieval information comprises the at least two first text blocks;
Based on the entity relation between a third node and a third node related to the target keyword in a graph structure layer representing the relation between entities in the text block, aggregating the entity relation between the third node under different model levels by using a first network embedding model, and outputting a first embedding vector representation of the entity corresponding to the third node, wherein the third node is related to the first text block, and the retrieval information comprises the entity relation between the entity corresponding to the third node and the third node;
based on the graph structure layer representing the relation between communities of the entities, the neighbor information connected with the third node side and the community information of the third node are aggregated under different model levels by utilizing a second network embedding model, the first embedding vector representation of the community corresponding to the third node is output, and the search information comprises the neighbor information connected with the third node side and the community information of the third node.
The process of retrieving the target keyword is described below as an example.
Step one, keyword extraction, wherein the problem input by the user is processed in natural language through a large language model (Large Language Model, LLM) to extract core entities and concepts. For example, a user-entered query is "is a payment transaction amount owned by a user compliant:
Entity 1 user
Entity 2 transaction amount
Meanwhile, query resolution and enhancement are performed on the above questions by using (LLM), and query sub-questions can be obtained, as shown in fig. 3, where "1 is all transaction amounts of the user.
Step two, mixing search is started (preliminary search)
The extracted target keywords are used for preliminary retrieval, as shown in fig. 3, after all the target keywords are converted into vectors, the vectors are retrieved from a vector library, semantic similarity search is performed in the vector library, and context data or history records (corresponding to regular natural language description in fig. 3) matched with the target keywords are found, wherein the context data or history records are candidate data sets. The vector library converts target keywords input by a user into embedded vectors through a pre-training model, and performs similarity matching with the embedded vectors in the vector library, wherein the formula is as follows:
Wherein V (e) is a preliminary result set of vector retrieval, which is a detailed description of the relevant rule statements. And V (e) an embedded representation of the entity entered by the user. E (d) the embedded representation of candidate document d, representing a legislation related document. Sim-similarity calculation function, cosine similarity can be used to measure similarity between the input and the document. Representing that d with the largest similarity value is found in the candidate document set. The V (e) result is in a natural language form, wherein the user quota rule 1 is that the annual cumulative memory quota is not more than 100 yuan, and the user quota rule 2 is that each transaction is not more than 50 yuan.
The extracted target keywords are then processed through the three-level structure (low level, middle level, high level) of GraphRAG to process local and global information flow searches for more accurate compliance analysis. The query of each layer represents the focus of each layer, and the steps of generating enhancement processing are spliced and converted.
Lower layer (relationship between statement blocks) the lower layer processes the raw data after the partitioning. Such as transaction records, business related information, rule-line related statement blocks for the user. This layer of operations serves to capture the direct relationship between these underlying information.
The calculation process is as follows:
The similarity between sentences is calculated as follows:
Wherein Sim (CH i,CHj) is the cosine similarity between statement blocks CH i and CH j.
Sentence relation matrix construction, namely constructing a similarity matrix R, wherein R i j=Sim(CHi,CHj).
Input examples:
Statement block 1, "2024 1 month 1 day, user pays 100 yuan to merchant a. "
Statement block 2, "2024 1 month 2 day, user pays merchant B200 yuan. "
Statement block 3 "user all accounts lifetime cumulative balance payment transaction amount is less than 100 yuan. "
Where sentence block 1, sentence block 2, and sentence block 3 are text blocks in the lower layer that match the target keyword, i.e., first text blocks.
The output example is the similarity matrix R as follows:
then, the similarity information of the sentence blocks is converted into a distance matrix D by using multidimensional scale analysis (Multidimensional Scaling, MDS), namely, the similarity matrix R is converted into the distance matrix D, and the distance matrix D can be obtained by And performing conversion. Using MDS method to input distance matrix D, calculating low-dimensional embedded vector of each statement block, i.e. first embedded vector representation of target keyword related retrieval information in low layer, usingAnd (3) representing.
Middle layer (local entity relationship) the middle layer handles the construction of local entity relationship, exposing the links between entities (such as users, business users, legal representatives, corporate information, etc.). This layer is typically used to refine the analysis within the local community, ensuring the legitimacy of each subtask. The calculation process is as follows:
The node routing formula of the GCN is:
wherein, the The entity node v i is embedded in the representation at the first layer. The meaning of layer i refers to the hierarchy of the network embedding model for capturing broader entity relationships at different levels.An embedded vector representing node v i at layer l+1, this vector being based on the current layerIs updated by the neighbor aggregation operation of (a).And the middle layer weight matrix captures the relation among the entities. N (v i) neighbor set of node v i. d i,dj degrees of nodes v i and v j. Sigma, activate function.
Input examples:
Entity node:
user node user U1
Merchant node merchant M1 and merchant M2
Rule node for accumulating transaction amount
Entity relationship:
User U1 and transaction amount 100 yuan (merchant M)
User U1 and transaction amount 200 yuan (merchant B)
User U1 and regular node relationship (less than 100)
The input entity node is a third node, and is associated with the first text block, and under the condition that the lower layer retrieves the first text block, the entity relationship between the third node and the third node can be obtained.
Output examples obtaining each entity embedded representation, i.e. the first embedded vector representation of the retrieval information related to the target keyword in the middle layer, which may reflect neighbor information and relationships, usingAnd (3) representing.
The high-level processing is global information, which is mainly used for dividing the whole graph structure into a plurality of communities through a community detection algorithm so as to be capable of identifying potential compliance problems on a global level. At a high level, it may be identified whether there are associated transactions between multiple communities. The calculation process is as follows:
the formula of the GCN is derived as:
wherein, the The embedded vector (i.e., feature vector) representing node v i at layer l+1 merges the neighbor information and community information. Sigma, the activation function is, for example, reLU, sigmoid and the like, and the function is to introduce nonlinearity so as to ensure that the network embedded model has stronger expression capability.The normalization factor for node degrees, where d i and d j are the degrees (i.e., the number of connected edges) of node v i and node v j, respectively.The high-level weight matrix is used for linear transformation of neighbor information in the first layer, and the influence of neighbor node characteristics on the current node is learned.The neighbor node v j is embedded in the representation of the first layer. Feature information of node v j is provided for updating the features of node v i. Mu meaning is the influence weight coefficient of the global level, mu is more than or equal to 0, and the influence degree of community information and neighbor information on node update is regulated.The embedded representation of community c at the first layer, typically an aggregation of node embeddings within the community, provides community-level feature information.
First part (neighbor information aggregation):
Represented as The method is characterized in that information is collected from direct neighbor nodes of a node v i and passes through a weight matrixAnd (3) carrying out conversion and degree normalization to obtain aggregation of neighbor information connected by the third node edge.
Second part (community information aggregation):
Represented as The method is characterized in that information is collected from communities to which the node v i belongs and passes through a weight matrixAnd multiplying the transformation of the third node by a weight coefficient mu to obtain the aggregation of community information of the third node.
Then, the neighbor information and the community information are weighted and summed, and the embedded representation of the node v i in the first layer (1) can be obtained through the activation function sigmaI.e. the first embedded vector representation of the retrieval information related to the target keyword in the higher layer, usingAnd (3) representing.
In the step, a three-layer sub-graph structure of a bottom layer (basic transaction record), a middle layer (merchant and other related entities) and a high layer (global community structure) is generated through GraphRAG, and embedded expression of retrieval information is carried out by combining with GCN, so that compliance of transactions can be evaluated from different layers in examination. In addition, the related technology is difficult to dynamically integrate complex transaction and merchant information in a large-scale payment network, and the application can adopt a GraphRAG multi-level graph structure to realize the overall and local information combination analysis of structured and unstructured data, and the information of different abstract levels can be cooperated in examination through a bottom layer, a middle layer and a high-level structure.
In addition, the related art often has the problems of low efficiency and easy loss of details when information is transmitted at different levels. The application adopts a routing algorithm such as a routing algorithm based on semantic density, and enables the information of a lower layer, a middle layer and a higher layer to be dynamically transferred in a graph structure according to different importance by a three-layer routing mechanism, thereby ensuring high-efficiency data transfer and processing between different abstraction levels, reducing information loss and ensuring the comprehensiveness and accuracy of compliance examination.
Also, in the related art, compliance review often relies on a single or small number of retrieval mechanisms, making it difficult to capture multidimensional information. According to the application, the vector retrieval of the target keyword is carried out by integrating the pre-training language models such as BERT, GPT and the like, and the vector retrieval of the target keyword, the retrieval of the graph database and the three-layer iterative retrieval of the graph database are combined, namely, complex multidimensional data can be subjected to multi-round analysis by combining a plurality of retrieval and iterative retrieval mechanisms, so that the depth and accuracy of compliance examination are improved. Furthermore, the retrieval strategy can be dynamically adjusted by combining a plurality of semantic retrieval mechanisms, and retrieval results are gradually optimized in the examination process.
In step 104, the generate enhancements process may generate and enhance data that provides a means for converting and enhancing the retrieved graph data during the generation phase.
Compliance censoring results may be presented in the form of compliance reports, which may include compliance judgments, detailed descriptions of non-compliance items, regulatory compliance, and regulatory compliance advice, among others.
In some embodiments, the plurality of first embedded vector representations may be combined to obtain an aggregate vector, and the generating enhancement processing based on the aggregate vector.
In some embodiments, the aggregate vector further comprises at least one of:
The embedded vector corresponding to the candidate data set matched with the target keyword is obtained by carrying out vector retrieval on the vector of the target keyword based on a pre-constructed vector library, and the vector library comprises vectors of payment related information;
the embedded vector corresponding to the query sub-problem of the information to be checked is obtained by utilizing a large model to conduct query decomposition on the information to be checked based on the target keyword.
Thus, more information can be gathered to carry out compliance examination, and accuracy of compliance examination is improved.
In some embodiments, the generation enhancement process may include only one stage, i.e., aggregating the aggregate vectors to generate the enhancement process, resulting in compliance review results.
In some embodiments, the step 104 specifically includes:
Vector aggregation is carried out based on the first embedded vector representation of the retrieval information related to the target keyword in each structural layer, so that an aggregation vector is obtained;
Based on the aggregate vector, adopting a recursion structure to generate an inference result of each inference step, and based on the aggregate vector and the generation probability determined by the target inference result, screening candidate examination results of the information to be examined by using a limited decoding strategy, wherein the target inference result comprises an inference result of the current inference step and an inference result obtained before the current inference step;
And carrying out compliance scoring on the to-be-inspected result based on the global context information characterized by the aggregate vector and the candidate inspection result, and determining the compliance inspection result of the to-be-inspected information based on the score.
The instant layer enhancement process may include three stages, as shown in fig. 3, including the following stages:
Pre-data generation enhancement, i.e., pre-generation enhancement, focuses primarily on optimizing the input data or representation to improve the quality of the generation before passing it to the generator. The graph data retrieved in step 103 may be enhanced by enriching semantic content to more closely integrate it with the query. In order to better handle the sub-questions related to the audit, the sub-questions of the query in the information to be checked are introduced. For example, "1. How much is all of the user's transaction line. By introducing the query sub-problem into the graph data optimization process, the graph data can be rewritten by using a large language model, and converted into a more natural and smooth language. Thus, not only the naturalness and the readability of the graph data are improved, but also the semantic content of the graph data is increased, and the output format can comprise natural language description, class code form, grammar tree, node sequence and the like.
In processing the graph data, the complex graph data can be converted into text or code form by methods so that the large language model can understand and process the data, which can include:
a) Adjacency and Edge tables (Adjacency/Edge tables) are common methods of describing graph structures. The adjacency list lists the direct neighbors of each node, which is suitable for representing the sparse graph, and the edge list lists all edges in the graph, which provides a simple and clear linear representation. These tables can linearize the triples in the graph and input them into a large language model for further processing.
B) Natural Language description (Natural Language).
C) Class Code-Like Forms (GMLs) because natural language or other one-dimensional sequences are not sufficient to directly represent two-dimensional graph structures, some research has explored the use of class Code formats such as the extensible markup language-based graphic description language (GraphML) or the programming language (GAMEMAKER LANGUAGE, GML) to represent graph structures. The code format can describe the nodes, edges and their relationships in the graph in a standardized way, and is suitable for the code understanding capability of a large language model.
D) Syntax Tree (syncax Tree) converting a graph into a Syntax Tree-like structure may preserve more structural information than directly expanding the graph data. The syntax tree has hierarchical topological order, and can better analyze and understand the inherent attribute of the graph structure. This transformation not only preserves the dynamics of the relationships between the elements in the graph, but also facilitates more complex graph algorithm processing.
E) Node Sequence (Node Sequence) some studies represent graphs as Node sequences. These sequences are generated by predefined rules, which are more compact than natural language descriptions, while preserving the structural information of the graph. The node sequence performs well in certain tasks because it combines structural knowledge and a priori information.
F) Graph embedding Graph Embeddings although converting graph data into text sequences by graph language allows large language models to handle graph structures, it tends to result in lengthy contexts and increased computational costs. Graph embedding is another option, and the graph can be converted into embedded representation, so that the problem of long text input is avoided.
Accordingly, as shown in FIG. 3, the Pre-data generation enhancement may convert the merged graph data into an adjacency list/edge list, a natural language description, a graph code description, and the like. This may support a variety of output formats, such as adjacency lists, natural language descriptions, syntax trees, node sequences, graph embeddings, etc., for generating compliance review reports. These output formats can generate compliance reports according to different requirements and further perform deep analysis. Therefore, the flexibility and readability of compliance report output can be improved, multiple examination scenes are supported, and coverage is achieved from fine-grained transaction examination to global network analysis, so that diversified data generation and enhancement are realized.
Mid-data generation enhancement, i.e., in-generation enhancement, is a technique that adjusts intermediate results or context cues during generation, typically by limiting the generation strategy to reduce generation errors. The generated output space is controlled by introducing limited decoding, so that generated content is ensured to exist in the graph structure, and logic errors are avoided. The multi-step reasoning can be performed by adjusting the prompt of the large language model, so that not only is an answer generated, but also a reasoning process is generated. Such enhancements may ensure that intermediate steps of generation are logical, reducing generation problems due to erroneous reasoning. As shown in FIG. 3, mid-data generation enhancements may generate a preliminary compliance determination, such as "the user is 200 yuan for all transactions this year, has exceeded 100 yuan for rule 1, is not compliance".
Post-data generation enhancement, i.e., post-generation enhancement, is performed after the initial generation of the response, typically involving the integration of multiple generated responses to obtain a final high quality answer. These responses are scored and the higher scoring responses are integrated into the final prompt to generate the final answer. In addition, the responses generated by different models are combined or selected, and a better generation effect is achieved through cooperation of the different models. As shown in FIG. 3, post-data generation enhancements may generate final compliance judgments, such as compliance censoring results of non-compliance, compliance censoring results may also include judgment bases, trigger rules, improvement suggestions, and the like.
The following is a calculation procedure for generating enhancement processing.
Pre-data generation enhancement, input optimization and format conversion. The main objective of the Pre stage is to integrate the output data of the lower layer (relationship between statement blocks), the middle layer (relationship information of entity nodes) and the upper layer (community information and relationship), the vector information retrieved by the vector library, and the query sub-problem, so as to ensure that the generator can efficiently process and generate the required result. At the same time, it is converted into a different structural format for easy understanding by the generator. The characteristics of the embedded representation of each level are utilized more fully, the embedded representation is spliced together, and the final embedded representation, namely the aggregate vector, is obtained through linear mapping and an activation function and is represented by the following formula.
Wherein, the The embedded representation of node v i at a low level reflects the relationship between statement blocks.The embedded representation of node v i in the middle layer reflects the relationship information of the entity nodes.The embedded representation of node v i at a high level reflects community information and global relationships. And V (e) embedding the representation in relation to the candidate data set retrieved in the vector library for the target keyword. q an embedded representation of a query sub-question, e.g. "what is the user's total transaction amount.
The five embedded vectors are spliced to form a longer vector.
And W is a weight matrix used for linear mapping, and the spliced vector is converted into a required dimension or space. And b, biasing vector. Sigma, activation functions, e.g., reLU, sigmoid, or tanh, etc.
The final embedded representation of node v i is the aggregate vector. With the final embedded representation of the node, the data can be converted into the following three formats, respectively:
adjacency list/side list-adjacency relationship of users, merchants and their legal representatives, transaction records.
Natural language description-transaction behavioral description is generated with respect to merchant a, such as "merchant a has associated transactions with multiple unrelated legal representatives".
Class code form-GraphML for standardized data exchange.
Mid-data generation enhancement, policy optimization in the generation process. The Mid stage is to adjust the intermediate result and context clues in the generation process, reduce the generation errors through specific generator strategies (such as limited decoding and multi-step reasoning), and ensure the accuracy of the generated answers.
In this stage, the multi-step reasoning formula used is:
Where y t is the result of the generation of step t and y (t-1) is the result of the inference generated in the previous step. With this recursive structure, the inference results generated at each step are dependent on the generation of the Pre-stage Thereby ensuring consistency of generated answers, and the answers generated in the reasoning process comprise multi-step reasoning results of the user.
To reduce errors in the generation, the Mid stage may use constrained decoding strategies, such as Top-k/Top-p sampling, that would limit the search space for decoding, ensuring that the generated content complies with the logical relationship, the constrained decoding formula used at this stage is:
wherein V is a set of judgment results, such as compliance, non-compliance, suspicious, excessive transaction, normal transaction, etc., The generation probability calculated based on the aggregate vector and the target reasoning result, namely the generation probability calculated based on the embedded representation and the context, is limited to be decoded, and the best candidate answer is screened according to the probability, namely the candidate examination result of the information to be examined is screened.
In each step of the pushing, the generator gradually selects the most suitable description result according to the input data. If the user's transaction exceeds the daily limit, the generator may select "superscalar transaction" as the current step output.
Post-data generation enhancement, final output compliance judgment. The Post stage is to integrate the various outputs of the generation process, make the final compliance determination, and give a high quality response. By scoring the generated result, the generated answer is ensured to have high accuracy and logic. This stage also requires further verification and optimization of the final generated compliance report. The scoring formula is:
And S, a compliance score which represents the overall compliance performance is obtained by weighting based on the reasoning characteristics of the Mid stage and the embedded information of the Pre stage. Alpha and beta, regulating parameters, respectively controlling the inference result y r in Mid stage and the embedded representation in Pre stage The proportion of contribution in the final score.Embedded representation of Pre stageGlobal context information is provided to reflect overall transaction or compliance.
The final output integration formula is o=argmax { S }. And outputting the highest-scoring compliance examination result, and displaying the highest-scoring compliance examination result in a compliance report form.
In Post phase, a final assessment of the user's compliance report is required. Through a scoring mechanism, the quality of each generated result is evaluated, and the best generated result is selected as a final output based on the scoring level. These outputs will be integrated into a compliance review report, which may include:
and judging compliance, namely recording the excessive transaction of the user in the transaction of 1 month of 2023, and ensuring that the maximum transaction limit is excessive, thereby belonging to non-compliance.
Compliance scoring, namely based on a scoring formula, considering that the transaction compliance of the user is not satisfied and further investigation is needed.
According to the application, by designing various data generation and enhancement modes, the efficient conversion of the graph data is ensured, and the accuracy and consistency of the generated data are enhanced. In addition, the generation of the compliance report of the complex graph structure is mainly limited to single structured data output, and various data output formats including natural language description, grammar tree, adjacency list, code format and the like can be supported by generating various data formats through a large language model in the Pre stage.
The application can achieve the following technical effects:
1. The automation and real-time performance are improved, namely, based on a large language model and GraphRAG methods, different data formats can be automatically analyzed, processed and analyzed, and a targeted compliance report can be generated.
2. And the conversion and processing of diversified data ensures the efficient conversion of the graph data and enhances the accuracy and consistency of the generated data by designing various data generation and enhancement methods.
3. And the enhancement data generation and processing can efficiently process data (such as legal texts, merchant agreements and the like), and complex text information can be associated with transaction data by utilizing pre, mid, post three-stage generation enhancement mechanisms through large model combination, so that more accurate compliance judgment can be generated.
4. And processing compliance analysis of multi-level information, namely realizing dynamic transmission and comprehensive judgment of information flow through a three-layer structure generation and routing weight mechanism.
5. And the flexible rule expansibility is that the inspection flow can be dynamically adjusted according to the new rules or regulations by combining graph retrieval and large model generation, so that the accuracy and adaptability of compliance judgment are ensured. Therefore, the method is more flexible in rule expansion and can quickly adapt to new compliance requirements.
The application can provide a comprehensive and efficient intelligent compliance examination solution by combining the multi-level diagram structure, the diagram retrieval enhancement technology and the large language model with the compliance examination flow of payment transaction data based on GraphRAG.
FIG. 4 illustrates a complete intelligent compliance audit structure framework that relies primarily on the core capabilities of the large model, such as keyword extraction, query decomposition, query enhancement, information consolidation and clipping, to access the large model in different ways throughout the process. For example, one search, one query is performed on a single keyword, preliminary audit information is returned, multiple searches, multiple analyses are performed on complex multidimensional data through multiple queries, particularly during query enhancement and global search, and iterative searches are performed, wherein the results are continuously optimized through iterative searches on the data needing deep analysis until an answer which is most in line with compliance checking rules is found, and global and local searching and auditing tasks are executed by combining a map database of a three-layer structure of GraphRAG.
Furthermore, through data generation and enhancement in three stages, efficient conversion of graph data can be ensured, diversified data output can be realized, and accuracy and consistency of generated data are enhanced.
Before the step 101, the method includes:
text blocking is carried out on the payment related information to obtain text blocks in the payment related information;
extracting information based on the text block, and de-duplicating the extracted entity to construct a first graph structure based on the relationship between the entity after de-duplication and the entity, wherein the first graph structure represents the relationship between the entities in the text block;
Carrying out community division on the entities based on the first graph structure to construct a second graph structure, wherein the second graph structure represents the relationship among communities of the entities;
and constructing a graph database of the three-layer structure based on the text block, the first graph structure and the second graph structure.
The overall flow of fig. 5 illustrates how key information is dynamically extracted from complex structured and unstructured data and multi-level GraphRAG is generated to provide intelligent support for payment compliance review. Through the core capabilities of a series of large language models, multiple data types can be efficiently processed, entities and relationships identified, and operational compliance reports generated. The method fully utilizes various capabilities of a large language model, including the capabilities of text blocking, entity relation extraction, information summarization generation, same semantic analysis, abstract generation and the like, and comprises the following key steps:
Data input and blocking, namely, multiple types of input data can be accepted, and the text understanding capability of a large language model is utilized to carry out blocking processing on payment related information including structured data (such as merchant information and transaction records) and unstructured data (such as management methods, contract text and regulation clauses). The structured data may be processed as sentence information as unstructured data prior to processing, and the capability of information aggregation of the large model is also applied here, for example, transaction date 2024-1-1, amount 100, merchant A, user1 is converted into 2024, month 1, user1 and merchant A is converted into amount 100 elements. In order to better process the data of the statement, a dynamic adaptive blocking method is adopted. In this way, it is ensured that the data is broken down into manageable text chunks, facilitating subsequent entity identification and relationship extraction, see level 1 in FIG. 6.
Entity extraction and relation extraction, namely processing the segmented text through a large language model (such as BERT), and identifying key entities and relations thereof in the text. These relationships are structured in the form of triples (entity 1, relationship, entity 2), such triples (user 1, transfer 100, merchant A) can be extracted, for example, from the description of "user 1 transfer 100 to merchant A". Through this process, if there is also a statement that "the total balance of all accounts of the user pays for transaction amount less than 100" in the management method, a pair of relations (user, transfer less than 100 yuan, cumulative limit, statement related information of the rule in the cumulative limit description) is added to the above user, so that the original statement text is converted into structural information for use in the generation of the subsequent graph structure, see the level 1 of fig. 6.
Entity summarization and deduplication-after extracting entities and relationships, deduplication may be performed on the same entities extracted from multiple sources (e.g., subject and legal representatives, meaning). It is ensured by similarity calculation and context analysis that only unique entities are retained in the graph structure, thereby avoiding redundant data. This process not only ensures data consistency, but also reduces noise in the graph structure, improving the accuracy of the graph structure, see level 2 of fig. 6.
And generating a graph structure and detecting communities, namely generating the graph structure based on the extracted entities and the relationships, wherein nodes represent the entities, and edges represent the relationships among the entities. The graph structure may be further partitioned into communities using a community detection algorithm, such as the modified Leiden algorithm, where each community represents a closely related group of entities, see level 3 of fig. 6. With this hierarchical structure, information can be better organized and retrieved.
In some embodiments, static blocking methods based on preset rules, or fixed blocking using a rules engine, may be used.
In some embodiments, the tiles may be dynamically resized according to semantic density and recursively partitioned. The text block is performed on the payment related information to obtain a text block in the payment related information, which comprises the following steps:
Calculating the semantic density of each sentence in the text of the payment related information;
based on the text structure of the payment related information, performing first block processing on the payment related information to obtain candidate text blocks;
and performing second block division processing on the candidate text blocks based on the semantic density of sentences in the candidate text blocks to obtain text blocks in the payment related information.
The embodiment can effectively optimize the blocking process of unstructured text by combining semantic density (SEMANTIC DENSITY) with recursive blocking (Recursive Splitting). The semantic density is used for measuring the key information content of each sentence in the text, and the recursive partitioning can ensure that the partitioning can dynamically adapt to the structures of different types of texts, and ensure that related semantic segments are kept together. The combination can ensure that the text not only can keep semantic integrity, but also can reduce calculation load, and realize more efficient blocking.
Semantic Density analysis the semantic density of each sentence in the text can be calculated by pre-training language models (e.g. BERT, GPT). Sentences with high semantic density indicate that they contain more key information such as merchant names, transaction amounts, etc.
Recursive blocking logic the recursive blocking method recursively blocks text layer by layer according to preset characters, token, or semantic boundaries (e.g., sentences, paragraphs). After combining the semantic density, the recursive partitioning can dynamically adjust the size of the blocks according to the semantic density in the recursive process of each layer.
The partitioning strategy is to preferentially subdivide text paragraphs or sentences with high semantic density, so as to ensure that each key entity and the context thereof are reserved more finely. For parts with low semantic density, the parts can be combined into larger blocks, so that unnecessary waste of computing resources is reduced.
As shown in fig. 6, text sentences in the payment related information may be cycled, semantic densities calculated, and recursive partitioning of text according to the semantic densities to obtain text blocks of level 1.
The algorithm flow is as follows:
Semantic Density calculation for a given unstructured text T= { s 1,s2,...,sn }, the semantic Density SD of each sentence is calculated (s i).
And (3) recursion block initialization, namely starting recursion block according to the structure (such as paragraphs and sentences) of the text, namely performing first block processing on payment related information, dividing the text into larger blocks, and gradually refining the text.
Adjustment of the block size if the semantic density SD of a certain block (s i)>τ1 (above the threshold) the block will be further recursively subdivided if the semantic density SD (s i)≤τ1 (below or equal to the threshold) the blocks will be merged, reducing the number of recursions.
Recursive partitioning in combination with semantic density can be described by the following formula:
Wherein CH i represents the blocking result of the ith block, SD (s i) represents the semantic density of the ith sentence, τ 1 is a threshold of semantic density, recursiveSplit is a recursive blocking function for further refining the size of the block, and Merge is an operation of merging adjacent blocks for processing low-density blocks, thereby reducing the complexity of processing.
Examples:
step 1, semantic Density computation
Sentence 1, user A trades 50 ten thousand yuan with merchant B in 2023 1 month 1, semantic density SD (s 1) =0.85, its semantic density is higher, include key entities such as user, merchant name, and amount.
Sentence 2. The transaction does not exceed the single transaction limit, but does not save the transaction voucher. Semantic density SD (s 2) =0.65, which is lower, only some additional information is described.
Step 2, recursion blocking
For sentence 1, since SD (s 1)=0.85>τ1), the recursive chunk will further subdivide this sentence, generating multiple small chunks to preserve key information, e.g., text chunk 1 user A trades 50 ten thousand yuan with merchant B at month 1 of 2023 for sentence 2, since SD (s 2)=0.65≤τ1, this sentence can be kept as a larger chunk and not subdivided, reducing unnecessary processing costs.
In the related technology, when unstructured text is processed, key entities and relations are difficult to extract accurately and efficiently, the semantic density is calculated by combining semantic density with a recursive partitioning technology and combining big language models such as BERT, and recursive partitioning is performed by dynamically adjusting the size of a block, so that high-density information is processed finely, and low-density information is combined reasonably, so that a partitioning strategy can be dynamically adjusted according to semantic importance of the text, processing complexity is reduced, integrity and processing efficiency of text information are guaranteed, and key entities and relations of the text can be extracted accurately and efficiently.
The technique dynamically optimizes the partitioning of unstructured text by semantic density and recursive partitioning algorithms. This chunking approach, combined with the semantic information of the text, is able to process different types of text in a more efficient manner. The efficiency and the accuracy of processing unstructured data are improved, key information in the compliance inspection process is ensured to be completely captured, and the risk of information loss is reduced.
In some embodiments, entity deduplication may be performed using simple string matching techniques or specific rules that do not involve contextual analysis.
In some embodiments, the deduplicating the extracted entity comprises:
calculating a first similarity between every two entities in the text block;
Under the condition that the first similarity is larger than a first preset threshold, creating a target entity, wherein the target entity is an entity obtained by combining two similar entities;
Calculating second similarity between the target entity and other uncombined entities in the text block;
and performing de-duplication of the target entity and other uncombined entities in the text block based on the second similarity.
Entity deduplication is a key step in the generation of graph structures to ensure that each entity in the graph structure has a unique, accurate representation, avoid redundant data, and merge identical entities of different origin. By using dynamic similarity threshold, semantic density enhanced context analysis, recursively enhanced entity merging and processing methods supporting multi-source heterogeneous data, the accuracy and efficiency of entity deduplication can be improved.
Provided that two text blocks:
Text block 1, "business license number 123456789 is used by business_family A, and the same legal representative is used by business_family B, the identity card is XXX, and the address is XX area of Shanghai city. The entity is a legal representative, a business _ home.
Text block 2, "contract principal Party A is a certain plum, identity card is XXX, contract, transaction limit amount 100 ten thousand yuan. The entity is the main body and the merchant.
In some embodiments, the computing a first similarity between each two entities in the text block includes at least one of:
Calculating the weight of each character in the two entities and calculating the similarity between the same characters in the two entities, and weighting the similarity of the same characters based on the weight of each character and the similarity between the same characters to obtain a first similarity between the two entities;
A second embedded vector representation based on text blocks corresponding to each of the two entities; A first similarity between the two entities is calculated based on the second embedded vector representation.
To achieve deduplication of entities in a text block, two different types of similarity need to be distinguished:
1. direct entity name matching, such as "merchant" and "merchant", can be directly matched through string similarity.
2. Implicit semantic similarity, such as "legal representatives" and "subjects", cannot be matched by simple string similarity, and it is necessary to confirm whether the two entities are identical by more advanced semantic analysis (e.g., embedded vectors, contextual similarity, etc.).
The direct entity name matching can adopt weighted Jaccard similarity calculation, the weighted Jaccard similarity is weighted by combining the semantic similarity of characters on the basis of the traditional Jaccard similarity, and the weighted formula is as follows:
Where S (e i) and S (e j) are the character sets or word sets of entities e i and e j. w (x) is the weight of the character or word x, which may be determined based on semantic density or contextual importance. Sim (x, y) is the similarity of the characters x and y, and the similarity of the semantic embedded vectors can be calculated by cosine similarity. Where x, y are characters in the character set, corresponding to the characters in the character set of the entities e i and e j.
The two entities to be compared are entity 1, merchant entity 2 and merchant. The character set is S (quotient_household) = { quotient, _, household }, S (merchant) = { quotient, household }.
Sim (x, y) =1 when the character quotient and quotient, and subscriber are both present in both entities and match perfectly, otherwise 0.
Weighted Jaccard similarity is:
The weights given for each character are w (quotient) =0.9, w (user) =0.8, w (_) =0.2.
Substituting the formula to calculate the first similarity: This indicates that the two entities have a high similarity at the character level.
Implicit semantic similarity can be calculated based on entities and text blocks.
For entities that cannot be matched by a simple string (e.g., "legal representatives" and "contract principals"), the semantic similarity of two entities can be calculated in conjunction with the semantic embedding of the entities, i.e., the corresponding text blocks. For two entities e 1 and e 2, they can be converted into vectors v (e 1) and v (e 2) by a pre-trained model such as BERT, and then combined with their text blocks (e.g., text block 1 and text block 2) to generate an overall semantic embedding.
The calculation formula of the semantic similarity is as follows:
The embedded vector of the text block is v (text block), which is generated by combining all word embeddings (such as 'contracts', 'legal representatives') in the text block, and finally, the semantic similarity between two entities is obtained more accurately through the overall similarity.
Two entities are set:
entity 1 "legal representative" means "Lishi somewhere"
Entity 2: "subject": "Lisomebody"
These entities and text blocks are represented as vectors by BERT embedding, v (statutory representative) = [0.8,0.6,0.9,. ], v (subject) = [0.78,0.62,0.88,. ], and calculating the semantic similarity of the two vectors, i.e., the first similarity, is CosSim (statutory representative, subject) = 0.98, indicates that the two entities have a higher similarity.
For the entity pairs meeting the similarity condition, i.e. for the entity pairs with the first similarity being larger than the first preset threshold, the entities may be recursively combined, and after each combination, the similarity between the combined entity and other entities, i.e. the second similarity, may be recalculated. The new entity after merging will keep all the attributes of the original entity and update its relationship in the graph structure. The formula of recursive merging is e merged=Merge(ei,ej),if Sim(ei,ej)≥τ2.
And e merged, merging the new entities, namely the target entities. e i,ej two entities to be combined.
Sim (e i,ej): a first similarity between entities e i and e j.
Τ 2 a first preset threshold, when Sim (e i,ej)≥τ2), a merge operation is performed.
The specific calculation process is as follows:
1. Initial similarity, first similarity calculation, first similarity Sim (e i,ej) is calculated for all possible entity pairs (e i,ej). For directly name-matched entities, weighted Jaccard similarity is used. For semantically similar entities, semantic similarity based on embedded vectors is used. In the using process, similar entity pairs are matched according to Jaccard similarity, and if the similar entity pairs are not matched, the similar entity pairs are matched according to semantic similarity.
2. Threshold judgment, similarity threshold τ 2 (e.g., 0.8 or 80%) is set. Sim (e i,ej) was compared with τ 2 to determine if it was combined.
3. Entity merger-for the entity pair that satisfies Sim (e i,ej)≥τ2, a merger operation is performed that creates the target entity e merged, integrating all the attributes and relationships of e i and e j.
4. Updating the entity set, namely adding the target entity e merged into the entity set. The merged original entities e i and e j are removed.
5. And (5) recursively calculating the second similarity between the target entity and other uncombined entities. Repeating steps 2 to 5 until no entity pairs meet the merging condition.
The related art processes a plurality of representation modes of the same entity more roughly, and redundancy or mismatching is easily caused. According to the application, through weighting Jaccard and dynamic similarity and semantic density analysis, a similarity threshold is dynamically generated through contextual analysis and semantic density calculation of the entities, and a recursion enhanced entity merging technology is combined, so that the same entities from different sources are ensured to be accurately matched and merged, and entity deduplication and accurate merging of multi-source heterogeneous data are realized.
And, through weighting Jaccard and semantic density analysis, entity deduplication is performed, so that entity representation in the graph structure can be ensured to be unique. The method can be dynamically adjusted according to entity semantic information, and redundant data is avoided. The graph structure data generated in compliance examination can be ensured to be clear and consistent, noise and data redundancy are greatly reduced, and accuracy of audit results is improved.
In some embodiments, some community detection algorithms (e.g., louvain algorithm, girvan-Newman algorithm, etc.) may be employed to perform community partitioning of the graph structure.
In some embodiments, the performing community division of the entity based on the first graph structure to construct a second graph structure includes:
Calculating an allocation score of a first node in the first graph structure, and performing community allocation on the first node based on the allocation score of the first node, wherein the allocation score is used for indicating whether the first node is allocated to a score of a first community, and the first community comprises a second node in the first graph structure;
Calculating the similarity between different communities, and carrying out community merging judgment based on the similarity between different communities to obtain a plurality of communities;
And calculating global modularity of community division based on semantic density of nodes in the first graph structure and transaction relation strength among different nodes, and carrying out community adjustment on the nodes based on the global modularity to obtain a second graph structure.
In some embodiments, the computing the allocation score for the first node in the first graph structure includes:
calculating the semantic density of the first node based on the transaction amount score, the transaction frequency score and the node type score of the entity corresponding to the first node;
Calculating the transaction relation strength between the first node and the second node based on the transaction amount score and the transaction number score between the first node corresponding entity and the second node corresponding entity;
Calculating the total weight of all the second nodes in the first community;
An allocation score for the first node is determined based on the semantic density of the first node, the strength of the trade relationship between the first node and the second nodes, and the total weight of all second nodes in the first community.
The Leiden algorithm can be adopted to carry out community division, which is a community detection algorithm for optimizing modularity (Modularity), and the goal is to find a community division, so that node connection inside communities is dense, and node connection between communities is sparse. The Leiden algorithm has the basic flow of locally optimizing the community structure by moving nodes to adjacent communities in a local movement stage. And a community merging stage, namely taking communities generated by the local movement stage as new nodes to generate larger aggregate communities. And the global optimization stage is to continuously repeat local optimization and merging until the modularity is not improved.
In a payment network, considering only the connections between nodes may not be sufficient to capture complex compliance risks, so the community detection effect of the Leiden algorithm can be enhanced by introducing Semantic Density (SD) and Transaction Relationship Strength (TRS).
Improvement of the local movement phase:
in the original Leiden algorithm, nodes are moved to different communities based on the strength of connection (i.e., edge weights) with neighboring nodes. This embodiment optimizes this process by introducing Semantic Density (SD), which is used to measure the criticality of nodes (e.g., merchants), and Transaction Relationship Strength (TRS). Key nodes such as high-volume transaction merchants or legal representatives should be given higher priority. And measuring the actual strength of the transaction between merchants. Number of transactions data such as the amount is used to weigh the weight of the edge. The formula is improved as follows:
Wherein, L (i, C) is the allocation score that node i is allocated to community C. SD (i) is the semantic density of node i, representing the importance of that node. TRS (i, j) the strength of the transaction relationship between node i and node j (which may be determined based on transaction frequency and amount, etc.). W (C) is the total weight of all nodes in community C.
Node types are set, namely a merchant (value M1), a user (value U1), a bank account (B1), an address (A1), an accumulated limit (L1) and a legal person representative (F1).
Example of relationship user U1 transfers 100 elements to merchant M1. The annual cumulative limit of the user U1 does not exceed 100 yuan.
The SD (i) of each node is calculated, and the formula may be SD (i) =α×transaction amount score+β×transaction number score+γ×node type score. Where α+β+γ=1, and represents the weight of each score. The weight coefficient is set to be α=0.4, β=0.3, and γ=0.3.
The transaction amount of the user U1 is scored, wherein the annual accumulated transaction amount of the user U1 is 100 yuan. Assuming that the maximum value of the user transaction amount is 100000 yuan and the minimum value is 0 yuan, the transaction amount score= (transaction amount-minimum value of U1)/(maximum-minimum value) = (100-0)/(100000-0) =0.001.
The number of transactions of the user U1 is scored, and the number of transactions of the user U1 is 1. Assuming that the maximum value of the number of user transactions is 1000 times and the minimum value is 0 times, the number of transactions score= (1-0)/(1000-0) =0.001.
Node type score, common user, set score to 0.2 (score range 0 to 1).
SD (U1) was calculated as SD (U1) =0.4×0.001+0.3×0.001+0.3×0.2=0.0004+0.0003+0.06= 0.0607.
The transaction relation strength TRS (i, j) is calculated.
The transaction relation strength TRS (i, j) is used to measure how tight a transaction is between node i and node j. The calculation formula is TRS (i, j) =delta×transaction amount score +_E×transaction number score. Where δ+_e=1, represents the weight of each score.
Setting the weight coefficient δ=0.7, e=0.3, and calculating the transaction amount score and the transaction number score as described above, the transaction relationship strength between the user U1 and the merchant M1 can be calculated as TRS (U1, M1) =0.7x0.00001+0.3x0.00001=0.000007+0.00003=0.00001.
The total weight of the community W (C) is calculated.
Initial state, each node is an independent community.
Suppose node U1 attempts to move to community C (which contains node M1). The total weight of community C is W (C) =sd (M1) = 0.150007.
The assignment score L (i, C) that the computing node is assigned to community C is:
L(U1,C)=(SD(U1)×TRS(U1,M1))/W(C)=(0.0607×0.00001)/0.150007≈0.000000607/0.150007≈0.00000405。
The decision is made to compare the scores to determine whether to assign node U1 to community C, the current score L (U1, C) = 0.00000405, since the score is positive and node U1 has a direct trade relationship with node M1, it is considered to move node U1 to community C.
Improvement of community merging stage.
In the community merging stage of the Leiden algorithm, the modularity is further improved by merging adjacent communities. In the embodiment, the semantic similarity between communities is introduced as an additional merging condition, and whether adjacent communities need to be merged or not is judged by adopting the community semantic similarity. The formula improves as:
Wherein Sim (C 1,C2) is the similarity between community C 1 and community C. Sim (E (i), E (j)): the embedding direction of node i and node j may be calculated based on vector similarity of BERT. Traversing all nodes i from community C 1 and nodes j from community C 2, calculating the embedded vector similarity Sim (E (i), E (j)) of each pair of nodes, and accumulating all similarity values. C 1 and C 2 represent the number of nodes in communities C 1 and C 2. If the calculated Sim (C 1,C2) is greater than a preset threshold (e.g., 0.8), then the two communities are considered highly similar, and are suitable for merging.
Improvement of global optimization stage.
In the global optimization stage, semantic density and transaction relation strength are introduced to recalculate global modularity of community division. The improved global optimization modularity formula is as follows:
the improved modularity characterizes the quality of community division, and the larger the value of the modularity is, the better the community division is. TRS (i, j) the strength of the transaction relationship between nodes i and j. SDij is the sum of the semantic densities of nodes i and j. ki. And k j, the degree of the nodes, and the number of the respective connection of the nodes. m is the total number of edges in the graph. Delta (Ci, cj) whether nodes i and j belong to the same community, 1 in the same community, or 0.
Suppose that community partition, community 1, contains nodes A and B. Community 2 contains nodes C and D. By calculating the improved modularity Q ', the association transactions between nodes are identified, determining whether communities need to be repartitioned or merged, assuming now Q' = 0.2896. The new community division is community 1, namely node A, community 2, namely node B, C, D, and the new Q' =0.0684, and the modularity is reduced, which means that the community division quality is poor after the communities are combined, and therefore the method is not suitable for combining the nodes B and C.
The community detection algorithm in the related art is difficult to effectively capture the association relationship between the complex transaction network and the merchant. The application improves the Leiden algorithm, and based on analysis of transaction relation strength and semantic density, the semantic density of transaction data, transaction frequency, amount and other key factors are used as the weight of community division, so that the community detection capability of the transaction data in a complex payment network is obviously improved, and the community division is more accurate, thereby being capable of identifying transaction chains and potential compliance problems of cross-merchants and finding potential abnormal transactions in compliance examination.
And, through improving Leiden algorithm, consider semantic density, trade relation intensity etc. to be used for detecting the community structure in the trade network more accurately. The improvement accurately divides merchants, accounts and the like in the complex payment network, and is convenient for audit. The analysis capability of the complex transaction network can be enhanced, and the potential illegal behaviors across the network can be effectively identified particularly in the multi-level compliance inspection process.
The application has very broad market prospect, especially in the industries of payment and financial science and technology. The method can remarkably improve the efficiency and accuracy of compliance examination, and is suitable for various industry fields. With the increasingly strict global compliance regulatory requirements, the market application of the application will be more extensive and commercial value can be continuously created for the following reasons:
1. Compliance inspection requirements are strong
With the rapid development of financial science and technology and payment industry, global payment compliance requirements are becoming stricter, and in particular, compliance pressure in the aspects of money laundering, data privacy protection and the like is increasing. The risk of compliance faced by payment institutions and financial institutions is also increasing. Therefore, technical solutions that enable automated mass payment transaction compliance reviews have a wide market demand. The application can rapidly process large-scale transaction data and provide intelligent compliance analysis by combining a large language model, graphRAG multi-level graph structures and graph retrieval enhancement technology, thereby having very strong commercial value in the financial industry.
2. Wide application range
The method is not only suitable for compliance examination in the payment industry, but also can be popularized to other fields with similar compliance requirements. Such as loan inspection in banking, compliance checking in securities trading, even trade compliance monitoring for e-commerce platforms, etc. This versatility means that the market application of the present application is quite extensive and applicable to all kinds of financial and transaction platforms worldwide.
3. Combination of big data and artificial intelligence technology
The application is based on large language models (such as BERT, GPT, etc.) and graph convolution networks, and is capable of handling compliance review of unstructured and structured data. The combination of the intelligent technology enables the intelligent system to have strong learning ability and expandability, and can update compliance standards in real time, and the examination effect is optimized continuously along with the time. This property makes it very high technical barriers and market competitiveness in the field of big data analysis and compliance management.
4. High efficiency and cost advantage
The traditional payment compliance inspection process generally requires a large amount of manpower for inspection and complex compliance inspection steps, and has high time cost and low efficiency. The application can greatly improve the efficiency, reduce human errors and reduce the manual inspection cost through automatic intelligent inspection. In the marketplace, this efficient and low cost solution is necessarily favored by a large number of financial institutions, paymate platforms.
5. Great market potential
According to market research, the global compliance technology (RegTech) market is growing rapidly, and it is expected that the coming years will expand at a two digit annual growth rate. Especially in the payment industry, the constant upgrading of regulatory authorities to pay compliance requirements will push the market demand for intelligent compliance technology. The application is in line with the market trend and has great market potential.
6. Policy and administration support
Regulatory authorities in various countries are continually outsourcing related policies requiring financial and payment institutions to enforce compliance management, which makes intelligent compliance screening tools policy-supported. For example, general data protection regulations (GENERAL DATA Protection Regulation, GDPR), anti-money laundering act (Anti-Money Laundering, AML), and chinese payment industry regulatory policies, etc., all of which further drive the market development of compliance screening technologies. Therefore, the application meets the global compliance examination requirement and has long-term market adaptability.
The following describes a payment compliance censoring device provided by an embodiment of the present invention.
Referring to fig. 7, a schematic structural diagram of a payment compliance checking device provided in an embodiment of the present invention is shown, and as shown in fig. 7, a payment compliance checking device 700 includes:
An acquiring module 701, configured to acquire information to be inspected in a business scenario of payment compliance inspection;
the information extraction module 702 is configured to perform information extraction based on the information to be checked to obtain information to be processed, where the information to be processed includes a target keyword of the information to be checked;
The retrieval module 703 is configured to retrieve the target keyword based on a pre-constructed three-hierarchy graph database, so as to obtain a first embedded vector representation of retrieval information related to the target keyword in each structural layer, where the three hierarchies are a structural layer of a text block in payment related information, a graph structural layer representing a relationship between entities in the text block, and a graph structural layer representing a relationship between communities of the entities;
and the generating and enhancing processing module 704 is configured to perform generating and enhancing processing based on the first embedded vector representation, so as to obtain a compliance review result of the information to be reviewed.
In some embodiments, the apparatus further comprises:
the text block module is used for text block of the payment related information to obtain text blocks in the payment related information;
The diagram construction module is used for extracting information based on the text blocks and de-duplicating the extracted entities so as to construct a first diagram structure based on the relationship between the de-duplicated entities, wherein the first diagram structure represents the relationship between the entities in the text blocks;
the community division module is used for carrying out community division on the entity based on the first graph structure so as to construct and obtain a second graph structure, wherein the second graph structure represents the relationship among communities of the entity;
And the library construction module is used for constructing a graph database of the three-layer structure based on the text block, the first graph structure and the second graph structure.
In some embodiments, the text blocking module is specifically configured to:
Calculating the semantic density of each sentence in the text of the payment related information;
based on the text structure of the payment related information, performing first block processing on the payment related information to obtain candidate text blocks;
and performing second block division processing on the candidate text blocks based on the semantic density of sentences in the candidate text blocks to obtain text blocks in the payment related information.
In some embodiments, the graph construction module is specifically configured to:
calculating a first similarity between every two entities in the text block;
Under the condition that the first similarity is larger than a first preset threshold, creating a target entity, wherein the target entity is an entity obtained by combining two similar entities;
Calculating second similarity between the target entity and other uncombined entities in the text block;
and performing de-duplication of the target entity and other uncombined entities in the text block based on the second similarity.
In some embodiments, the graph construction module is further configured to at least one of:
Calculating the weight of each character in the two entities and calculating the similarity between the same characters in the two entities, and weighting the similarity of the same characters based on the weight of each character and the similarity between the same characters to obtain a first similarity between the two entities;
A second embedded vector representation based on text blocks corresponding to each of the two entities; A first similarity between the two entities is calculated based on the second embedded vector representation.
In some embodiments, the community dividing module is specifically configured to:
Calculating an allocation score of a first node in the first graph structure, and performing community allocation on the first node based on the allocation score of the first node, wherein the allocation score is used for indicating whether the first node is allocated to a score of a first community, and the first community comprises a second node in the first graph structure;
Calculating the similarity between different communities, and carrying out community merging judgment based on the similarity between different communities to obtain a plurality of communities;
And calculating global modularity of community division based on semantic density of nodes in the first graph structure and transaction relation strength among different nodes, and carrying out community adjustment on the nodes based on the global modularity to obtain a second graph structure.
In some embodiments, the community partitioning module is further configured to:
calculating the semantic density of the first node based on the transaction amount score, the transaction frequency score and the node type score of the entity corresponding to the first node;
Calculating the transaction relation strength between the first node and the second node based on the transaction amount score and the transaction number score between the first node corresponding entity and the second node corresponding entity;
Calculating the total weight of all the second nodes in the first community;
An allocation score for the first node is determined based on the semantic density of the first node, the strength of the trade relationship between the first node and the second nodes, and the total weight of all second nodes in the first community.
In some embodiments, the retrieving module 703 is specifically configured to:
Based on the target keywords, sequentially searching the target keywords according to the search routes of the structural layers of the text blocks, the graph structural layers representing the relation among the entities in the text blocks and the graph structural layers representing the relation among communities of the entities so as to sequentially obtain search information related to the target keywords in each structural layer;
and carrying out embedded vector representation on the retrieval information related to the target keyword in each structural layer in sequence to obtain a first embedded vector representation of the retrieval information related to the target keyword in each structural layer.
In some embodiments, the retrieval module 703 is further configured to at least one of:
Calculating the sentence-to-sentence similarity between every two first text blocks based on at least two first text blocks related to the target keyword in the structural layer of the text blocks, constructing a similarity matrix based on the sentence-to-sentence similarity between every two first text blocks, converting the similarity matrix into a distance matrix, determining a first embedded vector representation of the first text blocks based on the distance matrix, and the retrieval information comprises the at least two first text blocks;
Based on the entity relation between a third node and a third node related to the target keyword in a graph structure layer representing the relation between entities in the text block, aggregating the entity relation between the third node under different model levels by using a first network embedding model, and outputting a first embedding vector representation of the entity corresponding to the third node, wherein the third node is related to the first text block, and the retrieval information comprises the entity relation between the entity corresponding to the third node and the third node;
based on the graph structure layer representing the relation between communities of the entities, the neighbor information connected with the third node side and the community information of the third node are aggregated under different model levels by utilizing a second network embedding model, the first embedding vector representation of the community corresponding to the third node is output, and the search information comprises the neighbor information connected with the third node side and the community information of the third node.
In some embodiments, the generating enhancement processing module 704 is specifically configured to:
Vector aggregation is carried out based on the first embedded vector representation of the retrieval information related to the target keyword in each structural layer, so that an aggregation vector is obtained;
Based on the aggregate vector, adopting a recursion structure to generate an inference result of each inference step, and based on the aggregate vector and the generation probability determined by the target inference result, screening candidate examination results of the information to be examined by using a limited decoding strategy, wherein the target inference result comprises an inference result of the current inference step and an inference result obtained before the current inference step;
And carrying out compliance scoring on the to-be-inspected result based on the global context information characterized by the aggregate vector and the candidate inspection result, and determining the compliance inspection result of the to-be-inspected information based on the score.
In some embodiments, the aggregate vector further comprises at least one of:
The embedded vector corresponding to the candidate data set matched with the target keyword is obtained by carrying out vector retrieval on the vector of the target keyword based on a pre-constructed vector library, and the vector library comprises vectors of payment related information;
the embedded vector corresponding to the query sub-problem of the information to be checked is obtained by utilizing a large model to conduct query decomposition on the information to be checked based on the target keyword.
The payment compliance censoring device 700 can implement the processes implemented in the above-described payment compliance censoring method embodiments, and can achieve the same technical effects, and for avoiding repetition, a detailed description is omitted herein.
Referring to fig. 8, a schematic structural diagram of an electronic device according to an embodiment of the present invention is shown. As shown in fig. 8, the electronic device 800 includes a processor 801, a memory 802, a user interface 803, and a bus interface 804.
A processor 801 for reading the program in the memory 802, performs the following procedures:
acquiring information to be checked in a business scene of payment compliance checking;
Information extraction is carried out based on the information to be checked to obtain information to be processed, wherein the information to be processed comprises target keywords of the information to be checked;
Searching the target keyword based on a pre-constructed three-level structure graph database to obtain a first embedded vector representation of search information related to the target keyword in each structural layer, wherein the three levels of structures are respectively a structural layer of a text block in payment related information, a graph structural layer representing a relationship between entities in the text block and a graph structural layer representing a relationship between communities of the entities;
And generating and enhancing the first embedded vector representation to obtain a compliance review result of the information to be reviewed.
In fig. 8, a bus architecture may be comprised of any number of interconnected buses and bridges, and in particular one or more processors represented by the processor 801 and various circuits of the memory represented by the memory 802. The bus architecture may also link together various other circuits such as peripheral devices, voltage regulators, power management circuits, etc., which are well known in the art and, therefore, will not be described further herein. Bus interface 804 provides an interface. The user interface 803 may also be an interface capable of interfacing with an inscribed desired device for a different user device, including but not limited to a keypad, display, speaker, microphone, joystick, etc.
The processor 801 is responsible for managing the bus architecture and general processing, and the memory 802 may store data used by the processor 801 in performing operations.
In some embodiments, the processor 801 is further configured to:
text blocking is carried out on the payment related information to obtain text blocks in the payment related information;
extracting information based on the text block, and de-duplicating the extracted entity to construct a first graph structure based on the relationship between the entity after de-duplication and the entity, wherein the first graph structure represents the relationship between the entities in the text block;
Carrying out community division on the entities based on the first graph structure to construct a second graph structure, wherein the second graph structure represents the relationship among communities of the entities;
and constructing a graph database of the three-layer structure based on the text block, the first graph structure and the second graph structure.
In some embodiments, the processor 801 is further configured to:
Calculating the semantic density of each sentence in the text of the payment related information;
based on the text structure of the payment related information, performing first block processing on the payment related information to obtain candidate text blocks;
and performing second block division processing on the candidate text blocks based on the semantic density of sentences in the candidate text blocks to obtain text blocks in the payment related information.
In some embodiments, the processor 801 is further configured to:
calculating a first similarity between every two entities in the text block;
Under the condition that the first similarity is larger than a first preset threshold, creating a target entity, wherein the target entity is an entity obtained by combining two similar entities;
Calculating second similarity between the target entity and other uncombined entities in the text block;
and performing de-duplication of the target entity and other uncombined entities in the text block based on the second similarity.
In some embodiments, the processor 801 is further configured to:
Calculating the weight of each character in the two entities and calculating the similarity between the same characters in the two entities, and weighting the similarity of the same characters based on the weight of each character and the similarity between the same characters to obtain a first similarity between the two entities;
A second embedded vector representation based on text blocks corresponding to each of the two entities; A first similarity between the two entities is calculated based on the second embedded vector representation.
In some embodiments, the processor 801 is further configured to:
Calculating an allocation score of a first node in the first graph structure, and performing community allocation on the first node based on the allocation score of the first node, wherein the allocation score is used for indicating whether the first node is allocated to a score of a first community, and the first community comprises a second node in the first graph structure;
Calculating the similarity between different communities, and carrying out community merging judgment based on the similarity between different communities to obtain a plurality of communities;
And calculating global modularity of community division based on semantic density of nodes in the first graph structure and transaction relation strength among different nodes, and carrying out community adjustment on the nodes based on the global modularity to obtain a second graph structure.
In some embodiments, the processor 801 is further configured to:
calculating the semantic density of the first node based on the transaction amount score, the transaction frequency score and the node type score of the entity corresponding to the first node;
Calculating the transaction relation strength between the first node and the second node based on the transaction amount score and the transaction number score between the first node corresponding entity and the second node corresponding entity;
Calculating the total weight of all the second nodes in the first community;
An allocation score for the first node is determined based on the semantic density of the first node, the strength of the trade relationship between the first node and the second nodes, and the total weight of all second nodes in the first community.
In some embodiments, the processor 801 is further configured to:
Based on the target keywords, sequentially searching the target keywords according to the search routes of the structural layers of the text blocks, the graph structural layers representing the relation among the entities in the text blocks and the graph structural layers representing the relation among communities of the entities so as to sequentially obtain search information related to the target keywords in each structural layer;
and carrying out embedded vector representation on the retrieval information related to the target keyword in each structural layer in sequence to obtain a first embedded vector representation of the retrieval information related to the target keyword in each structural layer.
In some embodiments, the processor 801 is further configured to:
Calculating the sentence-to-sentence similarity between every two first text blocks based on at least two first text blocks related to the target keyword in the structural layer of the text blocks, constructing a similarity matrix based on the sentence-to-sentence similarity between every two first text blocks, converting the similarity matrix into a distance matrix, determining a first embedded vector representation of the first text blocks based on the distance matrix, and the retrieval information comprises the at least two first text blocks;
Based on the entity relation between a third node and a third node related to the target keyword in a graph structure layer representing the relation between entities in the text block, aggregating the entity relation between the third node under different model levels by using a first network embedding model, and outputting a first embedding vector representation of the entity corresponding to the third node, wherein the third node is related to the first text block, and the retrieval information comprises the entity relation between the entity corresponding to the third node and the third node;
based on the graph structure layer representing the relation between communities of the entities, the neighbor information connected with the third node side and the community information of the third node are aggregated under different model levels by utilizing a second network embedding model, the first embedding vector representation of the community corresponding to the third node is output, and the search information comprises the neighbor information connected with the third node side and the community information of the third node.
In some embodiments, the processor 801 is further configured to:
Vector aggregation is carried out based on the first embedded vector representation of the retrieval information related to the target keyword in each structural layer, so that an aggregation vector is obtained;
Based on the aggregate vector, adopting a recursion structure to generate an inference result of each inference step, and based on the aggregate vector and the generation probability determined by the target inference result, screening candidate examination results of the information to be examined by using a limited decoding strategy, wherein the target inference result comprises an inference result of the current inference step and an inference result obtained before the current inference step;
And carrying out compliance scoring on the to-be-inspected result based on the global context information characterized by the aggregate vector and the candidate inspection result, and determining the compliance inspection result of the to-be-inspected information based on the score.
In some embodiments, the aggregate vector further comprises at least one of:
The embedded vector corresponding to the candidate data set matched with the target keyword is obtained by carrying out vector retrieval on the vector of the target keyword based on a pre-constructed vector library, and the vector library comprises vectors of payment related information;
the embedded vector corresponding to the query sub-problem of the information to be checked is obtained by utilizing a large model to conduct query decomposition on the information to be checked based on the target keyword.
Preferably, the embodiment of the present invention further provides an electronic device 800, including a processor 801, a memory 802, and a computer program stored in the memory 802 and capable of running on the processor 801, where the computer program when executed by the processor 801 implements each process of the above embodiment of the payment compliance checking method, and the same technical effects can be achieved, and for avoiding repetition, a detailed description is omitted herein.
The embodiment of the invention also provides a readable storage medium, and the readable storage medium stores a computer program, which when executed by a processor, realizes each process of the embodiment of the payment compliance checking method, and can achieve the same technical effects, and is not repeated here. Wherein the readable storage medium is selected from Read-Only Memory (ROM), random access Memory (Random Access Memory RAM), magnetic disk or optical disk.
The embodiment of the application also provides a computer program product, which comprises computer instructions, wherein the computer instructions realize the processes of the embodiment of the payment compliance checking method when being executed by a processor, and the same technical effects can be achieved, so that repetition is avoided, and the description is omitted here.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other manners. For example, the system embodiments described above are merely illustrative, e.g., the division of the elements is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present invention.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes various media capable of storing program codes such as a U disk, a mobile hard disk, a ROM, a RAM, a magnetic disk or an optical disk.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims (15)

1. A method of payment compliance review, the method comprising:
acquiring information to be checked in a business scene of payment compliance checking;
Information extraction is carried out based on the information to be checked to obtain information to be processed, wherein the information to be processed comprises target keywords of the information to be checked;
Searching the target keyword based on a pre-constructed three-level structure graph database to obtain a first embedded vector representation of search information related to the target keyword in each structural layer, wherein the three levels of structures are respectively a structural layer of a text block in payment related information, a graph structural layer representing a relationship between entities in the text block and a graph structural layer representing a relationship between communities of the entities;
And generating and enhancing the first embedded vector representation to obtain a compliance review result of the information to be reviewed.
2. The method according to claim 1, wherein prior to obtaining information to be inspected in a business scenario for payment compliance inspection, the method comprises:
text blocking is carried out on the payment related information to obtain text blocks in the payment related information;
extracting information based on the text block, and de-duplicating the extracted entity to construct a first graph structure based on the relationship between the entity after de-duplication and the entity, wherein the first graph structure represents the relationship between the entities in the text block;
Carrying out community division on the entities based on the first graph structure to construct a second graph structure, wherein the second graph structure represents the relationship among communities of the entities;
and constructing a graph database of the three-layer structure based on the text block, the first graph structure and the second graph structure.
3. The method according to claim 2, wherein text-blocking the payment related information to obtain text blocks in the payment related information comprises:
Calculating the semantic density of each sentence in the text of the payment related information;
based on the text structure of the payment related information, performing first block processing on the payment related information to obtain candidate text blocks;
and performing second block division processing on the candidate text blocks based on the semantic density of sentences in the candidate text blocks to obtain text blocks in the payment related information.
4. The method of claim 2, wherein de-duplicating the extracted entity comprises:
calculating a first similarity between every two entities in the text block;
Under the condition that the first similarity is larger than a first preset threshold, creating a target entity, wherein the target entity is an entity obtained by combining two similar entities;
Calculating second similarity between the target entity and other uncombined entities in the text block;
and performing de-duplication of the target entity and other uncombined entities in the text block based on the second similarity.
5. The method of claim 4, wherein said calculating a first similarity between each two entities in said block of text comprises at least one of:
Calculating the weight of each character in the two entities and calculating the similarity between the same characters in the two entities, and weighting the similarity of the same characters based on the weight of each character and the similarity between the same characters to obtain a first similarity between the two entities;
A second embedded vector representation based on text blocks corresponding to each of the two entities; A first similarity between the two entities is calculated based on the second embedded vector representation.
6. The method according to claim 2, wherein the performing community division of the entity based on the first graph structure to construct a second graph structure includes:
Calculating an allocation score of a first node in the first graph structure, and performing community allocation on the first node based on the allocation score of the first node, wherein the allocation score is used for indicating whether the first node is allocated to a score of a first community, and the first community comprises a second node in the first graph structure;
Calculating the similarity between different communities, and carrying out community merging judgment based on the similarity between different communities to obtain a plurality of communities;
And calculating global modularity of community division based on semantic density of nodes in the first graph structure and transaction relation strength among different nodes, and carrying out community adjustment on the nodes based on the global modularity to obtain a second graph structure.
7. The method of claim 6, wherein calculating the allocation score for the first node in the first graph structure comprises:
calculating the semantic density of the first node based on the transaction amount score, the transaction frequency score and the node type score of the entity corresponding to the first node;
Calculating the transaction relation strength between the first node and the second node based on the transaction amount score and the transaction number score between the first node corresponding entity and the second node corresponding entity;
Calculating the total weight of all the second nodes in the first community;
An allocation score for the first node is determined based on the semantic density of the first node, the strength of the trade relationship between the first node and the second nodes, and the total weight of all second nodes in the first community.
8. The method according to claim 1, wherein the searching the target keyword based on the pre-built three-level structure graph database to obtain a first embedded vector representation of search information related to the target keyword in each structural layer includes:
Based on the target keywords, sequentially searching the target keywords according to the search routes of the structural layers of the text blocks, the graph structural layers representing the relation among the entities in the text blocks and the graph structural layers representing the relation among communities of the entities so as to sequentially obtain search information related to the target keywords in each structural layer;
and carrying out embedded vector representation on the retrieval information related to the target keyword in each structural layer in sequence to obtain a first embedded vector representation of the retrieval information related to the target keyword in each structural layer.
9. The method according to claim 8, wherein the sequentially performing embedded vector characterization on the search information related to the target keyword in each structural layer, to obtain a first embedded vector representation of the search information related to the target keyword in each structural layer, includes at least one of:
Calculating the sentence-to-sentence similarity between every two first text blocks based on at least two first text blocks related to the target keyword in the structural layer of the text blocks, constructing a similarity matrix based on the sentence-to-sentence similarity between every two first text blocks, converting the similarity matrix into a distance matrix, determining a first embedded vector representation of the first text blocks based on the distance matrix, and the retrieval information comprises the at least two first text blocks;
Based on the entity relation between a third node and a third node related to the target keyword in a graph structure layer representing the relation between entities in the text block, aggregating the entity relation between the third node under different model levels by using a first network embedding model, and outputting a first embedding vector representation of the entity corresponding to the third node, wherein the third node is related to the first text block, and the retrieval information comprises the entity relation between the entity corresponding to the third node and the third node;
based on the graph structure layer representing the relation between communities of the entities, the neighbor information connected with the third node side and the community information of the third node are aggregated under different model levels by utilizing a second network embedding model, the first embedding vector representation of the community corresponding to the third node is output, and the search information comprises the neighbor information connected with the third node side and the community information of the third node.
10. The method of claim 1, wherein generating the augmentation process based on the first embedded vector representation to obtain compliance review results for the information under review comprises:
Vector aggregation is carried out based on the first embedded vector representation of the retrieval information related to the target keyword in each structural layer, so that an aggregation vector is obtained;
Based on the aggregate vector, adopting a recursion structure to generate an inference result of each inference step, and based on the aggregate vector and the generation probability determined by the target inference result, screening candidate examination results of the information to be examined by using a limited decoding strategy, wherein the target inference result comprises an inference result of the current inference step and an inference result obtained before the current inference step;
And carrying out compliance scoring on the to-be-inspected result based on the global context information characterized by the aggregate vector and the candidate inspection result, and determining the compliance inspection result of the to-be-inspected information based on the score.
11. The method of claim 10, wherein the aggregate vector further comprises at least one of:
The embedded vector corresponding to the candidate data set matched with the target keyword is obtained by carrying out vector retrieval on the vector of the target keyword based on a pre-constructed vector library, and the vector library comprises vectors of payment related information;
the embedded vector corresponding to the query sub-problem of the information to be checked is obtained by utilizing a large model to conduct query decomposition on the information to be checked based on the target keyword.
12. A payment compliance censoring device, the device comprising:
The acquisition module is used for acquiring information to be checked in a business scene of payment compliance checking;
The information extraction module is used for extracting information based on the information to be checked to obtain information to be processed, wherein the information to be processed comprises target keywords of the information to be checked;
The retrieval module is used for retrieving the target keyword based on a pre-constructed graph database with three layers of structures, and obtaining a first embedded vector representation of retrieval information related to the target keyword in each structure layer, wherein the three layers of structures are respectively a structure layer of a text block in payment related information, a graph structure layer representing the relationship among entities in the text block and a graph structure layer representing the relationship among communities of the entities;
and the generation enhancement processing module is used for generating enhancement processing based on the first embedded vector representation to obtain a compliance examination result of the information to be examined.
13. An electronic device comprising a processor, a memory, a computer program stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the payment compliance screening method of any one of claims 1 to 11.
14. A readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, implements the steps of the payment compliance screening method of any one of claims 1 to 11.
15. A computer program product comprising computer instructions which, when executed by a processor, implement the steps of the payment compliance screening method of any one of claims 1 to 11.
CN202510992359.XA 2025-07-18 2025-07-18 Payment compliance checking method and device and electronic equipment Pending CN120875891A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510992359.XA CN120875891A (en) 2025-07-18 2025-07-18 Payment compliance checking method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510992359.XA CN120875891A (en) 2025-07-18 2025-07-18 Payment compliance checking method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN120875891A true CN120875891A (en) 2025-10-31

Family

ID=97451634

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510992359.XA Pending CN120875891A (en) 2025-07-18 2025-07-18 Payment compliance checking method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN120875891A (en)

Similar Documents

Publication Publication Date Title
EP3848797A1 (en) Automatic parameter value resolution for api evaluation
US8577823B1 (en) Taxonomy system for enterprise data management and analysis
US20180018311A1 (en) Method and system for automatically extracting relevant tax terms from forms and instructions
CN104346418A (en) Anonymizing Sensitive Identifying Information Based on Relational Context Across a Group
US9400826B2 (en) Method and system for aggregate content modeling
KR102770197B1 (en) AI Legal Support System for Legal Experts
CN119398092A (en) A method and device for constructing a multi-component data intelligent entity
Nay Natural language processing and machine learning for law and policy texts
CN115547466B (en) Medical institution registration and review system and method based on big data
CN118862168A (en) A data desensitization method, model training method and AI management platform
CN119538118A (en) Data classification method and device
CN120387437A (en) An automated financial document parsing and question-answering system based on a large model
Park et al. Leveraging machine learning for automatic topic discovery and forecasting of process mining research: A literature review
Liu et al. Age inference using a hierarchical attention neural network
CN115063035A (en) Customer evaluation method, system, equipment and storage medium based on neural network
CN118210874A (en) Data processing method, device, computer equipment and storage medium
Wu et al. Metricizing policy texts: Comprehensive dataset on China’s Agri-policy intensity spanning 1982–2023
CN119293266B (en) Enterprise knowledge graph construction method, system, equipment and storage medium
CN120407796A (en) An intelligent classification and retrieval system for government documents
CN120045560A (en) Table lookup method, apparatus, device, computer readable storage medium and computer program product
Li et al. Automatic classification algorithm for multisearch data association rules in wireless networks
CN120875891A (en) Payment compliance checking method and device and electronic equipment
CN118115223A (en) Intelligent identification method and device for false invoice
Van Der Elst Extracting ESG data from business documents
Yu et al. Research on the design of a data mining-based financial audit model for financial multi-type data processing and audit trail discovery

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination