[go: up one dir, main page]

CN119202126A - A message content extraction method, device, computer equipment and storage medium - Google Patents

A message content extraction method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN119202126A
CN119202126A CN202411081721.XA CN202411081721A CN119202126A CN 119202126 A CN119202126 A CN 119202126A CN 202411081721 A CN202411081721 A CN 202411081721A CN 119202126 A CN119202126 A CN 119202126A
Authority
CN
China
Prior art keywords
message
vector
domain
classification
reviewed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411081721.XA
Other languages
Chinese (zh)
Inventor
王多多
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Bank Co Ltd
Original Assignee
Ping An Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Bank Co Ltd filed Critical Ping An Bank Co Ltd
Priority to CN202411081721.XA priority Critical patent/CN119202126A/en
Publication of CN119202126A publication Critical patent/CN119202126A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开一种消息内容抽取方法、装置、计算机设备及存储介质,涉及大数据技术领域。首先通过关键词提取与向量转化,将待审核消息转化为关键词向量。随后,利用预训练的领域分类模型对关键词向量进行领域分类,确定消息所属的业务领域。根据领域分类结果,选取匹配的目标业务领域知识图谱,并生成其向量表示。接着,结合该向量表示,运用预设的长短时记忆网络模型对消息进行语义编码,生成语义表示向量。最后,基于该语义表示向量,通过条件随机场模型精准识别消息中的关键实体、属性及关系。本申请还涉及区块链技术领域,待审核消息存储在区块链节点上。本申请提高了消息内容抽取的准确性和效率,能够自动化地识别出关键实体、属性及关系。

The present application discloses a message content extraction method, device, computer equipment and storage medium, and relates to the field of big data technology. First, the message to be reviewed is converted into a keyword vector through keyword extraction and vector conversion. Subsequently, the keyword vector is classified by domain using a pre-trained domain classification model to determine the business domain to which the message belongs. According to the domain classification result, a matching target business domain knowledge graph is selected, and its vector representation is generated. Then, in combination with the vector representation, a preset long short-term memory network model is used to semantically encode the message to generate a semantic representation vector. Finally, based on the semantic representation vector, the key entities, attributes and relationships in the message are accurately identified through a conditional random field model. The present application also relates to the field of blockchain technology, and the messages to be reviewed are stored on blockchain nodes. The present application improves the accuracy and efficiency of message content extraction, and can automatically identify key entities, attributes and relationships.

Description

Message content extraction method and device, computer equipment and storage medium
Technical Field
The application belongs to the technical field of big data, and particularly relates to a message content extraction method, a device, computer equipment and a storage medium.
Background
The touch audit is a series of audit processes performed before specific information (such as popularization information, notification, activity details and the like) is sent to a target client through channels of short messages, weChat, APP station internal messages and the like. The purpose of this process is to ensure that the information content being sent is compliant, accurate, error-free, and in a proper form, reaches the target customer at the right moment, thereby avoiding the customer from being bothered, lost, or adversely affected by misdirected, or improper transmissions.
When a large number of touch messages are faced, the content extraction technology can rapidly screen out information points needing to be focused, key contents such as time, places, preferential details and the like can be accurately extracted from a large number of text messages, rapid positioning of auditing personnel is facilitated, and auditing efficiency is improved. Current content extraction techniques rely primarily on manually-programmed rules or templates, which are typically defined based on features such as text structure, keywords, sentence patterns, etc., for identifying and extracting content that meets certain criteria. The rule content extraction method needs to be designed aiming at a specific knowledge field or text format, so that portability of the rule is poor, the rule is difficult to be directly applied to texts in other fields or formats, and making a comprehensive and accurate rule requires deep field knowledge and a large amount of time investment, and meanwhile, the rule can be difficult to cover all language phenomena, so that missing report or false report of an extraction result is caused. Along with the continuous change of text format and content, the original rules of manpower and material resources are required to be continuously updated and maintained so as to maintain the accuracy and reliability of extraction results.
Disclosure of Invention
The embodiment of the application aims to provide a message content extraction method, a device, a computer device and a storage medium, and aims to provide a scheme for extracting message content by combining domain classification and business domain knowledge graph, so that the accuracy and efficiency of message content extraction are improved, and key entities, attributes and relations can be automatically identified.
In order to solve the above technical problems, the embodiment of the present application provides a message content extraction method, which adopts the following technical scheme:
a message content extraction method, comprising:
Extracting keywords from the message to be audited, and carrying out vector conversion on the extracted keywords to obtain keyword vectors;
Based on the keyword vector and the pre-trained domain classification model, carrying out domain classification on the message to be audited to obtain a domain classification result;
Determining a target business domain knowledge graph matched with the message to be checked according to the domain classification result, and generating a vector representation of the target business domain knowledge graph to obtain a target knowledge graph vector;
combining the target knowledge graph vector, performing semantic coding on the message to be audited by using a preset long-short-time memory network model, and generating a semantic representation vector;
Based on the semantic representation vector, the key entities, entity attributes and entity relationships in the message to be audited are identified by utilizing the conditional random field model.
Further, the domain classification model is configured with a plurality of classification labels of different domains, and based on the keyword vector and the pre-trained domain classification model, the domain classification is performed on the message to be audited to obtain a domain classification result, which comprises the following steps:
calculating the similarity between the keyword vector and the classification label in the domain classification model to obtain a first similarity, and taking the first similarity value as the domain classification confidence coefficient of the message to be audited;
sorting the domain classification confidence coefficient of the message to be audited in a descending order to obtain a first classification confidence coefficient sequence;
And determining the service domain to which the message to be audited belongs according to the first classification confidence sequence, and obtaining a domain classification result.
Further, determining the service domain to which the message to be audited belongs according to the classification confidence sequence to obtain a domain classification result, including:
obtaining maximum classification confidence from the classification confidence sequence, wherein the maximum classification confidence is the maximum value in the classification confidence sequence;
comparing the maximum classification confidence with a preset confidence threshold;
And when the maximum classification confidence coefficient is greater than or equal to the confidence coefficient threshold value, acquiring a classification label corresponding to the maximum classification confidence coefficient, determining the service field represented by the classification label as the service field to which the message to be checked belongs, and obtaining a field classification result.
Further, the classification labels configured in the domain classification model include a parent domain label and a child domain label, calculate similarity between the keyword vector and the classification label in the domain classification model, and use the similarity value as the domain classification confidence of the message to be audited, and include:
determining a target parent field label matched with the keyword vector;
Acquiring all sub-domain labels under the target parent domain label to obtain a target sub-domain label;
Calculating the similarity between the keyword vector and the target sub-field label to obtain the similarity;
And taking the similarity value as the domain classification confidence of the message to be audited.
Further, generating a vector representation of the target business domain knowledge graph to obtain a target knowledge graph vector, including:
acquiring an entity associated with the message to be checked from the target business field knowledge graph to obtain an associated entity;
acquiring association relations between association entities to obtain an entity association relation set;
Calculating the structure position of the entity association relation set in the target service domain knowledge graph by utilizing the topology structure information of the target service domain knowledge graph to obtain the topology structure of the entity relation;
semantic representation is carried out on the entity and the topological structure of the entity relationship in the knowledge graph of the target service field, and a target knowledge graph vector is generated.
Further, the long-short-time memory network model comprises a coding layer, a long-short-time memory unit and a knowledge fusion unit, and is combined with the target knowledge graph vector, the preset long-short-time memory network model is used for carrying out semantic coding on the message to be audited, and the generation of the semantic representation vector comprises the following steps:
acquiring a message text of a message to be audited, and loading a target knowledge graph vector and the message text into a long-time memory network model;
encoding the message text through an encoding layer to obtain a message text vector;
processing the input message text vector by using a long-short-time memory unit to obtain a hidden state vector;
and carrying out knowledge fusion on the target knowledge graph vector and the hidden state vector by using a knowledge fusion unit to obtain a semantic representation vector.
Further, identifying key entities, entity attributes and entity relationships in the message to be audited using the conditional random field model based on the semantic representation vector, comprising:
performing word segmentation and part-of-speech tagging on the message to be audited to obtain a vocabulary tagging sequence;
extracting characteristics of the vocabulary labeling sequence through a conditional random field model to obtain a plurality of vocabulary characteristic vectors;
Carrying out semantic analysis on each vocabulary feature vector based on the semantic representation vector, and determining key entities, entity attributes and entity relations based on semantic analysis results;
performing semantic analysis on each vocabulary feature vector based on the semantic representation vector, and determining key entities, entity attributes and entity relationships based on semantic analysis results, including:
carrying out semantic analysis on each vocabulary feature vector by adopting a semantic analysis algorithm based on the semantic representation vector to obtain semantic features corresponding to each vocabulary feature vector;
identifying key entities in the message to be checked through semantic features to obtain a key entity list;
based on the key entity list, entity attributes and entity relationships contained by each key entity are identified according to the semantic representation vector.
In order to solve the above technical problems, the embodiment of the present application further provides a message content extraction device, which adopts the following technical scheme:
a message content extraction apparatus comprising:
the keyword processing module is used for extracting keywords of the message to be audited, and carrying out vector conversion on the extracted keywords to obtain keyword vectors;
The field classification module is used for carrying out field classification on the message to be audited based on the keyword vector and the pre-trained field classification model to obtain a field classification result;
The map vectorization module is used for determining a target business field knowledge map matched with the message to be checked according to the field classification result, generating a vector representation of the target business field knowledge map, and obtaining a target knowledge map vector;
The semantic coding module is used for carrying out semantic coding on the message to be audited by combining the target knowledge graph vector and using a preset long-short-time memory network model to generate a semantic representation vector;
And the content extraction module is used for identifying key entities, entity attributes and entity relations in the message to be audited by utilizing the conditional random field model based on the semantic representation vector.
In order to solve the above technical problems, the embodiment of the present application further provides a computer device, which adopts the following technical schemes:
a computer device comprising a memory having stored therein computer readable instructions which when executed by a processor implement the steps of the message content extraction method of any one of the preceding claims.
In order to solve the above technical problems, an embodiment of the present application further provides a computer readable storage medium, which adopts the following technical schemes:
A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of the message content extraction method as claimed in any one of the preceding claims.
Compared with the prior art, the embodiment of the application has the following main beneficial effects:
The application discloses a message content extraction method, a message content extraction device, computer equipment and a storage medium, and relates to the technical field of big data. Firstly, the message to be checked is converted into a keyword vector through keyword extraction and vector conversion. And then, carrying out domain classification on the keyword vector by utilizing a pre-trained domain classification model, and determining the service domain to which the message belongs. And selecting a matched target service domain knowledge graph according to the domain classification result, and generating a vector representation of the target service domain knowledge graph. And then, combining the vector representation, carrying out semantic coding on the message by using a preset long-short-time memory network model, and generating a semantic representation vector. Finally, based on the semantic representation vector, key entities, attributes and relations in the message are accurately identified through a conditional random field model. The method combines the domain classification and the business domain knowledge graph to extract the message content, combines the context semantic information of the message content, accurately extracts and structuralizes the message content, improves the accuracy and the efficiency of extracting the message content, and can automatically identify key entities, attributes and relations.
Drawings
In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for the description of the embodiments of the present application, it being apparent that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without the exercise of inventive effort for a person of ordinary skill in the art.
FIG. 1 illustrates an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 illustrates a flow chart of one embodiment of a message content extraction method according to the present application;
fig. 3 shows a schematic diagram of the structure of an embodiment of a message content extraction device according to the application;
fig. 4 shows a schematic structural diagram of an embodiment of a computer device according to the application.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs, the terms used in the description herein are used for the purpose of describing particular embodiments only and are not intended to limit the application, and the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the above description of the drawings are intended to cover non-exclusive inclusions. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.
As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Movi ng Pi cture Experts Group Aud i o Layer I I I, dynamic video expert compression standard audio plane 3), MP4 (Movi ng Pi cture Experts Group Aud i o Layer I V, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.
The server 105 may be a server that provides various services, such as a background server that provides support for pages displayed on the terminal devices 101, 102, 103, and may be a stand-alone server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content De l i very Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.
It should be noted that, the method for extracting message content provided in the embodiment of the present application is generally executed by a server, and accordingly, the message content extracting device is generally disposed in the server.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to fig. 2, a flow chart of one embodiment of a message content extraction method according to the present application is shown. The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Wherein artificial intelligence (ART I F I C I A L I NTE L L I GENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.
Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions. The message content extraction method comprises the following steps:
S201, extracting keywords of the message to be audited, and carrying out vector conversion on the extracted keywords to obtain keyword vectors.
Specifically, first, a keyword extraction technique, such as TF-IDF, textRank, or a deep learning-based method, is used to automatically identify the most representative vocabulary or phrase from the text as a keyword. These keywords can summarize the main content or core ideas of the text. Then, each keyword is mapped into a high-dimensional vector space by a vector transformation technique, such as a Word embedding (Word Embedd i ng) model, such as the Embedd i ng layer of Word2Vec, G l oVe or BERT, to form a keyword vector. These vectors not only retain semantic information of the vocabulary, but also reflect similarity and relationships between the vocabularies to some extent.
S202, carrying out field classification on the message to be audited based on the keyword vector and the pre-trained field classification model to obtain a field classification result.
Specifically, first, the keyword extraction technique is relied on to extract the keyword information from the message to be checked, and convert the keyword information into a vector form, namely a keyword vector. These vectors are fed as inputs into a pre-trained domain classification model. The domain classification model performs efficient and accurate processing on the keyword vector through a deep learning algorithm (such as convolutional neural network CNN, cyclic neural network RNN or variant LSTM/GRU thereof, and possibly even a transducer structure), so as to identify the specific domain or class to which the message belongs. The process fully utilizes the feature representation capability of the pre-training model learned on a large number of field related data and semantic information contained in the keyword vector, realizes the automation and the intellectualization of the field classification of the message to be audited, and remarkably improves the accuracy and the efficiency of the classification.
The domain classification model is a machine learning model for identifying the domain to which a text belongs, and can automatically classify the text into predefined domain categories, such as news, science and technology, entertainment, and the like, according to the content and characteristics of the text. The model realizes accurate classification of new texts by training and learning text features in different fields.
And S203, determining a target business domain knowledge graph matched with the message to be checked according to the domain classification result, and generating a vector representation of the target business domain knowledge graph to obtain a target knowledge graph vector.
Specifically, firstly, the information is accurately classified through a domain classification model, and the service domain to which the information belongs is determined. Then, based on the classification result, a target business field knowledge graph matched with the message to be checked is selected. The business domain knowledge graph is used as a collection of structured data and contains comprehensive information of entities, attributes and relations in the domain. For subsequent processing, the system also needs to generate a vector representation of the target business domain knowledge graph, i.e., a target knowledge graph vector. Through graph embedding technology, such as TransE, node2Vec and the like, nodes, relations and the like in the knowledge graph are converted into vectors in a high-dimensional space, so that a computer can calculate and infer efficiently. In this way, the system can fully utilize the rich information of the domain knowledge graph,
The business domain knowledge graph (Domai n Know LEDGE GRAPH, DKG) is a knowledge graph built for a specific domain or industry, is a graph-based data structure, and organizes information by nodes (representing entities) and edges (representing relationships between entities) to form a comprehensive, structured representation of the domain knowledge. For example, an insurance product knowledge graph is built around the insurance product, and contains basic information (such as product name, type, guarantee scope, premium, insurance amount, etc.), term interpretation, insurance requirements, claim settlement process, etc. of various types of insurance products. For another example, the knowledge graph of the claim settlement process details the information such as the insurance claim settlement process, standard, requirement, etc., including various links such as reporting, investigation, loss assessment, nuclear claim, payment, etc.
S204, combining the target knowledge graph vector, performing semantic coding on the message to be audited by using a preset long-short-time memory network model, and generating a semantic representation vector.
Specifically, the target knowledge graph vector is combined with a long and short time memory network (LSTM) model, and the message to be audited is subjected to deep semantic coding. Firstly, the knowledge of the target field is abstracted into a vector form, namely a target knowledge map vector, through a knowledge map technology, and the vectors contain rich field knowledge and context information. The message to be checked is then combined with these vectors as input to the LSTM model. The LSTM model is used as a special cyclic neural network (RNN) and can effectively capture long-distance dependency relations in texts and process sequence information through an internal gate mechanism. In the process, the LSTM model utilizes the domain knowledge provided by the target knowledge graph vector to perform deeper semantic understanding and analysis on the message to be audited, and finally generates a semantic representation vector capable of accurately expressing the core semantic of the message, thereby remarkably improving the accuracy and efficiency of text semantic coding.
S205, based on the semantic representation vector, identifying key entities, entity attributes and entity relations in the message to be audited by using the conditional random field model.
Specifically, in Natural Language Processing (NLP), a Conditional Random Field (CRF) model is utilized in conjunction with semantic representation vectors to identify key entities, entity attributes, and entity relationships in a message to be examined. First, the semantic representation vector obtained through the above steps. This vector is then input into a conditional random field model, which is a sequence labeling model that can take into account the dependencies between labels, thereby optimizing the labeling results in a global scope. In the identification process, the CRF model not only focuses on the semantic information of the current word, but also considers the context environment, and combines the domain knowledge in the semantic representation vector to accurately identify key entities in the message, the attributes of the entities and the relationship between the entities, thereby effectively improving the accuracy and the robustness of the entity and relationship identification.
In the above embodiment, the application discloses a message content extraction method, and relates to the technical field of big data. Firstly, the message to be checked is converted into a keyword vector through keyword extraction and vector conversion. And then, carrying out domain classification on the keyword vector by utilizing a pre-trained domain classification model, and determining the service domain to which the message belongs. And selecting a matched target service domain knowledge graph according to the domain classification result, and generating a vector representation of the target service domain knowledge graph. And then, combining the vector representation, carrying out semantic coding on the message by using a preset long-short-time memory network model, and generating a semantic representation vector. Finally, based on the semantic representation vector, key entities, attributes and relations in the message are accurately identified through a conditional random field model. The method combines the domain classification and the business domain knowledge graph to extract the message content, combines the context semantic information of the message content, accurately extracts and structuralizes the message content, improves the accuracy and the efficiency of extracting the message content, and can automatically identify key entities, attributes and relations.
Further, the domain classification model is configured with a plurality of classification labels of different domains, and based on the keyword vector and the pre-trained domain classification model, the domain classification is performed on the message to be audited to obtain a domain classification result, which comprises the following steps:
calculating the similarity between the keyword vector and the classification label in the domain classification model to obtain a first similarity, and taking the first similarity value as the domain classification confidence coefficient of the message to be audited;
sorting the domain classification confidence coefficient of the message to be audited in a descending order to obtain a first classification confidence coefficient sequence;
And determining the service domain to which the message to be audited belongs according to the first classification confidence sequence, and obtaining a domain classification result.
In the above embodiment, first, by calculating the similarity (i.e., the first similarity) between the keyword vector and each classification label in the pre-training domain classification model, the confidence of the domain classification is allocated to the message to be checked, so that the spatial characteristics of the vector are fully utilized, and the association degree between the message content and each domain is effectively measured. And then, the opposite confidence levels are subjected to descending order to form a first classification confidence level sequence, and the degree of matching between the message and each field is intuitively displayed. And finally, determining the service field to which the message belongs based on the sequence, and ensuring the accuracy and rationality of field classification.
In the embodiment, the method and the device for classifying the domain determine to classify the domain by distributing the confidence of the domain classification to the message to be checked, so that the accuracy of classification is improved, and the interpretability of the classification process is enhanced.
In the above embodiment, a hierarchical classification label system is introduced in the domain classification process, including a parent domain label and a child domain label, and by introducing the hierarchical classification label system, not only is the classification accuracy improved, but also the classification result is more in line with the requirements of the actual service scene, because significant differences may exist between different child domains, and the subsequent knowledge graph matching and semantic coding processes are directly affected.
Further, determining the service domain to which the message to be audited belongs according to the classification confidence sequence to obtain a domain classification result, including:
obtaining maximum classification confidence from the classification confidence sequence, wherein the maximum classification confidence is the maximum value in the classification confidence sequence;
comparing the maximum classification confidence with a preset confidence threshold;
And when the maximum classification confidence coefficient is greater than or equal to the confidence coefficient threshold value, acquiring a classification label corresponding to the maximum classification confidence coefficient, determining the service field represented by the classification label as the service field to which the message to be checked belongs, and obtaining a field classification result.
In the above embodiment, first, the maximum value is selected from the classification confidence sequence as the maximum classification confidence, so that it is ensured that the field that is most matched with the message to be checked is selected. Then, by comparing with a preset confidence threshold, an explicit judgment standard is introduced for evaluating the reliability of the classification result. When the maximum classification confidence coefficient meets the threshold requirement (namely, the maximum classification confidence coefficient is larger than or equal to the confidence coefficient threshold), the classification result is considered to have higher reliability, and the corresponding classification label is directly adopted as the service field of the message to be audited, so that the accuracy and the effectiveness of the field classification result are ensured. When the maximum classification confidence is smaller than the confidence threshold, an alarm prompt for outputting the classification confidence smaller than the confidence threshold is adopted to inform a processor that no sub-label matched with the current keyword vector exists in the classification label set, and matching label registration is needed to generate a parent label and a sub-label matched with the current keyword vector in the classification label set.
In the embodiment, the confidence level comparison mechanism is set to determine the service field to which the message to be checked belongs, so that the classification accuracy is improved, and the stability and reliability of the classification result are enhanced.
Further, the classification labels configured in the domain classification model include a parent domain label and a child domain label, calculate similarity between the keyword vector and the classification label in the domain classification model, and use the similarity value as the domain classification confidence of the message to be audited, and include:
determining a target parent field label matched with the keyword vector;
Acquiring all sub-domain labels under the target parent domain label to obtain a target sub-domain label;
Calculating the similarity between the keyword vector and the target sub-field label to obtain the similarity;
And taking the similarity value as the domain classification confidence of the message to be audited.
In the above embodiment, the parent domain label that best matches the keyword vector is first identified from the predefined taxonomy label system by a matching mechanism. Subsequently, for the parent domain, all sub-domain labels below it are traversed and obtained to achieve a finer classification of the message content. And then, calculating the similarity by using a similarity calculation method (such as cosine similarity, euclidean distance and the like), quantifying the association degree between the keyword vector and each target sub-field label, and generating a similarity value. Finally, the similarity values are directly used as confidence indexes for measuring the classification accuracy of the field of the message to be checked, so that a decision maker or a system is assisted to automatically finish the classification processing of the message, and the classification accuracy and efficiency are improved.
Through the steps, the method and the device realize high-precision judgment of the classification of the field of the message to be checked through a refined field classification system (comprising the tags of the father field and the son field) and combining the similarity calculation of the keyword vector and the classification tag, ensure the coverage of the classification, improve the depth accuracy of the classification and provide the field classification result with high confidence for the message processing, and strengthen the intelligent and automatic level of the system through layer-by-layer matching and similarity evaluation.
Further, when the maximum classification confidence coefficient calculated by the system does not reach the preset confidence coefficient threshold value, the fact that the sub-domain label which is highly matched with the input keyword vector cannot be found in the current classification label set is indicated. To cope with this situation, the system uses a way of outputting an alarm prompt, which explicitly conveys the information of the classification failure to the processor and indicates the lack of corresponding matching items in the classification tag set. By this step, not only is the transparency and user friendliness of the system improved, but also the subsequent label registration work is promoted. The processor registers new matching labels, including father labels and child labels, according to the alarm prompt to expand the classification label set, so as to ensure that the system can more accurately cope with similar keyword vectors possibly encountered in the future, thereby improving the accuracy and efficiency of overall classification.
Further, generating a vector representation of the target business domain knowledge graph to obtain a target knowledge graph vector, including:
acquiring an entity associated with the message to be checked from the target business field knowledge graph to obtain an associated entity;
acquiring association relations between association entities to obtain an entity association relation set;
Calculating the structure position of the entity association relation set in the target service domain knowledge graph by utilizing the topology structure information of the target service domain knowledge graph to obtain the topology structure of the entity relation;
semantic representation is carried out on the entity and the topological structure of the entity relationship in the knowledge graph of the target service field, and a target knowledge graph vector is generated.
In the above embodiment, first, entities directly related to the message to be checked are extracted from the target business field knowledge graph, and these entities are key to understanding and analyzing the content of the message. Then, complex relationships between these associated entities are identified, forming a set of entity associations to reveal business logic and context behind the message. And then, calculating the positions of the entity relations in the atlas, namely the topological structure of the entity relations by utilizing the topological structure information of the knowledge atlas, and identifying the positions of the entity relations in the atlas to confirm the direct relevance of the entity relations, and integrating the overall structure characteristics of the knowledge atlas to provide rich context information for subsequent semantic representation. Finally, by carrying out advanced semantic representation on the topological structure of the entity and the relation thereof to generate a target knowledge graph vector and carrying out semantic representation on the topological structure of the entity and the relation of the entity, the deep logic and the context information of the entity and the relation in the knowledge graph can be captured and expressed, so that a machine learning model can more accurately understand the meaning and the mode in data, and the performance of tasks such as classification, reasoning and the like is improved.
The topological structure of entity relationship refers to a space or a logic structure formed by interconnecting entities through relationship in a knowledge graph or a network graph. Such a structure does not take into account the specific size, shape of the entities and relationships, but focuses on the manner of connection and relative location between them. Through the topological structure, the interaction and the dependency relationship between the entities can be clearly displayed, and powerful support is provided for tasks such as knowledge reasoning, data analysis and the like. The topology of the entity relationships is an important consideration when constructing knowledge maps or performing network analysis.
Calculating the structural position of the entity association relation set in the target business field knowledge graph generally involves analyzing the connection paths and distances among the entities in the graph and the levels of the connection paths and distances in the graph hierarchy. Firstly, identifying the entities associated with the message to be checked and the direct and indirect relations between the entities, secondly, calculating the shortest path or distance between the entities by utilizing an algorithm (such as Di jkstra algorithm) in graph theory, and finally, determining the specific positions of the entities in the structure according to the hierarchy or parent-child relation of the entities in the graph, wherein the process is helpful for understanding the relative importance and interaction between the entities and provides a basis for subsequent semantic representation.
In the embodiment, the method and the device for generating the target knowledge graph vector by identifying the topological structure of the entity relationship and carrying out semantic representation on the topological structure of the entity and the entity relationship generate the target knowledge graph vector, and the target knowledge graph vector not only captures the key entity and the relationship related to the message, but also integrates the deep structure information of the knowledge graph, thereby providing a strong and semantic-rich feature input for a subsequent machine learning model or algorithm, and further improving the accuracy and efficiency of tasks such as field classification, information retrieval and the like.
Further, the long-short-time memory network model comprises a coding layer, a long-short-time memory unit and a knowledge fusion unit, and is combined with the target knowledge graph vector, the preset long-short-time memory network model is used for carrying out semantic coding on the message to be audited, and the generation of the semantic representation vector comprises the following steps:
acquiring a message text of a message to be audited, and loading a target knowledge graph vector and the message text into a long-time memory network model;
encoding the message text through an encoding layer to obtain a message text vector;
processing the input message text vector by using a long-short-time memory unit to obtain a hidden state vector;
and carrying out knowledge fusion on the target knowledge graph vector and the hidden state vector by using a knowledge fusion unit to obtain a semantic representation vector.
In the embodiment, the semantic understanding and representing capability of the message to be checked is effectively improved by combining the long-short-term memory network (LSTM) model with the knowledge graph of the target service field. Firstly, the text of the message and the target knowledge graph vector are input into the LSTM model together, so that the preliminary integration of the text information and the domain knowledge is realized. At the coding layer, the message text is converted into a vectorized representation. And then, the long-short-time memory unit (LSTM unit) uses a unique gating mechanism to deeply process the message text vector, thereby extracting more abundant context dependent information and generating a hidden state vector. And finally, the knowledge fusion unit fuses the target knowledge graph vector and the hidden state vector, so that the process not only reserves the semantic information of the text, but also fuses the deep structure and the logic relationship of the domain knowledge graph, and the generated semantic representation vector is more comprehensive and accurate.
The long-short-time memory network (LSTM) model is a special cyclic neural network (RNN) and aims to solve the problem of gradient disappearance or gradient explosion of the traditional RNN when processing long-sequence data. The LSTM allows the network to capture long-term dependencies while also effectively forgetting or retaining information by introducing three control units, namely a forgetting gate, an input gate and an output gate. This structure makes LSTM excellent in processing time series data, text data, and the like, and enables capturing more complex sequence features.
Hidden state Vector (HIDDEN STATE Vector) is a key concept in Recurrent Neural Networks (RNNs) and variants thereof (e.g., LSTM, GRU) and represents the internal information or memory that the input sequence contains up to the present time when the network processes sequence data. The hidden state vector is passed through a loop connection of the network so that the network can capture time dependency and context information in the sequence. In the generation task, the hidden state vector can also be used to generate the output of the next time step. Briefly, a hidden state vector is a data structure used by RNNs and variants thereof to store and communicate internal information when processing sequence data.
In the above embodiment, knowledge fusion is performed on the target knowledge graph vector and the hidden state vector to help combine the domain knowledge with the semantic information of the text itself. The hidden state vector captures context dependence and internal information of the text sequence, the target knowledge graph vector represents structured knowledge in the field, and a more comprehensive and accurate semantic representation vector can be generated by fusing the two vectors, and the vector not only contains specific meaning of the text, but also integrates constraint and context of the field knowledge, thereby improving accuracy and efficiency of subsequent tasks (such as classification, reasoning and the like).
Further, identifying key entities, entity attributes and entity relationships in the message to be audited using the conditional random field model based on the semantic representation vector, comprising:
performing word segmentation and part-of-speech tagging on the message to be audited to obtain a vocabulary tagging sequence;
extracting characteristics of the vocabulary labeling sequence through a conditional random field model to obtain a plurality of vocabulary characteristic vectors;
and carrying out semantic analysis on each vocabulary feature vector based on the semantic representation vector, and determining key entities, entity attributes and entity relations based on semantic analysis results.
In the embodiment, the accurate identification of key entities, entity attributes and relations in the message to be audited is realized by combining the semantic representation vector and the Conditional Random Field (CRF) model. Firstly, the message text is converted into a word labeling sequence which is easy to process through word segmentation and part-of-speech labeling technologies. Then, using the sequence modeling capability of the CRF model, extracting rich vocabulary feature vectors from the vocabulary labeling sequence, wherein the feature vectors contain context information and grammar rules between vocabularies. Further, based on the semantic representation vectors generated previously, deep semantic analysis is performed on each vocabulary feature vector, and the key entities, the attributes thereof and complex relationships between the key entities and the attributes thereof can be identified more accurately by fusing semantic information and domain knowledge of the text.
The Conditional Random Field (CRF) model is a discriminant probability undirected graph model for modeling and inferring the conditional probability distribution of sequence data. The method is particularly suitable for labeling or analyzing sequence data, such as natural language text, and improves the labeling accuracy by considering the dependency relationship among elements in the sequence. CRF breaks through the hypothesis limitation of the hidden Markov model, and can better capture the dependency relationship between the context information of the sequence data and the labels, so that the CRF is widely applied to the fields of natural language processing, computer vision and the like.
In the embodiment, the accuracy and efficiency of entity relation extraction are obviously improved by a method of fusing semantic and statistical models.
Performing semantic analysis on each vocabulary feature vector based on the semantic representation vector, and determining key entities, entity attributes and entity relationships based on semantic analysis results, including:
carrying out semantic analysis on each vocabulary feature vector by adopting a semantic analysis algorithm based on the semantic representation vector to obtain semantic features corresponding to each vocabulary feature vector;
identifying key entities in the message to be checked through semantic features to obtain a key entity list;
based on the key entity list, entity attributes and entity relationships contained by each key entity are identified according to the semantic representation vector.
In the above embodiment, the fine extraction of key entities, entity attributes and relationships in the message is realized by deep fusion of semantic representation vectors and semantic analysis algorithms. Firstly, a semantic analysis algorithm is utilized to read the vocabulary feature vectors, and semantic features of each vocabulary are extracted. Then, based on these rich semantic features, key entities in the message are accurately identified, and a key entity list is constructed. Finally, by further analyzing the association of the semantic representation vector with the list of key entities, not only is the specific attribute of each key entity identified, but also the intricate entity relationship between the key entities is revealed.
The key entities in the message to be checked are identified through semantic features, and the key entities are mainly dependent on deep analysis of vocabulary feature vectors by a semantic analysis algorithm. The semantic feature vectors capture the deep meaning and context information of the words, so that the model can distinguish which words have key meaning under specific contexts, and the semantic analysis algorithm analyzes the feature vectors and matches with predefined entity types or modes so as to identify key entities such as person names, place names, organization names and the like in the message. The process combines semantic understanding and pattern recognition technology, and improves accuracy and efficiency of entity recognition.
In this embodiment, the electronic device (e.g., the server shown in fig. 1) on which the message content extraction method operates may receive the instruction or acquire the data through a wired connection or a wireless connection. It should be noted that the wireless connection may include, but is not limited to, 3G/4G connection, wiFi connection, bluetooth connection, wiMAX connection, zigbee connection, UWB (ultra wideband) connection, and other now known or later developed wireless connection.
It is emphasized that, to further ensure the privacy and security of the message to be checked, the message to be checked may also be stored in a node of a blockchain.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
Those skilled in the art will appreciate that implementing all or part of the processes of the methods of the embodiments described above may be accomplished by way of computer readable instructions, stored on a computer readable storage medium, which when executed may comprise processes of embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-only Memory (ROM), or a random access Memory (Random Access Memory, RAM).
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a message content extraction apparatus, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.
As shown in fig. 3, the message content extraction apparatus 300 according to the present embodiment includes:
The keyword processing module 301 is configured to extract keywords from a message to be audited, and perform vector conversion on the extracted keywords to obtain keyword vectors;
The domain classification module 302 is configured to perform domain classification on the message to be audited based on the keyword vector and the pre-trained domain classification model, so as to obtain a domain classification result;
The spectrum vectorization module 303 is configured to determine a target service domain knowledge spectrum matched with the message to be checked according to the domain classification result, and generate a vector representation of the target service domain knowledge spectrum to obtain a target knowledge spectrum vector;
the semantic coding module 304 is configured to perform semantic coding on the message to be audited by using a preset long-short-time memory network model in combination with the target knowledge graph vector, so as to generate a semantic representation vector;
The content extraction module 305 is configured to identify key entities, entity attributes and entity relationships in the message to be audited using the conditional random field model based on the semantic representation vector.
Further, the domain classification model is configured with classification labels of a plurality of different domains, and the domain classification module 302 is specifically configured to:
The method comprises the steps of calculating the similarity of a keyword vector and a classification label in a domain classification model to obtain first similarity, taking a first similarity value as domain classification confidence coefficient of a message to be audited, sorting the domain classification confidence coefficient of the message to be audited in a descending order to obtain a first classification confidence coefficient sequence, and determining the service domain to which the message to be audited belongs according to the first classification confidence coefficient sequence to obtain a domain classification result.
Further, the domain classification module 302 is further configured to:
The method comprises the steps of obtaining a first classification confidence coefficient from a first classification confidence coefficient sequence, wherein the first classification confidence coefficient is the maximum value in the first classification confidence coefficient sequence, comparing the first classification confidence coefficient with a preset confidence coefficient threshold value, obtaining a classification label corresponding to the first classification confidence coefficient when the first classification confidence coefficient is larger than or equal to the confidence coefficient threshold value, and determining the service field represented by the classification label as the service field to which the message to be checked belongs to obtain a field classification result.
Further, the classification labels configured in the domain classification model include a parent domain label and a child domain label, and the domain classification module 302 is further configured to:
screening sub-domain labels matched with the keyword vectors to obtain first sub-domain labels, and calculating the similarity between the keyword vectors and the first sub-domain labels.
The domain classification module 302 is further configured to:
When the first classification confidence coefficient is smaller than a confidence coefficient threshold value, determining a target parent domain label matched with the keyword vector, obtaining all sub domain labels under the target parent domain label to obtain a second sub domain label, calculating the similarity between the keyword vector and the second sub domain label to obtain a second similarity, taking a second similarity value as the domain classification confidence coefficient of the message to be audited, sorting the domain classification confidence coefficient of the message to be audited in a descending order to obtain a second classification confidence coefficient sequence, obtaining the second classification confidence coefficient from the second classification confidence coefficient sequence, wherein the second classification confidence coefficient is the maximum value in the second classification confidence coefficient sequence, comparing the second classification confidence coefficient with a preset confidence coefficient threshold value, and when the second classification confidence coefficient is larger than or equal to the confidence coefficient threshold value, obtaining the classification label corresponding to the second classification confidence coefficient, determining the service domain represented by the classification label as the service domain to which the message to be audited belongs to, and obtaining the domain classification result.
Further, the map vectorization module 303 is specifically configured to:
The method comprises the steps of obtaining entities associated with a message to be checked in a target service domain knowledge graph to obtain associated entities, obtaining association relations between the associated entities to obtain an entity association relation set, calculating the structural position of the entity association relation set in the target service domain knowledge graph by utilizing topological structure information of the target service domain knowledge graph to obtain a topological structure of the entity relation, and carrying out semantic representation on the entities in the target service domain knowledge graph and the topological structure of the entity relation to generate a target knowledge graph vector.
Further, the long-short-time memory network model includes an encoding layer, a long-short-time memory unit and a knowledge fusion unit, and the semantic encoding module 304 is specifically configured to:
The method comprises the steps of obtaining a message text of a message to be audited, loading a target knowledge graph vector and the message text into a long-short-time memory network model, coding the message text through a coding layer to obtain a message text vector, processing the input message text vector by using a long-short-time memory unit to obtain a hidden state vector, and carrying out knowledge fusion on the target knowledge graph vector and the hidden state vector by using a knowledge fusion unit to obtain a semantic representation vector.
Further, the content extraction module 305 is specifically configured to:
The method comprises the steps of carrying out word segmentation and part-of-speech tagging on a message to be audited to obtain a vocabulary tagging sequence, carrying out feature extraction on the vocabulary tagging sequence through a conditional random field model to obtain a plurality of vocabulary feature vectors, carrying out semantic analysis on each vocabulary feature vector based on semantic representation vectors, and determining key entities, entity attributes and entity relations based on semantic analysis results.
The content extraction module 305 is further configured to:
the method comprises the steps of carrying out semantic analysis on each vocabulary feature vector by adopting a semantic analysis algorithm based on the semantic expression vector to obtain semantic features corresponding to each vocabulary feature vector, identifying key entities in a message to be checked through the semantic features to obtain a key entity list, and identifying entity attributes and entity relations contained in each key entity according to the semantic expression vector based on the key entity list.
In the above embodiment, the application discloses a message content extraction device, and relates to the technical field of big data. Firstly, the message to be checked is converted into a keyword vector through keyword extraction and vector conversion. And then, carrying out domain classification on the keyword vector by utilizing a pre-trained domain classification model, and determining the service domain to which the message belongs. And selecting a matched target service domain knowledge graph according to the domain classification result, and generating a vector representation of the target service domain knowledge graph. And then, combining the vector representation, carrying out semantic coding on the message by using a preset long-short-time memory network model, and generating a semantic representation vector. Finally, based on the semantic representation vector, key entities, attributes and relations in the message are accurately identified through a conditional random field model. The method combines the domain classification and the business domain knowledge graph to extract the message content, combines the context semantic information of the message content, accurately extracts and structuralizes the message content, improves the accuracy and the efficiency of extracting the message content, and can automatically identify key entities, attributes and relations.
In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 4, fig. 4 is a basic structural block diagram of a computer device according to the present embodiment.
The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It should be noted that only computer device 4 having components 41-43 is shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an application specific integrated circuit (APP L I CAT I on SPEC I F I CI NTEGRATED CI rcu it, AS ic), a programmable gate array (F I e l d-Programmab L E GATE ARRAY, FPGA), a digital Processor (D I GITA L SI GNA L Processor, DSP), an embedded device, and the like.
The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.
The memory 41 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the computer device 4. Of course, the memory 41 may also comprise both an internal memory unit of the computer device 4 and an external memory device. In this embodiment, the memory 41 is typically used to store an operating system and various application software installed on the computer device 4, such as computer readable instructions of a message content extraction method. Further, the memory 41 may be used to temporarily store various types of data that have been output or are to be output.
The processor 42 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, such as computer readable instructions for executing the message content extraction method.
The network interface 43 may comprise a wireless network interface or a wired network interface, which network interface 43 is typically used for establishing a communication connection between the computer device 4 and other electronic devices.
In the above embodiment, the application discloses a computer device, and relates to the technical field of big data. Firstly, the message to be checked is converted into a keyword vector through keyword extraction and vector conversion. And then, carrying out domain classification on the keyword vector by utilizing a pre-trained domain classification model, and determining the service domain to which the message belongs. And selecting a matched target service domain knowledge graph according to the domain classification result, and generating a vector representation of the target service domain knowledge graph. And then, combining the vector representation, carrying out semantic coding on the message by using a preset long-short-time memory network model, and generating a semantic representation vector. Finally, based on the semantic representation vector, key entities, attributes and relations in the message are accurately identified through a conditional random field model. The method combines the domain classification and the business domain knowledge graph to extract the message content, combines the context semantic information of the message content, accurately extracts and structuralizes the message content, improves the accuracy and the efficiency of extracting the message content, and can automatically identify key entities, attributes and relations.
The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of a message content extraction method as described above.
In the above embodiments, the present application discloses a computer readable storage medium, and relates to the technical field of big data. Firstly, the message to be checked is converted into a keyword vector through keyword extraction and vector conversion. And then, carrying out domain classification on the keyword vector by utilizing a pre-trained domain classification model, and determining the service domain to which the message belongs. And selecting a matched target service domain knowledge graph according to the domain classification result, and generating a vector representation of the target service domain knowledge graph. And then, combining the vector representation, carrying out semantic coding on the message by using a preset long-short-time memory network model, and generating a semantic representation vector. Finally, based on the semantic representation vector, key entities, attributes and relations in the message are accurately identified through a conditional random field model. The method combines the domain classification and the business domain knowledge graph to extract the message content, combines the context semantic information of the message content, accurately extracts and structuralizes the message content, improves the accuracy and the efficiency of extracting the message content, and can automatically identify key entities, attributes and relations.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.
The application is operational with numerous general purpose or special purpose computer system environments or configurations. Such as a personal computer, a server computer, a hand-held or portable device, a tablet device, a multiprocessor system, a microprocessor-based system, a set top box, a programmable consumer electronics, a network PC, a minicomputer, a mainframe computer, a distributed computing environment that includes any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims (10)

1.一种消息内容抽取方法,其特征在于,包括:1. A message content extraction method, characterized by comprising: 对待审核消息进行关键词提取,并将对提取到的关键词进行向量转化,得到关键词向量;Extract keywords from the messages to be reviewed, and convert the extracted keywords into vectors to obtain keyword vectors; 基于所述关键词向量和预训练的领域分类模型,对所述待审核消息进行领域分类,得到领域分类结果;Based on the keyword vector and the pre-trained domain classification model, the message to be reviewed is classified by domain to obtain a domain classification result; 根据所述领域分类结果确定与所述待审核消息匹配的目标业务领域知识图谱,并生成所述目标业务领域知识图谱的向量表示,得到目标知识图谱向量;Determine a target business domain knowledge graph that matches the message to be reviewed according to the domain classification result, and generate a vector representation of the target business domain knowledge graph to obtain a target knowledge graph vector; 结合所述目标知识图谱向量,使用预设的长短时记忆网络模型对所述待审核消息进行语义编码,生成语义表示向量;Combined with the target knowledge graph vector, a preset long short-term memory network model is used to semantically encode the message to be reviewed to generate a semantic representation vector; 基于所述语义表示向量,利用条件随机场模型识别所述待审核消息中的关键实体、实体属性和实体关系。Based on the semantic representation vector, a conditional random field model is used to identify key entities, entity attributes and entity relationships in the message to be reviewed. 2.如权利要求1所述的消息内容抽取方法,其特征在于,所述领域分类模型中配置有若干个不同领域的分类标签,所述基于所述关键词向量和预训练的领域分类模型,对所述待审核消息进行领域分类,得到领域分类结果,包括:2. The message content extraction method according to claim 1, characterized in that the domain classification model is configured with classification labels of several different domains, and the domain classification of the message to be reviewed is performed based on the keyword vector and the pre-trained domain classification model to obtain the domain classification result, including: 计算所述关键词向量与所述领域分类模型中的分类标签的相似度,得到相似度,并将所述相似度值作为所述待审核消息的领域分类置信度;Calculating the similarity between the keyword vector and the classification label in the domain classification model to obtain a similarity, and using the similarity value as the domain classification confidence of the message to be reviewed; 对所述待审核消息的领域分类置信度进行降序排序,得到分类置信度序列;Sorting the domain classification confidences of the messages to be reviewed in descending order to obtain a classification confidence sequence; 根据所述分类置信度序列确定所述待审核消息所属的业务领域,得到所述领域分类结果。The business field to which the message to be reviewed belongs is determined according to the classification confidence sequence, and the field classification result is obtained. 3.如权利要求2所述的消息内容抽取方法,其特征在于,所述根据所述分类置信度序列确定所述待审核消息所属的业务领域,得到所述领域分类结果,包括:3. The message content extraction method according to claim 2, characterized in that the determining the business field to which the message to be reviewed belongs according to the classification confidence sequence to obtain the field classification result comprises: 从所述分类置信度序列中获取最大分类置信度,其中,所述最大分类置信度为所述分类置信度序列中的最大值;Acquire a maximum classification confidence from the classification confidence sequence, wherein the maximum classification confidence is a maximum value in the classification confidence sequence; 比对所述最大分类置信度与预设的置信度阈值;Comparing the maximum classification confidence with a preset confidence threshold; 当所述最大分类置信度大于或等于所述置信度阈值时,获取所述最大分类置信度对应的分类标签,并将所述分类标签表示的业务领域确定为所述待审核消息所属的业务领域,得到所述领域分类结果。When the maximum classification confidence is greater than or equal to the confidence threshold, the classification label corresponding to the maximum classification confidence is obtained, and the business field represented by the classification label is determined as the business field to which the message to be reviewed belongs, to obtain the field classification result. 4.如权利要求3所述的消息内容抽取方法,其特征在于,所述领域分类模型中配置的分类标签包括父领域标签和子领域标签,所述计算所述关键词向量与所述领域分类模型中的分类标签的相似度,并将所述相似度值作为所述待审核消息的领域分类置信度,包括:4. The message content extraction method according to claim 3, characterized in that the classification labels configured in the domain classification model include parent domain labels and child domain labels, and the calculating the similarity between the keyword vector and the classification labels in the domain classification model, and taking the similarity value as the domain classification confidence of the message to be reviewed, comprises: 确定与所述关键词向量匹配的目标父领域标签;Determining a target parent domain label that matches the keyword vector; 获取所述目标父领域标签下的所有子领域标签,得到目标子领域标签;Obtain all sub-domain labels under the target parent domain label to obtain the target sub-domain label; 计算所述关键词向量与所述目标子领域标签的相似度,得到所述相似度;Calculating the similarity between the keyword vector and the target sub-domain label to obtain the similarity; 将所述相似度值作为所述待审核消息的领域分类置信度。The similarity value is used as the domain classification confidence of the message to be reviewed. 5.如权利要求1所述的消息内容抽取方法,其特征在于,所述生成所述目标业务领域知识图谱的向量表示,得到目标知识图谱向量,包括:5. The message content extraction method according to claim 1, characterized in that the step of generating a vector representation of the target business domain knowledge graph to obtain the target knowledge graph vector comprises: 在所述目标业务领域知识图谱中获取与所述待审核消息关联的实体,得到关联实体;Acquire entities associated with the message to be reviewed in the target business domain knowledge graph to obtain associated entities; 获取所述关联实体之间的关联关系,得到实体关联关系集合;Acquire the association relationships between the associated entities to obtain an entity association relationship set; 利用所述目标业务领域知识图谱的拓扑结构信息,计算所述实体关联关系集合在所述目标业务领域知识图谱中的结构位置,得到实体关系的拓扑结构;Using the topological structure information of the target business domain knowledge graph, the structural position of the entity association relationship set in the target business domain knowledge graph is calculated to obtain the topological structure of the entity relationship; 对所述目标业务领域知识图谱中的实体和实体关系的拓扑结构进行语义表示,生成所述目标知识图谱向量。The topological structure of entities and entity relationships in the target business domain knowledge graph is semantically represented to generate the target knowledge graph vector. 6.如权利要求1所述的消息内容抽取方法,其特征在于,所述长短时记忆网络模型包括编码层、长短时记忆单元和知识融合单元,所述结合所述目标知识图谱向量,使用预设的长短时记忆网络模型对所述待审核消息进行语义编码,生成语义表示向量,包括:6. The message content extraction method according to claim 1, characterized in that the long short-term memory network model includes a coding layer, a long short-term memory unit and a knowledge fusion unit, and the combining the target knowledge graph vector, using a preset long short-term memory network model to semantically encode the message to be reviewed, and generating a semantic representation vector, comprises: 获取所述待审核消息的消息文本,并将所述目标知识图谱向量和所述消息文本加载到所述长短时记忆网络模型中;Obtaining the message text of the message to be reviewed, and loading the target knowledge graph vector and the message text into the long short-term memory network model; 通过所述编码层对所述消息文本进行编码,得到消息文本向量;Encoding the message text through the encoding layer to obtain a message text vector; 使用所述长短时记忆单元对于输入的消息文本向量进行处理,得到隐状态向量;Using the long short-term memory unit to process the input message text vector to obtain a hidden state vector; 使用所述知识融合单元对所述目标知识图谱向量和所述隐状态向量进行知识融合,得到所述语义表示向量。The knowledge fusion unit is used to perform knowledge fusion on the target knowledge graph vector and the latent state vector to obtain the semantic representation vector. 7.如权利要求1所述的消息内容抽取方法,其特征在于,所述基于所述语义表示向量,利用条件随机场模型识别所述待审核消息中的关键实体、实体属性和实体关系,包括:7. The message content extraction method according to claim 1, characterized in that the identifying key entities, entity attributes and entity relationships in the message to be reviewed by using a conditional random field model based on the semantic representation vector comprises: 对所述待审核消息进行分词和词性标注,得到词汇标注序列;Performing word segmentation and part-of-speech tagging on the message to be reviewed to obtain a vocabulary tag sequence; 通过所述条件随机场模型对所述词汇标注序列进行特征提取,得到若干个词汇特征向量;Extracting features from the vocabulary tag sequence using the conditional random field model to obtain a plurality of vocabulary feature vectors; 基于所述语义表示向量,采用语义分析算法对每一个所述词汇特征向量进行语义分析,得到每一个所述词汇特征向量对应的语义特征;Based on the semantic representation vector, a semantic analysis algorithm is used to perform semantic analysis on each of the vocabulary feature vectors to obtain a semantic feature corresponding to each of the vocabulary feature vectors; 通过所述语义特征识别所述待审核消息中的关键实体,得到关键实体列表;Identify key entities in the message to be reviewed by using the semantic features to obtain a key entity list; 基于所述关键实体列表,根据所述语义表示向量识别每一个所述关键实体包含的实体属性和实体关系。Based on the key entity list, entity attributes and entity relationships included in each of the key entities are identified according to the semantic representation vector. 8.一种消息内容抽取装置,其特征在于,包括:8. A message content extraction device, characterized in that it comprises: 关键词处理模块,用于对待审核消息进行关键词提取,并将对提取到的关键词进行向量转化,得到关键词向量;The keyword processing module is used to extract keywords from the messages to be reviewed and convert the extracted keywords into vectors to obtain keyword vectors; 领域分类模块,用于基于所述关键词向量和预训练的领域分类模型,对所述待审核消息进行领域分类,得到领域分类结果;A domain classification module, used to classify the message to be reviewed based on the keyword vector and the pre-trained domain classification model to obtain a domain classification result; 图谱向量化模块,用于根据所述领域分类结果确定与所述待审核消息匹配的目标业务领域知识图谱,并生成所述目标业务领域知识图谱的向量表示,得到目标知识图谱向量;A graph vectorization module, used to determine the target business domain knowledge graph that matches the message to be reviewed according to the domain classification result, and generate a vector representation of the target business domain knowledge graph to obtain a target knowledge graph vector; 语义编码模块,用于结合所述目标知识图谱向量,使用预设的长短时记忆网络模型对所述待审核消息进行语义编码,生成语义表示向量;A semantic encoding module, used to combine the target knowledge graph vector and use a preset long short-term memory network model to semantically encode the message to be reviewed to generate a semantic representation vector; 内容抽取模块,用于基于所述语义表示向量,利用条件随机场模型识别所述待审核消息中的关键实体、实体属性和实体关系。The content extraction module is used to identify key entities, entity attributes and entity relationships in the message to be reviewed based on the semantic representation vector and using a conditional random field model. 9.一种计算机设备,其特征在于,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如权利要求1至7中任一项所述的消息内容抽取方法的步骤。9. A computer device, characterized in that it comprises a memory and a processor, wherein the memory stores computer-readable instructions, and when the processor executes the computer-readable instructions, the steps of the message content extraction method according to any one of claims 1 to 7 are implemented. 10.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如权利要求1至7中任一项所述的消息内容抽取方法的步骤。10. A computer-readable storage medium, characterized in that computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the steps of the message content extraction method according to any one of claims 1 to 7 are implemented.
CN202411081721.XA 2024-08-07 2024-08-07 A message content extraction method, device, computer equipment and storage medium Pending CN119202126A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411081721.XA CN119202126A (en) 2024-08-07 2024-08-07 A message content extraction method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411081721.XA CN119202126A (en) 2024-08-07 2024-08-07 A message content extraction method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN119202126A true CN119202126A (en) 2024-12-27

Family

ID=94040995

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411081721.XA Pending CN119202126A (en) 2024-08-07 2024-08-07 A message content extraction method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN119202126A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119557406A (en) * 2025-01-22 2025-03-04 火石创造科技有限公司 A method, device, equipment and medium for answering user questions
CN119741102A (en) * 2025-01-07 2025-04-01 中国工商银行股份有限公司 Product recommendation method, product recommendation model training method, device and equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119741102A (en) * 2025-01-07 2025-04-01 中国工商银行股份有限公司 Product recommendation method, product recommendation model training method, device and equipment
CN119557406A (en) * 2025-01-22 2025-03-04 火石创造科技有限公司 A method, device, equipment and medium for answering user questions

Similar Documents

Publication Publication Date Title
CN112084337B (en) Training method of text classification model, text classification method and equipment
CN110069709B (en) Intention recognition method, device, computer readable medium and electronic equipment
EP3985578A1 (en) Method and system for automatically training machine learning model
CN113704460B (en) Text classification method and device, electronic equipment and storage medium
CN112100401B (en) Knowledge graph construction method, device, equipment and storage medium for scientific and technological services
CN113051911B (en) Methods, devices, equipment, media and program products for extracting sensitive words
CN119202126A (en) A message content extraction method, device, computer equipment and storage medium
CN117874234A (en) Semantic-based text classification method, device, computer equipment and storage medium
CN119006144A (en) Business project management method, device, computer equipment and storage medium
CN119005189A (en) User portrait construction method, device, computer equipment and storage medium
CN119577148A (en) A text classification method, device, computer equipment and storage medium
CN115544210B (en) Methods for event extraction model training and event extraction based on continuous learning
CN119047484A (en) Intelligent document auditing method and device, computer equipment and storage medium
CN117033626A (en) Text auditing method, device, equipment and storage medium
CN116502624A (en) Corpus expansion method and device, computer equipment and storage medium
CN119670764B (en) Long text information processing method, device, computer equipment and storage medium
CN115168590A (en) Text feature extraction method, model training method, device, equipment and medium
CN115098687A (en) Alarm checking method and device for scheduling operation of electric power SDH optical transmission system
CN119691610A (en) Text multi-label classification method, device, computer equipment and storage medium
CN119475079A (en) Knowledge framework automatic generation method, device, computer equipment and storage medium
CN119128150A (en) A text clustering method, device, computer equipment and storage medium
CN116166858B (en) Information recommendation method, device, equipment and storage medium based on artificial intelligence
CN119203980A (en) A method, device, computer equipment and storage medium for reviewing letter of credit terms
Yang et al. Network Configuration Entity Extraction Method Based on Transformer with Multi-Head Attention Mechanism.
CN114282023B (en) Fingerprint generation method, device, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination