Disclosure of Invention
The embodiment of the application aims to provide a message content extraction method, a device, a computer device and a storage medium, and aims to provide a scheme for extracting message content by combining domain classification and business domain knowledge graph, so that the accuracy and efficiency of message content extraction are improved, and key entities, attributes and relations can be automatically identified.
In order to solve the above technical problems, the embodiment of the present application provides a message content extraction method, which adopts the following technical scheme:
a message content extraction method, comprising:
Extracting keywords from the message to be audited, and carrying out vector conversion on the extracted keywords to obtain keyword vectors;
Based on the keyword vector and the pre-trained domain classification model, carrying out domain classification on the message to be audited to obtain a domain classification result;
Determining a target business domain knowledge graph matched with the message to be checked according to the domain classification result, and generating a vector representation of the target business domain knowledge graph to obtain a target knowledge graph vector;
combining the target knowledge graph vector, performing semantic coding on the message to be audited by using a preset long-short-time memory network model, and generating a semantic representation vector;
Based on the semantic representation vector, the key entities, entity attributes and entity relationships in the message to be audited are identified by utilizing the conditional random field model.
Further, the domain classification model is configured with a plurality of classification labels of different domains, and based on the keyword vector and the pre-trained domain classification model, the domain classification is performed on the message to be audited to obtain a domain classification result, which comprises the following steps:
calculating the similarity between the keyword vector and the classification label in the domain classification model to obtain a first similarity, and taking the first similarity value as the domain classification confidence coefficient of the message to be audited;
sorting the domain classification confidence coefficient of the message to be audited in a descending order to obtain a first classification confidence coefficient sequence;
And determining the service domain to which the message to be audited belongs according to the first classification confidence sequence, and obtaining a domain classification result.
Further, determining the service domain to which the message to be audited belongs according to the classification confidence sequence to obtain a domain classification result, including:
obtaining maximum classification confidence from the classification confidence sequence, wherein the maximum classification confidence is the maximum value in the classification confidence sequence;
comparing the maximum classification confidence with a preset confidence threshold;
And when the maximum classification confidence coefficient is greater than or equal to the confidence coefficient threshold value, acquiring a classification label corresponding to the maximum classification confidence coefficient, determining the service field represented by the classification label as the service field to which the message to be checked belongs, and obtaining a field classification result.
Further, the classification labels configured in the domain classification model include a parent domain label and a child domain label, calculate similarity between the keyword vector and the classification label in the domain classification model, and use the similarity value as the domain classification confidence of the message to be audited, and include:
determining a target parent field label matched with the keyword vector;
Acquiring all sub-domain labels under the target parent domain label to obtain a target sub-domain label;
Calculating the similarity between the keyword vector and the target sub-field label to obtain the similarity;
And taking the similarity value as the domain classification confidence of the message to be audited.
Further, generating a vector representation of the target business domain knowledge graph to obtain a target knowledge graph vector, including:
acquiring an entity associated with the message to be checked from the target business field knowledge graph to obtain an associated entity;
acquiring association relations between association entities to obtain an entity association relation set;
Calculating the structure position of the entity association relation set in the target service domain knowledge graph by utilizing the topology structure information of the target service domain knowledge graph to obtain the topology structure of the entity relation;
semantic representation is carried out on the entity and the topological structure of the entity relationship in the knowledge graph of the target service field, and a target knowledge graph vector is generated.
Further, the long-short-time memory network model comprises a coding layer, a long-short-time memory unit and a knowledge fusion unit, and is combined with the target knowledge graph vector, the preset long-short-time memory network model is used for carrying out semantic coding on the message to be audited, and the generation of the semantic representation vector comprises the following steps:
acquiring a message text of a message to be audited, and loading a target knowledge graph vector and the message text into a long-time memory network model;
encoding the message text through an encoding layer to obtain a message text vector;
processing the input message text vector by using a long-short-time memory unit to obtain a hidden state vector;
and carrying out knowledge fusion on the target knowledge graph vector and the hidden state vector by using a knowledge fusion unit to obtain a semantic representation vector.
Further, identifying key entities, entity attributes and entity relationships in the message to be audited using the conditional random field model based on the semantic representation vector, comprising:
performing word segmentation and part-of-speech tagging on the message to be audited to obtain a vocabulary tagging sequence;
extracting characteristics of the vocabulary labeling sequence through a conditional random field model to obtain a plurality of vocabulary characteristic vectors;
Carrying out semantic analysis on each vocabulary feature vector based on the semantic representation vector, and determining key entities, entity attributes and entity relations based on semantic analysis results;
performing semantic analysis on each vocabulary feature vector based on the semantic representation vector, and determining key entities, entity attributes and entity relationships based on semantic analysis results, including:
carrying out semantic analysis on each vocabulary feature vector by adopting a semantic analysis algorithm based on the semantic representation vector to obtain semantic features corresponding to each vocabulary feature vector;
identifying key entities in the message to be checked through semantic features to obtain a key entity list;
based on the key entity list, entity attributes and entity relationships contained by each key entity are identified according to the semantic representation vector.
In order to solve the above technical problems, the embodiment of the present application further provides a message content extraction device, which adopts the following technical scheme:
a message content extraction apparatus comprising:
the keyword processing module is used for extracting keywords of the message to be audited, and carrying out vector conversion on the extracted keywords to obtain keyword vectors;
The field classification module is used for carrying out field classification on the message to be audited based on the keyword vector and the pre-trained field classification model to obtain a field classification result;
The map vectorization module is used for determining a target business field knowledge map matched with the message to be checked according to the field classification result, generating a vector representation of the target business field knowledge map, and obtaining a target knowledge map vector;
The semantic coding module is used for carrying out semantic coding on the message to be audited by combining the target knowledge graph vector and using a preset long-short-time memory network model to generate a semantic representation vector;
And the content extraction module is used for identifying key entities, entity attributes and entity relations in the message to be audited by utilizing the conditional random field model based on the semantic representation vector.
In order to solve the above technical problems, the embodiment of the present application further provides a computer device, which adopts the following technical schemes:
a computer device comprising a memory having stored therein computer readable instructions which when executed by a processor implement the steps of the message content extraction method of any one of the preceding claims.
In order to solve the above technical problems, an embodiment of the present application further provides a computer readable storage medium, which adopts the following technical schemes:
A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of the message content extraction method as claimed in any one of the preceding claims.
Compared with the prior art, the embodiment of the application has the following main beneficial effects:
The application discloses a message content extraction method, a message content extraction device, computer equipment and a storage medium, and relates to the technical field of big data. Firstly, the message to be checked is converted into a keyword vector through keyword extraction and vector conversion. And then, carrying out domain classification on the keyword vector by utilizing a pre-trained domain classification model, and determining the service domain to which the message belongs. And selecting a matched target service domain knowledge graph according to the domain classification result, and generating a vector representation of the target service domain knowledge graph. And then, combining the vector representation, carrying out semantic coding on the message by using a preset long-short-time memory network model, and generating a semantic representation vector. Finally, based on the semantic representation vector, key entities, attributes and relations in the message are accurately identified through a conditional random field model. The method combines the domain classification and the business domain knowledge graph to extract the message content, combines the context semantic information of the message content, accurately extracts and structuralizes the message content, improves the accuracy and the efficiency of extracting the message content, and can automatically identify key entities, attributes and relations.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs, the terms used in the description herein are used for the purpose of describing particular embodiments only and are not intended to limit the application, and the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the above description of the drawings are intended to cover non-exclusive inclusions. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.
As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Movi ng Pi cture Experts Group Aud i o Layer I I I, dynamic video expert compression standard audio plane 3), MP4 (Movi ng Pi cture Experts Group Aud i o Layer I V, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.
The server 105 may be a server that provides various services, such as a background server that provides support for pages displayed on the terminal devices 101, 102, 103, and may be a stand-alone server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content De l i very Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.
It should be noted that, the method for extracting message content provided in the embodiment of the present application is generally executed by a server, and accordingly, the message content extracting device is generally disposed in the server.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to fig. 2, a flow chart of one embodiment of a message content extraction method according to the present application is shown. The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Wherein artificial intelligence (ART I F I C I A L I NTE L L I GENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.
Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions. The message content extraction method comprises the following steps:
S201, extracting keywords of the message to be audited, and carrying out vector conversion on the extracted keywords to obtain keyword vectors.
Specifically, first, a keyword extraction technique, such as TF-IDF, textRank, or a deep learning-based method, is used to automatically identify the most representative vocabulary or phrase from the text as a keyword. These keywords can summarize the main content or core ideas of the text. Then, each keyword is mapped into a high-dimensional vector space by a vector transformation technique, such as a Word embedding (Word Embedd i ng) model, such as the Embedd i ng layer of Word2Vec, G l oVe or BERT, to form a keyword vector. These vectors not only retain semantic information of the vocabulary, but also reflect similarity and relationships between the vocabularies to some extent.
S202, carrying out field classification on the message to be audited based on the keyword vector and the pre-trained field classification model to obtain a field classification result.
Specifically, first, the keyword extraction technique is relied on to extract the keyword information from the message to be checked, and convert the keyword information into a vector form, namely a keyword vector. These vectors are fed as inputs into a pre-trained domain classification model. The domain classification model performs efficient and accurate processing on the keyword vector through a deep learning algorithm (such as convolutional neural network CNN, cyclic neural network RNN or variant LSTM/GRU thereof, and possibly even a transducer structure), so as to identify the specific domain or class to which the message belongs. The process fully utilizes the feature representation capability of the pre-training model learned on a large number of field related data and semantic information contained in the keyword vector, realizes the automation and the intellectualization of the field classification of the message to be audited, and remarkably improves the accuracy and the efficiency of the classification.
The domain classification model is a machine learning model for identifying the domain to which a text belongs, and can automatically classify the text into predefined domain categories, such as news, science and technology, entertainment, and the like, according to the content and characteristics of the text. The model realizes accurate classification of new texts by training and learning text features in different fields.
And S203, determining a target business domain knowledge graph matched with the message to be checked according to the domain classification result, and generating a vector representation of the target business domain knowledge graph to obtain a target knowledge graph vector.
Specifically, firstly, the information is accurately classified through a domain classification model, and the service domain to which the information belongs is determined. Then, based on the classification result, a target business field knowledge graph matched with the message to be checked is selected. The business domain knowledge graph is used as a collection of structured data and contains comprehensive information of entities, attributes and relations in the domain. For subsequent processing, the system also needs to generate a vector representation of the target business domain knowledge graph, i.e., a target knowledge graph vector. Through graph embedding technology, such as TransE, node2Vec and the like, nodes, relations and the like in the knowledge graph are converted into vectors in a high-dimensional space, so that a computer can calculate and infer efficiently. In this way, the system can fully utilize the rich information of the domain knowledge graph,
The business domain knowledge graph (Domai n Know LEDGE GRAPH, DKG) is a knowledge graph built for a specific domain or industry, is a graph-based data structure, and organizes information by nodes (representing entities) and edges (representing relationships between entities) to form a comprehensive, structured representation of the domain knowledge. For example, an insurance product knowledge graph is built around the insurance product, and contains basic information (such as product name, type, guarantee scope, premium, insurance amount, etc.), term interpretation, insurance requirements, claim settlement process, etc. of various types of insurance products. For another example, the knowledge graph of the claim settlement process details the information such as the insurance claim settlement process, standard, requirement, etc., including various links such as reporting, investigation, loss assessment, nuclear claim, payment, etc.
S204, combining the target knowledge graph vector, performing semantic coding on the message to be audited by using a preset long-short-time memory network model, and generating a semantic representation vector.
Specifically, the target knowledge graph vector is combined with a long and short time memory network (LSTM) model, and the message to be audited is subjected to deep semantic coding. Firstly, the knowledge of the target field is abstracted into a vector form, namely a target knowledge map vector, through a knowledge map technology, and the vectors contain rich field knowledge and context information. The message to be checked is then combined with these vectors as input to the LSTM model. The LSTM model is used as a special cyclic neural network (RNN) and can effectively capture long-distance dependency relations in texts and process sequence information through an internal gate mechanism. In the process, the LSTM model utilizes the domain knowledge provided by the target knowledge graph vector to perform deeper semantic understanding and analysis on the message to be audited, and finally generates a semantic representation vector capable of accurately expressing the core semantic of the message, thereby remarkably improving the accuracy and efficiency of text semantic coding.
S205, based on the semantic representation vector, identifying key entities, entity attributes and entity relations in the message to be audited by using the conditional random field model.
Specifically, in Natural Language Processing (NLP), a Conditional Random Field (CRF) model is utilized in conjunction with semantic representation vectors to identify key entities, entity attributes, and entity relationships in a message to be examined. First, the semantic representation vector obtained through the above steps. This vector is then input into a conditional random field model, which is a sequence labeling model that can take into account the dependencies between labels, thereby optimizing the labeling results in a global scope. In the identification process, the CRF model not only focuses on the semantic information of the current word, but also considers the context environment, and combines the domain knowledge in the semantic representation vector to accurately identify key entities in the message, the attributes of the entities and the relationship between the entities, thereby effectively improving the accuracy and the robustness of the entity and relationship identification.
In the above embodiment, the application discloses a message content extraction method, and relates to the technical field of big data. Firstly, the message to be checked is converted into a keyword vector through keyword extraction and vector conversion. And then, carrying out domain classification on the keyword vector by utilizing a pre-trained domain classification model, and determining the service domain to which the message belongs. And selecting a matched target service domain knowledge graph according to the domain classification result, and generating a vector representation of the target service domain knowledge graph. And then, combining the vector representation, carrying out semantic coding on the message by using a preset long-short-time memory network model, and generating a semantic representation vector. Finally, based on the semantic representation vector, key entities, attributes and relations in the message are accurately identified through a conditional random field model. The method combines the domain classification and the business domain knowledge graph to extract the message content, combines the context semantic information of the message content, accurately extracts and structuralizes the message content, improves the accuracy and the efficiency of extracting the message content, and can automatically identify key entities, attributes and relations.
Further, the domain classification model is configured with a plurality of classification labels of different domains, and based on the keyword vector and the pre-trained domain classification model, the domain classification is performed on the message to be audited to obtain a domain classification result, which comprises the following steps:
calculating the similarity between the keyword vector and the classification label in the domain classification model to obtain a first similarity, and taking the first similarity value as the domain classification confidence coefficient of the message to be audited;
sorting the domain classification confidence coefficient of the message to be audited in a descending order to obtain a first classification confidence coefficient sequence;
And determining the service domain to which the message to be audited belongs according to the first classification confidence sequence, and obtaining a domain classification result.
In the above embodiment, first, by calculating the similarity (i.e., the first similarity) between the keyword vector and each classification label in the pre-training domain classification model, the confidence of the domain classification is allocated to the message to be checked, so that the spatial characteristics of the vector are fully utilized, and the association degree between the message content and each domain is effectively measured. And then, the opposite confidence levels are subjected to descending order to form a first classification confidence level sequence, and the degree of matching between the message and each field is intuitively displayed. And finally, determining the service field to which the message belongs based on the sequence, and ensuring the accuracy and rationality of field classification.
In the embodiment, the method and the device for classifying the domain determine to classify the domain by distributing the confidence of the domain classification to the message to be checked, so that the accuracy of classification is improved, and the interpretability of the classification process is enhanced.
In the above embodiment, a hierarchical classification label system is introduced in the domain classification process, including a parent domain label and a child domain label, and by introducing the hierarchical classification label system, not only is the classification accuracy improved, but also the classification result is more in line with the requirements of the actual service scene, because significant differences may exist between different child domains, and the subsequent knowledge graph matching and semantic coding processes are directly affected.
Further, determining the service domain to which the message to be audited belongs according to the classification confidence sequence to obtain a domain classification result, including:
obtaining maximum classification confidence from the classification confidence sequence, wherein the maximum classification confidence is the maximum value in the classification confidence sequence;
comparing the maximum classification confidence with a preset confidence threshold;
And when the maximum classification confidence coefficient is greater than or equal to the confidence coefficient threshold value, acquiring a classification label corresponding to the maximum classification confidence coefficient, determining the service field represented by the classification label as the service field to which the message to be checked belongs, and obtaining a field classification result.
In the above embodiment, first, the maximum value is selected from the classification confidence sequence as the maximum classification confidence, so that it is ensured that the field that is most matched with the message to be checked is selected. Then, by comparing with a preset confidence threshold, an explicit judgment standard is introduced for evaluating the reliability of the classification result. When the maximum classification confidence coefficient meets the threshold requirement (namely, the maximum classification confidence coefficient is larger than or equal to the confidence coefficient threshold), the classification result is considered to have higher reliability, and the corresponding classification label is directly adopted as the service field of the message to be audited, so that the accuracy and the effectiveness of the field classification result are ensured. When the maximum classification confidence is smaller than the confidence threshold, an alarm prompt for outputting the classification confidence smaller than the confidence threshold is adopted to inform a processor that no sub-label matched with the current keyword vector exists in the classification label set, and matching label registration is needed to generate a parent label and a sub-label matched with the current keyword vector in the classification label set.
In the embodiment, the confidence level comparison mechanism is set to determine the service field to which the message to be checked belongs, so that the classification accuracy is improved, and the stability and reliability of the classification result are enhanced.
Further, the classification labels configured in the domain classification model include a parent domain label and a child domain label, calculate similarity between the keyword vector and the classification label in the domain classification model, and use the similarity value as the domain classification confidence of the message to be audited, and include:
determining a target parent field label matched with the keyword vector;
Acquiring all sub-domain labels under the target parent domain label to obtain a target sub-domain label;
Calculating the similarity between the keyword vector and the target sub-field label to obtain the similarity;
And taking the similarity value as the domain classification confidence of the message to be audited.
In the above embodiment, the parent domain label that best matches the keyword vector is first identified from the predefined taxonomy label system by a matching mechanism. Subsequently, for the parent domain, all sub-domain labels below it are traversed and obtained to achieve a finer classification of the message content. And then, calculating the similarity by using a similarity calculation method (such as cosine similarity, euclidean distance and the like), quantifying the association degree between the keyword vector and each target sub-field label, and generating a similarity value. Finally, the similarity values are directly used as confidence indexes for measuring the classification accuracy of the field of the message to be checked, so that a decision maker or a system is assisted to automatically finish the classification processing of the message, and the classification accuracy and efficiency are improved.
Through the steps, the method and the device realize high-precision judgment of the classification of the field of the message to be checked through a refined field classification system (comprising the tags of the father field and the son field) and combining the similarity calculation of the keyword vector and the classification tag, ensure the coverage of the classification, improve the depth accuracy of the classification and provide the field classification result with high confidence for the message processing, and strengthen the intelligent and automatic level of the system through layer-by-layer matching and similarity evaluation.
Further, when the maximum classification confidence coefficient calculated by the system does not reach the preset confidence coefficient threshold value, the fact that the sub-domain label which is highly matched with the input keyword vector cannot be found in the current classification label set is indicated. To cope with this situation, the system uses a way of outputting an alarm prompt, which explicitly conveys the information of the classification failure to the processor and indicates the lack of corresponding matching items in the classification tag set. By this step, not only is the transparency and user friendliness of the system improved, but also the subsequent label registration work is promoted. The processor registers new matching labels, including father labels and child labels, according to the alarm prompt to expand the classification label set, so as to ensure that the system can more accurately cope with similar keyword vectors possibly encountered in the future, thereby improving the accuracy and efficiency of overall classification.
Further, generating a vector representation of the target business domain knowledge graph to obtain a target knowledge graph vector, including:
acquiring an entity associated with the message to be checked from the target business field knowledge graph to obtain an associated entity;
acquiring association relations between association entities to obtain an entity association relation set;
Calculating the structure position of the entity association relation set in the target service domain knowledge graph by utilizing the topology structure information of the target service domain knowledge graph to obtain the topology structure of the entity relation;
semantic representation is carried out on the entity and the topological structure of the entity relationship in the knowledge graph of the target service field, and a target knowledge graph vector is generated.
In the above embodiment, first, entities directly related to the message to be checked are extracted from the target business field knowledge graph, and these entities are key to understanding and analyzing the content of the message. Then, complex relationships between these associated entities are identified, forming a set of entity associations to reveal business logic and context behind the message. And then, calculating the positions of the entity relations in the atlas, namely the topological structure of the entity relations by utilizing the topological structure information of the knowledge atlas, and identifying the positions of the entity relations in the atlas to confirm the direct relevance of the entity relations, and integrating the overall structure characteristics of the knowledge atlas to provide rich context information for subsequent semantic representation. Finally, by carrying out advanced semantic representation on the topological structure of the entity and the relation thereof to generate a target knowledge graph vector and carrying out semantic representation on the topological structure of the entity and the relation of the entity, the deep logic and the context information of the entity and the relation in the knowledge graph can be captured and expressed, so that a machine learning model can more accurately understand the meaning and the mode in data, and the performance of tasks such as classification, reasoning and the like is improved.
The topological structure of entity relationship refers to a space or a logic structure formed by interconnecting entities through relationship in a knowledge graph or a network graph. Such a structure does not take into account the specific size, shape of the entities and relationships, but focuses on the manner of connection and relative location between them. Through the topological structure, the interaction and the dependency relationship between the entities can be clearly displayed, and powerful support is provided for tasks such as knowledge reasoning, data analysis and the like. The topology of the entity relationships is an important consideration when constructing knowledge maps or performing network analysis.
Calculating the structural position of the entity association relation set in the target business field knowledge graph generally involves analyzing the connection paths and distances among the entities in the graph and the levels of the connection paths and distances in the graph hierarchy. Firstly, identifying the entities associated with the message to be checked and the direct and indirect relations between the entities, secondly, calculating the shortest path or distance between the entities by utilizing an algorithm (such as Di jkstra algorithm) in graph theory, and finally, determining the specific positions of the entities in the structure according to the hierarchy or parent-child relation of the entities in the graph, wherein the process is helpful for understanding the relative importance and interaction between the entities and provides a basis for subsequent semantic representation.
In the embodiment, the method and the device for generating the target knowledge graph vector by identifying the topological structure of the entity relationship and carrying out semantic representation on the topological structure of the entity and the entity relationship generate the target knowledge graph vector, and the target knowledge graph vector not only captures the key entity and the relationship related to the message, but also integrates the deep structure information of the knowledge graph, thereby providing a strong and semantic-rich feature input for a subsequent machine learning model or algorithm, and further improving the accuracy and efficiency of tasks such as field classification, information retrieval and the like.
Further, the long-short-time memory network model comprises a coding layer, a long-short-time memory unit and a knowledge fusion unit, and is combined with the target knowledge graph vector, the preset long-short-time memory network model is used for carrying out semantic coding on the message to be audited, and the generation of the semantic representation vector comprises the following steps:
acquiring a message text of a message to be audited, and loading a target knowledge graph vector and the message text into a long-time memory network model;
encoding the message text through an encoding layer to obtain a message text vector;
processing the input message text vector by using a long-short-time memory unit to obtain a hidden state vector;
and carrying out knowledge fusion on the target knowledge graph vector and the hidden state vector by using a knowledge fusion unit to obtain a semantic representation vector.
In the embodiment, the semantic understanding and representing capability of the message to be checked is effectively improved by combining the long-short-term memory network (LSTM) model with the knowledge graph of the target service field. Firstly, the text of the message and the target knowledge graph vector are input into the LSTM model together, so that the preliminary integration of the text information and the domain knowledge is realized. At the coding layer, the message text is converted into a vectorized representation. And then, the long-short-time memory unit (LSTM unit) uses a unique gating mechanism to deeply process the message text vector, thereby extracting more abundant context dependent information and generating a hidden state vector. And finally, the knowledge fusion unit fuses the target knowledge graph vector and the hidden state vector, so that the process not only reserves the semantic information of the text, but also fuses the deep structure and the logic relationship of the domain knowledge graph, and the generated semantic representation vector is more comprehensive and accurate.
The long-short-time memory network (LSTM) model is a special cyclic neural network (RNN) and aims to solve the problem of gradient disappearance or gradient explosion of the traditional RNN when processing long-sequence data. The LSTM allows the network to capture long-term dependencies while also effectively forgetting or retaining information by introducing three control units, namely a forgetting gate, an input gate and an output gate. This structure makes LSTM excellent in processing time series data, text data, and the like, and enables capturing more complex sequence features.
Hidden state Vector (HIDDEN STATE Vector) is a key concept in Recurrent Neural Networks (RNNs) and variants thereof (e.g., LSTM, GRU) and represents the internal information or memory that the input sequence contains up to the present time when the network processes sequence data. The hidden state vector is passed through a loop connection of the network so that the network can capture time dependency and context information in the sequence. In the generation task, the hidden state vector can also be used to generate the output of the next time step. Briefly, a hidden state vector is a data structure used by RNNs and variants thereof to store and communicate internal information when processing sequence data.
In the above embodiment, knowledge fusion is performed on the target knowledge graph vector and the hidden state vector to help combine the domain knowledge with the semantic information of the text itself. The hidden state vector captures context dependence and internal information of the text sequence, the target knowledge graph vector represents structured knowledge in the field, and a more comprehensive and accurate semantic representation vector can be generated by fusing the two vectors, and the vector not only contains specific meaning of the text, but also integrates constraint and context of the field knowledge, thereby improving accuracy and efficiency of subsequent tasks (such as classification, reasoning and the like).
Further, identifying key entities, entity attributes and entity relationships in the message to be audited using the conditional random field model based on the semantic representation vector, comprising:
performing word segmentation and part-of-speech tagging on the message to be audited to obtain a vocabulary tagging sequence;
extracting characteristics of the vocabulary labeling sequence through a conditional random field model to obtain a plurality of vocabulary characteristic vectors;
and carrying out semantic analysis on each vocabulary feature vector based on the semantic representation vector, and determining key entities, entity attributes and entity relations based on semantic analysis results.
In the embodiment, the accurate identification of key entities, entity attributes and relations in the message to be audited is realized by combining the semantic representation vector and the Conditional Random Field (CRF) model. Firstly, the message text is converted into a word labeling sequence which is easy to process through word segmentation and part-of-speech labeling technologies. Then, using the sequence modeling capability of the CRF model, extracting rich vocabulary feature vectors from the vocabulary labeling sequence, wherein the feature vectors contain context information and grammar rules between vocabularies. Further, based on the semantic representation vectors generated previously, deep semantic analysis is performed on each vocabulary feature vector, and the key entities, the attributes thereof and complex relationships between the key entities and the attributes thereof can be identified more accurately by fusing semantic information and domain knowledge of the text.
The Conditional Random Field (CRF) model is a discriminant probability undirected graph model for modeling and inferring the conditional probability distribution of sequence data. The method is particularly suitable for labeling or analyzing sequence data, such as natural language text, and improves the labeling accuracy by considering the dependency relationship among elements in the sequence. CRF breaks through the hypothesis limitation of the hidden Markov model, and can better capture the dependency relationship between the context information of the sequence data and the labels, so that the CRF is widely applied to the fields of natural language processing, computer vision and the like.
In the embodiment, the accuracy and efficiency of entity relation extraction are obviously improved by a method of fusing semantic and statistical models.
Performing semantic analysis on each vocabulary feature vector based on the semantic representation vector, and determining key entities, entity attributes and entity relationships based on semantic analysis results, including:
carrying out semantic analysis on each vocabulary feature vector by adopting a semantic analysis algorithm based on the semantic representation vector to obtain semantic features corresponding to each vocabulary feature vector;
identifying key entities in the message to be checked through semantic features to obtain a key entity list;
based on the key entity list, entity attributes and entity relationships contained by each key entity are identified according to the semantic representation vector.
In the above embodiment, the fine extraction of key entities, entity attributes and relationships in the message is realized by deep fusion of semantic representation vectors and semantic analysis algorithms. Firstly, a semantic analysis algorithm is utilized to read the vocabulary feature vectors, and semantic features of each vocabulary are extracted. Then, based on these rich semantic features, key entities in the message are accurately identified, and a key entity list is constructed. Finally, by further analyzing the association of the semantic representation vector with the list of key entities, not only is the specific attribute of each key entity identified, but also the intricate entity relationship between the key entities is revealed.
The key entities in the message to be checked are identified through semantic features, and the key entities are mainly dependent on deep analysis of vocabulary feature vectors by a semantic analysis algorithm. The semantic feature vectors capture the deep meaning and context information of the words, so that the model can distinguish which words have key meaning under specific contexts, and the semantic analysis algorithm analyzes the feature vectors and matches with predefined entity types or modes so as to identify key entities such as person names, place names, organization names and the like in the message. The process combines semantic understanding and pattern recognition technology, and improves accuracy and efficiency of entity recognition.
In this embodiment, the electronic device (e.g., the server shown in fig. 1) on which the message content extraction method operates may receive the instruction or acquire the data through a wired connection or a wireless connection. It should be noted that the wireless connection may include, but is not limited to, 3G/4G connection, wiFi connection, bluetooth connection, wiMAX connection, zigbee connection, UWB (ultra wideband) connection, and other now known or later developed wireless connection.
It is emphasized that, to further ensure the privacy and security of the message to be checked, the message to be checked may also be stored in a node of a blockchain.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
Those skilled in the art will appreciate that implementing all or part of the processes of the methods of the embodiments described above may be accomplished by way of computer readable instructions, stored on a computer readable storage medium, which when executed may comprise processes of embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-only Memory (ROM), or a random access Memory (Random Access Memory, RAM).
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a message content extraction apparatus, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.
As shown in fig. 3, the message content extraction apparatus 300 according to the present embodiment includes:
The keyword processing module 301 is configured to extract keywords from a message to be audited, and perform vector conversion on the extracted keywords to obtain keyword vectors;
The domain classification module 302 is configured to perform domain classification on the message to be audited based on the keyword vector and the pre-trained domain classification model, so as to obtain a domain classification result;
The spectrum vectorization module 303 is configured to determine a target service domain knowledge spectrum matched with the message to be checked according to the domain classification result, and generate a vector representation of the target service domain knowledge spectrum to obtain a target knowledge spectrum vector;
the semantic coding module 304 is configured to perform semantic coding on the message to be audited by using a preset long-short-time memory network model in combination with the target knowledge graph vector, so as to generate a semantic representation vector;
The content extraction module 305 is configured to identify key entities, entity attributes and entity relationships in the message to be audited using the conditional random field model based on the semantic representation vector.
Further, the domain classification model is configured with classification labels of a plurality of different domains, and the domain classification module 302 is specifically configured to:
The method comprises the steps of calculating the similarity of a keyword vector and a classification label in a domain classification model to obtain first similarity, taking a first similarity value as domain classification confidence coefficient of a message to be audited, sorting the domain classification confidence coefficient of the message to be audited in a descending order to obtain a first classification confidence coefficient sequence, and determining the service domain to which the message to be audited belongs according to the first classification confidence coefficient sequence to obtain a domain classification result.
Further, the domain classification module 302 is further configured to:
The method comprises the steps of obtaining a first classification confidence coefficient from a first classification confidence coefficient sequence, wherein the first classification confidence coefficient is the maximum value in the first classification confidence coefficient sequence, comparing the first classification confidence coefficient with a preset confidence coefficient threshold value, obtaining a classification label corresponding to the first classification confidence coefficient when the first classification confidence coefficient is larger than or equal to the confidence coefficient threshold value, and determining the service field represented by the classification label as the service field to which the message to be checked belongs to obtain a field classification result.
Further, the classification labels configured in the domain classification model include a parent domain label and a child domain label, and the domain classification module 302 is further configured to:
screening sub-domain labels matched with the keyword vectors to obtain first sub-domain labels, and calculating the similarity between the keyword vectors and the first sub-domain labels.
The domain classification module 302 is further configured to:
When the first classification confidence coefficient is smaller than a confidence coefficient threshold value, determining a target parent domain label matched with the keyword vector, obtaining all sub domain labels under the target parent domain label to obtain a second sub domain label, calculating the similarity between the keyword vector and the second sub domain label to obtain a second similarity, taking a second similarity value as the domain classification confidence coefficient of the message to be audited, sorting the domain classification confidence coefficient of the message to be audited in a descending order to obtain a second classification confidence coefficient sequence, obtaining the second classification confidence coefficient from the second classification confidence coefficient sequence, wherein the second classification confidence coefficient is the maximum value in the second classification confidence coefficient sequence, comparing the second classification confidence coefficient with a preset confidence coefficient threshold value, and when the second classification confidence coefficient is larger than or equal to the confidence coefficient threshold value, obtaining the classification label corresponding to the second classification confidence coefficient, determining the service domain represented by the classification label as the service domain to which the message to be audited belongs to, and obtaining the domain classification result.
Further, the map vectorization module 303 is specifically configured to:
The method comprises the steps of obtaining entities associated with a message to be checked in a target service domain knowledge graph to obtain associated entities, obtaining association relations between the associated entities to obtain an entity association relation set, calculating the structural position of the entity association relation set in the target service domain knowledge graph by utilizing topological structure information of the target service domain knowledge graph to obtain a topological structure of the entity relation, and carrying out semantic representation on the entities in the target service domain knowledge graph and the topological structure of the entity relation to generate a target knowledge graph vector.
Further, the long-short-time memory network model includes an encoding layer, a long-short-time memory unit and a knowledge fusion unit, and the semantic encoding module 304 is specifically configured to:
The method comprises the steps of obtaining a message text of a message to be audited, loading a target knowledge graph vector and the message text into a long-short-time memory network model, coding the message text through a coding layer to obtain a message text vector, processing the input message text vector by using a long-short-time memory unit to obtain a hidden state vector, and carrying out knowledge fusion on the target knowledge graph vector and the hidden state vector by using a knowledge fusion unit to obtain a semantic representation vector.
Further, the content extraction module 305 is specifically configured to:
The method comprises the steps of carrying out word segmentation and part-of-speech tagging on a message to be audited to obtain a vocabulary tagging sequence, carrying out feature extraction on the vocabulary tagging sequence through a conditional random field model to obtain a plurality of vocabulary feature vectors, carrying out semantic analysis on each vocabulary feature vector based on semantic representation vectors, and determining key entities, entity attributes and entity relations based on semantic analysis results.
The content extraction module 305 is further configured to:
the method comprises the steps of carrying out semantic analysis on each vocabulary feature vector by adopting a semantic analysis algorithm based on the semantic expression vector to obtain semantic features corresponding to each vocabulary feature vector, identifying key entities in a message to be checked through the semantic features to obtain a key entity list, and identifying entity attributes and entity relations contained in each key entity according to the semantic expression vector based on the key entity list.
In the above embodiment, the application discloses a message content extraction device, and relates to the technical field of big data. Firstly, the message to be checked is converted into a keyword vector through keyword extraction and vector conversion. And then, carrying out domain classification on the keyword vector by utilizing a pre-trained domain classification model, and determining the service domain to which the message belongs. And selecting a matched target service domain knowledge graph according to the domain classification result, and generating a vector representation of the target service domain knowledge graph. And then, combining the vector representation, carrying out semantic coding on the message by using a preset long-short-time memory network model, and generating a semantic representation vector. Finally, based on the semantic representation vector, key entities, attributes and relations in the message are accurately identified through a conditional random field model. The method combines the domain classification and the business domain knowledge graph to extract the message content, combines the context semantic information of the message content, accurately extracts and structuralizes the message content, improves the accuracy and the efficiency of extracting the message content, and can automatically identify key entities, attributes and relations.
In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 4, fig. 4 is a basic structural block diagram of a computer device according to the present embodiment.
The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It should be noted that only computer device 4 having components 41-43 is shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an application specific integrated circuit (APP L I CAT I on SPEC I F I CI NTEGRATED CI rcu it, AS ic), a programmable gate array (F I e l d-Programmab L E GATE ARRAY, FPGA), a digital Processor (D I GITA L SI GNA L Processor, DSP), an embedded device, and the like.
The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.
The memory 41 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the computer device 4. Of course, the memory 41 may also comprise both an internal memory unit of the computer device 4 and an external memory device. In this embodiment, the memory 41 is typically used to store an operating system and various application software installed on the computer device 4, such as computer readable instructions of a message content extraction method. Further, the memory 41 may be used to temporarily store various types of data that have been output or are to be output.
The processor 42 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, such as computer readable instructions for executing the message content extraction method.
The network interface 43 may comprise a wireless network interface or a wired network interface, which network interface 43 is typically used for establishing a communication connection between the computer device 4 and other electronic devices.
In the above embodiment, the application discloses a computer device, and relates to the technical field of big data. Firstly, the message to be checked is converted into a keyword vector through keyword extraction and vector conversion. And then, carrying out domain classification on the keyword vector by utilizing a pre-trained domain classification model, and determining the service domain to which the message belongs. And selecting a matched target service domain knowledge graph according to the domain classification result, and generating a vector representation of the target service domain knowledge graph. And then, combining the vector representation, carrying out semantic coding on the message by using a preset long-short-time memory network model, and generating a semantic representation vector. Finally, based on the semantic representation vector, key entities, attributes and relations in the message are accurately identified through a conditional random field model. The method combines the domain classification and the business domain knowledge graph to extract the message content, combines the context semantic information of the message content, accurately extracts and structuralizes the message content, improves the accuracy and the efficiency of extracting the message content, and can automatically identify key entities, attributes and relations.
The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of a message content extraction method as described above.
In the above embodiments, the present application discloses a computer readable storage medium, and relates to the technical field of big data. Firstly, the message to be checked is converted into a keyword vector through keyword extraction and vector conversion. And then, carrying out domain classification on the keyword vector by utilizing a pre-trained domain classification model, and determining the service domain to which the message belongs. And selecting a matched target service domain knowledge graph according to the domain classification result, and generating a vector representation of the target service domain knowledge graph. And then, combining the vector representation, carrying out semantic coding on the message by using a preset long-short-time memory network model, and generating a semantic representation vector. Finally, based on the semantic representation vector, key entities, attributes and relations in the message are accurately identified through a conditional random field model. The method combines the domain classification and the business domain knowledge graph to extract the message content, combines the context semantic information of the message content, accurately extracts and structuralizes the message content, improves the accuracy and the efficiency of extracting the message content, and can automatically identify key entities, attributes and relations.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.
The application is operational with numerous general purpose or special purpose computer system environments or configurations. Such as a personal computer, a server computer, a hand-held or portable device, a tablet device, a multiprocessor system, a microprocessor-based system, a set top box, a programmable consumer electronics, a network PC, a minicomputer, a mainframe computer, a distributed computing environment that includes any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.