CN119312810A

CN119312810A - A big data relationship mining method based on graph computing

Info

Publication number: CN119312810A
Application number: CN202411304787.0A
Authority: CN
Inventors: 孙伟
Original assignee: Guangzhou Jingjing Intelligent Technology Co ltd
Current assignee: Guangzhou Jingjing Intelligent Technology Co ltd
Priority date: 2024-09-19
Filing date: 2024-09-19
Publication date: 2025-01-14
Anticipated expiration: 2044-09-19
Also published as: CN119312810B

Abstract

The invention provides a big data relation mining method based on graph calculation, which generates a graph structure by preprocessing multi-source heterogeneous data, mines frequent subgraphs and dynamic relations by utilizing a graph calculation technology, builds a relation model, relates to data cleaning, entity identification, relation extraction and graph calculation technology, wherein the data cleaning combines local sensitive hash and machine learning denoising, the entity identification and the relation extraction adopt a deep learning model, the graph calculation fuses graph convolution network and distributed graph traversal, finally, the relation prediction is carried out through a machine learning framework based on the graph, and the analysis result is displayed by using a graph visualization technology, deep data insight is provided, and the method effectively improves the accuracy and efficiency of big data relation mining, and is suitable for data analysis and decision support in complex scenes.

Description

Big data relation mining method based on graph calculation

Technical Field

The invention relates to the technical field of data processing and mining, in particular to a big data relation mining method based on graph calculation.

Background

In the big data age, the variety and quantity of data are exponentially growing, the traditional data processing and mining method has hardly cope with the complex data relationship and mass data, graph calculation is an effective tool for processing complex data relationship and structure, and gradually draws extensive attention to researchers and industry, and the graph calculation is a graph theory-based calculation model for processing graph structure data consisting of nodes and edges, is suitable for various application scenes, comprises social network analysis, recommendation systems, financial risk assessment and the like, and can effectively capture the relationship and dependence in the data and reveal potential modes and trends.

The prior art discloses a visual character relation mining management system and method based on big data, which belongs to the technical field of big data and comprises a mining module, an analysis module and a management module; the mining module comprises a character mining unit, a time mining unit and a frequency mining unit, wherein the character mining unit is used for counting and calculating different accounts in the acquired mining data set to obtain a character data set; the time mining unit is used for counting and marking dialogue time points in the character data set to obtain a time point set; the invention also discloses a visual character relation mining management method based on big data, which can solve the technical problems of poor depth and breadth in the prior art of character relation mining and poor management updating effect after character relation mining.

However, the data mining method may not fully utilize the advantages of the deep learning model in relation extraction, and the efficiency of subgraph mining and pattern matching cannot be improved in the face of computational performance bottlenecks in processing large-scale graph data.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention designs a big data relation mining method based on graph calculation, which can effectively overcome the defects in the prior art.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a big data relation mining method based on graph calculation comprises the following steps:

S1, preprocessing multi-source heterogeneous big data, including data cleaning, entity identification, relation extraction and type labeling, to generate graph structure data, wherein the preprocessing step includes semantic role labeling of text data by using a natural language processing technology to identify semantic relations among entities;

S2, mining frequent subgraphs and dynamic relations from the graph by utilizing a graph calculation technology, wherein the graph calculation technology comprises a pattern matching algorithm, and the pattern matching algorithm combines deep learning feature extraction and a traditional graph traversing technology to improve accuracy and efficiency of subgraph mining;

s3, constructing a relation model based on the mining result, and applying the relation model to a specific scene for data analysis and insight, wherein the relation model comprises a machine learning framework based on a graph and is used for training and predicting the development and change of complex relations between entities.

The data cleansing step includes using data deduplication and noise filtering techniques, the data deduplication using a locality sensitive hashing method, comprising the steps of:

converting each data record into a feature vector;

selecting a hash function suitable for the data characteristics;

mapping each of the feature vectors into one or more hash buckets using the selected hash function, each of the hash buckets containing a plurality of the feature vectors, the feature vectors of each of the hash buckets being relatively close in hash space;

for each new data record, calculating the hash value of the characteristic vector of the new data record, and mapping the hash value into the corresponding hash bucket;

Repeatedly detecting in the hash bucket, comparing the similarity of all the feature vectors in the same hash bucket, identifying records with the similarity higher than a set threshold value as repeated data, removing the identified repeated data from a data set, and reserving unique data records;

the noise filtering technology comprises a method based on statistics and a method based on machine learning, and specifically comprises the following steps:

standard deviation analysis, calculating the standard deviation of each data, and identifying data points with larger deviation from the mean;

detecting abnormal distribution, namely identifying abnormal values in data by using a Box-plot (Box-plot);

applying the Z-Score method, the Z-Score treats data points greater than 3 or less than-3 as outliers;

the machine learning-based method uses local anomaly factors, specifically:

calculating the distance, namely calculating the neighborhood distance by using Euclidean distance, and calculating the distance between each data point and all other data points;

constructing local density, selecting k value, and determining the size (k value) of local neighborhood, namely k nearest neighbors of each data point;

calculating k distances, and for each data point, finding the maximum distance (k distance) among k nearest neighbors of the data point;

calculating an reachable distance, and for data points i and j, calculating the reachable distance, wherein the reachable distance is defined as a larger value between the k distances i and j;

calculating the local density of each data point, namely the density of the data points in k neighborhood of the local density;

the relative density, namely the ratio of the local density of each data point to the local density of the data point in the neighborhood of the data point is calculated;

LOF value, calculating LOF value of each data point, namely average value of relative density, setting threshold value according to LOF value, and regarding the data point with LOF value exceeding the threshold value as abnormal point;

outliers are filtered and these outliers are removed or marked from the dataset.

The entity identification step comprises using a named entity identification (NER) model based on deep learning, wherein the identification model is a model combining a long-term memory network (Bi-LSTM) and a Conditional Random Field (CRF), and the specific method comprises the following steps:

Acquiring a marked text data set which contains marked entity information;

Constructing a vocabulary and a tag set, wherein the vocabulary maps words in the text to unique index values;

The tag set defines an entity tag class;

converting words into vectors using pre-trained Word embeddings (e.g., word2Vec, gloVe) or custom Word embeddings;

Bi-LSTM network:

an input layer embedding words as input to the Bi-LSTM layer, the Bi-LSTM layer capturing contextual information using Bi-directional LSTM;

CRF layer:

Sequence labeling, namely performing sequence labeling on Bi-LSTM output by using a CRF layer, and optimizing entity boundary identification;

Model training, using the log likelihood of the CRF as a loss function, training the model using an Adam optimization algorithm, dividing the data into a training set and a verification set, performing model training, and adjusting the super parameters to optimize performance.

The relationship extraction step includes using a deep learning based relationship extraction model that uses a pre-trained language model (e.g., BERT) for relationship prediction, comprising the steps of:

acquiring a large-scale corpus containing entity pairs and relationship labels;

arranging the text into a format suitable for model input, and dividing the data set into a training set, a verification set and a test set;

Text tokenization, using a word segmenter (Tokenizer) of BERT to segment text into vocabulary indices;

generating an input, creating an input sample for each entity pair, comprising:

token IDs, vocabulary index of text;

Attention Masks, indicating which locations should be attended by the model;

token Type IDs to distinguish two different text portions;

model construction, loading a pre-training BERT, and selecting a proper BERT variant;

the BERT layer is used for extracting context information in the text;

The relation classification head adds a full connection layer on the CLS mark output by the BERT for relation classification;

defining a loss function, and optimizing classification tasks by using the cross entropy loss function;

And (3) performing model training, inputting a new text into the trained model, performing relationship prediction, and mapping the relationship category output by the model back to an actual relationship label.

The pattern matching algorithm in the graph calculation technology comprises feature learning and pattern recognition of graph structure data by using a graph rolling network (GCN), wherein the GCN is used for learning local and global features of nodes, and the method comprises the following specific steps of:

collecting graph data, namely acquiring graph structure data with nodes and edges;

Data formatting, namely converting the data into a format suitable for GCN input, wherein the data comprises a node characteristic matrix and an adjacent matrix, and standardizing node characteristics so as to improve training effect;

Performing adjacent matrix processing, namely performing normalization processing on the adjacent matrix, and calculating the inverse square root of the degree matrix of the adjacent matrix of each node;

convolution operation is carried out, wherein convolution operation is applied to each node and the neighborhood thereof, and information of neighbor nodes is aggregated;

Activating a function, namely using a ReLU activating function to increase the nonlinear expression capacity of the model;

stacking GCN layers, and stacking a plurality of graph convolution layers according to the requirement so as to capture the multi-level characteristics of the nodes;

the output layer is used for adding an appropriate output layer, such as a full connection layer or a pooling layer, for the subgraph mining task, and training the built model;

extracting features of nodes and subgraphs from GCN output;

sub-graph matching, namely identifying and extracting sub-graphs conforming to a specific mode by comparing the extracted features with a target mode;

The trained GCN model is deployed in a production environment and integrated into practical application, and real-time sub-graph matching and pattern recognition are carried out by using the model, so that data analysis and decision support are provided.

The graph traversal technique includes processing large-scale graph data using a distributed graph computation framework (GraphX) to accelerate the subgraph mining process;

Confirming Spark cluster version, and downloading GraphX packages compatible with Spark version;

Uploading GraphX packets to all nodes of the Spark cluster and placing the packets in a specified library directory;

Adjusting spark.executor.memory and spark.driver.memory parameters to allocate proper memory;

Setting spark.cores.max and spark.executor.instances parameters to configure core numbers and executor instances;

Data loading, loading data from HDFS, S3 or other storage systems using the method textFile or sequenceFile of Spark;

cleaning data, removing invalid or erroneous records, and converting the original data into vertex format and edge format;

Creating vertex RDD and edge RDD, and ensuring that the data type is consistent with GraphX requirements;

Creating vertex RDD, namely creating the vertex RDD by using a parallelize method of Spark, and distributing unique ID and corresponding attribute to each vertex;

creating an edge RDD, namely creating the edge RDD by using a parallelize method of Spark, and defining a source vertex, a target vertex and attributes of the edge;

Constructing a Graph object, transmitting the vertex RDD and the edge RDD to a Graph constructor, creating the Graph object, and setting default vertex attributes for the Graph;

Definition map calculation logic:

selecting a proper graph algorithm according to service requirements to realize algorithm logic;

Graph computation is performed using GraphXAPI, calling the API of pregel and mapvertical of GraphX;

writing user-defined functions to process the attributes of the vertexes and the edges;

Distributed graph computation

Partitioning the graph, and selecting a proper partitioning strategy;

selecting an optimal partitioning strategy in consideration of characteristics of the graph;

and (3) parallel computing, namely dividing the graph data into a plurality of partitions by utilizing the distributed computing capability of GraphX, and processing the partitions in parallel on a cluster, monitoring the progress of the task and the use condition of resources, and ensuring the efficient operation of the computing task.

The graph-based machine learning framework in the relational model includes modeling and predicting complex relationships between entities using a Graph Neural Network (GNN) that captures dependencies and dynamic changes between entities.

The data analysis and insight steps include using graph-based visualization techniques to present mined relationships and patterns as interactive atlases for in-depth analysis and understanding by the user.

A computer storage medium storing computer instructions which, when invoked, are operable to perform a graph-based big data relationship mining method as described above.

Compared with the prior art, the large data relation mining method based on graph calculation has the beneficial effects that the deep learning model (such as Bi-LSTM and CRF are combined for entity identification and BERT is used for relation extraction) is used for improving the understanding capability of the model on complex languages and relations, so that the accuracy of relation extraction is improved, the method makes up the defect of the prior art in semantic understanding, the model can better capture deep relations in texts, the graph-rolling network (GCN) and the distributed graph calculation framework (such as GraphX) are adopted for processing large-scale graph data, the performance bottleneck of the traditional method in processing large data is solved, the efficiency of sub-graph mining and pattern matching is improved by using distributed calculation, the quality of data is effectively improved by data deduplication and noise filtering technologies (including local sensitive hash, statistical methods and machine learning methods), the more reliable input is provided for subsequent mining, the excavated relations and patterns are presented in an interactive mode based on the visualization technology, the user is helped to understand the complex relations and the deep graphs more intuitively, and the practical data analysis is improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings which are required to be used in the description of the embodiments or the prior art will be briefly described below, it being obvious that the drawings in the description below are only exemplary and that other implementation drawings can be extended to a person skilled in the art without inventive effort from the drawings provided.

FIG. 1 is a diagram of the steps of a big data relation mining method based on graph computation.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

For the purpose of better illustrating the embodiments, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the actual product dimensions;

it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.

Examples

A big data relation mining method based on graph calculation, as shown in figure 1, comprises the following steps:

converting each data record into a feature vector;

A hash function is chosen that is suitable for the data characteristics, such as MinHash for euclidean distance or for cosine similarity, such as LSH for cosine similarity.

box plot (Box-plot), also known as Box plot or Box whisker plot, is a statistical plot for showing a set of data distribution conditions that can provide summary statistics of the minimum, first quartile (Q1), median (Q2), third quartile (Q3) and maximum of data, and the central tendency, degree of dispersion and outlier of the data can be intuitively seen by the shape of the Box and the length of the whisker.

the machine learning-based method uses local anomaly factors, specifically:

The LOF value, namely the average value of the relative density of each data point is calculated, and the larger the LOF value is, the more likely the data point is an abnormal value;

And acquiring a marked text data set which contains marked entity information such as a person name, a place name and an organization name.

The tag set defines an entity tag class, e.g. "B-PER" indicates the beginning of a person name and "I-PER" indicates the inside of a person name.

Bi-LSTM network:

CRF layer:

And acquiring a large-scale corpus containing entity pairs and relationship labels, such as ACE (access) and SemEval data sets.

generating an input, creating an input sample for each entity pair, comprising:

token IDs, vocabulary index of text;

Attention Masks, indicating which locations should be attended by the model;

Token Type IDs are used to distinguish between two different text portions, such as questions and answers.

Model construction, loading of pre-trained BERT, selection of appropriate BERT variants, such as BERT-base or BERT-large.

The BERT layer is used for extracting context information in the text;

graph data collection to obtain graph structure data with nodes and edges (data may be derived from a social network, molecular structure, or other graph form of data set

extracting features of nodes and subgraphs from GCN output;

Cleaning the data, removing invalid or erroneous records, and converting the original data into vertex format ((VertexId, VD)) and Edge format (Edge [ ED ]);

Definition map calculation logic:

selecting a proper graph algorithm according to service requirements to realize algorithm logic, such as defining an iterative calculation process of PageRank;

Distributed graph computation

Partitioning the graph, selecting a proper partitioning strategy, such as CanonicalRandomVertexCut, to optimize computing performance;

the connectivity map CanonicalRandomVertexCut generally performs well;

Sparse or dense maps, edgeCut may be more suitable for dense maps, while VertexCut is suitable for sparse maps.

In the specific implementation process, step S1 is to clean data and generate a graph structure

Converting each data record into a feature vector, performing de-duplication by using a Local Sensitive Hash (LSH) method, selecting a hash function suitable for the data features, mapping the feature vector into a hash bucket, calculating a hash value of each new data record and mapping the hash value into the hash bucket, performing repeated detection in the hash bucket, identifying and removing repeated data by calculating the similarity, and reserving a unique record.

Noise filtering:

statistical methods, machine learning methods, in which outliers are identified and removed using standard deviation analysis, box plot and Z-Score methods, local Outlier Factors (LOFs) for each data point are calculated, thresholds are set, and outliers whose LOF values exceed the thresholds are removed.

Entity identification and relationship extraction:

Entity recognition, namely performing named entity recognition by using a deep learning model combined with a long-term memory network (Bi-LSTM) and a Conditional Random Field (CRF), constructing a vocabulary and a tag set, converting words into vectors by using pre-training or custom word embedding, capturing context information by a Bi-LSTM layer, performing sequence labeling by a CRF layer, and optimizing entity boundary recognition.

And extracting the relation, namely performing relation prediction by using a BERT-based deep learning model, arranging texts into a format suitable for model input, extracting context information by using the BERT, and adding a full-connection layer to perform relation classification.

Graph structure data is generated, and the preprocessed data (including entities and relationships) is converted into graph structure data, including nodes (entities) and edges (relationships).

Step S2, graph calculation and frequent subgraph mining

The graph computation technique, graph rolling network (GCN), is to collect and format graph data, create node feature matrix and adjacency matrix, normalize adjacency matrix, apply convolution operation and ReLU activation function in GCN, stack multiple GCN layers to capture multi-level features of nodes, extract node and sub-graph features from GCN output, and perform pattern matching and sub-graph recognition.

And a pattern matching algorithm, which uses a graph rolling network (GCN) to perform pattern recognition, recognizes subgraphs matched with the target pattern, and extracts frequent subgraphs.

Step S3, construction and application of relation model

A relationship model is built, complex relationships between entities are modeled and predicted by using a Graph Neural Network (GNN), and the graph neural network is trained to capture the dependency relationships and dynamic changes between the entities.

And (3) applying the model, and applying the trained relation model to a specific scene to perform data analysis and insight.

Step S4, data analysis and visualization

And (3) data analysis, namely, using a visualization technology based on a graph to present the mined relation and pattern as an interactive map, and enabling a user to conduct deep analysis through the map so as to understand the complex relation and pattern in the data.

Visualization techniques, selection of appropriate graphic visualization tools and platforms (e.g., gephi, cytoscape)

The interactive interface is designed to allow the user to query and analyze nodes and edges in the graph, and provide filtering, scaling and layout functions of the graph for flexible exploration and analysis by the user.

The same or similar reference numerals correspond to the same or similar components;

The terms describing the positional relationship in the drawings are merely illustrative, and are not to be construed as limiting the present patent;

It is to be understood that the above-described embodiments of the present invention are provided by way of illustration only and not as limitations of the embodiments of the present invention, and that various other changes and modifications may be made by one skilled in the art based on the above description, without the necessity of or without intending to be exhaustive of all embodiments, and any modifications, equivalents, improvements and modifications etc. within the spirit and principles of the present invention are intended to be included within the scope of the appended claims.

Claims

1. The big data relation mining method based on graph calculation is characterized by comprising the following steps of:

2. The graph-based large data relationship mining method of claim 1, wherein the data cleansing step includes using data deduplication and noise filtering techniques, the data deduplication using a locality sensitive hashing method, comprising the steps of:

converting each data record into a feature vector;

selecting a hash function suitable for the data characteristics;

detecting abnormal distribution, namely identifying abnormal values in the data by using a box graph;

the machine learning-based method uses local anomaly factors, specifically:

constructing local density, selecting k value, and determining the size of local neighborhood, namely k nearest neighbors of each data point;

Calculating k distances, and for each data point, finding the maximum distance among k nearest neighbors;

3. The graph-based big data relation mining method according to claim 2, wherein the entity recognition step includes using a named entity recognition model based on deep learning, wherein the recognition model is a model combining a long-term and short-term memory network and a conditional random field, and the specific method steps are as follows:

The method comprises the steps of obtaining a marked text data set which contains marked entity information, constructing a vocabulary and a tag set, wherein the vocabulary maps vocabulary in a text to a unique index value;

The tag set defines an entity tag class;

Converting words into vectors using a pre-trained word embedding or custom word embedding layer;

Bi-LSTM network:

CRF layer:

4. A graph computation based big data relationship mining method according to claim 3, wherein the relationship extraction step includes using a deep learning based relationship extraction model that uses a pre-trained language model for relationship prediction, comprising the steps of:

acquiring a large-scale corpus containing entity pairs and relationship labels;

Marking a text, and dividing the text into vocabulary indexes by using a word segmentation device of BERT;

generating an input, creating an input sample for each entity pair, comprising:

token IDs, vocabulary index of text;

Attention Masks, indicating which locations should be attended by the model;

token Type IDs to distinguish two different text portions;

the BERT layer is used for extracting context information in the text;

5. The graph-computing-based big data relation mining method according to claim 4, wherein the pattern matching algorithm in the graph computing technology includes feature learning and pattern recognition on graph structure data by using a graph rolling network, wherein the GCN is used for learning local and global features of nodes, and the specific steps are as follows:

extracting features of nodes and subgraphs from GCN output;

6. The graph computation based big data relationship mining method of claim 5, wherein the graph traversal technique includes processing the large-scale graph data with a distributed graph computation framework to accelerate the subgraph mining process;

Definition map calculation logic:

Calculating a distributed graph, partitioning the graph, and selecting a proper partition strategy;

7. The graph-based computational big data relationship mining method of claim 6, wherein the graph-based machine learning framework in the relationship model includes modeling and predicting complex relationships between entities using a graph neural network, wherein the graph neural network is used to capture dependency relationships and dynamic changes between entities.

8. The graph-based computed big data relationship mining method of claim 7, wherein the data analysis and insight step includes using graph-based visualization techniques to present mined relationships and patterns as interactive graphs for in-depth analysis and understanding by the user.

9. A computer storage medium storing computer instructions which, when invoked, are operable to perform a graph-based big data relationship mining method as claimed in any of claims 1-8.