[go: up one dir, main page]

CN119312810A - A big data relationship mining method based on graph computing - Google Patents

A big data relationship mining method based on graph computing Download PDF

Info

Publication number
CN119312810A
CN119312810A CN202411304787.0A CN202411304787A CN119312810A CN 119312810 A CN119312810 A CN 119312810A CN 202411304787 A CN202411304787 A CN 202411304787A CN 119312810 A CN119312810 A CN 119312810A
Authority
CN
China
Prior art keywords
data
graph
model
namely
relation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202411304787.0A
Other languages
Chinese (zh)
Other versions
CN119312810B (en
Inventor
孙伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Jingjing Intelligent Technology Co ltd
Original Assignee
Guangzhou Jingjing Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Jingjing Intelligent Technology Co ltd filed Critical Guangzhou Jingjing Intelligent Technology Co ltd
Priority to CN202411304787.0A priority Critical patent/CN119312810B/en
Publication of CN119312810A publication Critical patent/CN119312810A/en
Application granted granted Critical
Publication of CN119312810B publication Critical patent/CN119312810B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a big data relation mining method based on graph calculation, which generates a graph structure by preprocessing multi-source heterogeneous data, mines frequent subgraphs and dynamic relations by utilizing a graph calculation technology, builds a relation model, relates to data cleaning, entity identification, relation extraction and graph calculation technology, wherein the data cleaning combines local sensitive hash and machine learning denoising, the entity identification and the relation extraction adopt a deep learning model, the graph calculation fuses graph convolution network and distributed graph traversal, finally, the relation prediction is carried out through a machine learning framework based on the graph, and the analysis result is displayed by using a graph visualization technology, deep data insight is provided, and the method effectively improves the accuracy and efficiency of big data relation mining, and is suitable for data analysis and decision support in complex scenes.

Description

Big data relation mining method based on graph calculation
Technical Field
The invention relates to the technical field of data processing and mining, in particular to a big data relation mining method based on graph calculation.
Background
In the big data age, the variety and quantity of data are exponentially growing, the traditional data processing and mining method has hardly cope with the complex data relationship and mass data, graph calculation is an effective tool for processing complex data relationship and structure, and gradually draws extensive attention to researchers and industry, and the graph calculation is a graph theory-based calculation model for processing graph structure data consisting of nodes and edges, is suitable for various application scenes, comprises social network analysis, recommendation systems, financial risk assessment and the like, and can effectively capture the relationship and dependence in the data and reveal potential modes and trends.
The prior art discloses a visual character relation mining management system and method based on big data, which belongs to the technical field of big data and comprises a mining module, an analysis module and a management module; the mining module comprises a character mining unit, a time mining unit and a frequency mining unit, wherein the character mining unit is used for counting and calculating different accounts in the acquired mining data set to obtain a character data set; the time mining unit is used for counting and marking dialogue time points in the character data set to obtain a time point set; the invention also discloses a visual character relation mining management method based on big data, which can solve the technical problems of poor depth and breadth in the prior art of character relation mining and poor management updating effect after character relation mining.
However, the data mining method may not fully utilize the advantages of the deep learning model in relation extraction, and the efficiency of subgraph mining and pattern matching cannot be improved in the face of computational performance bottlenecks in processing large-scale graph data.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention designs a big data relation mining method based on graph calculation, which can effectively overcome the defects in the prior art.
In order to solve the technical problems, the technical scheme of the invention is as follows:
a big data relation mining method based on graph calculation comprises the following steps:
S1, preprocessing multi-source heterogeneous big data, including data cleaning, entity identification, relation extraction and type labeling, to generate graph structure data, wherein the preprocessing step includes semantic role labeling of text data by using a natural language processing technology to identify semantic relations among entities;
S2, mining frequent subgraphs and dynamic relations from the graph by utilizing a graph calculation technology, wherein the graph calculation technology comprises a pattern matching algorithm, and the pattern matching algorithm combines deep learning feature extraction and a traditional graph traversing technology to improve accuracy and efficiency of subgraph mining;
s3, constructing a relation model based on the mining result, and applying the relation model to a specific scene for data analysis and insight, wherein the relation model comprises a machine learning framework based on a graph and is used for training and predicting the development and change of complex relations between entities.
The data cleansing step includes using data deduplication and noise filtering techniques, the data deduplication using a locality sensitive hashing method, comprising the steps of:
converting each data record into a feature vector;
selecting a hash function suitable for the data characteristics;
mapping each of the feature vectors into one or more hash buckets using the selected hash function, each of the hash buckets containing a plurality of the feature vectors, the feature vectors of each of the hash buckets being relatively close in hash space;
for each new data record, calculating the hash value of the characteristic vector of the new data record, and mapping the hash value into the corresponding hash bucket;
Repeatedly detecting in the hash bucket, comparing the similarity of all the feature vectors in the same hash bucket, identifying records with the similarity higher than a set threshold value as repeated data, removing the identified repeated data from a data set, and reserving unique data records;
the noise filtering technology comprises a method based on statistics and a method based on machine learning, and specifically comprises the following steps:
standard deviation analysis, calculating the standard deviation of each data, and identifying data points with larger deviation from the mean;
detecting abnormal distribution, namely identifying abnormal values in data by using a Box-plot (Box-plot);
applying the Z-Score method, the Z-Score treats data points greater than 3 or less than-3 as outliers;
the machine learning-based method uses local anomaly factors, specifically:
calculating the distance, namely calculating the neighborhood distance by using Euclidean distance, and calculating the distance between each data point and all other data points;
constructing local density, selecting k value, and determining the size (k value) of local neighborhood, namely k nearest neighbors of each data point;
calculating k distances, and for each data point, finding the maximum distance (k distance) among k nearest neighbors of the data point;
calculating an reachable distance, and for data points i and j, calculating the reachable distance, wherein the reachable distance is defined as a larger value between the k distances i and j;
calculating the local density of each data point, namely the density of the data points in k neighborhood of the local density;
the relative density, namely the ratio of the local density of each data point to the local density of the data point in the neighborhood of the data point is calculated;
LOF value, calculating LOF value of each data point, namely average value of relative density, setting threshold value according to LOF value, and regarding the data point with LOF value exceeding the threshold value as abnormal point;
outliers are filtered and these outliers are removed or marked from the dataset.
The entity identification step comprises using a named entity identification (NER) model based on deep learning, wherein the identification model is a model combining a long-term memory network (Bi-LSTM) and a Conditional Random Field (CRF), and the specific method comprises the following steps:
Acquiring a marked text data set which contains marked entity information;
Constructing a vocabulary and a tag set, wherein the vocabulary maps words in the text to unique index values;
The tag set defines an entity tag class;
converting words into vectors using pre-trained Word embeddings (e.g., word2Vec, gloVe) or custom Word embeddings;
Bi-LSTM network:
an input layer embedding words as input to the Bi-LSTM layer, the Bi-LSTM layer capturing contextual information using Bi-directional LSTM;
CRF layer:
Sequence labeling, namely performing sequence labeling on Bi-LSTM output by using a CRF layer, and optimizing entity boundary identification;
Model training, using the log likelihood of the CRF as a loss function, training the model using an Adam optimization algorithm, dividing the data into a training set and a verification set, performing model training, and adjusting the super parameters to optimize performance.
The relationship extraction step includes using a deep learning based relationship extraction model that uses a pre-trained language model (e.g., BERT) for relationship prediction, comprising the steps of:
acquiring a large-scale corpus containing entity pairs and relationship labels;
arranging the text into a format suitable for model input, and dividing the data set into a training set, a verification set and a test set;
Text tokenization, using a word segmenter (Tokenizer) of BERT to segment text into vocabulary indices;
generating an input, creating an input sample for each entity pair, comprising:
token IDs, vocabulary index of text;
Attention Masks, indicating which locations should be attended by the model;
token Type IDs to distinguish two different text portions;
model construction, loading a pre-training BERT, and selecting a proper BERT variant;
the BERT layer is used for extracting context information in the text;
The relation classification head adds a full connection layer on the CLS mark output by the BERT for relation classification;
defining a loss function, and optimizing classification tasks by using the cross entropy loss function;
And (3) performing model training, inputting a new text into the trained model, performing relationship prediction, and mapping the relationship category output by the model back to an actual relationship label.
The pattern matching algorithm in the graph calculation technology comprises feature learning and pattern recognition of graph structure data by using a graph rolling network (GCN), wherein the GCN is used for learning local and global features of nodes, and the method comprises the following specific steps of:
collecting graph data, namely acquiring graph structure data with nodes and edges;
Data formatting, namely converting the data into a format suitable for GCN input, wherein the data comprises a node characteristic matrix and an adjacent matrix, and standardizing node characteristics so as to improve training effect;
Performing adjacent matrix processing, namely performing normalization processing on the adjacent matrix, and calculating the inverse square root of the degree matrix of the adjacent matrix of each node;
convolution operation is carried out, wherein convolution operation is applied to each node and the neighborhood thereof, and information of neighbor nodes is aggregated;
Activating a function, namely using a ReLU activating function to increase the nonlinear expression capacity of the model;
stacking GCN layers, and stacking a plurality of graph convolution layers according to the requirement so as to capture the multi-level characteristics of the nodes;
the output layer is used for adding an appropriate output layer, such as a full connection layer or a pooling layer, for the subgraph mining task, and training the built model;
extracting features of nodes and subgraphs from GCN output;
sub-graph matching, namely identifying and extracting sub-graphs conforming to a specific mode by comparing the extracted features with a target mode;
The trained GCN model is deployed in a production environment and integrated into practical application, and real-time sub-graph matching and pattern recognition are carried out by using the model, so that data analysis and decision support are provided.
The graph traversal technique includes processing large-scale graph data using a distributed graph computation framework (GraphX) to accelerate the subgraph mining process;
Confirming Spark cluster version, and downloading GraphX packages compatible with Spark version;
Uploading GraphX packets to all nodes of the Spark cluster and placing the packets in a specified library directory;
Adjusting spark.executor.memory and spark.driver.memory parameters to allocate proper memory;
Setting spark.cores.max and spark.executor.instances parameters to configure core numbers and executor instances;
Data loading, loading data from HDFS, S3 or other storage systems using the method textFile or sequenceFile of Spark;
cleaning data, removing invalid or erroneous records, and converting the original data into vertex format and edge format;
Creating vertex RDD and edge RDD, and ensuring that the data type is consistent with GraphX requirements;
Creating vertex RDD, namely creating the vertex RDD by using a parallelize method of Spark, and distributing unique ID and corresponding attribute to each vertex;
creating an edge RDD, namely creating the edge RDD by using a parallelize method of Spark, and defining a source vertex, a target vertex and attributes of the edge;
Constructing a Graph object, transmitting the vertex RDD and the edge RDD to a Graph constructor, creating the Graph object, and setting default vertex attributes for the Graph;
Definition map calculation logic:
selecting a proper graph algorithm according to service requirements to realize algorithm logic;
Graph computation is performed using GraphXAPI, calling the API of pregel and mapvertical of GraphX;
writing user-defined functions to process the attributes of the vertexes and the edges;
Distributed graph computation
Partitioning the graph, and selecting a proper partitioning strategy;
selecting an optimal partitioning strategy in consideration of characteristics of the graph;
and (3) parallel computing, namely dividing the graph data into a plurality of partitions by utilizing the distributed computing capability of GraphX, and processing the partitions in parallel on a cluster, monitoring the progress of the task and the use condition of resources, and ensuring the efficient operation of the computing task.
The graph-based machine learning framework in the relational model includes modeling and predicting complex relationships between entities using a Graph Neural Network (GNN) that captures dependencies and dynamic changes between entities.
The data analysis and insight steps include using graph-based visualization techniques to present mined relationships and patterns as interactive atlases for in-depth analysis and understanding by the user.
A computer storage medium storing computer instructions which, when invoked, are operable to perform a graph-based big data relationship mining method as described above.
Compared with the prior art, the large data relation mining method based on graph calculation has the beneficial effects that the deep learning model (such as Bi-LSTM and CRF are combined for entity identification and BERT is used for relation extraction) is used for improving the understanding capability of the model on complex languages and relations, so that the accuracy of relation extraction is improved, the method makes up the defect of the prior art in semantic understanding, the model can better capture deep relations in texts, the graph-rolling network (GCN) and the distributed graph calculation framework (such as GraphX) are adopted for processing large-scale graph data, the performance bottleneck of the traditional method in processing large data is solved, the efficiency of sub-graph mining and pattern matching is improved by using distributed calculation, the quality of data is effectively improved by data deduplication and noise filtering technologies (including local sensitive hash, statistical methods and machine learning methods), the more reliable input is provided for subsequent mining, the excavated relations and patterns are presented in an interactive mode based on the visualization technology, the user is helped to understand the complex relations and the deep graphs more intuitively, and the practical data analysis is improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings which are required to be used in the description of the embodiments or the prior art will be briefly described below, it being obvious that the drawings in the description below are only exemplary and that other implementation drawings can be extended to a person skilled in the art without inventive effort from the drawings provided.
FIG. 1 is a diagram of the steps of a big data relation mining method based on graph computation.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
For the purpose of better illustrating the embodiments, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the actual product dimensions;
it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.
Examples
A big data relation mining method based on graph calculation, as shown in figure 1, comprises the following steps:
S1, preprocessing multi-source heterogeneous big data, including data cleaning, entity identification, relation extraction and type labeling, to generate graph structure data, wherein the preprocessing step includes semantic role labeling of text data by using a natural language processing technology to identify semantic relations among entities;
S2, mining frequent subgraphs and dynamic relations from the graph by utilizing a graph calculation technology, wherein the graph calculation technology comprises a pattern matching algorithm, and the pattern matching algorithm combines deep learning feature extraction and a traditional graph traversing technology to improve accuracy and efficiency of subgraph mining;
s3, constructing a relation model based on the mining result, and applying the relation model to a specific scene for data analysis and insight, wherein the relation model comprises a machine learning framework based on a graph and is used for training and predicting the development and change of complex relations between entities.
The data cleansing step includes using data deduplication and noise filtering techniques, the data deduplication using a locality sensitive hashing method, comprising the steps of:
converting each data record into a feature vector;
A hash function is chosen that is suitable for the data characteristics, such as MinHash for euclidean distance or for cosine similarity, such as LSH for cosine similarity.
Mapping each of the feature vectors into one or more hash buckets using the selected hash function, each of the hash buckets containing a plurality of the feature vectors, the feature vectors of each of the hash buckets being relatively close in hash space;
for each new data record, calculating the hash value of the characteristic vector of the new data record, and mapping the hash value into the corresponding hash bucket;
Repeatedly detecting in the hash bucket, comparing the similarity of all the feature vectors in the same hash bucket, identifying records with the similarity higher than a set threshold value as repeated data, removing the identified repeated data from a data set, and reserving unique data records;
the noise filtering technology comprises a method based on statistics and a method based on machine learning, and specifically comprises the following steps:
standard deviation analysis, calculating the standard deviation of each data, and identifying data points with larger deviation from the mean;
detecting abnormal distribution, namely identifying abnormal values in data by using a Box-plot (Box-plot);
box plot (Box-plot), also known as Box plot or Box whisker plot, is a statistical plot for showing a set of data distribution conditions that can provide summary statistics of the minimum, first quartile (Q1), median (Q2), third quartile (Q3) and maximum of data, and the central tendency, degree of dispersion and outlier of the data can be intuitively seen by the shape of the Box and the length of the whisker.
Applying the Z-Score method, the Z-Score treats data points greater than 3 or less than-3 as outliers;
the machine learning-based method uses local anomaly factors, specifically:
calculating the distance, namely calculating the neighborhood distance by using Euclidean distance, and calculating the distance between each data point and all other data points;
constructing local density, selecting k value, and determining the size (k value) of local neighborhood, namely k nearest neighbors of each data point;
calculating k distances, and for each data point, finding the maximum distance (k distance) among k nearest neighbors of the data point;
calculating an reachable distance, and for data points i and j, calculating the reachable distance, wherein the reachable distance is defined as a larger value between the k distances i and j;
calculating the local density of each data point, namely the density of the data points in k neighborhood of the local density;
the relative density, namely the ratio of the local density of each data point to the local density of the data point in the neighborhood of the data point is calculated;
The LOF value, namely the average value of the relative density of each data point is calculated, and the larger the LOF value is, the more likely the data point is an abnormal value;
outliers are filtered and these outliers are removed or marked from the dataset.
The entity identification step comprises using a named entity identification (NER) model based on deep learning, wherein the identification model is a model combining a long-term memory network (Bi-LSTM) and a Conditional Random Field (CRF), and the specific method comprises the following steps:
And acquiring a marked text data set which contains marked entity information such as a person name, a place name and an organization name.
Constructing a vocabulary and a tag set, wherein the vocabulary maps words in the text to unique index values;
The tag set defines an entity tag class, e.g. "B-PER" indicates the beginning of a person name and "I-PER" indicates the inside of a person name.
Converting words into vectors using pre-trained Word embeddings (e.g., word2Vec, gloVe) or custom Word embeddings;
Bi-LSTM network:
an input layer embedding words as input to the Bi-LSTM layer, the Bi-LSTM layer capturing contextual information using Bi-directional LSTM;
CRF layer:
Sequence labeling, namely performing sequence labeling on Bi-LSTM output by using a CRF layer, and optimizing entity boundary identification;
Model training, using the log likelihood of the CRF as a loss function, training the model using an Adam optimization algorithm, dividing the data into a training set and a verification set, performing model training, and adjusting the super parameters to optimize performance.
The relationship extraction step includes using a deep learning based relationship extraction model that uses a pre-trained language model (e.g., BERT) for relationship prediction, comprising the steps of:
And acquiring a large-scale corpus containing entity pairs and relationship labels, such as ACE (access) and SemEval data sets.
Arranging the text into a format suitable for model input, and dividing the data set into a training set, a verification set and a test set;
Text tokenization, using a word segmenter (Tokenizer) of BERT to segment text into vocabulary indices;
generating an input, creating an input sample for each entity pair, comprising:
token IDs, vocabulary index of text;
Attention Masks, indicating which locations should be attended by the model;
Token Type IDs are used to distinguish between two different text portions, such as questions and answers.
Model construction, loading of pre-trained BERT, selection of appropriate BERT variants, such as BERT-base or BERT-large.
The BERT layer is used for extracting context information in the text;
The relation classification head adds a full connection layer on the CLS mark output by the BERT for relation classification;
defining a loss function, and optimizing classification tasks by using the cross entropy loss function;
And (3) performing model training, inputting a new text into the trained model, performing relationship prediction, and mapping the relationship category output by the model back to an actual relationship label.
The pattern matching algorithm in the graph calculation technology comprises feature learning and pattern recognition of graph structure data by using a graph rolling network (GCN), wherein the GCN is used for learning local and global features of nodes, and the method comprises the following specific steps of:
graph data collection to obtain graph structure data with nodes and edges (data may be derived from a social network, molecular structure, or other graph form of data set
Data formatting, namely converting the data into a format suitable for GCN input, wherein the data comprises a node characteristic matrix and an adjacent matrix, and standardizing node characteristics so as to improve training effect;
Performing adjacent matrix processing, namely performing normalization processing on the adjacent matrix, and calculating the inverse square root of the degree matrix of the adjacent matrix of each node;
convolution operation is carried out, wherein convolution operation is applied to each node and the neighborhood thereof, and information of neighbor nodes is aggregated;
Activating a function, namely using a ReLU activating function to increase the nonlinear expression capacity of the model;
stacking GCN layers, and stacking a plurality of graph convolution layers according to the requirement so as to capture the multi-level characteristics of the nodes;
the output layer is used for adding an appropriate output layer, such as a full connection layer or a pooling layer, for the subgraph mining task, and training the built model;
extracting features of nodes and subgraphs from GCN output;
sub-graph matching, namely identifying and extracting sub-graphs conforming to a specific mode by comparing the extracted features with a target mode;
The trained GCN model is deployed in a production environment and integrated into practical application, and real-time sub-graph matching and pattern recognition are carried out by using the model, so that data analysis and decision support are provided.
The graph traversal technique includes processing large-scale graph data using a distributed graph computation framework (GraphX) to accelerate the subgraph mining process;
Confirming Spark cluster version, and downloading GraphX packages compatible with Spark version;
Uploading GraphX packets to all nodes of the Spark cluster and placing the packets in a specified library directory;
Adjusting spark.executor.memory and spark.driver.memory parameters to allocate proper memory;
Setting spark.cores.max and spark.executor.instances parameters to configure core numbers and executor instances;
Data loading, loading data from HDFS, S3 or other storage systems using the method textFile or sequenceFile of Spark;
Cleaning the data, removing invalid or erroneous records, and converting the original data into vertex format ((VertexId, VD)) and Edge format (Edge [ ED ]);
Creating vertex RDD and edge RDD, and ensuring that the data type is consistent with GraphX requirements;
Creating vertex RDD, namely creating the vertex RDD by using a parallelize method of Spark, and distributing unique ID and corresponding attribute to each vertex;
creating an edge RDD, namely creating the edge RDD by using a parallelize method of Spark, and defining a source vertex, a target vertex and attributes of the edge;
Constructing a Graph object, transmitting the vertex RDD and the edge RDD to a Graph constructor, creating the Graph object, and setting default vertex attributes for the Graph;
Definition map calculation logic:
selecting a proper graph algorithm according to service requirements to realize algorithm logic, such as defining an iterative calculation process of PageRank;
Graph computation is performed using GraphXAPI, calling the API of pregel and mapvertical of GraphX;
writing user-defined functions to process the attributes of the vertexes and the edges;
Distributed graph computation
Partitioning the graph, selecting a proper partitioning strategy, such as CanonicalRandomVertexCut, to optimize computing performance;
selecting an optimal partitioning strategy in consideration of characteristics of the graph;
the connectivity map CanonicalRandomVertexCut generally performs well;
Sparse or dense maps, edgeCut may be more suitable for dense maps, while VertexCut is suitable for sparse maps.
And (3) parallel computing, namely dividing the graph data into a plurality of partitions by utilizing the distributed computing capability of GraphX, and processing the partitions in parallel on a cluster, monitoring the progress of the task and the use condition of resources, and ensuring the efficient operation of the computing task.
The graph-based machine learning framework in the relational model includes modeling and predicting complex relationships between entities using a Graph Neural Network (GNN) that captures dependencies and dynamic changes between entities.
The data analysis and insight steps include using graph-based visualization techniques to present mined relationships and patterns as interactive atlases for in-depth analysis and understanding by the user.
A computer storage medium storing computer instructions which, when invoked, are operable to perform a graph-based big data relationship mining method as described above.
In the specific implementation process, step S1 is to clean data and generate a graph structure
Converting each data record into a feature vector, performing de-duplication by using a Local Sensitive Hash (LSH) method, selecting a hash function suitable for the data features, mapping the feature vector into a hash bucket, calculating a hash value of each new data record and mapping the hash value into the hash bucket, performing repeated detection in the hash bucket, identifying and removing repeated data by calculating the similarity, and reserving a unique record.
Noise filtering:
statistical methods, machine learning methods, in which outliers are identified and removed using standard deviation analysis, box plot and Z-Score methods, local Outlier Factors (LOFs) for each data point are calculated, thresholds are set, and outliers whose LOF values exceed the thresholds are removed.
Entity identification and relationship extraction:
Entity recognition, namely performing named entity recognition by using a deep learning model combined with a long-term memory network (Bi-LSTM) and a Conditional Random Field (CRF), constructing a vocabulary and a tag set, converting words into vectors by using pre-training or custom word embedding, capturing context information by a Bi-LSTM layer, performing sequence labeling by a CRF layer, and optimizing entity boundary recognition.
And extracting the relation, namely performing relation prediction by using a BERT-based deep learning model, arranging texts into a format suitable for model input, extracting context information by using the BERT, and adding a full-connection layer to perform relation classification.
Graph structure data is generated, and the preprocessed data (including entities and relationships) is converted into graph structure data, including nodes (entities) and edges (relationships).
Step S2, graph calculation and frequent subgraph mining
The graph computation technique, graph rolling network (GCN), is to collect and format graph data, create node feature matrix and adjacency matrix, normalize adjacency matrix, apply convolution operation and ReLU activation function in GCN, stack multiple GCN layers to capture multi-level features of nodes, extract node and sub-graph features from GCN output, and perform pattern matching and sub-graph recognition.
And a pattern matching algorithm, which uses a graph rolling network (GCN) to perform pattern recognition, recognizes subgraphs matched with the target pattern, and extracts frequent subgraphs.
Step S3, construction and application of relation model
A relationship model is built, complex relationships between entities are modeled and predicted by using a Graph Neural Network (GNN), and the graph neural network is trained to capture the dependency relationships and dynamic changes between the entities.
And (3) applying the model, and applying the trained relation model to a specific scene to perform data analysis and insight.
Step S4, data analysis and visualization
And (3) data analysis, namely, using a visualization technology based on a graph to present the mined relation and pattern as an interactive map, and enabling a user to conduct deep analysis through the map so as to understand the complex relation and pattern in the data.
Visualization techniques, selection of appropriate graphic visualization tools and platforms (e.g., gephi, cytoscape)
The interactive interface is designed to allow the user to query and analyze nodes and edges in the graph, and provide filtering, scaling and layout functions of the graph for flexible exploration and analysis by the user.
The same or similar reference numerals correspond to the same or similar components;
The terms describing the positional relationship in the drawings are merely illustrative, and are not to be construed as limiting the present patent;
It is to be understood that the above-described embodiments of the present invention are provided by way of illustration only and not as limitations of the embodiments of the present invention, and that various other changes and modifications may be made by one skilled in the art based on the above description, without the necessity of or without intending to be exhaustive of all embodiments, and any modifications, equivalents, improvements and modifications etc. within the spirit and principles of the present invention are intended to be included within the scope of the appended claims.

Claims (9)

1. The big data relation mining method based on graph calculation is characterized by comprising the following steps of:
S1, preprocessing multi-source heterogeneous big data, including data cleaning, entity identification, relation extraction and type labeling, to generate graph structure data, wherein the preprocessing step includes semantic role labeling of text data by using a natural language processing technology to identify semantic relations among entities;
S2, mining frequent subgraphs and dynamic relations from the graph by utilizing a graph calculation technology, wherein the graph calculation technology comprises a pattern matching algorithm, and the pattern matching algorithm combines deep learning feature extraction and a traditional graph traversing technology to improve accuracy and efficiency of subgraph mining;
s3, constructing a relation model based on the mining result, and applying the relation model to a specific scene for data analysis and insight, wherein the relation model comprises a machine learning framework based on a graph and is used for training and predicting the development and change of complex relations between entities.
2. The graph-based large data relationship mining method of claim 1, wherein the data cleansing step includes using data deduplication and noise filtering techniques, the data deduplication using a locality sensitive hashing method, comprising the steps of:
converting each data record into a feature vector;
selecting a hash function suitable for the data characteristics;
mapping each of the feature vectors into one or more hash buckets using the selected hash function, each of the hash buckets containing a plurality of the feature vectors, the feature vectors of each of the hash buckets being relatively close in hash space;
for each new data record, calculating the hash value of the characteristic vector of the new data record, and mapping the hash value into the corresponding hash bucket;
Repeatedly detecting in the hash bucket, comparing the similarity of all the feature vectors in the same hash bucket, identifying records with the similarity higher than a set threshold value as repeated data, removing the identified repeated data from a data set, and reserving unique data records;
the noise filtering technology comprises a method based on statistics and a method based on machine learning, and specifically comprises the following steps:
standard deviation analysis, calculating the standard deviation of each data, and identifying data points with larger deviation from the mean;
detecting abnormal distribution, namely identifying abnormal values in the data by using a box graph;
applying the Z-Score method, the Z-Score treats data points greater than 3 or less than-3 as outliers;
the machine learning-based method uses local anomaly factors, specifically:
calculating the distance, namely calculating the neighborhood distance by using Euclidean distance, and calculating the distance between each data point and all other data points;
constructing local density, selecting k value, and determining the size of local neighborhood, namely k nearest neighbors of each data point;
Calculating k distances, and for each data point, finding the maximum distance among k nearest neighbors;
calculating an reachable distance, and for data points i and j, calculating the reachable distance, wherein the reachable distance is defined as a larger value between the k distances i and j;
calculating the local density of each data point, namely the density of the data points in k neighborhood of the local density;
the relative density, namely the ratio of the local density of each data point to the local density of the data point in the neighborhood of the data point is calculated;
LOF value, calculating LOF value of each data point, namely average value of relative density, setting threshold value according to LOF value, and regarding the data point with LOF value exceeding the threshold value as abnormal point;
outliers are filtered and these outliers are removed or marked from the dataset.
3. The graph-based big data relation mining method according to claim 2, wherein the entity recognition step includes using a named entity recognition model based on deep learning, wherein the recognition model is a model combining a long-term and short-term memory network and a conditional random field, and the specific method steps are as follows:
The method comprises the steps of obtaining a marked text data set which contains marked entity information, constructing a vocabulary and a tag set, wherein the vocabulary maps vocabulary in a text to a unique index value;
The tag set defines an entity tag class;
Converting words into vectors using a pre-trained word embedding or custom word embedding layer;
Bi-LSTM network:
an input layer embedding words as input to the Bi-LSTM layer, the Bi-LSTM layer capturing contextual information using Bi-directional LSTM;
CRF layer:
Sequence labeling, namely performing sequence labeling on Bi-LSTM output by using a CRF layer, and optimizing entity boundary identification;
Model training, using the log likelihood of the CRF as a loss function, training the model using an Adam optimization algorithm, dividing the data into a training set and a verification set, performing model training, and adjusting the super parameters to optimize performance.
4. A graph computation based big data relationship mining method according to claim 3, wherein the relationship extraction step includes using a deep learning based relationship extraction model that uses a pre-trained language model for relationship prediction, comprising the steps of:
acquiring a large-scale corpus containing entity pairs and relationship labels;
arranging the text into a format suitable for model input, and dividing the data set into a training set, a verification set and a test set;
Marking a text, and dividing the text into vocabulary indexes by using a word segmentation device of BERT;
generating an input, creating an input sample for each entity pair, comprising:
token IDs, vocabulary index of text;
Attention Masks, indicating which locations should be attended by the model;
token Type IDs to distinguish two different text portions;
model construction, loading a pre-training BERT, and selecting a proper BERT variant;
the BERT layer is used for extracting context information in the text;
The relation classification head adds a full connection layer on the CLS mark output by the BERT for relation classification;
defining a loss function, and optimizing classification tasks by using the cross entropy loss function;
And (3) performing model training, inputting a new text into the trained model, performing relationship prediction, and mapping the relationship category output by the model back to an actual relationship label.
5. The graph-computing-based big data relation mining method according to claim 4, wherein the pattern matching algorithm in the graph computing technology includes feature learning and pattern recognition on graph structure data by using a graph rolling network, wherein the GCN is used for learning local and global features of nodes, and the specific steps are as follows:
collecting graph data, namely acquiring graph structure data with nodes and edges;
Data formatting, namely converting the data into a format suitable for GCN input, wherein the data comprises a node characteristic matrix and an adjacent matrix, and standardizing node characteristics so as to improve training effect;
Performing adjacent matrix processing, namely performing normalization processing on the adjacent matrix, and calculating the inverse square root of the degree matrix of the adjacent matrix of each node;
convolution operation is carried out, wherein convolution operation is applied to each node and the neighborhood thereof, and information of neighbor nodes is aggregated;
Activating a function, namely using a ReLU activating function to increase the nonlinear expression capacity of the model;
stacking GCN layers, and stacking a plurality of graph convolution layers according to the requirement so as to capture the multi-level characteristics of the nodes;
the output layer is used for adding an appropriate output layer, such as a full connection layer or a pooling layer, for the subgraph mining task, and training the built model;
extracting features of nodes and subgraphs from GCN output;
sub-graph matching, namely identifying and extracting sub-graphs conforming to a specific mode by comparing the extracted features with a target mode;
The trained GCN model is deployed in a production environment and integrated into practical application, and real-time sub-graph matching and pattern recognition are carried out by using the model, so that data analysis and decision support are provided.
6. The graph computation based big data relationship mining method of claim 5, wherein the graph traversal technique includes processing the large-scale graph data with a distributed graph computation framework to accelerate the subgraph mining process;
Confirming Spark cluster version, and downloading GraphX packages compatible with Spark version;
Uploading GraphX packets to all nodes of the Spark cluster and placing the packets in a specified library directory;
Adjusting spark.executor.memory and spark.driver.memory parameters to allocate proper memory;
Setting spark.cores.max and spark.executor.instances parameters to configure core numbers and executor instances;
Data loading, loading data from HDFS, S3 or other storage systems using the method textFile or sequenceFile of Spark;
cleaning data, removing invalid or erroneous records, and converting the original data into vertex format and edge format;
Creating vertex RDD and edge RDD, and ensuring that the data type is consistent with GraphX requirements;
Creating vertex RDD, namely creating the vertex RDD by using a parallelize method of Spark, and distributing unique ID and corresponding attribute to each vertex;
creating an edge RDD, namely creating the edge RDD by using a parallelize method of Spark, and defining a source vertex, a target vertex and attributes of the edge;
Constructing a Graph object, transmitting the vertex RDD and the edge RDD to a Graph constructor, creating the Graph object, and setting default vertex attributes for the Graph;
Definition map calculation logic:
selecting a proper graph algorithm according to service requirements to realize algorithm logic;
Graph computation is performed using GraphXAPI, calling the API of pregel and mapvertical of GraphX;
writing user-defined functions to process the attributes of the vertexes and the edges;
Calculating a distributed graph, partitioning the graph, and selecting a proper partition strategy;
selecting an optimal partitioning strategy in consideration of characteristics of the graph;
and (3) parallel computing, namely dividing the graph data into a plurality of partitions by utilizing the distributed computing capability of GraphX, and processing the partitions in parallel on a cluster, monitoring the progress of the task and the use condition of resources, and ensuring the efficient operation of the computing task.
7. The graph-based computational big data relationship mining method of claim 6, wherein the graph-based machine learning framework in the relationship model includes modeling and predicting complex relationships between entities using a graph neural network, wherein the graph neural network is used to capture dependency relationships and dynamic changes between entities.
8. The graph-based computed big data relationship mining method of claim 7, wherein the data analysis and insight step includes using graph-based visualization techniques to present mined relationships and patterns as interactive graphs for in-depth analysis and understanding by the user.
9. A computer storage medium storing computer instructions which, when invoked, are operable to perform a graph-based big data relationship mining method as claimed in any of claims 1-8.
CN202411304787.0A 2024-09-19 2024-09-19 A big data relationship mining method based on graph computing Active CN119312810B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411304787.0A CN119312810B (en) 2024-09-19 2024-09-19 A big data relationship mining method based on graph computing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411304787.0A CN119312810B (en) 2024-09-19 2024-09-19 A big data relationship mining method based on graph computing

Publications (2)

Publication Number Publication Date
CN119312810A true CN119312810A (en) 2025-01-14
CN119312810B CN119312810B (en) 2025-05-30

Family

ID=94189926

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411304787.0A Active CN119312810B (en) 2024-09-19 2024-09-19 A big data relationship mining method based on graph computing

Country Status (1)

Country Link
CN (1) CN119312810B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413707A (en) * 2019-07-22 2019-11-05 百融云创科技股份有限公司 The excavation of clique's relationship is cheated in internet and checks method and its system
US20230007965A1 (en) * 2020-03-23 2023-01-12 Zhejiang University Entity relation mining method based on biomedical literature
CN117453802A (en) * 2023-11-16 2024-01-26 中国电子科技集团公司第十研究所 A multi-domain cross-chapter event causality mining and determination device and method
CN117591944A (en) * 2024-01-19 2024-02-23 广东工业大学 A learning early warning method and system for big data analysis
CN118656410A (en) * 2024-06-24 2024-09-17 武汉安天信息技术有限责任公司 Entity relationship mining method, storage medium and terminal

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413707A (en) * 2019-07-22 2019-11-05 百融云创科技股份有限公司 The excavation of clique's relationship is cheated in internet and checks method and its system
US20230007965A1 (en) * 2020-03-23 2023-01-12 Zhejiang University Entity relation mining method based on biomedical literature
CN117453802A (en) * 2023-11-16 2024-01-26 中国电子科技集团公司第十研究所 A multi-domain cross-chapter event causality mining and determination device and method
CN117591944A (en) * 2024-01-19 2024-02-23 广东工业大学 A learning early warning method and system for big data analysis
CN118656410A (en) * 2024-06-24 2024-09-17 武汉安天信息技术有限责任公司 Entity relationship mining method, storage medium and terminal

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
唐德权: "基于图模式的数据挖掘算法研究与应用", 中国优秀博士学位论文全文数据库 (信息科技辑), no. 2022, 15 April 2022 (2022-04-15) *
闫朋;高建瓴;: "图数据挖掘在社交网络的应用研究", 电子世界, no. 08, 23 April 2016 (2016-04-23) *

Also Published As

Publication number Publication date
CN119312810B (en) 2025-05-30

Similar Documents

Publication Publication Date Title
Ahmed et al. Automatic analysis and sketch-based retrieval of architectural floor plans
TWI766618B (en) Key point detection method, electronic device and computer readable storage medium
CN108388559A (en) Name entity recognition method and system, computer program of the geographical space under
CN106815307A (en) Public Culture knowledge mapping platform and its use method
CN109213843A (en) A kind of detection method and device of rubbish text information
CN104573130A (en) Entity resolution method based on group calculation and entity resolution device based on group calculation
CN118135220B (en) Point cloud segmentation method, device and equipment based on voxel and point set fusion
CN104021255A (en) Multi-resolution hierarchical presenting and hierarchical matching weighted comparison method for CAD (computer aided design) model
Thomas et al. Detecting symmetry in scalar fields using augmented extremum graphs
CN112699245A (en) Construction method and device and application method and device of budget management knowledge graph
CN116611071A (en) Function-level vulnerability detection method based on multiple modes
Shalom et al. Part Analogies in Sets of Objects.
CN115017315A (en) A cutting-edge subject identification method, system and computer equipment
CN115130435A (en) Document processing method and device, electronic equipment and storage medium
CN113767403B (en) Automatic parsing of overspecification and underspecification in knowledge graphs
CN115905188B (en) Data quality improvement method based on knowledge graph
Sharma et al. Learning point embeddings from shape repositories for few-shot segmentation
CN119995952B (en) APT abnormal behavior detection method based on multidimensional feature fusion
CN105354826B (en) A kind of image object common location and unrelated sample decision method
CN117574269B (en) Intelligent identification method and system for natural fractures in continental shale reservoirs
CN119312810A (en) A big data relationship mining method based on graph computing
Barakbah et al. Identifying moving variance to make automatic clustering for normal data set
KR102543273B1 (en) Explainable role model recommendation method and apparatus thereof
Kuiper et al. A framework of unsupervised machine learning algorithms for user profiling
Zhang et al. An integral curve attribute based flow segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant