CN119557607B

CN119557607B - Data tracing method and system based on big data and blockchain multidimensional features

Info

Publication number: CN119557607B
Application number: CN202510128479.5A
Authority: CN
Inventors: 刘柱; 潘玲玲; 宁海元; 闵佳
Original assignee: Hangzhou Daishu Technology Co ltd
Current assignee: Hangzhou Daishu Technology Co ltd
Priority date: 2025-02-05
Filing date: 2025-02-05
Publication date: 2025-03-28
Anticipated expiration: 2045-02-05
Also published as: CN119557607A

Abstract

The application discloses a data tracing method and a system based on big data and multi-dimensional characteristics of a blockchain, which belong to the technical field of big data and comprise the steps of obtaining a target data set; the method comprises the steps of calculating responsibility values of each relevant node and interaction weights among adjacent nodes, constructing a multidimensional feature matrix, inputting the multidimensional feature matrix into a graph convolution network for feature fusion, constructing a tracing graph taking the relevant nodes as vertexes and the interaction weights as edges, calculating responsibility propagation probability of each node in the tracing graph, determining abnormal data, determining responsibility nodes according to transmission paths of the abnormal data, calculating final responsibility scores of each responsibility node, carrying out responsibility attribution according to the final responsibility scores, and constructing a node-edge weight visualization model according to responsibility attribution results to carry out visual tracing. The method solves the problem of uncertainty of responsibility attribution in the traditional scheme, and improves the accuracy and practicability of the tracing method.

Description

Data tracing method and system based on big data and blockchain multidimensional features

Technical Field

The invention relates to the technical field of big data, in particular to a data tracing method and system based on big data and multi-dimensional characteristics of a blockchain.

Background

Along with the rapid development of information technology, the big data technology is widely applied to various industries, and the integration and analysis of massive, multi-source and heterogeneous data are realized. This technological innovation has greatly driven the improvement of the intelligence and decision support capabilities of complex systems. However, in practical applications, the problems of inconsistency of data, ambiguous sources, error propagation, etc. seriously affect the reliability of the system and the accuracy of the analysis result.

The diversity and complexity of data makes the data extremely prone to bias and error during acquisition, transmission, processing, and analysis. These problems originate not only from defects in the data itself, but may also be closely related to individual links in the data transmission path. Therefore, how to effectively track the source, transmission path and change process of data becomes the key to ensure the data quality and the accuracy of analysis results.

Aiming at the problems, the industry provides a concept of data tracing, and aims to monitor and manage the full life cycle of data by establishing a complete data tracing mechanism. The existing data tracing method generally comprises the steps of data acquisition and preprocessing, multidimensional feature modeling, dynamic tracing path construction, anomaly detection and responsibility attribution, visual tracing, tracing result storage and the like.

In the data acquisition and preprocessing stage, the system acquires data stream information from a plurality of heterogeneous data sources, performs cleaning and format normalization processing, and generates a unique identifier for each piece of data for subsequent tracking. The multidimensional feature modeling is to construct a multidimensional feature matrix based on the data content, the transmission path and the contextual features, perform feature embedding on the data by using a deep learning model, and represent the multidimensional features of the data in the form of low-dimensional vectors so as to provide a basis for subsequent traceability analysis.

In the dynamic tracing path construction stage, a data tracing model based on a graph structure is adopted in the system, a dynamic tracing path is established according to a data source and a target node, and the improved graph neural network is utilized to extract the relation characteristics among paths so as to realize the global tracking of the data flow direction. This step is critical to understanding the propagation path and the course of the data.

In the stage of abnormality detection and responsibility attribution, the system establishes a data flow abnormality detection mechanism based on historical data and a characteristic model, and when abnormal data is detected, responsibility nodes are identified through a path backtracking algorithm and a data traceability report is generated. However, when the data transfer path includes multiple possible responsible nodes, the responsibility attribution based on the simple backtracking algorithm may not accurately determine a single responsible party, which is a major challenge faced by current data tracing technologies.

Finally, the visual tracing and tracing result storage stage displays the data flow and tracing path through the graphical interface, supports the query, storage and history comparison analysis of tracing results, and provides an intuitive and convenient tool for data quality management and decision making.

In summary, although the existing data tracing method solves the problems of inconsistent data, unknown source, error propagation and the like to a certain extent, the existing data tracing method still has a defect in the accuracy of responsibility attribution.

Disclosure of Invention

The invention aims to provide a data tracing method and system based on big data and multi-dimensional characteristics of a blockchain, so as to solve the problem that responsibility is not accurate enough during data tracing in the prior art.

In order to achieve the above purpose, the present application adopts the following technical scheme:

In a first aspect, the present application provides a data tracing method based on big data and blockchain multidimensional features, comprising the steps of:

acquiring a target data set, wherein the set comprises a plurality of target data records, and context environment states of each target data record and historical behavior records of related nodes of each target data record, and the target data records comprise data content and a transmission path;

calculating the responsibility value of each related node and the interaction weight between adjacent nodes based on the target data record, the context environment state and the historical behavior record, and constructing a multidimensional feature matrix according to the data content, the transmission path, the context environment state, the historical behavior record and the interaction weight;

inputting the multidimensional feature matrix into a graph convolution network to perform feature fusion to obtain a global feature graph, and constructing a tracing graph which takes related nodes as vertexes and interactive weights as edges according to the global feature graph;

calculating the responsibility propagation probability of each node in the traceability graph according to the responsibility values, and determining abnormal data in the target data set by adopting a multi-level detection mechanism;

Determining responsibility nodes in the traceability graph according to the transmission path of the abnormal data, and calculating the final responsibility score of each responsibility node according to the responsibility propagation probability and the historical behavior record;

and carrying out responsibility attribution according to the final responsibility score, and constructing a node-side weight visualization model according to the responsibility attribution result so as to carry out visualization tracing.

In a second aspect, the present application provides a data tracing system based on big data and blockchain multidimensional features, comprising:

The system comprises an acquisition module, a storage module and a transmission module, wherein the acquisition module is used for acquiring a target data set, the set comprises a plurality of item target data records, and context environment states of each item target data record and historical behavior records of related nodes of each item target data record, and the target data records comprise data content and transmission paths;

The first construction module is used for calculating the responsibility value of each related node and the interaction weight between adjacent nodes based on the target data record, the context environment state and the historical behavior record, and constructing a multidimensional feature matrix according to the data content, the transmission path, the context environment state, the historical behavior record and the interaction weight;

The second construction module is used for inputting the multidimensional feature matrix into a graph convolution network to perform feature fusion to obtain a global feature graph, and constructing a tracing graph which takes related nodes as vertexes and interactive weights as edges according to the global feature graph;

the detection module is used for calculating the responsibility propagation probability of each node in the traceability graph according to the responsibility values and determining abnormal data in the target data set by adopting a multi-level detection mechanism;

The calculation module is used for determining the responsibility nodes in the traceability graph according to the transmission path of the abnormal data and calculating the final responsibility score of each responsibility node according to the responsibility propagation probability and the history behavior record;

And the visualization module is used for carrying out responsibility attribution according to the final responsibility score, and constructing a node-side weight visualization model according to the responsibility attribution result so as to carry out visualization tracing.

In a third aspect, the present application provides a computer-readable storage medium storing a computer program, which when executed by a computer, implements a data tracing method based on big data and blockchain multidimensional features as described in any of the above.

The invention has the following beneficial effects:

Meanwhile, the final responsibility score of the nodes is calculated based on the dynamic responsibility distribution model, hierarchical attribution is supported, visual visualization results are generated, the problem of uncertain responsibility attribution in the traditional scheme is solved, and the accuracy and the practicability of the tracing method are improved.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the application, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a flow chart of a data tracing method based on big data and blockchain multidimensional features provided by an embodiment of the application;

FIG. 2 is a diagram of a multi-dimensional feature matrix relationship provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of a dynamic trace-source path provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of a visual interface provided by an embodiment of the present application;

fig. 5 is a diagram illustrating a structure of a data tracing system based on big data and multi-dimensional characteristics of a blockchain according to an embodiment of the present application.

Detailed Description

In order to make the technical scheme of the application clearer, the application is further described in detail below with reference to the attached drawings and specific embodiments. The terms "first," "second," and the like in the claims and the description of the application, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order, and it is to be understood that the terms so used may be interchanged, if appropriate, merely to describe the manner in which objects of the same nature are distinguished in the embodiments of the application by the description, and furthermore, the terms "comprise" and "have" and any variations thereof are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

In one embodiment, as shown in fig. 1, the present application provides a data tracing method based on big data and multi-dimensional characteristics of blockchain, comprising the following steps:

S110, acquiring a target data set, wherein the set comprises a plurality of target data records, the context environment state of each target data record and the historical behavior record of the relevant node, and the target data records comprise data content and a transmission path;

s120, calculating the responsibility value of each related node and the interaction weight between adjacent nodes based on the target data record, the context environment state and the historical behavior record, and constructing a multidimensional feature matrix according to the data content, the transmission path, the context environment state, the historical behavior record and the interaction weight;

s130, inputting the multidimensional feature matrix into a graph convolution network to perform feature fusion to obtain a global feature graph, and constructing a tracing graph which takes related nodes as vertexes and interactive weights as edges according to the global feature graph;

S140, calculating the responsibility propagation probability of each node in the traceability graph according to the responsibility values, and determining abnormal data in the target data set by adopting a multi-level detection mechanism;

s150, determining responsibility nodes in the traceability graph according to the transmission path of the abnormal data, and calculating the final responsibility score of each responsibility node according to the responsibility propagation probability and the history behavior record;

And S160, performing responsibility attribution according to the final responsibility score, and constructing a node-side weight visualization model according to the responsibility attribution result so as to perform visualization tracing.

The embodiment comprises five steps, namely data acquisition and preprocessing, multidimensional feature modeling, dynamic tracing path construction, anomaly detection and responsibility attribution, and visual tracing and tracing result storage.

First, data acquisition and preprocessing.

The data acquisition and preprocessing are the basis of the whole traceability process, and mainly extract necessary information from multi-source heterogeneous data related to blockchain transactions and carry out cleaning and normalization processing so as to provide high-quality data support for subsequent multidimensional feature modeling and responsibility attribution through comprehensive and accurate data input.

Further, extracting a plurality of original data records from the multi-source heterogeneous data stream, and cleaning and format standardization processing are carried out on each original data record to obtain a target data record;

Acquiring the context environment state of each item of target data record and the related historical behavior record of each node, wherein the historical behavior record comprises historical abnormal frequency and stability scores;

And constructing a target data set according to each item of target data record, the context environment state thereof and the related historical behavior record of each node.

The data acquisition and preprocessing is mainly divided into two parts, namely data acquisition and cleaning and context information supplementation. The data collection refers to extracting a plurality of original data records from a multi-source heterogeneous data stream, where each original data record includes information such as a timestamp, data content, a source node, a transfer path and the like in a data transfer process, in this embodiment, the timestamp covers all key time points in a transaction process, including but not limited to a transaction creation timestamp and a transaction verification timestamp, the data content specifically includes transaction amount, information of a transaction initiator and a public and private key verification result, the source node is a blockchain node of the transaction initiator, and the transfer path is all blockchain nodes through which data passes in the transaction process. Meanwhile, when in collection, each original data record is endowed with a unique identifier UID so as to trace the source and the flow direction of the original data record in the subsequent tracing process, thereby realizing the tracing of the data and the monitoring and analysis of the data flow direction.

After the data acquisition is completed, the data is subjected to cleaning and format normalization processing to obtain target data records, so that noise data and repeated data are removed, and the consistency of the data formats among all data sources is ensured. The format normalization comprises format unification of the time stamps so as to ensure that the time data can be accurately analyzed and utilized in the subsequent calculation and modeling process. Through data cleaning and format normalization processing, the data quality can be effectively ensured, so that the subsequent calculation and modeling are ensured to be performed efficiently and accurately.

The context information supplement is to further obtain the context environment state of the target data record and the relevant historical behavior record of each node on the basis of the cleaned and normalized data. The context environment state comprises data such as network delay, transmission bandwidth, node load and the like, the data such as the network delay, the transmission bandwidth, the node load and the like are quantified through the context environment evaluation value, the data such as the network delay, the transmission bandwidth, the node load and the like can be obtained through the network monitoring tool and the system performance monitoring tool, for example, the network delay is determined by measuring the round trip time of communication among nodes through the network testing tool, the current available bandwidth is obtained from an interface provided by network equipment or an operating system, the transmission bandwidth is determined by the interface, the use condition of resources such as a CPU (central processing unit) and a memory is obtained from a resource management system of the node, the node load is determined by the use condition of the resources such as the CPU and the like, the historical behavior record of the node comprises historical abnormal frequency and stability score, the historical abnormal frequency of the node is obtained from a log record of the system, the traditional abnormal detection result and the like, for example, the historical abnormal frequency of the node is obtained through counting the proportion of the abnormal frequency and the total operation frequency of the node in the past time. These context information provide important decision-making aid basis for subsequent responsibility deduction.

And adding all target data records, the contextual environmental characteristics of each target data record and the historical behavior records of the relevant nodes thereof into an empty set to obtain a target data set.

Further, a context evaluation value corresponding to each context state is determined, and the responsibility value of each relevant node is obtained by carrying out weighted average on the corresponding historical abnormal frequency, stability score and context evaluation value.

In data acquisition and preprocessing, an important algorithm is also the modeling of historical behavior characteristics of responsible nodes. The objective of the algorithm is to construct the responsibility value of each node in the transfer path by comprehensively evaluating the historical abnormal behavior, the transfer environment and the stability of the node. For this purpose, a weighted calculation formula is introduced to evaluate the responsibility values of the nodes, and in subsequent steps, the responsibility propagation mechanism will trace back and determine the responsibility of the data flow based on this responsibility value.

In particular, the method comprises the steps of,。

Wherein R _i represents the responsibility value of the node i, H _i is the historical anomaly frequency of the node i, C _i is the context evaluation value of the node i, S _i is the stability score of the node i, w ₁、w₂ and w ₃ are weight coefficients reflecting the importance of each feature in responsibility evaluation, and at the same time, the value ranges of H _i、C_i、S_i、w₁、w₂ and w ₃ are both [0,1], and w ₁+w₂+w₃ =1.

It should be noted that, the historical abnormal frequency is usually obtained by evaluating the historical behavior of the node within 3-6 months, and the time period can cover a certain number of transactions or data transmission events, so that the historical behavior of the node is more comprehensively reflected, and early data is not stale or excessively diluted by recent data due to overlong time.

Illustratively, the historical anomaly frequency of the node i is 0.1, the context evaluation value is 0.8, the stability score is 0.9, and the weight coefficient w ₁=0.5、w₂=0.3、w₃ =0.2, then the responsibility value of the node i is:

。

the responsibility value can accurately reflect the responsibility degree of the node under different data transfer scenes, and can be used as the basis of the subsequent tracing path construction and responsibility deduction.

And secondly, modeling multidimensional features.

The multidimensional feature modeling is a core step for ensuring that the features such as data sources, transmission paths, node historical behaviors and the like can be effectively integrated and deeply analyzed. The method aims at establishing a multidimensional feature matrix capable of integrating data content features, transfer path features, context features and responsibility deduction related features, and based on the matrix, performing feature integration by using a deep learning model so as to generate a global feature map and provide support for subsequent tracing path construction and responsibility attribution.

Further, extracting the characteristics of the data content, the transmission path, the context environment state and the historical behavior record to obtain the characteristics of the data content, the transmission path, the context environment and the historical behavior;

and constructing a multidimensional feature matrix according to the data content features, the transfer path features, the context environment features, the historical behavior features and the interaction weights.

In the step, firstly, characteristic extraction is carried out on collected data content, a transmission path and context information, such as characteristic extraction on the data content to obtain sum validity and public key matching rate, the characteristics can describe influence of the data content on tracing and responsibility judgment, the transmission path is extracted to obtain path length and transmission delay of each path, the characteristics can reflect influence of the transmission path on tracing the data, and the context information is subjected to characteristic extraction to obtain context environment characteristics and historical behavior characteristics, wherein the context environment characteristics comprise node load and delay fluctuation, the historical behavior characteristics comprise historical abnormal records, and the characteristics can reflect influence of a network environment on tracing the data.

Since historical behavioral characteristics, contextual environmental characteristics, and the weight of interactions between nodes all play a key role in the responsibility derivation process, these characteristics are also referred to as responsibility derivation related characteristics in this embodiment. Specifically, the historical behavior features can reflect the past abnormal condition of the nodes, the context environmental features can reflect the influence of the network environment, the interaction weights provide basis for responsibility judgment from the angle of the relationship among the nodes, and the interaction weights are combined with each other to provide rich and indispensable information for responsibility attribution.

In order to effectively fuse the extracted features, as shown in fig. 2, in this embodiment, a multidimensional feature matrix is constructed according to the extracted features, where each dimension in the multidimensional feature matrix represents a feature in a certain aspect in a data stream, such as a historical behavior of a node, a transmission delay, a node load, and the like, and then a graph rolling network (GCN) is used to perform feature fusion, where the GCN can process complex relationships between nodes in a graph structure manner, and effectively extract dependency features between nodes, so as to form a low-dimensional representation of data.

In the process of multidimensional feature modeling, the interaction relation among nodes is represented by weighting a graph structure. Specifically, the interaction weight of each edge is dynamically adjusted by factors such as interaction frequency, transmission delay, data transmission reliability and the like between adjacent nodes on the transmission path.

Further, the interaction frequency, the transmission delay and the reliability score between adjacent nodes on the transmission path are obtained, and the interaction weights between all the adjacent nodes are obtained through calculation according to the interaction frequency, the transmission delay and the reliability score.

Assuming that the interaction weight between the node i and the node j is w _ij, the calculation formula is as follows:

。

Wherein f (E _ij,T_ij,Q_ij) is a comprehensive function based on interaction frequency E _ij, transmission delay T _ij and reliability score Q _ij of a node i and a node j, the function is used for calculating interaction strength between the two nodes, a specific function formula can be set according to actual requirements, a denominator is the sum of interaction strengths of the node i and all neighbor nodes N (i) thereof, and u is an index variable used for traversing all neighbor nodes of the node i and belongs to N (i).

It should be noted that the reliability score is an index for measuring the reliability of interaction between two nodes, which is determined by comprehensively considering various factors such as stability and accuracy of data transmission between adjacent nodes.

The finally obtained interaction weight can more accurately reflect the relation between the path and the node in the data stream transmission process, so that the GCN can better embed and fuse the data.

For example, the interaction frequency between the node i and the node j is 0.8, the transmission delay is 0.5, the reliability score is 0.9, the sum of the interaction strengths of the node i and all the neighboring nodes is 3.5, and then the interaction weight is:

。

calculated f (0.8,0.5,0.9) =1.2, then

。

Namely, the interaction weight of the edge between the node i and the node j in the graph is 0.343, and the strong interaction relation between the two nodes is judged according to the actual situation. Through GCN, the interaction weight influences the feature fusion of the nodes, and the judgment precision of the responsible nodes is improved.

Through fusion modeling of historical behavior features, contextual features and interaction weights among nodes of the nodes, deep embedding of complex features is carried out by combining a graph convolution network, dynamic features of data can be comprehensively captured, and adaptability to various scenes is improved.

And thirdly, constructing a dynamic tracing path.

As shown in fig. 3, the goal of dynamic trace-source path construction is to model multiple nodes and paths through a graph structure in order to find the true responsible node among the multiple possible paths that the data flow passes through. The defect that a plurality of responsible nodes possibly exist due to incapability of effectively judging responsible attribution in the traditional scheme is solved through an introduced responsible propagation probability model.

Further, the corresponding conditional probability and joint probability of each relevant node historical behavior feature and the context environmental feature under the condition of a given responsible party are obtained;

and determining the prior responsibility probability of each related node according to the responsibility values, and calculating the responsibility propagation probability of each related node according to the conditional probability, the joint probability and the prior responsibility probability.

Wherein the responsibility propagation probability model is introduced to solve the possibility of how to accurately calculate each node as a responsible party when the data flow passes through a plurality of nodes. The model builds a liability causal chain based on a bayesian network through historical behavioral characteristics, contextual environmental characteristics and liability propagation probabilities among nodes.

Specifically, the probability of propagation of responsibility of node i in a path is calculated by the following formula:

。

Where P (R _i∣X_iY_i) represents the probability of propagation of responsibility of the node i in the path as a responsible party, X _i and Y _i represent the historical behavior feature and the contextual environmental feature of the node i, respectively, P (R _i) is the prior responsibility probability of the node i, which is essentially the probability of the responsibility value R _i, as calculated above as R _i =0.47, where the prior responsibility probability is 47%, which can also be understood as the probability that the node i is a directly responsible node in each event occurring in the past, P (X _i∣R_i) and P (Y _i∣R_i) are the conditional probabilities of the historical behavior feature and the contextual environmental feature of the node i, respectively, and the denominator P (X _i,Y_i) is the joint probability of the features, for normalization.

Specifically, the conditional probability refers to the probability of occurrence of the event a under the condition that the factor B occurs, and the calculation formula is P (a|b) =p (AB) P (B) P (a|b) =p (B) P (AB), where P (AB) is the joint probability, i.e., the probability that a and B occur simultaneously. If a transaction is associated with a known malicious address, the conditional probability that the transaction is malicious may increase.

The joint probability refers to the probability that events occur simultaneously, for example, the probability that two events occur simultaneously is recorded as P (a, B) or P (a n B) or P (AB) now A, B, if event a and event B are independent of each other, then there is P (a, B) =p (a) P (B), and if a transaction has both a large transfer and is associated with a known malicious address, then the joint probability that the transaction is malicious is higher when the two features occur simultaneously.

Assume that in the blockchain, a user has an intelligent contract for detecting abnormal transactions. The user sets the prior responsibility probability of a node to be 5% according to the historical data. When a new transaction occurs, the user finds that the transaction has an association with a known malicious address, which increases the conditional probability that the transaction is malicious to 20%. Meanwhile, if the transaction amount exceeds a certain threshold, the joint probability of this threshold and the malicious address may further increase to 40%. In this way, the user can adjust the assessment of the probability of propagation of responsibility for the transaction based on the prior probability of responsibility, the conditional probability, and the joint probability, thereby more accurately identifying abnormal transactions.

Fourth, anomaly detection and responsibility are attributed.

Anomaly detection and responsibility attribution are the core links of the system for accurately judging the data anomalies and responsibility nodes. By establishing an efficient abnormality detection mechanism and combining a dynamic responsibility allocation model, the positioning of data abnormality and the hierarchical display of responsibility attribution are realized gradually.

Further, calculating probability that each data point in the multidimensional feature matrix belongs to kth Gaussian distribution by using a Gaussian mixture model, and comparing the probability with a first set threshold value, wherein k is an integer greater than or equal to 0;

when the Gaussian distribution probability is smaller than a first set threshold value, judging that the data point is suspicious;

calculating the anomaly score of the corresponding data point by using an isolated forest algorithm, and carrying out weighted average on the anomaly score and the Gaussian distribution probability of the anomaly score to obtain the final anomaly score;

and comparing the final anomaly score with a second set threshold, and judging the data point as anomaly data when the final anomaly score is smaller than the second set threshold.

First, in order to improve the accuracy and robustness of anomaly detection, the present embodiment employs a multi-level detection mechanism. In the initial stage, the data content characteristics and the context environment characteristics are detected based on a Gaussian mixture model, and the Gaussian mixture model can adapt to a complex data mode by assuming that the data distribution is a linear combination of a plurality of Gaussian distributions. Assuming that the multidimensional feature matrix generated in the second step is X, and the dimension is n multiplied by m, wherein n is the number of data bars, and m is the feature number, each data pointThe probability belonging to the kth gaussian distribution is:

。

Wherein, Is the weight of the kth gaussian distribution,Is the mean valueCovariance (covariance)E, o and k are integers greater than or equal to 0.

It should be noted here that,AndIs obtained by using an Expectation Maximization (EM) algorithm, which is the prior art and will not be described herein.

And comparing the calculated probability with a set threshold value, and if the probability is smaller than the set threshold value, judging that the data point x _i is suspicious and belongs to the abnormal distribution of the kth category. This category is used to determine the distribution of data anomalies and affects the calculation of relevant parameters in the responsibility allocation formula, such as responsibility propagation probabilities, etc., based on the data characteristics associated with the distribution.

On this basis, an isolated forest was introduced as a supplementary test. The method specifically comprises the steps of randomly selecting a feature from a given data set, randomly selecting a segmentation value between the maximum value and the minimum value of the selected feature, dividing the data set into two subsets, enabling data points smaller than the segmentation value to be in a left subset, enabling data points larger than the segmentation value to be in a right subset, repeating the steps for each subset until a stopping condition is met (for example, the number of the data points in the subset is smaller than a certain threshold value or the tree reaches a preset maximum depth), constructing an isolated tree, repeating the process of constructing the isolated tree to obtain a plurality of isolated trees to form an isolated forest, and then recording the path length (the number of edges passing from a root node to a leaf node where the data point is located) of each data point in the isolated forest, and calculating the average path length and the anomaly score of the data point.

And then carrying out weighted average on the Gaussian distribution probability of each data point and the isolated forest anomaly score, and judging whether the corresponding data point is anomaly or not according to the weighted average result. By combining the two methods, the coverage range and the accuracy of data anomaly detection are improved.

And aiming at the detected abnormal data, determining all nodes flowing through the traceability graph, regarding all the nodes as responsible nodes, and carrying out hierarchical analysis through a dynamic responsibility distribution model.

The responsibility distribution model integrates the historical behavior characteristics, the context environment characteristics and the responsibility propagation probability of the nodes, calculates the contribution degree of each responsibility node to abnormal data, provides support for the subsequent hierarchical attribution report, and assumes the node i as the responsibility node, and the model is as follows:

。

Wherein R _i is the final responsibility score of the node i, P (R _i∣X_i,Y_i) is the responsibility propagation probability of the node i calculated in the step three, H _i is the historical abnormal frequency of the node i, alpha and beta are adjusting weights for balancing the influence of the propagation probability and the historical behavior, the value ranges of alpha and beta are all between 0 and 1, and the specific value can be determined according to the characteristics of data and the importance in the actual application scene.

The Gaussian mixture model is combined with the isolated forest, so that the accuracy and the robustness of anomaly detection are improved, the possibility of false detection and missed detection is remarkably reduced, and more stable input data is provided for a traceability system.

Further, the node with the highest final responsibility score in the traceability graph is listed as a direct responsibility node, and other nodes are listed as secondary responsibility nodes.

The responsibility attribution results are displayed in a hierarchical form and comprise direct responsibility nodes and secondary responsibility nodes, wherein the direct responsibility nodes refer to the nodes with the highest final responsibility scores and are usually the main sources of abnormal data, and the secondary responsibility nodes refer to the nodes indirectly participating in abnormal data transmission, and the final responsibility scores of the secondary responsibility nodes are slightly lower than those of the direct responsibility nodes, but the secondary responsibility nodes still need attention.

Illustratively, in a certain data anomaly, the traceability graph contains nodes A, B and C, the responsibility propagation probability of node a is 0.6, the history anomaly frequency is 0.8, the responsibility propagation probability of node B is 0.4, the history anomaly frequency is 0.5, the responsibility propagation probability of node C is 0.2, the history anomaly frequency is 0.4, and if α=0.7, β=0.3, the final responsibility score of node a is:

。

The final liability scores for nodes B and C were 0.43 and 0.32, respectively, following similar steps. Thus, node a is determined to be the direct responsible node and nodes B and C are secondary responsible nodes.

The dynamic responsibility distribution model combines the multidimensional features and responsibility propagation probability to generate a grading attribution report, and clearly displays the direct responsibility and the secondary responsibility nodes, thereby not only improving the accuracy of responsibility attribution, but also providing strong interpretability support for users.

And fifthly, visualized tracing and tracing result storage.

Visual traceability and traceability result storage aims at presenting data flow and responsibility attribution results in an intuitive form and provides efficient traceability result storage and analysis capability, as shown in fig. 4. The method mainly combines dynamic visualization and storage management, provides comprehensive traceability operation experience for users, and ensures traceability and reusability of data analysis results.

Further, the responsible node with the highest final responsibility score is listed as a direct responsible node, and other nodes are listed as secondary responsible nodes;

Normalizing the prior responsibility probability to obtain a color value of each responsibility node, and calculating the transfer weight of the edge according to the data content characteristics, the transfer path characteristics, the context environment characteristics and the historical behavior characteristics among the adjacent nodes;

A node-edge weight visualization model is constructed based on the color values and the pass weights to visualize the critical path of the abnormal data flow and the final responsibility score for each responsible node.

The core of dynamic visualization is that the traceability path, node responsibility distribution and historical change trend are displayed through a multidimensional graphical interface, and therefore a node-side weight visualization model is constructed. In the graph structure of the tracing path, the prior responsibility probability P (R _i) of each node and the transfer weight of the edgeThe dynamic visualization method is used as a key parameter of dynamic visualization, and the graphical interface visually displays node responsibility and main data flow through color shade and edge width respectively.

The display color shade of the node is calculated by the following formula:

。

Where C _i is the color value of node i, normalize represents normalizing the prior liability probability P (R _i) to ensure that the node colors have a uniform visual range in the graph. Through the formula, areas with high responsibility values can be in darker colors, and potential responsibility nodes are visually highlighted.

The width of the edge is adjusted according to the transmission weight, and the calculation formula of the transmission weight is as follows:

。

Where X _ij and Y _ij represent data transfer characteristics and context environmental characteristics between nodes i and j, respectively. The function f is a weighted sum mapping method capable of converting the feature input into weight values of edges, and the paths with larger edge widths are considered as critical paths through which data anomalies may flow.

In particular, the method comprises the steps of, 。

Wherein, 、Representing different aspects of the data transfer characteristics,、Different factors representing the characteristics of the context environment, z is an index variable used for traversing elements in a specific range associated with nodes i and j, belongs to V (ij), and determines the transmission weight of an edge by summing each item through the index variable, a, b, c, d is a corresponding weight coefficient used for adjusting the influence degree of different factors on the edge weight.

It should be noted here that the data transfer features include a data content feature and a transfer path feature.

Through dynamic visualization, a user can quickly identify high-responsibility nodes, key paths and corresponding responsibility distribution in the interface, and meanwhile, the interface also supports a time slider, so that the user can dynamically check the time change trend of the final responsibility distribution of the nodes.

The traceability results store aims to store analysis data in a structured way so as to inquire and compare histories.

The result storage designs three types of core data tables, namely a node responsibility score table, a path weight table and a time responsibility change table, wherein each type of table is associated with a unique identifier of an event so as to facilitate comparative analysis.

Meanwhile, in order to support efficient storage and quick retrieval, an optimized storage strategy based on a time sequence database is adopted, and the final responsibility score, path weight and event change data of the nodes are stored in time sequence by taking the event as an index, so that efficient time range query and dynamic update are supported.

Illustratively, a trace-source event includes nodes A, B and C, the trace-source path is a→b→c, the final responsibility score of the nodes is 0.8, 0.5 and 0.3, the transfer weights are 0.7 and 0.4, and the color values of the nodes A, B and C are 1.0, 0.625 and 0.375 respectively obtained by calculation through a normalization formula, that is, in the visual interface, the node a is shown as dark color, the node C is shown as light color, the width of the side a→b is larger, the width of the side b→c is smaller, and the key path highlighting the data anomaly is from a to B.

In the traceability result storage, a node responsibility score table records the final responsibility scores and calculation sources of the nodes A, B and C respectively, a path weight table records the transmission weight values of A, B and B, C respectively, and a time responsibility change table shows the final responsibility score change of the nodes A, B and C at different time points, so that a basis is provided for subsequent analysis.

The method and the device enhance the accuracy of judging the responsibility nodes in the tracing path by introducing the responsibility propagation probability model and Bayesian reasoning, optimize multidimensional feature modeling by combining the context information and the historical behavior features, and simultaneously calculate the final responsibility score of the nodes based on the dynamic responsibility distribution model, support hierarchical attribution and generate visual visualization results, solve the problem of uncertain responsibility attribution in the traditional scheme, and improve the accuracy and the practicability of the tracing method.

In another embodiment, as shown in fig. 5, the present application further provides a data tracing system based on big data and multi-dimensional characteristics of blockchain, including:

The acquisition module is used for acquiring a target data set, wherein the set comprises a plurality of item target data records, the context environment state of each item target data record and the historical behavior record of the relevant node, and the target data records comprise data content and a transmission path;

The embodiment is used for realizing the method provided by the embodiment, and has the corresponding beneficial effects of the method. Technical details not described in detail in this embodiment can be found in the methods provided in all the foregoing embodiments of the invention.

In yet another embodiment, the present application further provides a computer readable storage medium storing a computer program, where the computer program causes a computer to implement a data tracing method based on big data and blockchain multidimensional features as described above when executed.

By way of example, a computer program may be divided into one or more modules/units stored in a memory and executed by a processor and the I/O interface transmission of data accomplished by an input interface and an output interface to accomplish the present invention, and one or more modules/units may be a series of computer program instruction segments capable of accomplishing specific functions for describing the execution of the computer program in a computer device.

The computer device may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The computer device may include, but is not limited to, a memory, a processor, and it will be appreciated by those skilled in the art that the present embodiments are merely examples of computer devices and are not limiting of computer devices, may include more or fewer components, or may combine certain components, or different components, e.g., a computer device may also include an input, a network access device, a bus, etc.

The Processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The memory may also be an external storage device of the computer device, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, and further, the memory may also include an internal storage unit of the computer device and an external storage device, the memory may be used to store a computer program and other programs and data required by the computer device, and the memory 601 may also be used to temporarily store the computer program and other programs and data required by the computer device in an output device, where the foregoing storage medium includes a U disk, a removable hard disk, a ROM of a read-only memory, a RAM of a random access memory, a disk, or an optical disk, and other various media that can store program codes.

The foregoing examples illustrate only a few embodiments of the invention and are described in detail herein without thereby limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. The data tracing method based on the big data and the multi-dimensional characteristics of the blockchain is characterized by comprising the following steps of:

2. The method for tracing data based on big data and multi-dimensional characteristics of blockchain of claim 1, wherein the obtaining the target data set comprises:

Extracting a plurality of original data records from the multi-source heterogeneous data stream, and cleaning and format standardization processing are carried out on each original data record to obtain a target data record;

3. The method for tracing data based on big data and multi-dimensional characteristics of blockchain according to claim 2, wherein the calculating the responsibility value of each relevant node and the interaction weight between adjacent nodes based on the target data record, the context environmental state and the history behavior record comprises:

Determining a context environment evaluation value corresponding to each context environment state, and carrying out weighted average on the corresponding historical abnormal frequency, stability score and context environment evaluation value to obtain a responsibility value of each relevant node;

and acquiring interaction frequency, transmission delay and reliability score between adjacent nodes on a transmission path, and calculating to obtain interaction weights between all the adjacent nodes according to the interaction frequency, the transmission delay and the reliability score.

4. The method for tracing data based on big data and blockchain multidimensional features according to claim 1, wherein the constructing a multidimensional feature matrix according to the data content, the transfer path, the context environment state, the historical behavior record and the interaction weight comprises:

performing feature extraction on the data content, the transfer path, the context environment state and the historical behavior record to obtain data content features, transfer path features, context environment features and historical behavior features;

5. The method for tracing data based on big data and multi-dimensional characteristics of blockchain according to claim 1, wherein the calculating the probability of propagation of responsibility of each node in the tracing graph according to the responsibility value comprises:

Acquiring the respective corresponding conditional probability and joint probability of each relevant node historical behavior feature and context environmental feature under the condition of a given responsible party;

And determining the prior responsibility probability of each related node according to the responsibility value, and calculating the responsibility propagation probability of each related node according to the conditional probability, the joint probability and the prior responsibility probability.

6. The method for tracing data based on big data and multi-dimensional characteristics of blockchain as in claim 1, wherein determining abnormal data using a multi-level detection mechanism comprises:

Calculating the probability that each data point in the multidimensional feature matrix belongs to the kth Gaussian distribution by using a Gaussian mixture model, and comparing the probability with a first set threshold value, wherein k is an integer greater than or equal to 0;

when the Gaussian distribution probability is smaller than the first set threshold, judging that the data point is suspicious;

7. The method for tracing data based on big data and blockchain multidimensional features according to claim 2, wherein the determining the responsible nodes in the tracing graph according to the transmission path of the abnormal data and calculating the final responsibility score of each responsible node according to the responsibility propagation probability and the historical behavior record comprises:

And regarding all nodes through which the abnormal data flow in the traceability graph as responsible nodes, and calculating the final responsibility score according to the responsibility propagation probability and the history abnormal frequency of each responsible node.

8. The method for tracing data based on big data and blockchain multidimensional features of claim 5, wherein said attributing responsibility according to said final responsibility score and constructing a node-edge weight visualization model for visual tracing based on responsibility attribution results comprises:

the responsible node with the highest final responsibility score is listed as a direct responsible node, and other nodes are listed as secondary responsible nodes;

And constructing a node-edge weight visualization model according to the color values and the transfer weights so as to visualize the critical path of abnormal data flow and the final responsibility score of each responsible node.

9. A data tracing system based on big data and blockchain multidimensional features, comprising:

10. A computer readable storage medium storing a computer program, wherein the computer program when executed causes a computer to implement a data tracing method based on big data and blockchain multidimensional features as claimed in any one of claims 1 to 8.