CN118520397A

CN118520397A - A method and system for detecting data anomaly in a distributed network

Info

Publication number: CN118520397A
Application number: CN202410659935.4A
Authority: CN
Inventors: 魏兴慎; 刘苇; 祁龙云; 刘寅; 李科; 曹永明; 李慧水; 田秋涵; 张付存; 刘全; 黄天明; 刘剑; 董勤伟; 程长春; 冒佳明; 曹永健; 周剑; 郭楠楠
Original assignee: State Grid Jiangsu Electric Power Co Ltd; NARI Information and Communication Technology Co; Yangzhou Power Supply Co of State Grid Jiangsu Electric Power Co Ltd; State Grid Electric Power Research Institute; State Grid Corp of China SGCC
Current assignee: State Grid Jiangsu Electric Power Co Ltd; NARI Information and Communication Technology Co; Yangzhou Power Supply Co of State Grid Jiangsu Electric Power Co Ltd; State Grid Electric Power Research Institute; State Grid Corp of China SGCC
Priority date: 2024-05-27
Filing date: 2024-05-27
Publication date: 2024-08-20

Abstract

The invention discloses a data anomaly detection method and a system of a distributed network, wherein the method acquires the data of each distributed node in the distributed network, divides the data into a plurality of categories, extracts the characteristics in each category, and obtains the fusion characteristics of the data in all the categories through weighted average; calculating a plurality of statistic weighted summation of the data to be used as an index for evaluating the fusion characteristics, and selecting a plurality of fusion characteristics as the characteristics of the node; each distributed node obtains normal data points and abnormal data points through clustering; the distributed network builds a graph structure according to the characteristics and the connection relation of the nodes, and judges whether each node is an abnormal node or not through a graph neural network model. The invention can process multi-mode data in the distributed network and perform proper feature extraction, and can accurately detect abnormal nodes by establishing a graph structure in the distributed network and utilizing a normal data point and abnormal data point training model of the nodes.

Description

Data anomaly detection method and system for distributed network

Technical Field

The invention relates to network data anomaly detection, in particular to a data anomaly detection method and system of a distributed network.

Background

The continuous progress of the Internet of things technology, the cloud computing technology and the big data analysis technology provides reliable connection and communication, strong computing and storage capacity and capacity of efficiently processing mass data, so that the explosive development of the distributed Internet of things is promoted, and the scenes of real-time monitoring, intelligent decision making and optimal management are realized. However, these physical devices may fail or even suffer from attacks, so accurately identifying and solving these problems is critical to improving system efficiency and performance.

Traditional fault detection methods are mainly based on physical models and rules, but the methods generally require accurate system parameters and complex mathematical models, and limit applicability and expandability. At the same time, however, the extraction of the features of the anomaly pattern may be limited by the quality of the data, which may lead to a decrease in the accuracy of anomaly detection if the quality of the data is poor or inaccurate. Meanwhile, data representation learning is a core stage of anomaly detection technology. Different failure modes may require different feature representation methods, so how to select the appropriate features and perform efficient representation learning is a key issue.

In addition, when dealing with complex anomaly patterns, conventional techniques model relationships in a distributed system based on simplified assumptions or rules, failing to accurately capture interactions and dependencies between complex nodes, and may fail to provide accurate anomaly detection results. Meanwhile, the prior art often focuses on the characteristics of the node itself, and less consideration is given to the fact that the context information around the node may have limited characteristic representation capability. Finally, these techniques may require manual adjustment of parameters or updating of rules to accommodate the new anomaly pattern. This non-real time nature and lack of adaptivity may lead to reduced performance of conventional techniques in handling distributed anomaly detection.

Disclosure of Invention

The invention aims to: the invention aims to provide a data anomaly detection method and system for a distributed network, which improve anomaly detection performance in a distributed scene through data security modeling and attack anomaly detection based on a neural network.

The technical scheme is as follows: the data anomaly detection method of the distributed network comprises the following steps:

Acquiring data of each distributed node in a distributed network and carrying out standardization processing to obtain a standardized data set of each distributed node;

in the standardized data set, dividing the data into a plurality of categories, extracting features in each category, and obtaining fusion features of the data in all the categories through weighted average;

Calculating a plurality of statistics of the data in the standardized data set, taking the weighted sum of the statistics as an index for evaluating the fusion characteristics, and selecting a plurality of fusion characteristics as characteristics of the distributed node according to the index value;

each distributed node obtains normal data points and abnormal data points through clustering;

The distributed network builds a graph structure according to the characteristics of each distributed node and the connection relation of the distributed nodes, trains a graph neural network model by using normal data points and abnormal data points of each distributed node as labels, and judges whether each distributed node is an abnormal node or not through the trained graph neural network model.

Further, in the standardized dataset, classifying the data into a plurality of categories, extracting features in each category, and obtaining fusion features of the data in all categories through weighted average includes: in the standardized dataset, data are classified into several categories by a density clustering method.

Further, in the standardized dataset, classifying the data into a plurality of categories, extracting features in each category, and obtaining fusion features of the data in all categories through weighted average includes: and extracting features from each category by a kernel principal component analysis method, and obtaining fusion features of data in all categories by weighted average.

Further, the calculating the plurality of statistics of the data in the standardized dataset, the weighting and summing the plurality of statistics to be used as an index for evaluating the fusion characteristics, and selecting the plurality of fusion characteristics as the characteristics of the distributed node according to the index value includes: calculating three statistics of the data in the standardized dataset, namely T ² statistics, SPE statistics and SWE statistics; the three statistics are normalized, weighted and summed to obtain an index for evaluating the fusion characteristic; and selecting fusion characteristics with index values exceeding a threshold value as characteristics of the distributed node, or selecting a plurality of fusion characteristics before as characteristics of the distributed node according to the sequence from high index values to low index values.

Further, the obtaining the normal data point and the abnormal data point by the distributed nodes through clustering includes: the distance measurement of the clusters is the distance between data points, the distance is calculated according to the similarity between the data points, and the calculation method is as follows:

Wherein, AndRepresenting any two of the feature vectors,AndFor posterior probability values, m is the dimension of the feature vector, delta ^U is the allowed dissimilarity, the value is between 0 and 1, alpha is a constant, and beta is an adjustment parameter.

The step of obtaining normal data points and abnormal data points by clustering the distributed nodes comprises the following steps:

Setting the cluster number K _max, and distributing each data point to the cluster to which the cluster center closest to the data point belongs;

And selecting a cluster center point in each cluster, selecting the data point with the largest density from the cluster with the number of 1 as a new cluster center if the number of the cluster center points is smaller than K _max, and removing the data point from the cluster with the number of 1.

The invention relates to a data anomaly detection system of a distributed network, which comprises:

the data acquisition unit is used for acquiring the data of each distributed node in the distributed network and carrying out standardized processing to obtain a standardized data set of each distributed node;

The feature extraction unit is used for dividing the data into a plurality of categories in the standardized data set, extracting features in each category, and obtaining fusion features of the data in all the categories through weighted average;

The abnormal data point detection unit is used for calculating a plurality of statistics of the data in the standardized data set, weighting and summing the statistics to be used as an index for evaluating the fusion characteristics, and selecting the fusion characteristics as the characteristics of the distributed node according to the index value; each distributed node obtains normal data points and abnormal data points through clustering;

The abnormal node detection unit is used for constructing a graph structure in the distributed network according to the characteristics of each distributed node and the connection relation of the distributed nodes, training a graph neural network model by using normal data points and abnormal data points of each distributed node as labels, and judging whether abnormal data appear in each distributed node through the trained graph neural network model.

The electronic equipment comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, wherein the computer program realizes the data anomaly detection method of the distributed network when being loaded to the processor.

The computer readable storage medium of the present invention stores a computer program which, when executed by a processor, implements the data anomaly detection method of the distributed network.

The beneficial effects are that: compared with the prior art, the invention has the advantages that: (1) The invention simplifies the complex information data into key features, can process multi-mode data, performs feature extraction and dimension reduction on different types of data, can rapidly screen out the most valuable features, and reduces the data volume and the calculation complexity; meanwhile, the multi-statistic weighted evaluation considers the overall outlier degree and importance of the data, and can help to identify and process abnormal values in the data, so that the data quality and the robustness of the model are improved; (2) According to the method, the normal data point and the abnormal data point are identified in each node, the graph structure is built in the distributed network, the graph neural network model is trained through the normal data point and the abnormal data point data of each node, and further the abnormal node can be identified.

Drawings

FIG. 1 is a schematic diagram of the multi-mode feature extraction step of the present invention.

FIG. 2 is a schematic representation of the feature selection based on three statistics of the present invention.

FIG. 3 is a flowchart of a clustering algorithm based on similarity measurement according to the present invention.

Fig. 4 is a schematic diagram of abnormal behavior detection of a distributed node based on a graph neural network according to the present invention.

Detailed Description

The technical scheme of the invention is further described below with reference to the accompanying drawings.

The data anomaly detection method of the distributed network is characterized in that feature extraction and selection are carried out on data collected by distributed nodes, key features are obtained through multi-mode feature extraction and kernel function-based data dimension reduction, and clustering is carried out by using a fuzzy Gaussian member function as a distance measurement algorithm, so that anomaly data are removed. And then, constructing a graph neural network by utilizing the relation among a plurality of distributed nodes, and judging whether the operation and interaction behavior of the nodes are abnormal or not. Comprises the following steps.

Step 1: the data is first processed by feature extraction, from which the most sensitive and relevant feature information for anomaly detection is extracted. These features generally reflect certain key aspects such as correlations between variables, abnormal patterns or trends, etc., to better distinguish between normal operation and abnormal conditions. In order to distinguish whether a transition in the data is due to a change in the mode of operation or an attack, the present invention uses a multi-mode design to sort multiple operating point data simultaneously to reduce the false alarm rate.

The invention adopts a feature extraction method based on clustering and a kernel principal component analysis algorithm (KPCA) for performing dimension reduction and feature mapping on data. The method comprises the steps of firstly using a clustering algorithm to divide data into different categories, then using KPCA to extract characteristics of clustered data, and then fusing the characteristics extracted from each cluster. The multi-mode feature extraction can simultaneously consider data of a plurality of modes, cluster analysis is carried out in a feature extraction stage, and the data of different modes are grouped and the features of the data are extracted.

Step 1.1: the raw data is normalized so that the features of each dimension are of equal importance. The normalization formula is as follows:

where X is raw data, mean (X) is the mean of X, std (X) is the standard deviation of X.

Step 1.2: data is classified into different categories using a density clustering algorithm. The method comprises the following specific steps:

(1) The neighborhood radius eps and minimum number of samples parameter min _s amples are initialized. The neighborhood radius eps is used to determine the neighborhood relationship between samples and the minimum number of samples min _s amples is used to determine the minimum number of neighborhood samples for the core point.

(2) The following operations are performed for each sample: the distance d (p _i,p_j) between the sample and the other samples is calculated. Judging whether the sample is a core point or not, wherein the formula of the judging condition is d (p _i,p_j). Ltoreq.eps, and the neighborhood of p _i at least contains min _s amples sample points. If the core point is the core point, the core point and the samples in the adjacent area are classified into one type. If not, it is marked as a noise point or merged with other classes.

Step 1.3: and performing feature extraction on the clustered data by using a KPCA algorithm. KPCA is a nonlinear dimension reduction method that uses radial basis function kernels (also known as gaussian RBF kernels) for feature mapping. The method comprises the following specific steps:

(1) A kernel matrix K is calculated, where K _ij represents the kernel similarity between samples i and j. The kernel matrix can be calculated using the gaussian RBF kernel function as follows:

Where X _std [ i ] represents the normalized ith sample, σ is the parameters of the Gaussian kernel, which can be determined by the data variance and dimension.

(2) The kernel matrix K is centered so that each element subtracts the mean value of all elements, the centering formula is as follows: where n is the number of samples and I is the full 1 vector.

(3) And carrying out eigenvalue decomposition on the core matrix K' after centering to obtain eigenvalues and corresponding eigenvectors.

(4) And selecting the eigenvectors corresponding to the previous k eigenvalues to form a projection matrix V'.

(5) Mapping the standardized data X _std to obtain the dimension-reduced data X _kpca. The formula of the mapping is: x _kpca＝X_std.V ', wherein X _std is normalized data, and V ' is a feature vector corresponding to the first k ' feature values.

Step 1.4: and (3) giving weights to the features in different clusters, and carrying out weighted average on the features to obtain the fused features. The weight can be set according to the importance of the clusters, and before further processing, the features in the clusters can be considered to represent the data representation in different modes, so the weight of the features can be set asWhere n _c is the number of clusters.

Step 2: after feature extraction is completed, more appropriate and efficient features should be selected. Feature selection provides a subset of features of meaningful information for classification problems, aiming to improve quality with minimal information loss. The eigenvalue matrix, eigenvector and principal component matrix obtained after passing through the KPCA feature can be expressed as:

P＝[P_l P_m-l],T＝[T_l T_m-l]；

where l represents the number of principal components retained in the principal component analysis model.

Step 2.1: considering the top l highest eigenvalues and their corresponding eigenvectors, the matrix X can be expressed as: I.e. T _l＝XP_l, whereas E is the residual matrix.

For a certain sample vectorThree statistics are calculated in turn: the Hotelling's T-squared distribution, T ² square distribution is a multivariate statistic used to measure the distance between a sample point and a sample mean in multidimensional space. The formula for calculating the T ² statistic is:

The squared prediction error (Squared prediction error, SPE) statistic is a statistic used in spectral analysis to measure the difference between the spectral energy of a sample and the average spectral energy. It may be used to detect anomalies or anomalous patterns of signals; the formula for calculating SPE statistics is:

the weighted error squared (Squared weighted error, SWE) statistic may measure the dispersion or complexity of the spectral energy distribution. The formula for calculating the SWE statistic is:

Step 2.1: data is linearly transformed into a specified range using Min-Max scaling, which is formulated for a statistic (e.g., T ², SPE, or SWE) as follows:

where x is the value of the original statistic, min (x) is the minimum value of the statistic, and max (x) is the maximum value of the statistic.

Step 2.3: and carrying out weighted comprehensive evaluation on the three normalized statistics to obtain a comprehensive characteristic importance index. Each statistic may be assigned an appropriate weight according to the specific problem and requirement, or may be comprehensively evaluated using equal weights. A weighted summation approach may be used:

wherein the method comprises the steps of W _SPE and w _SWE are corresponding weights for adjusting the importance of different statistics in the overall evaluation.

Step 2.4: and selecting the features with higher importance as a final feature set according to the comprehensive feature importance indexes. A threshold η may be set and only features whose index exceeds the threshold η are selected or the top t features are selected.

Step 3: the feature selection can be used for obtaining the data after dimension reduction, and the next step is to learn the characteristics of normal data and abnormal attack data from the data so as to distinguish the normal data and the abnormal attack data.

The invention adopts the anomaly detection based on distance measurement, and the core idea is to judge whether the data points are abnormal or not by calculating the distance between the data points, namely, the distance between normal data points is smaller, and the distance between the abnormal data points and the normal data points is larger. A distance measure is first defined for measuring the similarity between data points. By calculating the distance between the data point and other data points, a distance matrix or distance vector can be obtained, and the data points are divided into normal data points and abnormal data points according to the distance.

Is provided withAndRepresenting any two vectors in the form ofAndWherein the method comprises the steps ofIs a posterior probability value, andVector quantityAndIs equal to m. Delta ^U is a predefined allowed distance. Arbitrary two vectorsAndThe dissimilarity value between may be calculated by the following function, where α= 0.3679:

Wherein, Delta ^f is calculated by the following formula, where delta ^U is the allowed dissimilarity, and the value is between 0 and 1:

Step 3.1: for each data point a, calculating the similarity between the data point a and other data points b according to the fuzzy Gaussian membership function, wherein the similarity calculation formula is as follows:

Step 3.2: and (3) for all the data points, executing the step 3.1, calculating the similarity between all the data points, and calculating the distance between two points by using an exponential function to obtain a final distance matrix. Specifically, the distance between two points Where β is an adjustment parameter, the smaller its value, the more significant the effect of similarity will be, and the more sensitive the change in distance will be, and vice versa.

Step 3.3:

(1) Selecting a cluster number K _max: selecting an appropriate number of clusters, K _max.K_max, represents dividing the data points into K _max clusters, according to the characteristics and requirements of the data.

(2) The distance parameter δ ^U of the fuzzy gaussian membership function is specified in advance.

(3) The points with obvious offset at the mark are outliers according to the density information

(4) Initializing a clustering center: 2 data points were randomly selected as initial cluster centers.

(5) Assigning data points to nearest cluster centers: and (3) according to the distance matrix obtained in the step 3.2, each data point is allocated to the cluster to which the cluster center closest to the data point belongs.

(6) The average value of each class is calculated as the virtual center point of the current cluster.

(7) Updating a clustering center: if the points with the coincident virtual centers exist in the current class, setting the virtual center point as a clustering center point; otherwise, selecting the point closest to the virtual center point in the class and farthest from the outlier as a new clustering center point.

(8) If the number of cluster center points is less than K _max, selecting the point with the greatest density in the first set as a new cluster center, and removing the whole point from the first set. The first set is the first cluster in all clusters which are obtained currently, namely a certain set with the number of 1 which is obtained through a clustering step, wherein the point with the highest density is the point with the most neighbors in the cluster or the cluster.

(9) Repeating the steps (5) to (8) until no change occurs in the cluster center or a specified number of iterations is reached.

Step 3.4:

(1) Calculating a distance threshold value: according to the characteristics and the requirements of the data, an appropriate abnormal threshold value is selected.

(2) Computing data points and their belongings distance of clustering center: for each data point, the distance between it and the cluster center to which it belongs is calculated according to the formula in step 3.2.

(3) Identifying outlier data points: data points whose distance exceeds an anomaly threshold are identified as outlier data points. These data points are far from the cluster center to which they belong and are considered outlier data points.

Step 4: step 3.4 the marked outlier data points may represent outlier or offending behavior in the system, by finding similarities and aggressors inside the data, the similar data points are grouped into the same cluster. This process can help understand the structure and distribution of data and find patterns and outliers therein. However, clustering algorithms do not directly capture complex relationships and interactions between data points. Thus, subsequent machine learning algorithms may be introduced to further capture data features in order to more accurately identify patterns of attacks or anomalies. The graph neural network is a machine learning model suitable for graph structure data, and can learn the characteristics of nodes and edges and capture local and global characteristics in the graph through the relationship between the nodes. In an anomaly detection scene, the graph neural network can further analyze complex relationships among nodes in the data, and predict the nodes and detect anomalies.

In a distributed environment, information and communications need to be shared between nodes through messaging and aggregation. Each node can transmit own characteristic information to the neighbor nodes and aggregate the information transmitted by the neighbor nodes, namely, the interaction of the information exists among the nodes. Thus, in the present invention constructed neural network of the graph, nodes represent entities or objects in the graph, such as individual data points, devices, users, or other entities in the system. Each node may have its own attributes such as feature vectors extracted and selected according to the present invention. Edges represent connections or relationships between nodes and may represent direct associations, interactions, or dependencies between nodes, such as communications, data flows, dependencies, etc. between nodes. By combining the nodes and the edges, a complete graph structure is formed. The graph structure may demonstrate relationships, interactions, and connections between nodes, providing more comprehensive information for anomaly detection and model training.

Step 4.1: (1) Each participant calculates respective feature vectors according to the feature extraction and selection steps, and sends the node features to a trusted central server.

(1) The central server gathers node characteristics uploaded by the various participants (nodes).

(2) And the central server constructs a global graph structure according to the data and the connection relation of the participants. Each participant may be represented as a node in the graph and the connection between the participants may be represented as edges.

Step 4.2: training is performed on the graph neural network model on a central server using the constructed global graph and node features as inputs. The model outputs an anomaly probability for each node, indicating the probability that the node belongs to anomalies. The anomaly probability is a continuous value between 0 and 1, representing the confidence of the node anomaly. In the training stage, supervised learning training can be performed by using the anomaly samples with labels to obtain anomaly scores or anomaly probabilities of the nodes.

Step 4.3:

(1) In the reasoning stage, a global graph structure is constructed by using the new data, and the characteristics of the nodes are extracted. This can be done according to the graph construction method and feature extraction method in the previous training process. And taking the constructed global graph and node characteristics as input, and carrying out reasoning and anomaly detection through a model.

(2) Abnormality determination: according to the anomaly probability, the anomaly state of the node can be determined according to a preset threshold value or other judgment criteria, and the node with the anomaly score higher than the threshold value is identified as the anomaly node.

Nodes in a distributed network are typically distributed sensor nodes, the data of which often comprises multiple modalities, e.g. different sensors, different time scales, data in different modes of operation, etc. In this embodiment, taking a distributed photophysical environment monitoring network as an example, the distributed network collects data through different distributed sensor nodes, and it is assumed that three sensors exist in a current node and are respectively used for monitoring temperature, humidity and illumination intensity. The data anomaly detection method of the distributed network comprises the following steps.

Step one, as shown in fig. 1, node a collects data in three different modes of operation, namely temperature, humidity and illumination intensity (respectively identified as circles, triangles and rectangles in fig. 1). For example, in a temperature mode of operation, the sensor may collect temperature data at a higher frequency and be sensitive to temperature changes, and other sensors (humidity and illumination intensity) may be at a lower sampling frequency. The collected data and the data in the other two operation modes are obviously different, so that each operation mode is regarded as a clustering type, and the dimension of the data in each type is reduced to extract the characteristics in the operation mode.

In a certain operation category, the kernel-based data dimension reduction method of the invention is to be executed. Sensor data typically has complex nonlinear characteristics. While conventional principal component analysis can only deal with linear features, principal component analysis based on kernel functions can effectively deal with nonlinear features by mapping data to a high-dimensional feature space using kernel functions. For data collected by the sensor, the following operations are performed:

(1) Assuming that the number of samples collected is n, each sample therein has a corresponding number

{1,2, …, I, …, j, …, n }. For sample i, the kernel matrix is normalized to give X _std [ i ] before it is calculated.

(2) For n samples, a standard deviation X _var of the data was calculated and 0.15X _var was taken as the parameter σ of the gaussian kernel.

(3) A kernel matrix K of the n samples is calculated, each element K _ij of the matrix representing the kernel similarity K _ij between samples i and j.

(4) The core matrix K is centered to obtain K'.

(3) And carrying out eigenvalue decomposition on the core matrix K' after centering to obtain an eigenvalue X _egv and a corresponding eigenvector V.

(4) And selecting the eigenvectors corresponding to the first k 'eigenvalues of X _egv to form a projection matrix V'.

(5) The data after dimension reduction are: x _kpca＝X_std.V'.

Assume that three clusters of data are reduced in dimension to obtain three sets of features: For each set of features, it is necessary to normalize them to ensure that they have similar dimensions and ranges. The Robust normalization is a normalization method which is Robust to outliers and is applicable to the situation that more outliers exist in sensor data:

(1) Median X _median of X _kpca was calculated: for a dataset with n ₁ data, the median is the value of (n ₁ +1)/2 data. If n ₁ is an odd number, then the median is uniquely determined; if n ₁ is an even number, then the median is the mean of the n ₁/2 and (n ₁/2) +1 data.

(2) Absolute median deviation X _mad of X _kpca was calculated: the absolute median deviation is the median of the absolute difference of each data from the median.

For a given dataset, the absolute difference of each data from the median is calculated, |x _kpca-X_median |.

(3) The median of the above difference is calculated, namely:

X_mad＝median(|X_kpca-X_median|),X_robust＝(X-X_median)/X_mad。

(4) Assume that three clusters of data are normalized to obtain three sets of features: and carrying out weighted splicing on each group of characteristics according to the determined weight. Specifically, each set of features may be stitched by column and then multiplied by a corresponding weight. The weighted splice fused features can be expressed as:

Step two, fig. 2 is a schematic diagram of a feature selection method based on three statistics in the present invention. Sensor data is typically generated in a continuous stream, and is voluminous in data, so feature selection is required. By evaluating and sorting a large number of features by calculating statistics, the most valuable features can be rapidly screened out, and the data volume and the calculation complexity are reduced. At the same time, the sensor data may be affected by noise, drift, or outliers. Statistics and weighted evaluation in this feature selection method take into account the overall outlier degree and importance of the data, and can help identify and process outliers in the data, thereby improving data quality and model robustness.

The feature extraction step extracts a series of features from the sensor data. These features include statistical features (e.g., mean, variance, skewness, kurtosis, etc.), frequency domain features (e.g., power spectral density, frequency features, etc.), time domain features (e.g., autocorrelation, cross-correlation, etc.), and the like. For statistical features, such as mean, variance, skewness, kurtosis, etc., the SWE may measure the difference between features by calculating the square of the weighted euclidean distance between them; the SPE calculates the square of the model prediction error, measures the fitting effect of the prediction model, and can help evaluate the prediction capability of different frequency domain feature subsets; for time domain features, such as auto-correlations, cross-correlations, etc., T ² may help evaluate the distinctions and representativeness of different time domain feature subsets, and covariance matrices and mean vectors of feature subsets may be calculated to measure their location and outliers in the multidimensional space. The construction process of the feature subset of the sensor data is as follows:

(1) After feature extraction, a sensor data sample vector x (k) ∈r ^m is collected, where k represents the sample index and m represents the feature dimension.

(2) Three statistics of x (k) are computed in turn: t ², SPE, and SWE.

(3) For each statistic, the data is linearly transformed into a specified range using Min-Max scaling to eliminate dimensional differences in the data.

(4) And carrying out weighted comprehensive evaluation on the three normalized statistics to obtain a comprehensive characteristic importance index. Each statistic may be assigned an appropriate weight, or may be comprehensively evaluated using equal weights, depending on the particular problem and requirements.

Score＝w_T2·T²+w_SPE·SPE+w_SWE·SWE

(5) And selecting the features with higher importance as a final feature set according to the comprehensive feature importance indexes. A threshold η may be set and only features whose index exceeds the threshold η are selected: x _select = { f|score (f) > η }, or the top t features that are top ranked may be selected as the final feature set: x _select = { f|rank (f) > t }.

X _select denotes the final selected feature set, f denotes the individual features in the fused feature X _fusion, f=

X _fusion [: i ], represents the extraction of data for column i from the fused feature set X _fusion. Score (f) represents the overall feature importance index for feature f, rank (f) represents the Rank of feature f in the feature importance index ranking, and t represents the number of features that need to be selected.

By calculating the three statistics and performing weighted evaluation, the feature with the highest evaluation score can be selected, so that the dimension of the data can be reduced, redundant features and noise can be removed, the effect and the interpretation of the model can be improved, and the understanding and the utilization of the sensor data can be facilitated.

Step three, fig. 3 is a flowchart of a clustering algorithm based on similarity measurement in the present invention. The detection and elimination of abnormal data can be carried out by utilizing the similarity-based and improved k-means clustering algorithm based on the invention on the premise that a group of sensor temperature data is subjected to feature extraction and selection, and the specific implementation flow is as follows:

(1) The number of clusters K _max is selected, assuming that K _max is 3, i.e. it is desired to divide the data into 3 clusters.

(2) The distance parameter δ ^U of the fuzzy gaussian member function is specified in advance, and is assumed to be set to 0.5.

(3) Points marked with significant offset are outliers based on the density information. For example, if a certain temperature data is less dense with surrounding data points, it may be marked as an outlier.

(4) Two temperature data points were randomly selected as initial cluster centers.

(5) Assigning data points to nearest cluster centers: each data point is assigned to the cluster to which the closest cluster center belongs.

(6) Calculating the average value of each class as the virtual center point of the current cluster: for the temperature data points within each cluster, the average value is calculated as the virtual center point of the current cluster.

(7) Updating a clustering center: and updating the clustering center according to the information of the virtual center point and the outlier. If the points with the coincident virtual centers exist in the current class, setting the virtual center point as a clustering center point; otherwise, the point closest to the virtual center point and farthest from the outlier in the class is selected as a new cluster center point.

(8) If the number of cluster center points is less than 3, the most dense points in the first set may be selected as new cluster centers and the entire point removed from the first set.

(10) Finally, according to the clustering result, the outliers and the outliers can be marked. For example, data points that are more than a threshold distance from the nearest cluster center point may be defined as outliers, while data points that are farther from other data points and do not belong to any cluster may be defined as outliers.

(11) In order to reduce the influence of the cluster data, the number of clusters may be appropriately increased, and each cluster center may be selected as representative data of the class to participate in subsequent operations.

The method fully considers the characteristics of the sensor data, can better control the fuzzy degree of clustering by predefining the distance parameter, ensures that the data points have greater flexibility when being distributed to the clustering center, and reduces the hard classification of the algorithm to the data points. By using density information to label outliers, outliers in the data can be better handled. Therefore, the influence of abnormal values on the clustering result can be reduced, and the accuracy and stability of clustering are improved. Meanwhile, two data points are randomly selected as an initial clustering center, so that the sensitivity of the traditional k-means algorithm to the initial clustering center can be avoided, and the stability and convergence rate of the algorithm are improved. When the clustering data is smaller than K _max, the method increases the clustering centers by selecting the first aggregation point, so that the number of the clustering centers can be dynamically increased according to the density information, and the flexibility and the adaptability of an algorithm are improved.

Step four, fig. 4 is a schematic diagram of detecting abnormal behaviors of distributed nodes based on a graph neural network. The distributed photophysical information system is assumed to be composed of a plurality of photovoltaic panels, and each panel is provided with a sensor node for collecting data.

(1) And each sensor node calculates the characteristic vector of the node according to a predefined characteristic extraction and selection method. For example, characteristics of the temperature, voltage, and current of the panel may be extracted. For example, the feature vector calculated by sensor node a is X _a, the feature vector calculated by sensor node B is X _b, and so on.

(2) Each sensor node transmits the calculated feature vector to a trusted central server. Sensor node a sends feature vector X _a to the central server, sensor node B sends feature vector X _b to the central server, and so on.

(3) The central server collects feature vectors from the individual sensor nodes. The central server collects feature vectors from sensor node a, node B, etc.

(4) And the central server constructs a global graph structure according to the data and the connection relation of the sensor nodes. Each sensor node can be used as one node in the graph, and the connection relation between the sensor nodes can be expressed as an edge, so that a global graph structure for representing the whole photovoltaic power generation system is obtained.

(5) Training is performed on the graph neural network model on a central server using the constructed global graph and node features as inputs. The graph neural network model takes the characteristics of each node as input, and outputs an anomaly probability for each node, which represents the probability that the node belongs to anomaly. In the training stage, supervised learning training can be performed by using the anomaly samples with labels to obtain anomaly scores or anomaly probabilities of the nodes.

In implementing step (5), new physical data may be inferred and anomaly detected using the trained model. According to the anomaly probability, the anomaly state of the node can be determined according to a preset threshold value or other judgment criteria, and the node with the anomaly score higher than the threshold value is identified as the anomaly node.

The method can be applied to various fields such as agricultural modernization, medical health, military national defense and the like. For example, in a photovoltaic power generation system, a wireless sensor network may be used to collect key parameter information such as temperature, voltage, current, etc. of a solar panel. Valuable features such as temperature change trend of the photovoltaic cell panel, voltage fluctuation and the like can be extracted from the original data through a data feature extraction technology. By selecting the appropriate features, an accurate model can be built to describe the operating state of the photovoltaic power generation system. The abnormality detection algorithm can timely find abnormal conditions in the photovoltaic power generation system, such as abnormal temperature rise or abnormal voltage fluctuation. For example, when the temperature of the photovoltaic panel is abnormally increased, there may be a fault or damage condition, and timely maintenance or replacement is required. By accurately extracting and selecting the characteristics and combining an abnormality detection algorithm, the abnormal conditions can be rapidly identified, so that the reliability and efficiency of the photovoltaic power generation system are improved. Meanwhile, through analyzing the characteristic data, potential problems in the photovoltaic system can be identified, and corresponding measures are taken to prevent faults and optimize the system performance. For example, according to the characteristics of voltage fluctuation, the inclination angle or the cleaning degree of the photovoltaic cell panel is adjusted so as to improve the power generation efficiency of the system. Therefore, the method for detecting the data modeling abnormality has important application value in the field of information.

The computer-readable storage media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The processor is configured to execute the computer program stored in the memory to implement the steps in the method according to the above-mentioned embodiments.

Claims

1. A method for detecting data anomalies in a distributed network, comprising the following steps:

The data of each distributed node in the distributed network is obtained and standardized to obtain a standardized data set of each distributed node;

In the standardized data set, the data is divided into several categories, features are extracted in each category, and fusion features of the data in all categories are obtained by weighted averaging;

Calculate several statistics of the data in the standardized data set, and use the weighted sum of the several statistics as an index for evaluating the fusion feature, and select several fusion features as the features of the distributed node according to the index value;

The distributed network constructs a graph structure based on the characteristics of each distributed node and the connection relationship of the distributed nodes. The normal data points and abnormal data points of each distributed node are used as labels to train the graph neural network model. The trained graph neural network model is used to determine whether each distributed node is an abnormal node.

2. The data anomaly detection method for a distributed network according to claim 1 is characterized in that in the standardized data set, the data is divided into several categories, feature extraction is performed in each category, and the fusion features of the data in all categories are obtained by weighted average, including:

In the standardized data set, the data are divided into several categories by density clustering method.

3. The data anomaly detection method for a distributed network according to claim 1 is characterized in that in the standardized data set, the data is divided into several categories, feature extraction is performed in each category, and the fusion features of the data in all categories are obtained by weighted average, including:

In each category, feature extraction is performed using the kernel principal component analysis method, and the fusion features of the data in all categories are obtained by weighted averaging.

4. The data anomaly detection method for a distributed network according to claim 1 is characterized in that the calculating of several statistics of the data in the standardized data set, the weighted sum of several statistics as an index for evaluating the fusion feature, and the selecting of several fusion features as the features of the distributed node according to the index value comprises:

Calculating three statistics of the data in the standardized data set, namely, ^T2 statistic, SPE statistic and SWE statistic; normalizing the three statistics and performing weighted summation to obtain an index for evaluating the fusion feature;

The fused features whose index values exceed the threshold are selected as the features of the distributed node, or the first several fused features are selected as the features of the distributed node by sorting the index values from high to low.

5. The method for detecting data anomaly in a distributed network according to claim 1, wherein the step of obtaining normal data points and abnormal data points by clustering each distributed node comprises:

The distance metric of the clustering is the distance between data points, and the distance is calculated based on the similarity between the data points. The calculation method is:

in, and represents any two eigenvectors, and is the posterior probability value, m is the dimension of the feature vector, δ ^U is the allowed dissimilarity, ranging between 0 and 1, α is a constant, and β is the adjustment parameter.

6. The method for detecting data anomalies in a distributed network according to claim 5, wherein the step of obtaining normal data points and abnormal data points by clustering each distributed node comprises:

Set the number of clusters K _max and assign each data point to the cluster to which the cluster center closest to it belongs;

Select cluster centers in each cluster. If the number of cluster centers is less than K _max , select the data point with the largest density from cluster 1 as the new cluster center and remove the data point from cluster 1.

7. A distributed network data anomaly detection system, comprising:

A data acquisition unit is used to obtain data of each distributed node in the distributed network and perform standardized processing to obtain a standardized data set of each distributed node;

A feature extraction unit, used to classify the data in the standardized data set into several categories, extract features in each category, and obtain fused features of the data in all categories by weighted average;

An abnormal data point detection unit is used to calculate several statistics of the data in the standardized data set, and the weighted sum of the several statistics is used as an index for evaluating the fusion feature, and several fusion features are selected as the features of the distributed node according to the index value; each distributed node obtains normal data points and abnormal data points by clustering;

The abnormal node detection unit is used to construct a graph structure in the distributed network according to the characteristics of each distributed node and the connection relationship of the distributed nodes, use the normal data points and abnormal data points of each distributed node as labels to train the graph neural network model, and use the trained graph neural network model to determine whether abnormal data appears in each distributed node.

8. The data anomaly detection system for a distributed network according to claim 7 is characterized in that in the standardized data set, the data is divided into several categories, feature extraction is performed in each category, and the fusion features of the data in all categories are obtained by weighted average, including:

9. The data anomaly detection system for a distributed network according to claim 7 is characterized in that in the standardized data set, the data is divided into several categories, feature extraction is performed in each category, and the fusion features of the data in all categories are obtained by weighted average, including:

10. The data anomaly detection system for a distributed network according to claim 7, characterized in that the calculating of several statistics of the data in the standardized data set, the weighted sum of several statistics as an index for evaluating the fusion feature, and the selecting of several fusion features as the features of the distributed node according to the index value include:

11. The data anomaly detection system of a distributed network according to claim 7, wherein the normal data points and abnormal data points obtained by clustering at each distributed node include:

12. The data anomaly detection system of a distributed network according to claim 11, wherein the normal data points and abnormal data points obtained by clustering at each distributed node include:

13. An electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the computer program is loaded into the processor, the method for detecting anomalies in a distributed network according to any one of claims 1 to 6 is implemented.

14. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the data anomaly detection method for a distributed network according to any one of claims 1 to 6.