[go: up one dir, main page]

CN116303386B - A method and system for intelligent imputation of missing data based on relation graphs - Google Patents

A method and system for intelligent imputation of missing data based on relation graphs

Info

Publication number
CN116303386B
CN116303386B CN202310146169.7A CN202310146169A CN116303386B CN 116303386 B CN116303386 B CN 116303386B CN 202310146169 A CN202310146169 A CN 202310146169A CN 116303386 B CN116303386 B CN 116303386B
Authority
CN
China
Prior art keywords
missing
variables
imputation
data
variable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310146169.7A
Other languages
Chinese (zh)
Other versions
CN116303386A (en
Inventor
廖伟
夏欢
陈肇欣
潘野
张涛
郑奕
薛方冉
陈哲
晏楠欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Second Research Institute of CAAC
Original Assignee
Second Research Institute of CAAC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Second Research Institute of CAAC filed Critical Second Research Institute of CAAC
Priority to CN202310146169.7A priority Critical patent/CN116303386B/en
Publication of CN116303386A publication Critical patent/CN116303386A/en
Application granted granted Critical
Publication of CN116303386B publication Critical patent/CN116303386B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Complex Calculations (AREA)

Abstract

本发明涉及信息处理技术领域,具体而言,涉及一种基于关系图谱的缺失数据智能插补方法和系统,总体上基于回归插补思想,引入了数据之间的关系图谱作为缺失值预测模型的输入控制策略;采用改进后的神经网络模型,使得多个变量的缺失值预测可以使用同一个模型;针对数据“大范围、大比例”缺失的场景,构建了一套高可信度的插补顺序控制策略和二次插补策略。总的来说,本发明降低了插补系统的复杂程度,同时提高了插补过程的计算效率。

This invention relates to the field of information processing technology, specifically to an intelligent imputation method and system for missing data based on relational graphs. Generally, it is based on the idea of regression imputation, introducing a relational graph of the data as the input control strategy for the missing value prediction model; it employs an improved neural network model, allowing the same model to be used for the prediction of missing values for multiple variables; and for scenarios with "large-scale and large-proportion" missing data, it constructs a highly reliable imputation order control strategy and a secondary imputation strategy. In summary, this invention reduces the complexity of the imputation system while improving the computational efficiency of the imputation process.

Description

Intelligent interpolation method and system for missing data based on relational graph
Technical Field
The invention relates to the technical field of information processing, in particular to a method and a system for intelligent interpolation of missing data based on a relational graph.
Background
With the wide application of machine learning and digital twin technology, the degree of dependence of a software system on data is greatly improved, and higher requirements are also put on the integrity and the credibility of data input, but due to defects in the process of acquisition and storage, the condition that original data are frequently missing exists, and the interpolation of the missing data is a problem which has to be faced in the engineering field.
The prior art mainly comprises a hot card interpolation method, a regression interpolation method and a multiple interpolation method, wherein,
The hot card interpolation method finds one object most similar to the hot card interpolation method in the complete data, sometimes finds more than one similar object, and randomly selects one of all matching objects as a filling value. The method is conceptually simple, and uses the relationship between data to evaluate null values, but has the disadvantage that the similarity standard is difficult to accurately define, and is greatly influenced by subjective factors.
The multiple interpolation method considers that the missing values are randomly distributed, a multiple interpolation algorithm such as MICE algorithm firstly adopts a regression interpolation mode to estimate the values to be interpolated, then simulates noise to form a plurality of groups of optional interpolation values, finally compares the generated plurality of groups of data sets with the original data sets, and selects a set with the smallest distribution deviation with the original data sets as a final result. Multiple interpolation can only handle random misses, cannot handle non-random misses, and also requires a large amount of computation.
The regression interpolation method is to use supervised machine learning methods, such as regression, nearest neighbor, random forest, support vector machine and other models, to establish a prediction model based on a complete data set, and to substitute known attributes into the model to predict missing attributes.
Specifically, the regression interpolation method can establish a missing value prediction model for each variable, and under the big data scene of 'table many and field many', modeling for each variable consumes a great deal of resources and can greatly increase the complexity of the system; in addition, in the model training and actual prediction processes, the regression interpolation method takes all variables except the target variable as inputs, consumes a great deal of calculation power and calculation time, and also forms dependence on the variables.
Disclosure of Invention
The invention aims to provide a one-stop data interpolation technology, which is based on the idea of regression interpolation, introduces a relation graph between data as input to train a missing value prediction model, so that the same model can realize missing value prediction of a plurality of variables, and can be suitable for a large-scale and large-scale data missing scene by constructing an interpolation sequence control strategy considering the missing range and the missing correlation degree in the same row, thereby realizing high-reliability and high-calculation-efficiency data interpolation and solving the problems pointed out in the background technology.
The embodiment of the invention is realized by the following technical scheme that the intelligent interpolation method for missing data based on the relation map comprises the following steps:
Generating a variable data set, and performing feature numeralization and numerical normalization pretreatment;
Based on the correlation coefficient between the variables, establishing a variable relation graph;
Training a neural network model by taking adjacent variables of all variables in the relation graph as input to obtain a missing value prediction model, wherein the adjacent variables are variables which are directly connected with a target variable in the relation graph;
Based on an interpolation sequence control strategy considering the deletion range and the deletion correlation in the same row, the intelligent interpolation of the deletion data is realized by using the deletion value prediction model;
and decoding and restoring the variable.
According to a preferred embodiment, the establishing a variable relation map based on the correlation coefficient between the variables includes:
calculating a correlation matrix among all variables, and performing binarization processing on the correlation matrix;
and setting the diagonal elements of the correlation matrix subjected to binarization processing to 0 to obtain an adjacent matrix, and constructing a relationship map based on the adjacent matrix.
According to a preferred embodiment, the establishing a variable relation map based on the correlation coefficient between the variables further includes:
based on the obtained adjacency matrix, the adjacency matrix is optimally adjusted based on expert experience data.
According to a preferred embodiment, the training the neural network model by using the adjacent variable of each variable in the relation map as an input to obtain a missing value prediction model includes:
taking the adjacent vectors of the variables, carrying out hadamard product on the adjacent vectors and the input tensors row by row to generate N intermediate tensors with the same dimension, wherein N is the number of the variables in the input tensors;
performing N rounds of forward propagation by taking the intermediate tensor as model input to generate N output tensors, wherein parameter updating is not performed after each round of forward propagation;
performing one round of forward propagation by taking the input tensor as input, and updating a process parameter;
setting the other elements except for the j-th column in the output tensor to zero, and summing N output tensors to obtain a final output tensor, wherein j is the forward propagation round number of the output tensor;
And carrying out back propagation based on the deviation of the final output tensor and the input tensor, repeating until the network converges or the training times reach a set value, and completing the training of the missing value prediction model.
According to a preferred embodiment, the interpolation sequence control strategy based on considering the missing ranges in the same row is:
Performing sufficiency verification on all null values of the current data line, and performing filling on the null values meeting the sufficiency verification requirement;
And (5) circulating iteration until no null value meeting the sufficiency verification requirement is obtained.
According to a preferred embodiment, the intelligent interpolation of missing data using the missing value prediction model includes:
Performing hadamard product on the adjacent vector of the current data line and the target null value to obtain a vector after shielding treatment;
taking the vector as input, and calculating a result through a missing value prediction model;
and extracting a column corresponding to the target null value in the calculation result as a predicted value to replace the target null value.
According to a preferred embodiment, the interpolation sequence control strategy based on considering the correlation of the deletions in the same row is:
the null values are ordered according to the missing correlation of the null values, wherein the missing correlation is the correlation duty ratio of the missing values in all adjacent variables of the current null values, and the expression is as follows:
In the above formula, r ij represents an element of the ith row and jth column of the correlation matrix, L j is a jth element of the adjacent vector L, Z j is a jth element of the missing state vector Z, if the data of the jth bit of the data row is null, Z j =1, otherwise, Z j =0;
filling is performed in order of low to high missing correlation with default values instead of null values in the adjacency variables as input.
According to a preferred embodiment, after the filling, the method further comprises:
calculating the reliability of interpolation data to form a reliability comparison table, wherein the interpolation data is divided into original data and interpolation values, and the interpolation value calculation expression is as follows:
In the above formula, epsilon is a harmonic coefficient, eta is a model damage coefficient, and represents the reliability loss caused by model prediction, and lambda j represents the reliability of the j-th variable of the current data line.
According to a preferred embodiment, the training the neural network model by using the adjacent variable of each variable in the relation map as an input to obtain a missing value prediction model further includes:
Taking the missing value prediction model as a pre-training model, and using the pre-training model to realize intelligent interpolation of missing data;
And calculating the average credibility of each row of interpolation data, and taking the data row with the average credibility higher than a preset threshold value as a new input to perform secondary training on the pre-training model to obtain a final missing value prediction model.
The invention also provides a missing data intelligent interpolation system based on the relation map, which is applied to the method, and comprises the following steps:
the processing module is used for generating a variable data set and carrying out feature numeralization and numerical normalization preprocessing;
The relation map construction module is used for building a variable relation map based on the correlation coefficient between the variables;
The training module is used for training the neural network model by taking adjacent variables of all variables in the relation graph as input so as to obtain a missing value prediction model, wherein the adjacent variables are variables which are directly connected with the target variables in the relation graph;
The interpolation module is used for realizing intelligent interpolation of the missing data by using the missing value prediction model based on an interpolation sequence control strategy considering the missing range and the missing correlation in the same row;
And the decoding module is used for decoding and restoring the variable.
The technical scheme of the embodiment of the invention has the advantages and beneficial effects that the method comprises a relation map construction strategy of variables and a control strategy for regulating the input and output of a model by using the relation map, the number of input variables and the dependence of other variables in a data set are greatly reduced on the basis of a traditional interpolation method, the method has stronger compatibility on the condition of 'large scale and large proportion' of data, the method comprises a unified missing value prediction model training strategy, and under the condition of 'large data with multiple tables and multiple fields', the prediction of all missing variables uses the same model, so that the modeling time and the system complexity are greatly reduced, and the method comprises an interpolation sequence control strategy considering the missing range and the missing relativity in the same row, and can reserve the data authenticity to the greatest extent by regulating the interpolation sequence, thereby providing important reliability references for subsequent work.
Drawings
Fig. 1 is a flow chart of a relation graph-based intelligent interpolation method for missing data provided in embodiment 1 of the present invention;
FIG. 2 is a schematic diagram of a relationship diagram according to embodiment 1 of the present invention;
FIG. 3 is a schematic diagram of the forward propagation and loss calculation process according to embodiment 1 of the present invention;
Fig. 4 is a schematic flow chart of intelligent interpolation provided in embodiment 1 of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Example 1
The invention discloses a missing data intelligent interpolation method based on a relation map, wherein a flow chart is shown in fig. 1, and the method is implemented according to the following steps:
1) The data preprocessing step, because the input variable may have a plurality of formats such as numerical value, character, time and the like, the variable data set needs to be preprocessed before modeling, and the specific steps are as follows:
1.1 The character information is digitized, including but not limited to tag encoding, unicode, serial number encoding, frequency encoding, relative time, etc.
The following is a brief description taking the single-hot encoding as an example:
For example, characters include southern aviation, chinese aviation, eastern aviation, hainan aviation, xiamen aviation, sichuan aviation, shenzhen aviation, shandong aviation, lucky aviation, spring and autumn aviation, etc. These characters are not continuous, but rather discrete, unordered.
The digitizing process is based on the principle that N states are encoded in an N-bit state register, for the above character example, and is performed (here, only nine features are provided, so n=9):
Southern aviation-100000000
China navigation-010000000
Oriental aviation-001000000
Hainan aviation → 000100000
Mansion aviation- & gt 000010000
Aviation of Sichuan-000001000 →
Shenzhen aviation- & gt 000000100
Lucky aviation- & gt 000000010
Aviation in spring and autumn- & gt 000000001
1.2 Normalization of the numerical information including, but not limited to, dispersion normalization, logarithmic normalization, zero-mean normalization, etc., and will not be described in detail herein.
After the preprocessing is completed, the processing methods and parameters are stored for subsequent decoding.
2) According to the statistical correlation and/or the business correlation, a relation map among all variables in the data set is constructed, and the specific steps are as follows:
2.1 The specific steps of constructing a relationship map based on statistical correlation are as follows:
2.1.1 A correlation matrix between all variables, the correlation matrix calculation means including but not limited to, pearson correlation coefficient, spearman correlation coefficient, kendel correlation coefficient, etc., wherein the calculation expression of pearson correlation coefficient is as follows:
In the above-mentioned method, the step of, The mean value of the variable X i is represented,Representing the mean of the variable Y i. It should be noted that the pearson correlation coefficient varies from-1 to 1, and a coefficient value r of 1 indicates a linear relationship between X i and Y i, and a coefficient value r of 0 indicates no linear relationship between X i and Y i. Specifically, the correlation coefficient is positive if and only if X i and Y i both fall on the same side of the respective mean values, and negative if X i and Y i tend to fall on opposite sides of the respective mean values.
2.1.2 Binarization processing is carried out on the correlation matrix, a binarization threshold value can be preset, a default value can be adopted, and excessive description is not specifically carried out.
2.1.3 Setting 0to the diagonal element of the correlation matrix so that the variables are not considered as adjacent, and finally obtaining the matrix, namely, the adjacent matrix used as a relation map, for describing the association relation between the variables, as shown in table 1, wherein table 1 is an example of the adjacent matrix between the variables provided by the embodiment of the invention:
TABLE 1 adjacency matrix between variables
Number of passengers Luggage number Door closing time Task timeout rate ...
Number of passengers 0 1 1 0 ...
Luggage number 1 0 1 0 ...
Door closing time 1 1 0 1 ...
Task timeout rate 0 0 1 0 ...
... ... ... ... ... ...
The adjacent matrix is a symmetric matrix, the adjacent vector is a matrix row corresponding to the target variable, an element of 1 indicates that two variables corresponding to the row and the column are associated, and an element of 0 indicates that the two variables are not associated.
2.2 The method and the device for optimizing and adjusting the relation map based on the service correlation, particularly, considering that a practitioner is more familiar with service data, can capture causal relations except for statistical correlation, and can also remove homogeneous association relations in the relation map (taking civil aviation scene data as an example in fig. 2, the number of passengers and the number of baggage are highly correlated, and when one of the passengers and the baggage is removed in the process of predicting the door closing time, the calculated amount can be reduced, the data dependence can be reduced, and the brought benefit is greater than the loss in precision), therefore, the method and the device for optimizing and adjusting the relation map based on the service correlation are based on the generated adjacency matrix, and set the corresponding element of the association relation to be removed as 0 based on expert experience data, set the corresponding element of the association relation to be supplemented as 1, and finally obtained relation is shown in fig. 2.
In summary, through step 2), the number of input variables and the dependence on other variables in the data set are greatly reduced on the basis of the traditional interpolation method, and the method has stronger compatibility for the situations of 'large scale and large proportion' of data.
3) Training the neural network model by taking adjacent variables of all variables in the relation map as input to obtain a missing value prediction model, wherein the method comprises the following specific steps of:
3.1 The model is initialized, and general parameters of the training neural network model such as the number of layers of the neural network, the number of neurons, an activation function, a learning rate, a loss function, an optimizer and the like are set, wherein in the embodiment, the model requires that the input dimension and the output dimension are the same and are N.
In addition, before the forward propagation starts, the neuron weights need to be initialized, and the process is the same as that of the traditional feedforward neural network, and is not repeated here.
It should be noted that, the embodiment of the present invention adopts an improved neural network model, where the improved neural network model refers to the improvement of the timing and the number of forward propagation and backward propagation and the organization of input and output based on the deep feed forward network, and does not limit the network layer number, the neuron number, the activation function and other general parameters of the neural network.
3.2 Forward propagation, in this embodiment, the input tensor is P M×N, where P M×N is a matrix with dimension mxn, where M is the batch size, represents the number of data lines in the batch input, and N represents the number of variables in the dataset, and it is noted that the input tensor used in the training process is the complete data line in the complete dataset.
Further, taking the adjacency vector of each variable in P M×N, carrying out hadamard product on the adjacency vector and P M×N row by row to generate N intermediate tensors with dimension of M multiplied by NThe objective is to generate an intermediate tensor free of non-contiguous variables and target variables, the contiguous variables being variables in a relationship graph that are directly connected to the target variables.
Will beN rounds of forward propagation as model inputs, generating N output tensorsAnd the parameter is updated by the reverse gradient without immediately after each round of forward propagation, and only the output is recorded.
Finally, the P M×N is used as input to make a round of forward propagation, so as to update the process parameters for subsequent gradient calculation.
3.3 Calculating loss, wherein the specific steps are as follows:
3.3.1 Is to be used as a main component) Setting zero in other elements except for the j-th column, and summing N output tensors to obtain a final output tensor O M×N, wherein the purpose is to obtain N output tensorsThe extraction of valid columns in the matrix form the final output, where j isNumber of forward propagation rounds.
3.3.2 Calculating the deviation of O M×N and P M×N based on a loss function, it should be noted that P M×N is the correct value of the output, so that the deviation of O M×N and P M×N is the loss of the current model, and the adopted loss function includes but is not limited to general machine learning loss functions such as L1 norm loss, mean square error loss, cross entropy loss, KL divergence loss and the like, and is not described in detail herein.
4) The back propagation based on the deviation of O M×N from P M×N calculates the contribution of each neuron to the loss and updates the weights according to the gradient calculated by the back propagation algorithm, which is the same as the conventional feed forward neural network and is not described here. It should be noted that the back propagation process takes a much larger time than the forward propagation process throughout the deep neural network training process, so that the training time is not greatly increased by the multiple rounds of forward propagation. The specific forward propagation and loss calculation procedure is shown with reference to fig. 3.
Repeating the steps 2) -4) until the network convergence or the training times reach the set value, and completing the training of the missing value prediction model.
In summary, the invention interpolates the model provided in step 4), and in the big data scene of 'table many, field many', the same model is used for the prediction of all missing variables, thus greatly reducing modeling time and system complexity.
5) Based on an interpolation sequence control strategy considering the deletion range and the deletion correlation in the same row, the intelligent interpolation of the deletion data is realized by using the deletion value prediction model, and the specific steps are as follows:
5.1 Initial line number m=1, and performing null value screening, see fig. 4, specifically including the following steps:
5.1.1 And carrying out sufficiency verification on all null values of the current data line, wherein the sufficiency verification refers to whether adjacent variables corresponding to the current null values are all non-null, and if the adjacent variables are all non-null, the sufficiency verification is satisfied.
5.1.2 Filling null values meeting sufficiency verification requirements, specifically as follows:
5.1.2.1 Performing hadamard product on the adjacent vector of the current data line and the target null value to obtain a vector after shielding treatment;
5.1.2.2 Taking the vector as input, and calculating a result through a missing value prediction model;
5.1.2.3 And (3) extracting a column corresponding to the target null value in the calculation result as a predicted value to replace the target null value.
5.1.3 And (3) iterating circularly until no null value meeting the sufficiency verification requirement is obtained.
5.2 When the null values of the sufficiency verification are not satisfied, the null values do not indicate that all the null values are filled, and a plurality of null values are mutually dependent and can not be filled, wherein the null value sequencing steps specifically comprise:
5.2.1 Ordering according to the missing correlation R of the current null value, wherein the expression of the missing correlation is as follows:
In the above formula, r ij represents an element of the ith row and jth column of the correlation matrix, l j represents a jth element of the adjacent vector, z j represents a jth element of the missing state vector, if the jth bit of the data row is null, z j =1, otherwise z j =0.
5.2.2 Assigning and filling, wherein filling is carried out according to the filling flow provided in the step 5.1.2) from low to high in the order of the lack relevance R until the line number M is greater than the total line M, otherwise, M is increased by 1, and the step 5.1.1) is returned.
It should be noted that, before filling, a default value is used as a model input instead of a null value in the adjacent variable, where the default value includes, but is not limited to, a median, a mode, or a mean value of the variables in the dataset, and the description is not repeated here.
In the initial stage of assignment filling of the same row, default values in the input variables are more, but influence is smaller because the missing correlation degree R is lower, and the default values in the input variables are less and the missing correlation degree is higher when the assignment filling of the same row is closer to the later stage, so that the reliability of interpolation data is improved to the greatest extent as a whole.
Further, calculating the reliability of the interpolation data after each time of performing interpolation on the empty value to form a reliability comparison table, wherein the interpolation data is divided into original data and interpolation values, and the interpolation value calculation expression is as follows:
In the above formula, ε is a harmonic coefficient, η is a model damage coefficient, λ j is the reliability loss caused by model prediction, λ j =1 if the current data line is the reliability of the j-th variable (λ j =μ if the current data line is the original data, μ is the default damage coefficient, μ is the reliability loss caused by using the default value as input, λ j is the calculated value of the formula in the previous step if the value is the value generated by interpolation in the previous step), ε, η and μ are constants, and the default value can be preset or used.
In summary, the invention can adjust the interpolation sequence to keep the data authenticity to the greatest extent through the step 5), and provides important reliability reference for subsequent work.
6) Since the interpolation data and the original data are in the encoded state, the present embodiment also needs to decode and restore the variable according to the processing method and parameters stored in step 1.2), and finally form a new data set after the interpolation is implemented.
Example 2
In order to further improve the prediction accuracy of the model, the method is different from embodiment 1, in which on the basis of the missing value prediction model obtained in step 3), the missing value prediction model is used as a pre-training model to perform secondary training, and the pre-training model is used to realize intelligent interpolation of missing data;
and calculating the average reliability of the interpolation data of each row, taking the data row with the average reliability higher than a preset threshold value as a new input to perform secondary training on the pre-training model, obtaining a final missing value prediction model, and performing prediction interpolation again.
According to the scheme provided by the embodiment, the data utilization rate is further improved through a mode of combining the pre-training and the secondary training, and the method has stronger adaptability under the conditions of large-scale missing of data and fewer complete data lines, so that the compatibility of the model to the large-scale missing condition is improved, and the prediction accuracy can be further improved compared with the scheme of the embodiment 1.
Example 3
The embodiment of the invention provides a missing data intelligent interpolation system based on a relation map, which is applied to the method as described in the embodiment 1 or the embodiment 2, and comprises the following steps:
the processing module is used for generating a variable data set and carrying out feature numeralization and numerical normalization preprocessing;
The relation map construction module is used for building a variable relation map based on the correlation coefficient between the variables;
The training module is used for training the neural network model by taking adjacent variables of all variables in the relation graph as input so as to obtain a missing value prediction model, wherein the adjacent variables are variables which are directly connected with the target variables in the relation graph;
The interpolation module is used for realizing intelligent interpolation of the missing data by using the missing value prediction model based on an interpolation sequence control strategy considering the missing range and the missing correlation in the same row;
And the decoding module is used for decoding and restoring the variable.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (9)

1.一种基于关系图谱的缺失数据智能插补方法,其特征在于,包括如下步骤:1. A method for intelligent imputation of missing data based on relation graphs, characterized by comprising the following steps: 生成变量数据集,并进行特征数值化和数值归一化预处理;Generate a variable dataset and perform feature numericalization and numerical normalization preprocessing; 基于变量之间的相关系数,建立变量关系图谱;Based on the correlation coefficients between variables, establish a variable relationship graph; 将所述关系图谱中各变量的邻接变量作为输入对神经网络模型进行训练,以获得缺失值预测模型,所述邻接变量为关系图谱中与目标变量直连的变量;The adjacent variables of each variable in the relation graph are used as inputs to train the neural network model to obtain a missing value prediction model. The adjacent variables are the variables in the relation graph that are directly connected to the target variable. 基于考虑同一行内缺失范围、缺失相关度的插补顺序控制策略,使用所述缺失值预测模型实现缺失数据的智能插补;Based on the imputation order control strategy that considers the missing range and missing correlation within the same row, the missing value prediction model is used to realize intelligent imputation of missing data. 对变量进行解码还原;Decode and restore the variables; 所述将所述关系图谱中各变量的邻接变量作为输入对神经网络模型进行训练,以获得缺失值预测模型,包括:The step of training a neural network model by using the adjacent variables of each variable in the relation graph as input to obtain a missing value prediction model includes: 取各变量的邻接向量,逐行与输入张量进行hadamard积,生成N个相同维度的中间张量,N为输入张量中变量的个数;Take the adjacency vector of each variable, and perform Hadamard product with the input tensor row by row to generate N intermediate tensors of the same dimension, where N is the number of variables in the input tensor; 将所述中间张量作为模型输入进行N轮前向传播,生成N个输出张量,其中,每轮前向传播完均不进行参数更新;The intermediate tensor is used as the model input for N rounds of forward propagation to generate N output tensors. No parameter updates are performed after each round of forward propagation. 将所述输入张量作为输入进行一轮前向传播,并更新过程参数;The input tensor is used as input for one round of forward propagation, and the process parameters are updated. 将所述输出张量中除第j列以外的其它元素置零,并对N个所述输出张量进行求和,得到最终输出张量,j为输出张量前向传播轮数;Set all elements in the output tensor except for the j-th column to zero, and sum the N output tensors to obtain the final output tensor, where j is the number of forward propagation rounds of the output tensor; 基于最终输出张量与输入张量的偏差进行反向传播,以此重复直至网络收敛或训练次数达到设定值,完成缺失值预测模型的训练。Backpropagation is performed based on the deviation between the final output tensor and the input tensor. This process is repeated until the network converges or the number of training iterations reaches a set value, thus completing the training of the missing value prediction model. 2.如权利要求1所述的基于关系图谱的缺失数据智能插补方法,其特征在于,所述基于变量之间的相关系数,建立变量关系图谱,包括:2. The intelligent imputation method for missing data based on relationship graphs as described in claim 1, characterized in that establishing a variable relationship graph based on the correlation coefficients between variables includes: 计算所有变量之间的相关性矩阵,并对相关性矩阵做二值化处理;Calculate the correlation matrix among all variables and binarize the correlation matrix. 将经过二值化处理后的相关性矩阵对角元素置0,得到邻接矩阵,基于所述邻接矩阵构建关系图谱。The diagonal elements of the binarized correlation matrix are set to 0 to obtain the adjacency matrix, and a relational graph is constructed based on the adjacency matrix. 3.如权利要求2所述的基于关系图谱的缺失数据智能插补方法,其特征在于,所述基于变量之间的相关系数,建立变量关系图谱,还包括:3. The intelligent imputation method for missing data based on relationship graphs as described in claim 2, characterized in that, the step of establishing a variable relationship graph based on the correlation coefficients between variables further includes: 在得到的邻接矩阵基础上,基于专家经验数据对其进行优化调整。Based on the obtained adjacency matrix, it is optimized and adjusted according to expert experience data. 4.如权利要求1所述的基于关系图谱的缺失数据智能插补方法,其特征在于,所述基于考虑同一行内缺失范围的插补顺序控制策略为:4. The intelligent imputation method for missing data based on relational graphs as described in claim 1, characterized in that the imputation order control strategy based on considering the missing range within the same row is: 对当前数据行的所有空值进行充分性验证,对满足充分性验证要求的空值执行填充;Perform sufficiency checks on all null values in the current data row, and fill in the null values that meet the sufficiency check requirements; 循环迭代,直至没有满足充分性验证要求的空值。The process is repeated until no null value is found that satisfies the sufficiency verification requirements. 5.如权利要求1所述的基于关系图谱的缺失数据智能插补方法,其特征在于,所述使用所述缺失值预测模型实现缺失数据的智能插补,包括:5. The intelligent imputation method for missing data based on relation graphs as described in claim 1, characterized in that the intelligent imputation of missing data using the missing value prediction model includes: 将当前数据行和目标空值的邻接向量进行hadamard积,得到屏蔽处理后的向量;Perform the Hadamard product on the adjacency vectors of the current data row and the target null value to obtain the masked vector; 将上述向量作为输入,通过缺失值预测模型计算结果;The above vectors are used as input, and the results are calculated using the missing value prediction model. 提取计算结果中目标空值所对应列作为预测值,替换目标空值。Extract the column corresponding to the target null value from the calculation results as the predicted value and replace the target null value. 6.如权利要求1所述的基于关系图谱的缺失数据智能插补方法,其特征在于,所述基于考虑同一行内缺失相关度的插补顺序控制策略为:6. The intelligent imputation method for missing data based on relational graphs as described in claim 1, characterized in that the imputation order control strategy based on considering the correlation of missing data within the same row is: 根据空值的缺失相关度对空值进行排序,所述缺失相关度为缺失值在当前空值所有邻接变量中的相关性占比,表达式如下:The null values are sorted according to their missing value relevance, where the missing value is the proportion of relevance of the missing value among all adjacent variables. The expression is as follows: 上式中,代表相关性矩阵第i行第j列的元素,是邻接向量的第j个元素,是缺失状态向量的第j个元素,若数据行第j位的数据为空,则,否则为In the above formula, This represents the element in the i-th row and j-th column of the correlation matrix. It is an adjacency vector The j-th element, It is a missing state vector If the j-th element of a data row is empty, then Otherwise ; 利用缺省值代替邻接变量中的空值作为输入,按缺失相关度由低到高的顺序执行填充。Use default values to replace null values in adjacent variables as input, and perform filling in order of missing relevance from low to high. 7.如权利要求6所述的基于关系图谱的缺失数据智能插补方法,其特征在于,所述执行填充后,还包括:7. The intelligent imputation method for missing data based on relational graphs as described in claim 6, characterized in that, after performing the imputation, it further includes: 计算插补数据的可信度,形成可信度对照表,其中,所述插补数据分为原始数据以及插补值,所述插补值计算表达式如下:The reliability of the interpolated data is calculated to form a reliability comparison table. The interpolated data consists of original data and interpolated values. The interpolated values are calculated using the following expression: 上式中,为调和系数,为模型损益系数,表示使用模型预测带来的可信度损失,表示当前数据行第j个变量的可信度。In the above formula, The harmonic coefficient, The model profit/loss coefficient represents the loss of credibility resulting from using the model's predictions. This indicates the confidence level of the j-th variable in the current data row. 8.如权利要求1所述的基于关系图谱的缺失数据智能插补方法,其特征在于,所述将所述关系图谱中各变量的邻接变量作为输入对神经网络模型进行训练,以获得缺失值预测模型,还包括:8. The intelligent imputation method for missing data based on relation graphs as described in claim 1, characterized in that, the step of training the neural network model by using the adjacent variables of each variable in the relation graph as input to obtain the missing value prediction model further includes: 将缺失值预测模型作为预训练模型,使用所述预训练模型实现缺失数据的智能插补;The missing value prediction model is used as a pre-trained model, and the pre-trained model is used to realize intelligent imputation of missing data. 计算每行插补数据的平均可信度,将平均可信度高于预设阈值的数据行作为新的输入对预训练模型进行二次训练,得到最终缺失值预测模型。Calculate the average confidence level of each row of imputed data, and use the rows with an average confidence level higher than a preset threshold as new input to train the pre-trained model a second time to obtain the final missing value prediction model. 9.一种基于关系图谱的缺失数据智能插补系统,应用到如权利要求1至8任一项所述的方法,其特征在于,包括:9. A missing data intelligent imputation system based on relational graphs, applied to the method described in any one of claims 1 to 8, characterized in that it comprises: 处理模块,用于生成变量数据集,并进行特征数值化和数值归一化预处理;The processing module is used to generate a variable dataset and perform feature numericalization and numerical normalization preprocessing. 关系图谱构建模块,用于基于变量之间的相关系数,建立变量关系图谱;The relational graph construction module is used to build a relational graph of variables based on the correlation coefficients between variables; 训练模块,用于将所述关系图谱中各变量的邻接变量作为输入对神经网络模型进行训练,以获得缺失值预测模型,所述邻接变量为关系图谱中与目标变量直连的变量;The training module is used to train the neural network model by taking the adjacent variables of each variable in the relation graph as input to obtain the missing value prediction model. The adjacent variables are the variables in the relation graph that are directly connected to the target variable. 插补模块,用于基于考虑同一行内缺失范围、缺失相关度的插补顺序控制策略,使用所述缺失值预测模型实现缺失数据的智能插补;The imputation module is used to intelligently imput missing data using the missing value prediction model, based on an imputation order control strategy that considers the missing range and missing correlation within the same row. 解码模块,用于对变量进行解码还原。The decoding module is used to decode and restore variables.
CN202310146169.7A 2023-02-21 2023-02-21 A method and system for intelligent imputation of missing data based on relation graphs Active CN116303386B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310146169.7A CN116303386B (en) 2023-02-21 2023-02-21 A method and system for intelligent imputation of missing data based on relation graphs

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310146169.7A CN116303386B (en) 2023-02-21 2023-02-21 A method and system for intelligent imputation of missing data based on relation graphs

Publications (2)

Publication Number Publication Date
CN116303386A CN116303386A (en) 2023-06-23
CN116303386B true CN116303386B (en) 2026-01-02

Family

ID=86837083

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310146169.7A Active CN116303386B (en) 2023-02-21 2023-02-21 A method and system for intelligent imputation of missing data based on relation graphs

Country Status (1)

Country Link
CN (1) CN116303386B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117437086A (en) * 2023-12-20 2024-01-23 中国电建集团贵阳勘测设计研究院有限公司 A method and system for interpolating solar resource missing data based on deep learning

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113569062A (en) * 2021-09-26 2021-10-29 深圳索信达数据技术有限公司 Knowledge graph completion method and system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273445A (en) * 2017-05-26 2017-10-20 电子科技大学 The apparatus and method that missing data mixes multiple interpolation in a kind of big data analysis
EP4094194A1 (en) * 2020-01-23 2022-11-30 Umnai Limited An explainable neural net architecture for multidimensional data
CN113254669B (en) * 2021-06-15 2021-10-19 广东电网有限责任公司湛江供电局 Knowledge graph-based power distribution network CIM model information completion method and system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113569062A (en) * 2021-09-26 2021-10-29 深圳索信达数据技术有限公司 Knowledge graph completion method and system

Also Published As

Publication number Publication date
CN116303386A (en) 2023-06-23

Similar Documents

Publication Publication Date Title
CN111785329B (en) Single-cell RNA sequencing clustering method based on countermeasure automatic encoder
CN113283590B (en) A defense method for backdoor attacks
CN113344615B (en) Marketing campaign prediction method based on GBDT and DL fusion model
CN109948029A (en) Deep Hash Image Search Method Based on Neural Network Adaptive
CN111583031A (en) Application scoring card model building method based on ensemble learning
CN113822419B (en) Self-supervision graph representation learning operation method based on structural information
CN110335160B (en) A method and system for predicting medical migration behavior based on improved Bi-GRU based on grouping and attention
CN112819523B (en) Marketing prediction method combining inner/outer product feature interaction and Bayesian neural network
CN119089156A (en) Business prediction method, device and equipment based on LNM large numerical model
CN117093885A (en) Federated learning multi-objective optimization method integrating hierarchical clustering and particle swarm
CN116485084A (en) A method and system for intelligent decision-making of power material demand based on data space
CN118821909A (en) A federated learning method for heterogeneous data based on improved aggregation algorithm
CN119443392A (en) Oil well production prediction method based on CEEMDAN-SA-LSTM
CN112232440B (en) Method for realizing information memory and distinction of impulse neural network by using specific neuron groups
CN116303386B (en) A method and system for intelligent imputation of missing data based on relation graphs
CN109214401B (en) SAR Image Classification Method and Device Based on Hierarchical Autoencoder
CN112884045A (en) Classification method of random edge deletion embedded model based on multiple visual angles
Sun et al. Dynamic Intelligent Supply-Demand Adaptation Model Towards Intelligent Cloud Manufacturing.
CN115019101A (en) Image classification method based on information bottleneck algorithm in image classification network
CN115496144A (en) Distribution network operation scenario determination method, device, computer equipment and storage medium
CN119066495B (en) Bank customer classification method and system based on federal semi-supervised graph learning
CN119026902B (en) Rock burst prediction method based on SSA-CNN-MoLSTM-attribute
CN119089234B (en) Cross-network information identification and classification method and system
CN119358754A (en) Load forecasting method, system, computer device and computer readable storage medium
CN118657188A (en) An unsupervised hash learning system and method for large-scale cross-modal retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant