CN116303386B

CN116303386B - A method and system for intelligent imputation of missing data based on relation graphs

Info

Publication number: CN116303386B
Application number: CN202310146169.7A
Authority: CN
Inventors: 廖伟; 夏欢; 陈肇欣; 潘野; 张涛; 郑奕; 薛方冉; 陈哲; 晏楠欣
Original assignee: Second Research Institute of CAAC
Current assignee: Second Research Institute of CAAC
Priority date: 2023-02-21
Filing date: 2023-02-21
Publication date: 2026-01-02
Anticipated expiration: 2043-02-21
Also published as: CN116303386A

Abstract

This invention relates to the field of information processing technology, specifically to an intelligent imputation method and system for missing data based on relational graphs. Generally, it is based on the idea of regression imputation, introducing a relational graph of the data as the input control strategy for the missing value prediction model; it employs an improved neural network model, allowing the same model to be used for the prediction of missing values for multiple variables; and for scenarios with "large-scale and large-proportion" missing data, it constructs a highly reliable imputation order control strategy and a secondary imputation strategy. In summary, this invention reduces the complexity of the imputation system while improving the computational efficiency of the imputation process.

Description

Intelligent interpolation method and system for missing data based on relational graph

Technical Field

The invention relates to the technical field of information processing, in particular to a method and a system for intelligent interpolation of missing data based on a relational graph.

Background

With the wide application of machine learning and digital twin technology, the degree of dependence of a software system on data is greatly improved, and higher requirements are also put on the integrity and the credibility of data input, but due to defects in the process of acquisition and storage, the condition that original data are frequently missing exists, and the interpolation of the missing data is a problem which has to be faced in the engineering field.

The prior art mainly comprises a hot card interpolation method, a regression interpolation method and a multiple interpolation method, wherein,

The hot card interpolation method finds one object most similar to the hot card interpolation method in the complete data, sometimes finds more than one similar object, and randomly selects one of all matching objects as a filling value. The method is conceptually simple, and uses the relationship between data to evaluate null values, but has the disadvantage that the similarity standard is difficult to accurately define, and is greatly influenced by subjective factors.

The multiple interpolation method considers that the missing values are randomly distributed, a multiple interpolation algorithm such as MICE algorithm firstly adopts a regression interpolation mode to estimate the values to be interpolated, then simulates noise to form a plurality of groups of optional interpolation values, finally compares the generated plurality of groups of data sets with the original data sets, and selects a set with the smallest distribution deviation with the original data sets as a final result. Multiple interpolation can only handle random misses, cannot handle non-random misses, and also requires a large amount of computation.

The regression interpolation method is to use supervised machine learning methods, such as regression, nearest neighbor, random forest, support vector machine and other models, to establish a prediction model based on a complete data set, and to substitute known attributes into the model to predict missing attributes.

Specifically, the regression interpolation method can establish a missing value prediction model for each variable, and under the big data scene of 'table many and field many', modeling for each variable consumes a great deal of resources and can greatly increase the complexity of the system; in addition, in the model training and actual prediction processes, the regression interpolation method takes all variables except the target variable as inputs, consumes a great deal of calculation power and calculation time, and also forms dependence on the variables.

Disclosure of Invention

The invention aims to provide a one-stop data interpolation technology, which is based on the idea of regression interpolation, introduces a relation graph between data as input to train a missing value prediction model, so that the same model can realize missing value prediction of a plurality of variables, and can be suitable for a large-scale and large-scale data missing scene by constructing an interpolation sequence control strategy considering the missing range and the missing correlation degree in the same row, thereby realizing high-reliability and high-calculation-efficiency data interpolation and solving the problems pointed out in the background technology.

The embodiment of the invention is realized by the following technical scheme that the intelligent interpolation method for missing data based on the relation map comprises the following steps:

Generating a variable data set, and performing feature numeralization and numerical normalization pretreatment;

Based on the correlation coefficient between the variables, establishing a variable relation graph;

Training a neural network model by taking adjacent variables of all variables in the relation graph as input to obtain a missing value prediction model, wherein the adjacent variables are variables which are directly connected with a target variable in the relation graph;

Based on an interpolation sequence control strategy considering the deletion range and the deletion correlation in the same row, the intelligent interpolation of the deletion data is realized by using the deletion value prediction model;

and decoding and restoring the variable.

According to a preferred embodiment, the establishing a variable relation map based on the correlation coefficient between the variables includes:

calculating a correlation matrix among all variables, and performing binarization processing on the correlation matrix;

and setting the diagonal elements of the correlation matrix subjected to binarization processing to 0 to obtain an adjacent matrix, and constructing a relationship map based on the adjacent matrix.

According to a preferred embodiment, the establishing a variable relation map based on the correlation coefficient between the variables further includes:

based on the obtained adjacency matrix, the adjacency matrix is optimally adjusted based on expert experience data.

According to a preferred embodiment, the training the neural network model by using the adjacent variable of each variable in the relation map as an input to obtain a missing value prediction model includes:

taking the adjacent vectors of the variables, carrying out hadamard product on the adjacent vectors and the input tensors row by row to generate N intermediate tensors with the same dimension, wherein N is the number of the variables in the input tensors;

performing N rounds of forward propagation by taking the intermediate tensor as model input to generate N output tensors, wherein parameter updating is not performed after each round of forward propagation;

performing one round of forward propagation by taking the input tensor as input, and updating a process parameter;

setting the other elements except for the j-th column in the output tensor to zero, and summing N output tensors to obtain a final output tensor, wherein j is the forward propagation round number of the output tensor;

And carrying out back propagation based on the deviation of the final output tensor and the input tensor, repeating until the network converges or the training times reach a set value, and completing the training of the missing value prediction model.

According to a preferred embodiment, the interpolation sequence control strategy based on considering the missing ranges in the same row is:

Performing sufficiency verification on all null values of the current data line, and performing filling on the null values meeting the sufficiency verification requirement;

And (5) circulating iteration until no null value meeting the sufficiency verification requirement is obtained.

According to a preferred embodiment, the intelligent interpolation of missing data using the missing value prediction model includes:

Performing hadamard product on the adjacent vector of the current data line and the target null value to obtain a vector after shielding treatment;

taking the vector as input, and calculating a result through a missing value prediction model;

and extracting a column corresponding to the target null value in the calculation result as a predicted value to replace the target null value.

According to a preferred embodiment, the interpolation sequence control strategy based on considering the correlation of the deletions in the same row is:

the null values are ordered according to the missing correlation of the null values, wherein the missing correlation is the correlation duty ratio of the missing values in all adjacent variables of the current null values, and the expression is as follows:

In the above formula, r _ij represents an element of the ith row and jth column of the correlation matrix, L _j is a jth element of the adjacent vector L, Z _j is a jth element of the missing state vector Z, if the data of the jth bit of the data row is null, Z _j =1, otherwise, Z _j =0;

filling is performed in order of low to high missing correlation with default values instead of null values in the adjacency variables as input.

According to a preferred embodiment, after the filling, the method further comprises:

calculating the reliability of interpolation data to form a reliability comparison table, wherein the interpolation data is divided into original data and interpolation values, and the interpolation value calculation expression is as follows:

In the above formula, epsilon is a harmonic coefficient, eta is a model damage coefficient, and represents the reliability loss caused by model prediction, and lambda _j represents the reliability of the j-th variable of the current data line.

According to a preferred embodiment, the training the neural network model by using the adjacent variable of each variable in the relation map as an input to obtain a missing value prediction model further includes:

Taking the missing value prediction model as a pre-training model, and using the pre-training model to realize intelligent interpolation of missing data;

And calculating the average credibility of each row of interpolation data, and taking the data row with the average credibility higher than a preset threshold value as a new input to perform secondary training on the pre-training model to obtain a final missing value prediction model.

The invention also provides a missing data intelligent interpolation system based on the relation map, which is applied to the method, and comprises the following steps:

the processing module is used for generating a variable data set and carrying out feature numeralization and numerical normalization preprocessing;

The relation map construction module is used for building a variable relation map based on the correlation coefficient between the variables;

The training module is used for training the neural network model by taking adjacent variables of all variables in the relation graph as input so as to obtain a missing value prediction model, wherein the adjacent variables are variables which are directly connected with the target variables in the relation graph;

The interpolation module is used for realizing intelligent interpolation of the missing data by using the missing value prediction model based on an interpolation sequence control strategy considering the missing range and the missing correlation in the same row;

And the decoding module is used for decoding and restoring the variable.

The technical scheme of the embodiment of the invention has the advantages and beneficial effects that the method comprises a relation map construction strategy of variables and a control strategy for regulating the input and output of a model by using the relation map, the number of input variables and the dependence of other variables in a data set are greatly reduced on the basis of a traditional interpolation method, the method has stronger compatibility on the condition of 'large scale and large proportion' of data, the method comprises a unified missing value prediction model training strategy, and under the condition of 'large data with multiple tables and multiple fields', the prediction of all missing variables uses the same model, so that the modeling time and the system complexity are greatly reduced, and the method comprises an interpolation sequence control strategy considering the missing range and the missing relativity in the same row, and can reserve the data authenticity to the greatest extent by regulating the interpolation sequence, thereby providing important reliability references for subsequent work.

Drawings

Fig. 1 is a flow chart of a relation graph-based intelligent interpolation method for missing data provided in embodiment 1 of the present invention;

FIG. 2 is a schematic diagram of a relationship diagram according to embodiment 1 of the present invention;

FIG. 3 is a schematic diagram of the forward propagation and loss calculation process according to embodiment 1 of the present invention;

Fig. 4 is a schematic flow chart of intelligent interpolation provided in embodiment 1 of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Example 1

The invention discloses a missing data intelligent interpolation method based on a relation map, wherein a flow chart is shown in fig. 1, and the method is implemented according to the following steps:

1) The data preprocessing step, because the input variable may have a plurality of formats such as numerical value, character, time and the like, the variable data set needs to be preprocessed before modeling, and the specific steps are as follows:

1.1 The character information is digitized, including but not limited to tag encoding, unicode, serial number encoding, frequency encoding, relative time, etc.

The following is a brief description taking the single-hot encoding as an example:

For example, characters include southern aviation, chinese aviation, eastern aviation, hainan aviation, xiamen aviation, sichuan aviation, shenzhen aviation, shandong aviation, lucky aviation, spring and autumn aviation, etc. These characters are not continuous, but rather discrete, unordered.

The digitizing process is based on the principle that N states are encoded in an N-bit state register, for the above character example, and is performed (here, only nine features are provided, so n=9):

Southern aviation-100000000

China navigation-010000000

Oriental aviation-001000000

Hainan aviation → 000100000

Mansion aviation- & gt 000010000

Aviation of Sichuan-000001000 →

Shenzhen aviation- & gt 000000100

Lucky aviation- & gt 000000010

Aviation in spring and autumn- & gt 000000001

1.2 Normalization of the numerical information including, but not limited to, dispersion normalization, logarithmic normalization, zero-mean normalization, etc., and will not be described in detail herein.

After the preprocessing is completed, the processing methods and parameters are stored for subsequent decoding.

2) According to the statistical correlation and/or the business correlation, a relation map among all variables in the data set is constructed, and the specific steps are as follows:

2.1 The specific steps of constructing a relationship map based on statistical correlation are as follows:

2.1.1 A correlation matrix between all variables, the correlation matrix calculation means including but not limited to, pearson correlation coefficient, spearman correlation coefficient, kendel correlation coefficient, etc., wherein the calculation expression of pearson correlation coefficient is as follows:

In the above-mentioned method, the step of, The mean value of the variable X _i is represented,Representing the mean of the variable Y _i. It should be noted that the pearson correlation coefficient varies from-1 to 1, and a coefficient value r of 1 indicates a linear relationship between X _i and Y _i, and a coefficient value r of 0 indicates no linear relationship between X _i and Y _i. Specifically, the correlation coefficient is positive if and only if X _i and Y _i both fall on the same side of the respective mean values, and negative if X _i and Y _i tend to fall on opposite sides of the respective mean values.

2.1.2 Binarization processing is carried out on the correlation matrix, a binarization threshold value can be preset, a default value can be adopted, and excessive description is not specifically carried out.

2.1.3 Setting 0to the diagonal element of the correlation matrix so that the variables are not considered as adjacent, and finally obtaining the matrix, namely, the adjacent matrix used as a relation map, for describing the association relation between the variables, as shown in table 1, wherein table 1 is an example of the adjacent matrix between the variables provided by the embodiment of the invention:

TABLE 1 adjacency matrix between variables

	Number of passengers	Luggage number	Door closing time	Task timeout rate	...
						Number of passengers	0	1	1	0	...
Luggage number	1	0	1	0	...
						Door closing time	1	1	0	1	...
Task timeout rate	0	0	1	0	...
						...	...	...	...	...	...

The adjacent matrix is a symmetric matrix, the adjacent vector is a matrix row corresponding to the target variable, an element of 1 indicates that two variables corresponding to the row and the column are associated, and an element of 0 indicates that the two variables are not associated.

2.2 The method and the device for optimizing and adjusting the relation map based on the service correlation, particularly, considering that a practitioner is more familiar with service data, can capture causal relations except for statistical correlation, and can also remove homogeneous association relations in the relation map (taking civil aviation scene data as an example in fig. 2, the number of passengers and the number of baggage are highly correlated, and when one of the passengers and the baggage is removed in the process of predicting the door closing time, the calculated amount can be reduced, the data dependence can be reduced, and the brought benefit is greater than the loss in precision), therefore, the method and the device for optimizing and adjusting the relation map based on the service correlation are based on the generated adjacency matrix, and set the corresponding element of the association relation to be removed as 0 based on expert experience data, set the corresponding element of the association relation to be supplemented as 1, and finally obtained relation is shown in fig. 2.

In summary, through step 2), the number of input variables and the dependence on other variables in the data set are greatly reduced on the basis of the traditional interpolation method, and the method has stronger compatibility for the situations of 'large scale and large proportion' of data.

3) Training the neural network model by taking adjacent variables of all variables in the relation map as input to obtain a missing value prediction model, wherein the method comprises the following specific steps of:

3.1 The model is initialized, and general parameters of the training neural network model such as the number of layers of the neural network, the number of neurons, an activation function, a learning rate, a loss function, an optimizer and the like are set, wherein in the embodiment, the model requires that the input dimension and the output dimension are the same and are N.

In addition, before the forward propagation starts, the neuron weights need to be initialized, and the process is the same as that of the traditional feedforward neural network, and is not repeated here.

It should be noted that, the embodiment of the present invention adopts an improved neural network model, where the improved neural network model refers to the improvement of the timing and the number of forward propagation and backward propagation and the organization of input and output based on the deep feed forward network, and does not limit the network layer number, the neuron number, the activation function and other general parameters of the neural network.

3.2 Forward propagation, in this embodiment, the input tensor is P _M×N, where P _M×N is a matrix with dimension mxn, where M is the batch size, represents the number of data lines in the batch input, and N represents the number of variables in the dataset, and it is noted that the input tensor used in the training process is the complete data line in the complete dataset.

Further, taking the adjacency vector of each variable in P _M×N, carrying out hadamard product on the adjacency vector and P _M×N row by row to generate N intermediate tensors with dimension of M multiplied by NThe objective is to generate an intermediate tensor free of non-contiguous variables and target variables, the contiguous variables being variables in a relationship graph that are directly connected to the target variables.

Will beN rounds of forward propagation as model inputs, generating N output tensorsAnd the parameter is updated by the reverse gradient without immediately after each round of forward propagation, and only the output is recorded.

Finally, the P _M×N is used as input to make a round of forward propagation, so as to update the process parameters for subsequent gradient calculation.

3.3 Calculating loss, wherein the specific steps are as follows:

3.3.1 Is to be used as a main component) Setting zero in other elements except for the j-th column, and summing N output tensors to obtain a final output tensor O _M×N, wherein the purpose is to obtain N output tensorsThe extraction of valid columns in the matrix form the final output, where j isNumber of forward propagation rounds.

3.3.2 Calculating the deviation of O _M×N and P _M×N based on a loss function, it should be noted that P _M×N is the correct value of the output, so that the deviation of O _M×N and P _M×N is the loss of the current model, and the adopted loss function includes but is not limited to general machine learning loss functions such as L1 norm loss, mean square error loss, cross entropy loss, KL divergence loss and the like, and is not described in detail herein.

4) The back propagation based on the deviation of O _M×N from P _M×N calculates the contribution of each neuron to the loss and updates the weights according to the gradient calculated by the back propagation algorithm, which is the same as the conventional feed forward neural network and is not described here. It should be noted that the back propagation process takes a much larger time than the forward propagation process throughout the deep neural network training process, so that the training time is not greatly increased by the multiple rounds of forward propagation. The specific forward propagation and loss calculation procedure is shown with reference to fig. 3.

Repeating the steps 2) -4) until the network convergence or the training times reach the set value, and completing the training of the missing value prediction model.

In summary, the invention interpolates the model provided in step 4), and in the big data scene of 'table many, field many', the same model is used for the prediction of all missing variables, thus greatly reducing modeling time and system complexity.

5) Based on an interpolation sequence control strategy considering the deletion range and the deletion correlation in the same row, the intelligent interpolation of the deletion data is realized by using the deletion value prediction model, and the specific steps are as follows:

5.1 Initial line number m=1, and performing null value screening, see fig. 4, specifically including the following steps:

5.1.1 And carrying out sufficiency verification on all null values of the current data line, wherein the sufficiency verification refers to whether adjacent variables corresponding to the current null values are all non-null, and if the adjacent variables are all non-null, the sufficiency verification is satisfied.

5.1.2 Filling null values meeting sufficiency verification requirements, specifically as follows:

5.1.2.1 Performing hadamard product on the adjacent vector of the current data line and the target null value to obtain a vector after shielding treatment;

5.1.2.2 Taking the vector as input, and calculating a result through a missing value prediction model;

5.1.2.3 And (3) extracting a column corresponding to the target null value in the calculation result as a predicted value to replace the target null value.

5.1.3 And (3) iterating circularly until no null value meeting the sufficiency verification requirement is obtained.

5.2 When the null values of the sufficiency verification are not satisfied, the null values do not indicate that all the null values are filled, and a plurality of null values are mutually dependent and can not be filled, wherein the null value sequencing steps specifically comprise:

5.2.1 Ordering according to the missing correlation R of the current null value, wherein the expression of the missing correlation is as follows:

In the above formula, r _ij represents an element of the ith row and jth column of the correlation matrix, l _j represents a jth element of the adjacent vector, z _j represents a jth element of the missing state vector, if the jth bit of the data row is null, z _j =1, otherwise z _j =0.

5.2.2 Assigning and filling, wherein filling is carried out according to the filling flow provided in the step 5.1.2) from low to high in the order of the lack relevance R until the line number M is greater than the total line M, otherwise, M is increased by 1, and the step 5.1.1) is returned.

It should be noted that, before filling, a default value is used as a model input instead of a null value in the adjacent variable, where the default value includes, but is not limited to, a median, a mode, or a mean value of the variables in the dataset, and the description is not repeated here.

In the initial stage of assignment filling of the same row, default values in the input variables are more, but influence is smaller because the missing correlation degree R is lower, and the default values in the input variables are less and the missing correlation degree is higher when the assignment filling of the same row is closer to the later stage, so that the reliability of interpolation data is improved to the greatest extent as a whole.

Further, calculating the reliability of the interpolation data after each time of performing interpolation on the empty value to form a reliability comparison table, wherein the interpolation data is divided into original data and interpolation values, and the interpolation value calculation expression is as follows:

In the above formula, ε is a harmonic coefficient, η is a model damage coefficient, λ _j is the reliability loss caused by model prediction, λ _j =1 if the current data line is the reliability of the j-th variable (λ _j =μ if the current data line is the original data, μ is the default damage coefficient, μ is the reliability loss caused by using the default value as input, λ _j is the calculated value of the formula in the previous step if the value is the value generated by interpolation in the previous step), ε, η and μ are constants, and the default value can be preset or used.

In summary, the invention can adjust the interpolation sequence to keep the data authenticity to the greatest extent through the step 5), and provides important reliability reference for subsequent work.

6) Since the interpolation data and the original data are in the encoded state, the present embodiment also needs to decode and restore the variable according to the processing method and parameters stored in step 1.2), and finally form a new data set after the interpolation is implemented.

Example 2

In order to further improve the prediction accuracy of the model, the method is different from embodiment 1, in which on the basis of the missing value prediction model obtained in step 3), the missing value prediction model is used as a pre-training model to perform secondary training, and the pre-training model is used to realize intelligent interpolation of missing data;

and calculating the average reliability of the interpolation data of each row, taking the data row with the average reliability higher than a preset threshold value as a new input to perform secondary training on the pre-training model, obtaining a final missing value prediction model, and performing prediction interpolation again.

According to the scheme provided by the embodiment, the data utilization rate is further improved through a mode of combining the pre-training and the secondary training, and the method has stronger adaptability under the conditions of large-scale missing of data and fewer complete data lines, so that the compatibility of the model to the large-scale missing condition is improved, and the prediction accuracy can be further improved compared with the scheme of the embodiment 1.

Example 3

The embodiment of the invention provides a missing data intelligent interpolation system based on a relation map, which is applied to the method as described in the embodiment 1 or the embodiment 2, and comprises the following steps:

And the decoding module is used for decoding and restoring the variable.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for intelligent imputation of missing data based on relation graphs, characterized by comprising the following steps:

Generate a variable dataset and perform feature numericalization and numerical normalization preprocessing;

Based on the correlation coefficients between variables, establish a variable relationship graph;

The adjacent variables of each variable in the relation graph are used as inputs to train the neural network model to obtain a missing value prediction model. The adjacent variables are the variables in the relation graph that are directly connected to the target variable.

Based on the imputation order control strategy that considers the missing range and missing correlation within the same row, the missing value prediction model is used to realize intelligent imputation of missing data.

Decode and restore the variables;

The step of training a neural network model by using the adjacent variables of each variable in the relation graph as input to obtain a missing value prediction model includes:

Take the adjacency vector of each variable, and perform Hadamard product with the input tensor row by row to generate N intermediate tensors of the same dimension, where N is the number of variables in the input tensor;

The intermediate tensor is used as the model input for N rounds of forward propagation to generate N output tensors. No parameter updates are performed after each round of forward propagation.

The input tensor is used as input for one round of forward propagation, and the process parameters are updated.

Set all elements in the output tensor except for the j-th column to zero, and sum the N output tensors to obtain the final output tensor, where j is the number of forward propagation rounds of the output tensor;

Backpropagation is performed based on the deviation between the final output tensor and the input tensor. This process is repeated until the network converges or the number of training iterations reaches a set value, thus completing the training of the missing value prediction model.

2. The intelligent imputation method for missing data based on relationship graphs as described in claim 1, characterized in that establishing a variable relationship graph based on the correlation coefficients between variables includes:

Calculate the correlation matrix among all variables and binarize the correlation matrix.

The diagonal elements of the binarized correlation matrix are set to 0 to obtain the adjacency matrix, and a relational graph is constructed based on the adjacency matrix.

3. The intelligent imputation method for missing data based on relationship graphs as described in claim 2, characterized in that, the step of establishing a variable relationship graph based on the correlation coefficients between variables further includes:

Based on the obtained adjacency matrix, it is optimized and adjusted according to expert experience data.

4. The intelligent imputation method for missing data based on relational graphs as described in claim 1, characterized in that the imputation order control strategy based on considering the missing range within the same row is:

Perform sufficiency checks on all null values in the current data row, and fill in the null values that meet the sufficiency check requirements;

The process is repeated until no null value is found that satisfies the sufficiency verification requirements.

5. The intelligent imputation method for missing data based on relation graphs as described in claim 1, characterized in that the intelligent imputation of missing data using the missing value prediction model includes:

Perform the Hadamard product on the adjacency vectors of the current data row and the target null value to obtain the masked vector;

The above vectors are used as input, and the results are calculated using the missing value prediction model.

Extract the column corresponding to the target null value from the calculation results as the predicted value and replace the target null value.

6. The intelligent imputation method for missing data based on relational graphs as described in claim 1, characterized in that the imputation order control strategy based on considering the correlation of missing data within the same row is:

The null values are sorted according to their missing value relevance, where the missing value is the proportion of relevance of the missing value among all adjacent variables. The expression is as follows:

In the above formula, This represents the element in the i-th row and j-th column of the correlation matrix. It is an adjacency vector The j-th element, It is a missing state vector If the j-th element of a data row is empty, then Otherwise ;

Use default values to replace null values in adjacent variables as input, and perform filling in order of missing relevance from low to high.

7. The intelligent imputation method for missing data based on relational graphs as described in claim 6, characterized in that, after performing the imputation, it further includes:

The reliability of the interpolated data is calculated to form a reliability comparison table. The interpolated data consists of original data and interpolated values. The interpolated values are calculated using the following expression:

In the above formula, The harmonic coefficient, The model profit/loss coefficient represents the loss of credibility resulting from using the model's predictions. This indicates the confidence level of the j-th variable in the current data row.

8. The intelligent imputation method for missing data based on relation graphs as described in claim 1, characterized in that, the step of training the neural network model by using the adjacent variables of each variable in the relation graph as input to obtain the missing value prediction model further includes:

The missing value prediction model is used as a pre-trained model, and the pre-trained model is used to realize intelligent imputation of missing data.

Calculate the average confidence level of each row of imputed data, and use the rows with an average confidence level higher than a preset threshold as new input to train the pre-trained model a second time to obtain the final missing value prediction model.

9. A missing data intelligent imputation system based on relational graphs, applied to the method described in any one of claims 1 to 8, characterized in that it comprises:

The processing module is used to generate a variable dataset and perform feature numericalization and numerical normalization preprocessing.

The relational graph construction module is used to build a relational graph of variables based on the correlation coefficients between variables;

The training module is used to train the neural network model by taking the adjacent variables of each variable in the relation graph as input to obtain the missing value prediction model. The adjacent variables are the variables in the relation graph that are directly connected to the target variable.

The imputation module is used to intelligently imput missing data using the missing value prediction model, based on an imputation order control strategy that considers the missing range and missing correlation within the same row.

The decoding module is used to decode and restore variables.