Disclosure of Invention
The invention aims to provide a one-stop data interpolation technology, which is based on the idea of regression interpolation, introduces a relation graph between data as input to train a missing value prediction model, so that the same model can realize missing value prediction of a plurality of variables, and can be suitable for a large-scale and large-scale data missing scene by constructing an interpolation sequence control strategy considering the missing range and the missing correlation degree in the same row, thereby realizing high-reliability and high-calculation-efficiency data interpolation and solving the problems pointed out in the background technology.
The embodiment of the invention is realized by the following technical scheme that the intelligent interpolation method for missing data based on the relation map comprises the following steps:
Generating a variable data set, and performing feature numeralization and numerical normalization pretreatment;
Based on the correlation coefficient between the variables, establishing a variable relation graph;
Training a neural network model by taking adjacent variables of all variables in the relation graph as input to obtain a missing value prediction model, wherein the adjacent variables are variables which are directly connected with a target variable in the relation graph;
Based on an interpolation sequence control strategy considering the deletion range and the deletion correlation in the same row, the intelligent interpolation of the deletion data is realized by using the deletion value prediction model;
and decoding and restoring the variable.
According to a preferred embodiment, the establishing a variable relation map based on the correlation coefficient between the variables includes:
calculating a correlation matrix among all variables, and performing binarization processing on the correlation matrix;
and setting the diagonal elements of the correlation matrix subjected to binarization processing to 0 to obtain an adjacent matrix, and constructing a relationship map based on the adjacent matrix.
According to a preferred embodiment, the establishing a variable relation map based on the correlation coefficient between the variables further includes:
based on the obtained adjacency matrix, the adjacency matrix is optimally adjusted based on expert experience data.
According to a preferred embodiment, the training the neural network model by using the adjacent variable of each variable in the relation map as an input to obtain a missing value prediction model includes:
taking the adjacent vectors of the variables, carrying out hadamard product on the adjacent vectors and the input tensors row by row to generate N intermediate tensors with the same dimension, wherein N is the number of the variables in the input tensors;
performing N rounds of forward propagation by taking the intermediate tensor as model input to generate N output tensors, wherein parameter updating is not performed after each round of forward propagation;
performing one round of forward propagation by taking the input tensor as input, and updating a process parameter;
setting the other elements except for the j-th column in the output tensor to zero, and summing N output tensors to obtain a final output tensor, wherein j is the forward propagation round number of the output tensor;
And carrying out back propagation based on the deviation of the final output tensor and the input tensor, repeating until the network converges or the training times reach a set value, and completing the training of the missing value prediction model.
According to a preferred embodiment, the interpolation sequence control strategy based on considering the missing ranges in the same row is:
Performing sufficiency verification on all null values of the current data line, and performing filling on the null values meeting the sufficiency verification requirement;
And (5) circulating iteration until no null value meeting the sufficiency verification requirement is obtained.
According to a preferred embodiment, the intelligent interpolation of missing data using the missing value prediction model includes:
Performing hadamard product on the adjacent vector of the current data line and the target null value to obtain a vector after shielding treatment;
taking the vector as input, and calculating a result through a missing value prediction model;
and extracting a column corresponding to the target null value in the calculation result as a predicted value to replace the target null value.
According to a preferred embodiment, the interpolation sequence control strategy based on considering the correlation of the deletions in the same row is:
the null values are ordered according to the missing correlation of the null values, wherein the missing correlation is the correlation duty ratio of the missing values in all adjacent variables of the current null values, and the expression is as follows:
In the above formula, r ij represents an element of the ith row and jth column of the correlation matrix, L j is a jth element of the adjacent vector L, Z j is a jth element of the missing state vector Z, if the data of the jth bit of the data row is null, Z j =1, otherwise, Z j =0;
filling is performed in order of low to high missing correlation with default values instead of null values in the adjacency variables as input.
According to a preferred embodiment, after the filling, the method further comprises:
calculating the reliability of interpolation data to form a reliability comparison table, wherein the interpolation data is divided into original data and interpolation values, and the interpolation value calculation expression is as follows:
In the above formula, epsilon is a harmonic coefficient, eta is a model damage coefficient, and represents the reliability loss caused by model prediction, and lambda j represents the reliability of the j-th variable of the current data line.
According to a preferred embodiment, the training the neural network model by using the adjacent variable of each variable in the relation map as an input to obtain a missing value prediction model further includes:
Taking the missing value prediction model as a pre-training model, and using the pre-training model to realize intelligent interpolation of missing data;
And calculating the average credibility of each row of interpolation data, and taking the data row with the average credibility higher than a preset threshold value as a new input to perform secondary training on the pre-training model to obtain a final missing value prediction model.
The invention also provides a missing data intelligent interpolation system based on the relation map, which is applied to the method, and comprises the following steps:
the processing module is used for generating a variable data set and carrying out feature numeralization and numerical normalization preprocessing;
The relation map construction module is used for building a variable relation map based on the correlation coefficient between the variables;
The training module is used for training the neural network model by taking adjacent variables of all variables in the relation graph as input so as to obtain a missing value prediction model, wherein the adjacent variables are variables which are directly connected with the target variables in the relation graph;
The interpolation module is used for realizing intelligent interpolation of the missing data by using the missing value prediction model based on an interpolation sequence control strategy considering the missing range and the missing correlation in the same row;
And the decoding module is used for decoding and restoring the variable.
The technical scheme of the embodiment of the invention has the advantages and beneficial effects that the method comprises a relation map construction strategy of variables and a control strategy for regulating the input and output of a model by using the relation map, the number of input variables and the dependence of other variables in a data set are greatly reduced on the basis of a traditional interpolation method, the method has stronger compatibility on the condition of 'large scale and large proportion' of data, the method comprises a unified missing value prediction model training strategy, and under the condition of 'large data with multiple tables and multiple fields', the prediction of all missing variables uses the same model, so that the modeling time and the system complexity are greatly reduced, and the method comprises an interpolation sequence control strategy considering the missing range and the missing relativity in the same row, and can reserve the data authenticity to the greatest extent by regulating the interpolation sequence, thereby providing important reliability references for subsequent work.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Example 1
The invention discloses a missing data intelligent interpolation method based on a relation map, wherein a flow chart is shown in fig. 1, and the method is implemented according to the following steps:
1) The data preprocessing step, because the input variable may have a plurality of formats such as numerical value, character, time and the like, the variable data set needs to be preprocessed before modeling, and the specific steps are as follows:
1.1 The character information is digitized, including but not limited to tag encoding, unicode, serial number encoding, frequency encoding, relative time, etc.
The following is a brief description taking the single-hot encoding as an example:
For example, characters include southern aviation, chinese aviation, eastern aviation, hainan aviation, xiamen aviation, sichuan aviation, shenzhen aviation, shandong aviation, lucky aviation, spring and autumn aviation, etc. These characters are not continuous, but rather discrete, unordered.
The digitizing process is based on the principle that N states are encoded in an N-bit state register, for the above character example, and is performed (here, only nine features are provided, so n=9):
Southern aviation-100000000
China navigation-010000000
Oriental aviation-001000000
Hainan aviation → 000100000
Mansion aviation- & gt 000010000
Aviation of Sichuan-000001000 →
Shenzhen aviation- & gt 000000100
Lucky aviation- & gt 000000010
Aviation in spring and autumn- & gt 000000001
1.2 Normalization of the numerical information including, but not limited to, dispersion normalization, logarithmic normalization, zero-mean normalization, etc., and will not be described in detail herein.
After the preprocessing is completed, the processing methods and parameters are stored for subsequent decoding.
2) According to the statistical correlation and/or the business correlation, a relation map among all variables in the data set is constructed, and the specific steps are as follows:
2.1 The specific steps of constructing a relationship map based on statistical correlation are as follows:
2.1.1 A correlation matrix between all variables, the correlation matrix calculation means including but not limited to, pearson correlation coefficient, spearman correlation coefficient, kendel correlation coefficient, etc., wherein the calculation expression of pearson correlation coefficient is as follows:
In the above-mentioned method, the step of, The mean value of the variable X i is represented,Representing the mean of the variable Y i. It should be noted that the pearson correlation coefficient varies from-1 to 1, and a coefficient value r of 1 indicates a linear relationship between X i and Y i, and a coefficient value r of 0 indicates no linear relationship between X i and Y i. Specifically, the correlation coefficient is positive if and only if X i and Y i both fall on the same side of the respective mean values, and negative if X i and Y i tend to fall on opposite sides of the respective mean values.
2.1.2 Binarization processing is carried out on the correlation matrix, a binarization threshold value can be preset, a default value can be adopted, and excessive description is not specifically carried out.
2.1.3 Setting 0to the diagonal element of the correlation matrix so that the variables are not considered as adjacent, and finally obtaining the matrix, namely, the adjacent matrix used as a relation map, for describing the association relation between the variables, as shown in table 1, wherein table 1 is an example of the adjacent matrix between the variables provided by the embodiment of the invention:
TABLE 1 adjacency matrix between variables
| |
Number of passengers |
Luggage number |
Door closing time |
Task timeout rate |
... |
| Number of passengers |
0 |
1 |
1 |
0 |
... |
| Luggage number |
1 |
0 |
1 |
0 |
... |
| Door closing time |
1 |
1 |
0 |
1 |
... |
| Task timeout rate |
0 |
0 |
1 |
0 |
... |
| ... |
... |
... |
... |
... |
... |
The adjacent matrix is a symmetric matrix, the adjacent vector is a matrix row corresponding to the target variable, an element of 1 indicates that two variables corresponding to the row and the column are associated, and an element of 0 indicates that the two variables are not associated.
2.2 The method and the device for optimizing and adjusting the relation map based on the service correlation, particularly, considering that a practitioner is more familiar with service data, can capture causal relations except for statistical correlation, and can also remove homogeneous association relations in the relation map (taking civil aviation scene data as an example in fig. 2, the number of passengers and the number of baggage are highly correlated, and when one of the passengers and the baggage is removed in the process of predicting the door closing time, the calculated amount can be reduced, the data dependence can be reduced, and the brought benefit is greater than the loss in precision), therefore, the method and the device for optimizing and adjusting the relation map based on the service correlation are based on the generated adjacency matrix, and set the corresponding element of the association relation to be removed as 0 based on expert experience data, set the corresponding element of the association relation to be supplemented as 1, and finally obtained relation is shown in fig. 2.
In summary, through step 2), the number of input variables and the dependence on other variables in the data set are greatly reduced on the basis of the traditional interpolation method, and the method has stronger compatibility for the situations of 'large scale and large proportion' of data.
3) Training the neural network model by taking adjacent variables of all variables in the relation map as input to obtain a missing value prediction model, wherein the method comprises the following specific steps of:
3.1 The model is initialized, and general parameters of the training neural network model such as the number of layers of the neural network, the number of neurons, an activation function, a learning rate, a loss function, an optimizer and the like are set, wherein in the embodiment, the model requires that the input dimension and the output dimension are the same and are N.
In addition, before the forward propagation starts, the neuron weights need to be initialized, and the process is the same as that of the traditional feedforward neural network, and is not repeated here.
It should be noted that, the embodiment of the present invention adopts an improved neural network model, where the improved neural network model refers to the improvement of the timing and the number of forward propagation and backward propagation and the organization of input and output based on the deep feed forward network, and does not limit the network layer number, the neuron number, the activation function and other general parameters of the neural network.
3.2 Forward propagation, in this embodiment, the input tensor is P M×N, where P M×N is a matrix with dimension mxn, where M is the batch size, represents the number of data lines in the batch input, and N represents the number of variables in the dataset, and it is noted that the input tensor used in the training process is the complete data line in the complete dataset.
Further, taking the adjacency vector of each variable in P M×N, carrying out hadamard product on the adjacency vector and P M×N row by row to generate N intermediate tensors with dimension of M multiplied by NThe objective is to generate an intermediate tensor free of non-contiguous variables and target variables, the contiguous variables being variables in a relationship graph that are directly connected to the target variables.
Will beN rounds of forward propagation as model inputs, generating N output tensorsAnd the parameter is updated by the reverse gradient without immediately after each round of forward propagation, and only the output is recorded.
Finally, the P M×N is used as input to make a round of forward propagation, so as to update the process parameters for subsequent gradient calculation.
3.3 Calculating loss, wherein the specific steps are as follows:
3.3.1 Is to be used as a main component) Setting zero in other elements except for the j-th column, and summing N output tensors to obtain a final output tensor O M×N, wherein the purpose is to obtain N output tensorsThe extraction of valid columns in the matrix form the final output, where j isNumber of forward propagation rounds.
3.3.2 Calculating the deviation of O M×N and P M×N based on a loss function, it should be noted that P M×N is the correct value of the output, so that the deviation of O M×N and P M×N is the loss of the current model, and the adopted loss function includes but is not limited to general machine learning loss functions such as L1 norm loss, mean square error loss, cross entropy loss, KL divergence loss and the like, and is not described in detail herein.
4) The back propagation based on the deviation of O M×N from P M×N calculates the contribution of each neuron to the loss and updates the weights according to the gradient calculated by the back propagation algorithm, which is the same as the conventional feed forward neural network and is not described here. It should be noted that the back propagation process takes a much larger time than the forward propagation process throughout the deep neural network training process, so that the training time is not greatly increased by the multiple rounds of forward propagation. The specific forward propagation and loss calculation procedure is shown with reference to fig. 3.
Repeating the steps 2) -4) until the network convergence or the training times reach the set value, and completing the training of the missing value prediction model.
In summary, the invention interpolates the model provided in step 4), and in the big data scene of 'table many, field many', the same model is used for the prediction of all missing variables, thus greatly reducing modeling time and system complexity.
5) Based on an interpolation sequence control strategy considering the deletion range and the deletion correlation in the same row, the intelligent interpolation of the deletion data is realized by using the deletion value prediction model, and the specific steps are as follows:
5.1 Initial line number m=1, and performing null value screening, see fig. 4, specifically including the following steps:
5.1.1 And carrying out sufficiency verification on all null values of the current data line, wherein the sufficiency verification refers to whether adjacent variables corresponding to the current null values are all non-null, and if the adjacent variables are all non-null, the sufficiency verification is satisfied.
5.1.2 Filling null values meeting sufficiency verification requirements, specifically as follows:
5.1.2.1 Performing hadamard product on the adjacent vector of the current data line and the target null value to obtain a vector after shielding treatment;
5.1.2.2 Taking the vector as input, and calculating a result through a missing value prediction model;
5.1.2.3 And (3) extracting a column corresponding to the target null value in the calculation result as a predicted value to replace the target null value.
5.1.3 And (3) iterating circularly until no null value meeting the sufficiency verification requirement is obtained.
5.2 When the null values of the sufficiency verification are not satisfied, the null values do not indicate that all the null values are filled, and a plurality of null values are mutually dependent and can not be filled, wherein the null value sequencing steps specifically comprise:
5.2.1 Ordering according to the missing correlation R of the current null value, wherein the expression of the missing correlation is as follows:
In the above formula, r ij represents an element of the ith row and jth column of the correlation matrix, l j represents a jth element of the adjacent vector, z j represents a jth element of the missing state vector, if the jth bit of the data row is null, z j =1, otherwise z j =0.
5.2.2 Assigning and filling, wherein filling is carried out according to the filling flow provided in the step 5.1.2) from low to high in the order of the lack relevance R until the line number M is greater than the total line M, otherwise, M is increased by 1, and the step 5.1.1) is returned.
It should be noted that, before filling, a default value is used as a model input instead of a null value in the adjacent variable, where the default value includes, but is not limited to, a median, a mode, or a mean value of the variables in the dataset, and the description is not repeated here.
In the initial stage of assignment filling of the same row, default values in the input variables are more, but influence is smaller because the missing correlation degree R is lower, and the default values in the input variables are less and the missing correlation degree is higher when the assignment filling of the same row is closer to the later stage, so that the reliability of interpolation data is improved to the greatest extent as a whole.
Further, calculating the reliability of the interpolation data after each time of performing interpolation on the empty value to form a reliability comparison table, wherein the interpolation data is divided into original data and interpolation values, and the interpolation value calculation expression is as follows:
In the above formula, ε is a harmonic coefficient, η is a model damage coefficient, λ j is the reliability loss caused by model prediction, λ j =1 if the current data line is the reliability of the j-th variable (λ j =μ if the current data line is the original data, μ is the default damage coefficient, μ is the reliability loss caused by using the default value as input, λ j is the calculated value of the formula in the previous step if the value is the value generated by interpolation in the previous step), ε, η and μ are constants, and the default value can be preset or used.
In summary, the invention can adjust the interpolation sequence to keep the data authenticity to the greatest extent through the step 5), and provides important reliability reference for subsequent work.
6) Since the interpolation data and the original data are in the encoded state, the present embodiment also needs to decode and restore the variable according to the processing method and parameters stored in step 1.2), and finally form a new data set after the interpolation is implemented.
Example 2
In order to further improve the prediction accuracy of the model, the method is different from embodiment 1, in which on the basis of the missing value prediction model obtained in step 3), the missing value prediction model is used as a pre-training model to perform secondary training, and the pre-training model is used to realize intelligent interpolation of missing data;
and calculating the average reliability of the interpolation data of each row, taking the data row with the average reliability higher than a preset threshold value as a new input to perform secondary training on the pre-training model, obtaining a final missing value prediction model, and performing prediction interpolation again.
According to the scheme provided by the embodiment, the data utilization rate is further improved through a mode of combining the pre-training and the secondary training, and the method has stronger adaptability under the conditions of large-scale missing of data and fewer complete data lines, so that the compatibility of the model to the large-scale missing condition is improved, and the prediction accuracy can be further improved compared with the scheme of the embodiment 1.
Example 3
The embodiment of the invention provides a missing data intelligent interpolation system based on a relation map, which is applied to the method as described in the embodiment 1 or the embodiment 2, and comprises the following steps:
the processing module is used for generating a variable data set and carrying out feature numeralization and numerical normalization preprocessing;
The relation map construction module is used for building a variable relation map based on the correlation coefficient between the variables;
The training module is used for training the neural network model by taking adjacent variables of all variables in the relation graph as input so as to obtain a missing value prediction model, wherein the adjacent variables are variables which are directly connected with the target variables in the relation graph;
The interpolation module is used for realizing intelligent interpolation of the missing data by using the missing value prediction model based on an interpolation sequence control strategy considering the missing range and the missing correlation in the same row;
And the decoding module is used for decoding and restoring the variable.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.