WO2020034593A1 - Method and apparatus for processing missing feature in crowd performance feature prediction - Google Patents
Method and apparatus for processing missing feature in crowd performance feature prediction Download PDFInfo
- Publication number
- WO2020034593A1 WO2020034593A1 PCT/CN2019/073294 CN2019073294W WO2020034593A1 WO 2020034593 A1 WO2020034593 A1 WO 2020034593A1 CN 2019073294 W CN2019073294 W CN 2019073294W WO 2020034593 A1 WO2020034593 A1 WO 2020034593A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- missing
- feature
- distribution
- gaussian
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0639—Performance analysis of employees; Performance analysis of enterprise or organisation operations
- G06Q10/06393—Score-carding, benchmarking or key performance indicator [KPI] analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Definitions
- the performance level of the crowd can usually be predicted by training the performance prediction model of the crowd.
- the larger the amount of data in the training set the higher the prediction accuracy of the prediction model obtained by training.
- there are fewer complete training sets that can be used to train a crowd performance level prediction model and in most cases there will be missing features in the training set. Therefore, in order to improve the prediction accuracy of a prediction model, it is usually necessary to fill or complete missing features.
- the present application provides a method and a device for processing missing features in the prediction of crowd performance characteristics, mainly to avoid deviations in the filling of missing features, and to avoid deviations in the association between the prediction results of the trained prediction model and the corresponding features.
- the prediction accuracy of the trained prediction model is mainly to avoid deviations in the filling of missing features, and to avoid deviations in the association between the prediction results of the trained prediction model and the corresponding features.
- a method for processing missing features in a crowd performance prediction including:
- the preset mixed Gaussian model consisting of a multivariate Gaussian distribution corresponding to the missing features
- the values corresponding to the missing features are filled into the crowd performance prediction training set.
- a missing feature processing device for crowd performance prediction including:
- An obtaining unit configured to obtain an existing feature and a preset mixed Gaussian model corresponding to the missing feature in the training set for predicting the performance of the crowd;
- the preset mixed Gaussian model is composed of a multivariate Gaussian distribution corresponding to the missing feature;
- An estimation unit configured to estimate a value corresponding to the missing feature according to a maximum expectation algorithm of the existing feature and the preset mixed Gaussian model
- the filling unit is configured to fill a value corresponding to the missing feature into the crowd performance prediction training set.
- a computer non-volatile readable storage medium on which computer-readable instructions are stored, and when the computer-readable instructions are executed by a processor, the following steps are implemented:
- the preset mixed Gaussian model consisting of a multivariate Gaussian distribution corresponding to the missing features
- the values corresponding to the missing features are filled into the crowd performance prediction training set.
- a computer device including a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor.
- the processor executes the computer-readable instructions, Implement the following steps:
- the preset mixed Gaussian model consisting of a multivariate Gaussian distribution corresponding to the missing features
- the values corresponding to the missing features are filled into the crowd performance prediction training set.
- the method and device for processing missing features in the prediction of crowd performance characteristics are compared with the conventional filling methods, such as special value interpolation and hot card interpolation, which are currently used to fill the missing characteristics in the crowd performance prediction training set.
- the present application can obtain the existing features and the preset mixed Gaussian models corresponding to the missing features in the crowd performance prediction training set, and the preset mixed Gaussian models are composed of a multivariate Gaussian distribution corresponding to the missing features;
- the maximum expectation algorithm of the feature and the preset mixed Gaussian model estimates a value corresponding to the missing feature.
- the values corresponding to the estimated missing features can be filled into the crowd performance prediction training set, so that the missing data can be filled based on the preset mixed Gaussian model corresponding to the missing features.
- the preset mixed Gaussian model is It consists of a multivariate Gaussian distribution corresponding to the missing feature, so it can ensure that the missing feature after filling reflects its distribution state and is related to itself, can avoid the bias of missing feature filling, and can avoid the prediction result and corresponding of the trained prediction model. There is a bias in the association between features, which can improve the prediction accuracy of the trained prediction model.
- FIG. 1 shows a flowchart of a method for processing missing features in a crowd performance prediction according to an embodiment of the present application
- FIG. 2 shows a flowchart of another method for processing missing features in a crowd performance prediction provided by an embodiment of the present application
- FIG. 3 is a schematic structural diagram of a device for processing missing features in a crowd performance prediction provided by an embodiment of the present application
- FIG. 4 is a schematic structural diagram of another missing feature processing device in crowd performance prediction according to an embodiment of the present application.
- FIG. 5 is a schematic structural diagram of a computer device according to an embodiment of the present application.
- an embodiment of the present application provides a method for processing missing features in a crowd performance prediction. As shown in FIG. 1, the method includes:
- the preset mixed Gaussian model may be composed of a multivariate Gaussian distribution corresponding to the missing feature.
- the training set may include a crowd performance feature and a crowd performance level, and the missing feature in the embodiment of the present application may be a feature in the crowd performance feature.
- the performance characteristics of the crowd may include, but are not limited to, the average number of courses per month, the latitude and longitude of the work address, the number of Internet transaction products in a single month, and the level of Internet transactions in six months. For example, if there are 100 training samples, and 40 of the training samples do not have Internet transaction levels within six months, you can confirm that the "Internet transaction levels within six months" of the 40 training samples are missing features.
- Existing features can be "the average number of learning courses per month,
- the multiple Gaussian distribution may be multiple categories of Internet transaction levels within half a year.
- the category of Internet transaction levels within half a year may specifically be m.
- the maximum expectation algorithm may be an iterative algorithm, and may include a maximum likelihood estimation calculation and an expectation calculation, and the maximum likelihood estimation calculation and the expectation calculation may be performed iteratively.
- the probability that a feature belongs to each meta-Gaussian distribution, so that the Gaussian distribution to which the missing feature belongs can be estimated, and the value corresponding to the missing feature can be determined.
- training sample 3 is (the average number of courses per month is 80, the latitude and longitude of the work address (123.435, 41.819), and the Internet trading product in a single month is 100,).
- the missing feature in training sample 3 is "Internet trading level within half a year”.
- the method for processing missing features in the prediction of crowd performance characteristics is compared with the conventional filling methods, such as special value interpolation, hot card interpolation, etc., which are used to fill the missing features in the crowd performance prediction training set.
- a preset mixed Gaussian model corresponding to existing features and missing features in a crowd performance prediction training set can be obtained, and the preset mixed Gaussian model is composed of a multivariate Gaussian distribution corresponding to the missing features; An existing feature and a maximum expected algorithm of the preset mixed Gaussian model estimate the missing feature.
- the estimated missing features can be filled into the crowd performance prediction training set, so that the missing data can be filled based on a preset mixed Gaussian model corresponding to the missing features.
- the composition of the multivariate Gaussian distribution corresponding to the missing features can ensure that the missing features after filling reflect their distribution status and are related to themselves. It can avoid the bias of missing feature filling and can avoid the prediction results of the trained prediction model and the corresponding features. There is a bias in the association, which can improve the prediction accuracy of the trained prediction model.
- this embodiment of the present application provides another method for processing missing features in the crowd performance prediction, such as As shown in FIG. 2, the method includes:
- the preset mixed Gaussian model may be composed of a multivariate Gaussian distribution corresponding to the missing feature.
- the existing features may exist in the form of feature vectors. If the features of the three dimensions of “average number of courses per month, latitude and longitude of work address, and number of Internet trading products in a single month” are uniquely marked for one and a half years Intra-Internet transaction level categories, the "average number of courses per month, latitude and longitude of work address, and number of Internet trading products in a single month" can be expressed in the form of feature vectors.
- the method may further include: determining a multivariate Gaussian distribution corresponding to the missing feature; and constructing the preset mixed Gaussian model according to the multivariate Gaussian distribution.
- the missing feature is "Internet transaction level within half a year”.
- Multiple categories corresponding to "Internet transaction level within half a year” are determined, and each category can correspond to a one-dimensional Gaussian distribution.
- the "average number of monthly courses, work address latitude and longitude" "The number of Internet trading products in a single month" as the observation sample vector, that is, the observed sample vector can be used to observe the multi-dimensional Gaussian distribution of missing features, and the Internet transaction level within six months can be divided into multiple categories, and then
- the weight, the corresponding mean vector, and the covariance matrix are used to construct a preset mixed Gaussian model corresponding to the Internet transaction level within the half year.
- the distribution parameters may include a mixing coefficient, a mean, and a covariance. If the existing features exist in the form of feature vectors, the distribution parameters may include a mixing coefficient, a mean vector, and a covariance matrix, and the mixing coefficient may be a ratio of the number of samples belonging to the corresponding Gaussian distribution to the total number of samples.
- the training set may include a first training set with complete features and a second training set with missing features, and the existing features include the first existing feature, the second existing feature, and the first training set in the first training set.
- the third existing feature in the two training sets the first existing feature corresponds to the third existing feature
- the second existing feature corresponds to the missing feature
- step 202 may specifically include : Estimating an initial mixing coefficient, an initial mean value, and an initial agreement of each elementary Gaussian distribution of the preset mixed Gaussian model according to the first existing feature, the second existing feature, and the maximum likelihood estimation calculation variance.
- the calculation of the maximum likelihood estimation may include:
- ⁇ ij can be expressed as the probability that the sample x j belongs to the i-th Gaussian distribution.
- the missing feature in the second training set may be y i ,
- a Gaussian distribution to which the missing feature belongs is initially estimated.
- the step 203 may specifically include: the initial mixing coefficient, the initial mean, the initial covariance, the third existing feature, and the expectation calculation.
- a Gaussian distribution to which the missing feature belongs is estimated.
- the probability of the missing feature belonging to each meta-Gaussian distribution may be preliminarily estimated based on the initial mixing coefficient, the initial mean, the initial covariance, the third existing feature, and the expectation calculation; The probability of the distribution, a Gaussian distribution to which the missing feature belongs is initially estimated.
- the expectation calculation may include:
- m can be the total number of Gaussian distribution.
- step 204 it is possible to iteratively update the distribution parameters of the respective Gaussian distributions according to all features, that is, the upper limit value in the maximum likelihood estimation calculation is changed from l to n:
- the iteratively updated distribution parameters converge, obtain an estimated Gaussian distribution based on the converged distribution parameters and the expectation, and estimate a value corresponding to the missing feature according to the estimated Gaussian distribution.
- the method may further include: calculating a parameter difference value of the distribution parameter updated in two iterations before and after; if the parameter difference value is less than a preset threshold, determining update The distribution parameters converge.
- the difference of the mixing coefficient updated two times before and after iteration may be calculated; if the difference of the mixing coefficient is less than a preset mixing coefficient threshold, the convergence of the mixing coefficient may be determined. Or calculate the difference between the mean values updated before and after two iterations; if the mean difference is less than a preset mean value threshold, then the mean value can be determined to converge. Or calculate the difference between the updated covariances before and after two iterations; if the covariance difference is less than a preset covariance threshold, the covariance convergence can be determined. That is, when any one of the above parameters converges, the iterative maximum likelihood estimation calculation and the expectation calculation can be stopped.
- the missing feature is estimated according to the Gaussian distribution to which the missing feature belongs in the last iteration estimation.
- the preset mixing coefficient threshold, the preset average threshold, and the preset covariance threshold may all be set according to user requirements, or may be set according to a system default mode, which are not limited in this embodiment of the present application. .
- the crowd performance prediction model may be a decision tree model or a logistic regression model, etc., for determining a crowd performance level. Specifically, if the crowd performance prediction model is a decision tree model, a decision tree algorithm can be used to train a crowd performance prediction training set after filling in missing features to obtain a decision tree model. If the crowd performance prediction model is a logistic regression model model, the logistic regression model algorithm can be used to train the crowd performance prediction training set after filling the missing features to obtain a logistic regression model model.
- the training set for predicting the performance characteristics of the crowd has training samples 1: (average number of monthly learning courses 100, Internet transaction level 1 within six months), Training sample 2 (average number of monthly courses 50, Internet transaction level 2 in half a year), training sample 3 (average number of monthly courses 60, Internet transaction level 2 in half a year), training sample 4 (average number of monthly courses 80,) , Training sample 5 (the average number of courses per month is 70,), the missing feature is "Internet trading levels within six months" in training samples 4 and 5.
- training samples 1 (average number of monthly learning courses 100, Internet transaction level 1 within six months)
- Training sample 2 (average number of monthly courses 50, Internet transaction level 2 in half a year)
- training sample 3 average number of monthly courses 60, Internet transaction level 2 in half a year
- training sample 4 average number of monthly courses 80,
- Training sample 5 (the average number of courses per month is 70,)
- the missing feature is "Internet trading levels within six months" in training samples 4 and 5.
- There are two categories of Internet transaction levels within six months respectively: within
- the "average number of learning courses per month" and “internet trading levels within half a year" in training sample 1, training sample 2, and training sample 3 can be substituted into the maximum likelihood estimation calculation to obtain preliminary estimates of ⁇ i and ⁇ , respectively. i , ⁇ i , and then substitute the "average number of learning courses per month" of training sample 4 and training sample 5 into the expectation calculation, and calculate the "Internet transaction level within six months" in training sample 4 as Internet transaction level 1 within six months respectively The probability of belonging to Internet transaction level 2 within half a year.
- the “Internet transaction level within six months” in training sample 4 can be determined
- the categories are: Level 1 of Internet transactions within six months.
- the category of “Internet transaction level within six months” in training sample 5 can be calculated: Internet transaction level 1 within six months.
- training sample 1 training sample 2, training sample 3, training sample 4 and training sample 5 "average learning courses per month” and “internet trading level within half a year” into the maximum likelihood estimation calculation, iteratively updated ⁇ i, ⁇ i, ⁇ i, and are updated training samples 4 and the training sample 5 in the category "within six months of Internet transaction level” belongs according to the updated ⁇ i, ⁇ i, ⁇ i and expectations calculated until ⁇ i "within six months of Internet transaction level” category, ⁇ i, when ⁇ i convergence, the estimated belongs as the final estimation results, such as finally determined: the training sample 4 "within six months of Internet transactions grade" category belongs: six months Intra-Internet Trading Level 1; The category of "Internet Trading Level in Six Months" in Training Sample 5 is: Internet Trading Level 2 in Six Months.
- the training set obtained can be: training sample 1: (average number of monthly learning courses 100, Internet transaction level 1 in half a year), training sample 2 (average number of monthly learning courses 50, half a year Internet transaction level 2), training sample 3 (average number of monthly courses 60, Internet transaction level 2 in half a year), training sample 4 (average number of monthly courses 80, Internet transaction level 1 in half a year), training sample 5 (monthly The average number of courses studied is 70, and the level of Internet transactions in half a year is 2).
- Another method for processing missing features in the prediction of population performance characteristics is in line with the traditional filling methods such as special value interpolation, hot card interpolation, etc., which are currently used to fill missing characteristics in the crowd performance prediction training set.
- the embodiment of the present application can obtain a preset mixed Gaussian model corresponding to existing features and missing features in a crowd performance prediction training set, and the preset mixed Gaussian model is composed of a multivariate Gaussian distribution corresponding to the missing features; The maximum expectation algorithm of the existing feature and the preset mixed Gaussian model is described, and the missing feature is estimated.
- the estimated missing features can be filled into the crowd performance prediction training set, so that the missing data can be filled based on a preset mixed Gaussian model corresponding to the missing features.
- the composition of the multivariate Gaussian distribution corresponding to the missing features can ensure that the missing features after filling reflect their distribution status and are related to themselves. It can avoid the bias of missing feature filling and can avoid the prediction results of the trained prediction model and the corresponding features. There is a bias in the association, which can improve the prediction accuracy of the trained prediction model.
- an embodiment of the present application provides a missing feature processing device in crowd performance prediction.
- the device includes: an obtaining unit 31, an estimation unit 32, and a filling unit 33. .
- the obtaining unit 31 may be configured to obtain a preset mixed Gaussian model corresponding to existing features and missing features in a crowd performance prediction training set, where the preset mixed Gaussian model is composed of a multivariate Gaussian distribution corresponding to the missing features.
- the obtaining unit 31 is a main functional module for obtaining a preset mixed Gaussian model corresponding to existing features and missing features in a crowd performance prediction training set in the device.
- the estimation unit 32 may be configured to estimate a value corresponding to the missing feature according to a maximum expected algorithm of the existing feature and the preset mixed Gaussian model.
- the estimation unit 32 is a main function module for estimating the missing feature according to the maximum expectation algorithm of the existing feature and the preset mixed Gaussian model, and is also a core module.
- the filling unit 33 may be configured to fill a value corresponding to the missing feature into the crowd performance prediction training set.
- the filling unit 33 is a main functional module in the device that fills the values corresponding to the missing features into the crowd performance prediction training set.
- the maximum expected algorithm may include a maximum likelihood estimation calculation and an expected calculation
- the estimation unit 32 may include an estimation module 321, an update module 322, and an acquisition module 323, as shown in FIG. 4.
- the estimation module 321 may be configured to estimate an initial distribution parameter of each meta-Gaussian distribution of the preset mixed Gaussian model according to the existing features and the maximum likelihood estimation calculation.
- the estimation module 321 may be further configured to initially estimate a Gaussian distribution to which the missing feature belongs according to the initial distribution parameter and the expectation calculation.
- the updating module 322 may be configured to iteratively update the distribution parameters of the respective Gaussian distributions according to the existing features, the associated Gaussian distribution, and the maximum likelihood estimation calculation, and iteratively update the missing Gaussian distribution to which the feature belongs.
- the obtaining module 323 may be configured to obtain a Gaussian distribution estimated according to the converged distribution parameters and the expectation when the iteratively updated distribution parameters converge.
- the estimation module 321 may be further configured to estimate a value corresponding to the missing feature according to a last estimated Gaussian distribution.
- the estimation module 321 may be specifically configured to initially estimate a probability that the missing feature belongs to each meta-Gaussian distribution according to the initial distribution parameter and the expectation calculation; and according to the belonging to each meta-Gaussian distribution The probability of the distribution, a Gaussian distribution to which the missing feature belongs is initially estimated.
- the training set includes a first training set with complete features and a second training set with missing features
- the existing features include the first existing feature and the second existing feature in the first training set.
- the third existing feature in the second training set the first existing feature corresponds to the third existing feature
- the second existing feature corresponds to the missing feature
- the estimation Module 321 may be specifically configured to estimate an initial mixing coefficient of each elementary Gaussian distribution of the preset mixed Gaussian model according to the first existing feature, the second existing feature, and the maximum likelihood estimation calculation. , Initial mean and initial covariance.
- the estimation module 321 may also be specifically used for the initial mixing coefficient, the initial mean, the initial covariance, the third existing feature, and the expectation calculation to initially estimate the Gaussian distribution to which the missing feature belongs.
- the estimation unit 32 may further include a calculation module 324 and a determination module 325.
- the calculation module 324 may be configured to calculate a parameter difference of the distribution parameter updated by two iterations before and after.
- the determining module 325 may be configured to determine that the updated distribution parameter converges if the parameter difference is less than a preset threshold.
- the apparatus may further include a determining unit 34 and a constructing unit 35.
- the determining unit 34 may be configured to determine a multivariate Gaussian distribution corresponding to the missing feature.
- the determining unit is a main functional module for determining a multivariate Gaussian distribution corresponding to the missing feature in the device.
- the constructing unit 35 may be configured to construct the preset mixed Gaussian model according to the multivariate Gaussian distribution.
- the constructing unit 35 is a main functional module for constructing the preset mixed Gaussian model according to the multivariate Gaussian distribution in the device.
- the device may further include a training unit 36.
- the training unit 36 may be configured to train a crowd performance prediction model according to a crowd performance prediction training set after filling in missing features.
- the training unit 36 is a main functional module for training a crowd performance prediction model in the device according to a crowd performance prediction training set after filling in missing features.
- an embodiment of the present application further provides a computer non-volatile readable storage medium that stores computer-readable instructions.
- the computer-readable instructions are executed by a processor, The following steps are implemented: acquiring existing features and preset mixed Gaussian models corresponding to missing features in a crowd performance prediction training set, the preset mixed Gaussian models consisting of a multivariate Gaussian distribution corresponding to the missing features; and according to the existing features And a maximum expectation algorithm of the preset mixed Gaussian model to estimate a value corresponding to the missing feature; and filling the value corresponding to the missing feature to the crowd performance prediction training set.
- the embodiment of the present application further provides a physical structure diagram of a computer device.
- the computer device includes a processor 41, The memory 42 and computer-readable instructions stored on the memory 42 and executable on the processor, where the memory 42 and the processor 41 are both arranged on the bus 43 and the processor 41 implements the following when the computer-readable instructions are executed Step: Obtain a preset mixed Gaussian model corresponding to the existing features and missing features in the crowd performance prediction training set, the preset mixed Gaussian model is composed of a multivariate Gaussian distribution corresponding to the missing features; The maximum expectation algorithm of the preset mixed Gaussian model is used to estimate the value corresponding to the missing feature; the value corresponding to the missing feature is filled into the crowd performance prediction training set.
- a preset mixed Gaussian model corresponding to existing features and missing features in a crowd performance prediction training set can be obtained, and the preset mixed Gaussian model is composed of a multivariate Gaussian distribution corresponding to the missing features;
- the maximum expectation algorithm of the existing feature and the preset mixed Gaussian model estimates a value corresponding to the missing feature.
- the values corresponding to the estimated missing features can be filled into the crowd performance prediction training set, so that the missing data can be filled based on the preset mixed Gaussian model corresponding to the missing features.
- the preset mixed Gaussian model is It consists of a multivariate Gaussian distribution corresponding to the missing feature, so it can ensure that the missing feature after filling reflects its distribution state and is related to itself, can avoid the bias of missing feature filling, and can avoid the prediction result and corresponding of the trained prediction model. There is a bias in the association between features, which can improve the prediction accuracy of the trained prediction model.
- modules or steps of the present application may be implemented by a general-purpose computing device, and they may be concentrated on a single computing device or distributed in a network composed of multiple computing devices.
- they can be implemented with computer-readable instruction code executable by the computing device, so that they can be stored in a storage device and executed by the computing device, and in some cases, can be different from this
- the steps shown or described are performed in sequence, either by making them into individual integrated circuit modules, or by making multiple modules or steps into a single integrated circuit module. As such, this application is not limited to any particular combination of hardware and software.
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Human Resources & Organizations (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Development Economics (AREA)
- Educational Administration (AREA)
- Economics (AREA)
- Entrepreneurship & Innovation (AREA)
- Data Mining & Analysis (AREA)
- Strategic Management (AREA)
- General Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- Artificial Intelligence (AREA)
- Quality & Reliability (AREA)
- Operations Research (AREA)
- Marketing (AREA)
- Game Theory and Decision Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Tourism & Hospitality (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Image Analysis (AREA)
Abstract
Description
本申请要求与2018年8月13日提交中国专利局、申请号为2018109185213、申请名称为“人群绩效特征预测中的缺失特征处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims priority from the Chinese patent application filed on August 13, 2018 with the Chinese Patent Office, application number 2018109185213, and application name "Method and Device for Processing Missing Features in Crowd Performance Feature Prediction", the entire contents of which are incorporated by reference. Incorporated in the application.
近年来,很多行业开始重视人群绩效,尤其是开始重视识别人群绩效等级,通过识别出目标人群并对目标人群进行奖励,能够大大促进企业整体效益的提升。目前通常可以通过训练人群绩效预测模型预测人群绩效等级,在训练预测模型时,训练集的数据量越大训练得到的预测模型预测精度越高。然而,在实际应用中,能够用来训练人群绩效等级预测模型的完整训练集较少,大部分情况下训练集会存在缺失特征。因此,为了提升预测模型的预测精度,通常需要填补或者补全缺失特征。In recent years, many industries have begun to pay attention to crowd performance, especially to identify the performance level of the crowd. By identifying and rewarding the target crowd, it can greatly promote the overall efficiency of the company. At present, the performance level of the crowd can usually be predicted by training the performance prediction model of the crowd. When training the prediction model, the larger the amount of data in the training set, the higher the prediction accuracy of the prediction model obtained by training. However, in practical applications, there are fewer complete training sets that can be used to train a crowd performance level prediction model, and in most cases there will be missing features in the training set. Therefore, in order to improve the prediction accuracy of a prediction model, it is usually necessary to fill or complete missing features.
目前,通常采用传统填补方法,如特殊值插补、热卡插补等方式填补人群绩效预测训练集中的缺失特征。然而,上述填补方式的填补效果受限于特征分布状态,且缺失特征通常为非随机缺失特征、非随机缺失特征的缺失与否与特征本身存在关联。例如,在收入调查中,受访人群中高收入人群或者低收入人群都不会填写具体收入,因此收入水平的缺失与收入水平自身存在关联。若通过上述方式填补缺失特征,会造成缺失特征的填补存在偏差,从而造成,进而造成训练得预测模型的预测精度较低。At present, traditional filling methods such as special value interpolation and hot card interpolation are usually used to fill the missing features in the crowd performance prediction training set. However, the filling effect of the above filling method is limited by the distribution of features, and the missing features are usually non-random missing features, and the absence of non-random missing features is related to the features themselves. For example, in the income survey, neither the high-income group nor the low-income group will fill in specific income, so the lack of income level is related to the income level itself. If the missing features are filled in the above manner, there will be deviations in the filling of missing features, which will result in lower prediction accuracy of the trained prediction model.
发明内容Summary of the Invention
本申请提供了一种人群绩效特征预测中的缺失特征处理方法及装置,主要在于能够避免缺失特征的填补存在偏差,避免经过训练得预测模型的预测结果与相应特征间的关联存在偏差,从而能够训练得预测模型的预测精度。The present application provides a method and a device for processing missing features in the prediction of crowd performance characteristics, mainly to avoid deviations in the filling of missing features, and to avoid deviations in the association between the prediction results of the trained prediction model and the corresponding features. The prediction accuracy of the trained prediction model.
根据本申请的第一个方面,提供一种人群绩效预测中的缺失特征处理方法,包括:According to a first aspect of the present application, a method for processing missing features in a crowd performance prediction is provided, including:
获取人群绩效预测训练集中的已有特征以及缺失特征对应的预设混合高斯模型,所述预设混合高斯模型由所述缺失特征对应的多元高斯分布组成;Obtaining a preset mixed Gaussian model corresponding to existing features and missing features in a crowd performance prediction training set, the preset mixed Gaussian model consisting of a multivariate Gaussian distribution corresponding to the missing features;
根据所述已有特征和所述预设混合高斯模型的最大期望算法,估计所述缺失特征对应的数值;Estimating a value corresponding to the missing feature according to a maximum expectation algorithm of the existing feature and the preset mixed Gaussian model;
将所述缺失特征对应的数值填补到所述人群绩效预测训练集中。The values corresponding to the missing features are filled into the crowd performance prediction training set.
根据本申请的第二个方面,提供一种人群绩效预测中的缺失特征处理装置,包括According to a second aspect of the present application, a missing feature processing device for crowd performance prediction is provided, including:
获取单元,用于获取人群绩效预测训练集中的已有特征以及缺失特征对应的预设混合高斯模型,所述预设混合高斯模型由所述缺失特征对应的多元高斯分布组成;An obtaining unit, configured to obtain an existing feature and a preset mixed Gaussian model corresponding to the missing feature in the training set for predicting the performance of the crowd; the preset mixed Gaussian model is composed of a multivariate Gaussian distribution corresponding to the missing feature;
估计单元,用于根据所述已有特征和所述预设混合高斯模型的最大期望算法,估计所述缺失特征对应的数值;An estimation unit, configured to estimate a value corresponding to the missing feature according to a maximum expectation algorithm of the existing feature and the preset mixed Gaussian model;
填补单元,用于将所述缺失特征对应的数值填补到所述人群绩效预测训练集中。The filling unit is configured to fill a value corresponding to the missing feature into the crowd performance prediction training set.
根据本申请的第三个方面,提供一种计算机非易失性可读存储介质,其上存储有计算机可读指令,该计算机可读指令被处理器执行时实现以下步骤:According to a third aspect of the present application, a computer non-volatile readable storage medium is provided, on which computer-readable instructions are stored, and when the computer-readable instructions are executed by a processor, the following steps are implemented:
获取人群绩效预测训练集中的已有特征以及缺失特征对应的预设混合高斯模型,所述预设混合高斯模型由所述缺失特征对应的多元高斯分布组成;Obtaining a preset mixed Gaussian model corresponding to existing features and missing features in a crowd performance prediction training set, the preset mixed Gaussian model consisting of a multivariate Gaussian distribution corresponding to the missing features;
根据所述已有特征和所述预设混合高斯模型的最大期望算法,估计所述缺失特征对应的数值;Estimating a value corresponding to the missing feature according to a maximum expectation algorithm of the existing feature and the preset mixed Gaussian model;
将所述缺失特征对应的数值填补到所述人群绩效预测训练集中。The values corresponding to the missing features are filled into the crowd performance prediction training set.
根据本申请的第四个方面,提供一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现以下步骤:According to a fourth aspect of the present application, a computer device is provided, including a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor. When the processor executes the computer-readable instructions, Implement the following steps:
获取人群绩效预测训练集中的已有特征以及缺失特征对应的预设混合高斯模型,所述预设混合高斯模型由所述缺失特征对应的多元高斯分布组成;Obtaining a preset mixed Gaussian model corresponding to existing features and missing features in a crowd performance prediction training set, the preset mixed Gaussian model consisting of a multivariate Gaussian distribution corresponding to the missing features;
根据所述已有特征和所述预设混合高斯模型的最大期望算法,估计所述缺失特征对应的数值;Estimating a value corresponding to the missing feature according to a maximum expectation algorithm of the existing feature and the preset mixed Gaussian model;
将所述缺失特征对应的数值填补到所述人群绩效预测训练集中。The values corresponding to the missing features are filled into the crowd performance prediction training set.
本申请提供的一种人群绩效特征预测中的缺失特征处理方法及装置,与目前通常采用传统填补方法,如特殊值插补、热卡插补等方式填补人群绩效预测训练集中的缺失特征相比,本申请能够获取人群绩效预测训练集中的已有特征以及缺失特征对应的预设混合高斯模型,所述预设混合高斯模型由所述缺失特征对应的多元高斯分布组成;能够根据所述已有特征和所述预设混合高斯模型的最大期望算法,估计所述缺失特征对应的数值。与此同时,能够将估计后的缺失特征对应的数值填补到所述人群绩效预测训练集中,从而能够实现基于缺失特征对应的预设混合高斯模型填补缺失数据,由于所述预设混合高斯模型是由所述缺失特征对应的多元高斯分布组成,因此能够保证填补后的缺失特征反映其分布状态且与本身相关,能够避免缺失特征的填补存在偏差,能够避免经过训练得预测模型的预测结果与相应特征间的关联存在偏差,进而能够提升训练得预测模型的预测精度。The method and device for processing missing features in the prediction of crowd performance characteristics provided by the present application are compared with the conventional filling methods, such as special value interpolation and hot card interpolation, which are currently used to fill the missing characteristics in the crowd performance prediction training set. The present application can obtain the existing features and the preset mixed Gaussian models corresponding to the missing features in the crowd performance prediction training set, and the preset mixed Gaussian models are composed of a multivariate Gaussian distribution corresponding to the missing features; The maximum expectation algorithm of the feature and the preset mixed Gaussian model estimates a value corresponding to the missing feature. At the same time, the values corresponding to the estimated missing features can be filled into the crowd performance prediction training set, so that the missing data can be filled based on the preset mixed Gaussian model corresponding to the missing features. Since the preset mixed Gaussian model is It consists of a multivariate Gaussian distribution corresponding to the missing feature, so it can ensure that the missing feature after filling reflects its distribution state and is related to itself, can avoid the bias of missing feature filling, and can avoid the prediction result and corresponding of the trained prediction model. There is a bias in the association between features, which can improve the prediction accuracy of the trained prediction model.
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:The drawings described here are used to provide a further understanding of the present application and constitute a part of the present application. The schematic embodiments of the present application and the description thereof are used to explain the present application, and do not constitute an improper limitation on the present application. In the drawings:
图1示出了本申请实施例提供的一种人群绩效预测中的缺失特征处理方法流程图;FIG. 1 shows a flowchart of a method for processing missing features in a crowd performance prediction according to an embodiment of the present application;
图2示出了本申请实施例提供的另一种人群绩效预测中的缺失特征处理方法流程图;FIG. 2 shows a flowchart of another method for processing missing features in a crowd performance prediction provided by an embodiment of the present application;
图3示出了本申请实施例提供的一种人群绩效预测中的缺失特征处理装置的结构示意图;3 is a schematic structural diagram of a device for processing missing features in a crowd performance prediction provided by an embodiment of the present application;
图4示出了本申请实施例提供的另一种人群绩效预测中的缺失特征处理装置的结构示意图;FIG. 4 is a schematic structural diagram of another missing feature processing device in crowd performance prediction according to an embodiment of the present application; FIG.
图5示出了本申请实施例提供的一种计算机设备的实体结构示意图。FIG. 5 is a schematic structural diagram of a computer device according to an embodiment of the present application.
下文中将参考附图并结合实施例来详细说明本申请。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。Hereinafter, the present application will be described in detail with reference to the drawings and embodiments. It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other.
如背景技术,目前,通常采用传统填补方法,如特殊值插补、热卡插补等方式填补人群绩效预测训练集中的缺失特征。然而,上述填补方式的填补效果受限于特征分布状态,且缺失特征通常为非随机缺失特征、非随机缺失特征的缺失与否与特征本身存在关联。例如,在收入调查中,受访人群中高收入人群或者低收入人群都不会填写具体收入,因此收入水平的缺失与收入水平自身存在关联。若通过上述方式填补缺失特征,会造成缺失特征的填补存在偏差,从而造成,进而造成训练得预测模型的预测精度较低。As the background technology, currently, traditional filling methods, such as special value interpolation and hot card interpolation, are usually used to fill the missing features in the crowd performance prediction training set. However, the filling effect of the above filling method is limited by the distribution of features, and the missing features are usually non-random missing features, and the absence of non-random missing features is related to the features themselves. For example, in the income survey, neither the high-income group nor the low-income group will fill in specific income, so the lack of income level is related to the income level itself. If the missing features are filled in the above manner, there will be deviations in the filling of missing features, which will result in lower prediction accuracy of the trained prediction model.
为了解决上述问题,本申请实施例提供了一种人群绩效预测中的缺失特征处理方法,如图1所示,所述方法包括:In order to solve the above problem, an embodiment of the present application provides a method for processing missing features in a crowd performance prediction. As shown in FIG. 1, the method includes:
101、获取人群绩效预测训练集中的已有特征以及缺失特征对应的预设混合高斯模型。101. Obtain a preset mixed Gaussian model corresponding to existing features and missing features in a crowd performance prediction training set.
其中,所述预设混合高斯模型可以由所述缺失特征对应的多元高斯分布组成。所述训练集可以包括人群绩效特征和人群绩效等级,本申请实施例中的缺失特征可以为人群绩效特征中的特征。所述人群绩效特征可以包括但不限于:月均学习课程数、工作地址经纬度、单月内互联网交易产品数、半年内互联网交易等级。例如,有100条训练样本,其中有40条训练样本中没有半年内互联网交易等级,则可以确认40条训练样本中的“半年内互联网交易等级”为缺失特征。已有特征可以为其中60条训练样本中的“月均学习课程数、The preset mixed Gaussian model may be composed of a multivariate Gaussian distribution corresponding to the missing feature. The training set may include a crowd performance feature and a crowd performance level, and the missing feature in the embodiment of the present application may be a feature in the crowd performance feature. The performance characteristics of the crowd may include, but are not limited to, the average number of courses per month, the latitude and longitude of the work address, the number of Internet transaction products in a single month, and the level of Internet transactions in six months. For example, if there are 100 training samples, and 40 of the training samples do not have Internet transaction levels within six months, you can confirm that the "Internet transaction levels within six months" of the 40 training samples are missing features. Existing features can be "the average number of learning courses per month,
工作地址经纬度、单月内互联网交易产品数、半年内互联网交易等级”和其中“40条训练样本中的“月均学习课程数、工作地址经纬度、单月内互联网交易产品数”。Work address latitude and longitude, number of Internet trading products in a single month, Internet transaction level within half a year "and" the average number of monthly courses, work address latitude and longitude, and the number of Internet trading products in a single month "among 40 training samples.
需要说明的是,所述预设混合高斯模型的概率分布可以如下方式表示:It should be noted that the probability distribution of the preset mixed Gaussian model can be expressed as follows:
若本申请实施例中的缺失特征为“半年内互联网交易等级”,则所述多元高斯分布可以为多个半年内互联网交易等级的类别,根据上述公式,半年内互联网交易等级的类别具体可以为m个。If the missing feature in the embodiment of the present application is “Internet transaction level within half a year”, the multiple Gaussian distribution may be multiple categories of Internet transaction levels within half a year. According to the above formula, the category of Internet transaction levels within half a year may specifically be m.
102、根据所述已有特征和所述预设混合高斯模型的最大期望算法,估计所述缺失特征对应的数值。102. Estimate a value corresponding to the missing feature according to a maximum expectation algorithm of the existing feature and the preset mixed Gaussian model.
其中,所述最大期望算法可以为一种迭代算法,可以包括极大似然估计计算和期望计算,所述极大似然估计计算和所述期望计算可以为迭代进行的。本申请实施例可以通过已有特征和极大似然估计计算迭代计算所述预设混合高斯模型的各元高斯分布的分布参数,然后通过期望计算和迭代计算出的分布参数,计算所述缺失特征属于各元高斯分布的概率,从而可以估计所述缺失特征所属的高斯分布,确定所述缺失特征对应的数值。The maximum expectation algorithm may be an iterative algorithm, and may include a maximum likelihood estimation calculation and an expectation calculation, and the maximum likelihood estimation calculation and the expectation calculation may be performed iteratively. In the embodiment of the present application, it is possible to iteratively calculate the distribution parameters of each meta-Gaussian distribution of the preset mixed Gaussian model by using the existing features and the maximum likelihood estimation calculation, and then calculate the missing by the expected calculation and the distribution parameters calculated by iteration. The probability that a feature belongs to each meta-Gaussian distribution, so that the Gaussian distribution to which the missing feature belongs can be estimated, and the value corresponding to the missing feature can be determined.
例如,若100条训练样本中有60条训练样本的特征完整,40条训练样本存在特征缺失,则可以根据60条训练样本中的已有特征和预设混合高斯模型估计出40条训练样本中的缺失特征,估计出缺失特征对应的数值后,通过将缺失特征对应的数值填补到所述人群绩效预测训练集,即可以得到特征完整的100条训练样本。For example, if 60 training samples in 100 training samples have complete features and 40 training samples have missing features, you can estimate 40 training samples based on the existing features in the 60 training samples and a preset mixed Gaussian model. After the missing feature is estimated, and the value corresponding to the missing feature is estimated, by filling the value corresponding to the missing feature to the training set of performance prediction of the crowd, 100 training samples with complete features can be obtained.
103、将所述缺失特征对应的数值填补到所述人群绩效预测训练集中。103. Fill a value corresponding to the missing feature into the crowd performance prediction training set.
需要说明的是,可以根据已有特征的位置将所述缺失特征对应的数值填补到所述人群绩效预测训练集的相应位置上。例如,训练样本3为(月均学习课程数80,工作地址经纬度(123.435,41.819),单月内互联网交易产品100,),训练样本3中缺失特征为“半年内互联网交易等级”,在估计出训练样本3中的“半年内互联网交易等级”对应的数值为半年内互联网交易等级2时,可以根据“月均学习课程数80,工作地址经纬度(123.435,41.819),单月内互联网交易产品100”,将所述“半年内互联网交易等级2”填补到训练样本3中,得到(月均学习课程数80,工作地址经纬度(123.435,41.819),单月内互联网交易产品100,半年内互联网交易等级2)。It should be noted that the value corresponding to the missing feature may be filled into the corresponding position of the crowd performance prediction training set according to the position of the existing feature. For example, training sample 3 is (the average number of courses per month is 80, the latitude and longitude of the work address (123.435, 41.819), and the Internet trading product in a single month is 100,). The missing feature in training sample 3 is "Internet trading level within half a year". In the training sample 3, when the corresponding value of "Internet Trading Level in Half a Year" is Internet Trading Level 2 in half a year, you can use the "average number of courses per month 80, work address latitude and longitude (123.435, 41.819), Internet trading products in a single month 100 ", filling the" Internet trading level 2 in half a year "into training sample 3, and obtaining (average number of courses per month 80, work address latitude and longitude (123.435, 41.819), Internet trading products 100 in a single month, Internet in half a year Transaction level 2).
本申请实施例提供的一种人群绩效特征预测中的缺失特征处理方法,与目前通常采用传统填补方法,如特殊值插补、热卡插补等方式填补人群绩效预测训练集中的缺失特征相比,本申请实施例能够获取人群绩效预测训练集中的已有特征以及缺失特征对应的预设混合高斯模型,所述预设混合高斯模型由所述缺失特征对应的多元高斯分布组成;能够根据 所述已有特征和所述预设混合高斯模型的最大期望算法,估计所述缺失特征。与此同时,能够将估计后的缺失特征填补到所述人群绩效预测训练集中,从而能够实现基于缺失特征对应的预设混合高斯模型填补缺失数据,由于所述预设混合高斯模型是由所述缺失特征对应的多元高斯分布组成,因此能够保证填补后的缺失特征反映其分布状态且与本身相关,能够避免缺失特征的填补存在偏差,能够避免经过训练得预测模型的预测结果与相应特征间的关联存在偏差,进而能够提升训练得预测模型的预测精度。The method for processing missing features in the prediction of crowd performance characteristics provided by the embodiments of the present application is compared with the conventional filling methods, such as special value interpolation, hot card interpolation, etc., which are used to fill the missing features in the crowd performance prediction training set. In the embodiment of the present application, a preset mixed Gaussian model corresponding to existing features and missing features in a crowd performance prediction training set can be obtained, and the preset mixed Gaussian model is composed of a multivariate Gaussian distribution corresponding to the missing features; An existing feature and a maximum expected algorithm of the preset mixed Gaussian model estimate the missing feature. At the same time, the estimated missing features can be filled into the crowd performance prediction training set, so that the missing data can be filled based on a preset mixed Gaussian model corresponding to the missing features. The composition of the multivariate Gaussian distribution corresponding to the missing features can ensure that the missing features after filling reflect their distribution status and are related to themselves. It can avoid the bias of missing feature filling and can avoid the prediction results of the trained prediction model and the corresponding features. There is a bias in the association, which can improve the prediction accuracy of the trained prediction model.
进一步的,为了更好的说明上述人群绩效预测中的缺失特征的过程,作为对上述实施例的细化和扩展,本申请实施例提供了另一种人群绩效预测中的缺失特征处理方法,如图2所示,所述方法包括:Further, in order to better explain the process of missing features in the above-mentioned crowd performance prediction, as a refinement and extension of the foregoing embodiment, this embodiment of the present application provides another method for processing missing features in the crowd performance prediction, such as As shown in FIG. 2, the method includes:
201、获取人群绩效预测训练集中的已有特征以及缺失特征对应的预设混合高斯模型。201. Obtain a preset mixed Gaussian model corresponding to existing features and missing features in a crowd performance prediction training set.
其中,所述预设混合高斯模型可以由所述缺失特征对应的多元高斯分布组成。在本申请实施例中,所述已有特征可以以特征向量的形式存在,若“月均学习课程数、工作地址经纬度、单月内互联网交易产品数”三个维度的特征,唯一标定一个半年内互联网交易等级类别,则可以将“月均学习课程数、工作地址经纬度、单月内互联网交易产品数”以特征向量的形式表示。The preset mixed Gaussian model may be composed of a multivariate Gaussian distribution corresponding to the missing feature. In the embodiment of the present application, the existing features may exist in the form of feature vectors. If the features of the three dimensions of “average number of courses per month, latitude and longitude of work address, and number of Internet trading products in a single month” are uniquely marked for one and a half years Intra-Internet transaction level categories, the "average number of courses per month, latitude and longitude of work address, and number of Internet trading products in a single month" can be expressed in the form of feature vectors.
对于本申请实施例,为了得到所述预设混合高斯模型,所述方法还可以包括:确定所述缺失特征对应的多元高斯分布;根据所述多元高斯分布构建所述预设混合高斯模型。For the embodiment of the present application, in order to obtain the preset mixed Gaussian model, the method may further include: determining a multivariate Gaussian distribution corresponding to the missing feature; and constructing the preset mixed Gaussian model according to the multivariate Gaussian distribution.
例如,所述缺失特征为“半年内互联网交易等级”,确定“半年内互联网交易等级”对应的多个类别,每个类别可以对应一元高斯分布,可以将“月均学习课程数、工作地址经纬度、单月内互联网交易产品数”作为观测样本向量,即可以以所述观测样本向量观测缺失特征的多元高斯分布,将半年内互联网交易等级划分为多个类别,然后根据观测样本向量所属类别的权重、对应的均值向量、协方差矩阵,构建所述半年内互联网交易等级对应的预设混合高斯模型。For example, the missing feature is "Internet transaction level within half a year". Multiple categories corresponding to "Internet transaction level within half a year" are determined, and each category can correspond to a one-dimensional Gaussian distribution. The "average number of monthly courses, work address latitude and longitude" "The number of Internet trading products in a single month" as the observation sample vector, that is, the observed sample vector can be used to observe the multi-dimensional Gaussian distribution of missing features, and the Internet transaction level within six months can be divided into multiple categories, and then The weight, the corresponding mean vector, and the covariance matrix are used to construct a preset mixed Gaussian model corresponding to the Internet transaction level within the half year.
202、根据所述已有特征和所述预设混合高斯模型的最大期望算法的极大似然估计计算,估计所述预设混合高斯模型的各元高斯分布的初始分布参数。202. Estimate initial distribution parameters of each elementary Gaussian distribution of the preset hybrid Gaussian model according to the maximum likelihood estimation calculation of the existing feature and the maximum expected algorithm of the preset hybrid Gaussian model.
其中,所述分布参数可以包括混合系数、均值和协方差。若所述已有特征以特征向量形式存在,所述分布参数可以包括混合系数、均值向量和协方差矩阵,所述混合系数可以为属于对应高斯分布的样本数与总样本数的比值。所述训练集可以包括特征完整的第一训练集和含缺失特征的第二训练集,所述已有特征包括所述第一训练集中第一已有特征、第二已有特征和所述第二训练集中的第三已有特征,所述第一已有特征与所述第三已有特征相对应,所述第二已有特征与所述缺失特征相对应,所述步骤202具体可以包括:根据所 述第一已有特征、所述第二已有特征和所述极大似然估计计算,估计所述预设混合高斯模型的各元高斯分布的初始混合系数、初始均值及初始协方差。The distribution parameters may include a mixing coefficient, a mean, and a covariance. If the existing features exist in the form of feature vectors, the distribution parameters may include a mixing coefficient, a mean vector, and a covariance matrix, and the mixing coefficient may be a ratio of the number of samples belonging to the corresponding Gaussian distribution to the total number of samples. The training set may include a first training set with complete features and a second training set with missing features, and the existing features include the first existing feature, the second existing feature, and the first training set in the first training set. The third existing feature in the two training sets, the first existing feature corresponds to the third existing feature, the second existing feature corresponds to the missing feature, and step 202 may specifically include : Estimating an initial mixing coefficient, an initial mean value, and an initial agreement of each elementary Gaussian distribution of the preset mixed Gaussian model according to the first existing feature, the second existing feature, and the maximum likelihood estimation calculation variance.
需要说明的是,极大似然估计计算可以包括:It should be noted that the calculation of the maximum likelihood estimation may include:
计算第i元高斯分布的混合系数: Calculate the mixing coefficient of the i-th Gaussian distribution:
计算第i元高斯分布的均值: Calculate the mean of the i-th Gaussian distribution:
计算第i元高斯分布的协方差: Calculate the covariance of the i-th Gaussian distribution:
γ ij可以表示为样本x j属于第i元高斯分布的概率。 γ ij can be expressed as the probability that the sample x j belongs to the i-th Gaussian distribution.
例如,训练集为D={(x 1,y 1),(x 2,y 2),…,(x l,y l),x l+1,x l+2…,x n},特征完整的第一训练集可以为D 1={(x 1,y 1),(x 2,y 2),…,(x l,y l)},含缺失特征的第二训练集可以为D 2={x l+1,x l+2…,x n},所述第一训练集中第一已有特征可以为x j,j=1,…,l,第二已有特征可以为y i,i=1,…,l,所述第二训练集中的第三已有特征可以为x j,j=l+1,…,n,所述第二训练集中的缺失特征可以为y i,i=l+1,…,n;具体x j可以对应(月均学习课程数、工作地址经纬度、单月内互联网交易产品数),y i可以对应x j属于第i类的“半年内互联网交易等级”的概率γ ij,x j所属的类的概率值为1,其余类为0。因此,可以将第一已有特征“x j”和第二已有特征“γ ij”分别代入极大似然估计计算,计算各元高斯分布的初始混合系数、初始均值及初始协方差。 For example, the training set is D = {(x 1 , y 1 ), (x 2 , y 2 ), ..., (x l , y l ), x l + 1 , x l + 2 …, x n }, features The complete first training set can be D 1 = {(x 1 , y 1 ), (x 2 , y 2 ), ..., (x l , y l )}, and the second training set with missing features can be D 2 = {x l + 1 , x l + 2 …, x n }, the first existing feature in the first training set may be x j , j = 1, ..., l, and the second existing feature may be y i , i = 1, ..., l, the third existing feature in the second training set may be x j , j = 1l + 1, ..., n, and the missing feature in the second training set may be y i , I = l + 1, ..., n; specific x j can correspond (average number of courses per month, work address latitude and longitude, number of Internet trading products in a single month), and y i can correspond to x j belonging to the i category of The probability γ ij of the "Internet transaction level", the probability value of the class to which x j belongs is 1, and the other classes are 0. Therefore, the first existing feature “x j ” and the second existing feature “γ ij ” can be substituted into the maximum likelihood estimation calculation to calculate the initial mixing coefficient, the initial mean, and the initial covariance of each elementary Gaussian distribution.
203、根据所述初始分布参数及所述最大期望算法的期望计算,初步估计所述缺失特征所属的高斯分布。203. According to the initial distribution parameter and the expectation calculation of the maximum expectation algorithm, a Gaussian distribution to which the missing feature belongs is initially estimated.
对于本申请实施例,与所述步骤202相对应的,所述步骤203具体可以包括:所述初始混合系数、初始均值、初始协方差、所述第三已有特征及所述期望计算,初步估计所述缺失特征所属的高斯分布。具体地,可以根据初始混合系数、初始均值、初始协方差、所述第三已有特征及所述期望计算,初步估计所述缺失特征属于各元高斯分布的概率;根据所述属于各元高斯分布的概率,初步估计所述缺失特征所属的高斯分布。For the embodiment of the present application, corresponding to step 202, the
需要说明的是,期望计算可以包括:It should be noted that the expectation calculation may include:
其中,m可以为高斯分布的总元数。Among them, m can be the total number of Gaussian distribution.
例如,接着步骤202所述的例子,在计算出初始参数π
i、μ
i、∑
i后,可以将x
j,j=l+1,…,n、π
i、μ
i、∑
i,分别代入到上述公式,分别计算“半年内互联网交易等级”属于各元高斯分布的概率,即属于各个“半年内互联网交易等级”类别的概率γ
ij;可以将对应概率最高的高斯分布确定为“半年内互联网交易等级”的高斯分布。
For example, following the example described in
204、根据所述已有特征、所述所属的高斯分布和所述极大似然估计计算,迭代更新所述各元高斯分布的分布参数,并迭代估计所述缺失特征所属的高斯分布。204. Iteratively update the distribution parameters of each meta-Gaussian distribution according to the existing feature, the associated Gaussian distribution, and the maximum likelihood estimation calculation, and iteratively estimate the Gaussian distribution to which the missing feature belongs.
需要说明的是,根据步骤204能够实现根据所有特征迭代更新所述各元高斯分布的分布参数,即极大似然估计计算中的上限值由l变为n:It should be noted that according to step 204, it is possible to iteratively update the distribution parameters of the respective Gaussian distributions according to all features, that is, the upper limit value in the maximum likelihood estimation calculation is changed from l to n:
计算第i元高斯分布的混合系数: Calculate the mixing coefficient of the i-th Gaussian distribution:
计算第i元高斯分布的均值: Calculate the mean of the i-th Gaussian distribution:
计算第i元高斯分布的协方差: Calculate the covariance of the i-th Gaussian distribution:
205、当迭代更新的分布参数收敛时,获取根据收敛的分布参数和所述期望计算估计的高斯分布,并根据所述估计的高斯分布估计所述缺失特征对应的数值。205. When the iteratively updated distribution parameters converge, obtain an estimated Gaussian distribution based on the converged distribution parameters and the expectation, and estimate a value corresponding to the missing feature according to the estimated Gaussian distribution.
对于本申请实施例,为了确定更新的分布参数是否收敛,所述方法还可以包括:计算前后两次迭代更新的分布参数的参数差值;若所述参数差值小于预设阈值,则确定更新的分布参数收敛。For the embodiment of the present application, in order to determine whether the updated distribution parameter converges, the method may further include: calculating a parameter difference value of the distribution parameter updated in two iterations before and after; if the parameter difference value is less than a preset threshold, determining update The distribution parameters converge.
具体地,若分布参数为混合系数、均值、协方差,可以计算前后两次迭代更新的混合系数的差值;若所述混合系数差值小于预设混合系数阈值,则可以确定混合系数收敛。或者计算前后两次迭代更新的均值的差值;若所述均值差值小于预设均值阈值,则可以确定均值收敛。或者计算前后两次迭代更新的协方差的差值;若所述协方差差值小于预设协方差阈值,则可以确定协方差收敛。即在上述任何一个参数收敛时,可以停止迭代极大似然估计计算和期望计算,此时,根据最后一次迭代估计缺失特征所属的高斯分布,估计所述缺失特征。所述预设混合系数阈值、所述预设均值阈值和所述预设协方差阈值均可以为根据用户需求设置的,也可以为根据系统默认模式设置的,本申请实施例在此不进行限定。Specifically, if the distribution parameters are the mixing coefficient, the mean, and the covariance, the difference of the mixing coefficient updated two times before and after iteration may be calculated; if the difference of the mixing coefficient is less than a preset mixing coefficient threshold, the convergence of the mixing coefficient may be determined. Or calculate the difference between the mean values updated before and after two iterations; if the mean difference is less than a preset mean value threshold, then the mean value can be determined to converge. Or calculate the difference between the updated covariances before and after two iterations; if the covariance difference is less than a preset covariance threshold, the covariance convergence can be determined. That is, when any one of the above parameters converges, the iterative maximum likelihood estimation calculation and the expectation calculation can be stopped. At this time, the missing feature is estimated according to the Gaussian distribution to which the missing feature belongs in the last iteration estimation. The preset mixing coefficient threshold, the preset average threshold, and the preset covariance threshold may all be set according to user requirements, or may be set according to a system default mode, which are not limited in this embodiment of the present application. .
206、将所述缺失特征对应的数值填补到所述人群绩效预测训练集中,并根据填补缺失特征对应的数值后的人群绩效预测训练集训练人群绩效预测模型。206: Fill the values corresponding to the missing features into the crowd performance prediction training set, and train the crowd performance prediction model according to the crowd performance prediction training set after filling the values corresponding to the missing features.
需要说明的是,可以根据已有特征的位置,确定缺失特征对应的数值填补到所述人群绩效预测训练集中的位置,若填补缺失特征对应的数值之前的训练集为D={(x 1,y 1),(x 2,y 2),…,(x l,y l),x l+1,x l+2…,x n),则可以分别根据x l+1,x l+2…,x n的位置填补y l+1,y l+2…,y n,最后得到y l+1,y l+2…,y n缺失特征后的训练集可以为D={(x 1,y 1),(x 2,y 2),…,(x l,y l),(x l+1,y l+1),(x l+2,y l+1)…,(x n,y n)}。此外,在本申请实施例中,所述人群绩效预测模型可以为确定人群绩效等级的决策树模型或者逻辑回归模型等。具体地,若人群绩效预测模型为决策树模型,可以利用决策树算法对填补缺失特征后的人群绩效预测训练集进行训练,得到决策树模型。若人群绩效预测模型为逻辑回归模型模型,可以利用逻辑回归模型算法对填补缺失特征后的人群绩效预测训练集进行训练,得到逻辑回归模型模型。 It should be noted that, according to the positions of existing features, the values corresponding to missing features can be determined to be filled in the position of the crowd performance prediction training set. If the training set before filling the values corresponding to missing features is D = {(x 1 , y 1 ), (x 2 , y 2 ), ..., (x l , y l ), x l + 1 , x l + 2 …, x n ), then you can use x l + 1 , x l + 2 respectively The position of…, x n is filled with y l + 1 , y l + 2 …, y n , and finally y l + 1 , y l + 2 …, y n is missing. The training set can be D = {(x 1 , y 1 ), (x 2 , y 2 ), ..., (x l , y l ), (x l + 1 , y l + 1 ), (x l + 2 , y l + 1 ) ..., (x n , y n )}. In addition, in the embodiment of the present application, the crowd performance prediction model may be a decision tree model or a logistic regression model, etc., for determining a crowd performance level. Specifically, if the crowd performance prediction model is a decision tree model, a decision tree algorithm can be used to train a crowd performance prediction training set after filling in missing features to obtain a decision tree model. If the crowd performance prediction model is a logistic regression model model, the logistic regression model algorithm can be used to train the crowd performance prediction training set after filling the missing features to obtain a logistic regression model model.
为了更好的理解本申请实施例,提供如下应用场景,包括但不限定于此:假设人群绩效特征预测训练集有训练样本1:(月均学习课程数100,半年内互联网交易等级1),训练样本2(月均学习课程数50,半年内互联网交易等级2),训练样本3(月均学习课程数60,半年内互联网交易等级2),训练样本4(月均学习课程数80,),训练样本5(月均学习课程数70,),则缺失特征为训练样本4和训练样本5中的“半年内互联网交易等级”,半年内互联网交易等级有2个类别,分别为:半年内互联网交易等级1、半年内互联网交易等级2。In order to better understand the embodiments of the present application, the following application scenarios are provided, including but not limited to this: Assume that the training set for predicting the performance characteristics of the crowd has training samples 1: (average number of monthly learning courses 100, Internet transaction level 1 within six months), Training sample 2 (average number of monthly courses 50, Internet transaction level 2 in half a year), training sample 3 (average number of monthly courses 60, Internet transaction level 2 in half a year), training sample 4 (average number of monthly courses 80,) , Training sample 5 (the average number of courses per month is 70,), the missing feature is "Internet trading levels within six months" in training samples 4 and 5. There are two categories of Internet transaction levels within six months, respectively: within six months Internet transaction level 1, Internet transaction level 2 within half a year.
首先,可以将训练样本1、训练样本2、训练样本3中的“月均学习课程数”和“半年内互联网交易等级”分别代入到极大似然估计计算中得到初步估计的π i、μ i、∑ i、然后将训练样本4和训练样本5的“月均学习课程数”分别代入到期望计算中,计算训练样本4中的“半年内互联网交易等级”分别属于半年内互联网交易等级1的概率,属于半年内互联网交易等级2的概率,若属于半年内互联网交易等级1的概率大于属于半年内互联网交易等级2的概率,则可以确定训练样本4中的“半年内互联网交易等级”所属的类别为:半年内互联网交易等级1。同理地,可以计算训练样本5中的“半年内互联网交易等级”所属的类别为:半年内互联网交易等级1。 First, the "average number of learning courses per month" and "internet trading levels within half a year" in training sample 1, training sample 2, and training sample 3 can be substituted into the maximum likelihood estimation calculation to obtain preliminary estimates of π i and μ, respectively. i , ∑ i , and then substitute the "average number of learning courses per month" of training sample 4 and training sample 5 into the expectation calculation, and calculate the "Internet transaction level within six months" in training sample 4 as Internet transaction level 1 within six months respectively The probability of belonging to Internet transaction level 2 within half a year. If the probability of Internet transaction level 1 within half a year is greater than the probability of Internet transaction level 2 within half a year, the “Internet transaction level within six months” in training sample 4 can be determined The categories are: Level 1 of Internet transactions within six months. In the same way, the category of “Internet transaction level within six months” in training sample 5 can be calculated: Internet transaction level 1 within six months.
然后,可以将训练样本1、训练样本2、训练样本3、训练样本4和训练样本5的“月均学习课程数”和“半年内互联网交易等级”分别代入到极大似然估计计算中,迭代更新π i、μ i、∑ i,并根据更新的π i、μ i、∑ i和期望计算分别更新训练样本4和训练样本5 中“半年内互联网交易等级”所属的类别,直到π i、μ i、∑ i收敛时,将估计的“半年内互联网交易等级”所属的类别确定为最后估计结果,如最终确定:训练样本4中的“半年内互联网交易等级”所属的类别为:半年内互联网交易等级1;训练样本5中的“半年内互联网交易等级”所属的类别为:半年内互联网交易等级2。 Then, you can substitute training sample 1, training sample 2, training sample 3, training sample 4 and training sample 5 "average learning courses per month" and "internet trading level within half a year" into the maximum likelihood estimation calculation, iteratively updated π i, μ i, Σ i, and are updated training samples 4 and the training sample 5 in the category "within six months of Internet transaction level" belongs according to the updated π i, μ i, Σ i and expectations calculated until π i "within six months of Internet transaction level" category, μ i, when Σ i convergence, the estimated belongs as the final estimation results, such as finally determined: the training sample 4 "within six months of Internet transactions grade" category belongs: six months Intra-Internet Trading Level 1; The category of "Internet Trading Level in Six Months" in Training Sample 5 is: Internet Trading Level 2 in Six Months.
因此,补入缺失特征对应的数值后,得到的训练集可以为:训练样本1:(月均学习课程数100,半年内互联网交易等级1),训练样本2(月均学习课程数50,半年内互联网交易等级2),训练样本3(月均学习课程数60,半年内互联网交易等级2),训练样本4(月均学习课程数80,半年内互联网交易等级1),训练样本5(月均学习课程数70,半年内互联网交易等级2)。Therefore, after filling in the values corresponding to the missing features, the training set obtained can be: training sample 1: (average number of monthly learning courses 100, Internet transaction level 1 in half a year), training sample 2 (average number of monthly learning courses 50, half a year Internet transaction level 2), training sample 3 (average number of monthly courses 60, Internet transaction level 2 in half a year), training sample 4 (average number of monthly courses 80, Internet transaction level 1 in half a year), training sample 5 (monthly The average number of courses studied is 70, and the level of Internet transactions in half a year is 2).
本申请实施例提供的另一种人群绩效特征预测中的缺失特征处理方法,与目前通常采用传统填补方法,如特殊值插补、热卡插补等方式填补人群绩效预测训练集中的缺失特征相比,本申请实施例能够获取人群绩效预测训练集中的已有特征以及缺失特征对应的预设混合高斯模型,所述预设混合高斯模型由所述缺失特征对应的多元高斯分布组成;能够根据所述已有特征和所述预设混合高斯模型的最大期望算法,估计所述缺失特征。与此同时,能够将估计后的缺失特征填补到所述人群绩效预测训练集中,从而能够实现基于缺失特征对应的预设混合高斯模型填补缺失数据,由于所述预设混合高斯模型是由所述缺失特征对应的多元高斯分布组成,因此能够保证填补后的缺失特征反映其分布状态且与本身相关,能够避免缺失特征的填补存在偏差,能够避免经过训练得预测模型的预测结果与相应特征间的关联存在偏差,进而能够提升训练得预测模型的预测精度。Another method for processing missing features in the prediction of population performance characteristics provided by the embodiments of the present application is in line with the traditional filling methods such as special value interpolation, hot card interpolation, etc., which are currently used to fill missing characteristics in the crowd performance prediction training set. In contrast, the embodiment of the present application can obtain a preset mixed Gaussian model corresponding to existing features and missing features in a crowd performance prediction training set, and the preset mixed Gaussian model is composed of a multivariate Gaussian distribution corresponding to the missing features; The maximum expectation algorithm of the existing feature and the preset mixed Gaussian model is described, and the missing feature is estimated. At the same time, the estimated missing features can be filled into the crowd performance prediction training set, so that the missing data can be filled based on a preset mixed Gaussian model corresponding to the missing features. The composition of the multivariate Gaussian distribution corresponding to the missing features can ensure that the missing features after filling reflect their distribution status and are related to themselves. It can avoid the bias of missing feature filling and can avoid the prediction results of the trained prediction model and the corresponding features. There is a bias in the association, which can improve the prediction accuracy of the trained prediction model.
进一步地,作为图1的具体实现,本申请实施例提供了一种人群绩效预测中的缺失特征处理装置,如图3所示,所述装置包括:获取单元31、估计单元32和填补单元33。Further, as a specific implementation of FIG. 1, an embodiment of the present application provides a missing feature processing device in crowd performance prediction. As shown in FIG. 3, the device includes: an obtaining
所述获取单元31,可以用于获取人群绩效预测训练集中的已有特征以及缺失特征对应的预设混合高斯模型,所述预设混合高斯模型由所述缺失特征对应的多元高斯分布组成。所述获取单元31是本装置中获取人群绩效预测训练集中的已有特征以及缺失特征对应的预设混合高斯模型的主要功能模块。The obtaining
所述估计单元32,可以用于根据所述已有特征和所述预设混合高斯模型的最大期望算法,估计所述缺失特征对应的数值。所述估计单元32是本装置中根据所述已有特征和所述预设混合高斯模型的最大期望算法,估计所述缺失特征的主要功能模块,也是核心模块。The
所述填补单元33,可以用于将所述缺失特征对应的数值填补到所述人群绩效预测训练集中。所述填补单元33是本装置中将所述缺失特征对应的数值填补到所述人群绩效预测训练集中的主要功能模块。The filling
对于本申请实施例,所述最大期望算法可以包括极大似然估计计算和期望计算,所述估计单元32可以包括:估计模块321、更新模块322和获取模块323,如图4所示。For the embodiment of the present application, the maximum expected algorithm may include a maximum likelihood estimation calculation and an expected calculation, and the
所述估计模块321,可以用于根据所述已有特征和所述极大似然估计计算,估计所述预设混合高斯模型的各元高斯分布的初始分布参数。The
所述估计模块321,还可以用于根据所述初始分布参数及所述期望计算,初步估计所述缺失特征所属的高斯分布。The
所述更新模块322,可以用于根据所述已有特征、所述所属的高斯分布和所述极大似然估计计算,迭代更新所述各元高斯分布的分布参数,并迭代更新所述缺失特征所属的高斯分布。The updating
所述获取模块323,可以用于当迭代更新的分布参数收敛时,获取根据收敛的分布参数和所述期望计算估计的高斯分布。The obtaining
所述估计模块321,还可以用于根据最后估计的高斯分布估计所述缺失特征对应的数值。The
在具体应用场景中,所述估计模块321,具体可以用于根据所述初始分布参数及所述期望计算,初步估计所述缺失特征属于各元高斯分布的概率;并根据所述属于各元高斯分布的概率,初步估计所述缺失特征所属的高斯分布。In a specific application scenario, the
需要说明的是,所述训练集包括特征完整的第一训练集和含缺失特征的第二训练集,所述已有特征包括所述第一训练集中第一已有特征、第二已有特征和所述第二训练集中的第三已有特征,所述第一已有特征与所述第三已有特征相对应,所述第二已有特征与所述缺失特征相对应;所述估计模块321,具体可以用于根据所述第一已有特征、所述第二已有特征和所述极大似然估计计算,估计所述预设混合高斯模型的各元高斯分布的初始混合系数、初始均值及初始协方差。It should be noted that the training set includes a first training set with complete features and a second training set with missing features, and the existing features include the first existing feature and the second existing feature in the first training set. And the third existing feature in the second training set, the first existing feature corresponds to the third existing feature, the second existing feature corresponds to the missing feature, and the
所述估计模块321,具体还可以用于所述初始混合系数、初始均值、初始协方差、所述第三已有特征及所述期望计算,初步估计所述缺失特征所属的高斯分布。The
对于本申请实施例,为了确定更新的分布参数是否收敛,所述估计单元32还可以包括:计算模块324和确定模块325。For the embodiment of the present application, in order to determine whether the updated distribution parameters converge, the
所述计算模块324,可以用于计算前后两次迭代更新的分布参数的参数差值。The
所述确定模块325,可以用于若所述参数差值小于预设阈值,则确定更新的分布参数收敛。The determining
对于本申请实施例,为了获取缺失特征对应的预设混合高斯模型,所述装置还可以包括:确定单元34和构建单元35。For the embodiment of the present application, in order to obtain a preset mixed Gaussian model corresponding to a missing feature, the apparatus may further include a determining
所述确定单元34,可以用于确定所述缺失特征对应的多元高斯分布。所述确定单元是本装置中确定所述缺失特征对应的多元高斯分布的主要功能模块。The determining
所述构建单元35,可以用于根据所述多元高斯分布构建所述预设混合高斯模型。所述构建单元35是本装置中根据所述多元高斯分布构建所述预设混合高斯模型是的主要功能模块。The constructing
此外,为了得到人群绩效预测模型,所述装置还可以包括:训练单元36。In addition, in order to obtain a crowd performance prediction model, the device may further include a
所述训练单元36,可以用于根据填补缺失特征后的人群绩效预测训练集训练人群绩效预测模型。所述训练单元36是本装置中根据填补缺失特征后的人群绩效预测训练集训练人群绩效预测模型的主要功能模块。The
需要说明的是,本申请实施例提供的一种人群绩效预测中的缺失特征处理装置所涉及各功能模块的其他相应描述,可以参考图1所示方法的对应描述,在此不再赘述。It should be noted that for other corresponding descriptions of the functional modules involved in the missing feature processing device in a crowd performance prediction provided in the embodiments of the present application, reference may be made to the corresponding description of the method shown in FIG. 1, and details are not described herein again.
基于上述如图1所示方法,相应的,本申请实施例还提供了一种计算机非易失性可读存储介质,其上存储有计算机可读指令,该计算机可读指令被处理器执行时实现以下步骤:获取人群绩效预测训练集中的已有特征以及缺失特征对应的预设混合高斯模型,所述预设混合高斯模型由所述缺失特征对应的多元高斯分布组成;根据所述已有特征和所述预设混合高斯模型的最大期望算法,估计所述缺失特征对应的数值;将所述缺失特征对应的数值填补到所述人群绩效预测训练集中。Based on the above method shown in FIG. 1, correspondingly, an embodiment of the present application further provides a computer non-volatile readable storage medium that stores computer-readable instructions. When the computer-readable instructions are executed by a processor, The following steps are implemented: acquiring existing features and preset mixed Gaussian models corresponding to missing features in a crowd performance prediction training set, the preset mixed Gaussian models consisting of a multivariate Gaussian distribution corresponding to the missing features; and according to the existing features And a maximum expectation algorithm of the preset mixed Gaussian model to estimate a value corresponding to the missing feature; and filling the value corresponding to the missing feature to the crowd performance prediction training set.
基于上述如图1所示方法和如图3所示装置的实施例,本申请实施例还提供了一种计算机设备的实体结构图,如图5所示,该计算机设备包括:处理器41、存储器42、及存储在存储器42上并可在处理器上运行的计算机可读指令,其中存储器42和处理器41均设置在总线43上所述处理器41执行所述计算机可读指令时实现以下步骤:获取人群绩效预测训练集中的已有特征以及缺失特征对应的预设混合高斯模型,所述预设混合高斯模型由所述缺失特征对应的多元高斯分布组成;根据所述已有特征和所述预设混合高斯模型的最大期望算法,估计所述缺失特征对应的数值;将所述缺失特征对应的数值填补到所述人群绩效预测训练集中。Based on the embodiment of the method shown in FIG. 1 and the device shown in FIG. 3, the embodiment of the present application further provides a physical structure diagram of a computer device. As shown in FIG. 5, the computer device includes a
通过本申请的技术方案,能够获取人群绩效预测训练集中的已有特征以及缺失特征对应的预设混合高斯模型,所述预设混合高斯模型由所述缺失特征对应的多元高斯分布组成;能够根据所述已有特征和所述预设混合高斯模型的最大期望算法,估计所述缺失特征对应的数值。与此同时,能够将估计后的缺失特征对应的数值填补到所述人群绩效预测训练集中,从而能够实现基于缺失特征对应的预设混合高斯模型填补缺失数据,由于所述预设混合高斯模型是由所述缺失特征对应的多元高斯分布组成,因此能够保证填补后的缺失 特征反映其分布状态且与本身相关,能够避免缺失特征的填补存在偏差,能够避免经过训练得预测模型的预测结果与相应特征间的关联存在偏差,进而能够提升训练得预测模型的预测精度。Through the technical solution of the present application, a preset mixed Gaussian model corresponding to existing features and missing features in a crowd performance prediction training set can be obtained, and the preset mixed Gaussian model is composed of a multivariate Gaussian distribution corresponding to the missing features; The maximum expectation algorithm of the existing feature and the preset mixed Gaussian model estimates a value corresponding to the missing feature. At the same time, the values corresponding to the estimated missing features can be filled into the crowd performance prediction training set, so that the missing data can be filled based on the preset mixed Gaussian model corresponding to the missing features. Since the preset mixed Gaussian model is It consists of a multivariate Gaussian distribution corresponding to the missing feature, so it can ensure that the missing feature after filling reflects its distribution state and is related to itself, can avoid the bias of missing feature filling, and can avoid the prediction result and corresponding of the trained prediction model. There is a bias in the association between features, which can improve the prediction accuracy of the trained prediction model.
显然,本领域的技术人员应该明白,上述的本申请的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的计算机可读指令代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本申请不限制于任何特定的硬件和软件结合。Obviously, those skilled in the art should understand that the above-mentioned modules or steps of the present application may be implemented by a general-purpose computing device, and they may be concentrated on a single computing device or distributed in a network composed of multiple computing devices. Above, optionally, they can be implemented with computer-readable instruction code executable by the computing device, so that they can be stored in a storage device and executed by the computing device, and in some cases, can be different from this The steps shown or described are performed in sequence, either by making them into individual integrated circuit modules, or by making multiple modules or steps into a single integrated circuit module. As such, this application is not limited to any particular combination of hardware and software.
以上所述仅为本申请的优选实施例而已,并不用于限制本申请,对于本领域的技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包括在本申请的保护范围之内。The above description is only a preferred embodiment of the present application, and is not intended to limit the present application. For those skilled in the art, this application may have various modifications and changes. Any modification, equivalent replacement, or improvement made within the spirit and principle of this application shall be included in the protection scope of this application.
Claims (20)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201810918521.3 | 2018-08-13 | ||
| CN201810918521.3A CN109325655A (en) | 2018-08-13 | 2018-08-13 | Missing characteristic processing method and device in the prediction of crowd's performance feature |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2020034593A1 true WO2020034593A1 (en) | 2020-02-20 |
Family
ID=65264112
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2019/073294 Ceased WO2020034593A1 (en) | 2018-08-13 | 2019-01-27 | Method and apparatus for processing missing feature in crowd performance feature prediction |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN109325655A (en) |
| WO (1) | WO2020034593A1 (en) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112906793A (en) * | 2021-02-22 | 2021-06-04 | 深圳市市政设计研究院有限公司 | Monitoring data repairing method and system for bridge health monitoring system |
| CN113159194A (en) * | 2021-04-26 | 2021-07-23 | 中南大学 | Missing value filling method based on attribute dynamic selection and gray level correlation analysis |
| CN115408375A (en) * | 2022-08-24 | 2022-11-29 | 大连理工大学 | A method for real-time reconstruction and online repair of missing sensor data in energy monitoring system |
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111429185B (en) * | 2020-03-27 | 2023-06-02 | 京东城市(北京)数字科技有限公司 | Crowd figure prediction method, device, equipment and storage medium |
| CN113076970A (en) * | 2021-02-24 | 2021-07-06 | 浙江师范大学 | Gaussian mixture model clustering machine learning method under deficiency condition |
| CN113901039A (en) * | 2021-10-11 | 2022-01-07 | 杭萧钢构股份有限公司 | A three-dimensional visual monitoring method, device, storage medium and terminal of a steel structure factory |
| JP2023072958A (en) * | 2021-11-15 | 2023-05-25 | 株式会社レゾナック | Model generation device, model generation method, and data estimation device |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101493886A (en) * | 2009-02-24 | 2009-07-29 | 武汉兰丁医学高科技有限公司 | Karyoplast categorization and identification method in case of unsoundness of characteristic parameter |
| CN104573685A (en) * | 2015-01-29 | 2015-04-29 | 中南大学 | Natural scene text detecting method based on extraction of linear structures |
| US20160180234A1 (en) * | 2014-12-23 | 2016-06-23 | InsideSales.com, Inc. | Using machine learning to predict performance of an individual in a role based on characteristics of the individual |
| CN105989843A (en) * | 2015-01-28 | 2016-10-05 | 中兴通讯股份有限公司 | Method and device of realizing missing feature reconstruction |
| CN107193876A (en) * | 2017-04-21 | 2017-09-22 | 美林数据技术股份有限公司 | A kind of missing data complementing method based on arest neighbors KNN algorithms |
| CN107842713A (en) * | 2017-11-03 | 2018-03-27 | 东北大学 | Submarine pipeline magnetic flux leakage data missing interpolating method based on KNN SVR |
-
2018
- 2018-08-13 CN CN201810918521.3A patent/CN109325655A/en active Pending
-
2019
- 2019-01-27 WO PCT/CN2019/073294 patent/WO2020034593A1/en not_active Ceased
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101493886A (en) * | 2009-02-24 | 2009-07-29 | 武汉兰丁医学高科技有限公司 | Karyoplast categorization and identification method in case of unsoundness of characteristic parameter |
| US20160180234A1 (en) * | 2014-12-23 | 2016-06-23 | InsideSales.com, Inc. | Using machine learning to predict performance of an individual in a role based on characteristics of the individual |
| CN105989843A (en) * | 2015-01-28 | 2016-10-05 | 中兴通讯股份有限公司 | Method and device of realizing missing feature reconstruction |
| CN104573685A (en) * | 2015-01-29 | 2015-04-29 | 中南大学 | Natural scene text detecting method based on extraction of linear structures |
| CN107193876A (en) * | 2017-04-21 | 2017-09-22 | 美林数据技术股份有限公司 | A kind of missing data complementing method based on arest neighbors KNN algorithms |
| CN107842713A (en) * | 2017-11-03 | 2018-03-27 | 东北大学 | Submarine pipeline magnetic flux leakage data missing interpolating method based on KNN SVR |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112906793A (en) * | 2021-02-22 | 2021-06-04 | 深圳市市政设计研究院有限公司 | Monitoring data repairing method and system for bridge health monitoring system |
| CN112906793B (en) * | 2021-02-22 | 2023-12-22 | 深圳市市政设计研究院有限公司 | A monitoring data repair method and system for bridge health monitoring system |
| CN113159194A (en) * | 2021-04-26 | 2021-07-23 | 中南大学 | Missing value filling method based on attribute dynamic selection and gray level correlation analysis |
| CN115408375A (en) * | 2022-08-24 | 2022-11-29 | 大连理工大学 | A method for real-time reconstruction and online repair of missing sensor data in energy monitoring system |
Also Published As
| Publication number | Publication date |
|---|---|
| CN109325655A (en) | 2019-02-12 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2020034593A1 (en) | Method and apparatus for processing missing feature in crowd performance feature prediction | |
| Gnecco et al. | Extremal random forests | |
| US20230342606A1 (en) | Training method and apparatus for graph neural network | |
| JP7219228B2 (en) | Strategic exploration in strategic dialogue between parties | |
| CN112365007B (en) | Model parameter determining method, device, equipment and storage medium | |
| CN112990478B (en) | Federal learning data processing system | |
| WO2020168851A1 (en) | Behavior recognition | |
| CN112639841B (en) | Sampling scheme for policy searching in multiparty policy interactions | |
| CN107633257B (en) | Data quality evaluation method and device, computer readable storage medium and terminal | |
| CN109409739B (en) | Crowdsourcing platform task allocation method based on POMDP model | |
| CN112084341A (en) | Knowledge graph completion method based on triple importance | |
| WO2019192310A1 (en) | Group network identification method and device, computer device, and computer-readable storage medium | |
| CN113190339B (en) | Task processing method and device | |
| CN116994273A (en) | Object recognition method, device, computer equipment and storage medium | |
| WO2023035526A1 (en) | Object sorting method, related device, and medium | |
| CN120046682A (en) | Federal learning dynamic clipping method and device based on gradient correlation | |
| CN113822455A (en) | A time prediction method, device, server and storage medium | |
| CN114692724B (en) | Training method of data classification model, data classification method and device | |
| CN110837847A (en) | User classification method and device, storage medium and server | |
| CN106778048B (en) | Method and device for data processing | |
| CN110473210B (en) | Image segmentation method and device based on confidence propagation | |
| US10937087B2 (en) | Systems and methods for optimal bidding in a business to business environment | |
| CN114821173A (en) | Image classification method, device, equipment and storage medium | |
| CN114528992A (en) | Block chain-based e-commerce business analysis model training method | |
| CN112053251A (en) | Insurance cost allocation method, device, equipment and storage medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19849831 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 19849831 Country of ref document: EP Kind code of ref document: A1 |