CN109978701A

CN109978701A - Personal probability forecasting method and the system of being hospitalized

Info

Publication number: CN109978701A
Application number: CN201910258525.8A
Authority: CN
Inventors: 万湘琳
Original assignee: Pacific Health Management Co Ltd
Current assignee: Pacific Health Management Co Ltd
Priority date: 2019-04-01
Filing date: 2019-04-01
Publication date: 2019-07-05

Abstract

The present invention a kind of personal probability forecasting method and system in hospital include: to acquire the basic medical insurance reimbursement data and corresponding insured people's information data in the current year；Data normalization processing is carried out to basic medical insurance settlement data, basic medical insurance clearing detailed data and insured people's information data, to obtain the basic medical insurance clearing detailed data of the basic medical insurance settlement data of standard, standard and the insured people's information data of standard；Four major class predictive factors are generated based on the basic medical insurance settlement data of standard, the basic medical insurance clearing detailed data of standard and the insured people's information data of standard, four major class predictive factors include personal information, health care costs, medical act and disease type, generate multiple sub- predictive factors based on four major class predictive factors；Feature Conversion is carried out to sub- predictive factor；Feature Dimension Reduction is carried out to reduce the quantity of sub- predictive factor to the sub- predictive factor after conversion；Logic Regression Models are established based on the sub- predictive factor after feature selecting to predict admission rate next year.

Description

Personal probability forecasting method and the system of being hospitalized

Technical field

The present invention relates to personal probabilistic forecasting technical fields of being hospitalized, more particularly to a kind of personal probability forecasting method of being hospitalized With personal probabilistic forecasting system of being hospitalized.

Background technique

Basic medical insurance reimbursement data cover insured people's personal information, disease information, medical behavior, medical expense, social security Type etc. is multi-field, and medical incidence is much higher for the incidence that other insure, and data granularity is thin, can For portraying portrait of the insured people in terms of medical treatment & health, the admission rate prediction model of foundation can be realization:

The price core of hospitalization benefit insurance is protected；

High risk group identification, medical expense is mainly derived from pays in hospital, the High risk group high to probability in hospital, Intervened in advance and managed, can effectively control out the rapid growth of medical expense.

The assessment of hospitalize ability, to patient's being hospitalized generally within following a period of time of current year hospitalization The prediction of rate can assess hospitalize level to a certain extent.

It is obtained in current major insurance company Claims Resolution data or open resource that perhaps research institution's meeting foundation is had by oneself Data are studied in the prediction for being hospitalized probability of personal level.But in view of the limitation of data volume, data granularity Limitation has deficiency on precision of prediction.

Summary of the invention

The present invention is in view of the problems of the existing technology and insufficient, provides a kind of personal be hospitalized and probability forecasting method and is System.

The present invention is to solve above-mentioned technical problem by following technical proposals:

The present invention provides a kind of personal probability forecasting method of being hospitalized, it is characterized in that comprising following steps:

Step 1, collecting sample: the basic medical insurance for acquiring the current year submits an expense account data and corresponding insured people's Information Number According to the basic medical insurance reimbursement data include basic medical insurance settlement data and basic medical insurance clearing detailed data；

Step 2, data normalization: to basic medical insurance settlement data, basic medical insurance clearing detailed data and insured people's information Data carry out data normalization processing, to obtain the basic medical insurance clearing detailed data of the basic medical insurance settlement data of standard, standard With the insured people's information data of standard；

Step 3, Feature Engineering: based on the basic medical insurance settlement data of standard, the basic medical insurance clearing detailed data of standard and mark Quasi- insured people's information data generates four major class predictive factors, and four major class predictive factors include personal information, health care costs, medical row For and disease type, generate multiple sub- predictive factors based on four major class predictive factors；

Step 4, Feature Conversion: Feature Conversion is carried out to sub- predictive factor；

Step 5, feature selecting: Feature Dimension Reduction is carried out to reduce the number of sub- predictive factor to the sub- predictive factor after conversion Amount；

Step 6 establishes model: establishing Logic Regression Models based on the sub- predictive factor after feature selecting to predict next year The admission rate of degree；

Wherein, Y indicates admission rate next year, θ_iIndicate independent variable, 0≤i≤n, X_jSon after indicating feature selecting is pre- J-th of sub- predictive factor in the factor is surveyed, 1≤j≤n, n indicate the quantity of the sub- predictive factor after feature selecting.

Preferably, in step 4, the Feature Conversion of numeric type predictive factor is carried out using impact coding, use The Feature Conversion of one-hot-encoding progress character type predictive factor.

Preferably, in steps of 5, being analyzed in the sub- predictive factor after conversion and being relative to each other using factor correlativity The factor, only retain a factor in the factor being relative to each other, using XGBOOST algorithm removal predictive power it is weaker because Son.

Preferably, the field that basic medical insurance settlement data includes mainly has personal number, and number of going to a doctor, consultation time, just Examine classification ,/discharge time of being admitted to hospital, diagnosis coding, diagnosis name, department's title, medical total amount, medical insurance reimbursed sum, think highly of oneself The amount of money, serious disease reimbursed sum, other reimbursed sums etc.；

The field that basic medical insurance clearing detailed data includes mainly has personal number, and number of going to a doctor settles accounts odd numbers, medical insurance mesh Record coding, medical insurance directory title, unit price, quantity, the amount of money pay ratio for oneself, at one's own expense amount of money etc.；

The main has age of field that insured people's information data includes, gender, insurance kind, retired state, registered permanent residence property, culture Degree, political affiliation, job category etc..

The present invention also provides a kind of personal probabilistic forecasting systems of being hospitalized, it is characterized in that comprising data acquisition module, number According to processing module, data generation module, Feature Conversion module, feature selection module and model building module；

The data acquisition module is used to acquire the basic medical insurance reimbursement data in the current year and corresponding insured people believes Data are ceased, the basic medical insurance reimbursement data include basic medical insurance settlement data and basic medical insurance clearing detailed data；

The data processing module is used to believe basic medical insurance settlement data, basic medical insurance clearing detailed data and insured people It ceases data and carries out data normalization processing, to obtain the basic medical insurance clearing detail number of the basic medical insurance settlement data of standard, standard According to the insured people's information data of standard；

The data generation module is used to settle accounts detailed data based on the basic medical insurance settlement data of standard, the basic medical insurance of standard Four major class predictive factors are generated with the insured people's information data of standard, four major class predictive factors include personal information, health care costs, doctor Treatment behavior and disease type generate multiple sub- predictive factors based on four major class predictive factors；

The Feature Conversion module is used to carry out Feature Conversion to sub- predictive factor；

The feature selection module is used to carry out Feature Dimension Reduction to the sub- predictive factor after conversion to reduce sub- predictive factor Quantity；

The model building module is used to establish Logic Regression Models based on the sub- predictive factor after feature selecting to predict Admission rate next year；

Preferably, the Feature Conversion module is used to carry out the feature of numeric type predictive factor using impact coding Conversion carries out the Feature Conversion of character type predictive factor using one-hot-encoding.

Preferably, the feature selection module is used to analyze the sub- predictive factor after conversion using factor correlativity In the factor that is relative to each other, only retain a factor in the factor being relative to each other, predictive power removed using XGBOOST algorithm The weaker factor.

On the basis of common knowledge of the art, above-mentioned each optimum condition, can any combination to get each preferable reality of the present invention Example.

The positive effect of the present invention is that:

The cover time of the reimbursement data of basic medical insurance is long, and area coverage is wide.The history information of personal level is gone to a doctor Information, medicine information, the inspection used, diagnosis and treatment, operation information are more comprehensive, greatly improve admission rate prediction model Precision.

Detailed description of the invention

Fig. 1 is that the individual of present pre-ferred embodiments is hospitalized the flow chart of probability forecasting method.

Fig. 2 is that the individual of present pre-ferred embodiments is hospitalized the structural block diagram of probabilistic forecasting system.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

Admission rate prediction model submits an expense account data and insured people's information data with the basic medical insurance in the current year to predict down The probability of being hospitalized in one year.The variable pond of admission rate prediction model includes personal information, health care costs, medical act and disease About 1500 predictive factors of 4 major class of medical information.Then, predict personal level next year by Logic Regression Models Probability in hospital.The expression formula of model are as follows:

Lower mask body introduces admission rate prediction model.

As shown in Figure 1, the present embodiment provides a kind of personal probability forecasting methods of being hospitalized comprising following steps:

Step 1, collecting sample: the basic medical insurance for acquiring the current year submits an expense account data and corresponding insured people's Information Number According to the basic medical insurance reimbursement data include basic medical insurance settlement data and basic medical insurance clearing detailed data.

The field that basic medical insurance settlement data includes mainly has personal number, number of going to a doctor, consultation time, classification of going to a doctor, It is admitted to hospital/discharge time, diagnosis coding, diagnosis name, department's title, medical total amount, medical insurance reimbursed sum, the amount of money of thinking highly of oneself, greatly Sick reimbursed sum, other reimbursed sums etc..

The field that basic medical insurance clearing detailed data includes mainly has personal number, and number of going to a doctor settles accounts odd numbers, medical insurance mesh Record coding, medical insurance directory title, unit price, quantity, the amount of money pay ratio for oneself, at one's own expense amount of money etc..

Step 2, data normalization: to basic medical insurance settlement data, basic medical insurance clearing detailed data and insured people's information Data carry out data normalization processing, to obtain the basic medical insurance clearing detailed data of the basic medical insurance settlement data of standard, standard With the insured people's information data of standard.

Data normalization and to establish standard scale be the important prerequisite using this model.All new cities are required to data Table in library is standardized.The field format of different cities medical insurance data is all different, after data normalization, standard scale Content be it is the same, in this way be convenient for subsequent code reuse.

Step 3, Feature Engineering: based on the basic medical insurance settlement data of standard, the basic medical insurance clearing detailed data of standard and mark Quasi- insured people's information data generates four major class predictive factors, and four major class predictive factors include personal information, health care costs, medical row For and disease type, generate more than 1500 sub- predictive factors based on four major class predictive factors, lay base for the foundation of prediction model Plinth.

(a) personal information

In the data of Tonglu, personal information includes the age of insurer, gender, insured type, occupation, the registered permanent residence, on-job shape State.

(b) health care costs

Health care costs are most important feature classifications.The layering of health care costs includes hospital grade, hospital category, medical class Type, season cost and account of payment type.We have carried out mathematical statistics and conduct to everyone expenditure in the same category Feature, including whole expenditures, average outgo and maximum expenditure.Such as plan as a whole the maximum value paid in the outpatient service of general hospital, Average value and synthesis.

(c) medical act

Medical act is relevant predictive factor with physician office visits, length of stay etc..The granularity and layering of medical act It is consistent with the exfoliated particles degree of health care costs predictive factor.

(d) disease information

The method that the label of disease information has used the disease of independent research to be grouped encodes a ICD10 of China more than 20,000, According to medical phase relation, expense phase relation, the phase relation of diagnosis and treatment path is divided into a group more than 360.Each insured people can be according to it Medical historical information stamps a disease label more than 360.It is otherwise 0 if there is such disease diagnosis information is then 1.

(e) medical information

It is analyzed by data, has chosen 84 and be grouped with the maximally related diagnosis and treatment of medical expense.Then, just according to insured people Information is examined, has stamped 84 diagnosis and treatment labels to each insured people.

(f) medicine information

889 medicine informations are divided into 27 disease categories on the basis of PCG drug is grouped by this method.For example, using The insured people for crossing Pravastatin will stamp the label of hyperlipidemia.

Wherein, the acquisition channel of disease type is (d), (e) and (f).

Step 4, Feature Conversion: carrying out Feature Conversion to sub- predictive factor, and it is pre- to carry out numeric type using impact coding The Feature Conversion for surveying the factor carries out the Feature Conversion of character type predictive factor using one-hot-encoding.

Feature Conversion is the value of original predictive factor to be converted into and predicted the more relevant numerical value of target.For difference The predictive factor of type can use different Feature Conversion methods.

Carry out transforming numerical type variable using impact coding, including spend class variable, because it can be preferably and pre- It surveys target and establishes linear relationship.It carries out impact coding and is divided into 100 by certain method firstly the need of by the field Bucket, each bucket can be by certain conversion methods by the numerical value of the numerical value conversion Cheng Xin of original field.

Transformed value_bucketi=f (original value)

Step 5, feature selecting: Feature Dimension Reduction is carried out to reduce the number of sub- predictive factor to the sub- predictive factor after conversion Amount.The factor being relative to each other in the sub- predictive factor after conversion is analyzed using factor correlativity, only retains phase each other A factor in the factor of pass, using the weaker factor of XGBOOST algorithm removal predictive power.

A part of subset is screened to do feature selecting using statistics or the method for modeling, this process also referred to as reduces dimension Degree, abbreviation dimensionality reduction.Due to producing a large amount of predictive factor, the method for the two major classes used carrys out system and efficiently does feature choosing It selects.

Firstly, automatically removing the factor being relative to each other by factor correlativity analysis.

Then, the weaker factor of predictive power is automatically removed using model.Based on factor correlativity analysis method it is excellent Point is that calculating speed is very fast.The advantages of model-based method is that the efficiency for the precision of prediction that it improves model is higher, still The disadvantage is that calculating speed is slower.Feature Dimension Reduction is carried out using XGBOOST algorithm.

Last removal manually.The opinion of domain knowledge and industry specialists is extremely important.Some predictive factors need to combine special The opinion of family is added or is removed manually.

Step 6 establishes model: establishing Logic Regression Models based on the sub- predictive factor after feature selecting to predict next year The admission rate of degree.

Logic Regression Models are a kind of generalized linear regression models (Generalized Linear Model), are usually used in pre- Survey certain disease or certain probability happened.Its dependent variable can be two classification, be also possible to it is polytypic, still Two classification it is more commonly used.Logistic regression assumes that dependent variable and residual error obey bi-distribution, and independent variable is linear with probability of happening Relationship, and it is mutually indepedent between independent variable.Logistic regression has carried out Logit transformation, model expression to dependent variable are as follows:

The probability of model prediction are as follows:

Mode has used 41 independents variable: personal information: 2；Health care costs: 18；Medical act: 1；Disease type: 16；It examines Treatment type: 4.

It wherein, is the medical treatment flower of the fourth quater with the strongest health care costs category feature of probability positive correlation of being hospitalized next year Take；The strongest personal information category feature of positive correlation is retirement mark；The strongest disease type category feature of positive correlation is gestation State；Diagnosis and treatment type category feature is childbirth correlation, and negatively correlated with probability of being hospitalized next year.

Using R2, AUROC, Gini with KS index come measure model prediction as a result, but emphasis it is different.

R2 is the ratio of regression sum of square and total sum of squares.It reflects regression equation to the interpretability of prediction target. Its data biggish for absolute value is more sensitive.

AUROC is the area (Area under ROC Curve) under ROC curve.ROC curve (receiver Operating characteristic curve), also known as experience linearity curve.It is according to a series of two different mode classifications (cut off value or threshold value), using true positive rate as ordinate, false positive rate is the curve that abscissa is drawn.ROC curve can be easy to The recognition capability to object event (certain disease, be hospitalized etc.) when any boundary value is found on ground.ROC curve is closer to upper left The accuracy at angle, test is higher.AUROC can intuitively be interpreted as the random positive sample ranking uniformly extracted and uniformly take out Expectation before the random negative sample taken, for its value between 0.5-1, value is higher, and the predictive ability of model is better.

Gini coefficient is to measure model to the index (Gini-AUROC*2-1) of positive, negative client's discrimination.Gini system For several values between 0-1, value is higher, and the discrimination of model is better.In the assessment of model capability, Gini coefficient is in 0.3- Indicate that the separating capacity of model is medium between 0.39；Gini coefficient indicates that the separating capacity of model is high between 0.4-0.59； Gini coefficient, which is greater than 0.6, indicates that the separating capacity of model is fabulous.

KS (Kolmogorov-Smirnov) index be under different two mode classifications (cut off value or threshold value), model The maximum value of the difference of true positive rate and false positive rate.It indicates the ability that model can distinguish positive, negative client.KS value Between 0-1.Value is bigger, and the separating capacity of model is better.Common to say, KS > 0.2 indicates that model has preferable prediction quasi- True property.

R-Square of the model on test set reaches 8.16%, KS and reaches 29.15%, has preferable prediction accurate Property.

Model classification	R-Square	AUROC	Gini	KS
					xgboost	10.75%	69.24%	38.49%	30.44%
Logistic Regression	8.16%	67.66%	35.32%	29.15%

The model method refers to the reimbursement data of basic medical insurance and the information data of insured people, in big data In the environment of predict the probability of being hospitalized of insured people next year, the accuracy of prediction is compared with common commercial insurance company own The prediction model established in Claims Resolution data greatly improves.The price core of insurance industry is protected and risk control capability is one It is secondary greatly to be promoted, have great importance.

As shown in Fig. 2, the present embodiment also provides a kind of personal probabilistic forecasting system of being hospitalized comprising data acquisition module 1, Data processing module 2, data generation module 3, Feature Conversion module 4, feature selection module 5 and model building module 6.

The data acquisition module 1 is used to acquire the basic medical insurance reimbursement data in the current year and corresponding insured people believes Data are ceased, the basic medical insurance reimbursement data include basic medical insurance settlement data and basic medical insurance clearing detailed data.

The data processing module 2 is used for basic medical insurance settlement data, basic medical insurance clearing detailed data and insured people Information data carries out data normalization processing, to obtain the basic medical insurance clearing detail of the basic medical insurance settlement data of standard, standard Data and the insured people's information data of standard.

The data generation module 3 is used to settle accounts detail number based on the basic medical insurance settlement data of standard, the basic medical insurance of standard Generate four major class predictive factors according to the insured people's information data of standard, four major class predictive factors include personal information, health care costs, Medical act and disease type generate multiple sub- predictive factors based on four major class predictive factors.

The Feature Conversion module 4 is used to carry out Feature Conversion to sub- predictive factor.

The feature selection module 5 is used to carry out Feature Dimension Reduction to the sub- predictive factor after conversion to reduce sub- predictive factor Quantity.

The model building module 6 is used to establish Logic Regression Models based on the sub- predictive factor after feature selecting to predict Admission rate next year.

Although specific embodiments of the present invention have been described above, it will be appreciated by those of skill in the art that these It is merely illustrative of, protection scope of the present invention is defined by the appended claims.Those skilled in the art is not carrying on the back Under the premise of from the principle and substance of the present invention, many changes and modifications may be made, but these are changed Protection scope of the present invention is each fallen with modification.

Claims

1. a kind of personal probability forecasting method of being hospitalized, which is characterized in that itself the following steps are included:

Step 1, collecting sample: the basic medical insurance reimbursement data and corresponding insured people's information data in the current year, institute are acquired Stating basic medical insurance reimbursement data includes basic medical insurance settlement data and basic medical insurance clearing detailed data；

Step 2, data normalization: to basic medical insurance settlement data, basic medical insurance clearing detailed data and insured people's information data Data normalization processing is carried out, to obtain the basic medical insurance clearing detailed data of the basic medical insurance settlement data of standard, standard and mark Quasi- insured people's information data；

Step 3, Feature Engineering: based on the basic medical insurance settlement data of standard, the basic medical insurance clearing detailed data of standard and standard ginseng Guarantor's information data generate four major class predictive factors, four major class predictive factors include personal information, health care costs, medical act and Disease type generates multiple sub- predictive factors based on four major class predictive factors；

Step 5, feature selecting: Feature Dimension Reduction is carried out to reduce the quantity of sub- predictive factor to the sub- predictive factor after conversion；

Step 6 establishes model: establishing Logic Regression Models based on the sub- predictive factor after feature selecting to predict next year Admission rate；

Wherein, Y indicates admission rate next year, θ_iIndicate independent variable, 0≤i≤n, X_jIndicate feature selecting after son prediction because J-th of sub- predictive factor in son, 1≤j≤n, n indicate the quantity of the sub- predictive factor after feature selecting.

2. personal probability forecasting method of being hospitalized as described in claim 1, which is characterized in that in step 4, using impact Coding carries out the Feature Conversion of numeric type predictive factor, and the spy of character type predictive factor is carried out using one-hot-encoding Sign conversion.

3. personal probability forecasting method of being hospitalized as described in claim 1, which is characterized in that in steps of 5, using factor correlation Property analyze the factor that is relative to each other in the sub- predictive factor after conversion, only retain one in the factor being relative to each other The factor, using the weaker factor of XGBOOST algorithm removal predictive power.

4. as described in claim 1 personal probability forecasting method of being hospitalized, which is characterized in that basic medical insurance settlement data includes Field mainly has personal number, number of going to a doctor, consultation time, classification of going to a doctor ,/discharge time of being admitted to hospital, diagnosis coding, diagnosis name Claim, department's title, medical total amount, medical insurance reimbursed sum, the amount of money of thinking highly of oneself, serious disease reimbursed sum, other reimbursed sums etc.；

The field that basic medical insurance clearing detailed data includes mainly has personal number, and number of going to a doctor settles accounts odd numbers, and medical insurance directory is compiled Code, medical insurance directory title, unit price, quantity, the amount of money pay ratio for oneself, at one's own expense amount of money etc.；

The main has age of field that insured people's information data includes, gender, insurance kind, retired state, registered permanent residence property, cultural journey Degree, political affiliation, job category etc..

5. a kind of personal probabilistic forecasting system of being hospitalized, which is characterized in that it includes data acquisition module, data processing module, number According to generation module, Feature Conversion module, feature selection module and model building module；

The basic medical insurance that the data acquisition module is used to acquire the current year submits an expense account data and corresponding insured people's Information Number According to the basic medical insurance reimbursement data include basic medical insurance settlement data and basic medical insurance clearing detailed data；

The data processing module is used for basic medical insurance settlement data, basic medical insurance clearing detailed data and insured people's Information Number According to carrying out data normalization processing, thus obtain the basic medical insurance settlement data of standard, standard basic medical insurance clearing detailed data and The insured people's information data of standard；

The data generation module is used for based on the basic medical insurance settlement data of standard, the basic medical insurance clearing detailed data of standard and mark Quasi- insured people's information data generates four major class predictive factors, and four major class predictive factors include personal information, health care costs, medical row For and disease type, generate multiple sub- predictive factors based on four major class predictive factors；

The feature selection module is used to carry out Feature Dimension Reduction to the sub- predictive factor after conversion to reduce the number of sub- predictive factor Amount；

The model building module is used to establish Logic Regression Models based on the sub- predictive factor after feature selecting next to predict The admission rate in year；

6. personal probabilistic forecasting system of being hospitalized as claimed in claim 5, which is characterized in that the Feature Conversion module is for adopting It is pre- to carry out character type using one-hot-encoding for the Feature Conversion that numeric type predictive factor is carried out with impact coding Survey the Feature Conversion of the factor.

7. personal probabilistic forecasting system of being hospitalized as claimed in claim 5, which is characterized in that the feature selection module is for adopting Analyze the factor that is relative to each other in the sub- predictive factor after conversion with factor correlativity, only retain be relative to each other because A factor in son, using the weaker factor of XGBOOST algorithm removal predictive power.

8. as claimed in claim 5 personal probabilistic forecasting system of being hospitalized, which is characterized in that basic medical insurance settlement data includes Field mainly has personal number, number of going to a doctor, consultation time, classification of going to a doctor ,/discharge time of being admitted to hospital, diagnosis coding, diagnosis name Claim, department's title, medical total amount, medical insurance reimbursed sum, the amount of money of thinking highly of oneself, serious disease reimbursed sum, other reimbursed sums etc.；