CN110543904A

CN110543904A - Enterprise risk classification model construction method based on Bayes

Info

Publication number: CN110543904A
Application number: CN201910790138.9A
Authority: CN
Inventors: 杨为琛; 伺彦伟; 张婷; 祁洪波; 徐爱华; 郭冰洁
Original assignee: Hebei Aerospace Information Technology Co Ltd
Current assignee: Hebei Aerospace Information Technology Co Ltd
Priority date: 2019-08-26
Filing date: 2019-08-26
Publication date: 2019-12-06

Abstract

The invention provides an enterprise risk classification model construction method based on Bayes, and belongs to the technical field of computers. Step one, selecting static characteristics, tax payment behavior characteristics and high tax risk characteristics of an enterprise as data sources, and cleaning and de-noising data. Step two, performing feature selection on the data source obtained in the step one by using an information gain method, and storing the obtained feature information gain values into a database table from large to small, wherein the larger the feature information gain value is, the stronger the classification capability of the data source is; thirdly, selecting the features 15-25 before ranking according to the sequence of each feature information gain value as the input of a Bayesian algorithm model; and step four, establishing an enterprise risk classification model based on Bayes, classifying the enterprises through the enterprise risk classification model based on Bayes, and finding out abnormal enterprises, thereby playing a role in early warning tax risks.

Description

Enterprise risk classification model construction method based on Bayes

Technical Field

The invention relates to an enterprise risk classification model construction method based on Bayes, and belongs to the technical field of computers.

background

With the development of informatization, a large amount of financial and tax information is accumulated in an enterprise system, and how to extract effective information from the financial and tax information is to carry out early warning on tax payment risks for deep research. Traditional data and risk analysis rely on survey mathematical statistics to a great extent, and data have hysteresis quality to lack the comprehensive consideration of a plurality of dimensions, can not timely comprehensive feedback enterprise's state. Therefore, in order to control the abnormal situations and the risk processing process and prevent the tax payment illegal events, a closed risk monitoring and management loop relying on a big data technology should be actively explored.

disclosure of Invention

The invention aims to provide an enterprise risk classification model based on a Bayesian algorithm by utilizing machine learning and data mining algorithms, and the problem can be effectively prevented. The model establishes an analysis model by using financial, tax and business related data of the enterprise, identifies possible risk doubt points in advance, and guides the reduction of enterprise tax evasion and enterprise illegal behaviors. The technical scheme is as follows:

A Bayesian-based enterprise risk classification model construction method comprises the following steps:

selecting static characteristics, tax payment behavior characteristics and high tax risk characteristics of an enterprise as data sources, and cleaning and de-noising data;

secondly, performing feature selection on the data source obtained in the first step by using an information gain method, and storing information gain values into a database table from large to small;

thirdly, selecting the top 15-25 ranked features as the input of a Bayes algorithm model according to the ranking of each feature information gain value;

and step four, establishing an enterprise risk classification model based on Bayes, classifying the enterprises through the enterprise risk classification model based on Bayes, and finding out abnormal enterprises, thereby playing a role in early warning tax risks.

further, the specific process of obtaining the characteristic information gain value in the second step includes:

Selecting an enterprise data set D, and assuming that the enterprise data set is D, | D | is the number of samples, K classes Ck are set, and | Ck | is the number of samples belonging to the Ck-th class; setting n different values { X1, X2, … and xn } of the characteristic X, dividing D into n subsets D1, D2, …, Dn and | Di | as the number of samples of the ith subset according to the value of the characteristic X; recording the number of samples in the subset Di, wherein the set of samples belonging to the Ck-th class is Dik, and | Dik | is Dik;

Secondly, inputting an enterprise data set D; calculating an empirical entropy H (D) of the enterprise dataset D, the empirical entropy H (D) being as follows:

Thirdly, calculating the empirical condition entropy H (D | X) of the feature pair data set D, wherein the empirical condition entropy H (D | X) is as follows:

fourthly, calculating a characteristic information gain value according to the empirical entropy H (D) and the empirical conditional entropy H (D | X), wherein the characteristic information gain value is as follows:

g(D,X)＝H(D)-H(D|X)

namely, the acquisition of the characteristic information gain value is completed.

Further, where K ═ 2, indicates that the enterprise is classified into a normal enterprise and an abnormal enterprise.

further, the step four of classifying the enterprises through the bayesian algorithm model comprises:

Step 1, after feature selection is carried out according to the feature information gain value, normalized and discretized data preprocessing is carried out on the feature data of the enterprise;

Step 2, dividing the processed enterprise data into a training set and a testing set, wherein the ratio of the training set to the testing set is 7: 3;

and 3, calculating the number of the enterprises classified as normal in the training set and the number of the enterprises with abnormal category attributes by using a Bayesian algorithm model, and calculating a prior probability P (Y ═ ck) as follows:

P(Y＝c),k＝1,2,…,K

and 4, calculating a conditional probability P (Y ═ ck | X ═ X) of each classification for each feature, wherein the conditional probability P (Y ═ ck | X ═ X) is as follows:

wherein x represents a data sample; y represents the category, namely normal business or abnormal business; n represents the total number of samples; j represents the jth sample;

and 6, acquiring the belonged classification of the enterprise according to the enterprise risk classification model by using the test sample, wherein the classification model comprises the following steps:

wherein y represents the test sample.

The invention has the beneficial effects that:

the Bayesian enterprise risk model is evaluated by selecting Recall ratio (Recall), Precision ratio (Precision) and F1 value (F1-score). The recall ratio is also called recall ratio, and refers to how many of all positive samples are judged as positive samples by the model, and the precision ratio refers to how many of all samples judged as positive by the model are real positive samples. F is 2PR/(P + R), wherein P is precision ratio and R is recall ratio.

Selecting about 1 million enterprises in a city of Hebei province as training samples, extracting 2000 enterprises as test samples, and judging the enterprise risk types. The recall ratio of the trained classification model is 75%, the precision ratio is 88%, and the F1 value is 81%. Experimental result shows that this patent classification model has good classification effect, can find out unusual enterprise to play the effect of carrying out early warning to the tax risk in advance.

Drawings

FIG. 1 is a flow chart of a Bayesian-based enterprise risk classification model construction method of the present invention;

FIG. 2 is a schematic diagram of IG values of some features in the database and their ordering.

Detailed Description

The present invention will be further described with reference to the following specific examples, but the present invention is not limited to these examples.

example 1:

the static characteristics of the enterprise mean that the tax registration information of the enterprise is enterprise basic information, and most data of the enterprise is input when the enterprise registers the tax for the first time. Once the data is recorded, changes are often made, such as the business address of the enterprise, the registered funds and the like. The features refined from this portion of the data are classified as static features by the present embodiment.

the tax payment behavior characteristic means that the tax payment behavior of the enterprise often has periodic changes due to the declaration period, the tax payment clearing period and other reasons, and the embodiment classifies the changes as the tax payment behavior characteristic. Such as upstream and downstream of a business, the feature may produce different results due to different choices of statistical periods, and different time granularities may be set for such features. Although the setting of different granularities can sacrifice partial feature independence, the actual situation of an enterprise can be more comprehensively reflected, and tax payment features can be refined according to different cycle granularities.

The high tax risk characteristic refers to the risk characteristic of the tax risk high issuing industry, and according to the industry subdivision result, because the characteristic only aims at the specific industry, the characteristic does not exist or is not obvious in other industries and has no judgment significance, the characteristic only calculates the enterprise belonging to the specific industry in the data processing process.

Summarizing the characteristics of the three angles, fifty characteristics are subjected to data sorting in the embodiment and are used as the basis of subsequent data analysis.

Step two, the specific process of obtaining the characteristic information gain value includes:

g(D,X)＝H(D)-H(D|X)

the characteristic selection is characterized in that the characteristics with stronger classification capability to the classifier are selected, so that the classification efficiency is improved, and a better classification result is obtained. This patent carries out the feature selection to fifty features that sort well to carry out the sequencing of feature classification ability, select the feature that has better classification ability. The present embodiment uses an information gain method to perform feature selection.

in information theory and probability statistics, Entropy (Entropy) represents a measure of uncertainty of random variables. The entropy of a random variable X is defined as:

where pi is the probability distribution of X. The larger the entropy of the random variable X, the greater its uncertainty.

conditional Entropy (Conditional Entropy) refers to the Entropy of the Conditional probability of a random variable X under a given condition Y, denoted as H (Y | X).

p＝P(X＝x),i＝1,2,…,n (3)

when the Entropy and the Conditional Entropy are obtained by data estimation, the corresponding Entropy and the Conditional Entropy are called Empirical Entropy (Empirical Entropy) and Empirical Conditional Entropy (Empirical Conditional Entropy), respectively.

The Information Gain (IG) indicates the degree of uncertainty in the Information of the class Y that is reduced by knowing the Information of the feature X. The information gain g (D, X) of the feature X on the data set D is defined as the difference between the empirical entropy H (D) of the data set D and the empirical conditional entropy H (D | X) of the feature X under a given condition, i.e. the difference between the empirical entropy H (D) of the feature X and the empirical conditional entropy H (D | X) of the feature X under the given condition

g(D,X)＝H(D)-H(D|X)

(4)

Where empirical entropy h (D) represents the uncertainty of the classification of data set D. The empirical conditional entropy H (D | X) represents the uncertainty of the classification of the data set D under the condition of the feature X. Their difference, called the information gain, represents the degree to which the classification uncertainty of the data set D is reduced by the features X. According to the formula (4), the information gain of the data set D depends on the features X, different features can obtain different information gain values, and features with larger information gains have stronger classification capability. Wherein, K is 2, which means that the enterprises are divided into normal enterprises and abnormal enterprises.

step four, the concrete process of establishing the enterprise risk classification model based on the Bayesian method comprises the following steps:

in some enterprises, abnormal enterprises are identified, and the enterprise tagging behavior is the classification process. The embodiment classifies the enterprises by adopting a naive Bayesian algorithm to find out abnormal enterprises, thereby playing a role in early warning tax risks. A bayesian algorithm is used to derive a training target P (Y ═ ck | X ═ X) from the training data D, i.e. for a given sample X, the probability that the sample belongs to the ck class is determined.

assuming that the training dataset is D { (X1, Y1), (X2, Y2), …, (xN, yN) }, and the output class label set Y ═ { c1, c2, …, cK } (in this patent, K ═ 2, which is a two-class classification, i.e., there is a normal business class and an abnormal business class), the naive bayes algorithm learns the joint probability distribution P (X, Y) through the training dataset. Specifically, the following prior probability distribution and conditional probability distribution are learned. Prior probability distribution, which expresses the proportion of each type of sample in the sample space:

P(Y＝c),k＝1,2,…,K (8)

Conditional probability distribution:

In the naive bayes classification, a posterior probability distribution P (Y ═ ck | X ═ X) of an input sample X is calculated by a learned model, the class with the highest posterior probability is input to the sample X, and the posterior probability is calculated by bayes theorem according to probability theory-related knowledge:

substituting equation (9) into equation (10) yields

The bayesian classifier uses the argmax function for classification, and thus the naive bayesian classifier can be expressed as:

all the class denominators in equation (12) are the same, so equation (12) can be changed to

the formula (13) is an enterprise risk classification model based on Bayesian.

Step four, the process of classifying the enterprises through the Bayesian algorithm model comprises the following steps:

step 1, carrying out normalization and discretization processing on the characteristic information gain value, wherein the characteristic information gain value represents enterprise data;

P(Y＝c),k＝1,2,…,K

and 5, acquiring the belonged classification of the enterprise according to the enterprise risk classification model by using the test sample, wherein the classification model comprises the following steps:

wherein y represents the test sample.

Although the present invention has been described with reference to the preferred embodiments, it should be understood that various changes and modifications can be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. a Bayesian-based enterprise risk classification model construction method is characterized by comprising the following steps:

2. the Bayesian-based enterprise risk classification model construction method according to claim 1, wherein the specific process of obtaining the feature information gain value in the second step comprises:

g(D,X)＝H(D)-H(D|X)

3. the bayesian-based enterprise risk classification model building method according to claim 2, wherein K is 2, which indicates that the enterprise is classified into a normal enterprise and an abnormal enterprise.

4. the Bayesian-based enterprise risk classification model construction method according to claim 1, wherein the step four of classifying the enterprise through the Bayesian algorithm model comprises the following steps:

P(Y＝c),k＝1,2,…,K

And 5, acquiring the belonged classification of the enterprise according to a classification model by using the test set, wherein the classification model comprises the following steps:

wherein y represents the test sample.