Disclosure of Invention
The application aims to provide a method, a device and a storage medium for determining an enterprise performance credit level. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.
According to an aspect of an embodiment of the present application, there is provided a method for determining an enterprise performance credit level, including: acquiring invoice data of an enterprise;
establishing a performance rating model according to the invoice data;
calculating a performance score of the enterprise by using the performance rating model;
determining a level of performance credit based on the performance score.
Further, the acquiring invoice data of the enterprise comprises: and extracting the billing behavior data of the billing enterprises from the business system, carrying out ETL processing and storing the data in a data warehouse.
Further, the establishing a performance rating model according to the invoice data includes:
and combining the invoice data to derive the characteristics influencing the enterprise performance risk, extracting the derived characteristics, modeling the sample data, finding out the characteristic variables highly related to the default risk and establishing a corresponding logistic regression model.
Further, modeling the sample data after extracting the derivative features includes:
finding out the optimal box number and box boundary of each characteristic;
performing box separation on each feature according to the optimal box separation boundary to obtain each box boundary of the feature and a WOE value;
processing the characteristic matrix of the training set and the test set, and replacing all values in the characteristic matrix with WOE values of the corresponding boxes;
modeling is performed using a training set.
Further, the binning each feature according to an optimal binning boundary includes: firstly, determining a larger box dividing number, carrying out equal frequency box dividing, calculating the WOE value of each box and the IV value of the characteristic, then combining similar boxes according to the chi-square test value, calculating the WOE value of each box and the IV value of the characteristic again until the number of the boxes becomes a smaller value, drawing a box dividing number-IV value curve, and finding out the optimal box dividing number and the boundary of each box.
According to an aspect of an embodiment of the present application, there is provided an apparatus for determining an enterprise performance credit level, including:
the acquisition module is used for acquiring invoice data of enterprises;
the modeling module is used for establishing a performance rating model according to the invoice data;
the calculation module is used for calculating the performance score of the enterprise by utilizing the performance rating model;
and the rating module is used for determining the performance credit level according to the performance score.
Further, the acquisition module is specifically used for extracting the billing behavior data of the billing enterprise from the business system, and storing the billing behavior data in the data warehouse after ETL processing.
Further, the modeling module is specifically configured to combine the invoice data derived features that affect the enterprise performance risk, extract derived features, then model sample data, find out feature variables highly related to the default risk, and establish a corresponding logistic regression model.
According to another aspect of the embodiments of the present application, there is provided an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the method for determining the enterprise performance credit level.
According to another aspect of the embodiments of the present application, there is provided a computer-readable storage medium having a computer program stored thereon, where the computer program is executed by a processor to implement the method for determining the level of business performance credit.
The technical scheme provided by one aspect of the embodiment of the application can have the following beneficial effects:
the method for determining the enterprise performance credit level provided by the embodiment of the application has the advantages of scientific and reasonable design, capability of effectively monitoring financial risks and quickly determining the enterprise credit level, high efficiency, capability of greatly improving the working efficiency, simplicity and convenience in use and capability of well meeting the requirements of practical application.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application, or may be learned by the practice of the embodiments. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
As shown in fig. 1, a first embodiment of the present application provides a method for determining a level of performance credit of an enterprise, including:
s1, acquiring invoice data of the enterprise;
s2, establishing a performance rating model according to the invoice data;
s3, calculating the performance score of the enterprise by using the performance rating model;
and S4, determining the performance credit level according to the performance score.
In some embodiments, the obtaining invoice data for a business comprises: and extracting the billing behavior data of the billing enterprises from the business system, carrying out ETL processing and storing the data in a data warehouse.
In some embodiments, said building a performance rating model from said invoice data comprises:
and combining the invoice data to derive the characteristics influencing the enterprise performance risk, extracting the derived characteristics, modeling the sample data, finding out the characteristic variables highly related to the default risk and establishing a corresponding logistic regression model.
In some embodiments, the modeling the sample data after extracting the derived features includes:
finding out the optimal box number and box boundary of each characteristic;
performing box separation on each feature according to the optimal box separation boundary to obtain each box boundary of the feature and a WOE value;
processing the characteristic matrix of the training set and the test set, and replacing all values in the characteristic matrix with WOE values of the corresponding boxes;
modeling is performed using a training set.
In some embodiments, the modeling is performed using a training set, and a logistic regression model is established with the logistic regression model as the performance rating model.
In some embodiments, the binning the features according to the optimal binning boundary comprises: firstly, determining a larger box dividing number, carrying out equal frequency box dividing, calculating the WOE value of each box and the IV value of the characteristic, then combining similar boxes according to the chi-square test value, calculating the WOE value of each box and the IV value of the characteristic again until the number of the boxes becomes a smaller value, drawing a box dividing number-IV value curve, and finding out the optimal box dividing number and the boundary of each box.
The embodiment also provides an apparatus for determining the performance credit level of an enterprise, including:
the acquisition module is used for acquiring invoice data of enterprises;
the modeling module is used for establishing a performance rating model according to the invoice data;
the calculation module is used for calculating the performance score of the enterprise by utilizing the performance rating model;
and the rating module is used for determining the performance credit level according to the performance score.
In some embodiments, the obtaining module is specifically configured to extract the billing behavior data of the billing enterprise from the business system, perform ETL processing, and store the result in the data warehouse.
In some embodiments, the modeling module is specifically configured to combine the invoice data to derive features that affect the business performance risk, extract the derived features, model the sample data, find out feature variables highly related to the default risk, and establish a corresponding logistic regression model.
Compared with the prior art, the method of the embodiment can effectively save the cost of human resources, greatly improve the working efficiency, and is simple and convenient to use and low in maintenance cost.
The embodiment also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the method for determining the enterprise performance credit level.
The present embodiment also provides a computer-readable storage medium, on which a computer program is stored, the program being executed by a processor to implement the method for determining the level of business performance credit.
A second embodiment of the present application provides a method for determining an enterprise performance credit level, including:
1.1, extracting user invoice issuing data from a data system, and establishing a new data table for storage;
1.2, analyzing factors influencing the enterprise performance risk through invoice data, designing a logistic regression model:
firstly, by combining invoice data derivation with characteristics affecting enterprise performance risk, modeling sample data after extracting derived characteristics, finding out characteristic variables highly related to default risk and establishing a corresponding logistic regression model.
Logistic regression model:
the probability of occurrence (Odds) of an event refers to the ratio of the probability of occurrence of the event to the probability of non-occurrence of the event. If a customer default probability is p, then its normal probability is 1-p, from which it can be derived:
Odds=p/(1-p);
at this time, the probability of a customer breach, p, can be expressed as:
the scoring card expression is:
Score=A-Blog(Odds)
where A, B is a constant. Since the log function monotonically increases at (0 → + ∞), the Score is lower as the user's rate of default Odds is greater.
By giving
(1) Score value S0 at a particular Odds;
(2) score increases value PDO when the particular Odds value is doubled;
a, B can be found by substituting the given value S0 and PDO into the scoring card expression.
Through the above analysis, the problem of scoring Score of the user is converted into the problem of finding log of default log probability (Odds) of the user.
Constructing a prediction function according to binary logistic regression
Where h θ (x) represents the probability that the result takes 1.
The log probability of the event (Odds) is derived as follows:
it can be found that: in the logistic regression model, the log probability of the output Y ═ 1 is a linear function of the input condition x.
From this, log (Odds) ═ θ0+θ1x1+...+θnxn
1.3 analyzing the sample data and establishing a performance rating model
1) And performing data preprocessing on the feature matrix X, wherein the data preprocessing comprises repeated value removal, missing value filling (only few samples lack the feature, the feature can be directly deleted, the number of family people can be filled by using a mean value, income filling of random forests can be realized, and the like), abnormal value processing and sample unbalance processing (using over-sampling and under-sampling methods).
2) The optimal bin number and bin boundaries for each feature are found. The optimal binning number is the number of bins in which the IV value of the feature is within the optimal IV value range as much as possible (the importance of each feature is increased as much as possible), and the features are similar in bins and have large inter-bin differences. The steps are thus: firstly, determining a larger number of bins, carrying out equal frequency bin splitting, calculating the WOE value of each bin and the IV value of the characteristic, then combining similar bins according to the chi-square test value, and calculating the WOE value of each bin and the IV value of the characteristic again until the number of the bins becomes a smaller value. And drawing a box number-IV value curve to find out the optimal box number and the boundaries of each box.
3) And (4) performing box separation on each feature according to the optimal box separation boundary, and obtaining the box boundaries and the WOE value of the feature after the box separation.
4) And processing the feature matrix X of the training set and the test set, and replacing all values in the feature matrix with WOE values of the corresponding boxes.
5) Modeling is performed by using a training set, a model score is calculated by using a test set, and the model score is improved by adjusting a regularization coefficient C and the maximum iteration number max _ iter by using a learning curve.
6) And (5) making a scoring card.
Score=A-B*log(odds)
Calculating the values of coefficients A and B from the values of the formula, and obtaining the intercept from logistic regression
Inter, and each characteristic coefficient lr _ coef _. The reference value of the score card is calculated by using a base _ score ═ a-B × lr. intercept _ formula, and a score list (one score for each box) for each feature is calculated by using col _ score ═ whereall [ "i _ colName" ] — (B × lr. coef _ [0] [ i ]), where whereintheaall [ "i _ colName" ] is a list of box boundaries for the i feature and a corresponding list of WOE values.
1.4 determining an enterprise performance credit level according to a performance rating model.
A third embodiment of the present application provides a method for determining an enterprise performance credit level, including:
s10, extracting user invoice data from the data system, and establishing a new data table for storage;
s20, preprocessing invoice behavior data, and storing the preprocessed incremental data; the invoice behavior data comprises invoicing amount, change of the invoicing amount, a waste invoice, invoicing time and behavior;
s30, designing a logistic regression model by analyzing factors influencing the enterprise performance risk through invoice data:
s40, analyzing the sample data and establishing a performance rating model;
and S50, determining the business performance credit level by using the performance rating model.
Step S40 includes: establishing a logistic regression model influencing the enterprise performance credit through invoice characteristic data of the selected samples;
step S50 includes: and calculating the performance risk score and the level according to the logistic regression model.
Compared with the prior art, the method of the embodiment can effectively save the cost of human resources, greatly improve the working efficiency, and is simple and convenient to use and low in maintenance cost.
A third embodiment of the present application provides a method for determining an enterprise performance credit level, including: step 01, extracting data
Extracting billing behavior data of the billing enterprise from the business system, carrying out ETL processing and storing the data in a data warehouse; the data items mainly comprise invoicing amount, invoice red flushing, invoice invalidation, invoicing time and other data in the invoice data.
Aiming at a large amount of enterprise invoice data in practical application, the embodiment adopts a data warehouse mode, and utilizes an ETL process to extract, convert and load the invoice data in a business system, and a wide table which takes a client as a center and contains multiple attributes is established and stored in the data warehouse for later modeling analysis. The data extraction refers to extracting data from a source system, converting the data into a corresponding data structure according to analysis requirements, summarizing the data, and loading the data to a target data warehouse after conversion and summarization are completed.
Step 02, data processing
Four types of data are mainly processed: respectively, missing values, abnormal values, deduplication processing, and processing of noise data. For the whole data, a preliminary knowledge of the data and an exploration analysis process of prior knowledge are obtained, the preliminary exploration of the data is firstly carried out, the basic attribute and the distribution condition of the data are obtained, and in addition, the relation among all the characteristics of the data index can be preliminarily explored through univariate and multivariate analysis so as to verify the hypothesis provided in the business analysis stage.
Billing amount: the invoicing amount reflects the operating income condition of the enterprise, the target customer groups of different credit products are different, and the invoicing amount can reflect the applicability of the credit products to the enterprises with different operating scales;
billing frequency: in the aspect of invoicing frequency, the number of continuously invoiced months in one year is a typical signal, an enterprise operating normally can have more invoiced months, even invoices are made for 12 months, and on the contrary, if the number of invoiced months is less than 12 months or less, problems of the enterprise can be shown, and the signals are risk signals. In the aspect of invoice types, the common invoice does not have the function of deducting the income tax amount, is obviously distinguished from the value-added tax special invoice, and the proportion of the special invoice is also a dimension of credit expression.
The ratio of the invoices which are wasted in the red flushing is high, and the invoice with a certain amount is treated by the red flushing in the current month or the next month, so that the conditions of returning goods and issuing wrong invoices of normal enterprises are considered not too much, if the ratio of the invoices which are wasted in the red flushing is too high, certain problems in enterprise operation or management can be shown, and the invoice belongs to potential risk points.
The more downstream customers of the enterprise, the wider the sales channel of the enterprise, and the less influence of the single downstream customer, so the concentration indexes such as the number and the occupation ratio of the downstream customers are added into the characteristics.
The more stable the downstream customers of the enterprise, the more stable the operation condition of the enterprise, and the more stable the cash flow of the enterprise, so that the lost customer proportion and the new customer proportion of the downstream customers of the enterprise are added into the index.
Step 03, single factor analysis
In order to ensure the modeling effect, trend analysis and correlation, feature importance, collinearity among features and significance calculation are carried out on each feature, and feature variables with strong prediction capability, weak correlation and strong significance on a prediction result are reserved.
The WOE value is a form of encoding of the original argument, and to perform WOE encoding on a variable, it is necessary to first perform binning on the variable. It represents the ratio of the corresponding customer to the non-responding customer in the current group and the difference in this ratio across all samples. This difference is represented by the ratio of these two ratios, then log-removed. The larger the WOE, the greater the difference, the greater the likelihood of a sample response in the packet, and the smaller the WOE value, the smaller the difference, the lesser the likelihood of a sample response in the packet.
The complete name of the characteristic IV Value is Information Value, and Chinese means Information Value or Information quantity.
Step 04, logistic regression modeling
And (4) performing logistic regression modeling, predicting default probability of the enterprise, and determining the performance risk rating of the enterprise according to the default probability interval.
It should be noted that:
the term "module" is not intended to be limited to a particular physical form. Depending on the particular application, a module may be implemented as hardware, firmware, software, and/or combinations thereof. Furthermore, different modules may share common components or even be implemented by the same component. There may or may not be clear boundaries between the various modules.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. In addition, this application is not directed to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present application as described herein, and any descriptions of specific languages are provided above to disclose the best modes of the present application.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in the creation apparatus of a virtual machine according to embodiments of the present application. The present application may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present application may be stored on a computer readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The above-mentioned embodiments only express the embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.