Disclosure of Invention
In order to solve the technical problems in the background art, the invention provides the network freight platform freight guarantee product risk-escaping behavior prediction method and system based on the interpretable machine learning.
The invention provides a network freight platform freight guarantee product risk-escaping behavior prediction method based on interpretable machine learning, which comprises the following steps:
s1, acquiring original sample data and corresponding historical risk results from a network freight platform, and classifying the acquired original sample data to serve as an alternative field of a subsequent feature selection process;
S2, preprocessing original sample data and dividing data sets
Extracting orders of purchased freight guarantee products according to the original sample data obtained in the step S1, cleaning and filtering the extracted order data to delete the order data with missing values, performing numerical type conversion processing on the data in the form of text classification to enable the data to be accepted by a machine learning algorithm, and performing binary classification marking processing on corresponding historical dangerous results;
s3, undersampling processing is carried out based on the preprocessed data
Randomly undersampling to ensure that the ratio of the number of samples to be processed to the number of samples not to be processed is 1:1, establishing a prediction sample data set with complete data and standard format by the results, extracting part of sample data from the formed prediction sample data set to be used as a training set, and taking the rest of sample data as a test set;
S4, performing feature selection based on the processed prediction sample data
Constructing a prediction model by adopting a plurality of machine learning algorithms based on the prediction sample data set, calculating marginal contribution of each characteristic parameter to a prediction result of the machine learning model, obtaining characteristic importance ranking, and selecting a plurality of characteristics with highest importance to be reserved as input characteristics of a subsequent prediction model;
S5, constructing, training and testing various machine learning prediction models
Extracting a new prediction sample data set based on the characteristics selected in the step S4, extracting part of sample data from the new prediction sample data set to serve as a new training set, and taking the rest part of sample data as a new testing set;
s6, comparing and evaluating multiple machine learning prediction models
According to a preset model evaluation method, the prediction accuracy and generalization of various machine learning prediction models are compared and analyzed, and an optimal prediction model is determined;
S7, processing a prediction request by the optimal prediction model obtained in the step S6, and returning a prediction result, wherein the returned prediction result comprises prediction of an order risk-out result and independent prediction of each characteristic risk-out behavior factor in the order.
Preferably, in S1, the feature classification includes at least three major categories of "driver feature", "vehicle feature", "order feature".
Preferably, the 'driver characteristic' class at least comprises a field of a mobile phone number of a driver and a name of the driver, the 'vehicle characteristic' class at least comprises a field of a license plate number and a vehicle type, and the 'order characteristic' class at least comprises a field of an item ID, a cargo secondary classification, a starting place, a destination, mileage and a time of building a bill.
Preferably, in S2, the classification feature in text form encodes the category type data of the character string type into numerical data using LabelEncoder so that it can be accepted by the machine learning algorithm.
Preferably, in S3, random undersampling is accomplished using RandomUnderSampler functions in the imbalanced-learn library.
Preferably, in S3, the extraction of the training set sample data is completed by using a train_test_split function.
Preferably, the duty cycle of the training set in the sample data is greater than the duty cycle of the test set in the sample data.
Preferably, in S4, the marginal contribution of each feature parameter to the prediction result of the machine learning model is calculated by a feature_ importances _function built in the model.
Preferably, in S5, the extraction of the new training set sample data is done using the train_test_split function.
Preferably, the new training set has a larger duty cycle in the new sample data than the new test set.
Preferably, in S6, the model evaluation method includes:
the method comprises the steps of obtaining a plurality of prediction results of a plurality of machine learning prediction models, determining the accuracy rate, the precision rate, the recall rate and the F1 score of each prediction model based on the plurality of prediction results, and determining an optimal prediction model based on the accuracy rate, the precision rate, the recall rate and the F1 score of each prediction model.
Preferably, the method also comprises S8, multi-method interpretability comparison analysis and risk factor weighting based on an optimal prediction model, wherein the specific method is as follows:
The method comprises the steps of respectively obtaining importance sorting results of all characteristic parameters in the aspect of predicting the risk of the freight guarantee product by using a plurality of interpretability analysis methods, comparing and analyzing, and if the comparison analysis results meet a fixed condition, determining that the importance sorting results of the characteristic parameters obtained by different methods are similar, thereby determining a final importance sorting result of the characteristic parameters, generating a risk factor weighting result of the risk of the freight guarantee product on the basis of the final importance sorting result, and feeding back to a pricing strategy formulation department of a network freight platform for feedback guidance based on formulation optimization of the integral basic price of the freight guarantee product on the project side.
The invention discloses an interpretable machine learning-based network freight platform freight guarantee product risk-escaping behavior prediction system, which comprises a data receiving module and a prediction module, wherein the data receiving module is used for receiving a data;
The data receiving module is used for receiving data input by a user, checking the validity of the received data and transmitting the checked valid data to the prediction module;
The prediction module comprises the optimal prediction model obtained in the step S6, and the prediction module automatically calls the prediction model in the prediction module to process a prediction request according to the received data and returns a prediction result, wherein the returned prediction result comprises the prediction of the order risk-out result and the independent prediction of each characteristic risk-out behavior factor in the order.
The invention discloses a freight guarantee product risk-escaping behavior prediction program development method based on an optimal prediction model, which comprises the following steps:
And (3) saving the optimal prediction model obtained in the step (S6) as a model file, loading the model file by using an HTML program code to obtain a freight guarantee product risk prediction webpage end program, enabling the program to automatically call the saved prediction model according to parameter data input by a user, and obtaining a freight guarantee product risk prediction result after calculation by the prediction model, wherein the obtained freight guarantee product risk prediction result comprises prediction of an order risk result and independent prediction of risk behavior factors of various characteristics in the order.
The invention can identify risk factors and predict the risk of the order by using a machine learning algorithm based on the order characteristics recorded by the platform, including project ID, order ID, driver ID, project_driver_history accumulated quantity of fortune, cargo name, cargo secondary classification, vehicle type, mileage, order construction time, starting place, destination and the like, and can accurately identify the risk of the order by using a machine learning algorithm to explain and analyze the characteristic importance. Compared with the prior art, the invention has the following advantages:
1. The quick and accurate prediction of the network freight platform freight guarantee product risk of the order can be realized by using a machine learning method, other experiments are not needed, the risk of the order can be determined, and the labor cost and the time cost are greatly saved;
2. The characteristic parameters recorded on the platform level are subjected to characteristic selection by utilizing the machine learning characteristic engineering, so that the requirement of special hardware equipment for recording the characteristic data of drivers and vehicles in the traditional risk prediction scheme is avoided, and the method has higher practical value;
3. Aiming at the problem of weak interpretability of the traditional machine learning method, the invention combines the freight guarantee product risk prediction model with various interpretability analysis methods, overcomes the problem of weak interpretability of the traditional machine learning method, fully understands the network freight platform freight guarantee product prediction result of the machine learning model, acquires the importance ranking of the characteristic parameters and enhances the interpretability of the model;
4. The importance degree of the order feature in the prediction model can be determined from a qualitative angle, so that effective feedback guidance is carried out on the optimization of the pricing strategy of the network freight platform in practical application.
Detailed Description
Referring to fig. 1, the network freight platform freight guarantee product risk-escaping behavior prediction method based on interpretable machine learning provided by the invention comprises the following steps:
S1, acquiring original sample data and a corresponding historical risk-out result from a network freight platform, and dividing data fields into three major categories of 'driver features', 'vehicle features', 'order features', and the like, wherein the 'driver features' category at least comprises fields such as a driver mobile phone number and a driver name, the 'vehicle features' category at least comprises fields such as a license plate number and a vehicle type, the 'order features' category at least comprises fields such as an item ID, a cargo secondary classification, an origin, a destination, a mileage, a construction time and the like, and the data fields are used as alternative fields of a subsequent feature selection process. The method comprises the steps of acquiring original sample data from a network freight platform to form original sample data for predicting the dangerous behavior of freight guarantee products, and taking collected corresponding historical dangerous results as target parameter data, namely data labels, of a data set.
S2, preprocessing original sample data
And (3) extracting orders of purchased freight guarantee products according to the original sample data obtained in the step (S1), cleaning and filtering, and deleting order data where the missing values are located. And coding the type data of the character string type into numerical data by utilizing LabelEncoder aiming at the classification characteristics of the character form so as to enable the numerical data to be accepted by a machine learning algorithm, and carrying out binary classification marking processing (namely, the risk is marked as 1 and the risk is not marked as 0) on the corresponding historical risk results.
S3, undersampling processing and data set division are carried out based on the preprocessed data
Because the network freight platform freight insurance product dangerous behavior distribution has obvious unbalance that dangerous samples (2.65%) are far smaller than non-dangerous samples (97.35%), randomUnderSampler functions in imbalanced-learn libraries are adopted for random undersampling, the ratio of the number of dangerous samples to the number of non-dangerous samples is 1:1, and the sampled data are established to form a prediction sample data set with complete data and standard format.
70% Of sample data is extracted from the set of established predicted sample data as a training set and the remaining 30% of sample data is used as a test set using the train_test_split function. the train_test_split function method can automate the process of data set division, greatly saves time, and can realize random division by setting random seeds, thereby avoiding artificial preference and ensuring the randomness of the data set. the train_test_split function style is as follows:
x_train,x_test,y_train,y_test=
train_test_split(train_data,train_target,test_size,random_state,shuffle)
The method comprises the steps of dividing a training set into a plurality of test sets, wherein x_train is a characteristic parameter of the training set after division, x_test is a characteristic parameter of the test set after division, y_train is a target parameter of the training set after division, y_test is a target parameter of the test set after division, train_data is a sample characteristic parameter data set to be divided, train_target is a sample target parameter data set to be divided, test_size is the data volume of the test set, if the test set is a floating point number between 0 and 1, the test set accounts for a ratio, if the test set is an integer, the number of data in the test set is directly represented, the rest data are counted into the training set, random_state is a seed of a random number and is used for determining how the data set is randomly disturbed and divided, if different random number seeds are set each time, train_test_split functions divide the data set into the training set and the test set in different modes, if the same random number seeds are set each time, the same training set and the test set are obtained each time, if the data are used for carrying out random sequence before the data are divided, if the data are random samples, and if the data are not random, the sequence is not the data is the data, and if the sequence is the data.
In this embodiment, the training set and the test set are divided by adopting the train_test_split function method, and the related parameter setting conditions of the functions are as follows:
the trace_data is characteristic parameter data selected from a data set;
the train_target is target parameter data, namely a historical risk-out result;
test_size is 0.3 and random_state is 42.
S4, performing feature selection based on the processed experimental sample data
The prediction sample data set based on the processing is formed by adopting a Decision Tree (Decision Tree) algorithm, a random forest (RandomForest) algorithm, a K Nearest Neighbor (KNN) algorithm and a fastest gradient elevator (XGBoost) algorithm to respectively construct prediction models, calculating marginal contribution of each characteristic parameter to a machine learning model prediction result by using a feature_ importances _function built in the model, obtaining feature importance ranking, and selecting a plurality of features with highest average weight selection importance by using feature importance given by four prediction models to be used as input features of a subsequent prediction model. The specific number of feature choices may be 3, 4,5, or 6.
S5, constructing, training and testing various machine learning prediction models
A new predicted sample data set is extracted based on the features selected through step S4, 70% (or 80%) of sample data is extracted from the new predicted sample data set as a new training set using the train_test_split function, and the remaining 30% (or 20%) of sample data is used as a new test set. And constructing a plurality of machine learning prediction models based on the new training set and the new testing set, wherein the machine learning prediction models comprise a Decision Tree (Decision Tree) prediction model, a random forest (RandomForest) prediction model, a K Nearest Neighbor (KNN) prediction model and a fastest gradient elevator (XGBoost) prediction model, and when determining the optimal super-parameter combination of each model, grid search super-parameter optimization is adopted, and the model performance corresponding to each super-parameter combination is determined by combining ten-fold cross verification evaluation. And respectively training and testing each prediction model by utilizing the determined optimal super-parameter combination of each machine learning prediction model, and finally outputting the freight guarantee product risk prediction result of each model.
S6, comparing and evaluating multiple machine learning prediction models
According to the two-classification model evaluation method, the prediction accuracy and generalization of various machine learning prediction models are compared and analyzed, an optimal prediction model is determined, and model evaluation indexes comprise:
Accuracy (Accuracy) the Accuracy refers to the proportion of correctly classified samples to the total number of samples.
Precision, the Precision, refers to the ratio of the number of samples correctly classified as positive by the classifier to the total number of samples output as positive by the classifier.
Recall (Recall) which refers to the ratio of the number of samples correctly classified as positive by the classifier to the total number of samples actually positive.
F1 Score (F1-Score) F1 Score considers both accuracy and recall, which is the harmonic mean of accuracy and recall.
The evaluation index calculating method comprises the following steps:
For the two classification problem, the confusion matrix is 2x2, 0 and 1 respectively, where each row represents a true value and each column represents a predicted value.
And selecting the prediction model with the highest comprehensive performance of the evaluation index as an optimal prediction model.
S7, processing a prediction request by the optimal prediction model obtained in the step S6, and returning a prediction result, wherein the returned prediction result comprises prediction of an order risk-out result and independent prediction of each characteristic risk-out behavior factor in the order.
The traditional machine learning method is more similar to a black box model, and the problem of opacity and weak interpretability exists in the prediction result, so the method is used for carrying out the interpretability analysis on the prediction model, and has very important significance for users to know the specific prediction process of the model and determine the importance ranking of each characteristic parameter.
In model interpretability analysis, the key to determining the ranking of importance of feature parameters is to understand the degree of contribution of each feature parameter to model prediction, which is typically measured by different metrics, which correspond to different model interpretability analysis methods. The model interpretability analysis method comprises a feature_ importances _function method and a SHAP (SHAPLEY ADDITIVE exPlanations) method which are built in the model.
Feature_ importances _function methods are commonly used for tree-based models, such as decision trees, random forests, gradient lifting trees and the like, and the importance of the feature parameters is calculated by means of Gini importance or information gain in the training process of the models, so that the importance ranking of the feature parameters can be obtained through feature_ importances _attribute or method after model training is completed, and the feature parameters with the greatest influence on model prediction are identified, so that the prediction process of the models can be understood.
The SHAP method is a game theory-based method for interpreting model output results of a single prediction. The marginal contribution of each characteristic parameter to the model prediction result is defined by calculating the SHAP value corresponding to each characteristic parameter in the sample data set. SHAP values are the values assigned to each characteristic parameter in the sample data set. Because for each sample in the sample dataset, the predictive model generates a predicted value that is derived from the reference value plus the sum of the corresponding SHAP values for each of the characteristic parameters. The SHAP values corresponding to the characteristic parameters obey the following expression:
yi=ybase+f(xi1)+f(xi2)+……+f(xik)
Where y i is the final model predictor for the ith sample, y base is the baseline value of the model, i.e., the mean of all sample target parameter data in the sample dataset, x i1 is the 1 st feature parameter of the ith sample, x i2 is the 2 nd feature parameter of the ith sample, and so on, f (x i1) is the SHAP value of x i1, i.e., the contribution of the 1 st feature parameter of the ith sample to the final model predictor, f (x i2) is the SHAP value of x i2, i.e., the contribution of the 2 nd feature parameter of the ith sample to the final model predictor, and so on.
The SHAP method has the advantage that it can provide an intuitive explanation about the extent of influence of a characteristic parameter on model prediction, while indicating that the characteristic parameter acts positively on the final model prediction value when f (x ik) >0, whereas indicating that the characteristic parameter acts negatively on the final model prediction value when f (x ik) <0, indicating the positive and negative of the influence. In addition, the SHAP method can be interpreted at both global and local levels, and is applicable to various types of models, including tree models, linear models, neural networks, and the like.
In this embodiment, since the determined freight guarantee product risk prediction model is the fastest gradient elevator model, which calculates the importance of the feature parameters during training, has a built-in feature_ importances _function method, therefore, two methods, namely a feature_ importances _function and SHAP, are selected for carrying out the interpretation analysis of the model, so that the model is helped to quickly determine the risk of the order and the importance ranking of the characteristic parameters, and the pricing strategy formulation of the freight guarantee product is guided based on the result.
S8, multi-method interpretability comparison analysis and risk factor weighting based on an optimal prediction model, wherein the specific method is as follows:
The method comprises the steps of respectively obtaining importance sorting results of all characteristic parameters in the aspect of predicting the risk of the freight guarantee product by using a plurality of interpretability analysis methods, comparing and analyzing, and if the comparison analysis results meet a fixed condition, determining that the importance sorting results of the characteristic parameters obtained by different methods are similar, thereby determining a final importance sorting result of the characteristic parameters, generating a risk factor weighting result of the risk of the freight guarantee product on the basis of the final importance sorting result, and feeding back to a pricing strategy formulation department of a network freight platform for feedback guidance based on formulation optimization of the integral basic price of the freight guarantee product on the project side.
Referring to fig. 2-3, the network freight platform freight guarantee product risk-escaping behavior prediction system based on interpretable machine learning disclosed by the invention comprises a data receiving module and a prediction module;
The data receiving module is used for receiving data input by a user, checking the validity of the received data and transmitting the checked valid data to the prediction module, wherein the data input by the user comprises one or more of driver characteristic data, vehicle characteristic data and order characteristic data.
The prediction module comprises the optimal prediction model obtained in the step S6, and the prediction module automatically calls the prediction model in the prediction module to process a prediction request according to the received data and returns a prediction result, wherein the returned prediction result comprises the prediction of the order risk-out result and the independent prediction of each characteristic risk-out behavior factor in the order.
The invention discloses a freight guarantee product risk-escaping behavior prediction program development method based on an optimal prediction model, which comprises the following steps:
And (3) saving the optimal prediction model obtained in the step (S6) as a model file, and loading by using an HTML program code to obtain a freight guarantee product risk prediction webpage end program, wherein the specific process is as follows:
The method comprises the steps of creating a Web application program and an API interface by using a lightweight Web framework Flask, storing a trained optimal prediction model into a back-end service framework for processing a prediction request, defining the prediction interface by using an @ app. Route decorator, processing a POST request, acquiring feature data input by a user, performing validity check and encoding, performing prediction by using the trained model, and returning a prediction result. The program can automatically call a prediction model according to parameter data input by a user, obtain a freight guarantee product risk prediction result after calculation, and enable the obtained freight guarantee product risk prediction result to comprise prediction of an order risk result and independent prediction of each characteristic risk behavior factor in the order.
In the steps, the model reliability is verified to be qualified, so that the developed freight guarantee product risk prediction program is used as an effective prediction tool, an intuitive and convenient interface is provided for the network freight platform, the network freight platform is helped to quickly determine the risk of the order and the importance ranking of the characteristic parameters, and the pricing strategy formulation of the freight guarantee product is guided based on the result. Meanwhile, the freight guarantee product risk-escaping behavior prediction system has the advantages of convenience, rapidness and high accuracy, a model is not required to be retrained, and a large amount of time cost is saved.
In summary, the invention can utilize a machine learning algorithm to identify risk factors and predict the risk of the order, and interpret and analyze the related feature importance, based on the order features recorded by the platform, including item ID, order ID, driver ID, item_driver_history accumulated quantity of fortune, cargo name, cargo secondary classification, vehicle type, mileage, time of order construction, starting place, destination and the like, so that the platform can accurately identify the risk of the order. Compared with the prior art, the invention has the following advantages:
1. The network freight platform is selected based on the characteristic engineering to collect other characteristic data of the order records so as to predict the order risk, so that the problem that the traditional prediction method is difficult to collect the vehicle driving data is solved;
2. the quick and accurate prediction of the network freight platform freight guarantee product risk of the order can be realized by using a machine learning method, other experiments are not needed, the risk of the order can be determined, and the labor cost and the time cost are greatly saved;
3. the characteristic parameters recorded on the platform level are subjected to characteristic selection by utilizing the machine learning characteristic engineering, so that the requirement of special hardware equipment for recording the characteristic data of drivers and vehicles in the traditional risk prediction scheme is avoided, and the method has higher practical value;
4. Aiming at the problem of weak interpretability of the traditional machine learning method, the invention combines the freight guarantee product risk prediction model with various interpretability analysis methods, overcomes the problem of weak interpretability of the traditional machine learning method, fully understands the network freight platform freight guarantee product prediction result of the machine learning model, acquires the importance ranking of the characteristic parameters and enhances the interpretability of the model;
5. The importance degree of the order feature in the prediction model can be determined from a qualitative angle, so that effective feedback guidance is carried out on the optimization of the pricing strategy of the network freight platform in practical application.
6. The HTML program for predicting the risk of the freight guarantee product can automatically predict the risk of the order according to the imported order characteristic data by one key, avoids redundant experiments, and is convenient for popularization and application of a trained prediction model in industrial practice.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.