CN118134047A

CN118134047A - Information prediction method, device, equipment and storage medium

Info

Publication number: CN118134047A
Application number: CN202410301832.0A
Authority: CN
Inventors: 危红康; 朱钰森; 宋新彤; 吴剑飞; 刘柏
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2024-03-14
Filing date: 2024-03-14
Publication date: 2024-06-04

Abstract

The application provides an information prediction method, an information prediction device, information prediction equipment and a storage medium, wherein the information prediction method comprises the following steps: inputting feature data to be predicted of a target user into a loss prediction model to obtain loss probability corresponding to the target user, training the loss prediction model based on sample user feature data, wherein the loss prediction model comprises a plurality of primary models and a secondary model, training the secondary model based on prediction results of the plurality of primary models, and if the loss probability is larger than a preset threshold value, carrying out attribution analysis processing by using at least one attribution algorithm according to the feature data to be predicted and the loss probability to obtain loss reasons of the target user. The application improves the accuracy of the prediction of the loss prediction model, and can analyze the causal relationship between the user behavior characteristics and the loss result, thereby obtaining the loss reason of the user.

Description

Information prediction method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an information prediction method, an apparatus, a device, and a storage medium.

Background

In the field of games, a lost player group needs to be predicted in advance, specific reasons of the lost player group are analyzed, and targeted measures are taken, so that the acquisition sense and experience sense of players in the games are improved, and the player loss rate is reduced.

In the prior art, the cause of the loss is generally determined by observing the characteristic differences of the lost population and the reserved population. Specifically, a data report of a lost player and a reserved player is constructed by pulling a player behavior log, and characteristic differences between the lost player and the reserved player are compared based on a data statistics mode, so that a loss reason is determined, an early warning model is constructed according to the loss reason, and early intervention is performed through the early warning model.

However, the data statistics mode in the prior art has high labor cost, low efficiency and poor mobility, the loss reasons obtained by analysis are wider, the method cannot be suitable for each lost individual, accurate recovery cannot be realized, and only lost players can be subjected to loss analysis and recall, so that the recall rate is low. Moreover, the early warning model constructed based on the existing method is poor in interpretation, the predicted result is not comprehensive enough, and the causal relationship between the player behavior and the final loss result is difficult to intuitively present.

Disclosure of Invention

The application aims to provide an information prediction method, device, equipment and storage medium for solving the problems of inaccurate loss reasons, poor model interpretation and incomplete prediction in the prior art.

In order to achieve the above purpose, the technical scheme adopted by the application is as follows:

in a first aspect, the present application provides an information prediction method, the method comprising:

inputting feature data to be predicted of a target user into the loss prediction model to obtain loss probability corresponding to the target user, wherein the loss prediction model is obtained by training based on sample user feature data, the loss prediction model comprises a plurality of primary models and a secondary model, and the secondary model is obtained by training based on prediction results of the plurality of primary models;

and if the loss probability is greater than a preset threshold, carrying out attribution analysis processing by using at least one attribution algorithm according to the feature data to be predicted and the loss probability to obtain the loss information of the target user.

In a second aspect, the present application provides an information prediction apparatus, the apparatus comprising:

The prediction module is used for inputting the feature data to be predicted of the target user into the loss prediction model to obtain the loss probability corresponding to the target user, the loss prediction model is obtained by training based on the feature data of the sample user, the loss prediction model comprises a plurality of primary models and a secondary model, and the secondary model is obtained by training based on the prediction results of the plurality of primary models;

And the attribution module is used for carrying out attribution analysis processing by using at least one attribution algorithm according to the feature data to be predicted and the loss probability if the loss probability is larger than a preset threshold value, so as to obtain the loss information of the target user.

In a third aspect, an embodiment of the present application further provides an electronic device, including: a processor, a storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating over a bus when the electronic device is running, the processor executing the machine-readable instructions to perform the steps of an information prediction method according to any one of the first aspects.

In a fourth aspect, embodiments of the present application also provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of an information prediction method according to any of the first aspects.

The beneficial effects of the application are as follows: by adopting the mode of integrated learning to carry out model training, the loss prediction model can fully integrate all the characteristics of the user, obtain more comprehensive and more accurate prediction results and improve the prediction accuracy of the loss prediction model. By analyzing the feature data to be predicted based on the loss probability and the user, the causal relation between the user behavior feature and the loss result can be obtained, the problems of weak causal visualization capacity of part of models and poor interpretability of black box models are solved, and the determination of the flow loss group and the establishment of a retrieval strategy by game operators can be well assisted.

In order to make the above objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 shows a flowchart of an information prediction method provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of an architecture of a attrition prediction model according to an embodiment of the present application;

FIG. 3 is a flow chart of a training loss prediction model provided by an embodiment of the present application;

FIG. 4 shows a flowchart of obtaining sample user feature data according to an embodiment of the present application;

FIG. 5 shows a flowchart of processing user feature data provided by an embodiment of the present application;

FIG. 6 is a flowchart of a process for encoding user feature data according to an embodiment of the present application;

FIG. 7 is a flow chart illustrating determining a cause of loss based on a first cause of loss and a second cause of loss, according to an embodiment of the present application;

FIG. 8 illustrates a flowchart for determining a contribution value of a tag feature provided by an embodiment of the present application;

FIG. 9 shows a flow chart for determining a second fluid loss cause provided by an embodiment of the present application;

Fig. 10 is a schematic structural diagram of an information prediction device according to an embodiment of the present application;

fig. 11 shows a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present application.

It should be noted that the term "comprising" will be used in embodiments of the application to indicate the presence of the features stated hereafter, but not to exclude the addition of other features.

In the first mode in the prior art, historical player data are counted, and a data report of the lost player and the reserved player is constructed, so that the characteristic difference between the two is compared in a data counting mode, the loss reason of the lost player is determined, and game parameters are correspondingly adjusted according to the loss reason.

However, this method can only analyze and obtain a wide loss cause, and because of individual differences among different players, the obtained loss cause cannot be applied to each lost individual, and thus accurate recovery cannot be realized. In addition, the method can only carry out churn analysis and recall on the churn players, and the recall rate is low.

The second mode is to construct a corresponding loss early warning model according to the historical behavior log of the player, predict the current game data of the player through the loss early warning model, and determine whether the player can lose and intervene in advance.

However, at present, when a loss early warning model is built, mathematical modeling is performed by using a single machine learning algorithm based on historical behavior logs of a player, and the model which is stable and better in all aspects is difficult to learn in the mode, and the problem that the model has poor interpretability is solved, so that the causal relationship between the behavior of the player and the final loss result cannot be intuitively presented.

Based on the information, the application provides an information prediction method, wherein a plurality of basic machine learning models are trained by extracting historical game data of a user, and the models are combined by adopting an integrated learning method to obtain a loss prediction model, so that the accuracy and the comprehensiveness of the prediction of the loss prediction model are improved. The prediction result of the loss prediction model is analyzed based on the attribution algorithm, so that the cause of the loss of the user can be predicted more accurately, and the game operator can make a decision based on the loss cause.

Fig. 1 is a flowchart of an information prediction method according to the present application, where an execution subject of the method may be an electronic device, as shown in fig. 1, and the method includes:

s101, inputting feature data to be predicted of a target user into a loss prediction model to obtain loss probability corresponding to the target user, wherein the loss prediction model is obtained by training based on the feature data of a sample user, the loss prediction model comprises a plurality of primary models and a secondary model, and the secondary model is obtained by training based on prediction results of the plurality of primary models.

According to the application, the historical data of the lost user and the retained user can be obtained, the historical data of the lost user is taken as a positive sample, the historical data of the retained user is taken as a negative sample, and the historical data is subjected to data processing and labeling to obtain the sample user characteristic data.

Alternatively, the attrition prediction model may be an integrated model, consisting of a plurality of primary models and a secondary model, the secondary model being trained based on the predictions of the plurality of primary models. Referring to fig. 2, after each primary model is trained based on sample user feature data, the prediction results of the primary model may be formed into a secondary training set, and the secondary model may be trained based on the secondary training set.

Wherein, a single primary model has a certain accuracy, and the structures of the primary models can be different. For example, machine learning algorithms such as logistic regression, decision trees, support vector machines, neural networks, etc. may be chosen as the primary models, and the number of primary models is not limited herein.

Alternatively, the secondary model may be any one of a logistic regression, a decision tree, a support vector machine, a neural network, etc. machine learning algorithm.

It should be noted that, in the model training stage, modeling may be performed by adopting an integrated learning manner, weights of each primary model are learned when training the secondary model, and the combination of the primary models is performed based on the weights of each learned primary model, so that the features learned by each primary model are fully integrated, a more comprehensive and more accurate prediction result is obtained, and a more stable loss prediction model with better generalization performance is finally obtained.

The feature data to be predicted can be obtained by processing data of the target user in a certain period of time. The target user may include a plurality of users, and after the feature data to be predicted of each user is input into the attrition prediction model, the attrition probability of each user may be obtained.

Optionally, after the feature data to be predicted of the target user is input into the loss prediction model, each primary model in the loss prediction model may perform prediction processing on the feature data to be predicted, and output a prediction result to the secondary model, where the secondary model may predict the loss probability of the user according to the prediction result of each primary model.

As a possible implementation manner, when the feature data to be predicted of a plurality of users is input, the attrition prediction model may sequentially predict the feature data to be predicted of each user and output the attrition probability of each user.

And S102, if the loss probability is larger than a preset threshold, carrying out attribution analysis processing by using at least one attribution algorithm according to the feature data to be predicted and the loss probability to obtain loss information of the target user.

The attribution algorithm may include Shap (SHAPLEY ADDITIVE exPlanations, saproli addition and interpretation) algorithm, LIME (Local Interpretable Model-agnostic Explanations, model independent local interpretability method) and other model post-mortem interpretability algorithms. PDP (PARTIAL DEPENDENCE Plot, partial dependency graph), ICE (Individual Conditional Expectation, individual condition expectation) algorithm, etc. may also be included, the application is not limited herein.

Alternatively, the attrition information of the target user may include the attrition cause of the target user, or the attrition cause and the attrition probability of each attrition cause.

Alternatively, at the time of performing the attribution analysis processing, the attribution analysis processing may be performed based on one attribution algorithm, or the attribution analysis processing may be performed based on a plurality of attribution algorithms, and the cause of the churn of the target user may be determined from the analysis results of the respective attribution algorithms.

After the loss prediction model is obtained, the loss probability of the user can be obtained by inputting the feature data to be predicted, the user can be divided into loss crowds with different grades of risks of high, medium and low according to the loss probability of the user and a preset threshold value, attribution analysis is carried out on the user or the group needing early warning, specific loss reasons are summarized, and accordingly corresponding early warning measures are formulated in a targeted mode.

It should be noted that, although the importance of the features may be checked in some machine learning algorithms, it is not possible to determine the relationship between the features and the final predicted result, i.e., whether the features of the user are positively correlated, negatively correlated, or otherwise complex with respect to the churn result, and how each feature affects an individual player. In the application, at least one attribution algorithm is adopted for attribution analysis processing, so that the relation between various characteristics of the user and the loss result can be analyzed, and the loss reason of the user can be determined.

According to the embodiment of the application, model training is carried out based on sample user characteristic data to obtain a loss prediction model, the loss prediction model comprises a plurality of primary models and a secondary model, the secondary model is obtained based on the prediction results of the plurality of primary models by training, the characteristic data to be predicted of a target user is input into the loss prediction model to obtain loss probability corresponding to the target user, and if the loss probability is greater than a preset threshold, attribution analysis processing is carried out by using at least one attribution algorithm according to the characteristic data to be predicted and the loss probability to obtain loss reasons of the target user.

By adopting the mode of integrated learning to carry out model training, the loss prediction model can fully integrate all the characteristics of the user, obtain more comprehensive and more accurate prediction results and improve the prediction accuracy of the loss prediction model. By analyzing the feature data to be predicted based on the loss probability and the user, the causal relation between the user behavior feature and the loss result can be obtained, the problems of weak causal visualization capacity of part of models and poor interpretability of black box models are solved, and the determination of the flow loss group and the establishment of a retrieval strategy by game operators can be well assisted.

The following is a further description of the training process of the base loss prediction model, as shown in fig. 3, including:

s301, model training is conducted based on sample user characteristic data, and a plurality of primary models and first prediction results of the primary models are obtained.

S302, combining the plurality of primary models and the plurality of secondary models into a loss prediction model.

Optionally, the sample user characteristic data may be input into each initial primary model respectively for training to obtain primary models corresponding to each initial primary model, and after the primary models are obtained, the sample user characteristic data is input into the primary models to obtain the first prediction results of each primary model.

As a possible implementation manner, the sample user feature data may be divided into a test set, a training set and a verification set, training of the primary model is performed based on the data of the training set, the model obtained by training is verified based on the data of the verification set, after the training is performed to obtain the primary model, the data of the test set may be input into each primary model, and a first prediction result output by each primary model is obtained.

S303, performing model training based on the first prediction results of the primary models to obtain secondary models.

It should be noted that the training set of the secondary model is generated using the primary model, but if the training set of the primary model is used directly, then overfitting is likely to occur, and typically, the training samples of the secondary model can be generated by cross-validation methods using samples that are not used when training the primary model.

Taking k-fold cross validation as an example, the training data set D e R ^m×d of the primary model may be randomly divided into k data sets d=d ₁,D₂,…,D_k of similar size, and then the j-th fold training data and test data are combined as: and D _j, where/> For N primary models,/>Representing that the nth machine learning algorithm is at/>The primary model n learned on the training data is output as/>, after the primary model n is predicted for each sample x _i in D _j The secondary model training samples generated by all N primary models at sample point x _i are characterized as/>The corresponding tag is still y _i. The whole k-fold data is subjected to N primary models to generate a secondary training set as/>D ^′ is then used for training of the secondary model.

After training of the respective primary and secondary models is completed, the primary and secondary models may be combined into a attrition prediction model. It should be noted that, in the training process of the secondary model, the weight of the primary model is also learned, and the secondary model can predict based on the weight of each primary model and the first prediction result of each primary model, so as to obtain the loss probability of the target user.

Next, a process of generating the sample user feature data before the model training based on the sample user feature data will be described, as shown in fig. 4, the process includes:

s401, extracting corresponding initial user characteristic data from the historical user data according to the preset characteristic labels.

S402, performing feature processing on the initial user feature data to obtain sample user feature data.

Alternatively, the feature labels may be preset based on the user's data, each feature label may correspond to one or more fields in the user's data, and each feature label may correspond to one or more churn reasons, respectively. And associating and aggregating the historical user data based on the feature tags to obtain initial feature data corresponding to each feature tag.

For example, the loss cause corresponding to the feature tag "level" may be "slow-to-upgrade", and the loss cause corresponding to the feature tag "level" may be "slow-to-upgrade".

Taking the scenario of the loss of the game user as an example, the loss reasons can be divided into external reasons and internal reasons, and it should be understood that the external reasons are often difficult to be represented by the game data of the user, so that the loss reasons refer to the internal reasons for the game scenario in the present application, such as too slow game progress, game balance problem, and lack of social activities. Through designing relevant feature labels, such as consumption, growth, copy progress, friends and the like, for the game behavior logs of the users, then mapping each feature label into a corresponding field in a game log table, different types of game behavior logs usually adopt a sub-table design, and the data volume is huge, and the relevant tables can be associated and data aggregated by using Hive to obtain initial user feature data corresponding to each feature label.

Specifically, in the step S402, the process of performing feature processing on the initial user feature data to obtain sample user feature data, as shown in fig. 5, includes:

S501, according to the data type of the feature tag, the initial user feature data corresponding to the feature tag is subjected to coding processing to obtain coded data.

Wherein the data type of the feature tag may be partially discrete or partially continuous. For different data types, different coding processing modes can be adopted respectively to process the initial user characteristic data of each characteristic label into coded data.

S502, classifying the user types of the coded data based on the initial user characteristic data to obtain classified data.

For different types of users, there is often a large difference in churn rate and churn cause, such as paid players and non-paid players, and unified modeling can greatly affect the accuracy of the final model predictions, should be modeled separately for different types of player groups.

S503, determining an observation window and a representation window based on a preset observation point, and adding annotation information to the classified data according to the user data of the observation window and the user data of the representation window.

A view point can be set for the user characteristic data, a time window in front of the view point is defined as a view window, and the view point is usually used for acquiring characteristic variables of a player in the time window, namely input characteristics of a model; the time window after the observation point is defined as a performance window, which is generally used for defining whether the player runs off, and labeling information can be added to the classified data after determining whether the player runs off based on the data of the performance window and the observation window. By way of example, a user annotation 0 may be left for a lost user annotation 1.

The setting of the observation point, the time span of the presentation window and the time span of the observation window can be determined based on historical experience or business requirements.

S504, eliminating abnormal points of the classified data added with the labeling information to obtain sample user characteristic data.

In some cases, the user may be active in the observation window, but there is no login in the expression window, which is most likely caused by external factors, so when analyzing the influence of internal factors on the user loss, these sample points may be considered as abnormal sample points, i.e. sample points that cannot be accurately described by the internal features should be removed.

As a possible implementation manner, kmeans clustering can be performed on sample data of a user according to a lost user and a reserved user to obtain a cluster of the lost user and a cluster of the reserved user, abnormal data point judgment is performed on the two clusters according to the actual loss condition of the user in a performance window, and interference of external factors on model prediction is reduced through eliminating the abnormal data points.

Further, in the step S501, according to the data type of the feature tag, the process of encoding the initial user feature data corresponding to the feature tag to obtain encoded data, as shown in fig. 6, includes:

S601, if the feature tag is a discrete feature, determining a target coding mode according to the feature value of the initial user feature data corresponding to the feature tag, and coding the initial user feature data corresponding to the feature tag based on the target coding mode.

Alternatively, when the feature tag is a partially discrete feature tag, the target encoding mode may include tag encoding as well as single-hot encoding. The target coding mode can be determined according to the relation between the characteristic values. If there is a magnitude relation between the values of the discrete features, a label code may be used, otherwise a one-hot code is used.

The tag coding means that the features are sequentially valued into values of 0, 1, 2, 3 and the like according to the size sequence of the data. The one-hot coding refers to converting the corresponding features into 0, 1 vectors.

S602, if the feature tag is a continuous feature, the initial user feature data corresponding to the feature tag is encoded according to a preset reference encoding mode.

Optionally, for the partially continuous feature labels, the feature labels may be processed according to a preset reference encoding mode, where the mean and variance of each dimension feature after processing are respectively 0 and 1. As an example, a specific calculation method may be shown in the following formula (1).

Wherein x ^′ is normalized data, x is original data, μ is the mean value of the original data, and σ represents the standard deviation of the original data.

Next, the process of obtaining the attrition information of the target user by performing the attribution analysis processing using at least one attribution algorithm according to the feature data to be predicted and the attrition probability is described, as shown in fig. 7, the step S103 includes:

s701, carrying out feature analysis on feature data to be predicted based on a first attribution algorithm and a loss prediction model to obtain at least one first loss reason.

S702, carrying out feature analysis on feature data to be predicted and loss probability based on a second attribution algorithm and a loss prediction model to obtain at least one second loss reason.

Alternatively, the first attribution algorithm may be Shap algorithm, shap algorithm may calculate shapely values of the respective feature labels based on the feature data to be predicted. The shapely value may characterize the contribution of the feature tag to the final loss result, with a larger shapely value indicating that the feature tag has a greater impact on the loss result. Thus, the main factors affecting player churn can be analyzed by calculating shapely values.

Alternatively, the second attribution algorithm may be a LIME algorithm, LIME is a local interpretability method, providing a small range of understanding of individual feature labels or distributions, and by performing feature analysis on the feature data to be predicted and the loss probability, at least one second fluid loss cause and a correlation value of each second fluid loss cause may be obtained, where the correlation value may be indicative of the importance of the second fluid loss cause.

S703, determining the loss information of the target user according to the first loss reason and the second loss reason.

In a first implementation, the second fluid loss cause may be verified based on the first fluid loss cause, and the attrition information of the target user is determined in at least one second fluid loss cause.

In a second implementation, the first attribution algorithm may output a first fluid loss cause and shapely values for each first fluid loss cause, and the second attribution algorithm may output a second fluid loss cause and related values for each second fluid loss cause. And carrying out weighted calculation on the shapely value of the first fluid loss reason and the related value of the second fluid loss reason, sequencing the calculation results, and determining the loss information of the target user according to the sequencing results.

Further, the process for performing feature analysis on the feature data to be predicted based on the first attribution algorithm to obtain the first flow loss cause includes:

And determining the contribution value of each feature label in the feature data to be predicted based on the first attribution algorithm and the loss prediction model.

And determining a first lost reason according to the contribution value of each characteristic label.

The contribution value can represent the influence degree of the characteristic tag on the lost motion result. After determining the contribution value of each feature tag, determining the loss reason corresponding to the feature tag with the contribution value larger than the preset threshold value as the first loss reason.

Next, with reference to a specific example, the above process of determining the contribution value of each feature tag in the feature data to be predicted based on the first attribution algorithm and the attrition prediction model will be described, and as shown in fig. 8, the process includes:

S801, carrying out data replacement on a plurality of first feature labels in feature data to be predicted to obtain a plurality of first feature data.

S802, predicting the plurality of first characteristic data based on the loss prediction model to obtain a plurality of first loss probabilities.

S803, analyzing and processing each first fluid loss probability based on a first attribution algorithm to obtain a contribution value of each first feature label in the feature data to be predicted.

It should be noted that, when determining the first loss probability, data replacement may be sequentially performed on each first feature tag data, and the obtained first feature data may be predicted to obtain the first loss probability. And the data substitution can be performed on all the first characteristic labels in parallel, and the obtained plurality of first characteristic data are predicted in parallel to obtain a plurality of first fluid loss probabilities. Specific processing sequence the application is not limited herein.

Alternatively, the first feature tag may be any one of the feature tags.

Wherein, the calculation of the contribution value can be realized by the following formula (2).

Where S is the subset of features used in the model, x is the vector of feature values for the instance to be interpreted, M is the number of features, (x ₁,…x_M) is the set of all input features, (x ₁,…x_M)\x_j is the possible set of all input features excluding x _j,Is the weight of subset S, val (S) is the SHAP value of subset S, val (S u x _j) represents the SHAP value of the new subset after adding feature x _j.

Taking KERNEL SHAP as an example, the data of any one feature tag in the feature data to be predicted can be replaced first to obtain the first feature data. Illustratively, the indication vector z' e {0,1} ^M is taken, where 1 indicates that the feature is not replaced and 0 indicates that the feature is replaced. The indication vector is converted into an original feature space, wherein the position of 1 is converted into an original feature value of the feature data to be predicted, the position of 0 is replaced by a feature value of a randomly extracted sample, a conversion function is expressed by h (x)), and then the loss prediction model f (h (z')) is used for predicting the feature data to obtain a first loss probability.

After sequentially replacing data for each feature tag and obtaining a plurality of first loss probabilities, an average value of all the first loss probabilities may be calculated, a reference value is obtained, weights of each of the indication vectors are calculated, and a contribution value of each of the feature tags is calculated based on the weights of the indication vectors and the reference value.

Specifically, when determining the weight corresponding to each feature tag, the following formula (3) may be used.

Where z ^′ denotes the indication vector, M is the number of features, |z ^′ | is the number of 1s in z ^′, i.e. the number of features currently present. After the weights of the respective indication vectors are determined, the contribution value of the feature tag may be determined according to the indication vector with a higher weight value. The specific calculation mode can be realized by the following formula (4).

Wherein phi _j is shapely, phi ₀ is a reference value, and g (z ^′) is a linear model.

After the contribution value of each feature label of each user in the feature data to be predicted is obtained, the loss reason can be analyzed from the angles of individuals, groups and the whole.

When the analysis is performed from the individual angle, the feature labels can be ranked according to the size of the contribution value, the higher the contribution value is, the larger the influence of the feature label on the final loss result is indicated, therefore, a threshold value can be set for the contribution value, and the loss reasons corresponding to all the feature labels larger than the threshold value are determined as the first loss reason of the individual.

If the real reasons affecting the player's churn cannot be judged only from the individual perspective, the application can also explore the churn reasons of the player from the perspective of churn groups. When analyzing from the group perspective, the step S802 includes:

and clustering the feature labels of the user group based on the contribution value to obtain a plurality of clusters of the lost user group.

Optionally, the contribution value of the feature data to be predicted on each feature label can be used as the feature to be clustered in a supervision clustering mode, and the loss reason of the loss group can be obtained by observing the cluster.

And sequencing the clustering clusters based on the contribution values, and determining at least one first flow loss reason in a sequencing result according to a preset reference value.

As a possible implementation manner, after the clusters are obtained, the clusters may be ordered according to the magnitude of the contribution value thereof, the feature labels corresponding to the clusters with the largest contribution value are determined, and the loss reasons corresponding to the feature labels are used as the first loss reason.

For example, after the clustering, the feature labels corresponding to the clustering clusters with the highest contribution value include: and the class and the challenge checkpoint can be used as the first loss reason corresponding to the class and the challenge checkpoint.

When analyzing from the overall point of view, the process of determining the first loss cause according to the contribution value of each feature tag includes:

and determining the sum of the contribution values of all the characteristic labels of all the users, sorting the characteristic labels according to the sum of the contribution values of all the characteristic labels, and determining a first loss reason according to the sorting result.

For example, after determining the contribution value of each user on each feature label, the sum of the contribution values on each feature label may be calculated, and the calculated feature labels are ranked, where the loss reason corresponding to the feature label with the sum of the contribution values in the first n items is taken as the first loss reason.

The following is a further description of the feature analysis of the feature data to be predicted and the loss probability based on the second attribution algorithm and the loss prediction model to obtain at least one second loss cause, as shown in fig. 9, and the step S702 includes:

And S901, modifying the feature data to be predicted to obtain modified feature data.

In the step, one piece of data in the feature data to be predicted can be selected at will, and the data is slightly modified to obtain modified feature data with certain similarity with the feature data to be predicted. It should be noted that, the feature data to be predicted is feature data of a plurality of users, and when the data modification is performed, the feature data of one of the users may be slightly modified.

S902, predicting the modified characteristic data based on the loss prediction model to obtain a second prediction result.

And inputting the modified characteristic data into the loss prediction model to obtain a second prediction result of the modified characteristic data.

S903, combining the modified characteristic data and the second prediction result into a new data set, and training an interpretability model based on the new data set to obtain the interpretability model.

S904, carrying out feature analysis on the loss probability according to the interpretability model to obtain at least one second loss cause.

Optionally, after combining the modified feature data and the second prediction result into a new data set, training a linear model on the new data set as an interpretable model, and explaining the loss probability obtained in the step S102 based on the interpretable model to obtain relevant values of each feature label, wherein the relevant values are used for describing the influence of the feature labels on the loss result, and sorting the feature labels according to the relevant values, so that the loss reason corresponding to the first n feature labels is determined as the second loss reason.

After obtaining the first and second fluid loss causes, a loss cause may be determined based on the at least one first and at least one second fluid loss cause, the process comprising:

And verifying at least one first loss reason according to at least one second loss reason, and determining the first loss reason which is the same as the second loss reason as the loss information of the target user.

In a first implementation, the first loss cause that is the same as the second loss cause may be used as the loss cause of the target user, and the loss cause may be used as the loss information of the target user.

In the second implementation manner, the weighted calculation may be performed on the correlation value of the second fluid loss reason and the contribution value of the first fluid loss reason to obtain weighted values of the feature labels, the feature labels are ranked based on the weighted values, and the loss information of the target user is determined from the ranking result.

Based on the same inventive concept, the embodiment of the present application further provides an information prediction device corresponding to the information prediction method, and since the principle of solving the problem by the device in the embodiment of the present application is similar to that of the information prediction method in the embodiment of the present application, the implementation of the device may refer to the implementation of the method, and the repetition is omitted.

Fig. 10 is a schematic structural diagram of an information prediction apparatus according to an embodiment of the present application, where, as shown in fig. 10, the apparatus includes:

the prediction module 1001 is configured to input feature data to be predicted of a target user into a loss prediction model, to obtain a loss probability corresponding to the target user, where the loss prediction model is obtained by training based on feature data of a sample user, the loss prediction model includes a plurality of primary models and a secondary model, and the secondary model is obtained by training based on prediction results of the plurality of primary models;

And an attribution module 1002, configured to perform attribution analysis processing by using at least one attribution algorithm according to the feature data to be predicted and the attrition probability if the attrition probability is greater than a preset threshold value, so as to obtain attrition information of the target user.

In a possible embodiment, the information prediction apparatus of the present application further includes: the training module is specifically used for:

model training is carried out based on sample user characteristic data, and a plurality of primary models and first prediction results of the primary models are obtained;

model training is carried out based on the first prediction results of the primary models to obtain secondary models;

The plurality of primary models and the secondary model are combined into a attrition prediction model.

In a possible embodiment, the apparatus further comprises a data processing module, in particular for:

Extracting corresponding initial user characteristic data from the historical user data according to a preset characteristic label;

And performing feature processing on the initial user feature data to obtain sample user feature data.

In a possible embodiment, the data processing module is further specifically configured to:

according to the data type of the feature tag, carrying out coding processing on the initial user feature data corresponding to the feature tag to obtain coded data;

classifying the user types of the coded data based on the initial user characteristic data to obtain classified data;

determining an observation window and a representation window based on a preset observation point, and adding annotation information to the classified data according to the user data of the observation window and the user data of the representation window;

and removing abnormal points from the classified data added with the labeling information to obtain sample user characteristic data.

If the feature tag is a discrete feature, determining a target coding mode according to the feature value of the initial user feature data corresponding to the feature tag, and coding the initial user feature data corresponding to the feature tag based on the target coding mode;

If the feature tag is a continuous feature, the initial user feature data corresponding to the feature tag is encoded according to a preset reference encoding mode.

In one possible embodiment, attribution module 1002 is specifically configured to:

Performing feature analysis on the feature data to be predicted based on a first attribution algorithm and a loss prediction model to obtain at least one first loss reason;

Performing feature analysis on the feature data to be predicted and the loss probability based on a second attribution algorithm and a loss prediction model to obtain at least one second loss cause;

And determining the loss information of the target user according to the first loss reason and the second loss reason.

Determining a contribution value of each feature tag in the feature data to be predicted based on a first attribution algorithm and a loss prediction model;

Performing data replacement on a plurality of first feature labels in the feature data to be predicted to obtain a plurality of first feature data;

predicting a plurality of first characteristic data based on a loss prediction model to obtain a plurality of first loss probabilities;

And analyzing and processing each first loss probability based on a first attribution algorithm to obtain the contribution value of each first feature label in the feature data to be predicted.

clustering the feature labels of the user group based on the contribution value to obtain a plurality of clusters of the lost user group;

Modifying the feature data to be predicted to obtain modified feature data;

Predicting the modified characteristic data based on the loss prediction model to obtain a second prediction result;

Combining the modified characteristic data and the second prediction result into a new data set, and training an interpretability model based on the new data set to obtain the interpretability model;

and carrying out feature analysis on the loss probability according to the interpretability model to obtain at least one second loss cause.

In one possible embodiment, the attribution module 1002 is specifically configured to:

According to the embodiment of the application, the model training is performed by adopting an integrated learning mode, so that the loss prediction model can fully integrate all the characteristics of the user, a more comprehensive and more accurate prediction result is obtained, and the prediction accuracy of the loss prediction model is improved. By analyzing the feature data to be predicted based on the loss probability and the user, the causal relation between the user behavior feature and the loss result can be obtained, the problems of weak causal visualization capacity of part of models and poor interpretability of black box models are solved, and the determination of the flow loss group and the establishment of a retrieval strategy by game operators can be well assisted.

Fig. 11 shows a schematic structural diagram of an electronic device according to an embodiment of the present application, including: a processor 1101, a storage medium 1102 and a bus 1103, said storage medium 1102 storing machine readable instructions executable by said processor 1101, when an electronic device runs an information prediction method as in the embodiment, said processor 1101 and said storage medium 1102 communicate via the bus 1103, said processor 1101 executing said machine readable instructions, a preamble of the processor 1101 method item to perform the steps of:

Inputting feature data to be predicted of a target user into a loss prediction model to obtain loss probability corresponding to the target user, wherein the loss prediction model is obtained by training based on sample user feature data, the loss prediction model comprises a plurality of primary models and a secondary model, and the secondary model is obtained by training based on prediction results of the plurality of primary models;

If the loss probability is larger than the preset threshold, carrying out attribution analysis processing by using at least one attribution algorithm according to the feature data to be predicted and the loss probability to obtain loss information of the target user.

In a possible embodiment, before the processor 1101 performs the input of the feature data to be predicted of the target user into the attrition prediction model, the method is specifically used for:

In one possible embodiment, the processor 1101, before performing model training based on the sample user feature data, further comprises:

In a possible embodiment, the processor 1101 is specifically configured to, when performing feature processing on the initial user feature data to obtain sample user feature data:

In a possible embodiment, the processor 1101 is specifically configured to, when executing the encoding process on the initial user feature data corresponding to the feature tag according to the data type of the feature tag, obtain encoded data:

In a possible embodiment, the processor 1101 is specifically configured to, when executing the attribution analysis processing according to the feature data to be predicted and the attrition probability using at least one attribution algorithm, obtain the attrition information of the target user:

In a possible embodiment, the processor 1101 is specifically configured to, when performing feature analysis on feature data to be predicted based on a first attribution algorithm to obtain a first cause of the fluid loss:

In a possible embodiment, the processor 1101 is specifically configured to, when executing the determining the contribution value of each feature tag in the feature data to be predicted based on the first attribution algorithm and the attrition prediction model:

In a possible embodiment, the processor 1101 is specifically configured to, when executing the determination of the first cause of the fluid loss according to the contribution value of each feature tag:

In a possible embodiment, the processor 1101 is specifically configured to, when performing feature analysis on the feature data to be predicted and the loss probability based on the second attribution algorithm and the loss prediction model, obtain at least one second loss cause:

Modifying the feature data to be predicted to obtain modified feature data;

In a possible embodiment, the processor 1101 is specifically configured to, when executing the determining the attrition information according to at least one first attrition cause and at least one second attrition cause:

According to the application, the model training is performed by adopting an integrated learning mode, so that the loss prediction model can fully integrate all the characteristics of the user, a more comprehensive and more accurate prediction result is obtained, and the prediction accuracy of the loss prediction model is improved. By analyzing the feature data to be predicted based on the loss probability and the user, the causal relation between the user behavior feature and the loss result can be obtained, the problems of weak causal visualization capacity of part of models and poor interpretability of black box models are solved, and the determination of the flow loss group and the establishment of a retrieval strategy by game operators can be well assisted.

The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program is executed by a processor when the computer program is executed by the processor, and the processor executes the following steps:

In a possible implementation manner, before the processor inputs the feature data to be predicted of the target user into the attrition prediction model to obtain the attrition probability corresponding to the target user, the processor is specifically configured to:

In one possible embodiment, the processor, prior to performing model training based on the sample user feature data, further comprises:

In a possible embodiment, the processor is specifically configured to, when performing feature processing on the initial user feature data to obtain sample user feature data:

In a possible implementation manner, the processor is specifically configured to, when executing the encoding processing on the initial user feature data corresponding to the feature tag according to the data type of the feature tag to obtain encoded data:

In a possible implementation manner, the processor is specifically configured to, when executing attribution analysis processing according to the feature data to be predicted and the loss probability by using at least one attribution algorithm, obtain loss information of the target user:

In a possible embodiment, the processor is specifically configured to, when performing feature analysis on feature data to be predicted based on a first attribution algorithm to obtain a first cause of the fluid loss:

In a possible embodiment, the processor is specifically configured to, when executing the determining the contribution value of each feature tag in the feature data to be predicted based on the first attribution algorithm and the churn prediction model:

In a possible embodiment, the processor is specifically configured to, when executing the determination of the first cause of the fluid loss according to the contribution value of each feature tag:

In a possible embodiment, the processor is specifically configured to, when performing feature analysis on the feature data to be predicted and the loss probability based on the second attribution algorithm and the loss prediction model to obtain at least one second loss cause:

Modifying the feature data to be predicted to obtain modified feature data;

In a possible embodiment, the processor is specifically configured to, when executing the determining the churn information according to at least one first churn reason and at least one second churn reason:

In an embodiment of the present application, the computer program may further execute other machine readable instructions when executed by a processor to perform the method as described in other embodiments, and the specific implementation of the method steps and principles are referred to in the description of the embodiments and are not described in detail herein.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments provided in the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

It should be noted that: like reference numerals and letters in the following figures denote like items, and thus once an item is defined in one figure, no further definition or explanation of it is required in the following figures, and furthermore, the terms "first," "second," "third," etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: the above examples are only specific embodiments of the present application, and are not intended to limit the scope of the present application, but it should be understood by those skilled in the art that the present application is not limited thereto, and that the present application is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the corresponding technical solutions. Are intended to be encompassed within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An information prediction method, comprising:

2. The method according to claim 1, wherein before inputting the feature data to be predicted of the target user into the attrition prediction model to obtain the attrition probability corresponding to the target user, the method comprises:

Model training is carried out based on the sample user characteristic data, and a plurality of primary models and first prediction results of the primary models are obtained;

Model training is carried out based on the first prediction results of the primary models, and the secondary models are obtained;

the plurality of primary models and the secondary model are combined into the attrition prediction model.

3. The method of claim 1, wherein prior to model training based on the sample user characteristic data, further comprising:

and carrying out feature processing on the initial user feature data to obtain the sample user feature data.

4. A method according to claim 3, wherein said performing feature processing on said initial user feature data to obtain said sample user feature data comprises:

According to the data type of the feature tag, carrying out coding processing on initial user feature data corresponding to the feature tag to obtain coded data;

And removing abnormal points from the classified data added with the labeling information to obtain the sample user characteristic data.

5. The method of claim 4, wherein the encoding the initial user feature data corresponding to the feature tag according to the data type of the feature tag to obtain encoded data includes:

and if the feature tag is a continuous feature, encoding initial user feature data corresponding to the feature tag according to a preset reference encoding mode.

6. The method according to any one of claims 1-5, wherein said performing an attribution analysis process using at least one attribution algorithm according to the feature data to be predicted and the attrition probability, to obtain attrition information of the target user, comprises:

performing feature analysis on the feature data to be predicted based on a first attribution algorithm and the loss prediction model to obtain at least one first loss reason;

Performing feature analysis on the feature data to be predicted and the loss probability based on a second attribution algorithm and the loss prediction model to obtain at least one second loss cause;

7. The method of claim 6, wherein the performing feature analysis on the feature data to be predicted based on a first attribution algorithm to obtain a first cause of the fluid loss comprises:

Determining a contribution value of each feature tag in the feature data to be predicted based on the first attribution algorithm and the loss prediction model;

And determining the first flow loss reason according to the contribution value of each characteristic label.

8. The method of claim 7, wherein the determining the contribution value of each feature tag in the feature data to be predicted based on the first attribution algorithm and the churn prediction model comprises:

predicting the plurality of first characteristic data based on the loss prediction model to obtain a plurality of first loss probabilities;

And analyzing and processing each first fluid loss probability based on the first attribution algorithm to obtain a contribution value of each first feature label in the feature data to be predicted.

9. The method of claim 7, wherein determining the first loss cause based on the contribution value of each feature tag comprises:

And sequencing the clustering clusters based on the contribution values, and determining at least one first loss reason in a sequencing result according to a preset reference value.

10. The method of claim 7, wherein determining the first loss cause based on the contribution value of each feature tag comprises:

11. The method of claim 6, wherein the performing feature analysis on the feature data to be predicted and the loss probability based on a second attribution algorithm and the loss prediction model to obtain at least one second loss cause comprises:

modifying the feature data to be predicted to obtain modified feature data;

Combining the modified characteristic data and the second prediction result into a new data set, and performing an interpretive model training based on the new data set to obtain an interpretive model;

And carrying out feature analysis on the loss probability according to the interpretability model to obtain the at least one second loss cause.

12. The method of claim 6, wherein determining the churn information based on the at least one first churn reason and the at least one second churn reason comprises:

And verifying the at least one first loss reason according to the at least one second loss reason, and determining the first loss reason which is the same as the second loss reason as the loss information of the target user.

13. An information prediction apparatus, comprising:

The prediction module is used for inputting the feature data to be predicted of the target user into a loss prediction model to obtain the loss probability corresponding to the target user, the loss prediction model is obtained by training based on the feature data of the sample user, the loss prediction model comprises a plurality of primary models and a secondary model, and the secondary model is obtained by training based on the prediction results of the plurality of primary models;

14. An electronic device, comprising: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating over the bus when the electronic device is running, the processor executing the machine-readable instructions to perform the steps of an information prediction method according to any one of claims 1 to 12.

15. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of an information prediction method according to any of claims 1 to 12.