Disclosure of Invention
The invention aims to provide an artificial intelligence algorithm model for building a personalized recommendation engine, so as to solve the problems in the background technology.
The purpose of the invention is realized by the following technical scheme: the method for building the artificial intelligence algorithm model of the personalized recommendation engine comprises the following steps of:
1) characteristic polymerization mode
The feature aggregation method is an absolute value-based aggregation method, wherein some absolute value intervals are set, and features are aggregated to corresponding intervals according to click rate or registration rate; considering the distribution of the click rate or registration rate of the features, setting a region with a smaller value; considering that the effect of 'dividing a region with a small value and a small interval' can be achieved by performing exponential transformation on the click rate or the registration rate, the exponential transformation is described as follows:
x _ i is the click rate or registration rate of the feature i, y _ i is the transformation result, and considering a sorting-based method, y _ i is set as a sorting rate: rank _ i/n; we calculate the value of alpha:
taking data of a 10008 channel as an example, calculating alpha to be 0.22 for the click rate characteristic, and calculating alpha to be 0.5 for the registration rate characteristic;
and (3) performing integer transformation on the transformed y _ i to realize feature aggregation:
because alpha is estimated based on the sorting, y _ i is distributed more uniformly between [0,1], and therefore the setting of m is also closer to the feature number after aggregation; let us set m of click rate model 1000 and m of registration rate model 500;
2) code flow
get_alpha.py
Calculating and testing the exponential transformation parameter alpha, and inputting: and (3) outputting floating point numbers between 0 and 1 obtained by dividing the rows: the input data is exponentially transformed to have a count distribution over 100 sub-buckets, and an alpha value,
ctr_feature_stat.sh
modifying the analysis _ cross _ features task uses a script of
analysis _ cross _ features _ alpha. py or
analysis _ cross _ features _ share _ param.py, and associated configurations, such as:
analysis_ctr.10008.confrun_all.sh/run_rgr_all.sh
model incremental update is performed by specifying the online flag bit (second parameter):
CTR _ FEATURES _ stat.sh reads incremental training data (ensuring that the sampling rate of negative samples is consistent with the original data) specified by a configuration file CTR _ ONLINE _ FEATURES, generates new mappingtable cal _ features.sh, reads the incremental training data, and generates incremental training samples;
and reading the increment training sample and the current model parameter, performing increment training, and generating a new ctrmappingtable and the like.
Further, the processing mode of the online sample feature missing of the artificial intelligence algorithm model for building the personalized recommendation engine comprises the following steps:
solution 1: in the scheme used initially, a certain proportion of samples are copied, and then the characteristic value is set to be-1 to serve as a parameter of the characteristic when the characteristic does not exist, so that the problem can be solved well;
solution 2: when initializing the model, directly taking the parameter of the constant item of the model corresponding to the statistical average value of the sample, so that the reference value of the model becomes a more reasonable statistical average value, and when the characteristics are lacked, the model is regressed to the statistical average value, and the problem can be solved;
solution 3: constant terms are determined using solution 2 while copying the sample and discarding features randomly or setting a value of-1 using solution 1.
The invention has the beneficial effects that:
1. fast statistical analysis query: IndexR uses columnar storage, provides efficient indexing for very large data sets, and reduces IO by filtering out extraneous data, quickly locating valid data. It uses an excellent Apach Drill as the upper query engine. It is especially suitable for ad-hoc OLAP query.
2. Data real-time import: IndexR supports ultra-high speed real-time import of data. As soon as the data reaches the IndexR node, it can be queried. Real-time data and historical data can be looked up together, and the so-called T +1 architecture does not need to be considered. And distinguished from other similarly functioning systems, IndexR never actively discards any data.
3. High-efficient hardware utilization ratio: compared to other systems, IndexR can run on inexpensive machines. You can get very good performance without expensive SSD hard disks, high-end CPUs, or even small machines, although running on it will be faster. While running on a JVM, it manually manages almost all of the memory, using highly designed, compact data structures.
4. The cluster is highly available, easy to expand, easy to manage and simple: distributed systems have evolved to the present time, and high availability and scalability have been standard. IndexR is characterized by a very simple and reliable structure and few necessary configuration items.
5. Deep integration with Hadoop ecology: IndexR stores the data in HDFS. This means you can process these files using MapReduce, or any Hadoop tool. We now provide Hive plug-ins for various ETL related works, or run off-line tasks. The task of docking Spark is ongoing and will be used for data mining as well as machine learning.
6. Highly compressed data format: indexrs are stored in columns and provide ultra-high compression rates, which can significantly reduce IO and network overhead.
7. Convenient data management: IndexR can conveniently import, delete data, and support modifying table schemas, such as adding, deleting, modifying, etc. columns.
Detailed Description
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
An artificial intelligence algorithm model for building a personalized recommendation engine, comprising: the occurrence probability of the target event is estimated through linear weighting and sigmoid transformation by logistic regression:
1) model description
Wherein y is the target variable value {0,1}, x is the eigenvector, and w is the model parameter.
We introduce an implicit variable z, the degree of membership (softmax) of the sample x to the model z, and the corresponding model parameters w _ z:
2) model training
With the introduction of hidden variables, we use the EM algorithm to train the model. For the dataset, the log-likelihood function of the model is as follows:
for the E-step of the EM algorithm, we want the equality state of the last inequality to hold, i.e., let the part after log be a constant:
wherein p (y, z | x; W) is defined as follows:
P(y,z|x;W)=p(y|x,z;W)p(z|x;W)
W={Wz,φz}
for the M-step of the EM algorithm, we adjust W to maximize the lower bound of the likelihood function:
3)Mixture Logistic Regression
we generalize the texture logistic regression algorithm as follows, where the M-step optimization can use random gradient descent, and even can consider FTRL-based random gradient descent (modifying the gradient calculation method therein):
E-step:fpr each i,j:
M-step:for each i,j:
repeatedly executing E-step and M-step until convergence; to avoid synchronous updates, it is preferable to initialize \ phi randomly, otherwise the parameters learned by each logistic regression would be the same; due to the introduction of the hidden variables, the likelihood function may not be a convex function, and the model may be trapped in a local optimum, which may require training the model and evaluating multiple times.
The method for generating the feature aggregation mode and the training data of the artificial intelligence algorithm model for building the personalized recommendation engine comprises the following steps of:
1) feature polymerization
The original feature aggregation is based on sorting, namely, given a parent feature, the sub-features below the parent feature are sorted according to the click rate or the registration rate, and then are uniformly aggregated into a plurality of sub-buckets; the conditions of such polymerization vary from day to day, and thus the physical significance of the new characteristics after polymerization is also different; however, based on the yesterday training result, the incremental learning is added with a new sample of the current day to continue training, and the physical significance of the aggregated features is required to be unchanged; the method of feature aggregation is changed into the method based on absolute values, some absolute value intervals are set, and features are aggregated to corresponding intervals according to click rate or registration rate; considering the distribution of the click rate or the registration rate of the features, the region with a smaller value is set, and the division interval is smaller; through tests, the mode has no obvious influence on the effect;
considering that the effect of 'dividing a region with a small value and a small interval' can be achieved by performing exponential transformation on the click rate or the registration rate, the exponential transformation is described as follows:
x _ i is the click rate or registration rate of the feature i, y _ i is the transformation result, and considering a sorting-based method, y _ i is set as a sorting rate: rank _ i/n; we calculate the value of alpha:
taking data of a 10008 channel as an example, calculating alpha to be 0.22 for the click rate characteristic, and calculating alpha to be 0.5 for the registration rate characteristic;
and (3) performing integer transformation on the transformed y _ i to realize feature aggregation:
because alpha is estimated based on the sorting, y _ i is distributed more uniformly between [0,1], and therefore the setting of m is also closer to the feature number after aggregation; let us set m of click rate model 1000 and m of registration rate model 500;
2) code flow
get_alpha.py
Calculating and testing the exponential transformation parameter alpha, and inputting: and (3) outputting floating point numbers between 0 and 1 obtained by dividing the rows: the distribution of counts in 100 sub-buckets after the input data is subjected to exponential transformation, and the alpha value
ctr_feature_stat.sh
Modifying the analysis _ cross _ features task uses a script as analysis _ cross _ features _ alpha. py or analysis _ cross _ features _ share _ param. py, and associated configurations, such as: analysis _ ctr.10008.confrun _ all.sh/run _ rgr _ all.sh
Model incremental update is performed by specifying the online flag bit (second parameter):
CTR _ FEATURES _ stat.sh reads incremental training data (ensuring that the sampling rate of negative samples is consistent with the original data) specified by the configuration file CTR _ ONLINE _ FEATURES, generates new mappingtable cal _ features.sh, reads the incremental training data, generates incremental training samples train _ model.local.sh, reads the incremental training samples and current model parameters, performs incremental training, generates new ctrappingtable and the like.
When the LR model is used on line to estimate the click registration rate, because the model is trained by historical data, when new features appear on the line on the same day, for example, when a new creative package is submitted, parameters corresponding to the features do not exist in the model, and the estimated result will generate deviation.
The processing mode for the on-line sample characteristic loss of the artificial intelligence algorithm model for building the personalized recommendation engine comprises the following steps:
solution 1: in the scheme used initially, a certain proportion of samples are copied, and then the characteristic value is set to be-1 to serve as a parameter of the characteristic when the characteristic does not exist, so that the problem can be solved well;
the disadvantages are as follows: additional processing of the sample is required; when the characteristics are sparse, a small number of copied samples are easy to deviate from the original statistical distribution; features can be directly discarded without increasing the-1 value when replicating samples
Solution 2: when initializing the model, directly taking the parameter of the constant item of the model corresponding to the statistical average value of the sample, so that the reference value of the model becomes a more reasonable statistical average value, and when the characteristics are lacked, the model is regressed to the statistical average value, and the problem can be solved;
the advantages are that: compared with the scheme 1, the method can converge to obtain parameters with reasonable physical significance, and constant terms directly correspond to the statistical average of samples; no special treatment of the sample is required
The disadvantages are as follows: the co-apparent relationship between other features still allows the parameters to converge to unreasonable values
Solution 3: constant terms are determined using solution 2 while copying the sample and discarding features randomly or setting a value of-1 using solution 1.
On-line cold start
Currently, for new activities (new sweet PackageID in features) we use feature sampling of samples to solve: for each sample, a new sample is created with a certain probability, and the characteristic value in the new sample is modified to be-1 (representing a new characteristic value); when the online system encounters a new activity, a model parameter table is searched, if no model parameter of the activity exists, the characteristic value of the activity is modified to be-1, and the model parameter corresponding to the default characteristic value is used. This is equivalent to learning an "average" parameter for the new feature, and for the new activity we use the ensemble-averaged condition to predict its performance on the next day. For new ad slots, we do not sample by default, and do not score.
For online learning online, a new activity collects certain data during the day, and may give an estimated value of the new activity in the remaining time of the day, better than the average estimated value (after all, the difference between activities is larger). The exploration of new things and the development of old things can be balanced in some way with reference to the explore/explore approach discussed in the industry. The industry considers this as a Multi-Armed Bandit schemes: within a certain time interval [ 1.·, T ], an action a can be selected at each time T and the corresponding benefit r (a, T) is obtained, with the goal of maximizing the total benefit. The currently more popular method is ucb (upper confidence bound), i.e. the action of selecting the largest possible benefit (probably the estimate + the estimate variance) each time. This pursues the greatest overall benefit by sacrificing some of the short-term benefit for detection. Specifically, when a bidding request comes, the campaign that estimates the eCPM of the candidate campaign and its upper bound and selects the maximum eCPM upper bound is selected for delivery (the bid may be delivered as eCPM). The upper bound of eCPM of new activities is high, and display opportunities can be obtained; when the feedback is obtained sufficiently, the variance of the eCPM of the feedback is reduced, and the upper bound is close to the eCPM, namely, the showing opportunity is obtained according to the real value of the feedback. The FTRL logistic regression that we use can estimate variance and upper bound through statistics of the training process, facilitating the implementation of UCB method. The explore/explore problem also has a corresponding evaluation method, but certain random data is collected. The feature merge (i.e. merge) needs to map features with the same meaning to the same value, so there are three main cases currently in the ctr model where merge (i.e./usr/local/services/dsp _ mini _ ctr _ rgr _ model/train/conf/. merge) features merge:
festival and holiday: merge is a feature mapping from holidays to holidays, which is 1 for holidays
The same ad slot: the current cool resources are divided into three channels (10008, 10022, 10060) where the ad slot ids are different but the locations are the same, so it is desirable to merge their data when training. Currently, the setting is carried out on an algorithm parameter configuration page (see), and data is converted into a merge file during model training
And (3) merging of creative packages: different creative packages may actually be the same creative, so the creative package content is md5 merged and mapped before being processed into a package. Currently, the merge procedure is called periodically in the stat _ rate entry, and this file is updated in the early morning, see bin/group _ package.
Examples of the experiments
Taking the example of playing a game product id tag to a user, assume that the following game is a page game.
1. Firstly, defining the format of the output result of the labeling:
userid->[gameid1,gameid2,gameid3,...]
2. each page game has official website, advertisement with landing page, game type (strategy type, action type, legend type, beauty type, etc.)
3. For the user, assume that there is a game type (data produced by another team) that the user likes
4. Furthermore, we can collect these data:
stay time on game landing page: short, medium, long-lasting; the number of times of browsing the landing page of the game; game theme official website homepage access times; the number of game registrations/logins,
5. each action has a time of occurrence, divided into 7 time windows: within 1 day, 1-3 days, 3-7 days, 7-15 days, 15-30 days, 30-60 days, 60-90 days,
6. the action of the user on the game, the occurrence time and whether the game type liked by the user is matched with the game type are constructed into a feature vector,
the following results were obtained: userid _ door- > [ feature1, feature2, feature3, ]
7. The target value, i.e. Click rate, of each feature vector is obtained by looking at the PV and Click on each game (official website) for the following day of each userid
8. Summarizing the feature vectors without label, cookie and gameid, and merging the vectors with the same behavior features, thereby reducing the number of training samples. The format of the finally obtained training sample data is as follows:
feature list (consisting of dimension ID,: partition), click rate (number of clicks/total pv number)
9. Inputting training sample data into a machine learning algorithm to generate an algorithm model;
10. and (3) prediction:
i. predicting ctr (interest degree) of userid for gameid: constructing a feature vector with the same format as that of a training sample by using a behavior list of userid on gameid, inputting the feature into a model, and calculating a ctr value by the model;
sorting by ctr from top to bottom, and selecting top N gameids as game tags interested by the user.
Intuitive models and mathematical interpretations
Consider simply a model with only one feature, such as gender.
The model has constant term parameters, set as c; since gender is a discrete feature, there are actually two features g1, g2 that do not occur simultaneously
If the different sexes respectively have 1000 samples, the positive examples respectively have 47 samples and 17 samples, the corresponding click rates are 0.047 and 0.017, and the corresponding weights are-3 and-4, then the trained model parameters should satisfy the following relations
c+g1=-3
c+g2=-4
Obviously, there are numerous solutions that can be satisfied, and since the initial value of the training is generally set to 0, the parameters obtained at convergence are relatively close to 0, for example, c-1, g 1-2, g 2-3
When the gender feature is absent, only the constant term c is obtained, the click rate is 0.268, which is far higher than the actual condition and tends to be close to the original reference value of 0.5
Application solution 1
Randomly copying a part of the sample, setting its sex characteristics as g-1, if the ratio of copying is large enough, the distribution of the part of the sample should be consistent with the overall distribution, the click rate is 0.032, and the corresponding weight is-3.4
So the model after training should satisfy
c+g-1=-3.4
c+g1=-3
c+g2=-4
At this time, the method still has a plurality of solutions, but when the characteristics are missing, the characteristics are g-1, so that the method can meet the overall average distribution
Application solution 2
Setting a constant term c to be-3.4 directly according to the global distribution of the sample, so that the trained model meets the requirements
-3.4+g1=-3
-3.4+g2=-4
Solving g 1-0.4 and g 2-0.6, wherein each parameter has a unique value and has reasonable meaning, c represents the reference value of the statistical sample, and g1 and g2 represent the influence of different characteristics on the reference
The problems that still exist are:
when there are multiple features, incomplete combinations still affect the parameters, such as adding one feature f, two possible values f1 and f2, and the trained model needs to satisfy
c+g1+f1=w1
c+g1+f2=w2
c+g2+f1=w3
c+g2+f2=w4
At this point, if c can be determined in advance, there is still a unique solution. However, in reality, all combinations of the features g and the features f do not necessarily appear in the sample, and the model still has no unique solution, and the lack of features still has an influence. However, since the overall distribution of the sample (constant term c) has been determined, its contribution is dominant and the remaining features are weighted on the overall distribution, where the feature missing effect is smaller.
Application solution 3
According to scheme 2 to yield c ═ -3.4
Copying part of the sample, the model satisfies
c-3.4 or c + g-1-3.4
c+g1=-3
c+g2=-4
Reasonable and unique parametric solutions are directly available, although for the multi-feature case the resulting condition is solution-free, since LR itself is based on the linear assumption, and for data that does not satisfy linearity, a perfect solution is not possible.
The above are only typical examples of the present invention, and besides, the present invention may have other embodiments, and all the technical solutions formed by equivalent substitutions or equivalent changes are within the scope of the present invention as claimed.