[go: up one dir, main page]

CN107203628B - Artificial intelligence algorithm model for building personalized recommendation engine - Google Patents

Artificial intelligence algorithm model for building personalized recommendation engine Download PDF

Info

Publication number
CN107203628B
CN107203628B CN201710393578.1A CN201710393578A CN107203628B CN 107203628 B CN107203628 B CN 107203628B CN 201710393578 A CN201710393578 A CN 201710393578A CN 107203628 B CN107203628 B CN 107203628B
Authority
CN
China
Prior art keywords
model
features
feature
value
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710393578.1A
Other languages
Chinese (zh)
Other versions
CN107203628A (en
Inventor
李华煜
梁丽丽
谭荣棉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Sunteng Information Technology Co ltd
Original Assignee
Guangzhou Sunteng Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Sunteng Information Technology Co ltd filed Critical Guangzhou Sunteng Information Technology Co ltd
Priority to CN201710393578.1A priority Critical patent/CN107203628B/en
Publication of CN107203628A publication Critical patent/CN107203628A/en
Application granted granted Critical
Publication of CN107203628B publication Critical patent/CN107203628B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Mathematical Physics (AREA)
  • Finance (AREA)
  • Computational Mathematics (AREA)
  • Development Economics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Accounting & Taxation (AREA)
  • General Engineering & Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Algebra (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Operations Research (AREA)
  • Evolutionary Biology (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Game Theory and Decision Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an artificial intelligence algorithm model for constructing an individualized recommendation engine; the logistic regression of the intelligent algorithm model of the invention predicts the occurrence probability of the target event through linear weighting and sigmoid transformation. And an artificial intelligence algorithm model of the personalized recommendation engine is set up to refer to a combined advertisement position mode of a good base channel, aggregate statistical data and sharing model parameters can be carried out on similar advertisement positions, and data can be distinguished and trained on different advertisement positions.

Description

Artificial intelligence algorithm model for building personalized recommendation engine
Technical Field
The invention belongs to the technical field of internet, and particularly relates to an artificial intelligence algorithm model for building an individualized recommendation engine.
Background
At present, the number of advertisement positions of some channels is large, and most of the advertisement position data are sparse. The model learning is carried out on the long-tailed advertisements, and the actual effect is unstable. With reference to the way of combining advertisement positions of the optimal library channel, a way of aggregating advertisement positions can be considered, statistical data and sharing model parameters can be aggregated for similar advertisement positions, and data can be distinguished for different advertisement positions for training. Referring to some concepts of the Mixture Gaussian and topic models, we can construct multiple logistic regression models and introduce a hidden variable z obeying a polynomial distribution, specifying to which logistic regression model the sample belongs.
Disclosure of Invention
The invention aims to provide an artificial intelligence algorithm model for building a personalized recommendation engine, so as to solve the problems in the background technology.
The purpose of the invention is realized by the following technical scheme: the method for building the artificial intelligence algorithm model of the personalized recommendation engine comprises the following steps of:
1) characteristic polymerization mode
The feature aggregation method is an absolute value-based aggregation method, wherein some absolute value intervals are set, and features are aggregated to corresponding intervals according to click rate or registration rate; considering the distribution of the click rate or registration rate of the features, setting a region with a smaller value; considering that the effect of 'dividing a region with a small value and a small interval' can be achieved by performing exponential transformation on the click rate or the registration rate, the exponential transformation is described as follows:
Figure GDA0001385733490000021
x _ i is the click rate or registration rate of the feature i, y _ i is the transformation result, and considering a sorting-based method, y _ i is set as a sorting rate: rank _ i/n; we calculate the value of alpha:
Figure GDA0001385733490000022
taking data of a 10008 channel as an example, calculating alpha to be 0.22 for the click rate characteristic, and calculating alpha to be 0.5 for the registration rate characteristic;
and (3) performing integer transformation on the transformed y _ i to realize feature aggregation:
Figure GDA0001385733490000023
because alpha is estimated based on the sorting, y _ i is distributed more uniformly between [0,1], and therefore the setting of m is also closer to the feature number after aggregation; let us set m of click rate model 1000 and m of registration rate model 500;
2) code flow
get_alpha.py
Calculating and testing the exponential transformation parameter alpha, and inputting: and (3) outputting floating point numbers between 0 and 1 obtained by dividing the rows: the input data is exponentially transformed to have a count distribution over 100 sub-buckets, and an alpha value,
ctr_feature_stat.sh
modifying the analysis _ cross _ features task uses a script of
analysis _ cross _ features _ alpha. py or
analysis _ cross _ features _ share _ param.py, and associated configurations, such as:
analysis_ctr.10008.confrun_all.sh/run_rgr_all.sh
model incremental update is performed by specifying the online flag bit (second parameter):
CTR _ FEATURES _ stat.sh reads incremental training data (ensuring that the sampling rate of negative samples is consistent with the original data) specified by a configuration file CTR _ ONLINE _ FEATURES, generates new mappingtable cal _ features.sh, reads the incremental training data, and generates incremental training samples;
and reading the increment training sample and the current model parameter, performing increment training, and generating a new ctrmappingtable and the like.
Further, the processing mode of the online sample feature missing of the artificial intelligence algorithm model for building the personalized recommendation engine comprises the following steps:
solution 1: in the scheme used initially, a certain proportion of samples are copied, and then the characteristic value is set to be-1 to serve as a parameter of the characteristic when the characteristic does not exist, so that the problem can be solved well;
solution 2: when initializing the model, directly taking the parameter of the constant item of the model corresponding to the statistical average value of the sample, so that the reference value of the model becomes a more reasonable statistical average value, and when the characteristics are lacked, the model is regressed to the statistical average value, and the problem can be solved;
solution 3: constant terms are determined using solution 2 while copying the sample and discarding features randomly or setting a value of-1 using solution 1.
The invention has the beneficial effects that:
1. fast statistical analysis query: IndexR uses columnar storage, provides efficient indexing for very large data sets, and reduces IO by filtering out extraneous data, quickly locating valid data. It uses an excellent Apach Drill as the upper query engine. It is especially suitable for ad-hoc OLAP query.
2. Data real-time import: IndexR supports ultra-high speed real-time import of data. As soon as the data reaches the IndexR node, it can be queried. Real-time data and historical data can be looked up together, and the so-called T +1 architecture does not need to be considered. And distinguished from other similarly functioning systems, IndexR never actively discards any data.
3. High-efficient hardware utilization ratio: compared to other systems, IndexR can run on inexpensive machines. You can get very good performance without expensive SSD hard disks, high-end CPUs, or even small machines, although running on it will be faster. While running on a JVM, it manually manages almost all of the memory, using highly designed, compact data structures.
4. The cluster is highly available, easy to expand, easy to manage and simple: distributed systems have evolved to the present time, and high availability and scalability have been standard. IndexR is characterized by a very simple and reliable structure and few necessary configuration items.
5. Deep integration with Hadoop ecology: IndexR stores the data in HDFS. This means you can process these files using MapReduce, or any Hadoop tool. We now provide Hive plug-ins for various ETL related works, or run off-line tasks. The task of docking Spark is ongoing and will be used for data mining as well as machine learning.
6. Highly compressed data format: indexrs are stored in columns and provide ultra-high compression rates, which can significantly reduce IO and network overhead.
7. Convenient data management: IndexR can conveniently import, delete data, and support modifying table schemas, such as adding, deleting, modifying, etc. columns.
Detailed Description
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
An artificial intelligence algorithm model for building a personalized recommendation engine, comprising: the occurrence probability of the target event is estimated through linear weighting and sigmoid transformation by logistic regression:
1) model description
Figure GDA0001385733490000051
Wherein y is the target variable value {0,1}, x is the eigenvector, and w is the model parameter.
We introduce an implicit variable z, the degree of membership (softmax) of the sample x to the model z, and the corresponding model parameters w _ z:
Figure GDA0001385733490000052
Figure GDA0001385733490000053
Figure GDA0001385733490000054
2) model training
With the introduction of hidden variables, we use the EM algorithm to train the model. For the dataset, the log-likelihood function of the model is as follows:
Figure GDA0001385733490000055
for the E-step of the EM algorithm, we want the equality state of the last inequality to hold, i.e., let the part after log be a constant:
Figure GDA0001385733490000056
Figure GDA0001385733490000057
wherein p (y, z | x; W) is defined as follows:
P(y,z|x;W)=p(y|x,z;W)p(z|x;W)
Figure GDA0001385733490000061
Figure GDA0001385733490000062
W={Wz,φz}
for the M-step of the EM algorithm, we adjust W to maximize the lower bound of the likelihood function:
Figure GDA0001385733490000063
Figure GDA0001385733490000064
Figure GDA0001385733490000065
Figure GDA0001385733490000066
3)Mixture Logistic Regression
we generalize the texture logistic regression algorithm as follows, where the M-step optimization can use random gradient descent, and even can consider FTRL-based random gradient descent (modifying the gradient calculation method therein):
E-step:fpr each i,j:
Figure GDA0001385733490000067
M-step:for each i,j:
Figure GDA0001385733490000068
Figure GDA0001385733490000069
Figure GDA00013857334900000610
Figure GDA00013857334900000611
repeatedly executing E-step and M-step until convergence; to avoid synchronous updates, it is preferable to initialize \ phi randomly, otherwise the parameters learned by each logistic regression would be the same; due to the introduction of the hidden variables, the likelihood function may not be a convex function, and the model may be trapped in a local optimum, which may require training the model and evaluating multiple times.
The method for generating the feature aggregation mode and the training data of the artificial intelligence algorithm model for building the personalized recommendation engine comprises the following steps of:
1) feature polymerization
The original feature aggregation is based on sorting, namely, given a parent feature, the sub-features below the parent feature are sorted according to the click rate or the registration rate, and then are uniformly aggregated into a plurality of sub-buckets; the conditions of such polymerization vary from day to day, and thus the physical significance of the new characteristics after polymerization is also different; however, based on the yesterday training result, the incremental learning is added with a new sample of the current day to continue training, and the physical significance of the aggregated features is required to be unchanged; the method of feature aggregation is changed into the method based on absolute values, some absolute value intervals are set, and features are aggregated to corresponding intervals according to click rate or registration rate; considering the distribution of the click rate or the registration rate of the features, the region with a smaller value is set, and the division interval is smaller; through tests, the mode has no obvious influence on the effect;
considering that the effect of 'dividing a region with a small value and a small interval' can be achieved by performing exponential transformation on the click rate or the registration rate, the exponential transformation is described as follows:
Figure GDA0001385733490000071
x _ i is the click rate or registration rate of the feature i, y _ i is the transformation result, and considering a sorting-based method, y _ i is set as a sorting rate: rank _ i/n; we calculate the value of alpha:
Figure GDA0001385733490000072
taking data of a 10008 channel as an example, calculating alpha to be 0.22 for the click rate characteristic, and calculating alpha to be 0.5 for the registration rate characteristic;
and (3) performing integer transformation on the transformed y _ i to realize feature aggregation:
Figure GDA0001385733490000081
because alpha is estimated based on the sorting, y _ i is distributed more uniformly between [0,1], and therefore the setting of m is also closer to the feature number after aggregation; let us set m of click rate model 1000 and m of registration rate model 500;
2) code flow
get_alpha.py
Calculating and testing the exponential transformation parameter alpha, and inputting: and (3) outputting floating point numbers between 0 and 1 obtained by dividing the rows: the distribution of counts in 100 sub-buckets after the input data is subjected to exponential transformation, and the alpha value
ctr_feature_stat.sh
Modifying the analysis _ cross _ features task uses a script as analysis _ cross _ features _ alpha. py or analysis _ cross _ features _ share _ param. py, and associated configurations, such as: analysis _ ctr.10008.confrun _ all.sh/run _ rgr _ all.sh
Model incremental update is performed by specifying the online flag bit (second parameter):
CTR _ FEATURES _ stat.sh reads incremental training data (ensuring that the sampling rate of negative samples is consistent with the original data) specified by the configuration file CTR _ ONLINE _ FEATURES, generates new mappingtable cal _ features.sh, reads the incremental training data, generates incremental training samples train _ model.local.sh, reads the incremental training samples and current model parameters, performs incremental training, generates new ctrappingtable and the like.
When the LR model is used on line to estimate the click registration rate, because the model is trained by historical data, when new features appear on the line on the same day, for example, when a new creative package is submitted, parameters corresponding to the features do not exist in the model, and the estimated result will generate deviation.
The processing mode for the on-line sample characteristic loss of the artificial intelligence algorithm model for building the personalized recommendation engine comprises the following steps:
solution 1: in the scheme used initially, a certain proportion of samples are copied, and then the characteristic value is set to be-1 to serve as a parameter of the characteristic when the characteristic does not exist, so that the problem can be solved well;
the disadvantages are as follows: additional processing of the sample is required; when the characteristics are sparse, a small number of copied samples are easy to deviate from the original statistical distribution; features can be directly discarded without increasing the-1 value when replicating samples
Solution 2: when initializing the model, directly taking the parameter of the constant item of the model corresponding to the statistical average value of the sample, so that the reference value of the model becomes a more reasonable statistical average value, and when the characteristics are lacked, the model is regressed to the statistical average value, and the problem can be solved;
the advantages are that: compared with the scheme 1, the method can converge to obtain parameters with reasonable physical significance, and constant terms directly correspond to the statistical average of samples; no special treatment of the sample is required
The disadvantages are as follows: the co-apparent relationship between other features still allows the parameters to converge to unreasonable values
Solution 3: constant terms are determined using solution 2 while copying the sample and discarding features randomly or setting a value of-1 using solution 1.
On-line cold start
Currently, for new activities (new sweet PackageID in features) we use feature sampling of samples to solve: for each sample, a new sample is created with a certain probability, and the characteristic value in the new sample is modified to be-1 (representing a new characteristic value); when the online system encounters a new activity, a model parameter table is searched, if no model parameter of the activity exists, the characteristic value of the activity is modified to be-1, and the model parameter corresponding to the default characteristic value is used. This is equivalent to learning an "average" parameter for the new feature, and for the new activity we use the ensemble-averaged condition to predict its performance on the next day. For new ad slots, we do not sample by default, and do not score.
For online learning online, a new activity collects certain data during the day, and may give an estimated value of the new activity in the remaining time of the day, better than the average estimated value (after all, the difference between activities is larger). The exploration of new things and the development of old things can be balanced in some way with reference to the explore/explore approach discussed in the industry. The industry considers this as a Multi-Armed Bandit schemes: within a certain time interval [ 1.·, T ], an action a can be selected at each time T and the corresponding benefit r (a, T) is obtained, with the goal of maximizing the total benefit. The currently more popular method is ucb (upper confidence bound), i.e. the action of selecting the largest possible benefit (probably the estimate + the estimate variance) each time. This pursues the greatest overall benefit by sacrificing some of the short-term benefit for detection. Specifically, when a bidding request comes, the campaign that estimates the eCPM of the candidate campaign and its upper bound and selects the maximum eCPM upper bound is selected for delivery (the bid may be delivered as eCPM). The upper bound of eCPM of new activities is high, and display opportunities can be obtained; when the feedback is obtained sufficiently, the variance of the eCPM of the feedback is reduced, and the upper bound is close to the eCPM, namely, the showing opportunity is obtained according to the real value of the feedback. The FTRL logistic regression that we use can estimate variance and upper bound through statistics of the training process, facilitating the implementation of UCB method. The explore/explore problem also has a corresponding evaluation method, but certain random data is collected. The feature merge (i.e. merge) needs to map features with the same meaning to the same value, so there are three main cases currently in the ctr model where merge (i.e./usr/local/services/dsp _ mini _ ctr _ rgr _ model/train/conf/. merge) features merge:
festival and holiday: merge is a feature mapping from holidays to holidays, which is 1 for holidays
The same ad slot: the current cool resources are divided into three channels (10008, 10022, 10060) where the ad slot ids are different but the locations are the same, so it is desirable to merge their data when training. Currently, the setting is carried out on an algorithm parameter configuration page (see), and data is converted into a merge file during model training
And (3) merging of creative packages: different creative packages may actually be the same creative, so the creative package content is md5 merged and mapped before being processed into a package. Currently, the merge procedure is called periodically in the stat _ rate entry, and this file is updated in the early morning, see bin/group _ package.
Examples of the experiments
Taking the example of playing a game product id tag to a user, assume that the following game is a page game.
1. Firstly, defining the format of the output result of the labeling:
userid->[gameid1,gameid2,gameid3,...]
2. each page game has official website, advertisement with landing page, game type (strategy type, action type, legend type, beauty type, etc.)
3. For the user, assume that there is a game type (data produced by another team) that the user likes
4. Furthermore, we can collect these data:
stay time on game landing page: short, medium, long-lasting; the number of times of browsing the landing page of the game; game theme official website homepage access times; the number of game registrations/logins,
5. each action has a time of occurrence, divided into 7 time windows: within 1 day, 1-3 days, 3-7 days, 7-15 days, 15-30 days, 30-60 days, 60-90 days,
6. the action of the user on the game, the occurrence time and whether the game type liked by the user is matched with the game type are constructed into a feature vector,
the following results were obtained: userid _ door- > [ feature1, feature2, feature3, ]
7. The target value, i.e. Click rate, of each feature vector is obtained by looking at the PV and Click on each game (official website) for the following day of each userid
8. Summarizing the feature vectors without label, cookie and gameid, and merging the vectors with the same behavior features, thereby reducing the number of training samples. The format of the finally obtained training sample data is as follows:
feature list (consisting of dimension ID,: partition), click rate (number of clicks/total pv number)
9. Inputting training sample data into a machine learning algorithm to generate an algorithm model;
10. and (3) prediction:
i. predicting ctr (interest degree) of userid for gameid: constructing a feature vector with the same format as that of a training sample by using a behavior list of userid on gameid, inputting the feature into a model, and calculating a ctr value by the model;
sorting by ctr from top to bottom, and selecting top N gameids as game tags interested by the user.
Intuitive models and mathematical interpretations
Consider simply a model with only one feature, such as gender.
The model has constant term parameters, set as c; since gender is a discrete feature, there are actually two features g1, g2 that do not occur simultaneously
If the different sexes respectively have 1000 samples, the positive examples respectively have 47 samples and 17 samples, the corresponding click rates are 0.047 and 0.017, and the corresponding weights are-3 and-4, then the trained model parameters should satisfy the following relations
c+g1=-3
c+g2=-4
Obviously, there are numerous solutions that can be satisfied, and since the initial value of the training is generally set to 0, the parameters obtained at convergence are relatively close to 0, for example, c-1, g 1-2, g 2-3
When the gender feature is absent, only the constant term c is obtained, the click rate is 0.268, which is far higher than the actual condition and tends to be close to the original reference value of 0.5
Application solution 1
Randomly copying a part of the sample, setting its sex characteristics as g-1, if the ratio of copying is large enough, the distribution of the part of the sample should be consistent with the overall distribution, the click rate is 0.032, and the corresponding weight is-3.4
So the model after training should satisfy
c+g-1=-3.4
c+g1=-3
c+g2=-4
At this time, the method still has a plurality of solutions, but when the characteristics are missing, the characteristics are g-1, so that the method can meet the overall average distribution
Application solution 2
Setting a constant term c to be-3.4 directly according to the global distribution of the sample, so that the trained model meets the requirements
-3.4+g1=-3
-3.4+g2=-4
Solving g 1-0.4 and g 2-0.6, wherein each parameter has a unique value and has reasonable meaning, c represents the reference value of the statistical sample, and g1 and g2 represent the influence of different characteristics on the reference
The problems that still exist are:
when there are multiple features, incomplete combinations still affect the parameters, such as adding one feature f, two possible values f1 and f2, and the trained model needs to satisfy
c+g1+f1=w1
c+g1+f2=w2
c+g2+f1=w3
c+g2+f2=w4
At this point, if c can be determined in advance, there is still a unique solution. However, in reality, all combinations of the features g and the features f do not necessarily appear in the sample, and the model still has no unique solution, and the lack of features still has an influence. However, since the overall distribution of the sample (constant term c) has been determined, its contribution is dominant and the remaining features are weighted on the overall distribution, where the feature missing effect is smaller.
Application solution 3
According to scheme 2 to yield c ═ -3.4
Copying part of the sample, the model satisfies
c-3.4 or c + g-1-3.4
c+g1=-3
c+g2=-4
Reasonable and unique parametric solutions are directly available, although for the multi-feature case the resulting condition is solution-free, since LR itself is based on the linear assumption, and for data that does not satisfy linearity, a perfect solution is not possible.
The above are only typical examples of the present invention, and besides, the present invention may have other embodiments, and all the technical solutions formed by equivalent substitutions or equivalent changes are within the scope of the present invention as claimed.

Claims (1)

1. The artificial intelligence algorithm model for building the personalized recommendation engine is characterized in that the occurrence probability of a target event is estimated through linear weighting and sigmoid transformation by means of logistic regression of the artificial intelligence algorithm model for building the personalized recommendation engine, and the method comprises the following steps of:
1) characteristic polymerization mode
The feature aggregation method is an absolute value-based aggregation method, and the features are aggregated to corresponding intervals according to the click rate or the registration rate by setting absolute value intervals; setting a smaller value area in consideration of the distribution of the click rate or the registration rate of the features; considering that the effect of 'dividing a region with a small value and a small interval' can be achieved by performing exponential transformation on the click rate or the registration rate, the exponential transformation is described as follows:
Figure FDA0002894304530000011
xiis the click rate or registration rate of the feature i, and sets y _ i as the ranking rate, yiIs a transformation result, and alpha is calculated by considering a method based on ordering*The value of (c):
Figure FDA0002894304530000012
for y after transformationiAnd (3) carrying out integral number conversion to realize characteristic aggregation:
Figure FDA0002894304530000013
since α is estimated based on ranking, yiIn [0,1]]The distribution of the m is more uniform, so that the setting of the m is also closer to the characteristic number after polymerization; setting m of a click rate model to be 1000, and m of a registration rate model to be 500;
2) code flow
get_alpha.py
Calculating and testing an exponential transformation parameter alpha*Inputting: and (3) outputting floating point numbers between 0 and 1 obtained by dividing the rows: distribution of counts in 100 buckets of input data after exponential transformation, and alpha*The value of the one or more of,
ctr_feature_stat.sh
the modified analysis _ cross _ features task uses scripts as analysis _ cross _ features _ alpha. py or analysis _ cross _ features _ share _ param. py, and analysis _ ctr.10008.confrun _ all
And (3) performing model incremental updating by specifying the online flag bit:
the CTR _ FEATURES _ stat.sh reads the incremental training data specified by the configuration file CTR _ ONLINE _ FEATURES, generates a new mapping table _ features.sh reads the incremental training data, and generates an incremental training sample;
reading an increment training sample and current model parameters, performing increment training, and generating a new ctrmappingtable, wherein the processing mode for establishing the online sample feature loss of the artificial intelligence algorithm model of the personalized recommendation engine comprises the following steps:
solution 1: the protocol used initially, by copying the sample and then setting all default eigenvalues to-1, as the eigenvalue parameter for that feature without eigenvalues;
solution 2: when initializing the model, directly taking the parameter of the constant item of the model corresponding to the statistical average value of the sample, so that the reference value of the model is changed into the parameter of the more reasonable statistical average value, and when the characteristic is lacked, the model is regressed to the statistical average value;
solution 3: the constant term is determined using solution 2 while copying the sample and discarding features randomly or using solution 1 setting the feature values of features for which no feature values were collected to-1.
CN201710393578.1A 2017-05-27 2017-05-27 Artificial intelligence algorithm model for building personalized recommendation engine Active CN107203628B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710393578.1A CN107203628B (en) 2017-05-27 2017-05-27 Artificial intelligence algorithm model for building personalized recommendation engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710393578.1A CN107203628B (en) 2017-05-27 2017-05-27 Artificial intelligence algorithm model for building personalized recommendation engine

Publications (2)

Publication Number Publication Date
CN107203628A CN107203628A (en) 2017-09-26
CN107203628B true CN107203628B (en) 2021-04-09

Family

ID=59906630

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710393578.1A Active CN107203628B (en) 2017-05-27 2017-05-27 Artificial intelligence algorithm model for building personalized recommendation engine

Country Status (1)

Country Link
CN (1) CN107203628B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8442969B2 (en) * 2007-08-14 2013-05-14 John Nicholas Gross Location based news and search engine
US8700464B1 (en) * 2007-03-30 2014-04-15 Amazon Technologies, Inc. Monitoring user consumption of content
CN106056427A (en) * 2016-05-25 2016-10-26 中南大学 Spark-based big data hybrid model mobile recommending method
CN106127525A (en) * 2016-06-27 2016-11-16 浙江大学 A kind of TV shopping Method of Commodity Recommendation based on sorting algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8700464B1 (en) * 2007-03-30 2014-04-15 Amazon Technologies, Inc. Monitoring user consumption of content
US8442969B2 (en) * 2007-08-14 2013-05-14 John Nicholas Gross Location based news and search engine
CN106056427A (en) * 2016-05-25 2016-10-26 中南大学 Spark-based big data hybrid model mobile recommending method
CN106127525A (en) * 2016-06-27 2016-11-16 浙江大学 A kind of TV shopping Method of Commodity Recommendation based on sorting algorithm

Also Published As

Publication number Publication date
CN107203628A (en) 2017-09-26

Similar Documents

Publication Publication Date Title
US11704366B2 (en) Methods and systems for associating internet devices
US9535938B2 (en) Efficient and fault-tolerant distributed algorithm for learning latent factor models through matrix factorization
US9348924B2 (en) Almost online large scale collaborative filtering based recommendation system
US11636394B2 (en) Differentiable user-item co-clustering
CN105512242B (en) A kind of parallel recommendation method based on social network structure
CN112749330B (en) Information pushing method, device, computer equipment and storage medium
US8364615B2 (en) Local graph partitioning using evolving sets
CN111898698B (en) Object processing method and device, storage medium and electronic equipment
Bhatia et al. A parallel fuzzy clustering algorithm for large graphs using Pregel
Pramanik et al. Discovery of closed high utility itemsets using a fast nature-inspired ant colony algorithm
Wang et al. QoS prediction of web services based on reputation-aware network embedding
Sundara Kumar et al. RETRACTED: Improving big data analytics data processing speed through map reduce scheduling and replica placement with HDFS using genetic optimization techniques
CN106407379A (en) Hadoop platform based movie recommendation method
US20190243923A1 (en) Online diverse set generation from partial-click feedback
Wang et al. Adaptive relation discovery from focusing seeds on large networks
CN110941771A (en) Commodity parallel dynamic pushing method in e-commerce platform
Jia et al. Dynamic group recommendation algorithm based on member activity level
CN107203628B (en) Artificial intelligence algorithm model for building personalized recommendation engine
CN116680090B (en) Edge computing network management method and platform based on big data
KR102791501B1 (en) Method of data sampling for active learning
Zhang et al. Heterogeneous information assisted bandit learning: Theory and application
CN113688934A (en) Migration learning based distributed expectation maximization financial data clustering method and system
Wen et al. Parallel naïve Bayes regression model-based collaborative filtering recommendation algorithm and its realisation on Hadoop for big data
Zhai et al. Scalable dynamic self-organising maps for mining massive textual data
Gao et al. An improved PSO-based clustering algorithm inspired by tissue-like P system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 510665 Room 401, No.3, East Tangdong Road, Tianhe District, Guangzhou City, Guangdong Province

Applicant after: GUANGZHOU SUNTENG INFORMATION TECHNOLOGY Co.,Ltd.

Address before: 510000 commercial building, No.5, Tangdong East Road, Tianhe District, Guangzhou City, Guangdong Province:B-425) 510000 commercial building, No.5, Tangdong East Road, Tianhe District, Guangzhou City, Guangdong Province (Location: b-425)

Applicant before: GUANGZHOU SUNTENG INFORMATION TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information
CB02 Change of applicant information

Address after: 510665 Room 401, No.3, East Tangdong Road, Tianhe District, Guangzhou City, Guangdong Province

Applicant after: GUANGZHOU SUNTENG INFORMATION TECHNOLOGY Co.,Ltd.

Address before: 510665 Room 401, 3 Tangdong East Road, Tianhe District, Guangzhou City, Guangdong Province

Applicant before: GUANGZHOU SUNTENG INFORMATION TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: An Artificial Intelligence Algorithm Model for Building Personalized Recommendation Engines

Granted publication date: 20210409

Pledgee: Industrial Commercial Bank of China Ltd. Guangzhou branch Yuexiu

Pledgor: GUANGZHOU SUNTENG INFORMATION TECHNOLOGY CO.,LTD.

Registration number: Y2024980024232