Background
In psychology, the OCEAN model is the five broad dimensions used to describe human personality, and this theory is based on the five-personality factor model. Five types of factors for the OCEAN model include: strict, outward, open, and fit for the personality and personality traits of human and nervous system. O stands for open to expert, C stands for Consumeriousness, E stands for exchange, A stands for Agreebleness and N stands for Neuroticisms. These five factors provide a rich conceptual architecture. In addition, the previous researches find that the five-personality theoretical model is strongly related to the behaviors of people on the social network sites.
Current personalized recommendation algorithms can be roughly classified into four categories:
(1) the recommendation mechanism based on the demographics is a recommendation method which is easy to implement, and simply finds the relevance degree of the user according to the basic information of the system user, and then recommends other articles which are liked by the similar user to the current user.
(2) The recommendation based on the content is a recommendation mechanism which is most widely applied at the beginning of the emergence of a recommendation engine, and the core idea of the recommendation based on the content recommendation is to find the relevance of the item or the content according to the metadata of the recommended item or the content and then recommend the similar item to a user based on the past preference record of the user. The recommendation system is mostly applied to the application of some information, some labels are extracted from the articles as the keywords of the articles, and then the similarity of the two articles can be evaluated through the labels.
(3) Association rule based recommendations, which are more common in e-commerce systems, have also proven to work well. The practical meaning is that users who have purchased some items prefer to purchase others. The primary goal of association rule-based recommendation systems is to mine association rules, i.e., collections of items purchased by many users at the same time, which can be recommended to each other.
(4) Collaborative filtering, which is a recommendation method widely used in recommendation systems. This algorithm is based on an assumption of "category by category, people by group" that users who like the same item are more likely to have the same interest. The recommendation system based on collaborative filtering is generally applied to a system with user scoring, and the user preference for the articles is described through the scores. Collaborative filtering is considered as an example of using collective intelligence, without requiring special treatment of items, but rather by establishing associations between items by users. Currently, collaborative filtering recommendation systems are differentiated into two types: user-based recommendations and Item-based recommendations.
However, the current personalized recommendation method is based on the above four categories, and does not combine the personality characteristics of the user well for marketing. The behavior of the user is not random, but implies many specific patterns. The network social behavior of the user reflects the user character, and meanwhile, the character of the user also influences the user behavior, so that the character of the user can be taken into consideration during online accurate marketing, online commodity recommendation, social recommendation and auxiliary product design, and a better result is obtained.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a personalized recommendation method based on an OCEAN model, which is used for performing personalized recommendation based on the personality of a user.
In order to achieve the above object, the invention provides a method for personalized recommendation based on an OCEAN model, which is characterized by comprising the following steps:
(1) establishing OCEAN model of social network site user
(1.1) selecting a plurality of microblog accounts, carrying out five personality tests on the users to obtain scores of five personality dimensions, and taking the scores of the five personality dimensions as an OCEAN model of the tested user;
(1.2) acquiring page content in a browser simulation mode, capturing microblog data of tested users, and respectively assembling the microblog data of each user into a text document;
(1.3) preprocessing the text document: filtering the text document, performing word segmentation processing, and storing the text document in a specified database after words are removed;
(1.4) importing the text documents of all tested users in the database into an LDA theme model, and outputting the probability distribution of the text document theme of each tested user by the LDA theme model;
(1.5) taking the document theme probability distribution of the user to be tested as sample input, taking the OCEAN model of the user to be tested as sample output, training by using a BP neural network, establishing a mapping model between the document theme distribution of the user and the OCEAN model of the user, and taking the mapping model as the OCEAN model for predicting the social network site user;
(2) personalized recommendation is carried out on users based on OCEAN model of social network site users
(2.1) user clustering
Based on an OCEAN model of the social network site users, dividing the users into user groups with K different characters by using a K-means clustering algorithm;
(2.2) carrying out personalized recommendation on the target users according to the categories to which the target users belong
When a target user appears, firstly determining a clustering category where the target user is located, then respectively taking all microblogs sent by each user in the category where the target user is located as a candidate set item, respectively performing text feature random extraction on each candidate set item by using a word frequency-inverse document, and constructing an n-dimensional vector as attribute data of each candidate set item, wherein each microblog is extracted as a one-dimensional vector;
assembling a text document according to microblog data of a target user, and performing text feature random extraction on the text document of the target user by using word frequency-inverse document frequency to construct an m-dimensional vector as favorite information of the target user;
and according to a cosine similarity formula, calculating the similarity between the favorite data of the user and the attribute data of each candidate set item, and recommending the candidate set item with the highest similarity to the target user as a recommendation set.
The invention aims to realize the following steps:
the invention discloses an OCEAN model-based personalized recommendation method, which is implemented by establishing an OCEAN model of a microblog user. When the OCEAN model of the user is established, the microblog text of the user is led into the LDA model, implicit connotation is found from the text in a non-guidance method, and prediction accuracy is improved. Meanwhile, the personalized recommendation is established on the basis of user clustering, the search range of the user is narrowed, and the calculation amount of real-time recommendation is reduced. The OCEAN model of the user is combined with personalized recommendation, the study is carried out by going deep into the character characteristics of the user, the psychology of the user is better met in the personalized recommendation process, and the accuracy is higher.
Meanwhile, the personalized recommendation method based on the OCEAN model further has the following beneficial effects:
(1) the OCEAN model of the microblog user is established, the index of the user character is considered before the traditional personalized recommendation, the character of the user and the preference of the user are integrated, and the recommendation method is high in accuracy and more suitable for the psychology of the user.
(2) When the users are clustered, the initial clustering centers of the clustering algorithm are not randomly selected, and the users with high micro-blogger page access volume are manually selected as the clustering centers, so that isolated points can be better reduced.
Detailed Description
The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.
Examples
FIG. 1 is a flow chart of a method for personalized recommendation based on an OCEAN model according to the present invention.
In this embodiment, as shown in fig. 1, a method for personalized recommendation based on an OCEAN model in the present invention includes the following steps:
s1, selecting a plurality of microblog accounts, carrying out five personality tests on the users to obtain scores of five personality dimensions, and taking the scores of the five personality dimensions as an OCEAN model of the tested user;
in this embodiment, in 1991, Five-personality scale (Big Five Inventory, BFI) compiled by university of california at berkeley university psychologist over P John on the basis of OCEAN model theory is a universally recognized personality test scale, the credibility and validity of the scale are widely verified in multiple psychological experiments, and the scale is used in this application to obtain a user OCEAN model required for training.
S2, acquiring page content in a browser simulation mode, and capturing microblog data of the tested user, wherein the microblog data of the user is divided into two parts: text documents and user basic information. The text document refers to the summary of all microblog texts sent by the user, the basic information of the user comprises user registration time, user attention quantity, user microblog quantity, whether personalized signatures exist or not and the like, and then the microblog data of each user are respectively gathered into one text document;
s3, preprocessing the text document: filtering the text document, performing word segmentation processing, and storing the text document in a specified database after words are removed;
s4, importing the text documents of all tested users in the database into an LDA theme model, and outputting the probability distribution of the text document theme of each tested user by the LDA theme model;
in this embodiment, the LDA topic model is shown in fig. 2, and parameter definitions in the LDA topic model are shown in table 1;
symbol interpretation:
TABLE 1
Inputting an LDA topic model: the set of all user text documents, the number of topics K, the hyper-parameters α and β are in accordance with the usual empirical values: the setting K is 10 and the setting K is,
β=0.01,γ=20
output of LDA topic model: a topic probability distribution for each user text document.
S5, inputting a sample by taking the document theme probability distribution of the user to be tested as a sample, outputting a sample by taking the OCEAN model of the user to be tested as a sample, training by utilizing a BP neural network, establishing a mapping model between the document theme distribution of the user and the OCEAN model of the user, and taking the mapping model as the OCEAN model for predicting the user of the social network site;
s6 clustering based on social network users
Based on an OCEAN model of the social network site users, dividing the users into user groups with K different characters by using a K-means clustering algorithm;
in the embodiment, the k-means clustering algorithm has high efficiency, is widely applied to clustering large-scale data, and has good effect on low-level data sets. The invention selects a k-means clustering algorithm.
And setting k as an input parameter of the k-means algorithm, representing the output quantity of the algorithm after the algorithm is segmented and calculated on a data set, wherein the data set consists of n data points and represents the quantity of all users, and the input parameter is the number k of clusters and the OCEAN model data of the users. The specific algorithm is as follows:
1) setting a five-dimensional data multi-bit set I ═ I of a user OCEAN model1,i2,...,i5};
2) All m users are searched and recorded as a set U ═ U1,u2,...,um};
3) Manually selecting users with higher access quantity and different labels from m users as initial clustering centers, and marking as { W1,W2,...,WK};
4) And (4) circularly inputting the vector, calculating the average value of the objects in each cluster, and updating the cluster center until no change occurs.
S7, carrying out personalized recommendation on the target users according to the categories to which the target users belong
When a target user appears, firstly determining a clustering category where the target user is located, then respectively taking all microblogs sent by each user in the category where the target user is located as a candidate set item, respectively performing text feature random extraction on each candidate set item by using a word frequency-inverse document, and constructing an n-dimensional vector as attribute data of each candidate set item, wherein each microblog is extracted as a one-dimensional vector;
for example: recording the collection of all collected microblog candidate sets as D ═ D1,d2,...,dNAnd the set of words appearing in all microblogs is T ═ T1,t2,...,tN}. That is, we have N candidate set items to be processed, and these items contain N different words. We will eventually use a vector to represent an item, say the jth item is denoted dj={w1j,w2j,...,wnjIn which w1jDenotes the 1 st word t1In the weight in article j, a larger value indicates more importance; therefore, to represent the jth item, d needs to be computedjThe value of each component. Utilizing term frequency-inverse document frequency (tf for short) commonly used in information retrieval-idf). The tf-idf corresponding to the kth word in the dictionary in the jth microblog is as follows:
wherein TF (t)k,dj) Is the number of times the k-th word appears in the candidate set item j, and nkThe number of microblogs including the k-th word is determined.
The final weight of the kth word in the microblog j is obtained by the following formula:
assembling a text document according to microblog data of a target user, and performing text feature random extraction on the text document of the target user by using word frequency-inverse document frequency to construct an m-dimensional vector as favorite information of the target user;
and according to a cosine similarity formula, calculating the similarity between the favorite data of the user and the attribute data of each candidate set item, and recommending the candidate set item with the highest similarity to the target user as a recommendation set.
Wherein, the cosine similarity formula is:
the scores of a user U and a candidate item I on the n-dimensional item space are respectively expressed as a vector Ua、IaThen the similarity cos (U, I) is:
Uathe preference value of the target user U for the a-th item is shown, namely the value corresponding to the a-th item in the preference data. I isaRepresenting the value corresponding to item a in the candidate set item.
Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.