CN104503973A

CN104503973A - Recommendation method based on singular value decomposition and classifier combination

Info

Publication number: CN104503973A
Application number: CN201410648324.6A
Authority: CN
Inventors: 贝毅君; 郑丽梦; 刘智新; 刘二腾
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2014-11-14
Filing date: 2014-11-14
Publication date: 2015-04-08

Abstract

The invention discloses a fusion recommendation method based on singular value decomposition and classifier. Through data preprocessing, the average score and probability distribution of items are calculated, and the singular value decomposition model is trained by stochastic gradient descent method. Item collection, calculate the user's entropy set on the item category through the calculation method of entropy value, and determine the critical value of item uncertainty, determine whether to use the classifier by comparing the uncertainty and critical value of the predicted item, and use Top-N The method recommends the highest rated N items among all unrated items of the user. The present invention adopts the historical score data of users, analyzes it, and finally generates personalized recommendations; uses the singular value decomposition algorithm to obtain the predicted score of the specified item i, and determines whether to classify by calculating the information entropy of the item on each user. Finally, the final prediction score of the item is obtained through the classifier, thereby improving the accuracy of the recommendation method.

Description

A Fusion Recommendation Method Based on Singular Value Decomposition and Classifier

技术领域technical field

本发明属于推荐算法技术领域，尤其涉及一种基于奇异值分解与分类器融合推荐的方法。The invention belongs to the technical field of recommendation algorithms, in particular to a method for fusion recommendation based on singular value decomposition and classifiers.

背景技术Background technique

无论是在电子商务领域还是在音乐、电影等在线视频网站，信息过载是当今世界人们面临的问题之一，而推荐系统这是解决这一问题的方法。它通过对海量数据挖掘，分析用户的历史浏览记录，历史项目评分，得到用户在某一领域中的偏好，从而实现个性化推荐。Whether it is in the field of e-commerce or online video sites such as music and movies, information overload is one of the problems people face in the world today, and the recommendation system is the way to solve this problem. It mines massive data, analyzes users' historical browsing records, and scores historical items to obtain users' preferences in a certain field, thereby realizing personalized recommendations.

在常用推荐算法中，目前存在三类，分别是基于内容的推荐算法、协同过滤算法和混合推荐算法。基于内容的推荐算法通过对用户的历史项目进行特征抽取、建模和对比，从而进行项目的推荐，但是存在复杂属性难处理问题；协同过滤则基于相似用户的爱好相同或者相似项目用户给分相同的假设进行推荐，却在新用户、稀疏性和可拓展性表现不佳；而混合推荐算法则大多数混合基于内容和系统过滤算法，在合适的融合条件下取得较好的结果。Among the commonly used recommendation algorithms, there are currently three categories, namely, content-based recommendation algorithms, collaborative filtering algorithms, and hybrid recommendation algorithms. The content-based recommendation algorithm performs feature extraction, modeling and comparison on the user's historical items to recommend items, but there are complex attributes that are difficult to deal with; collaborative filtering is based on similar users who have the same hobbies or users who give similar items the same score However, it performs poorly in terms of new users, sparsity and scalability; while hybrid recommendation algorithms mostly mix content-based and system filtering algorithms, and achieve better results under appropriate fusion conditions.

发明内容Contents of the invention

本发明实施例的目的在于提供一种基于奇异值分解与分类器融合推荐的方法，旨在解决现有的推荐算法存在的新用户表现不佳、系统推荐准确性较低和复杂属性的问题。The purpose of the embodiments of the present invention is to provide a method for fusion recommendation based on singular value decomposition and classifier, which aims to solve the problems of poor performance of new users, low accuracy of system recommendation and complex attributes existing in existing recommendation algorithms.

本发明实施例是这样实现的，一种基于奇异值分解与分类器融合推荐的方法，该基于奇异值分解与分类器融合推荐的方法包括以下步骤：The embodiment of the present invention is implemented in this way, a method based on singular value decomposition and classifier fusion recommendation, the method based on singular value decomposition and classifier fusion recommendation includes the following steps:

步骤一，通过对数据的预处理，计算项目的平均分和概率分布；Step 1, by preprocessing the data, calculate the average score and probability distribution of the item;

步骤二，针对每个用户，计算所有已评项目在项目所属类别上的熵值，对于每个项目，都有所属类别，而用户对于项目的偏好设为目标变量，确定取值范围，根据用户的已评项目集合，使用公式计算用户在分类上的熵值，其中{0,…,k}为目标变量的取值范围，P_k,u为用户在某个分类上用户u对项目i的偏好值为k时的概率，若某类别熵值不存在于用户u，则所有类别熵值的最大值即为该类的熵值，熵值在自然科学中表示系统的混乱程度，由此可见，某类别熵值越大，用户对该类项目评分的不确定性越大；Step 2. For each user, calculate the entropy value of all rated items in the category of the item. For each item, there is a category, and the user's preference for the item is set as the target variable, and the value range is determined. According to the user A collection of rated items for , using the formula Calculate the entropy value of the user on the category, where {0,...,k} is the value range of the target variable, P _k,u is the probability of user u’s preference value for item i on a certain category when the user u’s preference value is k, If the entropy value of a certain category does not exist in user u, then the maximum value of entropy of all categories is the entropy value of this category. The greater the uncertainty of the user's rating of this type of item;

由用户u在项目类别上的熵集合E＝{E₁,E₂,…,E_N}，对于任意一个用户u的项目i，根据用户u的熵集合E，得到项目的不确定性为其中n为项目i中类别的个数；通过用户u的项目评分子集合计算该集合所有元素的不确定性，找出用户u打分偏低的临界值e_u＝min(e_i,u)；对于所有用户，分别得到他们的项目类别熵的集合和临界值；From the entropy set E={E ₁ , E ₂ ,…,E _N } of user u on the item category, for any item i of user u, according to the entropy set E of user u, the uncertainty of the item is obtained as where n is the number of categories in item i; through user u's item rating subset Calculate the uncertainty of all elements in the set, and find out the critical value e _u =min(e _i,u ) for the low score of user u; for all users, obtain the set and critical value of their item category entropy respectively;

步骤三，通过比较预测项目的不确定性和临界值决定是否使用分类器，并使用Top-N方法推荐出用户所有未评项目中评分最高的N个项目。Step 3: Determine whether to use the classifier by comparing the uncertainty and critical value of the predicted items, and use the Top-N method to recommend the N items with the highest ratings among all the unrated items of the user.

进一步，步骤一的数据预处理具体包括，数据集由用户集合U，项目集合I，和评分矩阵R组成，其中评分矩阵R的取值范围为{1,2,3,4,5}，给定维度为f后，对于每个项目i∈I都有向量q_i，用来衡量项目拥有这些影响因子的程度；对于每个用户u∈U都有向量P_u，用来计算用户对项目在这些影响因上的兴趣，值同q_i中的值相似，都为正负；q_i ^T与P_u的内积捕捉用户u与项目i之间的关系；奇异值分解算法与最基础的评分公式R＝μ+b_u+b_i+q_i ^Tp_u相比，去除所有项目评分的平均分μ，而以每个项目评分的平均分a_i代替；在开始奇异值分解的学习算法之前，必须计算每一项目的平均分。Further, the data preprocessing in step 1 specifically includes that the data set consists of user set U, item set I, and scoring matrix R, where the value range of scoring matrix R is {1, 2, 3, 4, 5}, giving After setting the dimension as f, there is a vector q _i for each item i∈I, which is used to measure the degree to which the item has these influence factors; for each user u∈U, there is a vector P _u , which is used to calculate the user's influence on the item. The value of these influence factors is similar to the value in q _i , both positive and negative; the inner product of q _i ^T and Pu captures the relationship between user _u and item i; the singular value decomposition algorithm and the most basic scoring Compared with the formula R=μ+b _u +b _i +q _i ^T p _u , remove the average score μ of all item ratings and replace it with the average score a _i of each item rating; before starting the learning algorithm of singular value decomposition , the average score for each item must be calculated.

进一步，计算每一项目的平均分的具体步骤如下：Further, the specific steps for calculating the average score of each item are as follows:

第一步，对于有用户评分记录的项目，统计该项目评分，算出平均分，并算出在评分范围[1,2,3,4,5]上的分布概率；In the first step, for items with user rating records, count the item ratings, calculate the average score, and calculate the distribution probability on the rating range [1,2,3,4,5];

第二步，对于没有用户评分记录的项目，使用所有评分的平均分作为该项目的平均分，并设在评分范围上的分布概率为0.2。In the second step, for items without user rating records, the average score of all ratings is used as the average score of the item, and the distribution probability on the rating range is set to 0.2.

进一步，计算预测评分r_ui，需计算 $\min_{p, q, b} \underset{(u, i) &Element; R}{Σ} {(r_{ui} - b_{u} - b_{i} - a_{i} - p_{u}^{T} q_{i})}^{2} + λ ({| | p_{u} | |}^{2} + {| | q_{i} | |}^{2} + {| | b_{u} | |}^{2} + {| | b_{i} | |}^{2})$ 其中λ为正则化参数，而加入p_u、q_i、b_i和b_u绝对值的平方是为了防止过拟合。通过随机梯度下降算法或交叉最小二乘法迭代学习b_u，b_i，q_i，p_u的值，采用随机梯度下降算法。Further, to calculate the prediction score r _ui , it is necessary to calculate $\min_{p, q, b} \underset{(u, i) &Element; R}{Σ} {(r_{ui} - b_{u} - b_{i} - a_{i} - p_{u}^{T} q_{i})}^{2} + λ ({| | p_{u} | |}^{2} + {| | q_{i} | |}^{2} + {| | b_{u} | |}^{2} + {| | b_{i} | |}^{2})$ Among them, λ is a regularization parameter, and the square of the absolute values of _pu , q _i , b _i and b _u is added to prevent overfitting. Iteratively learn the values of b _u , b _i , q _i , and p _u through stochastic gradient descent algorithm or cross least square method, and adopt stochastic gradient descent algorithm.

进一步，步骤三具体包括，根据步骤一中所得的所有项目在[1,2,3,4,5]上的概率分布，和步骤二中所得的项目类别熵集合和临界值，对于任意一个用户u，比较项目i在用户u前提下的不确定性e_i,u和用户u的临界值e_u，若e_i,u≥e_u，说明项目的不确定性较大则不进行分类，直接使用奇异值分解计算未评项目i的预测分值；否则，根据该项目的概率分布以及该项目的SVD预测分得到向上取整和向下取整时的概率，选择概率大的值作为用户u对该项目i的预测分；最后，计算用户u对各个未平项目的评分，将评分最高的N个项目推荐给用户。Further, Step 3 specifically includes, according to the probability distribution of all items obtained in Step 1 on [1,2,3,4,5], and the item category entropy set and critical value obtained in Step 2, for any user u, compare the uncertainty e _i,u of item i under the premise of user u and the critical value e _u of user u, if e _i,u ≥ e _u , it means that the uncertainty of the item is large, then it will not be classified and directly Use the singular value decomposition to calculate the predicted score of the unrated item i; otherwise, according to the probability distribution of the item and the SVD predicted score of the item, the probability of rounding up and down is obtained, and the value with the highest probability is selected as the user u The predicted score of the item i; finally, calculate the score of user u for each unbalanced item, and recommend the N items with the highest score to the user.

进一步，该基于奇异值分解与分类器融合推荐的方法实现个性化推荐的具体实施步骤如下：Further, the specific implementation steps of personalized recommendation based on singular value decomposition and classifier fusion recommendation method are as follows:

步骤一，首先，对用户数据进行预处理，已知用户集合U＝{u₁,u₂,u₃}，项目集合I＝{i₁,i₂,i₃,i₄}，类别的集合C＝{C1,C2,C3,C4}；计算所有项目的平均分μ，并且计算每个项目的平均分a_i以及项目在评分范围[1,2,3,4,5]上的概率分布，对于没有评分记录的项目使用μ填充，得μ＝3.78，a_i＝{3.67,3.67,4,4}，概率分布为{P₁＝{0,0.33,0,0.33,0.33},P₂＝{0,0.33,0,0.33,0.33},P₃＝{0,0,0,1,0},P₄＝{0,0,0.5,0,0.5}}；Step 1, firstly, preprocess the user data, known user set U={u ₁ ,u ₂ ,u ₃ }, item set I={i ₁ ,i ₂ ,i ₃ ,i ₄ }, category set C={C1,C2,C3,C4}; Calculate the average score μ of all items, and calculate the average score a _i of each item and the probability distribution of the item on the scoring range [1,2,3,4,5] , use μ to fill items without scoring records, get μ=3.78, a _i ={3.67,3.67,4,4}, the probability distribution is {P ₁ ={0,0.33,0,0.33,0.33},P ₂ ={0,0.33,0,0.33,0.33}, P ₃ ={0,0,0,1,0},P ₄ ={0,0,0.5,0,0.5}};

步骤二，在给定维度f、学习速率和迭代次数后，利用用户历史评分数据和损失方程 $\min_{p, q, b} \underset{(u, i) &Element; R}{Σ} {(r_{ui} - b_{u} - b_{i} - a_{i} - p_{u}^{T} q_{i})}^{2} + λ ({| | p_{u} | |}^{2} + {| | q_{i} | |}^{2} + {| | b_{u} | |}^{2} + {| | b_{i} | |}^{2})$ 通过随机梯度下降方法计算b_u，b_i，p_u，q_i的值；步骤如下，在给定参数b_u，b_i，p_u，q_i的情况下首先对损失方程求各个参数的偏导，如b_u←b_u+α*(r-r_ui-λ*b_u)，其中r为用户u对i打分的真实分数，α为学习速率，λ为正则化参数；利用偏导数更新各个参数，并最终得到SVD模型R＝a_i+b_u+b_i+q_i ^Tp_u；Step 2. After the given dimension f, learning rate and number of iterations, use the user's historical rating data and loss equation $\min_{p, q, b} \underset{(u, i) &Element; R}{Σ} {(r_{ui} - b_{u} - b_{i} - a_{i} - p_{u}^{T} q_{i})}^{2} + λ ({| | p_{u} | |}^{2} + {| | q_{i} | |}^{2} + {| | b_{u} | |}^{2} + {| | b_{i} | |}^{2})$ Calculate the value of b _u , b _i , _{pu , q i by the method of stochastic gradient descent; the steps are as follows, in the case of given parameters b u , b i , p u} _{, q i} _first _calculate _the _partial value of each parameter for the loss equation Derivation, such as b _u ←b _u +α*(rr _ui -λ*b _u ), where r is the real score scored by user u on i, α is the learning rate, and λ is the regularization parameter; use the partial derivative to update each parameter , and finally get the SVD model R＝a _i +b _u +b _i +q _i ^T p _u ;

步骤三，利用信息熵公式计算每个用户在各个项目类别{C1,C2,C3,C4}上的熵集合；如用户u₁，目标变量的取值范围是{-1,0,1}，其中-1表示评分小于等于2且小于项目平均分，即不喜欢；0表示评分等于3的情况，即一般；1表示评分大于3的情况，即喜欢；由此，得到Ε(C_u)＝{0,1.43,0.60,0}，通过E(C_u)以及项目的不确定性计算公式得到项目的不确定性临界点e_u为用户u的项目评分子集合中所有项目不确定性的最小值，即0.68；因此，得到所有用户的临界值；Step 3, use the information entropy formula Calculate the entropy set of each user on each item category {C1,C2,C3,C4}; such as user u ₁ , the value range of the target variable is {-1,0,1}, where -1 means the score is less than or equal to 2 and less than the average score of the item, that is, dislike; 0 means that the score is equal to 3, that is, general; 1 means that the score is greater than 3, that is, like; thus, Ε(C _u )={0,1.43,0.60, 0}, through E(C _u ) and the uncertainty calculation formula of the project Get the uncertainty critical point e _u of the item is the item rating subset of user u The minimum value of uncertainty of all items in , which is 0.68; thus, the critical value of all users is obtained;

预测u₁对i₃的评分情况，首先通过训练好后的奇异值分解模型得到初步预测评分，然后得到项目i₃在用户u₁时的项目不确定性为0<e_u，则需要经过分类器；如果初步预测评分为3.21，得到i₃评分的向上取整为4，向下取整为3，计算评分为3时的概率P(3)＝(4-3.21)*0＝0，计算评分为4时的概率P(4)＝(3.21-3)*1＝0.21；P(4)>P(3)，所以，将预测评分归类为4，最后u1对i₃的评分为4，并将推荐给用户(此例中，N＝1)。To predict the scoring situation of u ₁ to i ₃ , first obtain the preliminary prediction score through the trained singular value decomposition model, and then get the item uncertainty of item i ₃ when user u ₁ is 0<e _u , then it needs to be classified device; if the initial prediction score is 3.21, get the i ₃ score up to 4, and down to 3, calculate the probability P(3)=(4-3.21)*0=0 when the score is 3, calculate When the score is 4, the probability P(4)=(3.21-3)*1=0.21; P(4)>P(3), so the predicted score is classified as 4, and finally the score of u1 to _i3 is 4 , and will be recommended to the user (in this example, N=1).

本发明提供的基于奇异值分解与分类器融合推荐的方法，采用通过用户历史评分数据，对其进行剖析，最后产生个性化推荐；利用奇异值分解算法得到指定项目i的预测评分，并通过计算项目在每个用户上的信息熵，根据项目评分的波动程度(信息熵的大小)决定是否进行分类，最终通过分类器得到项目的最后预测分。本发明信息熵的使用有效排除了项目评分波动大的情况下对分类器的影响，提高了分类效果以提高预测评分的评分准确度，从而提高了推荐方法的准确性。The method for recommendation based on singular value decomposition and classifier fusion provided by the present invention adopts the historical score data of users, analyzes it, and finally produces personalized recommendation; uses the singular value decomposition algorithm to obtain the predicted score of the specified item i, and calculates The information entropy of the item on each user determines whether to classify according to the fluctuation degree of the item rating (the size of the information entropy), and finally obtains the final predicted score of the item through the classifier. The use of information entropy in the present invention effectively eliminates the impact on the classifier when the item rating fluctuates greatly, improves the classification effect to improve the scoring accuracy of the predicted rating, thereby improving the accuracy of the recommendation method.

附图说明Description of drawings

图1是本发明实施例提供的基于奇异值分解与分类器融合推荐的方法流程图；FIG. 1 is a flow chart of a method for recommendation based on singular value decomposition and classifier fusion provided by an embodiment of the present invention;

图2是本发明实施例提供的基于奇异值分解与分类器融合推荐的方法实施例1的流程示意图。Fig. 2 is a schematic flowchart of Embodiment 1 of the method for recommendation based on singular value decomposition and classifier fusion provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention more clear, the present invention will be further described in detail below in conjunction with the examples. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

下面结合附图及具体实施例对本发明的应用原理作进一步描述。The application principle of the present invention will be further described below in conjunction with the accompanying drawings and specific embodiments.

如图1所示，本发明实施例的基于奇异值分解与分类器融合推荐的方法包括以下步骤：As shown in Figure 1, the method for recommendation based on singular value decomposition and classifier fusion recommendation in the embodiment of the present invention includes the following steps:

S101：通过对数据的预处理，计算项目的平均分和概率分布。S101: Calculate the average score and probability distribution of the items by preprocessing the data.

S102：通过随机梯度下降方法训练奇异值分解模型，对用户的已评分项目集合，通过熵值的计算方法计算用户在项目类别上的熵集合，并确定项目不确定性临界值；S102: Train the singular value decomposition model by the stochastic gradient descent method, calculate the entropy set of the user on the item category by the calculation method of the entropy value for the user's scored item set, and determine the item uncertainty critical value;

S103：通过比较预测项目的不确定性和临界值决定是否使用分类器，并使用Top-N方法推荐出用户所有未评项目中评分最高的N个项目。S103: Determine whether to use a classifier by comparing the uncertainty of the predicted items with the critical value, and use the Top-N method to recommend N items with the highest scores among all unrated items of the user.

本发明的具体步骤为：参照图2：Concrete steps of the present invention are: with reference to Fig. 2:

第一步，数据预处理，数据集由用户集合U，项目集合I，和评分矩阵R组成，其中评分矩阵R的取值范围为{1,2,3,4,5}，给定维度为f后，对于每个项目i∈I都有向量q_i，用来衡量项目拥有这些影响因子的程度；对于每个用户u∈U都有向量P_u，用来计算用户对项目在这些影响因上的兴趣，其值同q_i中的值相似，都可为正负。q_i ^T与P_u的内积可以捕捉用户u与项目i之间的关系；考虑到现实生活中用户在评分时会被评分中的众数和平均分所引导，本奇异值分解算法与最基础的评分公式R＝μ+b_u+b_i+q_i ^Tp_u相比，去除所有项目评分的平均分μ，而以每个项目评分的平均分a_i代替。因此，在开始奇异值分解的学习算法之前，必须计算每一项目的平均分；具体步骤如下：The first step is data preprocessing. The data set consists of a user set U, an item set I, and a scoring matrix R, where the value range of the scoring matrix R is {1,2,3,4,5}, and the given dimension is After f, there is a vector q _i for each item i∈I, which is used to measure the degree to which the item has these influence factors; for each user u∈U, there is a vector P _u , which is used to calculate the influence factors of the user on the item. The interest on , its value is similar to the value in q _i , both can be positive or negative. The inner product of q _i ^T and P _u can capture the relationship between user u and item i; considering that users in real life will be guided by the mode and average score in the rating, the singular value decomposition algorithm and the most Compared with the basic scoring formula R=μ+b _u +b _i +q _i ^T p _u , the average score μ of all item ratings is removed and replaced by the average score _ai of each item rating. Therefore, before starting the learning algorithm of singular value decomposition, the average score of each item must be calculated; the specific steps are as follows:

步骤一，对于有用户评分记录的项目，统计该项目评分，算出其平均分，并算出它在评分范围[1,2,3,4,5]上的分布概率；Step 1, for an item with user rating records, count the item rating, calculate its average score, and calculate its distribution probability on the rating range [1,2,3,4,5];

步骤二，对于没有用户评分记录的项目，使用所有评分的平均分作为该项目的平均分，并设其在评分范围上的分布概率为0.2；Step 2, for items without user rating records, use the average score of all ratings as the average score of the item, and set its distribution probability on the rating range to 0.2;

第二步，计算预测评分r_ui，需计算 $\min_{p, q, b} \underset{(u, i) &Element; R}{Σ} {(r_{ui} - b_{u} - b_{i} - a_{i} - p_{u}^{T} q_{i})}^{2} + λ ({| | p_{u} | |}^{2} + {| | q_{i} | |}^{2} + {| | b_{u} | |}^{2} + {| | b_{i} | |}^{2})$ 其中λ为正则化参数，而加入p_u、q_i、b_i和b_u绝对值的平方是为了防止过拟合。通过随机梯度下降(SGD)算法或交叉最小二乘法(ALS)迭代学习b_u，b_i，q_i，p_u的值，由于SGD方法在时间上优于ALS，采用随机梯度下降算法；The second step is to calculate the predicted score r _ui , which needs to be calculated $\min_{p, q, b} \underset{(u, i) &Element; R}{Σ} {(r_{ui} - b_{u} - b_{i} - a_{i} - p_{u}^{T} q_{i})}^{2} + λ ({| | p_{u} | |}^{2} + {| | q_{i} | |}^{2} + {| | b_{u} | |}^{2} + {| | b_{i} | |}^{2})$ Among them, λ is a regularization parameter, and the square of the absolute values of _pu , q _i , b _i and b _u is added to prevent overfitting. Iteratively learn the values of b _u , b _i , q _i , and p _u through the stochastic gradient descent (SGD) algorithm or crossed least squares (ALS). Since the SGD method is superior to ALS in time, the stochastic gradient descent algorithm is used;

第三步，针对每个用户u，计算其所有已评项目在项目所属类别上的熵值。对于每个项目i∈I，都有其所属类别，其集合为C＝{C₁,C₂,C₃,…,C_N}，其中N为类别数目。而用户对于该项目的偏好设为目标变量，确定其取值范围，根据用户u的已评项目集合I_u，使用公式计算用户在N个分类上的熵值，其中{0,…,k}为目标变量的取值范围，P_k,u为用户在某个分类上用户u对项目i的偏好值为k时的概率，若某类别熵值不存在于用户u，则所有类别熵值的最大值即为该类的熵值，熵值在自然科学中表示系统的混乱程度，由此可见，某类别熵值越大，用户对该类项目评分的不确定性越大；In the third step, for each user u, calculate the entropy value of all its rated items in the category to which the item belongs. For each item i∈I, there is its category, and its set is C={C ₁ ,C ₂ ,C ₃ ,...,C _N }, where N is the number of categories. The user's preference for this item is set as the target variable, and its value range is determined. According to the user u's evaluated item set I _u , use the formula Calculate the entropy value of the user on N categories, where {0,...,k} is the value range of the target variable, and P _k,u is the user u's preference value for item i on a certain category when the value is k Probability, if the entropy value of a certain category does not exist in user u, then the maximum value of entropy value of all categories is the entropy value of this category. Larger, the greater the uncertainty of the user's rating of this type of item;

第四步，由第三步可得用户u在项目类别上的熵集合E＝{E₁,E₂,…,E_N}，对于任意一个用户u的项目i，根据用户u的熵集合E，得到项目的不确定性为其中n为项目i中类别的个数。通过用户u的项目评分子集合计算该集合所有元素的不确定性，找出用户u打分偏低的临界值e_u＝min(e_i,u)。对于所有用户，分别得到他们的项目类别熵的集合和临界值；The fourth step, from the third step, the entropy set E={E ₁ , E ₂ ,…,E _N } of user u on the item category can be obtained. For any item i of user u, according to the entropy set E of user u , the uncertainty of the project is obtained as where n is the number of categories in item i. Subset of item ratings by user u Calculate the uncertainty of all elements in the set, and find out the critical value e _u =min(e _i,u ) for the low score of user u. For all users, get the set and critical value of their item category entropy respectively;

第五步，根据第一步中所得的所有项目在[1,2,3,4,5]上的概率分布，和4)中所得的项目类别熵集合和临界值，对于任意一个用户u，比较项目i在用户u前提下的不确定性e_i,u和用户u的临界值e_u，若e_i,u≥e_u，说明项目的不确定性较大则不进行分类，直接使用SVD计算未评项目i的预测分值；否则，根据该项目的概率分布以及该项目的SVD预测分得到其向上取整和向下取整时的概率，选择概率大的值作为用户u对该项目i的预测分。最后，计算用户u对各个未平项目的评分，将评分最高的N个项目推荐给用户。The fifth step, according to the probability distribution of all items obtained in the first step on [1,2,3,4,5], and the item category entropy set and critical value obtained in 4), for any user u, Compare the uncertainty e _i,u of item i under the premise of user u and the critical value e _u of user u, if e _i,u ≥ e _u , it means that the uncertainty of the item is large, then do not classify, and directly use SVD Calculate the predicted score of the unrated item i; otherwise, according to the probability distribution of the item and the SVD predicted score of the item, the probability of rounding up and rounding down is obtained, and the value with a higher probability is selected as the value of the item i for user u. i's prediction score. Finally, calculate the ratings of user u on each unbalanced item, and recommend the N items with the highest ratings to the user.

采用随机梯度下降算法，具体步骤如下：The stochastic gradient descent algorithm is used, and the specific steps are as follows:

步骤一，以公式b_u←b_u+α*(r-r_ui-λ*b_u)来拟合评分方程R＝a_i+b_u+b_i+q_i ^Tp_u中b_u的值，b_i，q_i和p_u以同种方式拟合，给定学习速率α，正则化参数λ，最大迭代步数和初始b_u，b_i，q_i，p_u以及初始均方根误差RSME的值；Step 1, use the formula b _u ←b _u +α*(rr _ui -λ*b _u ) to fit the value of b _u in the scoring equation R=a _i +b _u +b _i +q _i ^T p _u , b _i , q _i and p _u are fitted in the same way, given the learning rate α, the regularization parameter λ, the maximum number of iterations and the initial b _u , b _i , q _i , p _u and the initial root mean square error RSME value;

步骤二，导入训练数据，根据评分方程R＝a_i+b_u+b_i+q_i ^Tp_u得到预测评分，并用bu，bi，qi和pu的偏导方程更新bu，bi，qi和pu的值；Step 2, import the training data, get the predicted score according to the scoring equation R=a _i +b _u + _bi +q _i ^T p _u , and use the partial derivative equations of bu, bi, qi and pu to update bu, bi, qi and pu value;

步骤三，导入测试数据，并根据公式计算得到RMSE的值，其中u,i属于测试数据，R_ui为预测评分，为真实评分，nui为评分个数。Step 3, import the test data, and according to the formula Calculate the value of RMSE, where u, i belong to the test data, R _ui is the prediction score, is the real rating, and nui is the number of ratings.

步骤四，比较初始RMSE和当前RMSE的值，若当前RMSE大于初始或前一次RMSE的值或者迭代步数已达到最大迭代步数，则退出迭代；否则，将当前RMSE作为下一次迭代的RMSE的值，返回步骤二。Step 4, compare the initial RMSE and the current RMSE value, if the current RMSE is greater than the initial or previous RMSE value or the number of iteration steps has reached the maximum number of iteration steps, then exit the iteration; otherwise, use the current RMSE as the RMSE of the next iteration value, return to step 2.

所得的所有项目在[1,2,3,4,5]上的概率分布，和所得的项目类别熵集合和临界值，对于任意一个用户u，比较项目i在用户u前提下的不确定性e_i,_u和用户u的临界值e_u，若e_i,u≥e_u，说明项目的不确定性较大则不进行分类，直接使用奇异值分解公式R＝a_i+b_u+b_i+q_i ^Tp_u计算未评项目i的预测分值；否则，根据该项目的概率分布以及该项目的SVD预测分得到向上取整和向下取整时的概率，选择概率大的值作为用户u对该项目i的预测分；最后，计算用户u对各个未平项目的评分，将评分最高的N个项目推荐给用户。The resulting probability distribution of all items on [1,2,3,4,5], and the resulting item category entropy set and critical value, for any user u, compare the uncertainty of item i under the premise of user u e _i , _u and the critical value e _u of user u, if e _i,u ≥ e _u , it means that the item has a large uncertainty, then it will not be classified, and the singular value decomposition formula R=a _i +b _u +b will be used directly _i +q _i ^T p _u Calculate the predicted score of unrated item i; otherwise, according to the probability distribution of the item and the SVD predicted score of the item, the probability of rounding up and down is obtained, and the value with the highest probability is selected As user u's prediction score for item i; finally, calculate user u's score for each unbalanced item, and recommend the N items with the highest score to the user.

采用本发明提供的一种基于奇异值分解与分类器融合的推荐算法，可以实现个性化推荐，以表1中的评分矩阵和项目类别说明表为例，具体的实施步骤如下：(a)评分矩阵；(b)项目类别说明表。Using a recommendation algorithm based on singular value decomposition and classifier fusion provided by the present invention can realize personalized recommendation. Taking the scoring matrix and item category description table in Table 1 as an example, the specific implementation steps are as follows: (a) Scoring Matrix; (b) item category description table.

步骤一，首先，对用户数据进行预处理，评分矩阵如表1(a)；已知用户集合U＝{u₁,u₂,u₃}，项目集合I＝{i₁,i₂,i₃,i₄}，类别的集合C＝{C1,C2,C3,C4}。根据权利要求第一步，计算所有项目的平均分μ，并且计算每个项目的平均分a_i以及项目在评分范围[1,2,3,4,5]上的概率分布，对于没有评分记录的项目使用μ填充，可得μ＝3.78，a_i＝{3.67,3.67,4,4}，其概率分布为{P₁＝{0,0.33,0,0.33,0.33},P₂＝{0,0.33,0,0.33,0.33},P₃＝{0,0,0,1,0},P₄＝{0,0,0.5,0,0.5}}；Step 1. First, preprocess the user data. The scoring matrix is shown in Table 1(a); the known user set U={u ₁ ,u ₂ ,u ₃ }, and the item set I={i ₁ ,i ₂ ,i ₃ , i ₄ }, the set of categories C={C1, C2, C3, C4}. According to the first step of the claim, calculate the average score μ of all items, and calculate the average score a _i of each item and the probability distribution of the item on the scoring range [1,2,3,4,5], for no scoring record The item of is filled with μ, it can be obtained that μ=3.78, a _i ={3.67,3.67,4,4}, and its probability distribution is {P ₁ ={0,0.33,0,0.33,0.33},P ₂ ={0 ,0.33,0,0.33,0.33},P ₃ ={0,0,0,1,0},P ₄ ={0,0,0.5,0,0.5}};

步骤二，在给定维度f、学习速率和迭代次数后，利用用户历史评分数据和损失方程 $\min_{p, q, b} \underset{(u, i) &Element; R}{Σ} {(r_{ui} - b_{u} - b_{i} - a_{i} - p_{u}^{T} q_{i})}^{2} + λ ({| | p_{u} | |}^{2} + {| | q_{i} | |}^{2} + {| | b_{u} | |}^{2} + {| | b_{i} | |}^{2})$ 通过随机梯度下降方法计算b_u，b_i，p_u，q_i的值。其主要步骤如下，在给定参数b_u，b_i，p_u，q_i的情况下首先对该损失方程求各个参数的偏导，如b_u←b_u+α*(r-r_ui-λ*b_u)，其中r为用户u对i打分的真实分数，α为学习速率，λ为正则化参数；利用偏导数更新各个参数，并最终得到SVD模型R＝a_i+b_u+b_i+q_i ^Tp_u；Step 2. After the given dimension f, learning rate and number of iterations, use the user's historical rating data and loss equation $\min_{p, q, b} \underset{(u, i) &Element; R}{Σ} {(r_{ui} - b_{u} - b_{i} - a_{i} - p_{u}^{T} q_{i})}^{2} + λ ({| | p_{u} | |}^{2} + {| | q_{i} | |}^{2} + {| | b_{u} | |}^{2} + {| | b_{i} | |}^{2})$ Calculate the values of b _u , b _i , p _u , q _i by stochastic gradient descent method. The main steps are as follows. In the case of given parameters b _u , b _i , p _u , q _i , first calculate the partial derivative of each parameter of the loss equation, such as b _u ← b _u +α*(rr _ui -λ* b _u ), where r is the real score scored by user u on i, α is the learning rate, and λ is the regularization parameter; use the partial derivative to update each parameter, and finally get the SVD model R=a _i +b _u + _bi + q _i ^T p _u ;

步骤三，根据表1(b)以及本发明具体步骤的第三步，利用信息熵公式计算每个用户在各个项目类别{C1,C2,C3,C4}上的熵集合。如用户u₁，在这个实例中，该目标变量的取值范围是{-1,0,1}，其中-1表示评分小于等于2且小于项目平均分，即不喜欢；0表示评分等于3的情况，即一般；1表示评分大于3的情况，即喜欢。由此，可以得到Ε(C_u)＝{0,1.43,0.60,0}，通过E(C_u)以及项目的不确定性计算公式可以得到项目的不确定性临界点e_u为用户u的项目评分子集合中所有项目不确定性的最小值，即0.68。因此，可以得到所有用户的临界值；Step 3, according to the third step of table 1 (b) and the concrete steps of the present invention, utilize information entropy formula Compute the set of entropies for each user on each item category {C1,C2,C3,C4}. For example, user u ₁ , in this example, the value range of the target variable is {-1,0,1}, where -1 means that the score is less than or equal to 2 and less than the average score of the item, that is, dislike; 0 means that the score is equal to 3 In the case of , it is general; 1 means that the score is greater than 3, that is, like it. Thus, Ε(C _u )={0,1.43,0.60,0} can be obtained, through E(C _u ) and the uncertainty calculation formula of the project The uncertainty critical point e _u of the item can be obtained as the item rating subset of user u The minimum value of uncertainty of all items in , which is 0.68. Therefore, the critical value of all users can be obtained;

表1Table 1

i₁ i ₁ i₂ i ₂ i₃ i ₃ i₄ i ₄ u₁ u ₁ 55 22 00 33 u₂ u ₂ 44 55 44 00 u₃ u ₃ 22 44 00 55

(a)(a)

CC i₁ i ₁ {C1,C2}{C1,C2} i₂ i ₂ {C2,C3,C4}{C2,C3,C4} i₃ i ₃ {C4}{C4} i₄ i ₄ {C2,C3}{C2,C3}

(b)(b)

预测u₁对i₃的评分情况，首先通过训练好后的SVD模型得到初步预测评分，然后得到项目i₃在用户u₁时的项目不确定性为0<e_u，则需要经过分类器。如果初步预测评分为3.21，得到i₃评分的向上取整为4，向下取整为3，计算评分为3时的概率P(3)＝(4-3.21)*0＝0，计算评分为4时的概率P(4)＝(3.21-3)*1＝0.21。P(4)>P(3)，所以，将预测评分归类为4，最后u1对i3的评分为4，并将其推荐给用户(此例中，N＝1)。To predict the rating of u ₁ to i ₃ , first obtain the preliminary prediction score through the trained SVD model, and then get the item uncertainty of item i ₃ when user u ₁ is 0<e _u , then it needs to go through a classifier. If the preliminary prediction score is 3.21, the i ₃ score is rounded up to 4 and down to 3, and the probability P(3)=(4-3.21)*0=0 when the calculated score is 3, the calculated score is The probability of 4 is P(4)=(3.21-3)*1=0.21. P(4)>P(3), so the predicted score is classified as 4, and finally u1 scores 4 for i3, and recommends it to the user (in this example, N=1).

本发明利用奇异值分解算法得到指定项目i的预测评分，并通过计算项目在每个用户上的信息熵决定是否进行分类，最终通过分类器得到项目的最后预测分，从而提高推荐方法的准确性。The invention uses the singular value decomposition algorithm to obtain the predicted score of the specified item i, and determines whether to classify by calculating the information entropy of the item on each user, and finally obtains the final predicted score of the item through the classifier, thereby improving the accuracy of the recommendation method .

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention should be included in the protection of the present invention. within range.

Claims

1. based on the method that svd and Multiple Classifier Fusion are recommended, it is characterized in that, the method should recommended based on svd and Multiple Classifier Fusion comprises the following steps:

Step one, by the pre-service to data, the average mark of computational item and probability distribution;

Step 2, for each user, calculates and has allly commented the entropy of project on project generic, for each project, have generic, and user is set to target variable for the preference of project, determine span, comment project set according to user, use formula calculate user at classificatory entropy, wherein 0 ..., k} is the span of target variable, P _k,ufor user certain classification on user u be k to the preference value of project i time probability, if certain classification entropy is not present in user u, then the maximal value of all categories entropy is such entropy;

By the entropy set E={E of user on project category ₁, E ₂..., E _n, for the project i of any one user, according to the entropy set E of user, the uncertainty obtaining project is wherein n is the number of classification in project i; By the project scoring subclass of user calculate the uncertainty of this set all elements, find out user and to give a mark critical value e on the low side _u=min (e _i,u); For all users, obtain set and the critical value of their project category entropy respectively;

Step 3, determines whether use sorter by the uncertainty of comparison prediction project and critical value, and using Top-N method to recommend out, user is all does not comment in project the highest N number of project of marking.

2. the method for recommending based on svd and Multiple Classifier Fusion as claimed in claim 1, it is characterized in that, the data prediction of step one specifically comprises, data set gathers U by user, project set I, and rating matrix R forms, wherein the span of rating matrix R is { 1,2,3,4,5}, given dimension is after f, for each project i ∈ I directed quantity q _i, being used for measurement project has the degree of factor of influence; For each user u ∈ U directed quantity P _u, be used for calculate user on project these impact because of on interest, be worth same q _iin value similar, be all positive and negative; q _i ^twith P _uinner product catch relation between user u and project i; Singular value decomposition algorithm and most basic evaluate formula R=μ+b _u+ b _i+ q _i ^tp _ucompare, remove the average mark μ of all items scoring, and with the average mark a that each project is marked _ireplace; Before the learning algorithm starting svd, calculate the average mark of each project.

3. the method for recommending based on svd and Multiple Classifier Fusion as claimed in claim 2, it is characterized in that, the concrete steps calculating the average mark of each project are as follows:

The first step, for have user mark record project, add up this project scoring, according to calculate average mark, wherein n _itotal number of scoring user on project i, r _jbe each concrete scoring of user to project i of having marked, and calculate the distribution probability in scoring scope [1,2,3,4,5];

Second step, for do not have user mark record project, use the average mark of average mark as this project of all scorings, and the distribution probability be located in scoring scope is 0.2.

4. the method for recommending based on svd and Multiple Classifier Fusion as claimed in claim 1, is characterized in that, computational prediction scoring r _ui, need to calculate

\min_{p, q, b} \underset{(u, i) &Element; R}{Σ} {(r_{ui} - b_{u} - b_{i} - a_{i} - p_{u}^{T} q_{i})}^{2} + λ ({| | p_{u} | |}^{2} + {| | q_{i} | |}^{2} + {| | b_{u} | |}^{2} + {| | b_{i} | |}^{2})

Wherein λ is regularization parameter, by stochastic gradient descent algorithm or the least square method iterative learning b that intersects _u, b _i, q _i, p _uvalue, adopt stochastic gradient descent algorithm, concrete steps are as follows:

Step one, with the formula b _u← b _u+ α * (r-r _ui-λ * b _u) carry out matching scoring equation R=a _i+ b _u+ b _i+ q _i ^tp _umiddle b _uvalue, b _i, q _iand p _uwith same manner matching, given learning rate α, regularization parameter λ, greatest iteration step number and initial b _u, b _i, q _i, p _uand the value of initial root mean square error RSME;

Step 2, imports training data, according to scoring equation R=a _i+ b _u+ b _i+ q _i ^tp _uobtain prediction scoring, and upgrade the value of bu, bi, qi and pu with the local derviation equation of bu, bi, qi and pu;

Step 3, imports test data, and according to formula calculate the value of RMSE, wherein u, i belong to test data, R _uifor prediction scoring, for true scoring, nui is scoring number.

Step 4, compares the value of Initial R MSE and current RMSE, if current RMSE is greater than the value of an initial or front RMSE or iterative steps has reached greatest iteration step number, then exits iteration; Otherwise, using the value of current RMSE as the RMSE of next iteration, return step 2.

5. the method for recommending based on svd and Multiple Classifier Fusion as claimed in claim 1 or 2, it is characterized in that, step 3 specifically comprises, according to all items of gained in step one [1,2,3,4,5] probability distribution on, and the project category entropy set of gained and critical value in step 2, for any one user u, item compared i uncertain e under user u prerequisite _i,uwith the critical value e of user u _uif, e _i,u>=e _u, descriptive item object uncertainty is not then classified comparatively greatly, directly uses svd formula R=a _i+ b _u+ b _i+ q _i ^tp _ucalculate the prediction score value not commenting project i; Otherwise, probability when obtaining rounding up according to the probability distribution of project and the SVD prediction point of this project and round downwards, the value that select probability is large is divided the prediction of this project i as user u; Finally, user u is calculated to the scoring of each not flat project, by N number of project recommendation the highest for scoring to user.

6. the method for recommending based on svd and Multiple Classifier Fusion as claimed in claim 1 or 2, is characterized in that, the concrete implementation step that the method should recommended based on svd and Multiple Classifier Fusion realizes personalized recommendation is as follows:

Step one, first, carries out pre-service to user data, known users set U={u ₁, u ₂, u ₃, project set I={i ₁, i ₂, i ₃, i ₄, the set C={C1 of classification, C2, C3, C4}; Calculate the average mark μ of all items, and calculate the average mark a of each project _iand the probability distribution of project in scoring scope [1,2,3,4,5], the project for record of not marking uses μ to fill, and obtains μ=3.78, a _i={ 3.67,3.67,4,4}, probability distribution is { P ₁={ 0,0.33,0,0.33,0.33}, P ₂={ 0,0.33,0,0.33,0.33}, P ₃={ 0,0,0,1,0}, P ₄={ 0,0,0.5,0,0.5}};

Step 2, after given dimension f, learning rate and iterations, utilizes user's history score data and loss equation

\min_{p, q, b} \underset{(u, i) &Element; R}{Σ} {(r_{ui} - b_{u} - b_{i} - a_{i} - p_{u}^{T} q_{i})}^{2} + λ ({| | p_{u} | |}^{2} + {| | q_{i} | |}^{2} + {| | b_{u} | |}^{2} + {| | b_{i} | |}^{2})

B is calculated by stochastic gradient descent method _u, b _i, p _u, q _ivalue; Step is as follows, at given parameters b _u, b _i, p _u, q _iwhen first loss equation is asked to the local derviation of parameters, as b _u← b _u+ α * (r-r _ui-λ * b _u), wherein r is the true score that user u gives a mark to i, and α is learning rate, and λ is regularization parameter; Utilize partial derivative to upgrade parameters, and finally obtain SVD model R=a _i+ b _u+ b _i+ q _i ^tp _u;

Step 3, utilizes information entropy formula calculate each user in { the entropy set on C1, C2, C3, C4} of each project category; As user u ₁, the span of target variable be-1,0,1}, wherein-1 represents that scoring is less than or equal to 2 and is less than project average mark, does not namely like; 0 represents the situation that scoring equals 3, namely generally; 1 represents the situation that scoring is greater than 3, namely likes; Thus, Ε (C is obtained _u)={ 0,1.43,0.60,0}, by E (C _u) and the indeterminacy of calculation formula of project obtain the uncertain critical point e of project _ufor the project scoring subclass of user u the probabilistic minimum value of middle all items, namely 0.68; Therefore, the critical value of all users is obtained;

Prediction u ₁to i ₃scoring situation, first obtain tentative prediction scoring by the svd model after training, then obtain project i ₃at user u ₁time project uncertainty be 0<e _u, then need through sorter; If tentative prediction scoring is 3.21, obtain i ₃rounding up of scoring is 4, and rounding downwards is 3, probability P (3)=(4-3.21) * 0=0 when calculating scoring is 3, probability P (4)=(3.21-3) * 1=0.21 when calculating scoring is 4; P (4) >P (3), so, prediction scoring is classified as 4, last u ₁to i ₃scoring be 4, and will user be recommended, N=1.