CN102750360B

CN102750360B - Mining method of computer data for recommendation systems

Info

Publication number: CN102750360B
Application number: CN201210193229.2A
Authority: CN
Inventors: 王建民; 丁贵广; 龙明盛; 姜晓伟
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2012-06-12
Filing date: 2012-06-12
Publication date: 2014-05-28
Anticipated expiration: 2032-06-12
Also published as: CN102750360A

Abstract

本发明涉及一种用于推荐系统的计算机数据挖掘方法，属于计算机数据处理技术领域。首先在计算机的主服务器中初始化用户偏好矩阵和服务项目偏好矩阵，将用户输入的偏好矩阵的行向量分发给计算机中的多个映射器，各映射器分别计算用户偏好矩阵和服务项目偏好矩阵的梯度方向的子方向，并将计算结果发送给计算机中的化简器，化简器对接收的梯度方向的子方向进行累加，并根据用户偏好矩阵和服务项目偏好矩阵的梯度方向矩阵，对用户偏好矩阵和服务项目偏好矩阵进行更新。本发明方法对已有的PMF算法进行了改进，提高了大规模数据处理能力；采用键值对的数据存储结构储存偏好矩阵，使得占用的储存空间更小，数据读取速度更快。The invention relates to a computer data mining method used in a recommendation system, belonging to the technical field of computer data processing. First, the user preference matrix and service item preference matrix are initialized in the main server of the computer, and the row vectors of the preference matrix input by the user are distributed to multiple mappers in the computer, and each mapper calculates the user preference matrix and service item preference matrix respectively. The sub-direction of the gradient direction, and send the calculation result to the reducer in the computer, the reducer accumulates the received sub-direction of the gradient direction, and according to the gradient direction matrix of the user preference matrix and the service item preference matrix, the user The preference matrix and service item preference matrix are updated. The method of the present invention improves the existing PMF algorithm and improves the large-scale data processing capability; the data storage structure of the key-value pair is used to store the preference matrix, so that the occupied storage space is smaller and the data reading speed is faster.

Description

A Computer Data Mining Method for Recommender System

技术领域 technical field

本发明涉及一种用于推荐系统的计算机数据挖掘方法，属于计算机数据处理技术领域。The invention relates to a computer data mining method used in a recommendation system, belonging to the technical field of computer data processing.

背景技术 Background technique

随着互联网的飞速发展，电子商务规模的不断扩大，商品个数和种类快速增长，顾客需要花费越来越多的时间才能从大量的商品中找到自己想买的商品。这种浏览大量无关信息和商品的过程使得被信息过载问题困扰的消费者不断流失。为了解决这些问题，个性化推荐系统应运而生。个性化推荐系统建立在海量数据挖掘技术的基础之上，通过分析用户的数据，如兴趣和偏好等，提供完全个性化的决策支持和信息服务，提高了电子商务网站的流量转换率。另外，随着互联网信息的爆炸式增长，资讯类网站用户也面临着被海量资讯淹没的困扰。推荐系统在新闻推荐、影视作品推荐和图书推荐等领域的使用提升了用户体验，增加了网站粘性，因此也有较为广泛的应用。With the rapid development of the Internet, the scale of e-commerce continues to expand, and the number and types of commodities increase rapidly, customers need to spend more and more time to find the commodities they want to buy from a large number of commodities. This process of browsing a large amount of irrelevant information and products makes consumers who are plagued by information overload continue to lose. In order to solve these problems, personalized recommendation system came into being. The personalized recommendation system is based on massive data mining technology. By analyzing user data, such as interests and preferences, it provides fully personalized decision support and information services, and improves the traffic conversion rate of e-commerce websites. In addition, with the explosive growth of Internet information, users of information websites are also faced with the trouble of being overwhelmed by massive amounts of information. The use of recommendation systems in the fields of news recommendation, film and television works recommendation, and book recommendation has improved user experience and increased website stickiness, so it has a wider range of applications.

常见的推荐系统算法有基于关联规则（Association Rules）的推荐、基于内容的推荐和协同过滤（Collaborative Filtering，CF）三种。其中协同过滤技术在推荐系统中应用最为广泛。但仍然有许多问题亟待解决。整体而言，数据稀疏性（Sparsity）问题和系统延展性（Scalability）问题是所有协同过滤算法所面临的共同挑战。Common recommendation system algorithms include association rule-based recommendation, content-based recommendation, and collaborative filtering (Collaborative Filtering, CF). Among them, collaborative filtering technology is the most widely used in recommender systems. But there are still many problems to be solved. Overall, data sparsity and system scalability are common challenges faced by all collaborative filtering algorithms.

已有技术中，“R.Salakhutdinov and A.Mnih,“Probabilistic MatrixFactorization”,Advances in Neural Information Processing Systems 20(NIPS’07),pp.1257–1264,2008”，公开了一种用于计算机数据挖掘的概率矩阵分解算法（以下简称PMF，Probabilistic Matrix Factorization，）这是近年来提出的基于贝叶斯理论的矩阵分解算法。PMF模型可用于基于评分（ratings）的协同过滤，不但具有良好的系统扩展性（与用户增长成线性比关系），而且在稀疏数据中表现优秀。作为一种线性因子模型，PMF模型中用户的偏好可以表示为由用户对服务项目的偏好因子的线性组合形成的向量。用户对N×M的偏好矩阵R可表示为一个N×D的用户偏好矩阵U^T和一个D×M的商品因子矩阵V的乘积，即R=U^TV。训练这样一个模型等同于在给定损失函数的条件下寻找目标矩阵R的最佳D维近似。其次，PMF是基于贝叶斯理论的矩阵分解算法。通过给U、V矩阵赋予合理的先验概率分布，以及根据R≈U^TV关系得到R矩阵的似然函数，可以计算参数矩阵U、V的后验概率分布，最后使用最大后验概率估计法求得U、V矩阵。In the prior art, "R.Salakhutdinov and A.Mnih, "Probabilistic Matrix Factorization", Advances in Neural Information Processing Systems 20 (NIPS'07), pp.1257–1264, 2008", discloses a method for computer data mining The Probabilistic Matrix Factorization algorithm (hereinafter referred to as PMF, Probabilistic Matrix Factorization,) is a matrix factorization algorithm based on Bayesian theory proposed in recent years. The PMF model can be used for collaborative filtering based on ratings. It not only has good system scalability (linearly proportional to user growth), but also performs well in sparse data. As a linear factor model, the user's preference in the PMF model can be expressed as a vector formed by the linear combination of the user's preference factors for service items. The user's preference matrix R for N×M can be expressed as the product of an N×D user preference matrix U ^T and a D×M commodity factor matrix V, that is, R= ^UT V. Training such a model is equivalent to finding the best D-dimensional approximation of the target matrix R given a loss function. Second, PMF is a matrix factorization algorithm based on Bayesian theory. By assigning a reasonable prior probability distribution to the U and V matrices, and obtaining the likelihood function of the R matrix according to the R≈U ^T V relationship, the posterior probability distribution of the parameter matrices U and V can be calculated, and finally estimated using the maximum posterior probability The U and V matrices can be obtained by this method.

举例说明：假设有N个用户和M个电影，用户对电影的评分值为从1到K的整数。R_ij表示用户i对电影j的评分，U∈R^D×N和V∈R^D×M是用户和电影的潜在特征矩阵。定义可观察的所有评分数据R相对于潜在特征U、V的似然函数为：For example: Assume that there are N users and M movies, and the ratings of the movies by users are integers from 1 to K. R _ij represents the rating of movie j by user i, and U ∈ R ^D×N and V ∈ R ^D×M are the latent feature matrices of users and movies. Define the likelihood function of all observable scoring data R relative to latent features U, V as:

$p p = = ((R R | | U u,, V V,, {σ σ}^{22})) = = {Π Π}_{i i = = 11}^{N N} {Π Π}_{j j = = 11}^{M m} {[[N N (({R R}_{ij ij} | | {U u}_{i i}^{T T} {V V}_{j j},, {σ σ}^{22}))]]}^{{I I}_{ij ij}}$

公式1Formula 1

其中N(x|μ，σ²)表示均值为μ、方差为σ²的高斯分布的概率密度函数。I_ij是表示用户i是否已对电影j评分的示性函数。假设用户特征矩阵U和电影特征矩阵V的先验分布满足均值为0的高斯分布：Where N(x|μ, σ ² ) represents the probability density function of a Gaussian distribution with mean μ and variance σ ² . I _ij is an indicative function denoting whether user i has rated movie j. Assume that the prior distribution of user feature matrix U and movie feature matrix V satisfies a Gaussian distribution with a mean of 0:

$p p ((U u | | {σ σ}_{U u}^{22})) = = {Π Π}_{i i = = 11}^{N N} N N (({U u}_{i i} | | 00,, {σ σ}_{U u}^{22} I I)),, p p ((V V | | {σ σ}_{V V}^{22})) = = {Σ Σ}_{j j = = 11}^{M m} N N (({V V}_{j j} | | 00,, {σ σ}_{V V}^{22} I I))$

公式2Formula 2

则可求出U、V矩阵的后验概率分布的对数表达式为：Then the logarithmic expression of the posterior probability distribution of the U and V matrices can be obtained as:

$ln ln p p ((U u,, V V | | R R,, {σ σ}^{22},, {σ σ}_{V V}^{22},, {σ σ}_{U u}^{22})) = = - - \frac{11}{22 {σ σ}^{22}} {Σ Σ}_{i i = = 11}^{N N} {Σ Σ}_{j j = = 11}^{M m} {I I}_{ij ij} {(({R R}_{ij ij} - - {U u}_{i i}^{T T} {V V}_{j j}))}^{22} - - \frac{11}{22 {σ σ}_{U u}^{22}} {Σ Σ}_{i i = = 11}^{N N} {U u}_{i i}^{T T} {U u}_{i i} - - \frac{11}{22 {σ σ}_{V V}^{22}} {Σ Σ}_{j j = = 11}^{M m} {V V}_{j j}^{T T} {V V}_{j j} + + C C$

公式3Formula 3

最大化后验概率等价于以下函数的最优化问题：Maximizing the posterior probability is equivalent to the optimization problem of the following function:

$E E. = = \frac{11}{22} {Σ Σ}_{i i = = 11}^{N N} {Σ Σ}_{j j = = 11}^{M m} {I I}_{ij ij} {(({R R}_{ij ij} - - {U u}_{i i}^{T T} {V V}_{j j}))}^{22} + + \frac{{λ λ}_{U u}}{22} {Σ Σ}_{i i = = 11}^{N N} {| | | | {U u}_{i i} | | | |}_{Fro Fro}^{22} + + \frac{{λ λ}_{V V}}{22} {Σ Σ}_{j j = = 11}^{M m} {| | | | {V V}_{j j} | | | |}_{Fro Fro}^{22}$

公式4Formula 4

此处λ_U=σ²/σ_U ²,λ_V=σ²/σ_V ²，取决于用户输入。‖·‖_Fro表示弗罗宾尼斯范数。Here λ _U =σ ² /σ _U ² , λ _V =σ ² /σ _V ² , depending on user input. ‖·‖ _Fro means the Frobenius norm.

这样，PMF模型参数的推导就简化为常见的无约束最优化问题，通过梯度下降法，可以很容易的求出U、V矩阵。In this way, the derivation of the parameters of the PMF model is simplified to a common unconstrained optimization problem, and the U and V matrices can be easily obtained through the gradient descent method.

随着使用推荐系统的网站规模的不断扩大，单机环境下的PMF算法对服务器的存储能力和计算能力的要求越来越高，已经无法满足现实需要。With the continuous expansion of the website scale using the recommendation system, the PMF algorithm in the stand-alone environment has higher and higher requirements for the storage capacity and computing power of the server, which has been unable to meet the actual needs.

单机环境下的PMF算法需要把评分矩阵R整个读取到内存中进行计算，用梯度下降法进行迭代求解时，在判断求得的U、V矩阵是否已满足误差要求，需要计算U和V的乘积矩阵。这说明PMF算法对内存的要求至少应该可以放下两个R矩阵。这在数据规模不大时是可行的。而在现实应用场景中，大型商业网站的用户特征矩阵和商品特征矩阵的维度N、M往往是百万、甚至千万量级。此时，普通的服务器的内存已经不能满足PMF算法的对内存的要求。The PMF algorithm in a stand-alone environment needs to read the entire scoring matrix R into the memory for calculation. When using the gradient descent method for iterative solution, it is necessary to calculate the U and V matrix when judging whether the obtained U and V matrices meet the error requirements. product matrix. This shows that the memory requirements of the PMF algorithm should be able to accommodate at least two R matrices. This is feasible when the data size is not large. In actual application scenarios, the dimensions N and M of the user feature matrix and product feature matrix of large commercial websites are often on the order of millions or even tens of millions. At this point, the memory of a common server can no longer meet the memory requirements of the PMF algorithm.

另外，PMF是一种迭代算法，而且每次迭代都需要进行矩阵乘法来检验是否达到迭代边界，所以计算的时间复杂度很高。设梯度下降的迭代次数为k次，则PMF算法的时间复杂度为O(kNMD)。在用户、商品特征矩阵变得较大时，单机环境下的PMF算法需要的时间太长。In addition, PMF is an iterative algorithm, and each iteration requires matrix multiplication to check whether the iteration boundary is reached, so the time complexity of the calculation is very high. Assuming that the number of iterations of gradient descent is k times, the time complexity of the PMF algorithm is O(kNMD). When the user and product feature matrices become larger, the PMF algorithm in a stand-alone environment takes too long.

发明内容 Contents of the invention

本发明的目的是提出一种用于推荐系统的计算机数据挖掘方法，考虑到偏好矩阵R的规模很大，将R矩阵放置于外存，并考虑到R矩阵规模很大时，R矩阵通常是稀疏阵，将输入的R矩阵处理后，采用更为紧凑的数据结构来存储R矩阵，以缩短计算时间，增大数据的处理规模。The purpose of the present invention is to propose a computer data mining method for recommendation systems. Considering that the scale of the preference matrix R is very large, the R matrix is placed in the external memory, and considering that the scale of the R matrix is very large, the R matrix is usually Sparse array, after processing the input R matrix, adopts a more compact data structure to store the R matrix, so as to shorten the calculation time and increase the scale of data processing.

本发明提出的用于推荐系统的计算机数据挖掘方法，包括以下步骤：The computer data mining method that the present invention proposes for recommendation system comprises the following steps:

（1）设定一个N×M的偏好矩阵R，其中N为偏好矩阵R的行数，N等于用户个数，M为偏好矩阵R的列数，M等于为用户服务的项目个数；(1) Set an N×M preference matrix R, where N is the number of rows of the preference matrix R, N is equal to the number of users, M is the number of columns of the preference matrix R, and M is equal to the number of items serving users;

（2）向计算机输入文件，将输入文件转换成映射化简模型中的序列文件，使序列文件中的每一行为偏好矩阵R的一个行向量，偏好矩阵R的每一行的数据结构为：行向量下标和键值对数组组成，其中键值对数组包括服务项目编号和用户对该服务项目的偏好；(2) Input the file to the computer, and convert the input file into a sequence file in the mapping simplified model, so that each row in the sequence file is a row vector of the preference matrix R, and the data structure of each row of the preference matrix R is: row Composed of vector subscripts and key-value pair arrays, where the key-value pair array includes the service item number and the user's preference for the service item;

（3）将偏好矩阵R表示为R=U^TV，其中U^T为N×D的用户偏好矩阵的转置，N等于用户个数，D为用户服务项目偏好因子个数，V为D×M的服务项目偏好矩阵，M为服务项目个数；(3) Express the preference matrix R as R=U ^T V, where U ^T is the transposition of the N×D user preference matrix, N is equal to the number of users, D is the number of user service item preference factors, and V is D×D M's service item preference matrix, where M is the number of service items;

（4）在计算机的主服务器中生成用户偏好矩阵U和服务项目偏好矩阵V，其中用户偏好矩阵U的行为用户编号，列为用户偏好因子，初始化时用户偏好因子为任意实数，服务项目偏好矩阵V的行为服务项目编号，列为服务项目偏好因子，并设初始化时服务项目偏好因子为任意实数；(4) Generate a user preference matrix U and a service item preference matrix V in the main server of the computer, where the behavior user number of the user preference matrix U is listed as the user preference factor, and the user preference factor is any real number during initialization, and the service item preference matrix V's behavioral service item number is listed as the service item preference factor, and the service item preference factor is set to be any real number during initialization;

（5）将上述偏好矩阵R的行向量分发给计算机中的多个映射器，各映射器根据读取的偏好矩阵R的行向量，分别根据公式：(5) Distribute the row vectors of the above preference matrix R to multiple mappers in the computer, and each mapper according to the read row vectors of the preference matrix R, respectively according to the formula:

$&dtri; U_{ik} = λ_{U} U_{ik} + Σ_{j = 1}^{M} I_{ij} V_{j} (R_{ij} - g (U_{i}^{T} V_{j})) g^{'} (U_{i}^{T} V_{j}),$ 计算用户偏好矩阵U中每个元素的梯度方向▽U_ik， $&dtri; u_{ik} = λ_{u} u_{ik} + Σ_{j = 1}^{m} I_{ij} V_{j} (R_{ij} - g (u_{i}^{T} V_{j})) g^{'} (u_{i}^{T} V_{j}),$ Calculate the gradient direction ▽U _ik of each element in the user preference matrix U,

根据公式： $Δ (&dtri; V_{ik}) = I_{ij} U_{i} (R_{ij} - g (U_{i}^{T} V_{j})) g^{'} (U_{i}^{T} V_{j}),$ 计算服务项目偏好矩阵V中每个元素的梯度方向的子方向Δ(▽V_ik)，According to the formula: $Δ (&dtri; V_{ik}) = I_{ij} u_{i} (R_{ij} - g (u_{i}^{T} V_{j})) g^{'} (u_{i}^{T} V_{j}),$ Calculate the sub-direction Δ(▽V _ik ) of the gradient direction of each element in the service item preference matrix V,

其中,

表示用户偏好矩阵U的转置的第i个行向量，V_j表示服务项目偏好矩阵V的第j个行向量，λ_U是用户指定的用户偏好程度参数，λ_U为正实数，U_ik为用户偏好矩阵U的第i行、第k列的元素，I为指示函数矩阵，若I_ij等于0，则表示用户i未对服务项目j产生偏好，若I_ij等于1，则表示用户i对服务项目j产生偏好，g是罗杰斯特函数，g’是g函数的一阶导函数：in,

Represents the i-th row vector of the transposition of the user preference matrix U, V _j represents the j-th row vector of the service item preference matrix V, λ _U is the user preference degree parameter specified by the user, λ _U is a positive real number, U _ik is The element in row i and column k of user preference matrix U, I is an indicator function matrix, if I _ij is equal to 0, it means that user i has no preference for service item j, if I _ij is equal to 1, it means that user i has no preference for service item j. Service item j produces preference, g is the Rodgers function, and g' is the first derivative of the g function:

$g g ((x x)) = = \frac{11}{11 + + {e e}^{- - x x}}$

各映射器将梯度方向▽U_ik和梯度方向子方向Δ(▽V_ik)分别发送给计算机中的化简器；Each mapper sends the gradient direction ▽U _ik and the gradient direction sub-direction Δ(▽V _ik ) to the reducer in the computer respectively;

（6）化简器根据接收的服务项目偏好矩阵V中每个元素的梯度方向的子方向Δ(▽V_ik)进行累加，得到矩阵V的每个元素的梯度方向▽V_ik，

其中λ_V是用户指定的服务项目偏好程度参数，λ_V是正实数；(6) The reducer accumulates according to the subdirection Δ(▽V _ik ) of the gradient direction of each element in the received service item preference matrix V, and obtains the gradient direction ▽V _ik of each element of the matrix V,

Among them, λ _V is the service item preference degree parameter specified by the user, and λ _V is a positive real number;

（7）化简器构建一个用户偏好矩阵U的梯度方向矩阵▽U，梯度方向矩阵▽U中第i行、第k列的值为步骤（5）计算得到的▽U_ik，构建一个服务项目偏好矩阵V的梯度方向矩阵▽V，梯度方向矩阵▽V中第i行、第k列的值为步骤（5）计算得到的▽V_ik；(7) The simplifier builds a gradient direction matrix ▽U of the user preference matrix U, and the value of the i-th row and the k-th column in the gradient direction matrix ▽U is ▽U _ik calculated in step (5), constructing a service item The gradient direction matrix ▽V of the preference matrix V, the value of the i-th row and the k-th column in the gradient direction matrix ▽V is ▽V _ik calculated in step (5);

并根据用户偏好矩阵U的梯度方向矩阵▽U和服务项目偏好矩阵V的梯度方向矩阵▽V，对用户偏好矩阵U和服务项目偏好矩阵V进行更新，使：And according to the gradient direction matrix ▽U of the user preference matrix U and the gradient direction matrix ▽V of the service item preference matrix V, the user preference matrix U and the service item preference matrix V are updated, so that:

$\begin{matrix} U = U - &dtri; U \\ V = V - &dtri; V \end{matrix},$ 完成一次迭代； $\begin{matrix} u = u - &dtri; u \\ V = V - &dtri; V \end{matrix},$ Complete an iteration;

（8）用户设定一个最大迭代次数，若迭代次数大于或等于最大迭代次数，则得到用户偏好矩阵U和服务项目偏好矩阵V，结束计算；若迭代次数小于最大迭代次数，则重复步骤（3）～步骤（8）。(8) The user sets a maximum number of iterations. If the number of iterations is greater than or equal to the maximum number of iterations, the user preference matrix U and service item preference matrix V are obtained, and the calculation ends; if the number of iterations is less than the maximum number of iterations, repeat the step (3 ) ~ step (8).

本发明提出的用于推荐系统的计算机数据挖掘方法，其优点是：The computer data mining method that the present invention proposes is used for recommendation system, and its advantage is:

1、本发明方法对已有的PMF算法进行了改进，提高了大规模数据处理能力。本发明方法将偏好矩阵R存储于外存的文件中，基于映射化简(MapReduce)框架对数据进行分布式并发计算，对计算系统中单个计算机的计算能力和存储能力的要求低，有效适应大规模矩阵的概率分解。1. The method of the present invention improves the existing PMF algorithm and improves the large-scale data processing capability. The method of the present invention stores the preference matrix R in the file of the external storage, and performs distributed concurrent calculation on the data based on the MapReduce framework, which has low requirements on the computing power and storage capacity of a single computer in the computing system, and effectively adapts to large-scale Probabilistic decomposition of scale matrices.

2、本发明方法中，采用键值对的数据存储结构储存偏好矩阵R，使得R占用的储存空间更小，数据读取速度更快。2. In the method of the present invention, a key-value pair data storage structure is used to store the preference matrix R, so that the storage space occupied by R is smaller and the data reading speed is faster.

3、本发明方法对用户偏好矩阵U、服务项目偏好矩阵V的梯度方向采用并发计算方法，在梯度下降法的每一次迭代中，R的每一个行向量只需要被读取一次，即可并发地计算出U、V的梯度方向，所以对R矩阵的读取次数少，数据挖掘速度快。3. The method of the present invention adopts a concurrent calculation method for the gradient direction of the user preference matrix U and the service item preference matrix V. In each iteration of the gradient descent method, each row vector of R only needs to be read once to be concurrent. The gradient direction of U and V can be accurately calculated, so the number of readings to the R matrix is small, and the data mining speed is fast.

具体实施方式 Detailed ways

偏好矩阵R的每一行的数据结构如下表所示，其中用户i一共对m个服务项目产生偏好：The data structure of each row of the preference matrix R is shown in the following table, where user i has a total preference for m service items:

上表中，对偏好矩阵R的第i行R_i，首先存储行向量下标i。然后对于用户i偏好的每一个服务项目j，服务项目在偏好矩阵R中的列下标即为键值对中的键K，用户i对服务项目j的偏好即为对应的值V。In the above table, for the i-th row R _i of the preference matrix R, the row vector subscript i is stored first. Then, for each service item j preferred by user i, the subscript of the service item in the preference matrix R is the key K in the key-value pair, and user i's preference for service item j is the corresponding value V.

其中,

$g g ((x x)) = = \frac{11}{11 + + {e e}^{- - x x}}$

各映射器将梯度方向▽U_ik和梯度方向子方向Δ(▽V_ik)分别发送给计算机中的化简器；（6）化简器根据接收的服务项目偏好矩阵V中每个元素的梯度方向的子方向Δ(▽V_ik)进行累加，得到矩阵V的每个元素的梯度方向▽V_ik，

其中λ_V是用户指定的服务项目偏好程度参数，λ_V是正实数；Each mapper sends the gradient direction ▽U _ik and the gradient direction sub-direction Δ(▽V _ik ) to the reducer in the computer respectively; (6) The reducer receives the gradient of each element in the service item preference matrix V The sub-direction Δ(▽V _ik ) of the direction is accumulated to obtain the gradient direction ▽V _ik of each element of the matrix V,

Claims

1. A computer data mining method for recommendation system, characterized in that the method comprises the following steps:

(1) Set an N×M preference matrix R, where N is the number of rows of the preference matrix R, N is equal to the number of users, M is the number of columns of the preference matrix R, and M is equal to the number of items serving users;

(2) Input the file to the computer, and convert the input file into a sequence file in the mapping simplified model, so that each row in the sequence file is a row vector of the preference matrix R, and the data structure of each row of the preference matrix R is: row Composed of vector subscripts and key-value pair arrays, where the key-value pair array includes the service item number and the user's preference for the service item;

(3) Express the preference matrix R as R= ^UT V, where U ^T is the transposition of the N×D user preference matrix, N is equal to the number of users, D is the number of user service item preference factors, and V is D×D M's service item preference matrix, where M is the number of service items;

(4) Generate a user preference matrix U and a service item preference matrix V in the main server of the computer, where the behavior user number of the user preference matrix U is listed as the user preference factor, and the user preference factor is any real number during initialization, and the service item preference matrix V's behavioral service item number is listed as the service item preference factor, and the service item preference factor is set to be any real number during initialization;

(5) Distribute the row vectors of the above preference matrix R to multiple mappers in the computer, and each mapper according to the read row vectors of the preference matrix R, respectively according to the formula:

{&dtri; u}_{ik} = λ_{u} u_{ik} + Σ_{j = 1}^{m} I_{ij} V_{j} (R_{ij} - g (u_{i}^{T} V_{j})) g^{'} (u_{i}^{T} V_{j}),

Calculate the gradient direction ▽U _ik of each element in the user preference matrix U,

According to the formula:

Δ ({&dtri; V}_{ik}) = I_{ij} u_{i} (R_{ij} - g (u_{i}^{T} V_{j})) g^{'} (u_{i}^{T} V_{j}),

Calculate the sub-direction Δ(▽V _ik ) of the gradient direction of each element in the service item preference matrix V,

in,

g g ((x x)) = = \frac{11}{11 + + {e e}^{- - x x}};;

Each mapper sends the gradient direction ▽U _ik and the gradient direction sub-direction Δ(▽V _ik ) to the reducer in the computer respectively;

(6) The reducer accumulates according to the subdirection Δ(▽V _ik ) of the gradient direction of each element in the received service item preference matrix V, and obtains the gradient direction ▽V _ik of each element of the matrix V, Among them, λ _V is the service item preference degree parameter specified by the user, and λ _V is a positive real number;

(7) The simplifier builds a gradient direction matrix ▽U of the user preference matrix U, and the value of the i-th row and the k-th column in the gradient direction matrix ▽U is ▽U _ik calculated in step (5), constructing a service item The gradient direction matrix ▽V of the preference matrix V, the value of the i-th row and the k-th column in the gradient direction matrix ▽V is ▽V _ik calculated in step (5);

And according to the gradient direction matrix ▽U of the user preference matrix U and the gradient direction matrix ▽V of the service item preference matrix V, the user preference matrix U and the service item preference matrix V are updated, so that:

\begin{matrix} u = u - &dtri; u \\ V = V - &dtri; V \end{matrix},

Complete an iteration;

(8) The user sets a maximum number of iterations. If the number of iterations is greater than or equal to the maximum number of iterations, the user preference matrix U and service item preference matrix V are obtained, and the calculation ends; if the number of iterations is less than the maximum number of iterations, repeat the step (3 ) ~ step (8).