CN103902690B

CN103902690B - Method for improving accuracy of influence of user generate content (UGC) information of social network

Info

Publication number: CN103902690B
Application number: CN201410119194.7A
Authority: CN
Inventors: 李蕾; 林鑫; 王博远
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2014-03-27
Filing date: 2014-03-27
Publication date: 2017-03-22
Anticipated expiration: 2034-03-27
Also published as: CN103902690A

Abstract

The invention discloses a method for improving the influence accuracy of content information generated by social network users. The user-generated content UGC includes M keywords, and a total of N users participate in the UGC. The method includes: establishing social network UGC members Participation mechanism: Construct fan network unauthorized directed graph according to user-fan relationship of UGC and carry out community division; build interest network authority undirected graph and carry out community division according to user reply relationship of UGC; Calculate the social influence U _X of user X based on the degree of correlation among the various influencing factors of the member participation mechanism; Calculate the social influence of user X publishing keyword K, m is the number of times keyword K spreads on user X, if m=0, then S _KX =0; according to the formula Calculating the comprehensive social influence of the keyword K in the UGC; calculating the sum of the comprehensive social influences of the M keywords in the UGC to obtain the information influence INF of the UGC.

Description

A method for improving the influence accuracy of content information generated by social network users

技术领域technical field

本发明涉及信息监测技术,特别是一种提高社交网络用户产生内容信息影响力准确性的方法。The invention relates to information monitoring technology, in particular to a method for improving the influence accuracy of content information generated by social network users.

背景技术Background technique

互联网已经进入到web2.0时代，每个用户都可以自由发表言论，很多重要内容或者新闻都是通过用户产生内容（UGC，User Generate Content）首先产生出来，继而通过社交网络广泛传播，最终在某个特定的社交圈子内甚至整个社会产生巨大的影响力。因此，UGC影响力的研究对于信息采集、监测、预测等都具有非常重要的作用。但是由于UGC的数量过于庞大，增加的速度非常快，很难对所有UGC进行处理，必须要筛选出质量好而且影响力高的UGC来进行研究和利用。由此，对UGC的质量和信息影响力评估的研究越来越受到重视。The Internet has entered the era of web2.0, where every user is free to express their opinions. Many important content or news are first generated through User Generated Content (UGC, User Generate Content), and then widely disseminated through social networks, and finally in a certain within a specific social circle or even society as a whole. Therefore, research on UGC influence plays a very important role in information collection, monitoring, and prediction. However, because the number of UGCs is too large and the speed of increase is very fast, it is difficult to deal with all UGCs. It is necessary to screen out UGCs with good quality and high influence for research and utilization. Therefore, more and more attention has been paid to the research on the quality and information influence evaluation of UGC.

目前关于信息影响力的研究主要应用影响力扩散模型（IDM，InfluenceDiffusion Model）及其改进模型（如影响力扩散概率模型IDPM，Influence DiffusionProbability Model等）进行分析。基于文本会话的影响力扩散模型IDM利用会话中的回复链结构，基于词频计算文本间的相似性来计算源的影响力扩散能力，每条回复扩散的影响力之和即为该文本的影响力扩散能力。该模型提出后，成为信息影响力研究的重要基石，后人对于信息影响力的研究大多是对该模型进行改进；影响力扩散概率模型IDPM通过在整个兴趣空间上定义单个关键词语传播概率影响力来解决IDM模型中的影响力传递结构断层问题和灌水导致的虚假影响力传播问题，通过考虑句子中的有效关键词语来解决IDM模型中的影响力传递内容断层问题。The current research on information influence mainly uses the Influence Diffusion Model (IDM, Influence Diffusion Model) and its improved models (such as the Influence Diffusion Probability Model IDPM, Influence Diffusion Probability Model, etc.) for analysis. The influence diffusion model based on text conversation IDM uses the reply chain structure in the conversation to calculate the similarity between texts based on word frequency to calculate the influence diffusion ability of the source. The sum of the influence of each reply is the influence of the text. Diffusion ability. After the model was put forward, it became an important cornerstone of information influence research. Most of the research on information influence in later generations is to improve the model; the influence diffusion probability model IDPM spreads probability influence by defining a single keyword in the entire interest space. To solve the structural fault problem of influence transmission in the IDM model and the false influence propagation problem caused by irrigation, and to solve the content fault problem of influence transmission in the IDM model by considering the effective keywords in the sentence.

但是这些模型存在一些非常明显的缺陷，如每个评论或者回复的权值都相同、没有考虑用户之间的关系等。以BBS上的一个帖子作为UGC为例，如图1所示：However, these models have some very obvious defects, such as the weight of each comment or reply is the same, and the relationship between users is not considered. Take a post on BBS as UGC as an example, as shown in Figure 1:

用户1为信息发布者，用户2～用户5为用户1的回复者，A、B、C、D、E、F为该帖子包括的关键词，粗实线表示帖子在用户中的影响力传播关系，粗实线的方向为帖子的影响力传播方向，点划线表示用户间粉丝关系，虚线表示用户在兴趣网络中属于同一社群的关系，细实线表示用户在粉丝网络中属于同一社群的关系。User 1 is the information publisher, users 2 to 5 are the replies of user 1, A, B, C, D, E, and F are the keywords included in the post, and the thick solid line indicates the influence of the post among users relationship, the direction of the thick solid line is the influence propagation direction of the post, the dotted line indicates the fan relationship between users, the dotted line indicates the relationship that the user belongs to the same community in the interest network, and the thin solid line indicates that the user belongs to the same community in the fan network. group relationship.

图1中，用户2～用户5都回复了用户1的帖子，不过用户2是用户1的粉丝，用户3与用户1属于相同兴趣网络社群，用户4与用户1属于相同粉丝网络社群（但不是用户1的粉丝），用户5是新用户，之前可能几乎与用户1没有关系。In Figure 1, users 2 to 5 all replied to user 1’s post, but user 2 is a fan of user 1, user 3 and user 1 belong to the same interest network community, and user 4 and user 1 belong to the same fan network community ( but not a fan of user 1), user 5 is new and probably had little to no relationship with user 1 before.

由此可以看出，IDPM模型中没有对UGC的关键词进行分别加权处理会导致计算UGC的信息影响力存在偏差。It can be seen from this that, in the IDPM model, the keywords of UGC are not separately weighted, which will lead to deviations in the calculation of the information influence of UGC.

发明内容Contents of the invention

有鉴于此，本发明提出了一种提高社交网络用户产生内容信息影响力准确性的方法，有效解决了现有技术中不对UGC的关键词进行分别加权处理导致计算UGC的信息影响力存在偏差的缺陷。本发明提出的技术方案是：In view of this, the present invention proposes a method for improving the accuracy of the influence of content information generated by social network users, which effectively solves the problem in the prior art that there is a deviation in the calculation of UGC information influence due to the fact that the keywords of UGC are not separately weighted. defect. The technical scheme that the present invention proposes is:

一种提高社交网络用户产生内容信息影响力准确性的方法，该方法包括：A method for improving the influence accuracy of content information generated by social network users, the method comprising:

A.建立社交网络UGC成员参与机制，确定所述成员参与机制各影响因素间的路径系数，所述路径系数为所述成员参与机制各影响因素间的相关程度；A. Establish a social network UGC member participation mechanism, determine the path coefficient between the various influencing factors of the member participation mechanism, and the path coefficient is the degree of correlation between the various influencing factors of the member participation mechanism;

B.根据所述UGC的用户粉丝关系构建粉丝网络无权有向图，对所述粉丝网络无权有向图进行社群划分；根据所述UGC的用户回复关系构建兴趣网络有权无向图，对所述兴趣网络有权无向图进行社群划分；B. Construct a fan network unauthorized directed graph based on the user-fan relationship of the UGC, and perform community division on the fan network unauthorized directed graph; construct an interest network authorized undirected graph based on the user reply relationship of the UGC , performing community division on the right undirected graph of the interest network;

C.根据所述成员参与机制各影响因素间的相关程度计算用户X的社交影响力U_X；C. Calculate the social influence U _X of user X according to the degree of correlation between the various influencing factors of the member participation mechanism;

D.根据计算用户X发布关键词K的社交影响力，m为关键词K在用户X上的传播次数，如果m＝0，则S_KX＝0；D. According to Calculate the social influence of user X publishing keyword K, m is the number of times keyword K spreads on user X, if m=0, then S _KX =0;

E.根据公式计算关键词K在所述UGC中的综合社交影响力；E. According to the formula Calculating the comprehensive social influence of keyword K in the UGC;

F.计算所述M个关键词在所述UGC中的综合社交影响力之和，得到所述UGC的信息影响力INF。F. Calculate the sum of the comprehensive social influences of the M keywords in the UGC to obtain the information influence INF of the UGC.

上述方案中，所述成员参与机制包括信息质量、群体认同感、价值感知和参与四个影响因素，信息质量和群体认同感的路径系数为a₁，信息质量和价值感知的路径系数为a₂，价值感知和群体认同感的路径系数为a₃，参与和群体认同感的路径系数为a₄。In the above scheme, the member participation mechanism includes four influencing factors: information quality, group identity, value perception and participation. The path coefficient of information quality and group identity is a ₁ , and the path coefficient of information quality and value perception is a ₂ , the path coefficient of value perception and group identity is a ₃ , and the path coefficient of participation and group identity is a ₄ .

上述方案中，所述步骤C进一步包括：In the above scheme, the step C further includes:

根据公式计算所述UGC中用户X的社交影响力，According to the formula Calculate the social influence of user X in the UGC,

其中，b为用户X在所述UGC中被直接回复的次数，如果用户不存在直接回复者，则U_X＝0；Wherein, b is the number of times that user X is directly replied in the UGC, if the user does not have a direct reply person, then U _X =0;

如果用户X与其直接回复者属于相同兴趣网络社群，C₁＝a₁，否则，C₁＝1；If user X and his direct responder belong to the same interest network community, C ₁ =a ₁ , otherwise, C ₁ =1;

如果用户X与其直接回复者属于相同粉丝网络社群，C₂＝a₂×a₃，否则，C₂＝1；If user X and his direct reply belong to the same fan network community, C ₂ =a ₂ ×a ₃ , otherwise, C ₂ =1;

如果用户X是所述UGC信息发布者的粉丝，f＝a₂，否则f＝1。If user X is a fan of the UGC information publisher, f=a ₂ , otherwise f=1.

上述方案中，所述步骤F进一步包括：In the above scheme, the step F further includes:

根据公式所述UGC的信息影响力INF。According to the formula The information influence of the UGC is INF.

上述方案中，信息质量和群体认同感的路径系数a₁＝0.333，信息质量和价值感知的路径系数a₂＝0.824，价值感知和群体认同感的路径系数a₃＝0.624，参与和群体认同感的路径系数a₄＝0.437。In the above scheme, the path coefficient a ₁ = 0.333 for information quality and group identity, a ₂ = 0.824 for information quality and value perception, a 3 = 0.624 for value perception and group identity, and a ₃ = 0.624 for participation and group identity The path coefficient a ₄ =0.437.

综上所述，本发明提出了一种提高社交网络用户产生内容信息影响力准确性的方法，应用社交性拓展影响力扩散概率模型（S-IDPM，Sociability-based InfluenceDiffusion Probability Model）计算UGC信息影响力，主要利用用户社交网络（包括粉丝网络和兴趣网络）和回复链结构来对不同用户的回复进行加权，从而对UGC的关键词分别进行加权处理，提高了社交网络用户产生内容信息影响力计算的准确性。In summary, the present invention proposes a method for improving the accuracy of the influence of content information generated by social network users, and uses the Sociability-based Influence Diffusion Probability Model (S-IDPM, Sociability-based Influence Diffusion Probability Model) to calculate the influence of UGC information. Power, mainly using user social network (including fan network and interest network) and reply chain structure to weight the replies of different users, so as to weight the keywords of UGC respectively, and improve the calculation of the influence of content information generated by social network users accuracy.

附图说明Description of drawings

图1为用户间及所发帖子间关系图。Figure 1 is a diagram of the relationship between users and posts.

图2为粉丝网络图。Figure 2 is a fan network map.

图3为兴趣网络图。Figure 3 is a diagram of an interest network.

图4为用户成员参与机制。Figure 4 shows the user member participation mechanism.

图5为方法实施例一的UGC结构图。FIG. 5 is a UGC structure diagram of the first method embodiment.

图6为方法实施例一的流程图。FIG. 6 is a flow chart of the first method embodiment.

图7为方法实施例二的人工标注精品帖累积含有率对比图。Fig. 7 is a comparison chart of cumulative content ratios of manually marked high-quality posts in the second method embodiment.

图8为方法实施例二的类-特征值对照图。Fig. 8 is a class-eigenvalue comparison chart of the second method embodiment.

图9为方法实施例二的机器标注精品帖累积含有率对比图。Fig. 9 is a comparison chart of accumulative content ratios of machine-labeled high-quality posts in the second method embodiment.

具体实施方式detailed description

为使本发明的目的、技术方案和优点表达的更加清楚明白，下面结合附图及具体实施例对本发明再作进一步详细的说明。In order to make the object, technical solution and advantages of the present invention more clearly, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

本发明一个实施例的技术方案是：The technical scheme of an embodiment of the present invention is:

本发明技术方案将用户因素添加到用户产生内容信息影响力计算中，将BBS、微博、人人网等社交网络的所有用户划分为信息发布者与信息回复者。根据参与UGC的用户粉丝关系构建粉丝网络无权有向图，如图2所示，用户1是用户2的粉丝，则用户1与用户2之间有一条从用户1指向用户2的边；根据参与UGC的用户回复关系构建兴趣网络有权无向图，如图3所示，用户1与用户2共同参与了7个信息发布者发布的信息讨论，则用户1和用户2之间存在一条权值为7的无向边。The technical solution of the present invention adds user factors to the calculation of the influence of user-generated content information, and divides all users of social networks such as BBS, Weibo, and Renren into information publishers and information replyers. According to the fan relationship of users who participate in UGC, construct the unrighted directed graph of fan network, as shown in Figure 2, user 1 is a fan of user 2, then there is an edge from user 1 to user 2 between user 1 and user 2; according to User responses to UGC construct an interest network rights undirected graph, as shown in Figure 3, user 1 and user 2 jointly participated in the information discussion released by 7 information publishers, then there is a right relationship between user 1 and user 2. An undirected edge with a value of 7.

根据现有技术将上述粉丝网络与兴趣网络进行社群划分，并对划分的社群进行编号，属于同社群用户的社群编号相同，相同社群编号（相同粉丝网络社群编号或相同兴趣网络社群编号）的用户在一定程度上具有相似的价值观。用户社群划分技术为现有技术，在此不再详述，表1为图2社群划分后的粉丝网络社群，表2为社群划分后的兴趣网络社群示例，由表1可知，用户1和用户3为同一粉丝网络社群，由表2可知，用户1与用户2为同一兴趣网络社群。According to the existing technology, the above-mentioned fan network and interest network are divided into communities, and the divided communities are numbered. The community numbers of users belonging to the same community are the same, and the same community number (the same fan network community number or the same interest network Community ID) users have similar values to a certain extent. The user community division technology is an existing technology, and will not be described in detail here. Table 1 shows the fan network community after the community division in Figure 2, and Table 2 shows an example of the interest network community after the community division. It can be seen from Table 1 , user 1 and user 3 belong to the same fan network community, and it can be seen from Table 2 that user 1 and user 2 belong to the same interest network community.

表1Table 1

用户user 粉丝网络社群编号fFan network community number f 11 11 22 22 33 11 44 33 55 44 66 55

表2Table 2

用户user 兴趣网络社群编号rInterest network community number r 11 33 22 33 33 11 44 22 55 44 66 55

本发明技术方案构建了用户成员参与机制，如图4所示。The technical solution of the present invention builds a user member participation mechanism, as shown in FIG. 4 .

信息质量为社交网络中UGC的信息影响力，代表所述UGC的稳定性、信息的正确性、及时性、新颖性和服务品质的优劣。Information quality refers to the information influence of UGC in the social network, which represents the stability of the UGC, the correctness of information, timeliness, novelty and the quality of service.

价值感知利用社交网络中用户间的粉丝关系来表示。对于一个信息发布者U1，如果U1的粉丝用户U2回复了该信息发布者，则认为驱动粉丝用户U2回复U1的因素除了对U1的认可外，还带有一定的维持人际关系的因素，这种情况下，不仅仅是U1发布的信息影响驱动了粉丝用户U2的参与，其中还掺杂了用户之间人际关系的因素，在社交网络UGC信息影响力计算中，对粉丝用户回复的内容要减小相应的权重。Value perception is represented by fan relationships among users in social networks. For an information publisher U1, if U1’s fan user U2 replies to the information publisher, it is considered that the factors that drive the fan user U2 to reply to U1 are not only the recognition of U1, but also the factors of maintaining interpersonal relationships. In this case, it is not only the influence of the information released by U1 that drives the participation of fan user U2, but also the factors of interpersonal relationships between users. In the calculation of the influence of UGC information on social networks, the content of replies to fans should be reduced. small corresponding weights.

群体认同感表示社交网络中用户间的社群划分与回复评价对UGC信息影响力的影响。社群划分是前述的粉丝网络与兴趣网络的社群划分，如果回复者与被回复者属于相同社群（相同粉丝网络社群和/或相同兴趣网络社群），则减轻相应权重，否则增大相应权重；如果一个UGC的内容质量很高或者具备产生一定影响力的潜力，用户大多愿意参与这样的UGC，因此回复评价成为UGC影响力计算的重要因素。回复评价中关键词在UGC中的传播次数越多，说明该UGC的影响力越大。Group identity indicates the influence of community division and reply evaluation among users in social networks on the influence of UGC information. The community division is the community division of the aforementioned fan network and interest network. If the respondent and the respondent belong to the same community (the same fan network community and/or the same interest network community), the corresponding weight will be reduced; otherwise, the corresponding weight will be increased. Large corresponding weight; if a UGC content is of high quality or has the potential to generate a certain influence, most users are willing to participate in such a UGC, so reply evaluation becomes an important factor in the calculation of UGC influence. The more the keywords in the reply evaluation are spread in the UGC, the greater the influence of the UGC.

图4用户成员参与机制中的路径系数衡量的是两个变量之间的相关程度，分别用a₁、a₂、a₃、a₄来表示，且0＜a₁＜1，0＜a₂＜1，0＜a₃＜1，0＜a₄＜1。Figure 4 The path coefficient in the user member participation mechanism measures the degree of correlation between two variables, represented by a ₁ , a ₂ , a ₃ , and a ₄ respectively, and 0<a ₁ <1, 0<a ₂ <1, 0<a ₃ <1, 0<a ₄ <1.

根据所述成员参与机制各影响因素间的相关程度计算社交网络UGC的参与用户X的社交影响力U_X：Calculate the social influence U _X of the participating user X of the social network UGC according to the degree of correlation among the various influencing factors of the member participation mechanism:

其中，b为用户X在一个UGC中被直接回复的次数。如果用户X没有被直接回复，则U_X＝0。Among them, b is the number of times that user X is directly replied in a UGC. U _X =0 if user X is not directly replied to.

C₁表示用户X的回复者是否与用户X属于相同的兴趣网络社群，如果是，由于信息质量与群体认同感有a₁的相关程度，C₁＝a₁；否则，C₁＝1。C₁＝1表明如果能吸引到不同兴趣网络社群的用户进行回复，则影响力更强。C ₁ indicates whether the respondent of user X belongs to the same interest network community as user X, if yes, since information quality and group identity have a correlation degree of a ₁ , C ₁ =a ₁ ; otherwise, C ₁ =1. C ₁ =1 indicates that if users of different interest network communities can be attracted to reply, the influence will be stronger.

C₂表示用户X的回复者是否与用户X属于相同的粉丝网络社群，如果是，由于信息质量与价值感知有a₂的相关程度，价值感知与群体认同感有a₃的相关程度，所以C₂＝a₂×a₃；否则，C₂＝1。C₂＝1表明如果能吸引到不同粉丝网络社群的用户进行回复，则影响力更强。C ₂ indicates whether the respondent of user X belongs to the same fan network community as user X, if yes, because information quality and value perception have a ₂ correlation degree, and value perception and group identity have a ₃ correlation degree, so C ₂ =a ₂ ×a ₃ ; otherwise, C ₂ =1. C ₂ =1 indicates that if users from different fan network communities can be attracted to reply, the influence will be stronger.

f表示用户X是否为该UGC的信息发布者的粉丝，如果是，由于信息质量与价值感知有a₂的相关程度，f＝a₂；否则f＝1。这表明如果用户X是信息发布者的粉丝，则用户X的回复不仅仅是对内容的肯定，还带着维持社交关系的原因，因此要减小权重。f indicates whether user X is a fan of the information publisher of the UGC, if yes, f=a ₂ due to the correlation degree of a ₂ between information quality and value perception; otherwise f=1. This shows that if user X is a fan of the information publisher, user X's reply is not only an affirmation of the content, but also has reasons for maintaining social relations, so the weight should be reduced.

社交网络中用户X在一个UGC中的社交影响力U_X确定以后，该用户X发布的关键词K的社交影响力可以进一步确定：After the social influence U _X of user X in a UGC in the social network is determined, the social influence of the keyword K published by user X can be further determined:

其中，m为关键词K在用户X上的传播次数，即用户X的直接回复者中，也回复关键词K的直接回复者数量，如果m＝0，则S_KX＝0。Among them, m is the number of propagation times of keyword K on user X, that is, the number of direct replies of user X who also reply to keyword K, if m=0, then S _KX =0.

关键词K在整个UGC中的综合社交影响力即为该UGC所有用户（包括信息发布者和回复者）发布关键词K的社交影响力之和，即：The comprehensive social influence of keyword K in the entire UGC is the sum of the social influence of all users of the UGC (including information publishers and respondents) publishing keyword K, namely:

其中，N为参与该UGC的用户数量（即信息发布者和回复者的数量之和）。Among them, N is the number of users participating in the UGC (that is, the sum of the number of information publishers and respondents).

UGC信息影响力即为该UGC中所有关键词的综合社交影响力之和，即：UGC information influence is the sum of the comprehensive social influence of all keywords in the UGC, namely:

其中，M为该UGC包含的所有关键词数量。Among them, M is the number of all keywords contained in the UGC.

下面以实施例对本发明技术方案作进一步说明。The technical solution of the present invention will be further described below with examples.

方法实施例一Method embodiment one

图5为本实施例一个UGC的结构图，如图5所示，该UGC包含4个用户：用户1、用户2、用户3、用户4。用户1为信息发布者，发布的关键词为A、B、C；用户2和用户3分别直接回复了用户1，用户2发布的关键词为A、C、D，用户3发布的关键词为B、F；用户4直接回复了用户2，发布的关键词为C、F。兴趣网络社群编号用r表示，r₁＝1，r₂＝1，r₃＝2，r₄＝3；粉丝网络社群编号用f表示，f₁＝1，f₂＝2，f₃＝1，f₄＝3。用户2和用户4是信息发布者用户1的粉丝。本实施例为成员参与机制中各因素之间的路径系数赋值为：a₁＝0.333，a₂＝0.824，a₃＝0.624，a₄＝0.437。图6为本实施例的流程图，如图6所示，包括以下步骤：FIG. 5 is a structural diagram of a UGC in this embodiment. As shown in FIG. 5 , the UGC includes four users: User 1, User 2, User 3, and User 4. User 1 is the information publisher, and the keywords released are A, B, and C; users 2 and 3 respectively reply to user 1 directly, and the keywords posted by user 2 are A, C, D, and the keywords posted by user 3 are B, F; User 4 directly replied to User 2 with the keywords C and F. The interest network community number is represented by r, r ₁ =1, r ₂ =1, r ₃ =2, r ₄ =3; the fan network community number is represented by f, f ₁ =1, f ₂ =2, f ₃ =1, f ₄ =3. User 2 and User 4 are fans of the information publisher User 1. In this embodiment, the path coefficients among various factors in the member participation mechanism are assigned as follows: a ₁ =0.333, a ₂ =0.824, a ₃ =0.624, a ₄ =0.437. Fig. 6 is the flowchart of this embodiment, as shown in Fig. 6, comprises the following steps:

步骤601：分别计算各用户在该UGC中的社交影响力。Step 601: Calculate the social influence of each user in the UGC.

根据公式分别计算各用户的社交影响力。以计算用户1的社交影响力为例进行说明，用户2～用户4与此计算方法相同，不再赘述。According to the formula Calculate the social influence of each user separately. Taking the calculation of the social influence of user 1 as an example for illustration, the calculation methods of users 2 to 4 are the same and will not be repeated here.

用户1的直接回复者数量是2，即b＝2；用户2与用户1在不同兴趣网络社群，相同粉丝网络社群中，因此C₁＝a₁＝0.333，C₂＝1；用户3与用户1在相同兴趣网络社群，不同粉丝网络社群中，因此C₁＝1，C₂＝a₂×a₃＝0.514；用户1不是自身粉丝，f＝1，因此The number of direct responders of user 1 is 2, that is, b=2; user 2 and user 1 are in different interest network communities and the same fan network community, so C ₁ =a ₁ =0.333, C ₂ =1; user 3 In the same interest network community as user 1, but in different fan network communities, therefore C ₁ =1, C ₂ =a ₂ ×a ₃ =0.514; user 1 is not a fan of himself, f=1, therefore

U₁＝{1+ln[1+a₁×1+1×a₂×a₃]}×1＝1.614U ₁ ={1+ln[1+a ₁ ×1+1×a ₂ ×a ₃ ]}×1=1.614

同理U₂＝1.395，U₃＝0，U₄＝0。Similarly, U ₂ =1.395, U ₃ =0, U ₄ =0.

步骤602：分别计算该UGC中每个关键词的综合社交影响力。Step 602: Calculate the comprehensive social influence of each keyword in the UGC.

以关键词C的综合社交影响力计算为例进行说明，关键词A、B、D、F的综合社交影响力计算方法相同，不再赘述。Taking the calculation of the comprehensive social influence of keyword C as an example for illustration, the calculation methods of the comprehensive social influence of keywords A, B, D, and F are the same and will not be repeated here.

发布关键词C的用户有用户1、用户2、用户4，对于用户1来说，关键词C只传播了一次（用户1的直接回复者用户2发布了关键词C），则S_C1＝U₁；对于用户2来说，关键词C只传播了一次（用户2的直接回复者用户4发布了关键词C），则S_C2＝U₂；对于用户3和用户4来说，关键词C没有传播（用户3和用户4均不存在直接回复者发布关键词C），S_C3＝0，S_C4＝0，因此，关键词C的社交影响力为： The users who released keyword C include user 1, user 2, and user 4. For user 1, keyword C was only propagated once (user 2, the direct responder of user 1, published keyword C), then S _C1 = U ₁ ; for user 2, keyword C has only been propagated once (user 4, the direct responder of user 2, released keyword C), then S _C2 = U ₂ ; for user 3 and user 4, keyword C There is no dissemination (there is no direct responder to publish keyword C in user 3 and user 4), S _C3 = 0, S _C4 = 0, therefore, the social influence of keyword C is:

同理， In the same way,

步骤603：计算UGC的信息影响力。Step 603: Calculate the information influence of UGC.

根据公式计算该UGC的信息影响力。According to the formula Calculate the information influence of the UGC.

方法实施例二Method embodiment two

方法实施例一以较少用户参与的UGC为例对本发明技术方案如何计算社交网络UGC信息影响力进行了说明，本实施例以天涯论坛杂谈版块上2012年的用户和帖子信息为例对本发明技术方案作进一步说明。Method Embodiment 1 Taking UGC with less user participation as an example, how to calculate the influence of social network UGC information by the technical solution of the present invention is explained. The program is further explained.

用户信息共包含181841名用户ID、其粉丝的ID、在该版块发布的帖子ID以及在该版块回复帖子的ID；帖子信息共包含43609篇帖子的ID、该帖子中每楼的序号、发布者ID及其内容。通过判断帖子是否含有论坛管理员的精品符号，从帖子信息中筛选出了827篇帖子作为人工标注的精品帖子集，其他帖子作为非精品帖子集。由于数据量庞大，因此从非精品帖子集中随机抽取9173篇帖子与827篇人工标注的精品帖子混合成10000篇的帖子样本，并利用该样本分别对S-IDPM,IDM,IDPM进行对比、分析和评估。此外，还利用统计信息的聚类方法对帖子进行了机器标注，得到了机器标注下的精品帖子集，并同样对S-IDPM,IDM,IDPM进行了对比和分析。User information includes a total of 181,841 user IDs, IDs of their fans, IDs of posts published in this forum, and IDs of replies to posts in this forum; post information includes a total of 43,609 post IDs, the serial number of each floor in the post, and the publisher ID and its content. By judging whether the post contains the boutique symbol of the forum administrator, 827 posts are selected from the post information as the manual tagged boutique post set, and the other posts are regarded as the non-fine post collection. Due to the huge amount of data, 9173 posts were randomly selected from the collection of non-excellent posts and 827 artificially marked high-quality posts were mixed into a sample of 10,000 posts, and the samples were used to compare, analyze and analyze S-IDPM, IDM, and IDPM respectively. Evaluate. In addition, the clustering method of statistical information is used to machine-label the posts, and the machine-labeled high-quality post collection is obtained, and S-IDPM, IDM, and IDPM are also compared and analyzed.

表3中给出了上述三种方法的对比实验结果，由于帖子数量很大，这里只给出排名前5的帖子。Table 3 shows the comparative experimental results of the above three methods. Due to the large number of posts, only the top 5 posts are given here.

表3table 3

从表3中可以观察到，IDM与S-IDPM的主要区别在帖子2894103和2366245上。通过观察相应的语料，发现帖子2894103是一个广告征集帖，他发布了相应的广告模版，所有用户都要按照固定的格式回复，所以按照IDM模型，主要利用共现词来计算影响力，因此，该帖子在IDM下影响力很高。不过，从标题中可以看出，该帖子吸引的是一群喜欢汽车的用户，从计算得到的用户兴趣网络中也可以看出很多用户曾经共同回复过某些帖子，因此，他们在相同的兴趣网络中，有相同的兴趣网络编号，这说明该帖子只在一个小圈子里进行了传播。因此，在S-IDPM中它的排名不是很高，没有进入前5名。而帖子2366245引起了广泛的关注和回复，共有1476757楼回复，帖子中用户没有明显的大型的粉丝和兴趣网络，用户圈比较分散，说明该帖子引起了社区各种用户群的广泛关注。因此，它在S-IDPM中帖子影响力排名较高。It can be observed from Table 3 that the main difference between IDM and S-IDPM is in posts 2894103 and 2366245. By observing the corresponding corpus, it is found that post 2894103 is an advertisement solicitation post. He released the corresponding advertisement template, and all users must reply in a fixed format. Therefore, according to the IDM model, co-occurrence words are mainly used to calculate influence. Therefore, This post has high influence under IDM. However, it can be seen from the title that this post attracts a group of users who like cars, and from the calculated user interest network, it can also be seen that many users have responded to certain posts together, so they are in the same interest network , have the same interest network ID, which means that the post was only circulated in a small circle. Hence, it is not ranked very high in S-IDPM, not in the top 5. However, post 2366245 attracted widespread attention and replies, with a total of 1476757 replies. The users in the post did not have obvious large-scale fans and interest networks, and the user circle was relatively scattered, indicating that this post attracted widespread attention from various user groups in the community. Therefore, it ranks higher in post influence in S-IDPM.

而IDPM与S-IDPM之间主要区别在帖子2510082和帖子2713599上。通过观察相应的语料，发现帖子2713599的发帖人的用户名为“我是日系车主”，与帖子的标题非常类似，通过观察她的用户页面也发现该用户没有任何粉丝和关注，也没有回复过任何帖子且只发过这一篇帖子，这些充分说明了该用户名是一个马甲名，没有与任何人有社交或者兴趣网络的关系。这篇帖子引起共14944楼的回复，而帖子2510082虽然引起了28211楼的回复，比帖子2713599的回复数多，但是帖子2713599的回复用户的粉丝和兴趣网络更加分散，因此，在S-IDPM中的影响力排名更加靠前。And the main difference between IDPM and S-IDPM is in post 2510082 and post 2713599. By observing the corresponding corpus, it is found that the user name of the poster of post 2713599 is "I am a Japanese car owner", which is very similar to the title of the post. By observing her user page, it is also found that the user does not have any fans, attention, or reply Any post and only this one post, which fully demonstrates that the username is a vest name and has no social or interest network relationship with anyone. This post caused a total of 14,944 replies, and although post 2510082 attracted 28,211 replies, which is more than the number of replies to post 2713599, but the fans and interest networks of post 2713599 reply users are more scattered. Therefore, in S-IDPM influence ranks higher.

从以上的定性分析中可以看出，S-IDPM可以在一定程度上解决IDM和IDPM所没有考虑到的一些问题。It can be seen from the above qualitative analysis that S-IDPM can solve some problems not considered by IDM and IDPM to a certain extent.

接下来，定量分析三种方法在帖子影响力计算中的效果。Next, quantitatively analyze the effect of the three methods in post influence calculation.

首先，本实施例对比了以人工标注（论坛管理员标注）的精品帖作为精品帖标准，比较IDM,IDPM以及S-IDPM精品帖累积含有率对比图，如图7所示。从图7中可以看出，S-IDPM精品帖累积含有率在前3000名最快达到70%左右。说明在S-IDPM的影响力排序中，前30%中覆盖了70%的精品帖子，而且前10%，20%均高于IDM和IDPM模型。因此，说明S-IDPM对帖子影响力的计算结果更好，更符合人工标注结果。First of all, this embodiment compares the quality posts manually marked (marked by forum administrators) as the standard of quality posts, and compares the cumulative content ratio of IDM, IDPM and S-IDPM quality posts, as shown in Figure 7. It can be seen from Figure 7 that the accumulative content rate of S-IDPM high-quality posts reaches about 70% the fastest among the top 3,000. It shows that in the influence ranking of S-IDPM, the top 30% covers 70% of high-quality posts, and the top 10% and 20% are higher than the IDM and IDPM models. Therefore, it shows that the calculation result of S-IDPM's influence on posts is better, which is more in line with the manual labeling results.

接下来，将现有技术中基于聚类的意见领袖发现算法利用到本发明的帖子影响力分析中，实现利用统计信息聚类方法发现精品帖的算法。Next, the cluster-based opinion leader discovery algorithm in the prior art is used in the post influence analysis of the present invention, and an algorithm for discovering high-quality posts by using the statistical information clustering method is realized.

选用帖子的楼数F，持续时间T，回复人数P，每小时的回复楼数表示为F/T，平均每楼的词数为W/F，以及非楼主回复数与楼主回复数之差D作为特征值，N表示类的成员数。利用子类数量选取方法和聚类算法（均为现有技术）得到8个类，如图8所示。Choose the floor number F of the post, the duration T, the number of replies P, the number of replies per hour is expressed as F/T, the average number of words per floor is W/F, and the difference between the number of replies from non-hosts and the number of replies from hosts is D As a feature value, N represents the number of members of the class. Eight classes are obtained by using the method for selecting the number of subclasses and the clustering algorithm (both existing technologies), as shown in Figure 8.

将基于聚类的意见领袖发现算法中的筛选条件调整为：类成员数较少，类成员特征值均值较大的类中的成员作为机器标注的精品帖。因此，5号和7号类中的成员作为接下来实验的精品帖，5号和7号类共有1001名成员，其中与论坛管理员标注的827个精品帖只有291篇帖子是相同的，所以，这与图7所示实验不同。接下来利用这1001篇精品帖来比较IDM,IDPM以及S-IDPM精品帖累积含有率对比图，如图9所示：The filter conditions in the clustering-based opinion leader discovery algorithm are adjusted as follows: the members in the class with fewer members and larger average feature value of class members are regarded as high-quality posts marked by machines. Therefore, the members in categories 5 and 7 are the high-quality posts for the next experiment. There are 1,001 members in categories 5 and 7, and only 291 of them are the same as the 827 high-quality posts marked by the forum administrator. Therefore, , which is different from the experiment shown in Fig. 7. Next, use these 1001 high-quality posts to compare the cumulative content ratio of IDM, IDPM and S-IDPM high-quality posts, as shown in Figure 9:

从图9中可以看出，S-IDPM精品帖累积率曲线也是一直处于中游水平，而且也在前2000篇时达到了精品帖累计率85%以上，说明在利用统计信息机器标注精品帖的情况下，S-IDPM对帖子影响力的计算效果依然很好。It can be seen from Figure 9 that the cumulative rate curve of S-IDPM high-quality posts has always been at the middle level, and the cumulative rate of high-quality posts has reached more than 85% in the first 2000 articles, which shows that the use of statistical information machines to mark high-quality posts Under this condition, the calculation effect of S-IDPM on the influence of posts is still very good.

最后，对三种算法在人工标注和机器标注的情况下精品帖准确率进行对比，如表4所示，S-IDPM在人工标注和机器标注两种情况下，精品帖准确率均高于其他两种模型。Finally, compare the accuracy of the high-quality posts of the three algorithms in the case of manual labeling and machine labeling, as shown in Table 4, S-IDPM has higher accuracy than other high-quality posts in the two cases of manual labeling and machine labeling. Two models.

表4Table 4

Pt₀ Pt ₀ Pt₁ Pt ₁ IDMIDM 28.1%28.1% 68.1%68.1% IDPMIDPM 30.2%30.2% 67.3%67.3% S-IDPMS-IDPM 32.4%32.4% 68.4%68.4%

通过利用人工标注和机器标注的精品帖累计率对比实验以及精品帖准确率对比实验，综合以上实验结果可以看出S-IDPM对帖子影响力计算的结果更加准确，优于IDM和IDPM方法。Through the comparison experiment of the accumulative rate of high-quality posts and the comparison experiment of the accuracy of high-quality posts using manual and machine marking, it can be seen from the above experimental results that S-IDPM is more accurate in calculating the influence of posts, which is better than IDM and IDPM methods.

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所做的任何修改、等同替换、改进等，均应包含在本发明保护的范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within the range.

Claims

1. it is a kind of to improve the method that social network user produces content information power of influence accuracy, it is applied to social network user product Raw content UGC, the UGC include M key word, have N number of user and participate in the UGC, it is characterised in that the method includes：

A. social networkies UGC member's participation mechanism is set up, the path coefficient between each influence factor of member's participation mechanism is determined, The path coefficient is the degree of correlation between each influence factor of member's participation mechanism；

B. vermicelli network is built according to user's vermicelli relation of the UGC and haves no right directed graph, the vermicelli network is had no right oriented Figure carries out community division；Relation is replied according to the user of the UGC build correlation network and have the right non-directed graph, to the correlation network Non-directed graph of having the right carries out community division；

C. social influence power U of user X is calculated according to the degree of correlation between each influence factor of member's participation mechanism_X；

D. basisThe social influence power that user X issues key word K is calculated, m is biographies of the key word K on user X Number of times is broadcast, if m=0, S_KX=0；

E. according to formulaCalculate comprehensive social influence power of the key word K in the UGC；

F. comprehensive social influence power sum of the M key word in the UGC is calculated, the informational influence of the UGC is obtained Power INF.

2. method according to claim 1, it is characterised in that member's participation mechanism includes that information quality, colony are recognized It is a that the same feeling, value perceive and participate in the path coefficient of four influence factors, information quality and group identification sense₁, information quality and The path coefficient that value is perceived is a₂, value is perceived and the path coefficient of group identification sense is a₃, participate in the road with group identification sense Footpath coefficient is a₄。

3. method according to claim 2, it is characterised in that step C is further included：

According to formulaThe social influence power of user X in the UGC is calculated,

Wherein, b is the number of times directly replied in the UGC by user X, if user does not have direct reply person, U_X=0；

If directly reply person belongs to same interest network social association, C to user X₁=a₁, otherwise, C₁=1；

If directly reply person belongs to identical vermicelli network social association, C to user X₂=a₂×a₃, otherwise, C₂=1；

If user X is the vermicelli of the UGC information publishers, f=a₂, otherwise f=1.

4. method according to claim 1, it is characterised in that step F is further included：

According to formulaCalculate informational influence power INF of the UGC.

5. method according to claim 2, it is characterised in that the path coefficient a of information quality and group identification sense₁= 0.333, the path coefficient a that information quality and value are perceived₂=0.824, value perceives the path coefficient a with group identification sense₃= 0.624, participate in the path coefficient a with group identification sense₄=0.437.