CN104281882B

CN104281882B - The method and system of prediction social network information stream row degree based on user characteristics

Info

Publication number: CN104281882B
Application number: CN201410472689.8A
Authority: CN
Inventors: 李歌; 胡玥; 于延宇; 李丹
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2014-09-16
Filing date: 2014-09-16
Publication date: 2017-09-15
Anticipated expiration: 2034-09-16
Also published as: CN104281882A

Abstract

The invention provides a method for predicting the popularity of social network information based on user characteristics. The method includes: obtaining user data and information data in the social network; extracting part of user attribute characteristics and user behavior characteristics from user data; Classify the user data according to the user behavior characteristics; obtain the user communication characteristics corresponding to the information data according to the information data and user categories; obtain the social network information popularity prediction model according to the user communication characteristics, and use the prediction model to predict the information popularity. The system for predicting the popularity of social network information based on user characteristics provided by the present invention includes an acquisition module, a feature extraction module, a classification module, a processing module, a prediction model module and a prediction model. The invention combines the characteristics of user behavior characteristics to more accurately predict the information dissemination of the social network, and solves the problems of lagging hotspot discovery, information push, and real-time network public opinion monitoring.

Description

Method and system for predicting social network information popularity based on user characteristics

技术领域technical field

本发明涉及网络安全技术领域，具体涉及基于用户特征的预测社交网络信息流行度的方法及系统。The invention relates to the technical field of network security, in particular to a method and system for predicting the popularity of social network information based on user characteristics.

背景技术Background technique

目前，网络已经成为获取信息的重要途径，尤其是随着各种社交网站的快速崛起，更方便的信息获取和更快速的信息传播随之而来。社交网络已经形成了庞大的线上社会群体，构建了密切的线上人际关系。社交网络上的信息传播不同于信件、口头、报纸等传统的信息传播，社交网络中的信息传播具有以下几方面的突出特征，第一，具有很强的实时性，科技的进步，信息的发出者很容易将看到的重大事件用最快的时间向外传播；第二，拥有较强的群体性，社交网络的信息发布变得随心所欲，不同的人出于一些目的会发布一些具有煽动性的信息，而这些信息广泛传播将会引发群体性。第三，信息更新周期性变小，由于信息的大量发布，信息的来源越来越广，所以信息在传播中逐渐被新的信息所取代，周期性变小。At present, the Internet has become an important way to obtain information, especially with the rapid rise of various social networking sites, more convenient information acquisition and faster information dissemination follow. Social networks have formed a huge online social group and built close online interpersonal relationships. Information dissemination on social networks is different from traditional information dissemination such as letters, oral, newspapers, etc. Information dissemination in social networks has the following outstanding features. First, it has strong real-time performance. It is easy for readers to spread the major events they see in the fastest time; second, with a strong group nature, information release on social networks becomes arbitrary, and different people will publish some inflammatory messages for some purposes. information, and the widespread dissemination of this information will lead to group behavior. Third, the periodicity of information updates becomes smaller. Due to the massive release of information, the sources of information become wider and wider, so the information is gradually replaced by new information in the process of dissemination, and the periodicity becomes smaller.

信息传播的流行度预测结合社交网络上信息传播的特点，会有效的解决很多问题。能够及早的发现信息传播中的变化，尽早的对信息传播的流行度做出预测成为了信息实时推送和社会网络舆情监测的主要部分。目前，对信息推送和舆情传播都是采用的监测方法，设置一个阀值，当信息的某些参数大于这个阀值时，就会定义为推送信息或者舆情信息。这些方法相对粗糙，信息的实时特征很难得到保证。The popularity prediction of information dissemination combined with the characteristics of information dissemination on social networks will effectively solve many problems. Being able to detect changes in information dissemination early and predict the popularity of information dissemination as early as possible has become the main part of real-time information push and social network public opinion monitoring. At present, the monitoring method is adopted for information push and public opinion dissemination. A threshold is set. When some parameters of the information are greater than this threshold, it will be defined as push information or public opinion information. These methods are relatively rough, and it is difficult to guarantee the real-time characteristics of the information.

发明内容Contents of the invention

针对现有技术的缺陷，本发明提供的基于用户特征的预测社交网络信息流行度的方法，结合用户行为特征的特点，更加准确地预测社交网络的信息传播，解决热点发现滞后、信息推送以及网络舆情监测实时性难以保证的问题。Aiming at the defects of the prior art, the method for predicting the popularity of social network information based on user characteristics provided by the present invention, combined with the characteristics of user behavior characteristics, can more accurately predict the dissemination of information on social networks, and solve hot spot discovery lag, information push and network It is difficult to guarantee the real-time performance of public opinion monitoring.

第一方面，本发明提供了一种基于用户特征的预测社交网络信息流行度的方法，该方法包括：In a first aspect, the present invention provides a method for predicting the popularity of social network information based on user characteristics, the method comprising:

获取预设时间内社交网络中的信息数据和与所述信息数据对应的用户数据，所述用户数据包括多个用户属性特征；Acquiring information data in the social network within a preset time and user data corresponding to the information data, the user data including a plurality of user attribute characteristics;

从所述用户数据中提取部分用户属性特征，以及根据所述用户数据，获取所述用户数据的用户行为特征；Extracting some user attribute features from the user data, and acquiring user behavior features of the user data according to the user data;

按照所述用户属性特征和所述用户行为特征将用户数据进行分类，得到用户数据中用户的类别；Classify the user data according to the user attribute characteristics and the user behavior characteristics to obtain the user category in the user data;

根据所述信息数据以及所述用户数据中用户的类别，获取与所述信息数据对应的用户传播特征；Acquiring user propagation characteristics corresponding to the information data according to the information data and the category of users in the user data;

根据所述用户传播特征，确定社交网络信息流行度的预测模型；Determine a predictive model of social network information popularity according to the user propagation characteristics;

采用所述预测模型对一段时间内产生的信息数据进行分析，对信息流行度进行预测。The prediction model is used to analyze the information data generated within a period of time, and predict the popularity of the information.

优选地，所述获取预设时间内社交网络中的信息数据和与所述信息数据对应的用户数据步骤后，该方法还包括：Preferably, after the step of obtaining the information data in the social network within the preset time and the user data corresponding to the information data, the method further includes:

将所述用户数据和所述信息数据存储到数据库中。The user data and the information data are stored in a database.

优选地，所述获取预设时间内社交网络中的信息数据和与所述信息数据对应的用户数据，包括：Preferably, the acquisition of information data in the social network within a preset time period and user data corresponding to the information data includes:

采用网络爬虫获取论坛类社交网络的用户数据和信息数据；Use web crawlers to obtain user data and information data of forum-like social networks;

采用应用程序编程接口API获取微博类社交网络的用户数据和信息数据；Use application programming interface API to obtain user data and information data of Weibo social network;

采用网络爬虫获取社区类社交网络的用户数据，采用用户的剪贴板获取社区类社交网络的信息数据。A web crawler is used to obtain user data of a community social network, and a user's clipboard is used to obtain information data of a community social network.

优选地，所述按照所述用户属性特征和所述用户行为特征将用户数据进行分类，得到用户数据中用户的类别，包括：Preferably, the user data is classified according to the user attribute characteristics and the user behavior characteristics to obtain the user category in the user data, including:

对所述用户属性特征和所述用户行为特征进行归一化处理，得到用户特征；Performing normalization processing on the user attribute features and the user behavior features to obtain user features;

根据所述用户特征，采用聚类算法将用户数据进行分类，得到用户数据中用户的类别。According to the user characteristics, a clustering algorithm is used to classify the user data to obtain user categories in the user data.

优选地，所述采用聚类算法将用户数据进行分类，包括：Preferably, the use of a clustering algorithm to classify user data includes:

将用户数据分为两类，并计算类别中心的距离，若类别中心的距离小于预设值，则将这两个类别融合成一个类别；Divide the user data into two categories, and calculate the distance between the category centers, if the distance between the category centers is less than the preset value, then merge the two categories into one category;

对各类别的用户数据继续分类，并计算各类别中心的距离，直至出现三个类别的用户数据融合成一个类别时停止分类，得到用户的类别。Continue to classify the user data of each category, and calculate the distance between the centers of each category, and stop the classification until the user data of three categories are merged into one category, and the category of the user is obtained.

优选地，所述根据所述用户传播特征，确定社交网络信息流行度的预测模型，包括：Preferably, the determining the prediction model of social network information popularity according to the user communication characteristics includes:

建立基于用户特征的多元线性模型；Establish a multivariate linear model based on user characteristics;

将所述用户传播信息作为训练集，对所述线性模型进行训练，得到社交网络信息流行度预测模型。The user dissemination information is used as a training set, and the linear model is trained to obtain a social network information popularity prediction model.

第二方面，本发明提供了一种基于用户特征的预测社交网络信息流行度的系统，该系统包括：In a second aspect, the present invention provides a system for predicting the popularity of social network information based on user characteristics, the system comprising:

获取模块，用于获取预设时间内社交网络中的信息数据和与所述信息数据对应的用户数据，所述用户数据包括多个用户属性特征；An acquisition module, configured to acquire information data in a social network within a preset time period and user data corresponding to the information data, the user data including a plurality of user attribute characteristics;

特征提取模块，用于从所述用户数据中提取部分的用户属性特征，以及根据所述用户数据，获取所述用户数据的用户行为特征；A feature extraction module, configured to extract part of user attribute features from the user data, and obtain user behavior features of the user data according to the user data;

分类模块，用于按照所述用户属性特征和所述用户行为特征将用户数据进行分类，得到用户数据中用户的类别；A classification module, configured to classify user data according to the user attribute characteristics and the user behavior characteristics, to obtain the user category in the user data;

处理模块，用于根据所述信息数据以及所述用户数据中用户的类别，获取与所述信息数据对应的用户传播特征；A processing module, configured to acquire user propagation characteristics corresponding to the information data according to the information data and the category of users in the user data;

预测模型模块，用于根据所述用户传播特征，确定社交网络信息流行度的预测模型；A predictive model module, configured to determine a predictive model of social network information popularity according to the user communication characteristics;

预测模块，用于采用所述预测模型对一段时间内产生的信息数据进行分析，对信息流行度进行预测。The prediction module is used to analyze the information data generated within a period of time by using the prediction model, and predict the popularity of the information.

优选地，所述系统还包括：Preferably, the system also includes:

存储模块，用于将所述用户数据和所述信息数据存储到数据库中。A storage module, configured to store the user data and the information data in a database.

优选地，所述分类模块包括：Preferably, the classification module includes:

归一化子模块，用于对所述用户属性特征和所述用户行为特征进行归一化处理，得到用户特征；A normalization sub-module, configured to perform normalization processing on the user attribute features and the user behavior features to obtain user features;

划分子模块，用于根据所述用户特征，采用聚类算法将用户数据进行分类，得到用户数据中用户的类别。The division sub-module is used to classify the user data by using a clustering algorithm according to the user characteristics, and obtain user categories in the user data.

优选地，所述预测模型模块包括：Preferably, the predictive model module includes:

建立模型子模块，用于建立基于用户特征的多元线性模型；Establish a model sub-module for establishing a multivariate linear model based on user characteristics;

训练子模块，用于将所述用户传播信息作为训练集，对所述线性模型进行训练，得到社交网络信息流行度预测模型。The training sub-module is used to use the user dissemination information as a training set to train the linear model to obtain a social network information popularity prediction model.

基于上述技术方案，本发明提供的基于用户特征的预测社交网络信息流行度的方法，充分考虑了社交网络信息的实时性和用户特征对于信息传播的影响，采用信息流行度预测的方式来进行信息传播模式描述，能够尽早的对信息传播进行预测，减少了传统方法的滞后性，对及时信息推送和社交网络的舆情及时控制提供了帮助。同时，本发明的系统在运行时内存代价不高，拥有很高的效率，拥有独立性和可移植性。综合来说，本发明能够对信息传播流行度做出及早的预测，对信息的及时推送，和网络的舆情及时控制都有很大的帮助。Based on the above technical solution, the method for predicting the popularity of social network information based on user characteristics provided by the present invention fully considers the real-timeness of social network information and the influence of user characteristics on information dissemination, and adopts the method of information popularity prediction to perform information The description of the communication mode can predict the information dissemination as early as possible, reducing the lag of traditional methods, and providing help for timely information push and timely control of public opinion on social networks. At the same time, the system of the present invention has low memory cost during operation, high efficiency, independence and portability. In general, the present invention can make an early prediction on the popularity of information dissemination, and is of great help to the timely push of information and the timely control of public opinion on the network.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. Those skilled in the art can also obtain other drawings based on these drawings without creative work.

图1是本发明一实施例提供的基于用户特征的预测社交网络信息流行度的方法的流程示意图；1 is a schematic flow diagram of a method for predicting social network information popularity based on user characteristics provided by an embodiment of the present invention;

图2是本发明另一实施例提供的获取用户数据和信息数据的方法的流程示意图；Fig. 2 is a schematic flowchart of a method for acquiring user data and information data provided by another embodiment of the present invention;

图3是本发明一实施例提供的基于用户特征的预测社交网络信息流行度的系统的结构图；3 is a structural diagram of a system for predicting social network information popularity based on user characteristics provided by an embodiment of the present invention;

图4是本发明另一实施例提供的分类模块的结构示意图；Fig. 4 is a schematic structural diagram of a classification module provided by another embodiment of the present invention;

图5是本发明另一实施例提供的预测模型模块的结构示意图。Fig. 5 is a schematic structural diagram of a prediction model module provided by another embodiment of the present invention.

具体实施方式detailed description

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

如图1所示，图1示出了本发明一实施例提供的基于用户特征的预测社交网络信息流行度的方法，该方法包括如下步骤：As shown in Figure 1, Figure 1 shows a method for predicting social network information popularity based on user characteristics provided by an embodiment of the present invention, the method includes the following steps:

步骤101：获取预设时间内社交网络中的信息数据和与信息数据对应的用户数据。其中，用户数据包括多个用户属性特征。Step 101: Obtain information data and user data corresponding to the information data in the social network within a preset time. Wherein, the user data includes multiple user attribute features.

本实施例中，将获取的社交网络中的用户数据和信息数据存储到数据库中。In this embodiment, the acquired user data and information data in the social network are stored in the database.

根据不同类型的社交网络采取不同的数据获取方式，论坛类的社交网站以帖子作为信息的载体，所以适合采用网络爬虫获取帖子的数据。According to different types of social networks, different data acquisition methods are adopted. Forum social networking sites use posts as information carriers, so it is suitable to use web crawlers to obtain post data.

对于微博类社交网络，以短文本的微博来传播消息，可以使用微博平台提供的应用程序编程接口(Application Programming Interface，简称API)获取信息数据与用户数据。For microblog-type social networks, the short-text microblogs are used to disseminate news, and the application programming interface (Application Programming Interface, API for short) provided by the microblog platform can be used to obtain information data and user data.

对于社区类社交网络，可以通过网络爬虫获取用户数据，在通过这些用户的剪贴板获取信息数据。For community-based social networks, user data can be obtained through web crawlers, and information data can be obtained through the clipboards of these users.

步骤102：从用户数据中提取部分的用户属性特征，以及根据用户数据获取用户数据的用户行为特征。Step 102: Extract part of the user attribute features from the user data, and obtain the user behavior features of the user data according to the user data.

具体来说，对于获取的用户数据，具体分为用户属性特征和用户行为特征。Specifically, the acquired user data is specifically divided into user attribute features and user behavior features.

用户属性特征是用户在注册社交网络账号时提供的信息，例如：姓名，年龄，性别等。对于用户的属性特征要保留可能影响信息传播的有效的特征，去除不会影响信息传播的无效特征，例如：电话号码，邮编等。The user attribute feature is the information provided by the user when registering a social network account, such as: name, age, gender, etc. For user attribute features, valid features that may affect information dissemination should be retained, and invalid features that will not affect information dissemination should be removed, such as phone numbers, zip codes, etc.

用户行为特征是指用户在社交网络上进行活动时产生的特征，例如：好友数量，回复数量等。但是有一些用户行为数据不能够直接通过API和网络爬虫直接获取，需要通过计算间接获得，例如：使用社交网络年龄，聚集系数等。下述用户特征就是由有效的用户属性特征和完整的用户行为特征组成。User behavior characteristics refer to the characteristics generated when users perform activities on social networks, such as the number of friends, the number of replies, and so on. However, there are some user behavior data that cannot be obtained directly through APIs and web crawlers, and need to be obtained indirectly through calculations, for example: using social network age, aggregation coefficient, etc. The following user characteristics are composed of effective user attribute characteristics and complete user behavior characteristics.

步骤103：按照所述用户属性特征和所述用户行为特征将用户数据进行分类，得到用户数据中用户的类别。Step 103: Classify the user data according to the user attribute features and the user behavior features to obtain the user categories in the user data.

本实施例中，本步骤包括：In this embodiment, this step includes:

具体来说，CLA算法是一种聚类算法，通过CLA算法不用预先指定要分类的个数，而是通过一定的条件找到合适的分类个数。首先CLA算法会将用户分为两类，并计算类别中心的距离，当类别中心的距离小于一定值时，就认为这两个类别应属于一类，并将这两个类别融合成一个类别。然后增加类别的数量重新按照上述的方法对用户进行分类，直到第一次出现有三个类别的用户融合成为一个类别时，停止算法。这样用户就会被分为合适的类别。Specifically, the CLA algorithm is a clustering algorithm. The CLA algorithm does not need to specify the number of classifications in advance, but finds the appropriate number of classifications through certain conditions. First, the CLA algorithm divides users into two categories and calculates the distance between the category centers. When the distance between the category centers is less than a certain value, the two categories are considered to belong to one category, and the two categories are fused into one category. Then increase the number of categories and reclassify users according to the above method, until the first time users of three categories merge into one category, stop the algorithm. This way users are classified into appropriate categories.

步骤104：根据信息数据以及用户数据中用户的类别，获取与信息数据对应的用户传播特征。Step 104: Obtain user propagation characteristics corresponding to the information data according to the information data and user categories in the user data.

按照上述用户的分类，将数据库中的信息数据中参与信息传播的用户按照他们的类别进行统计。统计的各个种类的用户数量作为这个信息的用户传播特征。According to the classification of the above users, the users participating in information dissemination in the information data in the database are counted according to their categories. The number of users of various types is counted as the user dissemination feature of this information.

步骤105：根据所述用户传播特征，确定社交网络信息流行度的预测模型。Step 105: Determine a prediction model of social network information popularity according to the user communication characteristics.

具体来说，本步骤包括：建立基于用户特征的多元线性模型；将所述用户传播信息作为训练集，对所述线性模型进行训练，得到社交网络信息流行度预测模型。Specifically, this step includes: establishing a multivariate linear model based on user characteristics; using the user dissemination information as a training set to train the linear model to obtain a social network information popularity prediction model.

本实施例中，将信息的用户传播特征作为训练集，使用线性回归的方法，可以得到预测信息流行度的模型。在计算过程中可以使用梯度下降的算法快速的得到各个类别的用户对于信息传播的影响权重。In this embodiment, the user dissemination characteristics of information are used as a training set, and a model for predicting information popularity can be obtained by using a linear regression method. In the calculation process, the gradient descent algorithm can be used to quickly obtain the influence weight of each category of users on information dissemination.

步骤106：采用得到的预测模型对一段时间内产生的信息数据进行分析，对信息流行度进行预测。Step 106: Using the obtained prediction model to analyze the information data generated within a period of time, and predict the popularity of the information.

本实施例提供的基于用户特征的预测社交网络信息流行度的方法，充分考虑了社交网络信息的实时性和用户特征对于信息传播的影响，采用信息流行度预测的方式来进行信息传播模式描述，能够尽早的对信息传播进行预测，减少了传统方法的滞后性，对及时信息推送和社交网络的舆情及时控制提供了帮助。同时，本发明的系统在运行时内存代价不高，拥有很高的效率，拥有独立性和可移植性。综合来说，本发明能够对信息传播流行度做出及早的预测，对信息的及时推送，和网络的舆情及时控制都有很大的帮助。The method for predicting the popularity of social network information based on user characteristics provided in this embodiment fully considers the real-timeness of social network information and the influence of user characteristics on information dissemination, and uses the information popularity prediction method to describe the information dissemination mode. It can predict information dissemination as early as possible, reducing the lag of traditional methods, and providing help for timely information push and timely control of public opinion on social networks. At the same time, the system of the present invention has low memory cost during operation, high efficiency, independence and portability. In general, the present invention can make an early prediction on the popularity of information dissemination, and is of great help to the timely push of information and the timely control of public opinion on the network.

下面，本发明另一实施例以天涯论坛为例来说明基于用户特征的预测社交网络信息流行度的方法，该方法包括：Next, another embodiment of the present invention takes Tianya Forum as an example to illustrate a method for predicting the popularity of social network information based on user characteristics. The method includes:

步骤一：获取信息数据和用户数据。Step 1: Obtain information data and user data.

本步骤的具体流程如下：The specific process of this step is as follows:

由于天涯论坛并没有提供有效获取数据的API，所以在本实施例中，采用编写网络爬虫的方法来获取信息数据和用户数据。Since Tianya Forum does not provide an API for effectively obtaining data, in this embodiment, a method of writing a web crawler is used to obtain information data and user data.

平台环境：在32位windows7平台下安装配置Microsoft SQL Server2008数据库，使用Microsoft Visual Studio 2010编写网络爬虫程序PostCrawler。如图2所示，为本实施例中提供的获取用户数据和信息数据的方法的流程图，具体运行程序的过程见下述网络爬虫程序postCrawler和UserCrawler。Platform environment: Install and configure Microsoft SQL Server2008 database under 32-bit windows7 platform, and use Microsoft Visual Studio 2010 to write the web crawler program PostCrawler. As shown in FIG. 2 , it is a flow chart of the method for obtaining user data and information data provided in this embodiment. For the specific process of running the program, refer to the following web crawler programs postCrawler and UserCrawler.

1)通过在主机上设置统一资源定位符(Uniform Resource Locator，缩写为URL)URL池来执行网络爬虫程序postCrawler。每个帖子都有对应的独一无二的帖子ID，通过帖子ID可以获得帖子的URL，通过连续的帖子ID可以建立一个URL池，这样就可以连续的获取信息数据和用户数据。但是由于一些帖子被官方删除，一些帖子URL不能够正常返回帖子的信息，在获取数据前要先使用正则表达式筛选有效的信息数据。1) Execute the web crawler program postCrawler by setting a Uniform Resource Locator (Uniform Resource Locator, abbreviated as URL) URL pool on the host. Each post has a corresponding unique post ID. The URL of the post can be obtained through the post ID, and a URL pool can be established through continuous post IDs, so that information data and user data can be obtained continuously. However, because some posts are officially deleted, and some post URLs cannot return the post information normally, regular expressions must be used to filter valid information data before obtaining data.

PostCrawler定义如下：PostCrawler is defined as follows:

2)通过帖子的回复用户ID可以设置爬虫UserCrawler来获取用户数据。每个用户都有对应的用户ID，通过用户ID可以找到对应的用户页面的URL。通过URL就可以访问用户页面查看用户基本信息和历史活动并存入数据库。2) The crawler UserCrawler can be set to obtain user data through the reply user ID of the post. Each user has a corresponding user ID, and the URL of the corresponding user page can be found through the user ID. You can access the user page through the URL to view the user's basic information and historical activities and store them in the database.

UserCrawler定义如下：UserCrawler is defined as follows:

3)SQL Server 2008数据库的设计。信息数据和用户数据的设计字段如下：3) SQL Server 2008 database design. The design fields for info data and user data are as follows:

信息数据：ID(帖子ID)，hostID(发帖用户ID)，click(点击量)，reply(回复量)，time(发帖时间)，userIDList(回帖用户ID列表)Information data: ID (post ID), hostID (post user ID), click (click volume), reply (reply volume), time (post time), userIDList (reply user ID list)

回帖用户ID列表：userID(回帖用户ID)，replyTime(回帖时间)Reply user ID list: userID (reply user ID), replyTime (reply time)

用户数据：ID(用户ID)，fans(粉丝数)，follows(关注数)，posts(发帖数)，replyPosts(回帖数)，registerDate(注册日期)，lastLoginDate(最后一次登录时间)，score(社区积分)，logins(登录次数)，topic(参与板块数量)，age(使用天涯论坛的年龄)，clusteringCoefficient(聚集系数)，reciprocity(互惠系数),userType(用户类别)。User data: ID (user ID), fans (number of fans), follows (number of followers), posts (number of posts), replyPosts (number of replies), registerDate (date of registration), lastLoginDate (time of last login), score (community Points), logins (number of logins), topic (number of participating boards), age (age of using Tianya Forum), clusteringCoefficient (aggregation coefficient), reciprocity (reciprocity coefficient), userType (user category).

这里，函数PostCrawler()和UserCrawler()在现有的操作系统平台上使用已有的任何编程语言来实现该函数的功能，获取信息数据和用户数据。Here, the functions PostCrawler ( ) and UserCrawler ( ) use any existing programming language on the existing operating system platform to realize the function of the function and obtain information data and user data.

步骤二：从所述用户数据中提取有效的用户属性特征，并计算得到用户行为特征。Step 2: extract effective user attribute features from the user data, and calculate user behavior features.

用户属性特征是用户在注册帐号时会被要求填写的用户基本信息，这些信息有些可以作为用户的特征用于用户的分类。在天涯论坛注册帐号时不需要填写这些信息，但是对于其他社交网络，通过API和网络爬虫和获得到这些数据。对于无效的用户属性特征可以在存入数据库之前删除，也可以将全部用户属性特征存入数据库之后在进行分类步骤之前只选择有效的用户特征进行分类。User attributes are basic user information that users are required to fill in when registering an account. Some of this information can be used as user characteristics for user classification. It is not necessary to fill in these information when registering an account on Tianya Forum, but for other social networks, the data can be obtained through API and web crawlers. Invalid user attribute features can be deleted before being stored in the database, or only valid user features can be selected for classification after storing all user attribute features in the database before performing the classification step.

用户行为特征是指用户在社交网络上进行活动时产生的特征，有些用户行为特征可以直接通过网络爬虫和API获得，而另一些需要通过计算获得。User behavior characteristics refer to the characteristics generated when users conduct activities on social networks. Some user behavior characteristics can be obtained directly through web crawlers and APIs, while others need to be obtained through calculations.

对于天涯论坛的用户，我们可以通过计算获得用户的age(使用天涯论坛的年龄)，clusteringCoefficient(聚集系数)，reciprocity(互惠系数)。For the users of Tianya Forum, we can obtain the user's age (the age of using Tianya Forum), clusteringCoefficient (aggregation coefficient), and reciprocity (reciprocity coefficient) by calculation.

Age是指用户在天涯论坛上活跃的时间长度的用户行为特征，也就是用户从注册到最后一次登录天涯论坛的时间，计算方法是：Age refers to the user behavior characteristics of the length of time a user is active on the Tianya Forum, that is, the time from the user's registration to the last login to the Tianya Forum. The calculation method is:

age＝registerDate-lastLoginDateage=registerDate-lastLoginDate

ClusteringCoefficient用来衡量用户和他们的邻居之间的互连性的用户行为特征。如果用户A关注了用户B和用户C，那么clusteringCoefficient就是用户B和用户C之间存在关注关系的概率。用C表示clusteringCoefficient，G_Δ表示用户A，用户B，用户C都存在关注关系，G_Λ表示只有用户A与用户B和用户C之间存在关注关系，可以得到计算公式：ClusteringCoefficient is a user behavior characteristic used to measure the interconnectivity between users and their neighbors. If user A follows user B and user C, then clusteringCoefficient is the probability that there is a following relationship between user B and user C. Use C to represent the clusteringCoefficient, G _Δ to represent that user A, user B, and user C all have a following relationship, and G _Λ to represent that only user A, user B, and user C have a following relationship, and the calculation formula can be obtained:

Reciprocity用来表示用户之间相互关注的概率，即与用户i互相关注的用户数量与用户i所有关注的用户数量的比值。用R表示reciprocity，用A表示与用户i相互关注的数量，B表示用户i关注的所有用户数量，可以得到计算公式：Reciprocity is used to indicate the probability of users following each other, that is, the ratio of the number of users who follow each other with user i to the number of users who follow user i. Use R to represent reciprocity, use A to represent the number of mutual followers with user i, and B to represent the number of all users followed by user i, and the calculation formula can be obtained:

R＝A/BR=A/B

这样就得到全部的用户特征，将这些用户特征存入数据库。In this way, all user characteristics are obtained, and these user characteristics are stored in the database.

步骤三：通过CLA算法，将用户按照用户特征进行分类，CLA算法的表述如下：Step 3: Use the CLA algorithm to classify users according to user characteristics. The expression of the CLA algorithm is as follows:

在CLA算法中，K-means算法是一种最广泛使用的基于划分的聚类算法，通过计算用户特征的欧式距离来将用户进行分类。d(C₁,C₂)表示的是类别1的聚类中心C₁与类别2的聚类中心C₂之间的距离，即其中F是选取用户特征的数量。在算法中的T是需要设置的阀值参数，如果两个聚类中心的距离小于该阀值时，这两个聚类中心所在的类别就会融合在一起。In the CLA algorithm, the K-means algorithm is the most widely used partition-based clustering algorithm, which classifies users by calculating the Euclidean distance of user features. d(C ₁ ,C ₂ ) represents the distance between the cluster center C ₁ of category 1 and the cluster center C ₂ of category 2, namely where F is the number of selected user features. T in the algorithm is a threshold parameter that needs to be set. If the distance between two cluster centers is less than the threshold, the categories of the two cluster centers will be merged together.

通过CLA算法，我们可以找到合适的类别数量K，和K个聚类中心，通过这些聚类中心，我们就可以通过用户的用户特征将用户分成K个类别。Through the CLA algorithm, we can find the appropriate number of categories K and K cluster centers. Through these cluster centers, we can divide users into K categories according to their user characteristics.

步骤四：对于天涯论坛来说，参与信息传播的用户就是对帖子进行回复的用户，通过数据库中帖子的回复用户ID列表，可以得到所有回贴用户的种类：Step 4: For Tianya Forum, the users who participate in information dissemination are the users who reply to the posts. Through the reply user ID list of the posts in the database, the types of all reply users can be obtained:

usertype:＝argmin_j｜｜user_i-C_j｜｜²(1)usertype:＝argmin _j ｜｜user _i -C _j ｜｜ ² (1)

其中user_i是指用户，C_j表示用户类别j的聚类中心，通过计算用户特征与中心点距离最短的聚类中心所对应的就是用户的userType。Among them, user _i refers to the user, C _j represents the cluster center of user category j, and the cluster center with the shortest distance between the user characteristics and the center point corresponds to the user's userType.

统计各个类别回贴用户的数量，将这些值作为帖子的用户传播特征。Count the number of posting users in each category, and use these values as the user propagation characteristics of posts.

步骤五：首先定义模型是通过一个帖子v的在t_r天里用户回复量的历史记录来预测在t_t(t_t>t_r)天里全部的帖子的回复量新模型不仅考虑到t_r天内的回复量，还要考虑这些回复是由那些用户回复的。通过上一步骤可以通过计算用户的标签将所有回复帖子的用户聚类。我们建立基于用户特征的多元线性模型里来示早期和后期的回复量存在强烈的线性关系和用户行为对帖子回复量的影响。我们可以定义x_i(v,t_r)表示为用户类别标签userType为i的用户在t_r天内的回复次数总和，这样我们可以得到特征向量X_k(v,t_r)，它表示为X_k(v,t_r)＝(x₁(v,t_r),x₂(v,t_r),…,x_k(v,t_r))。这样我们就得到了t_t天里全部的帖子的回复量的预测模型：Step 5: First define the model to predict the reply volume of all posts in t _t (t _t >t _r ) days through the historical records of user reply volume in t _r days of a post v The new model takes into account not only the volume of replies within t _r days, but also which users replied to those replies. Through the previous step, all users who replied to the post can be clustered by calculating the user's tags. We build a multivariate linear model based on user characteristics to show that there is a strong linear relationship between early and late replies and the impact of user behavior on post replies. We can define x _i (v, t _r ) as the sum of the number of replies of users whose user category label userType is i within t _r days, so that we can get the feature vector X _k (v, t _r ), which is expressed as X _k (v,t _r )=(x ₁ (v,t _r ),x ₂ (v,t _r ),...,x _k (v,t _r )). In this way, we get the reply volume of all posts in _t days predictive model:

其中参数是这个模型的参数向量，其中γ_i就是类别i的用户的回复数量对于预测的影响参数。这个参数向量取决于t_r，t_t和k。where parameters Is the parameter vector of this model, where γ _i is the influence parameter of the number of replies of users of category i on the prediction. This parameter vector depends on t _r , t _t and k.

为了计算模型的参数向量我们引入了mean Relative Squared Error(mRSE)，这是评估预测模型的重要指标，当mRSE的最小值越小时模型的性能越好，mRSE表示为：In order to calculate the parameter vector of the model, we introduce mean Relative Squared Error (mRSE), which is an important indicator for evaluating the prediction model. When the minimum value of mRSE is smaller, the performance of the model is better, and mRSE is expressed as:

将我们的算式(2)代入到算式(3)中我们可以得到：Substituting our formula (2) into formula (3), we can get:

通过给出训练样本集C和设置t_r和t_t，使用梯度下降的方法就可以训练出当mRSE取的最小值时，参数向量的取值。这样问题就转化为：By giving the training sample set C and setting t _r and t _t , the gradient descent method can be used to train the parameter vector when mRSE takes the minimum value value of . The problem thus becomes:

这样就可以通过将算式(2)描述的预测模型来预测贴子未来的回复量。This can be done by adding The prediction model described by formula (2) is used to predict the future reply volume of the post.

如图3所示，为本发明一实施例提供的基于用户特征的预测社交网络信息流行度的系统的结构图，该系统包括：获取模块301、特征提取模块302、分类模块303、处理模块304、预测模型模块305及预测模型306。As shown in Figure 3, it is a structural diagram of a system for predicting social network information popularity based on user characteristics provided by an embodiment of the present invention, the system includes: an acquisition module 301, a feature extraction module 302, a classification module 303, and a processing module 304 , a predictive model module 305 and a predictive model 306 .

获取模块301，用于用于获取预设时间内社交网络中的信息数据和与所述信息数据对应的用户数据，所述用户数据包括多个用户属性特征。The obtaining module 301 is configured to obtain information data in a social network within a preset time period and user data corresponding to the information data, where the user data includes a plurality of user attribute features.

特征提取模块302，用于从所述用户数据中提取部分的用户属性特征，以及根据所述用户数据，获取所述用户数据的用户行为特征。The feature extraction module 302 is configured to extract part of user attribute features from the user data, and obtain user behavior features of the user data according to the user data.

分类模块303用于按照所述用户属性特征和所述用户行为特征将用户数据进行分类，得到用户数据中用户的类别。The classification module 303 is configured to classify the user data according to the user attribute characteristics and the user behavior characteristics to obtain the user category in the user data.

处理模块304，用于根据所述信息数据以及所述用户数据中用户的类别，获取与所述信息数据对应的用户传播特征。The processing module 304 is configured to acquire user propagation characteristics corresponding to the information data according to the information data and user categories in the user data.

预测模型模块305，用于根据所述用户传播特征，确定社交网络信息流行度的预测模型。The prediction model module 305 is configured to determine a prediction model of social network information popularity according to the user communication characteristics.

预测模块306，用于采用所述预测模型对一段时间内产生的信息数据进行分析，对信息流行度进行预测。The prediction module 306 is configured to use the prediction model to analyze the information data generated within a period of time, and predict the popularity of the information.

进一步地，所述系统还包括：存储模块，用于将所述用户数据和所述信息数据存储到数据库中。Further, the system further includes: a storage module, configured to store the user data and the information data in a database.

具体来说，如图4所示，分类模块303包括归一化子模块401及化分子模块402。Specifically, as shown in FIG. 4 , the classification module 303 includes a normalization submodule 401 and a molecularization module 402 .

归一化子模块401，用于对所述用户属性特征和所述用户行为特征进行归一化处理，得到用户特征；A normalization submodule 401, configured to perform normalization processing on the user attribute features and the user behavior features to obtain user features;

划分子模块402，用于根据所述用户特征，采用聚类算法将用户数据进行分类，得到用户数据中用户的类别。The division sub-module 402 is configured to classify the user data by using a clustering algorithm according to the user characteristics, and obtain user categories in the user data.

具体来说，如图5所示，预测模型模块305包括建立模型子模块501及训练子模块502。Specifically, as shown in FIG. 5 , the predictive model module 305 includes a model building submodule 501 and a training submodule 502 .

建立模型子模块501，用于建立基于用户特征的多元线性模型；Establishing a model submodule 501, used to establish a multivariate linear model based on user characteristics;

训练子模块502，用于将所述用户传播信息作为训练集，对所述线性模型进行训练，得到社交网络信息流行度预测模型。The training sub-module 502 is configured to use the user dissemination information as a training set to train the linear model to obtain a social network information popularity prediction model.

以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解；其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。The above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand; Modifications are made to the recorded technical solutions, or equivalent replacements are made to some of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. a kind of method of the prediction social network information stream row degree based on user characteristics, it is characterised in that this method includes：

Obtain information data and user data corresponding with described information data in social networks, the user in preset time Data include multiple user property features；

The effective user property feature of influence Information Communication is extracted from the user data, and according to the number of users According to the user behavior feature of the acquisition user data；

User data is classified according to the user property feature and the user behavior feature of extraction, obtained in user data The classification of user；

According to the classification of user in described information data and the user data, user corresponding with described information data is obtained Propagation characteristic；

According to user's propagation characteristic, the forecast model of social network information stream row degree is determined；

The information data produced in a period of time is analyzed using the forecast model, information popularity is predicted；

The classification according to user in described information data and the user data, is obtained corresponding with described information data User's propagation characteristic, including：

According to the classification of user, the user that Information Communication is participated in the information data in database is carried out according to the classification of user Statistics, the number of users of each classification of statistics is used as the corresponding user's propagation characteristic of described information data；

It is described that the forecast model of social network information stream row degree is determined according to user's propagation characteristic, including：

Set up the multivariate linear model based on user characteristics；

Using user's propagation characteristic as training set, the linear model is trained, social network information stream row is obtained Spend forecast model；

Wherein, the user characteristics is normalized by the user property feature and the user behavior feature extracted Arrive.

2. according to the method described in claim 1, it is characterised in that the Information Number obtained in preset time in social networks After user data step corresponding with described information data, this method also includes：

By in the user data and described information data Cun Chudao databases.

3. according to the method described in claim 1, it is characterised in that the Information Number obtained in preset time in social networks According to user data corresponding with described information data, including：

The user data and information data of forum's class social networks are obtained using web crawlers；

The user data and information data of microblogging class social networks are obtained using application programming interface API；

The user data of community's class social networks is obtained using web crawlers, community's class social network is obtained using the clipbook of user The information data of network.

4. according to the method described in claim 1, it is characterised in that the user property feature according to extraction and the user Behavioural characteristic is classified user data, obtains the classification of user in user data, including：

User property feature and the user behavior feature to extraction are normalized, and obtain user characteristics；

According to the user characteristics, user data is classified using clustering algorithm, the classification of user in user data is obtained.

5. method according to claim 4, it is characterised in that the use clustering algorithm is classified user data, Including：

User data is divided into two classes, and calculates the distance of class center, if the distance of class center is less than preset value, by this Two classifications are fused into a classification；

User data of all categories is continued to classify, and calculates the distance at center of all categories, until there is the user of three classifications Data fusion obtains the classification of user into classification is stopped during a classification.

6. a kind of system of the prediction social network information stream row degree based on user characteristics, it is characterised in that the system includes：

Acquisition module, for obtaining information data and user corresponding with described information data in preset time in social networks Data, the user data includes multiple user property features；

Characteristic extracting module, the effective user property feature for extracting influence Information Communication from the user data, with And according to the user data, obtain the user behavior feature of the user data；

Sort module, is classified user data for the user property feature according to extraction and the user behavior feature, Obtain the classification of user in user data；

Processing module, for the classification according to user in described information data and the user data, is obtained and described information The corresponding user's propagation characteristic of data；

Forecast model module, for according to user's propagation characteristic, determining the forecast model of social network information stream row degree；

Prediction module, for being analyzed using the forecast model the information data produced in a period of time, to information flow Row degree is predicted；

The processing module, specifically for：

The forecast model module includes：

Model submodule is set up, for setting up the multivariate linear model based on user characteristics；

Submodule is trained, for user's propagation characteristic, as training set, to be trained to the linear model, obtain society Hand over network information prevalence degree forecast model；

7. system according to claim 6, it is characterised in that the system also includes：

Memory module, for by the user data and described information data Cun Chudao databases.

8. system according to claim 6, it is characterised in that the sort module includes：

Submodule is normalized, is normalized, obtains for the user property feature to extraction and the user behavior feature To user characteristics；

Submodule is divided, for according to the user characteristics, user data to be classified using clustering algorithm, number of users is obtained According to the classification of middle user.