CN114692011A

CN114692011A - Social media data feature selection method fusing L1 regularization and link attributes

Info

Publication number: CN114692011A
Application number: CN202210258834.7A
Authority: CN
Inventors: 潘晓光; 令狐彬; 张娜; 张雅娜; 陈智娇
Original assignee: Shanxi Sanyouhe Smart Information Technology Co Ltd
Current assignee: Shanxi Sanyouhe Smart Information Technology Co Ltd
Priority date: 2022-03-16
Filing date: 2022-03-16
Publication date: 2022-07-01

Abstract

The invention belongs to the field of data analysis, and particularly relates to a social media data feature selection method fusing L1 regularization and link attributes, which comprises the following steps of S1, inputting social media data, wherein a behavior sample of the social media data is listed as a feature corresponding to social content; s2, normalizing the social media data; s3, extracting the link relations of 4 common social media data; s4, combining with L1 regularization to realize feature selection under corresponding link relation; and S5, acquiring and outputting the feature subsets obtained by different link relations to obtain a final feature set. The social media data feature selection with the link attribute is realized, the problem of social media data association characteristics which cannot be solved by a traditional feature selection method is solved, and a solid foundation is laid for follow-up operations such as dimension reduction and important feature analysis of large-scale social media data.

Description

A Feature Selection Method for Social Media Data Fusing L1 Regularization and Link Attributes

技术领域technical field

本发明属于数据分析领域，具体涉及融合L1正则化与链接属性的社交媒体数据特征选择方法。The invention belongs to the field of data analysis, and in particular relates to a social media data feature selection method integrating L1 regularization and link attributes.

背景技术Background technique

当前无数社交媒体服务的发展，使人们能够方便、轻松地进行沟通和表达自己，例如微博。社交媒体的广泛使用以前所未有的速度产生了海量数据，例如每天有数亿条微博被发送和转载，海量、高维的社交媒体数据对分类、聚类等数据挖掘任务提出了新的挑战。特征选择被广泛应用于高维数据的挖掘当中，传统的特征选择方法，如L1正则化，目的是从高维数据中选择相关的特征，以获得简洁、准确的数据表示，它可以减轻维数灾难，加快学习过程，提高学习模型的泛化能力。社交媒体数据主要由(1)传统的高维属性值数据(如帖子、评论和图像等)和(2)描述社交媒体用户之间关系以及发布帖子的人等的链接数据组成。社交媒体数据这一特性给特征选择带来了新的挑战，传统的特征选择方法无法利用链接数据中的附加信息。此外，社交媒体的性质还决定了其数据是海量的、嘈杂的和不完整的，这使得本来就具有挑战性的针对社交媒体链接数据的特征选择问题更加严峻。The current development of countless social media services enables people to communicate and express themselves conveniently and easily, such as Weibo. The widespread use of social media has generated massive data at an unprecedented speed. For example, hundreds of millions of microblogs are sent and reposted every day. Massive, high-dimensional social media data poses new challenges for data mining tasks such as classification and clustering. Feature selection is widely used in high-dimensional data mining. Traditional feature selection methods, such as L1 regularization, aim to select relevant features from high-dimensional data to obtain concise and accurate data representation, which can reduce dimensionality. Disaster, speed up the learning process and improve the generalization ability of the learned model. Social media data is mainly composed of (1) traditional high-dimensional attribute value data (such as posts, comments, and images, etc.) and (2) link data describing the relationship between social media users and the person who posted the post, etc. The characteristic of social media data brings new challenges to feature selection, and traditional feature selection methods cannot take advantage of the additional information in linked data. In addition, the nature of social media also determines that its data is massive, noisy and incomplete, which makes the already challenging feature selection problem for social media link data even more severe.

技术问题：现有的绝大多数特征选择算法都使用“平面”数据(样本由多个特征或属性的值进行表示)，这些数据包含统一的属性值数据点，且通常被认为是独立的和同分布的。考虑到社交媒体中链接数据的属性，本发明通过从链接数据中提取区别关系并将其融入到传统特征选择方法中实现对于具有链接属性的社交媒体数据特征选择。Technical problem: The vast majority of existing feature selection algorithms use "flat" data (samples are represented by the values of multiple features or attributes), which contain uniform attribute value data points and are generally considered to be independent and identically distributed. Considering the attributes of link data in social media, the present invention realizes feature selection for social media data with link attributes by extracting distinguishing relationships from link data and incorporating them into traditional feature selection methods.

发明内容SUMMARY OF THE INVENTION

针对上述问题，本发明提供了一种融合L1正则化与链接属性的社交媒体数据特征选择方法。In view of the above problems, the present invention provides a feature selection method for social media data that integrates L1 regularization and link attributes.

本申请公开的融合L1正则化与链接属性的社交媒体数据特征选择方法，包括如下步骤：The social media data feature selection method that integrates L1 regularization and link attributes disclosed in the present application includes the following steps:

S1、输入社交媒体数据，其行为样本，列为对应社交内容的特征；S1. Input social media data, and its behavior samples are listed as the characteristics of the corresponding social content;

S2、规范化表示社交媒体数据；S2. Normalized representation of social media data;

S3、提取4种常见的社交媒体数据的链接关系；S3. Extract the link relationship of 4 common social media data;

S4、结合L1正则化实现对应链接链接关系下的特征选择；S4, combined with L1 regularization to realize feature selection under the corresponding link relationship;

S5、对不同链接关系获得的特征子集取并输出得到最终的特征集合。S5. Take and output feature subsets obtained from different link relationships to obtain a final feature set.

进一步的，所述步骤S1中，对包含链接关系的社交媒体数据进行规范化的表示。Further, in the step S1, normalized representation is performed on the social media data including the link relationship.

进一步的，所述步骤S2中，在社会关联理论的指导下，可以从关联数据中提取各种链接关系，基于下述4种链接关系，可以将其与现有的特征选择方法结合，建模为新的特征选择法则，从而实现社交媒体中链接数据的特征选择，所述链接关系包括1)Co-post：即多个帖子来自同一用户或者说来自同一个用户的社交媒体实例更相似；2)Co-following：即如果两个用户关注同一个用户，那么这两个用户所产生的帖子更相似；3)Co-followed：即如果两个用户被同一个用户关注，则他们的帖子是可能相似的；4)Following：即如果一个用户关注另外一个用户，则这两个用户可能具有相同的兴趣点，从而他们的帖子可能相似。Further, in the step S2, under the guidance of the social relevance theory, various link relationships can be extracted from the associated data, and based on the following four link relationships, they can be combined with the existing feature selection methods to model. It is a new feature selection rule, so as to realize feature selection of link data in social media, and the link relationship includes 1) Co-post: that is, multiple posts from the same user or social media instances from the same user are more similar; 2 )Co-following: i.e. if two users follow the same user, then the posts generated by the two users are more similar; 3)Co-followed: i.e. if two users are followed by the same user, their posts are likely Similar; 4) Following: That is, if one user follows another user, the two users may have the same point of interest, so their posts may be similar.

进一步的，所述步骤S3中，在完成链接关系的构建之后，我们以L1正则化为基础特征选择模型，结合上述链接关系，构造新的特征选择优化对象，L1正则化的优化目标函数如下：Further, in the step S3, after completing the construction of the link relationship, we take L1 regularization as the basic feature selection model, and combine the above link relationship to construct a new feature selection optimization object. The optimization objective function of L1 regularization is as follows:

其中，W代表特征权重，参数α控制W的稀疏度，；令L(X，Y)代表

Among them, W represents the feature weight, and the parameter α controls the sparsity of W; let L(X, Y) represent

进一步的，所述步骤S3中，还包括对所述链接关系进行对应的优化：对于Co-post链接关系：

其中，β调整Co-Post的贡献度，T(f_i)＝W^Tf_i；对于Co-following链接关系：

其中，β调整Co-following的贡献度，

对于Co-followed链接关系，首先引入指标矩阵

表示如果u_j是p_i的作者；且FE＝sign(SS^T)，其中sign(x)＝1如果x＞0，否则sign(x)＝0。L_FI表示FI的拉普拉斯矩阵，则优化目标表示为：

其中，B＝XX^T+βXHL_FEH^TX^T,E＝Y^TX^T；对于following链接关系，有：

其中，B＝XX^T+βXHL_SH^TX^T,E＝Y^TX^T,与Co-followed链接关系的区别在于L_S为S的拉普拉斯矩阵。Further, in the step S3, it also includes performing corresponding optimization on the link relationship: for the Co-post link relationship:

Among them, β adjusts the contribution of Co-Post, T(f _i )=W ^T f _i ; for the Co-following link relationship:

Among them, β adjusts the contribution of Co-following,

For the Co-followed link relationship, first introduce the indicator matrix

represents if u _j is the author of _pi ; and FE=sign(SS ^T ), where sign(x)=1 if x>0, otherwise sign(x)=0. L _FI represents the Laplacian matrix of FI, then the optimization objective is expressed as:

Among them, B=XX ^T +βXHL _FE H ^T X ^T , E=Y ^T X ^T ; for the following link relationship, there are:

Among them, B=XX ^T +βXHL _S H ^T X ^T , E=Y ^T X ^T , and the difference from the Co-followed link relationship is that L _S is the Laplacian matrix of S.

进一步的，所述步骤S4中，通过获得的链接关系特征选择准则，实现社交媒体数据的特征选择建模，对于每种关系获得的特征子集，通过取并集的形式得到最终的特征集合，完成后续的社交媒体数据分析。Further, in the step S4, the feature selection modeling of the social media data is realized through the obtained link relationship feature selection criteria, and for the feature subset obtained for each relationship, the final feature set is obtained in the form of a union, Complete follow-up social media data analysis.

本发明具有以下优点：提出融合已知的社交媒体数据关联关系来解决该类型数据特征选择的问题，基于从链接数据中提取了四种类型的关系，并将其约束集成到常用的特征选择方法L1正则化中，实现了具有链接属性的社交媒体数据特征选择，解决了传统特征选择方法无法解决的社交媒体数据关联特性，为大规模社交媒体数据的降维以及重要特征分析等后续操作打下坚实的基础。The present invention has the following advantages: it proposes to integrate the known social media data association relationship to solve the problem of feature selection of this type of data, based on four types of relationships extracted from the linked data, and their constraints are integrated into commonly used feature selection methods In L1 regularization, the feature selection of social media data with link attributes is realized, which solves the social media data association characteristics that cannot be solved by traditional feature selection methods, and lays a solid foundation for subsequent operations such as dimensionality reduction of large-scale social media data and analysis of important features. The basics.

附图说明Description of drawings

图1为社交媒体数据示意图；Figure 1 is a schematic diagram of social media data;

图2为具有链接属性的社交媒体数据特征选择框架；Fig. 2 is the social media data feature selection framework with link attribute;

图3为融合L1正则化与链接属性的社交媒体数据特征选择方法流程图。FIG. 3 is a flowchart of a feature selection method for social media data that integrates L1 regularization and link attributes.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

本申请公开的融合L1正则化与链接属性的社交媒体数据特征选择方法包括如下步骤：S1、首先需要对包含链接关系的社交媒体数据进行规范化的表示；考虑到社交媒体数据它的数据点或实例本质上是相互连接的，在不失一般性的情况下，图1给出了一个带有两种数据表示形式的社交媒体数据的简单示例构建。图1(a)有四个用户(u1，…，u4)，每个用户关注一些其他用户(例如，u1关注u2和u4)并且有一些帖子或者博客(例如，用户u1发的帖子p1和p2)。图1(b)是属性值数据的常规表示：行是post，在社交媒体中代表发布的每一个微博或帖子，列是文本的特性或术语，它是社交媒体数据的主要内容载体。在社交媒体的背景下，还有以链接数据形式出现的附加信息，如图1(c)所示，包含的关系有谁发布了帖子，谁关注谁等。令p＝{p1,p2,…,pn}代表社交媒体数据中的帖子或博客，n表示帖子或博客的数量；f＝{f1,f2,…,fm}代表社交媒体数据中帖子或博客的特征属性描述，m代表特征数量。X表示包含所有样本(p)及其对应的特征属性集组成的特征数据矩阵，大小为m*n。c＝{c1,c2,…,ck}表示每条帖子或者每个样本对应的标签，k代表标签类别数目。Y表示包含所有样本(p)及其相应的标签构成的矩阵,大小为n*k。u＝{u1,u2,…,ut}代表用户集合，t表示用户数量。将用户与用户之间的以下关系建模为一个图与邻接矩阵S，其中S(i,j)＝1，表示从uj到ui有关注关系(用户j关注了用户i)，否则为0，例如在图1中显示为S(:,1)＝[0,1,0,1]^T。同时，Q(i,j)＝1表示帖子pj由用户ui发，否则为0，例如在图1中显示为Q(1,:)＝[1,1,0,0,0,0,0,0]。于是，传统的监督特征选择是依据{X,Y}的映射，在从m个特征中选择一个子集特征；而考虑链接关系的社交媒体数据特征选择表述为在考虑用户关系S和用户与帖子关系Q的基础上，依据{X,Y}的映射在m个特征中选择一个子集特征。其中，S,Q可以看作是一种社交媒体数据进行特征选择时特有的约束关系。The social media data feature selection method that integrates L1 regularization and link attributes disclosed in the present application includes the following steps: S1. First, the social media data containing the link relationship needs to be represented in a normalized manner; considering the social media data, its data points or instances Interconnected in nature, without loss of generality, Figure 1 presents a simple example construction of social media data with two data representations. Figure 1(a) has four users (u1,..., u4), each user follows some other users (eg, u1 follows u2 and u4) and has some posts or blogs (eg, posts p1 and p2 by user u1) ). Figure 1(b) is a conventional representation of attribute value data: the row is a post, which in social media represents each tweet or post published, and the column is the feature or term of the text, which is the main content carrier of social media data. In the context of social media, there is additional information in the form of link data, as shown in Figure 1(c), including the relationship who made the post, who followed whom, etc. Let p={p1,p2,…,pn} denote posts or blogs in social media data, n denote the number of posts or blogs; f={f1,f2,…,fm} denote the number of posts or blogs in social media data Feature attribute description, m represents the number of features. X represents a feature data matrix consisting of all samples (p) and their corresponding feature attribute sets, and the size is m*n. c={c1,c2,...,ck} represents the tag corresponding to each post or each sample, and k represents the number of tag categories. Y represents a matrix containing all samples (p) and their corresponding labels, and the size is n*k. u={u1,u2,...,ut} represents the set of users, and t represents the number of users. The following relationship between users and users is modeled as a graph and adjacency matrix S, where S(i,j)=1, indicating that there is a concern relationship from uj to ui (user j has followed user i), otherwise it is 0, For example shown in Figure 1 as S(:,1)=[0,1,0,1] ^T . At the same time, Q(i,j)=1 means that the post pj is sent by user ui, otherwise it is 0, for example, it is shown as Q(1,:)=[1,1,0,0,0,0,0 in Figure 1 ,0]. Therefore, the traditional supervised feature selection is to select a subset of features from m features based on the mapping of {X, Y}; while the feature selection of social media data considering the link relationship is expressed as considering the user relationship S and the user and post. On the basis of the relation Q, a subset feature is selected from the m features according to the mapping of {X, Y}. Among them, S and Q can be regarded as a unique constraint relationship when social media data is used for feature selection.

S2、完成社交媒体数据的表示后，需要整合不同类型样本的差异关系为特征选择流程建模。这里在社会关联理论的指导下，可以从关联数据中提取各种关系，即社交媒体数据中用户与社交媒介的常见几种已知链接关系：S2. After completing the representation of social media data, it is necessary to integrate the difference relationships of different types of samples to model the feature selection process. Here, under the guidance of social relevance theory, various relationships can be extracted from linked data, that is, several common known link relationships between users and social media in social media data:

1)Co-post：即多个帖子来自同一用户或者说来自同一个用户的社交媒体实例更相似，换句话说，一个用户的帖子在主题上(比如"体育"、"音乐")比那些随机选择的帖子更相似，如图1(a)中的{p3，p4，p5}；1) Co-post: i.e. multiple posts from the same user or social media instances from the same user are more similar, in other words, a user's posts on topics (like "sports", "music") are more similar than those that are random The selected posts are more similar, such as {p3, p4, p5} in Fig. 1(a);

2)Co-following：即如果两个用户关注同一个用户，那么这两个用户所产生的帖子更相似，图1(b)中u1和u3关注u4，则他们的帖子{p1，p2}和{p6，p7}很可能是相似的主题；2) Co-following: that is, if two users follow the same user, then the posts generated by the two users are more similar. In Figure 1(b), u1 and u3 follow u4, then their posts {p1, p2} and {p6, p7} are likely to be similar themes;

3)Co-followed：即如果两个用户被同一个用户关注，则他们的帖子是可能相似的。例如，在图1(a)中，两个用户u2和u4都被用户u1关注，那么他们的帖子{p3，p4，p5}和{p8}可能具有比较相似的主题；3) Co-followed: that is, if two users are followed by the same user, their posts may be similar. For example, in Figure 1(a), two users u2 and u4 are followed by user u1, then their posts {p3, p4, p5} and {p8} may have relatively similar topics;

4)Following：即如果一个用户关注另外一个用户，则这两个用户可能具有相同的兴趣点，从而他们的帖子可能相似。例如，在图1(a)中，u1关注u2，那么他们的帖子{p1,p2}和{p3,p4,p5}可能相似。4) Following: That is, if a user follows another user, the two users may have the same point of interest, and thus their posts may be similar. For example, in Figure 1(a), u1 follows u2, then their posts {p1,p2} and {p3,p4,p5} may be similar.

基于上述4种链接关系，可以将其与现有的特征选择方法结合，建模为新的特征选择法则，从而实现社交媒体中链接数据的特征选择。Based on the above four link relationships, it can be combined with the existing feature selection method to model as a new feature selection rule, so as to realize the feature selection of link data in social media.

S3、在完成上链接关系的构建之后，我们以L1正则化(Lasso回归)为基础特征选择模型，结合上述4种链接属性关系，构造新的特征选择优化对象。L1正则化的优化目标函数如下：S3. After completing the construction of the link relationship, we use L1 regularization (Lasso regression) as the basic feature selection model, and combine the above four link attribute relationships to construct a new feature selection optimization object. The optimization objective function of L1 regularization is as follows:

则上述4种关系对应的优化问题为：Among them, W represents the feature weight, and the parameter α controls the sparsity of W; let L(X, Y) represent

Then the optimization problems corresponding to the above four relations are:

对于Co-post链接关系：For Co-post link relationship:

其中，β调整Co-Post的贡献度，T(f_i)＝W^Tf_i。Among them, β adjusts the contribution of Co-Post, T(fi )=W ^T _{f i} _.

对于Co-following链接关系：For Co-following link relationships:

其中，β调整Co-following的贡献度，

对于Co-followed链接关系，首先引入指标矩阵

表示如果u_j是p_i的作者；且FE＝sign(SS^T)，其中sign(x)＝1如果x＞0，否则sign(x)＝0。L_FI表示FI的拉普拉斯矩阵。则优化目标表示为：Among them, β adjusts the contribution of Co-following,

For the Co-followed link relationship, first introduce the indicator matrix

represents if u _j is the author of _pi ; and FE=sign(SS ^T ), where sign(x)=1 if x>0, otherwise sign(x)=0. L _FI represents the Laplacian matrix of FI. Then the optimization objective is expressed as:

其中，B＝XX^T+βXHL_FEH^TX^T,E＝Y^TX^T；Wherein, B=XX ^T +βXHL _FE H ^T X ^T , E=Y ^T X ^T ;

同理，对于following链接关系，有：Similarly, for the following link relationship, there are:

其中，B＝XX^T+βXHL_SH^TX^T,E＝Y^TX^T,与Co-followed链接关系的区别在于L_S为S的拉普拉斯矩阵。通过结合社交媒体数据常见的4种链接关系与L1正则化特征选择方法，就能够实现对于包含链接关系的社交媒体数据的特征选择。具体的算法原理见附图2。Among them, B=XX ^T +βXHL _S H ^T X ^T , E=Y ^T X ^T , and the difference from the Co-followed link relationship is that L _S is the Laplacian matrix of S. By combining the four common link relationships in social media data and the L1 regularization feature selection method, feature selection for social media data containing link relationships can be achieved. The specific algorithm principle is shown in Figure 2.

S4、基于上述获得的链接关系特征选择准则，我们就可以实现社交媒体数据的特征选择建模。对于每种关系获得的特征子集，可以通过取并集的形式得到最终的特征集合，以完成后续的社交媒体数据分析，如分类等。S4. Based on the above-obtained link relationship feature selection criteria, we can implement feature selection modeling for social media data. For the feature subset obtained from each relationship, the final feature set can be obtained in the form of a union to complete subsequent social media data analysis, such as classification.

S5、完成了多种分类器的构建之后，，我们就可以针对未知的待评估样本，通过输入甲基化数据实现对于样本属性的预测。S5. After completing the construction of various classifiers, we can predict the attributes of samples by inputting methylation data for unknown samples to be evaluated.

上面仅对本发明的较佳实施例作了详细说明，但是本发明并不限于上述实施例，在本领域普通技术人员所具备的知识范围内，还可以在不脱离本发明宗旨的前提下作出各种变化，各种变化均应包含在本发明的保护范围之内。Only the preferred embodiments of the present invention have been described in detail above, but the present invention is not limited to the above-mentioned embodiments, and within the scope of knowledge possessed by those of ordinary skill in the art, various aspects can also be made without departing from the purpose of the present invention. Various changes should be included within the protection scope of the present invention.

Claims

1. A social media data feature selection method integrating L1 regularization and link attributes, characterized in that:

S1. Input social media data, and its behavior samples are listed as the characteristics of the corresponding social content;

S2. Normalized representation of social media data;

S3. Extract the link relationship of 4 common social media data;

S4, combined with L1 regularization to realize feature selection under the corresponding link relationship;

S5. Take and output feature subsets obtained from different link relationships to obtain a final feature set.

2 . The feature selection method for social media data combining L1 regularization and link attributes according to claim 1 , wherein in the step S1 , normalized representation is performed on social media data including link relationships. 3 .

3. The feature selection method for social media data combining L1 regularization and link attributes according to claim 2, wherein in the step S2, under the guidance of social relevance theory, various Link relationship, based on the following four link relationships, it can be combined with the existing feature selection method and modeled as a new feature selection rule, so as to realize the feature selection of link data in social media, the link relationship includes 1) Co-post: that is, multiple posts from the same user or social media instances from the same user are more similar; 2) Co-following: that is, if two users follow the same user, then the posts generated by the two users are more similar Similar; 3) Co-followed: that is, if two users are followed by the same user, their posts may be similar; 4) Following: that is, if one user follows another user, the two users may have the same points of interest, and thus their posts may be similar.

4. The social media data feature selection method of merging L1 regularization and link attributes according to claim 3, characterized in that: in the step S3, after completing the construction of the link relationship, we take L1 regularization as the basic feature Select the model and combine the above link relationship to construct a new feature selection optimization object. The optimization objective function of L1 regularization is as follows:

5. The social media data feature selection method of integrating L1 regularization and link attributes according to claim 4, characterized in that: in the step S3, it also includes performing corresponding optimization on the link relationship: for Co-post Link relationship:

Among them, β adjusts the contribution of Co-following,

For the Co-followed link relationship, first introduce the indicator matrix

Among them, BXX ^T +βXHL _FE H ^T X ^T , E=Y ^T X ^T ; for the following link relationship, there are:

6. The feature selection method for social media data combining L1 regularization and link attributes according to claim 5, characterized in that: in the step S4, feature selection of social media data is realized through the obtained link relationship feature selection criteria Modeling, for the feature subset obtained for each relationship, the final feature set is obtained in the form of union, and the subsequent social media data analysis is completed.