[go: up one dir, main page]

CN114692011A - Social media data feature selection method fusing L1 regularization and link attributes - Google Patents

Social media data feature selection method fusing L1 regularization and link attributes Download PDF

Info

Publication number
CN114692011A
CN114692011A CN202210258834.7A CN202210258834A CN114692011A CN 114692011 A CN114692011 A CN 114692011A CN 202210258834 A CN202210258834 A CN 202210258834A CN 114692011 A CN114692011 A CN 114692011A
Authority
CN
China
Prior art keywords
social media
media data
link
feature selection
regularization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210258834.7A
Other languages
Chinese (zh)
Inventor
潘晓光
令狐彬
张娜
张雅娜
陈智娇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanxi Sanyouhe Smart Information Technology Co Ltd
Original Assignee
Shanxi Sanyouhe Smart Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanxi Sanyouhe Smart Information Technology Co Ltd filed Critical Shanxi Sanyouhe Smart Information Technology Co Ltd
Priority to CN202210258834.7A priority Critical patent/CN114692011A/en
Publication of CN114692011A publication Critical patent/CN114692011A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the field of data analysis, and particularly relates to a social media data feature selection method fusing L1 regularization and link attributes, which comprises the following steps of S1, inputting social media data, wherein a behavior sample of the social media data is listed as a feature corresponding to social content; s2, normalizing the social media data; s3, extracting the link relations of 4 common social media data; s4, combining with L1 regularization to realize feature selection under corresponding link relation; and S5, acquiring and outputting the feature subsets obtained by different link relations to obtain a final feature set. The social media data feature selection with the link attribute is realized, the problem of social media data association characteristics which cannot be solved by a traditional feature selection method is solved, and a solid foundation is laid for follow-up operations such as dimension reduction and important feature analysis of large-scale social media data.

Description

融合L1正则化与链接属性的社交媒体数据特征选择方法A Feature Selection Method for Social Media Data Fusing L1 Regularization and Link Attributes

技术领域technical field

本发明属于数据分析领域,具体涉及融合L1正则化与链接属性的社交媒体数据特征选择方法。The invention belongs to the field of data analysis, and in particular relates to a social media data feature selection method integrating L1 regularization and link attributes.

背景技术Background technique

当前无数社交媒体服务的发展,使人们能够方便、轻松地进行沟通和表达自己,例如微博。社交媒体的广泛使用以前所未有的速度产生了海量数据,例如每天有数亿条微博被发送和转载,海量、高维的社交媒体数据对分类、聚类等数据挖掘任务提出了新的挑战。特征选择被广泛应用于高维数据的挖掘当中,传统的特征选择方法,如L1正则化,目的是从高维数据中选择相关的特征,以获得简洁、准确的数据表示,它可以减轻维数灾难,加快学习过程,提高学习模型的泛化能力。社交媒体数据主要由(1)传统的高维属性值数据(如帖子、评论和图像等)和(2)描述社交媒体用户之间关系以及发布帖子的人等的链接数据组成。社交媒体数据这一特性给特征选择带来了新的挑战,传统的特征选择方法无法利用链接数据中的附加信息。此外,社交媒体的性质还决定了其数据是海量的、嘈杂的和不完整的,这使得本来就具有挑战性的针对社交媒体链接数据的特征选择问题更加严峻。The current development of countless social media services enables people to communicate and express themselves conveniently and easily, such as Weibo. The widespread use of social media has generated massive data at an unprecedented speed. For example, hundreds of millions of microblogs are sent and reposted every day. Massive, high-dimensional social media data poses new challenges for data mining tasks such as classification and clustering. Feature selection is widely used in high-dimensional data mining. Traditional feature selection methods, such as L1 regularization, aim to select relevant features from high-dimensional data to obtain concise and accurate data representation, which can reduce dimensionality. Disaster, speed up the learning process and improve the generalization ability of the learned model. Social media data is mainly composed of (1) traditional high-dimensional attribute value data (such as posts, comments, and images, etc.) and (2) link data describing the relationship between social media users and the person who posted the post, etc. The characteristic of social media data brings new challenges to feature selection, and traditional feature selection methods cannot take advantage of the additional information in linked data. In addition, the nature of social media also determines that its data is massive, noisy and incomplete, which makes the already challenging feature selection problem for social media link data even more severe.

技术问题:现有的绝大多数特征选择算法都使用“平面”数据(样本由多个特征或属性的值进行表示),这些数据包含统一的属性值数据点,且通常被认为是独立的和同分布的。考虑到社交媒体中链接数据的属性,本发明通过从链接数据中提取区别关系并将其融入到传统特征选择方法中实现对于具有链接属性的社交媒体数据特征选择。Technical problem: The vast majority of existing feature selection algorithms use "flat" data (samples are represented by the values of multiple features or attributes), which contain uniform attribute value data points and are generally considered to be independent and identically distributed. Considering the attributes of link data in social media, the present invention realizes feature selection for social media data with link attributes by extracting distinguishing relationships from link data and incorporating them into traditional feature selection methods.

发明内容SUMMARY OF THE INVENTION

针对上述问题,本发明提供了一种融合L1正则化与链接属性的社交媒体数据特征选择方法。In view of the above problems, the present invention provides a feature selection method for social media data that integrates L1 regularization and link attributes.

本申请公开的融合L1正则化与链接属性的社交媒体数据特征选择方法,包括如下步骤:The social media data feature selection method that integrates L1 regularization and link attributes disclosed in the present application includes the following steps:

S1、输入社交媒体数据,其行为样本,列为对应社交内容的特征;S1. Input social media data, and its behavior samples are listed as the characteristics of the corresponding social content;

S2、规范化表示社交媒体数据;S2. Normalized representation of social media data;

S3、提取4种常见的社交媒体数据的链接关系;S3. Extract the link relationship of 4 common social media data;

S4、结合L1正则化实现对应链接链接关系下的特征选择;S4, combined with L1 regularization to realize feature selection under the corresponding link relationship;

S5、对不同链接关系获得的特征子集取并输出得到最终的特征集合。S5. Take and output feature subsets obtained from different link relationships to obtain a final feature set.

进一步的,所述步骤S1中,对包含链接关系的社交媒体数据进行规范化的表示。Further, in the step S1, normalized representation is performed on the social media data including the link relationship.

进一步的,所述步骤S2中,在社会关联理论的指导下,可以从关联数据中提取各种链接关系,基于下述4种链接关系,可以将其与现有的特征选择方法结合,建模为新的特征选择法则,从而实现社交媒体中链接数据的特征选择,所述链接关系包括1)Co-post:即多个帖子来自同一用户或者说来自同一个用户的社交媒体实例更相似;2)Co-following:即如果两个用户关注同一个用户,那么这两个用户所产生的帖子更相似;3)Co-followed:即如果两个用户被同一个用户关注,则他们的帖子是可能相似的;4)Following:即如果一个用户关注另外一个用户,则这两个用户可能具有相同的兴趣点,从而他们的帖子可能相似。Further, in the step S2, under the guidance of the social relevance theory, various link relationships can be extracted from the associated data, and based on the following four link relationships, they can be combined with the existing feature selection methods to model. It is a new feature selection rule, so as to realize feature selection of link data in social media, and the link relationship includes 1) Co-post: that is, multiple posts from the same user or social media instances from the same user are more similar; 2 )Co-following: i.e. if two users follow the same user, then the posts generated by the two users are more similar; 3)Co-followed: i.e. if two users are followed by the same user, their posts are likely Similar; 4) Following: That is, if one user follows another user, the two users may have the same point of interest, so their posts may be similar.

进一步的,所述步骤S3中,在完成链接关系的构建之后,我们以L1正则化为基础特征选择模型,结合上述链接关系,构造新的特征选择优化对象,L1正则化的优化目标函数如下:Further, in the step S3, after completing the construction of the link relationship, we take L1 regularization as the basic feature selection model, and combine the above link relationship to construct a new feature selection optimization object. The optimization objective function of L1 regularization is as follows:

Figure BDA0003549833780000031
Figure BDA0003549833780000031

其中,W代表特征权重,参数α控制W的稀疏度,;令L(X,Y)代表

Figure BDA0003549833780000032
Among them, W represents the feature weight, and the parameter α controls the sparsity of W; let L(X, Y) represent
Figure BDA0003549833780000032

进一步的,所述步骤S3中,还包括对所述链接关系进行对应的优化:对于Co-post链接关系:

Figure BDA0003549833780000033
其中,β调整Co-Post的贡献度,T(fi)=WTfi;对于Co-following链接关系:
Figure BDA0003549833780000034
其中,β调整Co-following的贡献度,
Figure BDA0003549833780000035
对于Co-followed链接关系,首先引入指标矩阵
Figure BDA0003549833780000036
表示如果uj是pi的作者;且FE=sign(SST),其中sign(x)=1如果x>0,否则sign(x)=0。LFI表示FI的拉普拉斯矩阵,则优化目标表示为:
Figure BDA0003549833780000037
其中,B=XXT+βXHLFEHTXT,E=YTXT;对于following链接关系,有:
Figure BDA0003549833780000038
其中,B=XXT+βXHLSHTXT,E=YTXT,与Co-followed链接关系的区别在于LS为S的拉普拉斯矩阵。Further, in the step S3, it also includes performing corresponding optimization on the link relationship: for the Co-post link relationship:
Figure BDA0003549833780000033
Among them, β adjusts the contribution of Co-Post, T(f i )=W T f i ; for the Co-following link relationship:
Figure BDA0003549833780000034
Among them, β adjusts the contribution of Co-following,
Figure BDA0003549833780000035
For the Co-followed link relationship, first introduce the indicator matrix
Figure BDA0003549833780000036
represents if u j is the author of pi ; and FE=sign(SS T ), where sign(x)=1 if x>0, otherwise sign(x)=0. L FI represents the Laplacian matrix of FI, then the optimization objective is expressed as:
Figure BDA0003549833780000037
Among them, B=XX T +βXHL FE H T X T , E=Y T X T ; for the following link relationship, there are:
Figure BDA0003549833780000038
Among them, B=XX T +βXHL S H T X T , E=Y T X T , and the difference from the Co-followed link relationship is that L S is the Laplacian matrix of S.

进一步的,所述步骤S4中,通过获得的链接关系特征选择准则,实现社交媒体数据的特征选择建模,对于每种关系获得的特征子集,通过取并集的形式得到最终的特征集合,完成后续的社交媒体数据分析。Further, in the step S4, the feature selection modeling of the social media data is realized through the obtained link relationship feature selection criteria, and for the feature subset obtained for each relationship, the final feature set is obtained in the form of a union, Complete follow-up social media data analysis.

本发明具有以下优点:提出融合已知的社交媒体数据关联关系来解决该类型数据特征选择的问题,基于从链接数据中提取了四种类型的关系,并将其约束集成到常用的特征选择方法L1正则化中,实现了具有链接属性的社交媒体数据特征选择,解决了传统特征选择方法无法解决的社交媒体数据关联特性,为大规模社交媒体数据的降维以及重要特征分析等后续操作打下坚实的基础。The present invention has the following advantages: it proposes to integrate the known social media data association relationship to solve the problem of feature selection of this type of data, based on four types of relationships extracted from the linked data, and their constraints are integrated into commonly used feature selection methods In L1 regularization, the feature selection of social media data with link attributes is realized, which solves the social media data association characteristics that cannot be solved by traditional feature selection methods, and lays a solid foundation for subsequent operations such as dimensionality reduction of large-scale social media data and analysis of important features. The basics.

附图说明Description of drawings

图1为社交媒体数据示意图;Figure 1 is a schematic diagram of social media data;

图2为具有链接属性的社交媒体数据特征选择框架;Fig. 2 is the social media data feature selection framework with link attribute;

图3为融合L1正则化与链接属性的社交媒体数据特征选择方法流程图。FIG. 3 is a flowchart of a feature selection method for social media data that integrates L1 regularization and link attributes.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

本申请公开的融合L1正则化与链接属性的社交媒体数据特征选择方法包括如下步骤:S1、首先需要对包含链接关系的社交媒体数据进行规范化的表示;考虑到社交媒体数据它的数据点或实例本质上是相互连接的,在不失一般性的情况下,图1给出了一个带有两种数据表示形式的社交媒体数据的简单示例构建。图1(a)有四个用户(u1,…,u4),每个用户关注一些其他用户(例如,u1关注u2和u4)并且有一些帖子或者博客(例如,用户u1发的帖子p1和p2)。图1(b)是属性值数据的常规表示:行是post,在社交媒体中代表发布的每一个微博或帖子,列是文本的特性或术语,它是社交媒体数据的主要内容载体。在社交媒体的背景下,还有以链接数据形式出现的附加信息,如图1(c)所示,包含的关系有谁发布了帖子,谁关注谁等。令p={p1,p2,…,pn}代表社交媒体数据中的帖子或博客,n表示帖子或博客的数量;f={f1,f2,…,fm}代表社交媒体数据中帖子或博客的特征属性描述,m代表特征数量。X表示包含所有样本(p)及其对应的特征属性集组成的特征数据矩阵,大小为m*n。c={c1,c2,…,ck}表示每条帖子或者每个样本对应的标签,k代表标签类别数目。Y表示包含所有样本(p)及其相应的标签构成的矩阵,大小为n*k。u={u1,u2,…,ut}代表用户集合,t表示用户数量。将用户与用户之间的以下关系建模为一个图与邻接矩阵S,其中S(i,j)=1,表示从uj到ui有关注关系(用户j关注了用户i),否则为0,例如在图1中显示为S(:,1)=[0,1,0,1]T。同时,Q(i,j)=1表示帖子pj由用户ui发,否则为0,例如在图1中显示为Q(1,:)=[1,1,0,0,0,0,0,0]。于是,传统的监督特征选择是依据{X,Y}的映射,在从m个特征中选择一个子集特征;而考虑链接关系的社交媒体数据特征选择表述为在考虑用户关系S和用户与帖子关系Q的基础上,依据{X,Y}的映射在m个特征中选择一个子集特征。其中,S,Q可以看作是一种社交媒体数据进行特征选择时特有的约束关系。The social media data feature selection method that integrates L1 regularization and link attributes disclosed in the present application includes the following steps: S1. First, the social media data containing the link relationship needs to be represented in a normalized manner; considering the social media data, its data points or instances Interconnected in nature, without loss of generality, Figure 1 presents a simple example construction of social media data with two data representations. Figure 1(a) has four users (u1,..., u4), each user follows some other users (eg, u1 follows u2 and u4) and has some posts or blogs (eg, posts p1 and p2 by user u1) ). Figure 1(b) is a conventional representation of attribute value data: the row is a post, which in social media represents each tweet or post published, and the column is the feature or term of the text, which is the main content carrier of social media data. In the context of social media, there is additional information in the form of link data, as shown in Figure 1(c), including the relationship who made the post, who followed whom, etc. Let p={p1,p2,…,pn} denote posts or blogs in social media data, n denote the number of posts or blogs; f={f1,f2,…,fm} denote the number of posts or blogs in social media data Feature attribute description, m represents the number of features. X represents a feature data matrix consisting of all samples (p) and their corresponding feature attribute sets, and the size is m*n. c={c1,c2,...,ck} represents the tag corresponding to each post or each sample, and k represents the number of tag categories. Y represents a matrix containing all samples (p) and their corresponding labels, and the size is n*k. u={u1,u2,...,ut} represents the set of users, and t represents the number of users. The following relationship between users and users is modeled as a graph and adjacency matrix S, where S(i,j)=1, indicating that there is a concern relationship from uj to ui (user j has followed user i), otherwise it is 0, For example shown in Figure 1 as S(:,1)=[0,1,0,1] T . At the same time, Q(i,j)=1 means that the post pj is sent by user ui, otherwise it is 0, for example, it is shown as Q(1,:)=[1,1,0,0,0,0,0 in Figure 1 ,0]. Therefore, the traditional supervised feature selection is to select a subset of features from m features based on the mapping of {X, Y}; while the feature selection of social media data considering the link relationship is expressed as considering the user relationship S and the user and post. On the basis of the relation Q, a subset feature is selected from the m features according to the mapping of {X, Y}. Among them, S and Q can be regarded as a unique constraint relationship when social media data is used for feature selection.

S2、完成社交媒体数据的表示后,需要整合不同类型样本的差异关系为特征选择流程建模。这里在社会关联理论的指导下,可以从关联数据中提取各种关系,即社交媒体数据中用户与社交媒介的常见几种已知链接关系:S2. After completing the representation of social media data, it is necessary to integrate the difference relationships of different types of samples to model the feature selection process. Here, under the guidance of social relevance theory, various relationships can be extracted from linked data, that is, several common known link relationships between users and social media in social media data:

1)Co-post:即多个帖子来自同一用户或者说来自同一个用户的社交媒体实例更相似,换句话说,一个用户的帖子在主题上(比如"体育"、"音乐")比那些随机选择的帖子更相似,如图1(a)中的{p3,p4,p5};1) Co-post: i.e. multiple posts from the same user or social media instances from the same user are more similar, in other words, a user's posts on topics (like "sports", "music") are more similar than those that are random The selected posts are more similar, such as {p3, p4, p5} in Fig. 1(a);

2)Co-following:即如果两个用户关注同一个用户,那么这两个用户所产生的帖子更相似,图1(b)中u1和u3关注u4,则他们的帖子{p1,p2}和{p6,p7}很可能是相似的主题;2) Co-following: that is, if two users follow the same user, then the posts generated by the two users are more similar. In Figure 1(b), u1 and u3 follow u4, then their posts {p1, p2} and {p6, p7} are likely to be similar themes;

3)Co-followed:即如果两个用户被同一个用户关注,则他们的帖子是可能相似的。例如,在图1(a)中,两个用户u2和u4都被用户u1关注,那么他们的帖子{p3,p4,p5}和{p8}可能具有比较相似的主题;3) Co-followed: that is, if two users are followed by the same user, their posts may be similar. For example, in Figure 1(a), two users u2 and u4 are followed by user u1, then their posts {p3, p4, p5} and {p8} may have relatively similar topics;

4)Following:即如果一个用户关注另外一个用户,则这两个用户可能具有相同的兴趣点,从而他们的帖子可能相似。例如,在图1(a)中,u1关注u2,那么他们的帖子{p1,p2}和{p3,p4,p5}可能相似。4) Following: That is, if a user follows another user, the two users may have the same point of interest, and thus their posts may be similar. For example, in Figure 1(a), u1 follows u2, then their posts {p1,p2} and {p3,p4,p5} may be similar.

基于上述4种链接关系,可以将其与现有的特征选择方法结合,建模为新的特征选择法则,从而实现社交媒体中链接数据的特征选择。Based on the above four link relationships, it can be combined with the existing feature selection method to model as a new feature selection rule, so as to realize the feature selection of link data in social media.

S3、在完成上链接关系的构建之后,我们以L1正则化(Lasso回归)为基础特征选择模型,结合上述4种链接属性关系,构造新的特征选择优化对象。L1正则化的优化目标函数如下:S3. After completing the construction of the link relationship, we use L1 regularization (Lasso regression) as the basic feature selection model, and combine the above four link attribute relationships to construct a new feature selection optimization object. The optimization objective function of L1 regularization is as follows:

Figure BDA0003549833780000061
Figure BDA0003549833780000061

其中,W代表特征权重,参数α控制W的稀疏度,;令L(X,Y)代表

Figure BDA0003549833780000062
则上述4种关系对应的优化问题为:Among them, W represents the feature weight, and the parameter α controls the sparsity of W; let L(X, Y) represent
Figure BDA0003549833780000062
Then the optimization problems corresponding to the above four relations are:

对于Co-post链接关系:For Co-post link relationship:

Figure BDA0003549833780000063
Figure BDA0003549833780000063

其中,β调整Co-Post的贡献度,T(fi)=WTfiAmong them, β adjusts the contribution of Co-Post, T(fi )=W T f i .

对于Co-following链接关系:For Co-following link relationships:

Figure BDA0003549833780000064
Figure BDA0003549833780000064

其中,β调整Co-following的贡献度,

Figure BDA0003549833780000065
对于Co-followed链接关系,首先引入指标矩阵
Figure BDA0003549833780000066
表示如果uj是pi的作者;且FE=sign(SST),其中sign(x)=1如果x>0,否则sign(x)=0。LFI表示FI的拉普拉斯矩阵。则优化目标表示为:Among them, β adjusts the contribution of Co-following,
Figure BDA0003549833780000065
For the Co-followed link relationship, first introduce the indicator matrix
Figure BDA0003549833780000066
represents if u j is the author of pi ; and FE=sign(SS T ), where sign(x)=1 if x>0, otherwise sign(x)=0. L FI represents the Laplacian matrix of FI. Then the optimization objective is expressed as:

Figure BDA0003549833780000067
Figure BDA0003549833780000067

其中,B=XXT+βXHLFEHTXT,E=YTXTWherein, B=XX T +βXHL FE H T X T , E=Y T X T ;

同理,对于following链接关系,有:Similarly, for the following link relationship, there are:

Figure BDA0003549833780000068
Figure BDA0003549833780000068

其中,B=XXT+βXHLSHTXT,E=YTXT,与Co-followed链接关系的区别在于LS为S的拉普拉斯矩阵。通过结合社交媒体数据常见的4种链接关系与L1正则化特征选择方法,就能够实现对于包含链接关系的社交媒体数据的特征选择。具体的算法原理见附图2。Among them, B=XX T +βXHL S H T X T , E=Y T X T , and the difference from the Co-followed link relationship is that L S is the Laplacian matrix of S. By combining the four common link relationships in social media data and the L1 regularization feature selection method, feature selection for social media data containing link relationships can be achieved. The specific algorithm principle is shown in Figure 2.

S4、基于上述获得的链接关系特征选择准则,我们就可以实现社交媒体数据的特征选择建模。对于每种关系获得的特征子集,可以通过取并集的形式得到最终的特征集合,以完成后续的社交媒体数据分析,如分类等。S4. Based on the above-obtained link relationship feature selection criteria, we can implement feature selection modeling for social media data. For the feature subset obtained from each relationship, the final feature set can be obtained in the form of a union to complete subsequent social media data analysis, such as classification.

S5、完成了多种分类器的构建之后,,我们就可以针对未知的待评估样本,通过输入甲基化数据实现对于样本属性的预测。S5. After completing the construction of various classifiers, we can predict the attributes of samples by inputting methylation data for unknown samples to be evaluated.

上面仅对本发明的较佳实施例作了详细说明,但是本发明并不限于上述实施例,在本领域普通技术人员所具备的知识范围内,还可以在不脱离本发明宗旨的前提下作出各种变化,各种变化均应包含在本发明的保护范围之内。Only the preferred embodiments of the present invention have been described in detail above, but the present invention is not limited to the above-mentioned embodiments, and within the scope of knowledge possessed by those of ordinary skill in the art, various aspects can also be made without departing from the purpose of the present invention. Various changes should be included within the protection scope of the present invention.

Claims (6)

1.融合L1正则化与链接属性的社交媒体数据特征选择方法,其特征在于:1. A social media data feature selection method integrating L1 regularization and link attributes, characterized in that: S1、输入社交媒体数据,其行为样本,列为对应社交内容的特征;S1. Input social media data, and its behavior samples are listed as the characteristics of the corresponding social content; S2、规范化表示社交媒体数据;S2. Normalized representation of social media data; S3、提取4种常见的社交媒体数据的链接关系;S3. Extract the link relationship of 4 common social media data; S4、结合L1正则化实现对应链接链接关系下的特征选择;S4, combined with L1 regularization to realize feature selection under the corresponding link relationship; S5、对不同链接关系获得的特征子集取并输出得到最终的特征集合。S5. Take and output feature subsets obtained from different link relationships to obtain a final feature set. 2.根据权利要求1所述的融合L1正则化与链接属性的社交媒体数据特征选择方法,其特征在于:所述步骤S1中,对包含链接关系的社交媒体数据进行规范化的表示。2 . The feature selection method for social media data combining L1 regularization and link attributes according to claim 1 , wherein in the step S1 , normalized representation is performed on social media data including link relationships. 3 . 3.根据权利要求2所述的融合L1正则化与链接属性的社交媒体数据特征选择方法,其特征在于:所述步骤S2中,在社会关联理论的指导下,可以从关联数据中提取各种链接关系,基于下述4种链接关系,可以将其与现有的特征选择方法结合,建模为新的特征选择法则,从而实现社交媒体中链接数据的特征选择,所述链接关系包括1)Co-post:即多个帖子来自同一用户或者说来自同一个用户的社交媒体实例更相似;2)Co-following:即如果两个用户关注同一个用户,那么这两个用户所产生的帖子更相似;3)Co-followed:即如果两个用户被同一个用户关注,则他们的帖子是可能相似的;4)Following:即如果一个用户关注另外一个用户,则这两个用户可能具有相同的兴趣点,从而他们的帖子可能相似。3. The feature selection method for social media data combining L1 regularization and link attributes according to claim 2, wherein in the step S2, under the guidance of social relevance theory, various Link relationship, based on the following four link relationships, it can be combined with the existing feature selection method and modeled as a new feature selection rule, so as to realize the feature selection of link data in social media, the link relationship includes 1) Co-post: that is, multiple posts from the same user or social media instances from the same user are more similar; 2) Co-following: that is, if two users follow the same user, then the posts generated by the two users are more similar Similar; 3) Co-followed: that is, if two users are followed by the same user, their posts may be similar; 4) Following: that is, if one user follows another user, the two users may have the same points of interest, and thus their posts may be similar. 4.根据权利要求3所述的融合L1正则化与链接属性的社交媒体数据特征选择方法,其特征在于:所述步骤S3中,在完成链接关系的构建之后,我们以L1正则化为基础特征选择模型,结合上述链接关系,构造新的特征选择优化对象,L1正则化的优化目标函数如下:4. The social media data feature selection method of merging L1 regularization and link attributes according to claim 3, characterized in that: in the step S3, after completing the construction of the link relationship, we take L1 regularization as the basic feature Select the model and combine the above link relationship to construct a new feature selection optimization object. The optimization objective function of L1 regularization is as follows:
Figure FDA0003549833770000021
Figure FDA0003549833770000021
其中,W代表特征权重,参数α控制W的稀疏度,;令L(X,Y)代表
Figure FDA0003549833770000022
Among them, W represents the feature weight, and the parameter α controls the sparsity of W; let L(X, Y) represent
Figure FDA0003549833770000022
5.根据权利要求4所述的融合L1正则化与链接属性的社交媒体数据特征选择方法,其特征在于:所述步骤S3中,还包括对所述链接关系进行对应的优化:对于Co-post链接关系:
Figure FDA0003549833770000023
其中,β调整Co-Post的贡献度,T(fi)=WTfi;对于Co-following链接关系:
Figure FDA0003549833770000024
其中,β调整Co-following的贡献度,
Figure FDA0003549833770000025
对于Co-followed链接关系,首先引入指标矩阵
Figure FDA0003549833770000026
表示如果uj是pi的作者;且FE=sign(SST),其中sign(x)=1如果x>0,否则sign(x)=0。LFI表示FI的拉普拉斯矩阵,则优化目标表示为:
Figure FDA0003549833770000027
其中,BXXT+βXHLFEHTXT,E=YTXT;对于following链接关系,有:
Figure FDA0003549833770000028
其中,B=XXT+βXHLSHTXT,E=YTXT,与Co-followed链接关系的区别在于LS为S的拉普拉斯矩阵。
5. The social media data feature selection method of integrating L1 regularization and link attributes according to claim 4, characterized in that: in the step S3, it also includes performing corresponding optimization on the link relationship: for Co-post Link relationship:
Figure FDA0003549833770000023
Among them, β adjusts the contribution of Co-Post, T(f i )=W T f i ; for the Co-following link relationship:
Figure FDA0003549833770000024
Among them, β adjusts the contribution of Co-following,
Figure FDA0003549833770000025
For the Co-followed link relationship, first introduce the indicator matrix
Figure FDA0003549833770000026
represents if u j is the author of pi ; and FE=sign(SS T ), where sign(x)=1 if x>0, otherwise sign(x)=0. L FI represents the Laplacian matrix of FI, then the optimization objective is expressed as:
Figure FDA0003549833770000027
Among them, BXX T +βXHL FE H T X T , E=Y T X T ; for the following link relationship, there are:
Figure FDA0003549833770000028
Among them, B=XX T +βXHL S H T X T , E=Y T X T , and the difference from the Co-followed link relationship is that L S is the Laplacian matrix of S.
6.根据权利要求5所述的融合L1正则化与链接属性的社交媒体数据特征选择方法,其特征在于:所述步骤S4中,通过获得的链接关系特征选择准则,实现社交媒体数据的特征选择建模,对于每种关系获得的特征子集,通过取并集的形式得到最终的特征集合,完成后续的社交媒体数据分析。6. The feature selection method for social media data combining L1 regularization and link attributes according to claim 5, characterized in that: in the step S4, feature selection of social media data is realized through the obtained link relationship feature selection criteria Modeling, for the feature subset obtained for each relationship, the final feature set is obtained in the form of union, and the subsequent social media data analysis is completed.
CN202210258834.7A 2022-03-16 2022-03-16 Social media data feature selection method fusing L1 regularization and link attributes Pending CN114692011A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210258834.7A CN114692011A (en) 2022-03-16 2022-03-16 Social media data feature selection method fusing L1 regularization and link attributes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210258834.7A CN114692011A (en) 2022-03-16 2022-03-16 Social media data feature selection method fusing L1 regularization and link attributes

Publications (1)

Publication Number Publication Date
CN114692011A true CN114692011A (en) 2022-07-01

Family

ID=82138436

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210258834.7A Pending CN114692011A (en) 2022-03-16 2022-03-16 Social media data feature selection method fusing L1 regularization and link attributes

Country Status (1)

Country Link
CN (1) CN114692011A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8060405B1 (en) * 2004-12-31 2011-11-15 Google Inc. Methods and systems for correlating connections between users and links between articles
US20170212943A1 (en) * 2016-01-22 2017-07-27 Arizona Board Of Regents On Behalf Of Arizona State University Systems and methods for unsupervised streaming feature selection in social media
CN111241423A (en) * 2020-01-17 2020-06-05 江西财经大学 Deep recommendation method and system integrating trust distrust relation and attention mechanism
CN113988012A (en) * 2021-10-25 2022-01-28 天津大学 An unsupervised social media summarization method that fuses social context and multi-granularity relations

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8060405B1 (en) * 2004-12-31 2011-11-15 Google Inc. Methods and systems for correlating connections between users and links between articles
US20170212943A1 (en) * 2016-01-22 2017-07-27 Arizona Board Of Regents On Behalf Of Arizona State University Systems and methods for unsupervised streaming feature selection in social media
CN111241423A (en) * 2020-01-17 2020-06-05 江西财经大学 Deep recommendation method and system integrating trust distrust relation and attention mechanism
CN113988012A (en) * 2021-10-25 2022-01-28 天津大学 An unsupervised social media summarization method that fuses social context and multi-granularity relations

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
任永功等: "基于用户相关性的动态网络媒体数据无监督特征选择算法", 计算机学报, vol. 41, no. 7, 31 July 2018 (2018-07-31), pages 1517 - 1535 *

Similar Documents

Publication Publication Date Title
CN108492200B (en) User attribute inference method and device based on convolutional neural network
CN104318340B (en) Information visualization methods and intelligent visible analysis system based on text resume information
CN103440287B (en) A kind of Web question and answer searching system based on product information structure
CN111754345B (en) Bit currency address classification method based on improved random forest
WO2023155508A1 (en) Graph convolutional neural network and knowledge base-based paper correlation analysis method
Zhao et al. Entity identification for heterogeneous database integration—a multiple classifier system approach and empirical evaluation
CN114817557B (en) Enterprise risk detection method and device based on enterprise credit big data knowledge graph
Zhang et al. Online asymmetric active learning with imbalanced data
CN103092975A (en) Detection and filter method of network community garbage information based on topic consensus coverage rate
CN109214454B (en) A Weibo-Oriented Emotional Community Classification Method
CN105389354A (en) Social media text oriented unsupervised method for extracting and sorting events
CN114579833B (en) A visual analysis method of microblog public opinion based on topic mining and sentiment analysis
CN107423820A (en) The knowledge mapping of binding entity stratigraphic classification represents learning method
CN118134529A (en) Big data-based computer data processing method and system
CN117076765A (en) Intelligent recruitment system sentry matching method and system based on heterogeneous graph neural network
CN106445914B (en) Construction method and construction device of microblog emotion classifier
Wang et al. Missing data imputation for machine learning
CN118503450A (en) A method and system for identifying key nodes of network pollution based on knowledge graph
CN116662564A (en) A service recommendation method based on deep matrix factorization and knowledge graph
CN104657422A (en) Classification decision tree-based intelligent content distribution classification method
CN114692011A (en) Social media data feature selection method fusing L1 regularization and link attributes
CN106934423A (en) The construction method and system of a kind of decision tree
CN116680475A (en) Personalized recommendation method, system and electronic device based on heterogeneous graph attention
Venkatesan et al. An ID3 Algorithm for Performance of Decision Tree in Predicting Student's Absenteeism in an Academic Year using Categorical Datasets
SakethNath et al. Emotion Detection using Natural Language Processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination