CN107578136A

CN107578136A - Overlapping Community Discovery Method Based on Random Walk and Seed Expansion

Info

Publication number: CN107578136A
Application number: CN201710830381.XA
Authority: CN
Inventors: 郭昆; 郭文忠; 陈羽中; 牛玉贞; 陈基杰
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2017-09-14
Filing date: 2017-09-14
Publication date: 2018-01-12

Abstract

The present invention relates to a method for discovering overlapping communities based on random walk and seed expansion, which uses random walk to select seed communities, calculates similarity and optimizes adaptive function community expansion to realize community discovery on large-scale social networks, including the following Steps: 1. Read the original data to obtain the network structure and node neighbor information; 2. Obtain the seed community set Seeds according to the node transition probability matrix, scoring matrix and the similarity between the node and the community; 3. Optimize according to the similarity between communities Seed community set Seeds ; 4. Extend the community according to the similarity between the seed and the community and the adaptive fitness function; 5. Process the free nodes in the network and merge similar communities according to the similarity between nodes and communities and the similarity between communities; 6. Output the network Overlapping community structures. This method can efficiently and accurately find overlapping communities.

Description

Overlapping Community Discovery Method Based on Random Walk and Seed Expansion

技术领域technical field

本发明涉及社交网络上的重叠社区发现技术领域，特别是一种基于随机游走与种子扩展的重叠社区发现方法。The invention relates to the technical field of discovering overlapping communities on social networks, in particular to a method for discovering overlapping communities based on random walk and seed expansion.

背景技术Background technique

现实生活中存在很多复杂的网络结构，例如社交网络、科学家合作网络、文献引用网络、蛋白质互相协作网以及在线社交网络等。这些复杂网络最关键的特性就是都可以被抽象为由许多节点和边组成，节点代表网络中的个体，而边代表个体间的联系，边还可以赋予权重值。复杂网络中的社区通常要求一个社区内的点连接紧密，而社区与社区之间的点连接稀疏。社区发现是研究复杂网络结构的关键技术之一。社区发现的目的就是要从复杂的网络结构中高效准确的发现这些紧密的社区。There are many complex network structures in real life, such as social networks, scientist collaboration networks, literature citation networks, protein collaboration networks, and online social networks. The most critical feature of these complex networks is that they can be abstracted to be composed of many nodes and edges. Nodes represent individuals in the network, while edges represent connections between individuals, and edges can also be assigned weight values. Communities in complex networks usually require that the points within a community are closely connected, while the points between communities are sparsely connected. Community discovery is one of the key techniques for studying complex network structures. The purpose of community discovery is to efficiently and accurately discover these tight communities from complex network structures.

社区结构是复杂网络中重要的结构，如何高效准确的挖掘出社区隐含的结构与信息仍是一个难点，近些年吸引很多人对其进行深入的研究。目前已经提出了很多基于网络结构划分的全局社区发现算法，包括层次聚类算法、谱方法、基于团的方法、边聚类、标签传播等。由于全局社区发现算法需要对整个网络结构信息进行整体认知，在规模较大或者不完整的复杂网络中就存在一些缺陷，所以就有了基于种子扩展的局部社区发现算法。基于种子扩展的局部社区发现算法通常是从网络中的种子节点出发，利用社区的局部信息不断的从网络中加入节点来发现社区结构。Community structure is an important structure in a complex network. How to efficiently and accurately mine the hidden structure and information of the community is still a difficult point. In recent years, many people have attracted many people to conduct in-depth research on it. At present, many global community discovery algorithms based on network structure division have been proposed, including hierarchical clustering algorithm, spectral method, clique-based method, edge clustering, label propagation, etc. Since the global community discovery algorithm requires an overall cognition of the entire network structure information, there are some defects in large-scale or incomplete complex networks, so there is a local community discovery algorithm based on seed expansion. The local community discovery algorithm based on seed expansion usually starts from the seed nodes in the network, and uses the local information of the community to continuously add nodes from the network to discover the community structure.

现有的基于种子扩展的局部社区发现算法在社区发现方面已经取得一定成果，但仍然存在以下几个问题：首先算法的时间与空间复杂度相对较高，在处理较大规模网络时存在不足；其次，算法通常只是通过某个简单策略选取节点作为种子节点，从而影响了社区扩展挖掘的精确度；最后，在社区扩展阶段未能从多个方面考虑节点加入社区的紧密度，发现的社区质量不高。The existing local community discovery algorithm based on seed expansion has achieved some results in community discovery, but there are still several problems: first, the time and space complexity of the algorithm is relatively high, and there are deficiencies in dealing with large-scale networks; Secondly, the algorithm usually only selects nodes as seed nodes through a simple strategy, which affects the accuracy of community expansion mining; finally, in the community expansion stage, it fails to consider the closeness of nodes joining the community from multiple aspects, and the quality of the discovered community not tall.

发明内容Contents of the invention

本发明的目的在于提供一种基于随机游走与种子扩展的重叠社区发现方法，该方法可以高效、准确得发现重叠社区。The purpose of the present invention is to provide a method for discovering overlapping communities based on random walk and seed expansion, which can efficiently and accurately discover overlapping communities.

为实现上述目的，本发明的技术方案是：一种基于随机游走与种子扩展的重叠社区发现方法，包括以下步骤：To achieve the above object, the technical solution of the present invention is: a method for discovering overlapping communities based on random walk and seed expansion, comprising the following steps:

步骤1：读取原始数据，获取网络结构及节点近邻信息；Step 1: Read the original data to obtain the network structure and node neighbor information;

步骤2：根据节点的转移概率矩阵、评分矩阵以及节点与社区的相似度，获得种子社区集合Seeds；Step 2: Obtain the seed community set Seeds according to the transition probability matrix of the node, the scoring matrix and the similarity between the node and the community;

步骤3：根据社区间相似度优化种子社区集合Seeds；Step 3: Optimize the seed community set Seeds according to the similarity between communities;

步骤4：根据种子与社区的相似度和自适应fitness函数扩展社区；Step 4: Extend the community according to the similarity between the seed and the community and the adaptive fitness function;

步骤5：根据节点与社区相似度和社区间相似度处理网络中的自由节点并合并相似社区；Step 5: Process free nodes in the network and merge similar communities according to the similarity between nodes and communities and between communities;

步骤6：输出网络重叠社区结构。Step 6: Output network overlapping community structure.

进一步地，在步骤2中，根据节点的转移概率矩阵、评分矩阵以及节点与社区的相似度，获得种子社区集合Seeds的方法如下：Further, in step 2, according to the node transition probability matrix, scoring matrix and the similarity between the node and the community, the method of obtaining the seed community set Seeds is as follows:

步骤2.1：根据网络的邻接矩阵A_uv和节点度k_v，得到转移概率矩阵P_uv：Step 2.1: According to the network adjacency matrix A _uv and node degree k _v , get the transition probability matrix P _uv :

P_uv＝A_uv/k_v (1)P _uv ＝A _uv /k _v (1)

P_uv中元素表示节点u随机走动一步到达节点v的概率；The elements in P _uv represent the probability that node u walks randomly and reaches node v in one step;

进一步得到随机走动t步之后的转移概率矩阵P^t _uv：Further obtain the transition probability matrix P ^t _uv after random walk t steps:

步骤2.2：根据游走t步之后的转移概率矩阵P^t _uv，得到评分矩阵B：Step 2.2: According to the transition probability matrix P ^t _uv after walking for t steps, the scoring matrix B is obtained:

其中，T表示随机游走的步数阈值；Among them, T represents the step threshold of random walk;

评分矩阵B中每个元素表示节点u走动t步到达节点v所得到的评分，用B(u,v)表示，两个节点连接越紧密，则其间的评分越高；Each element in the scoring matrix B represents the score obtained by node u walking t steps to reach node v, represented by B(u,v), the closer the two nodes are connected, the higher the score between them;

步骤2.3：把B(u,v)降序排序，取出前l个评分值较大的节点构成初始种子列表B-list；Step 2.3: Sort B(u,v) in descending order, and take out the first l nodes with higher score values to form the initial seed list B-list;

步骤2.4：计算B-list中每个节点作为初始种子社区与其近邻节点的节点与社区相似度NC_SIM(v,C_i)，当相似度大于设定阈值时就将邻居节点加入到种子社区中，最终得到种子社区集合Seeds；Step 2.4: Calculate the node-community similarity NC_SIM(v,C _i ) between each node in the B-list as the initial seed community and its neighbor nodes. When the similarity is greater than the set threshold, the neighbor node is added to the seed community. Finally, the seed community collection Seeds is obtained;

其中，v表示节点，为节点v与社区C_i关联边的度，k_v为节点v的度；NC_SIM(v,C_i)越大，则表明节点v越有可能属于社区C_i。Among them, v represents a node, is the degree of the edge associated with node v and community C _i , and k _v is the degree of node v; the larger NC_SIM(v,C _i ) is, the more likely node v belongs to community C _i .

进一步地，在步骤3中，根据社区间相似度优化种子社区集合Seeds的方法如下：Further, in step 3, the method of optimizing the seed community set Seeds according to the similarity between communities is as follows:

步骤3.1：遍历种子社区集合Seeds中的每个种子社区，计算社区间的相似度CC_SIM(C_i,C_j)；Step 3.1: traverse each seed community in the seed community set Seeds, and calculate the similarity CC_SIM(C _i , C _j ) between the communities;

其中，|overlap(C_i,C_j)|为社区C_i与社区C_j共有的节点数，|C_i|为社区C_i的节点数，|C_j|为社区C_j的节点数；CC_SIM(C_i,C_j)越大，则表明社区C_i与社区C_j的结构越相近，当超过设定阈值ε时，将两个社区合并为一个社区；Among them, |overlap(C _i , C _j )| is the number of nodes shared by community C _i and community C _j , |C _i | is the number of nodes in community C _i , |C _j | is the number of nodes in community C _j ; CC_SIM The larger (C _i , C _j ) indicates that the structure of the community C _i and the community C _j is more similar. When the set threshold ε is exceeded, the two communities will be merged into one community;

步骤3.2：如果社区间的相似度大于阈值ε时，则将该两个社区合并，获得优化后的种子社区集合Seeds’。Step 3.2: If the similarity between communities is greater than the threshold ε, merge the two communities to obtain the optimized seed community set Seeds'.

进一步地，在步骤4中，根据种子与社区相似度和自适应fitness函数扩展社区的方法如下：Further, in step 4, the method of expanding the community according to the similarity between the seed and the community and the adaptive fitness function is as follows:

步骤4.1：先获得步骤3优化后的种子社区的邻居集NBSet，遍历NBSet，计算其中每个邻居节点与种子社区的节点与社区相似度NC_SIM(v,C_i)；Step 4.1: first obtain the neighbor set NBSet of the seed community optimized in step 3, traverse the NBSet, and calculate the similarity NC_SIM(v,C _i ) between each neighbor node and the seed community;

步骤4.2：取出前l个相似度较大的节点，构成候选节点列表C-list；Step 4.2: Take out the first l nodes with higher similarity to form the candidate node list C-list;

步骤4.3：计算将C-list中的候选节点加入社区中的自适应函数fitness，把能够让fitness增加的节点加入社区，否则置其为自由节点；自适应函数fitness的计算公式如下：Step 4.3: Calculate the adaptive function fitness by adding the candidate nodes in the C-list to the community, and add the nodes that can increase the fitness to the community, otherwise set them as free nodes; the calculation formula of the adaptive function fitness is as follows:

其中，和分别为子图g内部度和外部度的总值，参数α是一个为正的实数，用来控制发现的社区规模；in, with are the total value of the internal degree and external degree of the subgraph g, respectively, and the parameter α is a positive real number, which is used to control the size of the community found;

步骤4.4：更新NBSet，然后重复以上步骤，直到NBSet为空；Step 4.4: Update NBSet, then repeat the above steps until NBSet is empty;

步骤4.5：得到初始网络重叠社区集合C。Step 4.5: Obtain the initial network overlapping community set C.

进一步地，在步骤5中，根据节点与社区相似度和社区间相似度处理网络中的自由节点并合并相似社区的方法如下：Further, in step 5, the method of processing free nodes in the network and merging similar communities according to the similarity between nodes and communities and between communities is as follows:

步骤5.1：计算网络中的自由节点到社区的节点与社区相似度CC_SIM(C_i,C_j)，把自由节点加入到相似度最高的社区中，并更新的重叠社区集合C；Step 5.1: Calculate the node-community similarity CC_SIM(C _i , C _j ) from free nodes to communities in the network, add free nodes to the community with the highest similarity, and update the overlapping community set C;

步骤5.2：计算重叠社区集合C中社区与社区的之间的相似度CC_SIM(C_i,C_j)，合并相似度大于阈值ε的社区，得到新的重叠社区集合C’。Step 5.2: Calculate the similarity CC_SIM(C _i , C _j ) between the communities in the overlapping community set C, merge the communities whose similarity is greater than the threshold ε, and obtain a new overlapping community set C'.

进一步地，在步骤6中，输出网络重叠社区结构的方法为：根据优化后得到新的重叠社区集合C’，集合C’中的每个元素代表一个社区，然后输出生成的网络重叠社区结构。Further, in step 6, the method of outputting the overlapping community structure of the network is: according to a new overlapping community set C' obtained after optimization, each element in the set C' represents a community, and then output the generated overlapping community structure of the network.

本发明的有益效果是将随机游走策略、相似度计算以及优化自适应函数fitness相结合，应用于较大规模网络的社区发现，从而能够有效的得到网络中重叠社区结构划分，并为网络聚类在重叠社区发现方向的发展提供有益补充。The beneficial effect of the present invention is to combine the random walk strategy, similarity calculation and optimized adaptive function fitness, and apply it to the community discovery of a large-scale network, so that the overlapping community structure division in the network can be effectively obtained, and the network aggregation Class development in the direction of overlapping community discovery provides a useful complement.

附图说明Description of drawings

图1是本发明实施例的实现流程图。Fig. 1 is an implementation flow chart of the embodiment of the present invention.

具体实施方式detailed description

下面结合附图及具体实施例对本发明作进一步的详细说明。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

本发明基于随机游走与种子扩展的重叠社区发现方法，运用了随机游走选取种子社区、相似度计算与优化自适应函数社区扩展，实现大规模社交网络上的社区发现，如图1所示，包括以下步骤：The present invention is based on the overlapping community discovery method of random walk and seed expansion, uses random walk to select seed communities, similarity calculation and optimized adaptive function community expansion, and realizes community discovery on large-scale social networks, as shown in Figure 1 , including the following steps:

步骤1：读取原始数据，获取网络结构及节点近邻信息。Step 1: Read the original data to obtain the network structure and node neighbor information.

步骤2：根据节点的转移概率矩阵、评分矩阵以及节点与社区的相似度，获得种子社区集合Seeds。具体方法如下：Step 2: Obtain the seed community set Seeds according to the transition probability matrix of the node, the scoring matrix and the similarity between the node and the community. The specific method is as follows:

P_uv＝A_uv/k_v (1)P _uv ＝A _uv /k _v (1)

步骤3：根据社区间相似度优化种子社区集合Seeds。具体方法如下：Step 3: Optimize the seed community set Seeds according to the similarity between communities. The specific method is as follows:

步骤3.2：如果社区间的相似度大于阈值ε时，则将该两个社区合并，获得优化后的种子社区集合Seeds’。在本实施例中，阈值ε取0.5。Step 3.2: If the similarity between communities is greater than the threshold ε, merge the two communities to obtain the optimized seed community set Seeds'. In this embodiment, the threshold ε is 0.5.

步骤4：根据种子与社区的相似度和自适应fitness函数扩展社区。具体方法为：Step 4: Expand the community according to the similarity between the seed and the community and the adaptive fitness function. The specific method is:

步骤4.1：先获得步骤3优化后的种子社区的邻居集NBSet，遍历NBSet，计算其中每个邻居节点与种子社区的节点与社区相似度NC_SIM(v,C_i)，计算公式见式(4)；Step 4.1: First obtain the neighbor set NBSet of the seed community optimized in step 3, traverse the NBSet, and calculate the similarity NC_SIM(v,C _i ) between each neighbor node and the seed community, and the calculation formula is shown in formula (4) ;

步骤5：根据节点与社区相似度和社区间相似度处理网络中的自由节点并合并相似社区。具体方法为：Step 5: Process free nodes in the network and merge similar communities according to the node-community similarity and inter-community similarity. The specific method is:

步骤6：输出网络重叠社区结构。具体方法为：根据优化后得到新的重叠社区集合C’，集合C’中的每个元素代表一个社区，然后输出生成的网络重叠社区结构。Step 6: Output network overlapping community structure. The specific method is: according to the optimization, a new overlapping community set C' is obtained, each element in the set C' represents a community, and then the generated network overlapping community structure is output.

以上是本发明的较佳实施例，凡依本发明技术方案所作的改变，所产生的功能作用未超出本发明技术方案的范围时，均属于本发明的保护范围。The above are the preferred embodiments of the present invention, and all changes made according to the technical solution of the present invention, when the functional effect produced does not exceed the scope of the technical solution of the present invention, all belong to the protection scope of the present invention.

Claims

1. A method for discovering overlapping communities based on random walk and seed expansion, characterized in that, comprising the following steps:

Step 1: Read the original data to obtain the network structure and node neighbor information;

Step 2: Obtain the seed community set Seeds according to the transition probability matrix of the node, the scoring matrix and the similarity between the node and the community;

Step 3: Optimize the seed community set Seeds according to the similarity between communities;

Step 4: Extend the community according to the similarity between the seed and the community and the adaptive fitness function;

Step 5: Process free nodes in the network and merge similar communities according to the similarity between nodes and communities and between communities;

Step 6: Output network overlapping community structure.

2. The overlapping community discovery method based on random walk and seed expansion according to claim 1, characterized in that in step 2, according to the transition probability matrix, scoring matrix and similarity between nodes and communities of nodes, the seeds are obtained The method of collecting Seeds in the community is as follows:

Step 2.1: According to the network adjacency matrix A _uv and node degree k _v , get the transition probability matrix P _uv :

P _uv ＝A _uv /k _v (1)

The elements in P _uv represent the probability that node u walks randomly and reaches node v in one step;

Further obtain the transition probability matrix P ^t _uv after random walk t steps:

Step 2.2: According to the transition probability matrix P ^t _uv after walking for t steps, the scoring matrix B is obtained:

<mrow><mi>B</mi><mo>=</mo><munderover><mo>&Sigma;</mo><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>T</mi></munderover><msubsup><mi>P</mi><mrow><mi>u</mi><mi>v</mi></mrow><mi>t</mi></msubsup><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>3</mn><mo>)</mo></mrow></mrow>

Among them, T represents the step threshold of random walk;

Each element in the scoring matrix B represents the score obtained by node u walking t steps to reach node v, represented by B(u,v), the closer the two nodes are connected, the higher the score between them;

Step 2.3: Sort B(u,v) in descending order, and take out the first l nodes with higher score values to form the initial seed list B-list;

Step 2.4: Calculate the node-community similarity NC_SIM(v,C _i ) between each node in the B-list as the initial seed community and its neighbor nodes. When the similarity is greater than the set threshold, the neighbor node is added to the seed community. Finally, the seed community collection Seeds is obtained;

Among them, v represents a node, is the degree of the edge associated with node v and community C _i , and k _v is the degree of node v; the larger NC_SIM(v,C _i ) is, the more likely node v belongs to community C _i .

3. The overlapping community discovery method based on random walk and seed expansion according to claim 2, characterized in that, in step 3, the method of optimizing the seed community set Seeds according to the similarity between communities is as follows:

Step 3.1: traverse each seed community in the seed community set Seeds, and calculate the similarity CC_SIM(C _i , C _j ) between the communities;

Among them, |overlap(C _i , C _j )| is the number of nodes shared by community C _i and community C _j , |C _i | is the number of nodes in community C _i , |C _j | is the number of nodes in community C _j ; CC_SIM The larger (C _i , C _j ) indicates that the structure of the community C _i and the community C _j is more similar. When the set threshold ε is exceeded, the two communities will be merged into one community;

Step 3.2: If the similarity between communities is greater than the threshold ε, merge the two communities to obtain the optimized seed community set Seeds'.

4. The overlapping community discovery method based on random walk and seed expansion according to claim 3, characterized in that, in step 4, the method of expanding community according to seed and community similarity and adaptive fitness function is as follows:

Step 4.1: first obtain the neighbor set NBSet of the seed community optimized in step 3, traverse the NBSet, and calculate the similarity NC_SIM(v,C _i ) between each neighbor node and the seed community;

Step 4.2: Take out the first l nodes with higher similarity to form the candidate node list C-list;

Step 4.3: Calculate the adaptive function fitness by adding the candidate nodes in the C-list to the community, and add the nodes that can increase the fitness to the community, otherwise set them as free nodes; the calculation formula of the adaptive function fitness is as follows:

<mrow><msub><mi>f</mi><mi>g</mi></msub><mo>=</mo><mfrac><msubsup><mi>k</mi><mrow><mi>i</mi><mi>n</mi></mrow><mi>g</mi></msubsup><msup><mrow><mo>(</mo><msubsup><mi>k</mi><mrow><mi>i</mi><mi>n</mi></mrow><mi>g</mi></msubsup><mo>+</mo><msubsup><mi>k</mi><mrow><mi>o</mi><mi>u</mi><mi>t</mi></mrow><mi>g</mi></msubsup><mo>)</mo></mrow><mi>&alpha;</mi></msup></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>6</mn><mo>)</mo></mrow></mrow>

in, with are the total value of the internal degree and external degree of the subgraph g, respectively, and the parameter α is a positive real number, which is used to control the size of the community found;

Step 4.4: Update NBSet, then repeat the above steps until NBSet is empty;

Step 4.5: Obtain the initial network overlapping community set C.

5. The overlapping community discovery method based on random walk and seed expansion according to claim 4, wherein in step 5, the free nodes in the network are processed and merged according to the similarity between nodes and communities and the similarity between communities The methods for similar communities are as follows:

Step 5.1: Calculate the node-community similarity CC_SIM(C _i , C _j ) from free nodes to communities in the network, add free nodes to the community with the highest similarity, and update the overlapping community set C;

Step 5.2: Calculate the similarity CC_SIM(C _i , C _j ) between the communities in the overlapping community set C, merge the communities whose similarity is greater than the threshold ε, and obtain a new overlapping community set C'.

6. The method for discovering overlapping communities based on random walk and seed expansion according to claim 1, characterized in that in step 6, the method of outputting the network overlapping community structure is: according to the new overlapping community set C obtained after optimization ', each element in the set C' represents a community, and then output the generated network overlapping community structure.