CN108833211A

CN108833211A - Unbiased Delayed Sampling Method for Social Networks

Info

Publication number: CN108833211A
Application number: CN201810689711.2A
Authority: CN
Inventors: 刘良桂; 陈炳宪; 贾会玲; 张宇
Original assignee: Zhejiang Sci Tech University ZSTU
Current assignee: Zhejiang Sci Tech University ZSTU
Priority date: 2018-06-28
Filing date: 2018-06-28
Publication date: 2018-11-16

Abstract

本发明公开一种社交网络的数据采样方法(无偏延迟采样)，该方法遵循马尔科夫收敛准则，无偏采样方法可以适应不同网络连用程度的网络，一方面，无偏延迟方法有更好的采样网络无偏性，另一方面，无偏延迟采样方法可以减少重复数据的入样概率从而提高网络的探测能力。The invention discloses a data sampling method (unbiased delay sampling) of a social network. The method follows the Markov convergence criterion, and the unbiased sampling method can adapt to networks with different degrees of network usage. On the one hand, the unbiased delay method has better On the other hand, the unbiased delay sampling method can reduce the sampling probability of repeated data and improve the detection ability of the network.

Description

Unbiased Delayed Sampling Method for Social Networks

技术领域technical field

本发明涉及社交网络数据采样技术领域，具体涉及一种社交网络的无偏延迟采样方法(Unbiased-delay sampling,UD Sampling)。The present invention relates to the technical field of social network data sampling, in particular to a social network unbiased-delay sampling method (Unbiased-delay sampling, UD Sampling).

背景技术Background technique

近年来，在线社交网路已经成为主要的互联网服务。社交网络的蓬勃发展吸引了大量的研究学者的关注，社会学家想要研究在线用户的用户行为，工程师利用社交网络设计更好的网络系统，科研人员研究这用复杂网络的结构及动态变化过程。In recent years, online social networking has become a major Internet service. The vigorous development of social networks has attracted the attention of a large number of researchers. Sociologists want to study the user behavior of online users. Engineers use social networks to design better network systems. Researchers study the structure and dynamic change process of this complex network. .

社交网络通常会模型化为社交图进行研究分析。研究者直接面临的问题就是社交网络的数据量太过庞大。首先，想要得到完整的数据集是不切实际的，因为抓取到如此庞大的社交图要耗费难以想象的时间，有些时候也是不可能的。与此同时，处理如此庞大的社交图，即使利用高性能计算机集群也需要大量的时间进行计算。其次，出于商业机密以及用户的私有设置，社交网络的完整数据也并不可获得。最后，社交网络的用户数量增长迅速并且用户间的关系会随时间改变，因此经典的大型网络并不能完全爬取。所以，如何在大型网络中抓取适量的样本，并保持原始网络的网络属性就成了社交网络研究的基础问题。Social networks are usually modeled as social graphs for research and analysis. The problem that researchers directly face is that the amount of data in social networks is too large. First, it is impractical to obtain a complete dataset, because capturing such a large social graph would take unimaginably time-consuming and sometimes impossible. At the same time, processing such a huge social graph requires a lot of time for calculation even with high-performance computer clusters. Second, due to commercial confidentiality and users' private settings, the complete data of social networks is not available. Finally, the number of users of social networks grows rapidly and the relationships between users change over time, so classically large networks cannot be fully crawled. Therefore, how to capture an appropriate amount of samples in a large network and maintain the network properties of the original network has become the basic problem of social network research.

目前常用的网络采样技术，普遍上应用广度优先搜索算法进行数据采样。广度优先搜索算法虽然可以快速获取大量用户数据。然而在实际的生产中需要消耗大量资源设计去重队列，这样会大大减少数据的抽取效率。同时广度优先搜索算法是典型的网络的遍历算法，其算法抽取的数据会偏向高度的节点，从而该方法不能获取可靠的用户数据。Currently commonly used network sampling technology generally applies the breadth-first search algorithm for data sampling. Although the breadth-first search algorithm can quickly obtain a large amount of user data. However, in actual production, it is necessary to consume a lot of resources to design the deduplication queue, which will greatly reduce the efficiency of data extraction. At the same time, the breadth-first search algorithm is a typical network traversal algorithm, and the data extracted by the algorithm will be biased towards high-level nodes, so this method cannot obtain reliable user data.

发明内容Contents of the invention

为了解决现有社交网络数据抽取方案不能获取无偏数据以及需要设计去重队列的不足，本发明提供一种新颖的网络采样方法(无偏采样方法)，从而可以获取更加可靠的无偏数据。In order to solve the shortcomings of the existing social network data extraction schemes that cannot obtain unbiased data and need to design deduplication queues, the present invention provides a novel network sampling method (unbiased sampling method), so that more reliable unbiased data can be obtained.

本发明采用如下技术方案：一种社交网络的无偏延迟采样方法，包括以下步骤：The present invention adopts following technical scheme: a kind of unbiased delay sampling method of social network, comprises the following steps:

(1)将真实的网络转化为图G＝(E,V)，E表示图中的边的集合，边表示真实网络下用户间的关系，V表示图中的节点的集合，节点表示真实网络下的用户。(1) Transform the real network into a graph G=(E, V), E represents the set of edges in the graph, the edges represent the relationship between users in the real network, V represents the set of nodes in the graph, and the nodes represent the real network users under .

(2)初始化采样集S，缓存空间Cache，将S和Cache置空；从V中随机选取一个节点v；然后按照如下步骤进行采样。(2) Initialize the sampling set S, cache space Cache, and empty S and Cache; randomly select a node v from V; then follow the steps below to sample.

(3)探测节点v的10个邻居节点，对于邻居个数小于10的节点则探测其所有的邻居节点。将探测到的邻居节点存入缓存空间Cache中。(3) Detect 10 neighbor nodes of node v, and detect all neighbor nodes for nodes whose neighbor number is less than 10. Store the detected neighbor nodes in the cache space Cache.

(4)在节点v的所有邻居节点中随机选择一个邻居节点w。判断K_v/K_w是否大于等于P，如果是，将邻居节点w作为当前节点v，并将节点w放入采样集S，然后返回至步骤3，如果不是，继续下一步骤。其中，P为随机数，P服从0-1均匀分布。K_v表示节点v的邻居节点个数，即节点v的度数。(4) Randomly select a neighbor node w among all neighbor nodes of node v. Determine whether K _v /K _w is greater than or equal to P, if yes, take the neighbor node w as the current node v, and put node w into the sampling set S, then return to step 3, if not, continue to the next step. Among them, P is a random number, and P obeys the uniform distribution of 0-1. K _v represents the number of neighbor nodes of node v, that is, the degree of node v.

(5)判断P是否小于等于重复采样的概率α，如果是，则保持当前节点v不变，然后返回至步骤3，如果不是，继续下一步骤。(5) Determine whether P is less than or equal to the probability α of repeated sampling, if yes, keep the current node v unchanged, and then return to step 3, if not, continue to the next step.

(6)在缓存空间Cache中找出与当前节点v具有相同邻居数的所有被探测节点，从这些被探测节点中选择被探测次数最小的一个节点有多个相同被探测次数的节点取随机一个。将节点作为当前节点v，并将节点放入采样集，然后返回至步骤3。(6) Find all detected nodes with the same number of neighbors as the current node v in the cache space Cache, and select a node with the smallest number of detections from these detected nodes There are multiple nodes with the same number of probes to choose a random one. will node As the current node v, and the node Put in the sample set and return to step 3.

进一步地，所述步骤5中，α＝0.2。Further, in the step 5, α=0.2.

本发明的有益效果是，第一，在独立的采样集上，网络的度分布属性更加接近原始网络特性。第二，避免了传统方法在高连通子网中低度节点过度入样的问题，并提高了方法对网络的探测能力。第三，在低采样率的情况下，其样本的传递性和同配性更接近原始网络的属性。The beneficial effects of the present invention are: firstly, on an independent sampling set, the degree distribution properties of the network are closer to the original network characteristics. Second, it avoids the problem of excessive sampling of low-degree nodes in high-connected subnetworks in traditional methods, and improves the detection ability of the method for the network. Third, in the case of low sampling rate, the transitivity and assortativeness of its samples are closer to the properties of the original network.

附图说明Description of drawings

图1为Twitter和Epinions的独立采样节点度分布CDF，NMSE图；Figure 1 is the independent sampling node degree distribution CDF and NMSE graph of Twitter and Epinions;

图2为不同采样网络的传递性图；Figure 2 is a transitive diagram of different sampling networks;

图3为不同采样网络的同配性图；Figure 3 is the assortative diagram of different sampling networks;

图4为采样节点的更新率图；Fig. 4 is an update rate diagram of a sampling node;

图5为参数α对采样重复率的影响图。Fig. 5 is a graph showing the influence of parameter α on the sampling repetition rate.

具体实施方式Detailed ways

步骤一：定义概念：Step 1: Define the concept:

社交网络采样方法研究通常是将真实的网络转化为图模型，图中的边表示真实网络下用户间的关系，图中的节点表示真实网络下的用户。使用符号G＝(E,V)代表图，其中E表示图中的边的集合，V表示图中的节点的集合，v表示图中的节点。采样集定义为S。采样过程中已探测过的节点会压入缓存空间中，定义缓存空间为Cache。在缓存空间中的节点用d表示，d^j表示被探测过的节点，其中j是该节点被探测过的次数，表示与节点v具有相同邻居个数的被探测节点。K_v表示节点v的邻居节点个数，即节点v的度数。α表示重复采样的概率，该数值默认取0.2。The research on social network sampling methods usually converts the real network into a graph model. The edges in the graph represent the relationship between users in the real network, and the nodes in the graph represent the users in the real network. A graph is represented using the notation G=(E,V), where E represents the set of edges in the graph, V represents the set of nodes in the graph, and v represents the nodes in the graph. The sampling set is defined as S. The nodes that have been detected during the sampling process will be pushed into the cache space, and the cache space is defined as Cache. The nodes in the cache space are represented by d, and d ^j represents the detected node, where j is the number of times the node has been detected, Indicates the detected nodes that have the same number of neighbors as node v. K _v represents the number of neighbor nodes of node v, that is, the degree of node v. α represents the probability of repeated sampling, which is 0.2 by default.

步骤二：初始化采样集S，缓存空间Cache，将S和Cache置空Step 2: Initialize the sampling set S, the cache space Cache, and empty S and Cache

步骤三：选取初始节点v，选取的方法是在全网中随机选取。Step 3: Select the initial node v, the selection method is to select randomly in the whole network.

步骤四：探测节点v的10个邻居节点，邻居个数小于10的则探测所有邻居节点。将探测的节点d^j存入缓存空间Cache中。Step 4: Detect 10 neighbor nodes of node v, if the number of neighbors is less than 10, detect all neighbor nodes. Store the detected node d ^j in the cache space Cache.

步骤五：在节点v的所有邻居节点中随机选择一个邻居节点w。Step 5: Randomly select a neighbor node w among all neighbor nodes of node v.

步骤六：生成一个0到1的随机数P，P服从0-1均匀分布。Step 6: Generate a random number P between 0 and 1, and P obeys the uniform distribution of 0-1.

步骤七：判断P是否小于等于K_v/K_w，如果是，将邻居节点w作为当前节点v，并将节点w放入采样集S，然后转到步骤五，如果不是，继续下一步骤。Step 7: Determine whether P is less than or equal to K _v /K _w , if yes, take the neighbor node w as the current node v, and put node w into the sampling set S, then go to step 5, if not, continue to the next step.

步骤八：判断P是否小于等于α(α默认为0.2)，如果是，则保持当前节点v不变，然后转到步骤五。如果不是，继续下一步骤。Step 8: Determine whether P is less than or equal to α (α is 0.2 by default), if yes, keep the current node v unchanged, and then go to step 5. If not, continue to the next step.

步骤九：在缓存空间Cache中找出与当前节点v具有相同邻居数的被探测节点集在这个被探测节点集中选择被探测次数最小的一个节点其中J＝min(j)，有多个相同被探测次数的节点取随机一个。将节点作为当前节点v，并将节点放入采样集，然后转入步骤五。Step 9: Find the detected node set with the same number of neighbors as the current node v in the cache space Select a node with the smallest number of detections in this detected node set Among them, J=min(j), and there are multiple nodes with the same number of times of being detected to choose a random one. will node As the current node v, and the node Put in the sample set and go to step five.

方法的停止规则可以是获取到了足量的数据时人为的停止，也可以是一段时间的抽取数据后，自动停止程序。The stop rule of the method can be artificially stopped when enough data is obtained, or it can be automatically stopped after a period of data extraction.

在这里采用Twitter和Epinions两种不同连通程度的网络对无偏延迟采样方法的采样性能进行评估，其中参与对比的经典采样方法有BFS，MHRW，RW。Here, Twitter and Epinions networks with different degrees of connectivity are used to evaluate the sampling performance of the unbiased delayed sampling method. The classic sampling methods involved in the comparison include BFS, MHRW, and RW.

从图1的左半部分可以看出，MHRW，BFS的度分布会偏向度大的节点，因为它们的NMSE曲线比UD方法获得的曲线更高。与此同时，UD方法获得的度分布CDF曲线比其他方法更加接近原始网络，同样说明了UD方法所采集网络的度分布属性比MHRW和BFS方法更接近原始网络度分布。综上所述，UD方法采集的子网有更好的度分布属性，即使采样网络没有重复的数据。From the left half of Fig. 1, it can be seen that the degree distribution of MHRW, BFS will be biased towards nodes with large degrees, because their NMSE curves are higher than those obtained by the UD method. At the same time, the degree distribution CDF curve obtained by the UD method is closer to the original network than other methods, which also shows that the degree distribution properties of the network collected by the UD method are closer to the original network degree distribution than the MHRW and BFS methods. In summary, the subnetworks collected by the UD method have better degree distribution properties, even if the sampled network has no repeated data.

在图2中，黑色的水平基线代表了原始网络统计量传递性的具体值，从图中可以看出，随着采样率的不断提高，不同采样方法的采样网络其传递性会趋向于基线值，但在较小的采样比例中，改进的UD方法与MHRW和RW方法相比更加接近原始网络的传递性指标。In Figure 2, the black horizontal baseline represents the specific value of the transitivity of the original network statistics. It can be seen from the figure that as the sampling rate continues to increase, the transitivity of the sampling network of different sampling methods will tend to the baseline value , but in a smaller sampling ratio, the improved UD method is closer to the transitivity index of the original network than the MHRW and RW methods.

图3评估了不同采样方法所抽取网络的网络同配性指标。从图中可以看出，候选的采样方法随着采样率的提高，较快的收敛到原始网络同配性上。但在较低的采样率时，改进的UD方法更接近基线，说明UD方法与MHRW和RW方法相比有更好的网络同配性指标。Figure 3 evaluates the network assortativeness metrics of the networks extracted by different sampling methods. It can be seen from the figure that the candidate sampling method quickly converges to the assortative property of the original network as the sampling rate increases. But at lower sampling rates, the improved UD method is closer to the baseline, indicating that the UD method has a better network assortment index than the MHRW and RW methods.

图4显示了Twitter和Epinions的采样更新率。其中横轴表示被抽取出的独立节点个数，纵轴是被抽取的独立节点个数与实际采样节点数的比值，称为更新率。不难发现，更高的更新率有更少的重复节点。因此更高的更新率在采样过程中会有更好网络探测能力。从图4可以看出，稀疏网络(Epinions)的采样更新率低于高连通性网络(Twitter)，并且对于不同连通性的网络UD方法的采样更新率优于MHRW。这是因为低连通性的网络有更高的概率接触到已访问过的节点。这证明了UD采样方法可以避免MHRW在低度节点的过度入样问题，与此同时，UD方法会有更好网络探测能力。Figure 4 shows the sampled update rates for Twitter and Epinions. The horizontal axis represents the number of extracted independent nodes, and the vertical axis is the ratio of the number of extracted independent nodes to the actual number of sampled nodes, which is called the update rate. It is not difficult to find that a higher update rate has fewer duplicate nodes. Therefore, a higher update rate will have better network detection capabilities during the sampling process. It can be seen from Figure 4 that the sampling update rate of the sparse network (Epinions) is lower than that of the highly connected network (Twitter), and the sampling update rate of the UD method for networks with different connectivity is better than that of MHRW. This is because a network with low connectivity has a higher probability of reaching a node that has already been visited. This proves that the UD sampling method can avoid the over-sampling problem of MHRW in low-degree nodes, and at the same time, the UD method will have better network detection capabilities.

在UD采样过程中，我们使用了参数值α来控制当前节点的自循环概率，图5显示了不同的参数α对采样节点重复率的影响。其中，横坐标表示参数α的取值，α从0.05开始以步长0.05一直到1(α＝0.05,0.1,0.15,…,1)，纵坐标表示采样集中的节点重复率(样本中所有节点的数量/样本中独立节点的数量)。我们使用UD方法，MHRW以及RW方法在Twitter数据集中分别采集5％的数据。从图5可以看出，当参数α的值接近0时，UD采样方法的样本重复率接近RW。当参数α的值是1时，UD采样方法的样本重复率6倍于RW，并与MHRW相同。所以在社交网络中，参数α可以用来控制UD采样方法的样本重复率。更特别的，若参数α在0.2到0.4之间时，MHRW的采样重复率会有较好的降低。In the UD sampling process, we use the parameter value α to control the self-circulation probability of the current node. Figure 5 shows the influence of different parameter α on the repetition rate of the sampling node. Among them, the abscissa represents the value of the parameter α, α starts from 0.05 with a step size of 0.05 to 1 (α=0.05,0.1,0.15,…,1), and the ordinate represents the node repetition rate in the sampling set (all nodes in the sample The number of / the number of independent nodes in the sample). We use the UD method, MHRW and RW methods to collect 5% of the data in the Twitter dataset respectively. It can be seen from Fig. 5 that when the value of parameter α is close to 0, the sample repetition rate of UD sampling method is close to RW. When the value of parameter α is 1, the sample repetition rate of UD sampling method is 6 times that of RW and the same as that of MHRW. So in social networks, the parameter α can be used to control the sample repetition rate of the UD sampling method. More specifically, if the parameter α is between 0.2 and 0.4, the sampling repetition rate of MHRW will be better reduced.

Claims

1. an unbiased delay sampling method of social network, is characterized in that, comprises the following steps:

(1) Transform the real network into a graph G=(E, V), E represents the set of edges in the graph, the edges represent the relationship between users in the real network, V represents the set of nodes in the graph, and the nodes represent the real network users under .

(2) Initialize the sampling set S, cache space Cache, and empty S and Cache; randomly select a node v from V; then follow the steps below to sample.

(3) Detect 10 neighbor nodes of node v, and detect all neighbor nodes for nodes whose neighbor number is less than 10. Store the detected neighbor nodes in the cache space Cache.

(4) Randomly select a neighbor node w among all neighbor nodes of node v. Determine whether K _v /K _w is greater than or equal to P, if yes, take the neighbor node w as the current node v, and put node w into the sampling set S, then return to step 3, if not, continue to the next step. Among them, P is a random number, and P obeys the uniform distribution of 0-1. K _v represents the number of neighbor nodes of node v, that is, the degree of node v.

(5) Determine whether P is less than or equal to the probability α of repeated sampling, if yes, keep the current node v unchanged, and then return to step 3, if not, continue to the next step.

(6) Find all detected nodes with the same number of neighbors as the current node v in the cache space Cache, and select a node with the smallest number of detections from these detected nodes There are multiple nodes with the same number of probes to choose a random one. will node As the current node v, and the node Put in the sample set and return to step 3.

2. The method according to claim 1, characterized in that, in the step 5, α=0.2.