CN104281664A

CN104281664A - Data segmenting method and system of distributed graph calculating system

Info

Publication number: CN104281664A
Application number: CN201410496089.5A
Authority: CN
Inventors: 李博; 宋骐; 李建欣; 于伟仁
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2014-09-24
Filing date: 2014-09-24
Publication date: 2015-01-14
Anticipated expiration: 2034-09-24
Also published as: CN104281664B

Abstract

The present invention provides a method and system for data segmentation in a distributed graph computing system. The method includes: determining the similarity measurement value between each data node in the data to be processed and its own first adjacent nodes; obtaining each first adjacent node The number of occurrences of the label of an adjacent node, and determine whether there are at least two labels with the same number of occurrences; The similarity measurement value between the second adjacent nodes determines the label of the data node; divides the data nodes with the same label into the same community, and stores the data nodes belonging to the same community in the same processing host. It fully considers the similarity characteristics between data nodes and realizes the community division of data nodes based on tags, which saves computing overhead, and closely related data nodes are assigned to the same processing host, reducing the communication overhead between different processing hosts .

Description

Distributed graph computing system data segmentation method and system

技术领域technical field

本发明属于数据处理领域，尤其是涉及一种分布式图计算系统数据切分方法和系统。The invention belongs to the field of data processing, and in particular relates to a data segmentation method and system of a distributed graph computing system.

背景技术Background technique

近年来，以互联网、高能物理、计算生物为代表的众多领域产生的海量数据对数据处理系统提出了更高的要求。在这些海量数据中，由节点和边组成的图结构数据作为一种重要的数据结构，可以有效表示很多不同领域的数据关系，例如，社交网络数据可以表现为一种图结构数据，其中节点代表用户，边代表用户之间的关系，例如两个用户相互关注则表明两个对应节点之间有一条边。同样，英特网的网页数据也可以表现为一种图结构数据，节点代表网页，边代表网页之间的关系，例如网页A上有一个指向网页B的超链接，则对应的节点A和节点B之间有一条边。In recent years, massive data generated in many fields represented by the Internet, high-energy physics, and computational biology have put forward higher requirements for data processing systems. Among these massive data, graph-structured data composed of nodes and edges is an important data structure that can effectively represent data relationships in many different fields. For example, social network data can be represented as a graph-structured data, where nodes represent Users, and edges represent the relationship between users. For example, if two users follow each other, it means that there is an edge between the two corresponding nodes. Similarly, the web page data of the Internet can also be represented as a graph structure data. Nodes represent web pages, and edges represent the relationship between web pages. For example, there is a hyperlink pointing to web page B on web page A, and the corresponding node A and node There is an edge between B.

随着数据比如图结构数据的数据量越来越大，受限于有限的运算和存储能力，单机已经不能满足对大规模图结构数据的高效处理，因而作为大数据处理的有效工具，分布式数据处理设备提供了一个处理海量图结构数据的平台。为了实现数据的分布式存储，分布式数据处理设备一般会采用数据切分。数据切分，简单来说，就是指通过某种特定的条件，将海量数据中的数据分散存放到多个处理主机上，以达到分散单台处理主机负载的效果。With the increasing amount of data such as graph-structured data, limited by limited computing and storage capabilities, a single machine can no longer meet the efficient processing of large-scale graph-structured data. Therefore, as an effective tool for big data processing, distributed Data processing equipment provides a platform for processing massive graph-structured data. In order to realize distributed storage of data, distributed data processing devices generally adopt data segmentation. Data segmentation, in simple terms, refers to the distribution of data in massive data to multiple processing hosts through certain specific conditions, so as to achieve the effect of dispersing the load of a single processing host.

但是，当前图数据处理设备存储数据的模式为：对图数据中的每个数据节点创建一个主节点(它自身)和至少一个副节点(备份节点)，随机将主节点固定存储在某一台处理主机上，同时在该处理主机上保存主节点的相邻数据节点，其中，如果相邻数据节点的主节点在其处理主机上，则存储相邻数据节点的副节点。However, the current data storage mode of the graph data processing device is: for each data node in the graph data, create a primary node (itself) and at least one secondary node (backup node), and randomly store the primary node in a certain On the processing host, the adjacent data nodes of the primary node are stored on the processing host at the same time, wherein, if the primary node of the adjacent data node is on its processing host, the secondary nodes of the adjacent data node are stored.

上述这种以数据节点为单位进行数据切分的图结构数据处理方式中，关系紧密的数据节点往往会被存储在不同的处理主机上，而关系紧密的数据节点实际上非常有可能要被同时访问，如果这些数据节点不在同一个处理主机上，则需要耗费大量的通信资源来从不同的处理主机上获得关系紧密的不同数据节点。而且，上述的随机分配以及数据节点冗余存储的方式，对于一些度数(即邻居节点的个数)很大的数据节点，需要消耗很多的运算资源，从而，将导致较大的运算开销和通信开销，使得数据处理效率较低。In the above-mentioned graph-structured data processing method that divides data in units of data nodes, closely related data nodes are often stored on different processing hosts, and closely related data nodes are actually very likely to be processed simultaneously. Access, if these data nodes are not on the same processing host, it will take a lot of communication resources to obtain different closely related data nodes from different processing hosts. Moreover, the above-mentioned random allocation and redundant storage of data nodes will consume a lot of computing resources for some data nodes with a large degree (that is, the number of neighbor nodes), which will lead to large computing overhead and communication overhead, making data processing less efficient.

发明内容Contents of the invention

针对上述存在的问题，本发明提供一种分布式图计算系统数据切分方法和系统，用以克服现有技术中以数据节点为单位进行数据切分的图结构数据处理方式导致数据处理效率较低的缺陷。In view of the above existing problems, the present invention provides a data segmentation method and system of a distributed graph computing system, which is used to overcome the low data processing efficiency caused by the graph structure data processing method of data segmentation in units of data nodes in the prior art. low defect.

本发明提供了一种分布式图计算系统数据切分方法，包括：The invention provides a data segmentation method of a distributed graph computing system, including:

根据预设算法，确定待处理数据中的每个数据节点与自身的各第一相邻节点间的相似性度量值；According to a preset algorithm, determine the similarity measure value between each data node in the data to be processed and its own first adjacent nodes;

根据所述各第一相邻节点的标签，获得每个所述标签在所述各第一相邻节点中的出现次数，所述标签包括节点标识ID；Obtain the number of occurrences of each of the labels in each of the first adjacent nodes according to the labels of the first adjacent nodes, where the label includes a node identification ID;

根据所述出现次数，确定所述各第一相邻节点的标签中是否存在出现次数相同的至少两个标签；According to the number of occurrences, determine whether there are at least two labels with the same number of occurrences among the labels of the first adjacent nodes;

若存在，则确定与所述至少两个标签分别对应的各第二相邻节点，并根据所述数据节点与所述各第二相邻节点间的相似性度量值，确定所述数据节点的标签；If it exists, determine the second adjacent nodes corresponding to the at least two labels, and determine the data node's Label;

将所述待处理数据中具有同一标签的数据节点划分到同一社区，并将属于同一社区的数据节点存储在同一处理主机中。Divide the data nodes with the same label in the data to be processed into the same community, and store the data nodes belonging to the same community in the same processing host.

本发明提供了一种分布式图计算系统数据切分系统，包括：The present invention provides a distributed graph computing system data segmentation system, including:

第一确定模块，用于根据预设算法，确定待处理数据中的每个数据节点与自身的各第一相邻节点间的相似性度量值；The first determination module is used to determine the similarity measurement value between each data node in the data to be processed and its own first adjacent nodes according to a preset algorithm;

获取模块，用于根据所述各第一相邻节点的标签，获得每个所述标签在所述各第一相邻节点中的出现次数，所述标签包括节点标识ID；An acquisition module, configured to obtain the number of occurrences of each of the labels in each of the first adjacent nodes according to the labels of the first adjacent nodes, where the label includes a node identification ID;

第二确定模块，用于根据所述出现次数，确定所述各第一相邻节点的标签中是否存在出现次数相同的至少两个标签；A second determining module, configured to determine whether there are at least two labels with the same number of occurrences among the labels of the first adjacent nodes according to the number of occurrences;

第三确定模块，用于若存在，则确定与所述至少两个标签分别对应的各第二相邻节点，并根据所述数据节点与所述各第二相邻节点间的相似性度量值，确定所述数据节点的标签；The third determination module is configured to determine the second adjacent nodes respectively corresponding to the at least two labels, if there is one, and according to the similarity measure value between the data node and the second adjacent nodes , to determine the label of the data node;

处理模块，用于将所述待处理数据中具有同一标签的数据节点划分到同一社区，并将属于同一社区的数据节点存储在同一处理主机中。The processing module is used to divide the data nodes with the same label in the data to be processed into the same community, and store the data nodes belonging to the same community in the same processing host.

本发明提供的分布式图计算系统数据切分方法和系统，对于数据规模很大的待处理数据，针对待处理数据中的每个数据节点确定自身与各相邻节点间的相似性度量值，进而在确定其各相邻节点的标签中存在多个出现次数相同的标签时，根据与拥有该相同标签的各相邻节点间的相似性度量值，确定自身的标签，进而将待处理数据中具有同一标签的数据节点划分到同一社区，并将属于同一社区的数据节点存储在同一处理主机中。通过对每个数据节点进行与其相邻节点间的相似性度量，充分考虑了待处理数据的整体结构特性，进而对待处理数据中的每个数据节点基于该相似性特性，结合其相邻节点的标签进行聚类分析，以最终将待处理数据划分为多个社区，每个社区中的数据节点都具有同一标签，而且，每个社区中的数据节点也具有较高相似性，从而以社区为单位进行数据节点在处理主机上的切分存储。由于充分考虑了数据节点间的相似性特征以及基于标签实现了数据节点的社区划分，节省了运算开销，而且使得关系密切的数据节点能够被分配到同一处理主机中，大大减少了在不同处理主机间的通信开销，从而提高了数据处理效率。The data segmentation method and system of the distributed graph computing system provided by the present invention, for the data to be processed with a large data scale, for each data node in the data to be processed, determine the similarity measurement value between itself and each adjacent node, Then, when it is determined that there are multiple labels with the same number of occurrences in the labels of its adjacent nodes, according to the similarity measurement value between the adjacent nodes with the same label, determine its own label, and then convert the data to be processed Data nodes with the same label are divided into the same community, and data nodes belonging to the same community are stored in the same processing host. By measuring the similarity between each data node and its adjacent nodes, the overall structural characteristics of the data to be processed are fully considered, and then each data node in the data to be processed is based on the similarity characteristics, combined with its adjacent nodes. Tags are used for cluster analysis to finally divide the data to be processed into multiple communities, and the data nodes in each community have the same label, and the data nodes in each community also have high similarity, so the community-based The unit performs split storage of data nodes on the processing host. Due to the full consideration of the similarity characteristics between data nodes and the realization of the community division of data nodes based on labels, the calculation cost is saved, and the closely related data nodes can be assigned to the same processing host, which greatly reduces the number of different processing hosts. Inter-communication overhead, thus improving the efficiency of data processing.

附图说明Description of drawings

图1为本发明分布式图计算系统数据切分方法实施例一的流程图；FIG. 1 is a flow chart of Embodiment 1 of the data segmentation method of the distributed graph computing system of the present invention;

图2为本发明分布式图计算系统数据切分方法实施例二的流程图；FIG. 2 is a flow chart of Embodiment 2 of the data segmentation method of the distributed graph computing system of the present invention;

图3为本发明分布式图计算系统数据切分系统实施例一的结构示意图；3 is a schematic structural diagram of Embodiment 1 of the data segmentation system of the distributed graph computing system of the present invention;

图4为本发明分布式图计算系统数据切分系统实施例二的结构示意图。FIG. 4 is a schematic structural diagram of Embodiment 2 of the data segmentation system of the distributed graph computing system of the present invention.

具体实施方式Detailed ways

图1为本发明分布式图计算系统数据切分方法实施例一的流程图，如图1所示，该方法包括：Fig. 1 is a flowchart of Embodiment 1 of the data segmentation method of the distributed graph computing system of the present invention. As shown in Fig. 1, the method includes:

步骤101、根据预设算法，确定待处理数据中的每个数据节点与自身的各第一相邻节点间的相似性度量值；Step 101, according to a preset algorithm, determine the similarity measurement value between each data node in the data to be processed and its own first adjacent nodes;

本实施例提供的所述方法适用于采用分布式数据处理系统对大规模的图结构数据进行数据切分存储的场景，在该分布式数据处理系统中设置有多个处理主机。值得说明的是，该系统在接收到大量待处理数据后，可以采用现有的分布式处理方式将该待处理数据预先分布式存入多个处理主机中，而本实施例提供的所述方法可以针对预存入各处理主机中的待处理数据进行处理，当然，也可以在接收到待处理数据后直接进行处理。本实施例提供的所述方法可以由一处理系统来执行，该处理系统比如为分布式处理系统的管理平台。The method provided in this embodiment is applicable to a scenario where a distributed data processing system is used for data segmentation and storage of large-scale graph-structured data, and multiple processing hosts are set in the distributed data processing system. It is worth noting that after the system receives a large amount of data to be processed, the existing distributed processing method can be used to distribute and store the data to be processed in multiple processing hosts in advance, and the method provided in this embodiment Processing can be performed on the data to be processed pre-stored in each processing host, and of course, processing can also be performed directly after receiving the data to be processed. The method provided in this embodiment may be executed by a processing system, such as a management platform of a distributed processing system.

本实施例中，对于图结构的数据，数据可以表示为数据节点与边的抽象图，处理系统针对待处理数据中的每个数据节点，分别计算其与自身相邻节点的相似度度量值，其中，上述预设算法例如可以是基于数据节点与其相邻节点之间的连接边的权重，或是基于数据节点的度数(其相邻节点的总数)与每个相邻节点的度数，或是基于数据节点的网页排名(PageRank，以下简称PR)值等等。In this embodiment, for graph-structured data, the data can be represented as an abstract graph of data nodes and edges, and the processing system calculates the similarity measure value of each data node in the data to be processed with its own adjacent nodes, Among them, the above-mentioned preset algorithm can be based on the weight of the connection edge between the data node and its adjacent nodes, or based on the degree of the data node (the total number of its adjacent nodes) and the degree of each adjacent node, or Based on the webpage ranking (PageRank, hereinafter referred to as PR) value of the data node and the like.

步骤102、根据所述各第一相邻节点的标签，获得每个所述标签在所述各第一相邻节点中的出现次数，所述标签包括节点标识ID；Step 102. According to the labels of the first adjacent nodes, the number of occurrences of each of the labels in the first adjacent nodes is obtained, and the labels include node identification IDs;

步骤103、根据所述出现次数，确定所述各第一相邻节点的标签中是否存在出现次数相同的至少两个标签，若存在，则执行步骤104，否则，执行步骤105；Step 103, according to the number of occurrences, determine whether there are at least two labels with the same number of occurrences among the labels of the first adjacent nodes, if so, perform step 104, otherwise, perform step 105;

步骤104、确定与所述至少两个标签分别对应的各第二相邻节点，并根据所述数据节点与所述各第二相邻节点间的相似性度量值，确定所述数据节点的标签；Step 104: Determine the second adjacent nodes corresponding to the at least two labels, and determine the label of the data node according to the similarity measurement value between the data node and the second adjacent nodes ;

步骤105、确定所述数据节点的标签与所述各第一相邻节点的标签中出现次数最多的标签相同。Step 105. Determine that the label of the data node is the same as the label with the most occurrences among the labels of the first adjacent nodes.

本实施例中，可以预先为每个数据节点配置一个标签，该标签可以是为每个节点预先设置的一个标识ID。值得说明的是，该标签为节点标识ID，并不意味着标签与数据节点具有唯一的一一对应关系，简单来说，该标签具有分类的含义，将满足一定条件的各个数据节点打上同一个标签。In this embodiment, a label may be pre-configured for each data node, and the label may be an identification ID preset for each node. It is worth noting that the label is the ID of the node, which does not mean that the label has a unique one-to-one correspondence with the data node. Simply put, the label has the meaning of classification, and each data node that meets certain conditions is marked with the same Label.

值得说明的是，本实施例中为每个数据节点确定标签的过程是一个多次迭代的过程。在迭代初始时，可以为每个数据节点配置一个用于区别于其他数据节点的标识ID，在第一次迭代时，以某个数据节点i为例，在计算获得了数据节点i与其各个相邻节点即上述的第一相邻节点的相似度度量值后，假设其与相邻节点j的相似度度量值最大，从而，确定数据节点i的标签为数据节点j的标识ID。也就是说，在第一次迭代时，数据节点i的各相邻节点的标签即为每个相邻节点被初始分配的标识ID，而且，由于此时的每个相邻节点的标签具有唯一性，因此，每个标签的出现次数为1，即数据节点i的各相邻节点的标签中存在出现次数相同的至少两个标签，此时，与所述至少两个标签分别对应的各第二相邻节点即为数据节点i的全部相邻节点。从而，根据数据节点i与其全部相邻节点中的每个相邻数据节点间的相似性度量值，确定数据节点i的标签。比如，假设数据节点i与相邻节点j的相似度度量值最大，从而，确定数据节点i的标签为数据节点j的标识ID，此时，数据节点i的标签不再时初始分配给其的标识ID，而变为数据节点j的标识ID。It should be noted that the process of determining a label for each data node in this embodiment is a process of multiple iterations. At the beginning of the iteration, each data node can be configured with an identification ID used to distinguish it from other data nodes. In the first iteration, taking a certain data node i as an example, after calculating the data node i and its corresponding After the adjacent node, that is, the above-mentioned first adjacent node's similarity measure value, it is assumed that its similarity measure value with the adjacent node j is the largest, so that the label of data node i is determined to be the identification ID of data node j. That is to say, in the first iteration, the label of each adjacent node of data node i is the ID initially assigned to each adjacent node, and, since the label of each adjacent node at this time has a unique Therefore, the number of occurrences of each label is 1, that is, there are at least two labels with the same number of occurrences in the labels of the adjacent nodes of data node i, and at this time, each corresponding to the at least two labels respectively The second adjacent nodes are all adjacent nodes of data node i. Therefore, the label of data node i is determined according to the similarity measurement value between data node i and each adjacent data node in all adjacent nodes. For example, assuming that the similarity measure value between data node i and adjacent node j is the largest, so that the label of data node i is determined to be the identification ID of data node j, at this time, the label of data node i is no longer the initial assignment to it The identification ID becomes the identification ID of data node j.

从而在后续的第二次或者以后的迭代过程中，由于存在上述标签改变的过程，因此后续的迭代过程中有可能出现某数据节点k的某个相邻节点q的标签出现了最多的次数，也就是说，该数据节点q的标识ID在数据节点k的所有相邻节点中被用于作为标签的次数最多。此时，将数据节点k的标签变为数据节点q的标识ID。Therefore, in the subsequent second or subsequent iteration process, due to the above-mentioned label change process, it is possible that the label of a certain adjacent node q of a certain data node k appears the most times in the subsequent iteration process, That is to say, the identification ID of the data node q is used as the label the most times among all adjacent nodes of the data node k. At this point, change the label of data node k to the identification ID of data node q.

步骤106、将所述待处理数据中具有同一标签的数据节点划分到同一社区，并将属于同一社区的数据节点存储在同一处理主机中。Step 106: Divide the data nodes with the same label in the data to be processed into the same community, and store the data nodes belonging to the same community in the same processing host.

在对待处理数据中的每个数据节点都做了上述的确定标签的处理之后，待处理数据中的很多数据节点将具有同样的标签，将具有同一标签的数据节点划分到同一社区，并将属于同一社区的数据节点存储在同一处理主机中，从而完成对待处理数据的数据切分处理，即将待处理数据划分为各个社区，以社区为单位进行数据节点的存储。After each data node in the data to be processed has done the above-mentioned process of determining the label, many data nodes in the data to be processed will have the same label, and the data nodes with the same label will be divided into the same community, and will belong to The data nodes of the same community are stored in the same processing host, so as to complete the data segmentation processing of the data to be processed, that is, the data to be processed is divided into various communities, and the data nodes are stored in the community as a unit.

本实施例中，对于数据规模很大的待处理数据，针对待处理数据中的每个数据节点确定自身与各相邻节点间的相似性度量值，进而在确定其各相邻节点的标签中存在多个出现次数相同的标签时，根据与拥有该相同标签的各相邻节点间的相似性度量值，确定自身的标签，进而将待处理数据中具有同一标签的数据节点划分到同一社区，并将属于同一社区的数据节点存储在同一处理主机中。通过对每个数据节点进行与其相邻节点间的相似性度量，充分考虑了待处理数据的整体结构特性，进而对待处理数据中的每个数据节点基于该相似性特性，结合其相邻节点的标签进行聚类分析，以最终将待处理数据划分为多个社区，每个社区中的数据节点都具有同一标签，而且，每个社区中的数据节点也具有较高相似性，从而以社区为单位进行数据节点在处理主机上的切分存储。由于充分考虑了数据节点间的相似性特征以及基于标签实现了数据节点的社区划分，节省了运算开销，而且使得关系密切的数据节点能够被分配到同一处理主机中，大大减少了在不同处理主机间的通信开销，从而提高了数据处理效率。In this embodiment, for data to be processed with a large data scale, for each data node in the data to be processed, determine the similarity metric value between itself and each adjacent node, and then determine the label of each adjacent node When there are multiple labels with the same number of occurrences, determine its own label according to the similarity measurement value between adjacent nodes with the same label, and then divide the data nodes with the same label in the data to be processed into the same community, And store data nodes belonging to the same community in the same processing host. By measuring the similarity between each data node and its adjacent nodes, the overall structural characteristics of the data to be processed are fully considered, and then each data node in the data to be processed is based on the similarity characteristics, combined with its adjacent nodes. Tags are used for cluster analysis to finally divide the data to be processed into multiple communities, and the data nodes in each community have the same label, and the data nodes in each community also have high similarity, so the community-based The unit performs split storage of data nodes on the processing host. Due to the full consideration of the similarity between data nodes and the realization of the community division of data nodes based on labels, the calculation cost is saved, and the closely related data nodes can be assigned to the same processing host, which greatly reduces the number of different processing hosts. Inter-communication overhead, thus improving the efficiency of data processing.

图2为本发明分布式图计算系统数据切分方法实施例二的流程图，如图2所示，本实施例提供的所述方法包括如下步骤：Fig. 2 is a flowchart of Embodiment 2 of the data segmentation method of the distributed graph computing system of the present invention. As shown in Fig. 2, the method provided in this embodiment includes the following steps:

步骤201、根据公式(1)计算所述待处理数据中的数据节点u与所述数据节点u的相邻节点v间的相似性度量值APS(u,v)；Step 201. Calculate the similarity measure value APS(u,v) between the data node u in the data to be processed and the adjacent node v of the data node u according to the formula (1);

$APS APS ((u u,, v v)) = = {Σ Σ}_{j j = = 11}^{n no} min min ((PG PG (({v v}_{j j},, u u)),, PG PG (({v v}_{j j},, v v)))),, u u,, v v &Element; &Element; V V,, {v v}_{j j} &Element; &Element; ((δ δ ((v v)) \cap \cap δ δ ((u u)))) - - - - - - ((11))$

其中，v为所述数据节点u的各第一相邻节点中的任一个数据节点，δ(v)为数据节点v的各第一相邻节点，V为所述待处理数据中所有数据节点的集合，δ(u)为所述数据节点u的各第一相邻节点，n为同时属于数据节点v的各第一相邻节点和数据节点u的各第一相邻节点的节点的个数，PG(v_j,u)为数据节点v_j传播到数据节点u的网页排名PR值，根据PG(v_j,u)＝PR(u)确定，PG(v_j,v)为数据节点v_j传播到数据节点v的网页排名PR值，根据PG(v_j,v)＝PR(v)确定，其中，PR(u)和PR(v)根据公式(2)确定：Wherein, v is any one of the first adjacent nodes of the data node u, δ(v) is each of the first adjacent nodes of the data node v, and V is all data nodes in the data to be processed , δ(u) is each first adjacent node of the data node u, and n is the number of nodes that belong to each first adjacent node of data node v and each first adjacent node of data node u at the same time PG(v _j ,u) is the PR value of the page ranking transmitted from data node v _j to data node u, determined according to PG(v _j ,u)=PR(u), PG(v _j ,v) is data node The PR value of web page ranking propagated by v _j to data node v is determined according to PG(v _j ,v)=PR(v), where PR(u) and PR(v) are determined according to formula (2):

$PR PR ((w w)) = = ((11 - - γ γ)) + + γ γ {Σ Σ}_{{v v}_{i i} &Element; &Element; in in ((w w))} \frac{PR PR (({v v}_{i i}))}{| | out out ((w w)) | |},, w w &Element; &Element; V V - - - - - - ((22))$

其中，γ为加权系数，in(w)为与所述数据节点w的入边对应的相邻节点集合，out(w)为与数据节点w的出边对应的相邻节点集合，|out(w)|为所述出边对应的相邻节点集合中包含的节点个数。Wherein, γ is a weighting coefficient, in(w) is the set of adjacent nodes corresponding to the incoming edge of the data node w, out (w) is the set of adjacent nodes corresponding to the outgoing edge of the data node w, |out( w)| is the number of nodes contained in the adjacent node set corresponding to the outgoing edge.

本实施例中，可以根据数据节点的PageRank值来计算某个数据节点与其相邻节点的相似程度。PageRank是Google专有的算法，用于衡量特定网页相对于搜索引擎索引中的其他网页而言的重要程度，本实施例中将其应用在图结构数据中，用来计算数据节点在整个图结构数据即待处理数据中的重要程度。每个数据节点通过上述计算方式计算出其与所有相邻节点的相似度度量值，从而每个数据节点可以保存一个结构体，其中存储了它的所有相邻节点的标识ID和与该数据节点的相似度度量值，以便用于后续该数据节点的标签的确定过程中。In this embodiment, the degree of similarity between a certain data node and its adjacent nodes can be calculated according to the PageRank value of the data node. PageRank is a proprietary algorithm of Google, which is used to measure the importance of a specific webpage relative to other webpages in the search engine index. The data is the degree of importance in the data to be processed. Each data node calculates its similarity measure with all adjacent nodes through the above calculation method, so that each data node can save a structure, which stores the identification IDs of all its adjacent nodes and the data node The similarity measure value of , so as to be used in the subsequent determination process of the label of the data node.

步骤202、根据所述各第一相邻节点的标签，获得每个所述标签在所述各第一相邻节点中的出现次数；Step 202. According to the labels of the first adjacent nodes, the number of occurrences of each of the labels in the first adjacent nodes is obtained;

步骤203、根据所述出现次数，确定所述各第一相邻节点的标签中是否存在出现次数相同的至少两个标签，若存在，则执行步骤204，否则，执行步骤205；Step 203, according to the number of occurrences, determine whether there are at least two labels with the same number of occurrences among the labels of the first adjacent nodes, if yes, perform step 204, otherwise, perform step 205;

步骤204、确定与所述至少两个标签分别对应的各第二相邻节点，从所述各第二相邻节点中确定出第三相邻节点，确定所述数据节点的标签与所述第三相邻节点的标签相同，所述第三相邻节点与所述数据节点的相似性度量值大于所述各第二相邻节点中除所述第三相邻节点之外的其他第二相邻节点与所述数据节点的相似性度量值；Step 204: Determine the second adjacent nodes respectively corresponding to the at least two labels, determine the third adjacent node from the second adjacent nodes, determine the label of the data node and the first The labels of the three adjacent nodes are the same, and the similarity metric value between the third adjacent node and the data node is greater than that of other second phases of the second adjacent nodes except the third adjacent node. The similarity measure value of the adjacent node and the data node;

步骤205、确定所述数据节点的标签与所述各第一相邻节点的标签中出现次数最多的标签相同。Step 205. Determine that the label of the data node is the same as the label with the most occurrences among the labels of the first adjacent nodes.

上述步骤202-205与图1所示实施例中一致，不再赘述。The foregoing steps 202-205 are consistent with those in the embodiment shown in FIG. 1 , and will not be repeated here.

步骤206、将所述待处理数据中具有同一标签的数据节点划分到同一社区，将所述待处理数据中的各个社区按照社区规模从大到小的顺序进行排序，所述社区规模为所述社区中包含的数据节点的个数；Step 206: Divide the data nodes with the same label in the data to be processed into the same community, sort each community in the data to be processed in descending order of community size, and the community size is the The number of data nodes contained in the community;

步骤207、根据各处理主机的预设存储顺序以及所述各个社区的排序，分别确定所述各个社区应该存入的处理主机，所述存储顺序是指所述各处理主机存储所述各个社区的先后顺序。Step 207: According to the preset storage order of each processing host and the order of each community, respectively determine the processing hosts that each community should store in, the storage order refers to the storage order of each community stored by each processing host sequence.

本实施例中，在各个处理主机中已经预先存储了待处理数据的情况下，为了使得对待处理数据进行社区划分后的每个社区，与存储该社区中的数据节点的处理主机中的预先存储的数据节点具有较好的匹配，从而进一步降低处理开销，优化处理效果，本实施例在将待处理数据中具有同一标签的数据节点划分到同一社区之后，还进行如下的处理：In this embodiment, when the data to be processed has been pre-stored in each processing host, in order to make each community after the data to be processed is divided into communities, and the pre-stored data in the processing host that stores the data nodes in the community The data nodes have better matching, thereby further reducing the processing overhead and optimizing the processing effect. In this embodiment, after the data nodes with the same label in the data to be processed are divided into the same community, the following processing is also performed:

将各个社区按照社区规模从大到小的顺序进行排序，其中，社区规模为所述社区中包含的数据节点的个数；根据各处理主机的预设存储顺序以及所述各个社区的排序，分别确定所述各个社区应该存入的处理主机，所述存储顺序是指所述各处理主机存储所述各个社区的先后顺序。Sort each community in order of community size from large to small, where the community size is the number of data nodes contained in the community; according to the preset storage order of each processing host and the order of each community, respectively Determine the processing hosts in which the communities should be stored, and the storage sequence refers to the order in which the processing hosts store the communities.

值得说明的是，本实施例中对各个社区按照社区规模从大到小排序仅仅是一种可选的方式，也可以从小到大排序。而且，上述各处理主机的预设存储顺序，简单来说相当于对各个处理主机进行了顺序编号。从而，上述根据各处理主机的预设存储顺序以及所述各个社区的排序，分别确定所述各个社区应该存入的处理主机，具体为：以每个处理主机一次存入一个社区的方式，按照各个社区排序的先后顺序以及各个处理主机存储各社区的先后顺序，确定每个社区应该存入的处理主机。It should be noted that in this embodiment, sorting the communities according to the size of the communities in descending order is only an optional manner, and it can also be sorted in descending order. Moreover, the preset storage order of the above-mentioned processing hosts is simply equivalent to sequentially numbering each processing host. Therefore, according to the preset storage order of each processing host and the sorting of each community, the processing hosts that should be stored in each community are respectively determined, specifically: each processing host is stored in one community at a time, according to The order in which communities are sorted and the order in which each processing host stores each community determines the processing host where each community should be stored.

举例来说，假设待处理数据中有N1个社区，按照社区规模从大到小的顺序依次排列为：社区1、社区2、…、社区N1，并有N2个处理主机，按照预设存储顺序依次为：处理主机1、处理主机2、…、处理主机N2。如果N1小于或者等于N2，那么各个社区与各个处理主机的对应关系为：社区1(即社区1中包含的所有数据节点)存入处理主机1中，社区2存入处理主机2中，依次类推，社区N1存入处理主机N1中；如果N1大于N2，此时说明待存入的社区数量多于处理主机的数量，此时各个社区与各个处理主机的对应关系为：社区1存入处理主机1中，社区2存入处理主机2中，依次类推，社区N2存入处理主机N2中，后续社区则从头开始循环存入，即社区N2+1存入处理主机1中，社区N2+2存入处理主机2中，以此类推。For example, assuming that there are N1 communities in the data to be processed, they are arranged in order of community size from large to small: community 1, community 2, ..., community N1, and there are N2 processing hosts, according to the preset storage order The order is: processing host 1, processing host 2, ..., processing host N2. If N1 is less than or equal to N2, then the corresponding relationship between each community and each processing host is: community 1 (that is, all data nodes contained in community 1) is stored in processing host 1, community 2 is stored in processing host 2, and so on , the community N1 is stored in the processing host N1; if N1 is greater than N2, it means that the number of communities to be stored is more than the number of processing hosts. At this time, the corresponding relationship between each community and each processing host is: Community 1 is stored in the processing host In 1, community 2 is stored in processing host 2, and so on, community N2 is stored in processing host N2, and subsequent communities are stored in a cycle from the beginning, that is, community N2+1 is stored in processing host 1, community N2+2 is stored in into processing host 2, and so on.

从而，在确定了各个社区应该存入的处理主机后，可选的，可以将各个社区依次存入对应的各处理主机中。但是，另一种可选的方式中，为了使得每个社区能够存入预存数据节点与该社区包含的数据节点具有良好匹配程度的处理主机中，本实施例中，将执行如下步骤：Therefore, after the processing hosts that each community should be stored in are determined, optionally, each community can be stored in corresponding processing hosts in turn. However, in another optional way, in order to enable each community to be stored in a processing host with a good matching degree between the pre-stored data nodes and the data nodes contained in the community, in this embodiment, the following steps will be performed:

步骤208、分别以所述各个社区中的每个社区作为待处理社区，确定所述待处理社区中包含的数据节点分别与所述各处理主机中的每个处理主机中预先保存的数据节点间的匹配程度；Step 208: Using each of the communities as the community to be processed, determine the relationship between the data nodes contained in the community to be processed and the data nodes pre-saved in each of the processing hosts. degree of matching;

具体地，所述确定所述待处理社区中包含的数据节点分别与所述各处理主机中的每个处理主机中预先保存的数据节点间的匹配程度，包括：Specifically, the determining the degree of matching between the data nodes contained in the community to be processed and the data nodes pre-saved in each of the processing hosts respectively includes:

确定每个所述匹配程度为所述待处理社区中包含的数据节点与对应的每个处理主机中预先保存的数据节点间包含的相同数据节点的个数。Each matching degree is determined as the number of identical data nodes contained between the data nodes contained in the community to be processed and the corresponding data nodes pre-saved in each processing host.

简单来说，针对每个社区来说，将其包含的数据节点依次与每个处理主机中预先保存的数据节点进行比较，确定分别与每个处理主机中预先保存的数据节点间的匹配程度，即重复的数据节点的个数。To put it simply, for each community, the data nodes contained in it are compared with the data nodes pre-saved in each processing host in turn to determine the degree of matching with the data nodes pre-saved in each processing host. That is, the number of repeated data nodes.

步骤209、从所述各处理主机中确定与所述待处理社区对应的目标处理主机，所述目标处理主机中预先保存的数据节点与所述待处理社区中包含的数据节点的匹配程度最高；Step 209, determining a target processing host corresponding to the community to be processed from the processing hosts, where the pre-saved data nodes in the target processing host have the highest matching degree with the data nodes contained in the community to be processed;

针对每个社区来说，确定分别与每个处理主机中预先保存的数据节点间的匹配程度，即重复的数据节点的个数后，选择数据节点重复个数最多的那个处理主机作为该社区最终将被存入的目标处理主机。For each community, after determining the degree of matching with the pre-saved data nodes in each processing host, that is, the number of repeated data nodes, select the processing host with the largest number of repeated data nodes as the final community. The target processing host to be logged.

步骤210、确定所述目标处理主机与根据所述各处理主机的预设存储顺序以及所述各个社区的排序确定的所述待处理社区应该存入的处理主机是否相同，若相同，则执行步骤211，否则，执行步骤212；Step 210: Determine whether the target processing host is the same as the processing host that should be stored in the community to be processed determined according to the preset storage order of the processing hosts and the ranking of the communities, and if they are the same, execute the step 211, otherwise, execute step 212;

步骤211、将所述待处理社区存入所述应该存入的处理主机中。Step 211, storing the community to be processed in the processing host that should be stored.

步骤212、将所述待处理社区存入所述目标处理主机中。Step 212, storing the community to be processed in the target processing host.

如果确定的目标处理主机与根据社区排序和处理主机排序确定的处理主机相同，那么将该社区存入该处理主机即可，否则，将该社区存入目标处理主机，即相当于将该社区从初始确定的处理主机迁移到了目标处理主机中。If the determined target processing host is the same as the processing host determined according to the community sorting and processing host sorting, then it is sufficient to store the community into the processing host; otherwise, storing the community into the target processing host is equivalent to removing the community from The initially determined processing host was migrated to the target processing host.

本实施例中，通过对每个数据节点进行与其相邻节点间的相似性度量，充分考虑了待处理数据的整体结构特性，进而对待处理数据中的每个数据节点基于该相似性特性，结合其相邻节点的标签进行聚类分析，以最终将待处理数据划分为多个社区，每个社区中的数据节点都具有同一标签，而且，每个社区中的数据节点也具有较高相似性，从而以社区为单位进行数据节点在处理主机上的切分存储。由于充分考虑了数据节点间的相似性特征以及基于标签实现了数据节点的社区划分，节省了运算开销，而且使得关系密切的数据节点能够被分配到同一处理主机中，大大减少了在不同处理主机间的通信开销，从而提高了数据处理效率。此外，在进行社区存入处理主机的过程中，依据社区中包含的数据节点与处理主机中预先存储的数据节点的匹配程度来最终确定各社区将被存入的处理主机，降低了待处理数据中数据节点存储位置的变更，进一步降低了处理开销，提高了处理效率。In this embodiment, by measuring the similarity between each data node and its adjacent nodes, the overall structural characteristics of the data to be processed are fully considered, and then each data node in the data to be processed is based on the similarity characteristics, combined with The labels of its adjacent nodes are clustered to finally divide the data to be processed into multiple communities, and the data nodes in each community have the same label, and the data nodes in each community also have high similarity , so that the data nodes are segmented and stored on the processing host in units of communities. Due to the full consideration of the similarity between data nodes and the realization of the community division of data nodes based on labels, the calculation cost is saved, and the closely related data nodes can be assigned to the same processing host, which greatly reduces the number of different processing hosts. Inter-communication overhead, thus improving the efficiency of data processing. In addition, in the process of storing the community into the processing host, the processing host to be stored in each community is finally determined according to the matching degree between the data nodes contained in the community and the data nodes pre-stored in the processing host, which reduces the number of unprocessed data. The change of the storage location of the data nodes further reduces the processing overhead and improves the processing efficiency.

图3为本发明分布式图计算系统数据切分系统实施例一的结构示意图，如图3所示，该系统包括：Fig. 3 is a schematic structural diagram of Embodiment 1 of the data segmentation system of the distributed graph computing system of the present invention. As shown in Fig. 3, the system includes:

第一确定模块11，用于根据预设算法，确定待处理数据中的每个数据节点与自身的各第一相邻节点间的相似性度量值；The first determination module 11 is configured to determine the similarity measure value between each data node in the data to be processed and its own first adjacent nodes according to a preset algorithm;

获取模块12，用于根据所述各第一相邻节点的标签，获得每个所述标签在所述各第一相邻节点中的出现次数，所述标签包括节点标识ID；An acquisition module 12, configured to obtain the number of occurrences of each of the labels in each of the first adjacent nodes according to the labels of the first adjacent nodes, where the label includes a node identification ID;

第二确定模块13，用于根据所述出现次数，确定所述各第一相邻节点的标签中是否存在出现次数相同的至少两个标签；The second determination module 13 is configured to determine whether there are at least two labels with the same number of occurrences among the labels of the first adjacent nodes according to the number of occurrences;

第三确定模块14，用于若所述第二确定模块13确定存在出现次数相同的至少两个标签，则确定与所述至少两个标签分别对应的各第二相邻节点，并根据所述数据节点与所述各第二相邻节点间的相似性度量值，确定所述数据节点的标签；The third determination module 14 is configured to determine the second adjacent nodes respectively corresponding to the at least two labels if the second determination module 13 determines that there are at least two labels with the same number of occurrences, and according to the The similarity measurement value between the data node and the second adjacent nodes determines the label of the data node;

处理模块15，用于将所述待处理数据中具有同一标签的数据节点划分到同一社区，并将属于同一社区的数据节点存储在同一处理主机中。The processing module 15 is configured to divide the data nodes with the same label in the data to be processed into the same community, and store the data nodes belonging to the same community in the same processing host.

所述第三确定模块14，还用于：The third determination module 14 is also used for:

若所述第二确定模块13确定不存在出现次数相同的至少两个标签，则确定所述数据节点的标签与所述各第一相邻节点的标签中出现次数最多的标签相同。If the second determination module 13 determines that there are no at least two labels with the same number of occurrences, it is determined that the label of the data node is the same as the label with the largest number of occurrences among the labels of the first adjacent nodes.

本实施例的系统可以用于执行图1所示方法实施例的技术方案，其实现原理和技术效果类似，此处不再赘述。The system of this embodiment can be used to execute the technical solution of the method embodiment shown in FIG. 1 , and its implementation principle and technical effect are similar, and will not be repeated here.

图4为本发明分布式图计算系统数据切分系统实施例二的结构示意图，如图4所示，本实施例提供的所述系统在图3所示实施例的基础上，所述第一确定模块11，具体用于：Fig. 4 is a schematic structural diagram of Embodiment 2 of the data segmentation system of the distributed graph computing system of the present invention. As shown in Fig. 4, the system provided in this embodiment is based on the embodiment shown in Fig. 3, and the first Determine module 11, specifically for:

根据公式(1)计算所述待处理数据中的数据节点u与所述数据节点u的相邻节点v间的相似性度量值APS(u,v)：Calculate the similarity measure value APS(u, v) between the data node u in the data to be processed and the adjacent node v of the data node u according to formula (1):

其中，γ为加权系数，in(w)为与所述数据节点w的入边对应的相邻节点集合，out(w)为与数据节点w的出边对应的相邻节点集合，|out(w)|为所述出边对应的相邻节点集合中包含的节点个数；Wherein, γ is a weighting coefficient, in(w) is the set of adjacent nodes corresponding to the incoming edge of the data node w, out (w) is the set of adjacent nodes corresponding to the outgoing edge of the data node w, |out( w)| is the number of nodes contained in the adjacent node set corresponding to the outgoing edge;

所述第三确定模块14，具体用于：The third determination module 14 is specifically used for:

从所述各第二相邻节点中确定出第三相邻节点，所述第三相邻节点与所述数据节点的相似性度量值大于所述各第二相邻节点中除所述第三相邻节点之外的其他第二相邻节点与所述数据节点的相似性度量值；A third adjacent node is determined from the second adjacent nodes, and the similarity metric value between the third adjacent node and the data node is greater than that of the second adjacent nodes except the third adjacent node. The similarity measure value between other second neighboring nodes and the data node except neighboring nodes;

确定所述数据节点的标签与所述第三相邻节点的标签相同；determining that the label of the data node is the same as the label of the third adjacent node;

所述处理模块15，还用于：The processing module 15 is also used for:

将所述待处理数据中的各个社区按照社区规模从大到小的顺序进行排序，所述社区规模为所述社区中包含的数据节点的个数；Sort each community in the data to be processed in descending order of community size, where the community size is the number of data nodes contained in the community;

根据各处理主机的预设存储顺序以及所述各个社区的排序，分别确定所述各个社区应该存入的处理主机，所述存储顺序是指所述各处理主机存储所述各个社区的先后顺序；According to the preset storage order of each processing host and the sorting of each community, respectively determine the processing host that each community should be stored in, and the storage order refers to the order in which each processing host stores the various communities;

将所述各个社区依次存入对应的各处理主机中；Store each community in the corresponding processing hosts in sequence;

所述系统还包括第四确定模块21，用于：The system also includes a fourth determining module 21, configured to:

分别以所述各个社区中的每个社区作为待处理社区，确定所述待处理社区中包含的数据节点分别与所述各处理主机中的每个处理主机中预先保存的数据节点间的匹配程度；Using each community in each of the communities as the community to be processed, determine the matching degree between the data nodes contained in the community to be processed and the data nodes pre-saved in each of the processing hosts respectively ;

从所述各处理主机中确定与所述待处理社区对应的目标处理主机，所述目标处理主机中预先保存的数据节点与所述待处理社区中包含的数据节点的匹配程度最高；Determining a target processing host corresponding to the community to be processed from the processing hosts, where the pre-saved data nodes in the target processing host have the highest matching degree with the data nodes contained in the community to be processed;

确定所述目标处理主机与根据所述各处理主机的预设存储顺序以及所述各个社区的排序确定的所述待处理社区应该存入的处理主机是否相同；determining whether the target processing host is the same as the processing host that should be stored in the communities to be processed determined according to the preset storage order of the processing hosts and the ranking of the communities;

所述处理模块15，还用于：The processing module 15 is also used for:

若所述第四确定模块21确定所述目标处理主机与根据所述各处理主机的预设存储顺序以及所述各个社区的排序确定的所述待处理社区应该存入的处理主机相同，则将所述待处理社区存入所述应该存入的处理主机中；If the fourth determining module 21 determines that the target processing host is the same as the processing host that should be stored in the community to be processed determined according to the preset storage order of the processing hosts and the ranking of the communities, then the The community to be processed is stored in the processing host that should be stored;

若所述第四确定模块21确定所述目标处理主机与根据所述各处理主机的预设存储顺序以及所述各个社区的排序确定的所述待处理社区应该存入的处理主机不相同，则将所述待处理社区存入所述目标处理主机中。If the fourth determination module 21 determines that the target processing host is different from the processing host that should be stored in the community to be processed determined according to the preset storage order of the processing hosts and the ranking of the communities, then The community to be processed is stored in the target processing host.

本实施例的系统可以用于执行图2所示方法实施例的技术方案，其实现原理和技术效果类似，此处不再赘述。The system of this embodiment can be used to implement the technical solution of the method embodiment shown in FIG. 2 , and its implementation principle and technical effect are similar, and will not be repeated here.

本领域普通技术人员可以理解：实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成，前述的程序可以存储于一计算机可读取存储介质中，该程序在执行时，执行包括上述方法实施例的步骤；而前述的存储介质包括：ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the steps for realizing the above-mentioned method embodiments can be completed by hardware related to program instructions, and the aforementioned program can be stored in a computer-readable storage medium. When the program is executed, the It includes the steps of the above-mentioned method embodiments; and the aforementioned storage medium includes: ROM, RAM, magnetic disk or optical disk and other various media that can store program codes.

最后应说明的是：以上各实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述各实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分或者全部技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than limiting them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the various embodiments of the present invention. scope.

Claims

1. a distributed figure computing system data segmentation method, is characterized in that, comprising:

According to preset algorithm, determine each back end in pending data and the similarity measure values between each first-phase neighbors of self;

According to the label of described each first-phase neighbors, obtain the occurrence number of each described label in described each first-phase neighbors, described label comprises node identification ID;

According to described occurrence number, determine whether there are at least two identical labels of occurrence number in the label of described each first-phase neighbors;

If exist, then determine each second-phase neighbors corresponding respectively with described at least two labels, and according to the similarity measure values between described back end and described each second-phase neighbors, determine the label of described back end;

The back end in described pending data with same label is divided into same community, and the back end belonging to same community is stored in same processing host.

2. method according to claim 1, is characterized in that, described according to preset algorithm, determines each back end in pending data and the similarity measure values between each first-phase neighbors of self, comprising:

Similarity measure values APS (u, v) between the adjacent node v calculating back end u in described pending data and described back end u according to formula (1):

APS (u, v) = Σ_{j = 1}^{n} \min (PG (v_{j}, u), PG (v_{j}, v)), u, v &Element; V, v_{j} &Element; (δ (v) \cap δ (u)) - - - (1)

Wherein, v is any one back end in each first-phase neighbors of described back end u, each first-phase neighbors that δ (v) is back end v, V is the set of all back end in described pending data, each first-phase neighbors that δ (u) is described back end u, n is the number of the node simultaneously belonging to each first-phase neighbors of back end v and each first-phase neighbors of back end u, PG (v _j, u) be back end v _jpropagate into the page rank PR value of back end u, according to PG (v _j, u)=PR (u) determines, PG (v _j, v) be back end v _jpropagate into the page rank PR value of back end v, according to PG (v _j, v)=PR (v) determines, wherein, PR (u) and PR (v) determines according to formula (2):

PR (w) = (1 - γ) + γ Σ_{v_{i} &Element; in (w)} \frac{PR (v_{i})}{| out (w) |}, w &Element; V - - - (2)

Wherein, γ is weighting coefficient, in (w) for described back end w enter adjacent node set corresponding to limit, out (w) for back end w go out adjacent node set corresponding to limit, | out (w) | be the described node number going out to comprise in adjacent node set corresponding to limit.

3. method according to claim 2, is characterized in that, described according to the similarity measure values between described back end and described each second-phase neighbors, determines the label of described back end, comprising:

From described each second-phase neighbors, determine third phase neighbors, the similarity measure values of described third phase neighbors and described back end is greater than the similarity measure values of other second-phase neighbors in described each second-phase neighbors except described third phase neighbors and described back end;

Determine that the label of described back end is identical with the label of described third phase neighbors.

4. method according to claim 1, is characterized in that, described according to described occurrence number, after determining whether there are at least two identical labels of occurrence number in the described label of described each first-phase neighbors, also comprises:

If do not exist, then determine that the label that the label of described back end is maximum with occurrence number in the label of described each first-phase neighbors is identical.

5. method according to any one of claim 1 to 4, is characterized in that, described the back end in described pending data with same label is divided into same community after, also comprise:

Sorted according to community's scale order from big to small each community in described pending data, described community scale is the number of the back end comprised in described community;

Described the back end belonging to same community to be stored in same processing host, to comprise:

According to the default storage order of each processing host and the sequence of each community described, determine respectively each community described should stored in processing host, described storage order refers to the sequencing of each community described in described each process host stores;

By each community described successively stored in each processing host of correspondence.

6. method according to claim 5, is characterized in that, described according to the default storage order of each processing host and the sequence of each community described, determine respectively each community described should stored in processing host after, also comprise:

Respectively using each community in each community described as pending community, determine the matching degree between the back end preserved in advance in the back end that comprises in the described pending community each processing host respectively and in described each processing host;

From described each processing host, determine the target processing host corresponding with described pending community, the matching degree of the back end comprised in the back end preserved in advance in described target processing host and described pending community is the highest;

Determine described target processing host and the described pending community determined according to the default storage order of described each processing host and the sequence of each community described should stored in processing host whether identical;

If identical, then by described pending community stored in described should stored in processing host in.

7. method according to claim 6, it is characterized in that, described determine described target processing host and the described pending community determined according to the default storage order of described each processing host and the sequence of each community described should stored in processing host whether identical after, also comprise:

If not identical, then by described pending community stored in described target processing host.

8. the method according to claim 6 or 7, is characterized in that, the matching degree between the back end preserved in advance in each processing host of the described back end determining to comprise in described pending community respectively and in described each processing host, comprising:

Determine that each described matching degree is the number of back end and the identical data node comprised between the back end preserved in advance in corresponding each processing host comprised in described pending community.

9. a distributed figure computing system data cutting system, is characterized in that, comprising:

First determination module, for according to preset algorithm, determines each back end in pending data and the similarity measure values between each first-phase neighbors of self;

Acquisition module, for the label according to described each first-phase neighbors, obtain the occurrence number of each described label in described each first-phase neighbors, described label comprises node identification ID;

Second determination module, for according to described occurrence number, determines whether there are at least two identical labels of occurrence number in the label of described each first-phase neighbors;

3rd determination module, if for existing, then determines each second-phase neighbors corresponding respectively with described at least two labels, and according to the similarity measure values between described back end and described each second-phase neighbors, determines the label of described back end;

Processing module, for the back end in described pending data with same label is divided into same community, and is stored in the back end belonging to same community in same processing host.

10. system according to claim 9, is characterized in that, described first determination module, specifically for:

APS (u, v) = Σ_{j = 1}^{n} \min (PG (v_{j}, u), PG (v_{j}, v)), u, v &Element; V, v_{j} &Element; (δ (v) \cap δ (u)) - - - (1)

PR (w) = (1 - γ) + γ Σ_{v_{i} &Element; in (w)} \frac{PR (v_{i})}{| out (w) |}, w &Element; V - - - (2)

Wherein, γ is weighting coefficient, in (w) for described back end w enter adjacent node set corresponding to limit, out (w) for back end w go out adjacent node set corresponding to limit, | out (w) | be the described node number going out to comprise in adjacent node set corresponding to limit;

Described 3rd determination module, specifically for:

Determine that the label of described back end is identical with the label of described third phase neighbors;

Described 3rd determination module, also for:

If do not exist, then determine that the label that the label of described back end is maximum with occurrence number in the label of described each first-phase neighbors is identical;

Described processing module, also for:

By each community described successively stored in each processing host of correspondence;

Described system also comprises the 4th determination module, for:

Described processing module, also for:

If identical, then by described pending community stored in described should stored in processing host in;