[go: up one dir, main page]

CN106611037A - Method and device for distributed diagram calculation - Google Patents

Method and device for distributed diagram calculation Download PDF

Info

Publication number
CN106611037A
CN106611037A CN201610818819.8A CN201610818819A CN106611037A CN 106611037 A CN106611037 A CN 106611037A CN 201610818819 A CN201610818819 A CN 201610818819A CN 106611037 A CN106611037 A CN 106611037A
Authority
CN
China
Prior art keywords
nomography
data
distributed
diagram data
equipment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610818819.8A
Other languages
Chinese (zh)
Inventor
王志平
吕程
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Star Link Information Technology (shanghai) Co Ltd
Original Assignee
Star Link Information Technology (shanghai) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Star Link Information Technology (shanghai) Co Ltd filed Critical Star Link Information Technology (shanghai) Co Ltd
Priority to CN201610818819.8A priority Critical patent/CN106611037A/en
Priority to PCT/CN2017/080845 priority patent/WO2018045753A1/en
Publication of CN106611037A publication Critical patent/CN106611037A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2237Vectors, bitmaps or matrices
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application aims at providing a method and device for distributed diagram calculation. Compared with the prior art, the method provided by the invention comprises the following steps: firstly acquiring original diagram data, and processing the original diagram data according to a diagram algorithm to acquire structured diagram data corresponding to the diagram algorithm, thereby adapting to the diagram algorithms in different types; and then distributing the calculation tasks corresponding to the diagram algorithms to multiple calculation nodes to be executed, wherein a persistence condition should be satisfied in the execution process, the persistence operation is performed, the data dependence is cutoff, the repeated calculation amount is reduced, and the processing efficiency is improved. Furthermore, the application comprises firstly performing a combination operation on the diagram data before an aggregation operation and a connection operation, thereby improving the operation efficiency and relieving the network transmission pressure. Furthermore, the application adopts a data serialization and deserialization method to facilitate the transmission of intermediate data produced in the calculation process between the calculation nodes.

Description

For the method and apparatus that distributed figure is calculated
Technical field
The application is related to computer realm, more particularly to a kind of technology calculated for distributed figure.
Background technology
With the expansion of figure scale, unit, the figure Processing Algorithm of single thread are limited by system resource and calculating time, Algorithm success cannot be ensured and effectively run.Therefore it is the approach of solve problem by figure processing procedure parallelization, distributedization.
The content of the invention
The purpose of the application is to provide a kind of method and apparatus calculated for distributed figure.
According to the one side of the application, there is provided a kind of method calculated for distributed figure, wherein, methods described bag Include:
Obtain original diagram data;
According to nomography, process the original diagram data to obtain the corresponding regular diagram data of the nomography;
The nomography corresponding calculating task is distributed to multiple calculate nodes to perform, wherein, in the process of implementation when Meet persistence condition, carry out persistence operation.
According to further aspect of the application, there is provided a kind of equipment calculated for distributed figure, wherein, the equipment Including:
First device, for obtaining original diagram data;
Second device, for according to nomography, processing the original diagram data corresponding regular to obtain the nomography Diagram data;
3rd device, performs for the nomography corresponding calculating task is distributed to multiple calculate nodes, wherein, When persistence condition is met in implementation procedure, persistence operation is carried out.
Compared with prior art, the application first obtains original diagram data, then according to the nomography process original graph number The corresponding regular diagram data of the nomography is obtained according to this, in order to be adapted to different types of nomography, then by the graphic calculation The corresponding calculating task of method is distributed to multiple calculate nodes and performs, wherein, in the process of implementation when persistence condition is met, carry out Persistence is operated, and cuts off data dependence, reduces double counting amount, improves treatment effeciency.Further, the application is to diagram data Before carrying out converging operationJu Hecaozuo and attended operation, operation is first merged to which, so as to improve operation efficiency, mitigate network transmission pressure Power.Further, a kind of method that the application adopts Data Serialization and unserializing, in order to the generation in calculating process Intermediate data is transmitted between calculate node.Further, the application is realized and starts nomography by SQL statement, and is led to Cross improvement and process logic so that the data into nomography are complete diagram datas.
Description of the drawings
By reading the detailed description made to non-limiting example made with reference to the following drawings, other of the invention Feature, objects and advantages will become more apparent upon:
Fig. 1 illustrates a kind of method flow diagram calculated for distributed figure according to the application one side;
Fig. 2 illustrate nomography corresponding calculating task is distributed to according to one kind of one preferred embodiment of the application it is multiple The schematic diagram of calculate node;
Fig. 3 illustrates a kind of method flow diagram calculated for distributed figure according to another preferred embodiment of the application;
Fig. 4 illustrates a kind of equipment schematic diagram calculated for distributed figure according to the application other side;
Fig. 5 illustrates a kind of equipment schematic diagram calculated for distributed figure according to another preferred embodiment of the application.
In accompanying drawing, same or analogous reference represents same or analogous part.
Specific embodiment
Below in conjunction with the accompanying drawings the present invention is described in further detail.
In one typical configuration of the application, terminal, the equipment of service network and trusted party include one or more Processor (CPU), input/output interface, network interface and internal memory.
Internal memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read only memory (ROM) or flash memory (flash RAM).Internal memory is computer-readable medium Example.
Computer-readable medium includes that permanent and non-permanent, removable and non-removable media can be by any method Or technology is realizing information Store.Information can be computer-readable instruction, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read only memory (ROM), electric erasable Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read only memory (CD-ROM), Digital versatile disc (DVD) or other optical storages, magnetic cassette tape, magnetic disk storage or other magnetic storage apparatus or Any other non-transmission medium, can be used to store the information that can be accessed by a computing device.Define according to herein, computer Computer-readable recording medium does not include non-temporary computer readable media (transitory media), the such as data signal and carrier wave of modulation.
Fig. 1 illustrates a kind of method flow diagram calculated for distributed figure according to the application one side, wherein, it is described Method includes step S11, step S12 and step S13.
Specifically, in step s 11, equipment 1 obtains original diagram data;In step s 12, equipment 1 is according to nomography, place Manage the original diagram data to obtain the corresponding regular diagram data of the nomography;In step s 13, equipment 1 is by the graphic calculation The corresponding calculating task of method is distributed to multiple calculate nodes and performs, wherein, in the process of implementation when persistence condition is met, carry out Persistence is operated.
Here, the equipment 1 includes but is not limited to user equipment, the network equipment or user equipment and the network equipment passes through The mutually integrated equipment for being constituted of network.The user equipment its include but is not limited to any one can be entered by touch pad with user The mobile electronic product of row man-machine interaction, such as smart mobile phone, panel computer, notebook computer etc., the mobile electronic product Any operating system, such as android operating systems, iOS operating systems etc. can be adopted.Wherein, the network equipment includes one Planting can carry out the electronic equipment of numerical computations and information processing automatically according to the instruction being previously set or store, its hardware bag Include but be not limited to microprocessor, special IC (ASIC), programmable gate array (FPGA), digital processing unit (DSP), embedded Formula equipment etc..The network equipment its include but is not limited to computer, network host, single network server, multiple networks clothes The cloud that business device collection or multiple servers are constituted;Here, cloud by based on cloud computing (Cloud Computing) a large amount of computers or The webserver is constituted, wherein, cloud computing is one kind of Distributed Calculation, be made up of the loosely-coupled computer collection of a group Individual virtual supercomputer.The network includes but is not limited to the Internet, wide area network, Metropolitan Area Network (MAN), LAN, VPN, wireless Self-organizing network (Ad Hoc networks) etc..Preferably, equipment 1 can also be and run on the user equipment, the network equipment or use Family equipment and the network equipment, the network equipment, touch terminal or the network equipment and touch terminal are constituted by network is mutually integrated Shell script on equipment.Certainly, those skilled in the art will be understood that the said equipment 1 is only for example, and other are existing or modern The equipment 1 being likely to occur afterwards is such as applicable to the application, and within also should being included in the application protection domain, and here is with the side of reference Formula is incorporated herein.
In step s 11, equipment 1 obtains original diagram data.
Here, the original graph data include the point data and side data of figure;Wherein, side data can include starting point and The information of the point of arrival, can also include the information needed for any nomography;If weight map, then side data also carry weight number According to.
In step s 12, equipment 1 processes the original diagram data corresponding to obtain the nomography according to nomography Regular diagram data.
For example, nomography generally requires some parameters and carrys out the key messages such as control accuracy, operation times;The nomography Species may have various, and for different nomographys, its parameter is likely to difference.Here, by processing the original diagram data, Corresponding regular diagram data is obtained, to be adapted to different types of nomography.
Preferably, in step s 12, the regular diagram data is also stored in distributed file system by equipment 1.
For example, the distributed file system can include Hadoop distributed file system (Hadoop Distributed File System, HDFS);In order to increase the degree of parallelism of process, in a preferred embodiment, the application will Diagram data is stored in Hadoop distributed file systems.
Certainly, this area will be understood that above-mentioned Hadoop distributed file systems are only for example, and other are existing or from now on The distributed file system being likely to occur such as is applicable to the application, within also should being included in the application protection domain, and here It is incorporated herein by reference.
In a preferred embodiment, employ Hadoop distributed file systems to be stored, also using Hive as friendship Mutual instrument;In practical application scene, in addition to data, generally also need to for the configuration of some Hive to pass to computing node. Here, Hive is the Tool for Data Warehouse based on Hadoop, sql like language can be applied to by big data scene by Hive, one The compatible traditional data application of aspect, on the other hand shields complicated distributed programmed details.Hive supports various computing engines, its Middle Spark possesses abundant computation model and operator as computing engines, can be used for realizing nomography.
Preferably, in step s 12, equipment 1 carries out type inspection to the regular diagram data always according to the nomography Look into.
For example, before data enter nomography, need to carry out type checking, it is to avoid wrong data causes algorithm to malfunction.Tool Body ground, first can carry out field segmentation to the regular diagram data, then enter ranks type checking.In a preferred embodiment, lead to Cross the structure type detector that GraphOperator operators obtain input data from Hive StandardStructObjectInspector, the type detector enumerate the element type detector of each field ObjectInspector。
In step s 13, the nomography corresponding calculating task is distributed to multiple calculate nodes and is performed by equipment 1, its In, in the process of implementation when persistence condition is met, carry out persistence operation.
In a preferred embodiment, during calculating task distribution, in order to improve treatment effeciency, as far as possible each is counted Operator node distributes on the HDFS nodes for having diagram data.In the case where calculating implementation procedure complexity is time-consuming, by lasting Change operation preservation intermediate result and can cut off data dependence, reduce double counting amount.
Preferably, in step s 13, to create multiple calculate nodes by resource management framework described for performing for equipment 1 The corresponding calculating task of nomography.
For example, it is described that Yarn can be included by resource management framework;With reference to Fig. 2, by resource management framework Yarn it is The corresponding calculating task of the nomography creates multiple calculate nodes.
Certainly, this area will be understood that above-mentioned resource management framework Yarn is only for example, and other are existing or from now on may The resource management framework of appearance is such as applicable to the application, and within also should being included in the application protection domain, and here is quoting Mode is incorporated herein.
Preferably, in step s 13, the nomography corresponding calculating task is distributed to Distributed Calculation frame by equipment 1 Multiple calculate nodes in frame are performed.
For example, the distributed computing framework can include Spark;With reference to Fig. 2, using distributed computing framework Spark As computing engines, as the calculating process of data is delayed (lazy) model, it is more beneficial for the high figure of computation complexity and calculates.
Certainly, this area will be understood that above-mentioned distributed computing framework Spark is only for example, and other are existing or from now on may be used Can occur distributed computing framework be such as applicable to the application, within also should being included in the application protection domain, and here with Way of reference is incorporated herein.
Preferably, the persistence condition includes following at least any one:The elasticity distribution of the distributed computing framework The calculating of formula data set is time-consuming to reach corresponding duration threshold value;The elasticity distribution formula data set of the distributed computing framework work as Front dependence length reaches corresponding length threshold.
For example, the elasticity distribution formula number of Spark according to distributed computing framework Spark, can be used in calculating process According to collection (Resilient Distributed Datasets, RDD).When GraphRDD has longer calculating time or dependence (can such as calculate time-consuming corresponding duration threshold value to be set to 10 minutes, 10 points be reached when the calculating of GraphRDD is time-consuming Clock), persistence operation is carried out, data and element type detector ObjectInspector are written in local disk together, And corresponding BlockId is reported to into Spark Driver.When the nomography for needing many wheel interative computations is processed, in order to avoid The loss of data that calculate node failure is caused, persistence operation can also write data into Hadoop distributed file systems (HDFS) in.
Preferably, the persistence operation includes following at least any one:Store current result of calculation;Removing is currently relied upon Relation.
Here, result of calculation can be preserved, dependence is removed by persistence (persist) operation, some quilts are reduced The calculating cost of the complex transformations of Reusability, and fault-tolerance is provided.
Preferably, in step s 13, equipment 1 also carries out converging operationJu Hecaozuo and company to regular diagram data described in key assignments identical Connect operation.
For example, certain string or a few column data using diagram data be used as key assignments (key), be polymerized (groupBy) operation and Connection (join) operation, processes all data of identical key assignments (key) by a calculate node, therefore has between calculate node big The data transfer of amount.Specifically, the specific field of data is chosen by GraphRDD, these field sequences is melted into into key assignments (key) key assignments (key) identical data are merged by converging operationJu Hecaozuo and attended operation, and it is different according to the species of nomography, apply Plus nonidentity operation.In order to reduce network transmission pressure, here, first data are merged in each calculate node by converging operationJu Hecaozuo Once, then by the result for having merged it is delivered in other calculate nodes according to key assignments (key).
In preferred enforcement, in order to improve joint efficiency, it is possible to use a kind of data structure and optimisation strategy of optimization. When two huge GraphRDD of data volume do attended operation, enormous pressure can be produced to internal memory.In the present embodiment, adopted Data structure can be stored in data in disk when memory source is nervous, so as to avoid internal memory overflow problem.Work as data volume When the minimum GraphRDD and great GraphRDD of data volume does attended operation, adopt and the GraphRDD compared with small data quantity is copied The connection optimisation strategy of shellfish to each calculate node, accelerates to also mitigate network pressure while connection speed.
Preferably, it is described converging operationJu Hecaozuo and attended operation are carried out to regular diagram data described in key assignments identical also to include:Enter Before the row converging operationJu Hecaozuo, operation is merged to the regular diagram data in each described calculate node.
Here, before carrying out converging operationJu Hecaozuo and attended operation, performing data union operation in current calculate node, can subtracting The transmission quantity of few network data, improves operation efficiency, so as to mitigate network transmission pressure.
Preferably, in step s 13, equipment 1 is first carried out instead to intermediate data when calculate node acquisition intermediate data Serializing, processes the intermediate data after unserializing according to the nomography, then to according to the centre after nomography process Data are serialized.
For example, various intermediate data can be produced in figure calculating process, in order to be able to reduce calculate node parsing data type CPU and memory cost, the present embodiment adopt a kind of Data Serialization and unserializing method based on type checking, by data Type resolver passes to calculate node together in company with data.Specifically, GraphOperator is by initial data and element type Detector ObjectInspector is combined into GraphRDD, used as the input data of each nomography operator.When being related to When Shuffle is operated, each data is serialized using ObjectInspector, and in other calculate node unserializings.
Preferably, with reference to Fig. 3, methods described also includes step S14 ' and step S15 ';In step S14 ' in, equipment 1 is obtained Take pending SQL statement;In step S15 ' in, equipment 1 parses the SQL statement to call corresponding nomography.
In the prior art, because nomography has the calculating process and a large amount of iterationses of complexity, it is impossible to by SQL realities It is existing.
And in the present embodiment, using distributed computing framework Spark as computing engines, will in the way of self-defining function Numerous nomographys are integrated in Hive.It is thus possible to by nomography and other SQL statement organic assemblings, reduce intractability.
Preferably, in step S15 ' in, equipment 1 registers multiple nomographys using self-defining function, wherein, each nomography One registration function of correspondence.
For example, it is possible to use (User Defined Table-Generating Function, user is certainly for the UDTF of Hive Define table generating function) mechanism, the mesh realized class name, reach by SQL statement startup nomography of nomography is registered to Hive 's.Here, UDTF is a kind of interface that Hive makes function by oneself and designs for user's addition, user can pass through UDTF's Process methods, obtain a line input, and be converted in a row or multirow output.
But the model of " a line is input into, multirow output " of UDTF can not meet the demand of figure calculating.In the present embodiment, By adding new process logic on the basis of UDTF so that the data into nomography are complete diagram datas.Here, can be with Using UDTF interfaces, it is that each nomography registers a function.The present embodiment realizes the Operator operators based on UDTF, from And solve the demand that figure is calculated.Specifically, a GraphOperator operator is realized first, as all nomography operators Base class.GraphOperator inherits UDTF interfaces, therefore can pass through FunctionRegistry's RegisterGenericUDTF methods, different nomographys are registered in Hive.Change Hive's in the present embodiment TableScanOperator operators and UDTFOperator operators.UDTFOperator operators are calculated from TableScanOperator The input data for being encapsulated as RDD is obtained at son, and passes to GraphOperator operators.Each inherits GraphOperator Nomography operator can just have access to complete diagram data.
Fig. 4 illustrates a kind of equipment 1 calculated for distributed figure according to the application other side, wherein, it is described to set Standby 1 includes first device 11, second device 12 and 3rd device 13.
Specifically, the first device 11 obtains original diagram data;The second device 12 processes described according to nomography Original diagram data is obtaining the corresponding regular diagram data of the nomography;The 3rd device 13 is by the nomography corresponding meter Calculation task is distributed to multiple calculate nodes and performs, wherein, in the process of implementation when persistence condition is met, carry out persistence behaviour Make.
Here, the equipment 1 includes but is not limited to user equipment, the network equipment or user equipment and the network equipment passes through The mutually integrated equipment for being constituted of network.The user equipment its include but is not limited to any one can be entered by touch pad with user The mobile electronic product of row man-machine interaction, such as smart mobile phone, panel computer, notebook computer etc., the mobile electronic product Any operating system, such as android operating systems, iOS operating systems etc. can be adopted.Wherein, the network equipment includes one Planting can carry out the electronic equipment of numerical computations and information processing automatically according to the instruction being previously set or store, its hardware bag Include but be not limited to microprocessor, special IC (ASIC), programmable gate array (FPGA), digital processing unit (DSP), embedded Formula equipment etc..The network equipment its include but is not limited to computer, network host, single network server, multiple networks clothes The cloud that business device collection or multiple servers are constituted;Here, cloud by based on cloud computing (Cloud Computing) a large amount of computers or The webserver is constituted, wherein, cloud computing is one kind of Distributed Calculation, be made up of the loosely-coupled computer collection of a group Individual virtual supercomputer.The network includes but is not limited to the Internet, wide area network, Metropolitan Area Network (MAN), LAN, VPN, wireless Self-organizing network (Ad Hoc networks) etc..Preferably, equipment 1 can also be and run on the user equipment, the network equipment or use Family equipment and the network equipment, the network equipment, touch terminal or the network equipment and touch terminal are constituted by network is mutually integrated Shell script on equipment.Certainly, those skilled in the art will be understood that the said equipment 1 is only for example, and other are existing or modern The equipment 1 being likely to occur afterwards is such as applicable to the application, and within also should being included in the application protection domain, and here is with the side of reference Formula is incorporated herein.
The first device 11 obtains original diagram data.
Here, the original graph data include the point data and side data of figure;Wherein, side data can include starting point and The information of the point of arrival, can also include the information needed for any nomography;If weight map, then side data also carry weight number According to.
The second device 12 processes the original diagram data corresponding regular to obtain the nomography according to nomography Diagram data.
For example, nomography generally requires some parameters and carrys out the key messages such as control accuracy, operation times;The nomography Species may have various, and for different nomographys, its parameter is likely to difference.Here, by processing the original diagram data, Corresponding regular diagram data is obtained, to be adapted to different types of nomography.
Preferably, the regular diagram data is also stored in distributed file system by the second device 12.
For example, the distributed file system can include Hadoop distributed file system (Hadoop Distributed File System, HDFS);In order to increase the degree of parallelism of process, in a preferred embodiment, the application will Diagram data is stored in Hadoop distributed file systems.
Certainly, this area will be understood that above-mentioned Hadoop distributed file systems are only for example, and other are existing or from now on The distributed file system being likely to occur such as is applicable to the application, within also should being included in the application protection domain, and here It is incorporated herein by reference.
In a preferred embodiment, employ Hadoop distributed file systems to be stored, also using Hive as friendship Mutual instrument;In practical application scene, in addition to data, generally also need to for the configuration of some Hive to pass to computing node. Here, Hive is the Tool for Data Warehouse based on Hadoop, sql like language can be applied to by big data scene by Hive, one The compatible traditional data application of aspect, on the other hand shields complicated distributed programmed details.Hive supports various computing engines, its Middle Spark possesses abundant computation model and operator as computing engines, can be used for realizing nomography.
Preferably, the second device 12 carries out type checking to the regular diagram data always according to the nomography.
For example, before data enter nomography, need to carry out type checking, it is to avoid wrong data causes algorithm to malfunction.Tool Body ground, first can carry out field segmentation to the regular diagram data, then enter ranks type checking.In a preferred embodiment, lead to Cross the structure type detector that GraphOperator operators obtain input data from Hive StandardStructObjectInspector, the type detector enumerate the element type detector of each field ObjectInspector。
The nomography corresponding calculating task is distributed to multiple calculate nodes and is performed by the 3rd device 13, wherein, In the process of implementation when persistence condition is met, persistence operation is carried out.
In a preferred embodiment, during calculating task distribution, in order to improve treatment effeciency, as far as possible each is counted Operator node distributes on the HDFS nodes for having diagram data.In the case where calculating implementation procedure complexity is time-consuming, by lasting Change operation preservation intermediate result and can cut off data dependence, reduce double counting amount.
Preferably, the 3rd device 13 creates multiple calculate nodes for performing the graphic calculation by resource management framework The corresponding calculating task of method.
For example, it is described that Yarn can be included by resource management framework;With reference to Fig. 2, by resource management framework Yarn it is The corresponding calculating task of the nomography creates multiple calculate nodes.
Certainly, this area will be understood that above-mentioned resource management framework Yarn is only for example, and other are existing or from now on may The resource management framework of appearance is such as applicable to the application, and within also should being included in the application protection domain, and here is quoting Mode is incorporated herein.
Preferably, the 3rd device 13 is distributed to the nomography corresponding calculating task in distributed computing framework Multiple calculate nodes perform.
For example, the distributed computing framework can include Spark;With reference to Fig. 2, using distributed computing framework Spark As computing engines, as the calculating process of data is delayed (lazy) model, it is more beneficial for the high figure of computation complexity and calculates.
Certainly, this area will be understood that above-mentioned distributed computing framework Spark is only for example, and other are existing or from now on may be used Can occur distributed computing framework be such as applicable to the application, within also should being included in the application protection domain, and here with Way of reference is incorporated herein.
Preferably, the persistence condition includes following at least any one:The elasticity distribution of the distributed computing framework The calculating of formula data set is time-consuming to reach corresponding duration threshold value;The elasticity distribution formula data set of the distributed computing framework work as Front dependence length reaches corresponding length threshold.
For example, the elasticity distribution formula number of Spark according to distributed computing framework Spark, can be used in calculating process According to collection (Resilient Distributed Datasets, RDD).When GraphRDD has longer calculating time or dependence (can such as calculate time-consuming corresponding duration threshold value to be set to 10 minutes, 10 points be reached when the calculating of GraphRDD is time-consuming Clock), persistence operation is carried out, data and element type detector ObjectInspector are written in local disk together, And corresponding BlockId is reported to into Spark Driver.When the nomography for needing many wheel interative computations is processed, in order to avoid The loss of data that calculate node failure is caused, persistence operation can also write data into Hadoop distributed file systems (HDFS) in.
Preferably, the persistence operation includes following at least any one:Store current result of calculation;Removing is currently relied upon Relation.
Here, result of calculation can be preserved, dependence is removed by persistence (persist) operation, some quilts are reduced The calculating cost of the complex transformations of Reusability, and fault-tolerance is provided.
Preferably, three device 13 also carries out converging operationJu Hecaozuo and connection behaviour to regular diagram data described in key assignments identical Make.
For example, certain string or a few column data using diagram data be used as key assignments (key), be polymerized (groupBy) operation and Connection (join) operation, processes all data of identical key assignments (key) by a calculate node, therefore has between calculate node big The data transfer of amount.Specifically, the specific field of data is chosen by GraphRDD, these field sequences is melted into into key assignments (key) key assignments (key) identical data are merged by converging operationJu Hecaozuo and attended operation, and it is different according to the species of nomography, apply Plus nonidentity operation.In order to reduce network transmission pressure, here, first data are merged in each calculate node by converging operationJu Hecaozuo Once, then by the result for having merged it is delivered in other calculate nodes according to key assignments (key).
In preferred enforcement, in order to improve joint efficiency, it is possible to use a kind of data structure and optimisation strategy of optimization. When two huge GraphRDD of data volume do attended operation, enormous pressure can be produced to internal memory.In the present embodiment, adopted Data structure can be stored in data in disk when memory source is nervous, so as to avoid internal memory overflow problem.Work as data volume When the minimum GraphRDD and great GraphRDD of data volume does attended operation, adopt and the GraphRDD compared with small data quantity is copied The connection optimisation strategy of shellfish to each calculate node, accelerates to also mitigate network pressure while connection speed.
Preferably, it is described converging operationJu Hecaozuo and attended operation are carried out to regular diagram data described in key assignments identical also to include:Enter Before the row converging operationJu Hecaozuo, operation is merged to the regular diagram data in each described calculate node.
Here, before carrying out converging operationJu Hecaozuo and attended operation, performing data union operation in current calculate node, can subtracting The transmission quantity of few network data, improves operation efficiency, so as to mitigate network transmission pressure.
Preferably, the 3rd device 13 first carries out inverted sequence to intermediate data when calculate node acquisition intermediate data Rowization, process the intermediate data after unserializing according to the nomography, then to according to the mediant after nomography process According to being serialized.
For example, various intermediate data can be produced in figure calculating process, in order to be able to reduce calculate node parsing data type CPU and memory cost, the present embodiment adopt a kind of Data Serialization and unserializing method based on type checking, by data Type resolver passes to calculate node together in company with data.Specifically, GraphOperator is by initial data and element type Detector ObjectInspector is combined into GraphRDD, used as the input data of each nomography operator.When being related to When Shuffle is operated, each data is serialized using ObjectInspector, and in other calculate node unserializings.
Preferably, with reference to Fig. 5, the equipment 1 also includes the 4th device 14 ' and the 5th device 15 ';4th device 14 ' obtain pending SQL statement;Five device 15 ' parses the SQL statement to call corresponding nomography.
In the prior art, because nomography has the calculating process and a large amount of iterationses of complexity, it is impossible to by SQL realities It is existing.
And in the present embodiment, using distributed computing framework Spark as computing engines, will in the way of self-defining function Numerous nomographys are integrated in Hive.It is thus possible to by nomography and other SQL statement organic assemblings, reduce intractability.
Preferably, the 5th device 15 ' registers multiple nomographys using self-defining function, wherein, each nomography pair Answer a registration function.
For example, it is possible to use (User Defined Table-Generating Function, user is certainly for the UDTF of Hive Define table generating function) mechanism, the mesh realized class name, reach by SQL statement startup nomography of nomography is registered to Hive 's.Here, UDTF is the interface that Hive makes function by oneself and designs for user's addition, user can pass through the process side of UDTF Method, obtain a line input, and be converted in a row or multirow output.
But the model of " a line is input into, multirow output " of UDTF can not meet the demand of figure calculating.In the present embodiment, By adding new process logic on the basis of UDTF so that the data into nomography are complete diagram datas.Here, can be with Using UDTF interfaces, it is that each nomography registers a function.The present embodiment realizes the Operator operators based on UDTF, from And solve the demand that figure is calculated.Specifically, a GraphOperator operator is realized first, as all nomography operators Base class.GraphOperator inherits UDTF interfaces, therefore can pass through FunctionRegistry's RegisterGenericUDTF methods, different nomographys are registered in Hive.Change Hive's in the present embodiment TableScanOperator operators and UDTFOperator operators.UDTFOperator operators are calculated from TableScanOperator The input data for being encapsulated as RDD is obtained at son, and passes to GraphOperator operators.Each inherits GraphOperator Nomography operator can just have access to complete diagram data.
Compared with prior art, the application first obtains original diagram data, then according to the nomography process original graph number The corresponding regular diagram data of the nomography is obtained according to this, in order to be adapted to different types of nomography, then by the graphic calculation The corresponding calculating task of method is distributed to multiple calculate nodes and performs, wherein, in the process of implementation when persistence condition is met, carry out Persistence is operated, and cuts off data dependence, reduces double counting amount, improves treatment effeciency.Further, the application is to diagram data Before carrying out converging operationJu Hecaozuo and attended operation, operation is first merged to which, so as to improve operation efficiency, mitigate network transmission pressure Power.Further, a kind of method that the application adopts Data Serialization and unserializing, in order to the generation in calculating process Intermediate data is transmitted between calculate node.Further, the application is realized and starts nomography by SQL statement, and is led to Cross improvement and process logic so that the data into nomography are complete diagram datas.
It should be noted that the application can be carried out in the assembly of software and/or software with hardware, for example, can adopt Realized with special IC (ASIC), general purpose computer or any other similar hardware device.In one embodiment In, the software program of the application can pass through computing device to realize steps described above or function.Similarly, the application Software program (including related data structure) can be stored in computer readable recording medium storing program for performing, for example, RAM memory, Magnetically or optically driver or floppy disc and similar devices.In addition, some steps or function of the application can employ hardware to realize, example Such as, as coordinating so as to perform the circuit of each step or function with processor.
In addition, the part of the application can be applied to computer program, such as computer program instructions, when its quilt When computer is performed, by the operation of the computer, can call or provide according to the present processes and/or technical scheme. And the programmed instruction of the present processes is called, it is possibly stored in fixed or moveable recording medium, and/or passes through Data flow in broadcast or other signal bearing medias and be transmitted, and/or be stored according to described program instruction operation In the working storage of computer equipment.Here, including a device according to one embodiment of the application, the device includes using In the memorizer and the processor for execute program instructions of storage computer program instructions, wherein, when the computer program refers to When order is by the computing device, method and/or skill of the plant running based on aforementioned multiple embodiments according to the application are triggered Art scheme.
It is obvious to a person skilled in the art that the application is not limited to the details of above-mentioned one exemplary embodiment, Er Qie In the case of without departing substantially from spirit herein or basic feature, the application can be realized in other specific forms.Therefore, no matter From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, scope of the present application is by appended power Profit is required rather than described above is limited, it is intended that all in the implication and scope of the equivalency of claim by falling Change is included in the application.Any reference in claim should not be considered as and limit involved claim.This Outward, it is clear that " including ", a word was not excluded for other units or step, and odd number is not excluded for plural number.That what is stated in device claim is multiple Unit or device can also be realized by software or hardware by a unit or device.The first, the second grade word is used for table Show title, and be not offered as any specific order.
It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie In the case of spirit or essential attributes without departing substantially from the present invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention is by appended power Profit is required rather than described above is limited, it is intended that all in the implication and scope of the equivalency of claim by falling Change is included in the present invention.Any reference in claim should not be considered as and limit involved claim.This Outward, it is clear that " including ", a word was not excluded for other units or step, and odd number is not excluded for plural number.That what is stated in device claim is multiple Unit or device can also be realized by software or hardware by a unit or device.The first, the second grade word is used for table Show title, and be not offered as any specific order.

Claims (24)

1. it is a kind of for distributed figure calculate method, wherein, methods described includes:
A obtains original diagram data;
B processes the original diagram data to obtain the corresponding regular diagram data of the nomography according to nomography;
The nomography corresponding calculating task is distributed to multiple calculate nodes and is performed by c, wherein, work as satisfaction in the process of implementation Persistence condition, carries out persistence operation.
2. method according to claim 1, wherein, methods described also includes:
Obtain pending SQL statement;
Parse the SQL statement to call corresponding nomography.
3. method according to claim 2, wherein, the parsing SQL statement is calling corresponding nomography to include:
Multiple nomographys are registered using self-defining function, wherein, each nomography one registration function of correspondence.
4. method according to claim 1, wherein, step c also includes:
Converging operationJu Hecaozuo and attended operation are carried out to regular diagram data described in key assignments identical.
5. method according to claim 4, wherein, it is described that converging operationJu Hecaozuo is carried out to regular diagram data described in key assignments identical And attended operation also includes:
Before carrying out the converging operationJu Hecaozuo, operation is merged to the regular diagram data in each described calculate node.
6. method according to claim 1, wherein, step c includes:
When the calculate node obtains intermediate data, unserializing is carried out to intermediate data first, it is anti-according to nomography process Intermediate data after serializing, then to being serialized according to the intermediate data after nomography process.
7. method according to claim 1, wherein, step b also includes:
The regular diagram data is stored in into distributed file system.
8. method according to claim 1, wherein, step b also includes:
According to the nomography, type checking is carried out to the regular diagram data.
9. method according to claim 1, wherein, step c includes:
Multiple calculate nodes are created for performing the corresponding calculating task of the nomography by resource management framework.
10. method according to claim 1, wherein, step c includes:
The multiple calculate nodes nomography corresponding calculating task being distributed in distributed computing framework are performed.
11. methods according to claim 10, wherein, the persistence condition includes following at least any one:
The calculating of the elasticity distribution formula data set of the distributed computing framework is time-consuming to reach corresponding duration threshold value;
The present dependency length of the elasticity distribution formula data set of the distributed computing framework reaches corresponding length threshold.
12. methods according to any one of claim 1 to 11, wherein, persistence operation includes following at least appointing One:
Store current result of calculation;
Remove present dependency.
A kind of 13. equipment calculated for distributed figure, wherein, the equipment includes:
First device, for obtaining original diagram data;
Second device, for according to nomography, processing the original diagram data to obtain the corresponding regular figure number of the nomography According to;
3rd device, performs for the nomography corresponding calculating task is distributed to multiple calculate nodes, wherein, performing During when persistence condition is met, carry out persistence operation.
14. equipment according to claim 13, wherein, the equipment also includes:
4th device, for obtaining pending SQL statement;
5th device, for parsing the SQL statement to call corresponding nomography.
15. equipment according to claim 14, wherein, the 5th device is used for:
Multiple nomographys are registered using self-defining function, wherein, each nomography one registration function of correspondence.
16. equipment according to claim 13, wherein, the 3rd device is additionally operable to:
Converging operationJu Hecaozuo and attended operation are carried out to regular diagram data described in key assignments identical.
17. equipment according to claim 16, wherein, it is described that polymerization behaviour is carried out to regular diagram data described in key assignments identical Make and attended operation also includes:
Before carrying out the converging operationJu Hecaozuo, operation is merged to the regular diagram data in each described calculate node.
18. equipment according to claim 13, wherein, the 3rd device is used for:
When the calculate node obtains intermediate data, unserializing is carried out to intermediate data first, it is anti-according to nomography process Intermediate data after serializing, then to being serialized according to the intermediate data after nomography process.
19. equipment according to claim 13, wherein, the second device is additionally operable to:
The regular diagram data is stored in into distributed file system.
20. equipment according to claim 13, wherein, the second device is additionally operable to:
According to the nomography, type checking is carried out to the regular diagram data.
21. equipment according to claim 13, wherein, the 3rd device is used for:
Multiple calculate nodes are created for performing the corresponding calculating task of the nomography by resource management framework.
22. equipment according to claim 13, wherein, the 3rd device is used for:
The multiple calculate nodes nomography corresponding calculating task being distributed in distributed computing framework are performed.
23. equipment according to claim 22, wherein, the persistence condition includes following at least any one:
The calculating of the elasticity distribution formula data set of the distributed computing framework is time-consuming to reach corresponding duration threshold value;
The present dependency length of the elasticity distribution formula data set of the distributed computing framework reaches corresponding length threshold.
24. equipment according to any one of claim 13 to 23, wherein, persistence operation includes following at least appointing One:
Store current result of calculation;
Remove present dependency.
CN201610818819.8A 2016-09-12 2016-09-12 Method and device for distributed diagram calculation Pending CN106611037A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201610818819.8A CN106611037A (en) 2016-09-12 2016-09-12 Method and device for distributed diagram calculation
PCT/CN2017/080845 WO2018045753A1 (en) 2016-09-12 2017-04-18 Method and device for distributed graph computation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610818819.8A CN106611037A (en) 2016-09-12 2016-09-12 Method and device for distributed diagram calculation

Publications (1)

Publication Number Publication Date
CN106611037A true CN106611037A (en) 2017-05-03

Family

ID=58614973

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610818819.8A Pending CN106611037A (en) 2016-09-12 2016-09-12 Method and device for distributed diagram calculation

Country Status (2)

Country Link
CN (1) CN106611037A (en)
WO (1) WO2018045753A1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107729523A (en) * 2017-10-27 2018-02-23 平安科技(深圳)有限公司 Data service method, electronic installation and storage medium
CN109189732A (en) * 2018-08-03 2019-01-11 成都四方伟业软件股份有限公司 A kind of median analysis method and device
CN110427359A (en) * 2019-06-27 2019-11-08 苏州浪潮智能科技有限公司 A kind of diagram data treating method and apparatus
CN110688610A (en) * 2019-09-27 2020-01-14 支付宝(杭州)信息技术有限公司 Weight calculation method and device for graph data and electronic equipment
CN111211993A (en) * 2018-11-21 2020-05-29 百度在线网络技术(北京)有限公司 Incremental persistence method and device for streaming computation
CN111475684A (en) * 2020-06-29 2020-07-31 北京一流科技有限公司 Data processing network system and calculation chart generation method thereof
CN111935026A (en) * 2020-08-07 2020-11-13 腾讯科技(深圳)有限公司 Data transmission method, device, processing equipment and medium
WO2021012497A1 (en) * 2019-07-22 2021-01-28 平安科技(深圳)有限公司 Method, apparatus and device for storing categorical variables for graph calculation, and storage medium
CN113495679A (en) * 2020-04-01 2021-10-12 孟彤 Optimization method for large data storage access and processing based on nonvolatile storage medium
CN113626207A (en) * 2021-10-12 2021-11-09 苍穹数码技术股份有限公司 Map data processing method, device, equipment and storage medium
WO2023083234A1 (en) * 2021-11-11 2023-05-19 支付宝(杭州)信息技术有限公司 Graph state data management

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109918199B (en) * 2019-02-28 2023-06-16 中国科学技术大学苏州研究院 GPU-based distributed graph processing system
CN111367936B (en) * 2020-02-28 2023-08-22 中国工商银行股份有限公司 Offline verification method and device for structured query language grammar
CN114925123B (en) * 2022-04-24 2024-06-07 杭州悦数科技有限公司 Data transmission method between distributed graph database and graph computing system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103336808A (en) * 2013-06-25 2013-10-02 中国科学院信息工程研究所 System and method for real-time graph data processing based on BSP (Board Support Package) model
CN103793442A (en) * 2012-11-05 2014-05-14 北京超图软件股份有限公司 Spatial data processing method and system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102591709B (en) * 2011-12-20 2014-03-12 南京大学 Shapefile master-slave type parallel writing method based on OGR (open geospatial rule)
CN103970604B (en) * 2013-01-31 2017-05-03 国际商业机器公司 Method and device for realizing image processing based on MapReduce framework
CN104978228B (en) * 2014-04-09 2019-08-30 腾讯科技(深圳)有限公司 A kind of dispatching method and device of distributed computing system
CN105335135B (en) * 2014-07-14 2019-01-08 华为技术有限公司 Data processing method and central node

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103793442A (en) * 2012-11-05 2014-05-14 北京超图软件股份有限公司 Spatial data processing method and system
CN103336808A (en) * 2013-06-25 2013-10-02 中国科学院信息工程研究所 System and method for real-time graph data processing based on BSP (Board Support Package) model

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107729523A (en) * 2017-10-27 2018-02-23 平安科技(深圳)有限公司 Data service method, electronic installation and storage medium
CN109189732A (en) * 2018-08-03 2019-01-11 成都四方伟业软件股份有限公司 A kind of median analysis method and device
CN111211993B (en) * 2018-11-21 2023-08-11 百度在线网络技术(北京)有限公司 Incremental persistence method, device and storage medium for stream computation
CN111211993A (en) * 2018-11-21 2020-05-29 百度在线网络技术(北京)有限公司 Incremental persistence method and device for streaming computation
CN110427359A (en) * 2019-06-27 2019-11-08 苏州浪潮智能科技有限公司 A kind of diagram data treating method and apparatus
WO2021012497A1 (en) * 2019-07-22 2021-01-28 平安科技(深圳)有限公司 Method, apparatus and device for storing categorical variables for graph calculation, and storage medium
CN110688610B (en) * 2019-09-27 2023-05-09 支付宝(杭州)信息技术有限公司 Weight calculation method and device for graph data and electronic equipment
CN110688610A (en) * 2019-09-27 2020-01-14 支付宝(杭州)信息技术有限公司 Weight calculation method and device for graph data and electronic equipment
CN113495679A (en) * 2020-04-01 2021-10-12 孟彤 Optimization method for large data storage access and processing based on nonvolatile storage medium
CN113495679B (en) * 2020-04-01 2022-10-21 北京大学 Optimization method of big data storage access and processing based on non-volatile storage medium
CN111475684B (en) * 2020-06-29 2020-09-22 北京一流科技有限公司 Data processing network system and calculation chart generation method thereof
CN111475684A (en) * 2020-06-29 2020-07-31 北京一流科技有限公司 Data processing network system and calculation chart generation method thereof
CN111935026A (en) * 2020-08-07 2020-11-13 腾讯科技(深圳)有限公司 Data transmission method, device, processing equipment and medium
CN111935026B (en) * 2020-08-07 2024-02-13 腾讯科技(深圳)有限公司 Data transmission method, device, processing equipment and medium
CN113626207A (en) * 2021-10-12 2021-11-09 苍穹数码技术股份有限公司 Map data processing method, device, equipment and storage medium
WO2023083234A1 (en) * 2021-11-11 2023-05-19 支付宝(杭州)信息技术有限公司 Graph state data management

Also Published As

Publication number Publication date
WO2018045753A1 (en) 2018-03-15

Similar Documents

Publication Publication Date Title
CN106611037A (en) Method and device for distributed diagram calculation
US11797558B2 (en) Generating data transformation workflows
US11907216B2 (en) Multi-language fusion query method and multi-model database system
US11681702B2 (en) Conversion of model views into relational models
CN106897322B (en) A kind of access method and device of database and file system
JP7170638B2 (en) Generating, Accessing, and Displaying Lineage Metadata
JP6144700B2 (en) Scalable analysis platform for semi-structured data
EP3740880A1 (en) Pick and applicator for use with a stringed instrument
CN105930479A (en) Data skew processing method and apparatus
JP2014502762A (en) Filtering query data in the data store
Whitby et al. Geowave: Utilizing distributed key-value stores for multidimensional data
US20190310978A1 (en) Supporting a join operation against multiple nosql databases
CN108037967A (en) A kind of menu loading method and electronic equipment based on more parent-child structures
EP3293644B1 (en) Loading data for iterative evaluation through simd registers
CN117668050A (en) Cross-data-source hybrid engine query method, system, equipment and medium
US9158796B1 (en) Data source modeling methods for heterogeneous data sources and related computer program products and systems
EP3293645A1 (en) Iterative evaluation of data through simd processor registers
Barkhordari et al. ScadiBino: an effective MapReduce-based association rule mining method
US9052956B2 (en) Selecting execution environments
CN116762067B (en) Near Data Processing (NDP) in a network node
US10255316B2 (en) Processing of data chunks using a database calculation engine
US10713244B2 (en) Calculation engine optimizations for join operations utilizing automatic detection of forced constraints
EP2590089A1 (en) Rule type columns in database
US9262492B2 (en) Dividing and combining operations
US12189717B1 (en) Automatic partitioning of machine learning models for training across multiple devices

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170503

RJ01 Rejection of invention patent application after publication