CN106611037A - Method and device for distributed diagram calculation - Google Patents
Method and device for distributed diagram calculation Download PDFInfo
- Publication number
- CN106611037A CN106611037A CN201610818819.8A CN201610818819A CN106611037A CN 106611037 A CN106611037 A CN 106611037A CN 201610818819 A CN201610818819 A CN 201610818819A CN 106611037 A CN106611037 A CN 106611037A
- Authority
- CN
- China
- Prior art keywords
- nomography
- data
- distributed
- diagram data
- equipment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2237—Vectors, bitmaps or matrices
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application aims at providing a method and device for distributed diagram calculation. Compared with the prior art, the method provided by the invention comprises the following steps: firstly acquiring original diagram data, and processing the original diagram data according to a diagram algorithm to acquire structured diagram data corresponding to the diagram algorithm, thereby adapting to the diagram algorithms in different types; and then distributing the calculation tasks corresponding to the diagram algorithms to multiple calculation nodes to be executed, wherein a persistence condition should be satisfied in the execution process, the persistence operation is performed, the data dependence is cutoff, the repeated calculation amount is reduced, and the processing efficiency is improved. Furthermore, the application comprises firstly performing a combination operation on the diagram data before an aggregation operation and a connection operation, thereby improving the operation efficiency and relieving the network transmission pressure. Furthermore, the application adopts a data serialization and deserialization method to facilitate the transmission of intermediate data produced in the calculation process between the calculation nodes.
Description
Technical field
The application is related to computer realm, more particularly to a kind of technology calculated for distributed figure.
Background technology
With the expansion of figure scale, unit, the figure Processing Algorithm of single thread are limited by system resource and calculating time,
Algorithm success cannot be ensured and effectively run.Therefore it is the approach of solve problem by figure processing procedure parallelization, distributedization.
The content of the invention
The purpose of the application is to provide a kind of method and apparatus calculated for distributed figure.
According to the one side of the application, there is provided a kind of method calculated for distributed figure, wherein, methods described bag
Include:
Obtain original diagram data;
According to nomography, process the original diagram data to obtain the corresponding regular diagram data of the nomography;
The nomography corresponding calculating task is distributed to multiple calculate nodes to perform, wherein, in the process of implementation when
Meet persistence condition, carry out persistence operation.
According to further aspect of the application, there is provided a kind of equipment calculated for distributed figure, wherein, the equipment
Including:
First device, for obtaining original diagram data;
Second device, for according to nomography, processing the original diagram data corresponding regular to obtain the nomography
Diagram data;
3rd device, performs for the nomography corresponding calculating task is distributed to multiple calculate nodes, wherein,
When persistence condition is met in implementation procedure, persistence operation is carried out.
Compared with prior art, the application first obtains original diagram data, then according to the nomography process original graph number
The corresponding regular diagram data of the nomography is obtained according to this, in order to be adapted to different types of nomography, then by the graphic calculation
The corresponding calculating task of method is distributed to multiple calculate nodes and performs, wherein, in the process of implementation when persistence condition is met, carry out
Persistence is operated, and cuts off data dependence, reduces double counting amount, improves treatment effeciency.Further, the application is to diagram data
Before carrying out converging operationJu Hecaozuo and attended operation, operation is first merged to which, so as to improve operation efficiency, mitigate network transmission pressure
Power.Further, a kind of method that the application adopts Data Serialization and unserializing, in order to the generation in calculating process
Intermediate data is transmitted between calculate node.Further, the application is realized and starts nomography by SQL statement, and is led to
Cross improvement and process logic so that the data into nomography are complete diagram datas.
Description of the drawings
By reading the detailed description made to non-limiting example made with reference to the following drawings, other of the invention
Feature, objects and advantages will become more apparent upon:
Fig. 1 illustrates a kind of method flow diagram calculated for distributed figure according to the application one side;
Fig. 2 illustrate nomography corresponding calculating task is distributed to according to one kind of one preferred embodiment of the application it is multiple
The schematic diagram of calculate node;
Fig. 3 illustrates a kind of method flow diagram calculated for distributed figure according to another preferred embodiment of the application;
Fig. 4 illustrates a kind of equipment schematic diagram calculated for distributed figure according to the application other side;
Fig. 5 illustrates a kind of equipment schematic diagram calculated for distributed figure according to another preferred embodiment of the application.
In accompanying drawing, same or analogous reference represents same or analogous part.
Specific embodiment
Below in conjunction with the accompanying drawings the present invention is described in further detail.
In one typical configuration of the application, terminal, the equipment of service network and trusted party include one or more
Processor (CPU), input/output interface, network interface and internal memory.
Internal memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/or
The forms such as Nonvolatile memory, such as read only memory (ROM) or flash memory (flash RAM).Internal memory is computer-readable medium
Example.
Computer-readable medium includes that permanent and non-permanent, removable and non-removable media can be by any method
Or technology is realizing information Store.Information can be computer-readable instruction, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read only memory (ROM), electric erasable
Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read only memory (CD-ROM),
Digital versatile disc (DVD) or other optical storages, magnetic cassette tape, magnetic disk storage or other magnetic storage apparatus or
Any other non-transmission medium, can be used to store the information that can be accessed by a computing device.Define according to herein, computer
Computer-readable recording medium does not include non-temporary computer readable media (transitory media), the such as data signal and carrier wave of modulation.
Fig. 1 illustrates a kind of method flow diagram calculated for distributed figure according to the application one side, wherein, it is described
Method includes step S11, step S12 and step S13.
Specifically, in step s 11, equipment 1 obtains original diagram data;In step s 12, equipment 1 is according to nomography, place
Manage the original diagram data to obtain the corresponding regular diagram data of the nomography;In step s 13, equipment 1 is by the graphic calculation
The corresponding calculating task of method is distributed to multiple calculate nodes and performs, wherein, in the process of implementation when persistence condition is met, carry out
Persistence is operated.
Here, the equipment 1 includes but is not limited to user equipment, the network equipment or user equipment and the network equipment passes through
The mutually integrated equipment for being constituted of network.The user equipment its include but is not limited to any one can be entered by touch pad with user
The mobile electronic product of row man-machine interaction, such as smart mobile phone, panel computer, notebook computer etc., the mobile electronic product
Any operating system, such as android operating systems, iOS operating systems etc. can be adopted.Wherein, the network equipment includes one
Planting can carry out the electronic equipment of numerical computations and information processing automatically according to the instruction being previously set or store, its hardware bag
Include but be not limited to microprocessor, special IC (ASIC), programmable gate array (FPGA), digital processing unit (DSP), embedded
Formula equipment etc..The network equipment its include but is not limited to computer, network host, single network server, multiple networks clothes
The cloud that business device collection or multiple servers are constituted;Here, cloud by based on cloud computing (Cloud Computing) a large amount of computers or
The webserver is constituted, wherein, cloud computing is one kind of Distributed Calculation, be made up of the loosely-coupled computer collection of a group
Individual virtual supercomputer.The network includes but is not limited to the Internet, wide area network, Metropolitan Area Network (MAN), LAN, VPN, wireless
Self-organizing network (Ad Hoc networks) etc..Preferably, equipment 1 can also be and run on the user equipment, the network equipment or use
Family equipment and the network equipment, the network equipment, touch terminal or the network equipment and touch terminal are constituted by network is mutually integrated
Shell script on equipment.Certainly, those skilled in the art will be understood that the said equipment 1 is only for example, and other are existing or modern
The equipment 1 being likely to occur afterwards is such as applicable to the application, and within also should being included in the application protection domain, and here is with the side of reference
Formula is incorporated herein.
In step s 11, equipment 1 obtains original diagram data.
Here, the original graph data include the point data and side data of figure;Wherein, side data can include starting point and
The information of the point of arrival, can also include the information needed for any nomography;If weight map, then side data also carry weight number
According to.
In step s 12, equipment 1 processes the original diagram data corresponding to obtain the nomography according to nomography
Regular diagram data.
For example, nomography generally requires some parameters and carrys out the key messages such as control accuracy, operation times;The nomography
Species may have various, and for different nomographys, its parameter is likely to difference.Here, by processing the original diagram data,
Corresponding regular diagram data is obtained, to be adapted to different types of nomography.
Preferably, in step s 12, the regular diagram data is also stored in distributed file system by equipment 1.
For example, the distributed file system can include Hadoop distributed file system (Hadoop
Distributed File System, HDFS);In order to increase the degree of parallelism of process, in a preferred embodiment, the application will
Diagram data is stored in Hadoop distributed file systems.
Certainly, this area will be understood that above-mentioned Hadoop distributed file systems are only for example, and other are existing or from now on
The distributed file system being likely to occur such as is applicable to the application, within also should being included in the application protection domain, and here
It is incorporated herein by reference.
In a preferred embodiment, employ Hadoop distributed file systems to be stored, also using Hive as friendship
Mutual instrument;In practical application scene, in addition to data, generally also need to for the configuration of some Hive to pass to computing node.
Here, Hive is the Tool for Data Warehouse based on Hadoop, sql like language can be applied to by big data scene by Hive, one
The compatible traditional data application of aspect, on the other hand shields complicated distributed programmed details.Hive supports various computing engines, its
Middle Spark possesses abundant computation model and operator as computing engines, can be used for realizing nomography.
Preferably, in step s 12, equipment 1 carries out type inspection to the regular diagram data always according to the nomography
Look into.
For example, before data enter nomography, need to carry out type checking, it is to avoid wrong data causes algorithm to malfunction.Tool
Body ground, first can carry out field segmentation to the regular diagram data, then enter ranks type checking.In a preferred embodiment, lead to
Cross the structure type detector that GraphOperator operators obtain input data from Hive
StandardStructObjectInspector, the type detector enumerate the element type detector of each field
ObjectInspector。
In step s 13, the nomography corresponding calculating task is distributed to multiple calculate nodes and is performed by equipment 1, its
In, in the process of implementation when persistence condition is met, carry out persistence operation.
In a preferred embodiment, during calculating task distribution, in order to improve treatment effeciency, as far as possible each is counted
Operator node distributes on the HDFS nodes for having diagram data.In the case where calculating implementation procedure complexity is time-consuming, by lasting
Change operation preservation intermediate result and can cut off data dependence, reduce double counting amount.
Preferably, in step s 13, to create multiple calculate nodes by resource management framework described for performing for equipment 1
The corresponding calculating task of nomography.
For example, it is described that Yarn can be included by resource management framework;With reference to Fig. 2, by resource management framework Yarn it is
The corresponding calculating task of the nomography creates multiple calculate nodes.
Certainly, this area will be understood that above-mentioned resource management framework Yarn is only for example, and other are existing or from now on may
The resource management framework of appearance is such as applicable to the application, and within also should being included in the application protection domain, and here is quoting
Mode is incorporated herein.
Preferably, in step s 13, the nomography corresponding calculating task is distributed to Distributed Calculation frame by equipment 1
Multiple calculate nodes in frame are performed.
For example, the distributed computing framework can include Spark;With reference to Fig. 2, using distributed computing framework Spark
As computing engines, as the calculating process of data is delayed (lazy) model, it is more beneficial for the high figure of computation complexity and calculates.
Certainly, this area will be understood that above-mentioned distributed computing framework Spark is only for example, and other are existing or from now on may be used
Can occur distributed computing framework be such as applicable to the application, within also should being included in the application protection domain, and here with
Way of reference is incorporated herein.
Preferably, the persistence condition includes following at least any one:The elasticity distribution of the distributed computing framework
The calculating of formula data set is time-consuming to reach corresponding duration threshold value;The elasticity distribution formula data set of the distributed computing framework work as
Front dependence length reaches corresponding length threshold.
For example, the elasticity distribution formula number of Spark according to distributed computing framework Spark, can be used in calculating process
According to collection (Resilient Distributed Datasets, RDD).When GraphRDD has longer calculating time or dependence
(can such as calculate time-consuming corresponding duration threshold value to be set to 10 minutes, 10 points be reached when the calculating of GraphRDD is time-consuming
Clock), persistence operation is carried out, data and element type detector ObjectInspector are written in local disk together,
And corresponding BlockId is reported to into Spark Driver.When the nomography for needing many wheel interative computations is processed, in order to avoid
The loss of data that calculate node failure is caused, persistence operation can also write data into Hadoop distributed file systems
(HDFS) in.
Preferably, the persistence operation includes following at least any one:Store current result of calculation;Removing is currently relied upon
Relation.
Here, result of calculation can be preserved, dependence is removed by persistence (persist) operation, some quilts are reduced
The calculating cost of the complex transformations of Reusability, and fault-tolerance is provided.
Preferably, in step s 13, equipment 1 also carries out converging operationJu Hecaozuo and company to regular diagram data described in key assignments identical
Connect operation.
For example, certain string or a few column data using diagram data be used as key assignments (key), be polymerized (groupBy) operation and
Connection (join) operation, processes all data of identical key assignments (key) by a calculate node, therefore has between calculate node big
The data transfer of amount.Specifically, the specific field of data is chosen by GraphRDD, these field sequences is melted into into key assignments
(key) key assignments (key) identical data are merged by converging operationJu Hecaozuo and attended operation, and it is different according to the species of nomography, apply
Plus nonidentity operation.In order to reduce network transmission pressure, here, first data are merged in each calculate node by converging operationJu Hecaozuo
Once, then by the result for having merged it is delivered in other calculate nodes according to key assignments (key).
In preferred enforcement, in order to improve joint efficiency, it is possible to use a kind of data structure and optimisation strategy of optimization.
When two huge GraphRDD of data volume do attended operation, enormous pressure can be produced to internal memory.In the present embodiment, adopted
Data structure can be stored in data in disk when memory source is nervous, so as to avoid internal memory overflow problem.Work as data volume
When the minimum GraphRDD and great GraphRDD of data volume does attended operation, adopt and the GraphRDD compared with small data quantity is copied
The connection optimisation strategy of shellfish to each calculate node, accelerates to also mitigate network pressure while connection speed.
Preferably, it is described converging operationJu Hecaozuo and attended operation are carried out to regular diagram data described in key assignments identical also to include:Enter
Before the row converging operationJu Hecaozuo, operation is merged to the regular diagram data in each described calculate node.
Here, before carrying out converging operationJu Hecaozuo and attended operation, performing data union operation in current calculate node, can subtracting
The transmission quantity of few network data, improves operation efficiency, so as to mitigate network transmission pressure.
Preferably, in step s 13, equipment 1 is first carried out instead to intermediate data when calculate node acquisition intermediate data
Serializing, processes the intermediate data after unserializing according to the nomography, then to according to the centre after nomography process
Data are serialized.
For example, various intermediate data can be produced in figure calculating process, in order to be able to reduce calculate node parsing data type
CPU and memory cost, the present embodiment adopt a kind of Data Serialization and unserializing method based on type checking, by data
Type resolver passes to calculate node together in company with data.Specifically, GraphOperator is by initial data and element type
Detector ObjectInspector is combined into GraphRDD, used as the input data of each nomography operator.When being related to
When Shuffle is operated, each data is serialized using ObjectInspector, and in other calculate node unserializings.
Preferably, with reference to Fig. 3, methods described also includes step S14 ' and step S15 ';In step S14 ' in, equipment 1 is obtained
Take pending SQL statement;In step S15 ' in, equipment 1 parses the SQL statement to call corresponding nomography.
In the prior art, because nomography has the calculating process and a large amount of iterationses of complexity, it is impossible to by SQL realities
It is existing.
And in the present embodiment, using distributed computing framework Spark as computing engines, will in the way of self-defining function
Numerous nomographys are integrated in Hive.It is thus possible to by nomography and other SQL statement organic assemblings, reduce intractability.
Preferably, in step S15 ' in, equipment 1 registers multiple nomographys using self-defining function, wherein, each nomography
One registration function of correspondence.
For example, it is possible to use (User Defined Table-Generating Function, user is certainly for the UDTF of Hive
Define table generating function) mechanism, the mesh realized class name, reach by SQL statement startup nomography of nomography is registered to Hive
's.Here, UDTF is a kind of interface that Hive makes function by oneself and designs for user's addition, user can pass through UDTF's
Process methods, obtain a line input, and be converted in a row or multirow output.
But the model of " a line is input into, multirow output " of UDTF can not meet the demand of figure calculating.In the present embodiment,
By adding new process logic on the basis of UDTF so that the data into nomography are complete diagram datas.Here, can be with
Using UDTF interfaces, it is that each nomography registers a function.The present embodiment realizes the Operator operators based on UDTF, from
And solve the demand that figure is calculated.Specifically, a GraphOperator operator is realized first, as all nomography operators
Base class.GraphOperator inherits UDTF interfaces, therefore can pass through FunctionRegistry's
RegisterGenericUDTF methods, different nomographys are registered in Hive.Change Hive's in the present embodiment
TableScanOperator operators and UDTFOperator operators.UDTFOperator operators are calculated from TableScanOperator
The input data for being encapsulated as RDD is obtained at son, and passes to GraphOperator operators.Each inherits GraphOperator
Nomography operator can just have access to complete diagram data.
Fig. 4 illustrates a kind of equipment 1 calculated for distributed figure according to the application other side, wherein, it is described to set
Standby 1 includes first device 11, second device 12 and 3rd device 13.
Specifically, the first device 11 obtains original diagram data;The second device 12 processes described according to nomography
Original diagram data is obtaining the corresponding regular diagram data of the nomography;The 3rd device 13 is by the nomography corresponding meter
Calculation task is distributed to multiple calculate nodes and performs, wherein, in the process of implementation when persistence condition is met, carry out persistence behaviour
Make.
Here, the equipment 1 includes but is not limited to user equipment, the network equipment or user equipment and the network equipment passes through
The mutually integrated equipment for being constituted of network.The user equipment its include but is not limited to any one can be entered by touch pad with user
The mobile electronic product of row man-machine interaction, such as smart mobile phone, panel computer, notebook computer etc., the mobile electronic product
Any operating system, such as android operating systems, iOS operating systems etc. can be adopted.Wherein, the network equipment includes one
Planting can carry out the electronic equipment of numerical computations and information processing automatically according to the instruction being previously set or store, its hardware bag
Include but be not limited to microprocessor, special IC (ASIC), programmable gate array (FPGA), digital processing unit (DSP), embedded
Formula equipment etc..The network equipment its include but is not limited to computer, network host, single network server, multiple networks clothes
The cloud that business device collection or multiple servers are constituted;Here, cloud by based on cloud computing (Cloud Computing) a large amount of computers or
The webserver is constituted, wherein, cloud computing is one kind of Distributed Calculation, be made up of the loosely-coupled computer collection of a group
Individual virtual supercomputer.The network includes but is not limited to the Internet, wide area network, Metropolitan Area Network (MAN), LAN, VPN, wireless
Self-organizing network (Ad Hoc networks) etc..Preferably, equipment 1 can also be and run on the user equipment, the network equipment or use
Family equipment and the network equipment, the network equipment, touch terminal or the network equipment and touch terminal are constituted by network is mutually integrated
Shell script on equipment.Certainly, those skilled in the art will be understood that the said equipment 1 is only for example, and other are existing or modern
The equipment 1 being likely to occur afterwards is such as applicable to the application, and within also should being included in the application protection domain, and here is with the side of reference
Formula is incorporated herein.
The first device 11 obtains original diagram data.
Here, the original graph data include the point data and side data of figure;Wherein, side data can include starting point and
The information of the point of arrival, can also include the information needed for any nomography;If weight map, then side data also carry weight number
According to.
The second device 12 processes the original diagram data corresponding regular to obtain the nomography according to nomography
Diagram data.
For example, nomography generally requires some parameters and carrys out the key messages such as control accuracy, operation times;The nomography
Species may have various, and for different nomographys, its parameter is likely to difference.Here, by processing the original diagram data,
Corresponding regular diagram data is obtained, to be adapted to different types of nomography.
Preferably, the regular diagram data is also stored in distributed file system by the second device 12.
For example, the distributed file system can include Hadoop distributed file system (Hadoop
Distributed File System, HDFS);In order to increase the degree of parallelism of process, in a preferred embodiment, the application will
Diagram data is stored in Hadoop distributed file systems.
Certainly, this area will be understood that above-mentioned Hadoop distributed file systems are only for example, and other are existing or from now on
The distributed file system being likely to occur such as is applicable to the application, within also should being included in the application protection domain, and here
It is incorporated herein by reference.
In a preferred embodiment, employ Hadoop distributed file systems to be stored, also using Hive as friendship
Mutual instrument;In practical application scene, in addition to data, generally also need to for the configuration of some Hive to pass to computing node.
Here, Hive is the Tool for Data Warehouse based on Hadoop, sql like language can be applied to by big data scene by Hive, one
The compatible traditional data application of aspect, on the other hand shields complicated distributed programmed details.Hive supports various computing engines, its
Middle Spark possesses abundant computation model and operator as computing engines, can be used for realizing nomography.
Preferably, the second device 12 carries out type checking to the regular diagram data always according to the nomography.
For example, before data enter nomography, need to carry out type checking, it is to avoid wrong data causes algorithm to malfunction.Tool
Body ground, first can carry out field segmentation to the regular diagram data, then enter ranks type checking.In a preferred embodiment, lead to
Cross the structure type detector that GraphOperator operators obtain input data from Hive
StandardStructObjectInspector, the type detector enumerate the element type detector of each field
ObjectInspector。
The nomography corresponding calculating task is distributed to multiple calculate nodes and is performed by the 3rd device 13, wherein,
In the process of implementation when persistence condition is met, persistence operation is carried out.
In a preferred embodiment, during calculating task distribution, in order to improve treatment effeciency, as far as possible each is counted
Operator node distributes on the HDFS nodes for having diagram data.In the case where calculating implementation procedure complexity is time-consuming, by lasting
Change operation preservation intermediate result and can cut off data dependence, reduce double counting amount.
Preferably, the 3rd device 13 creates multiple calculate nodes for performing the graphic calculation by resource management framework
The corresponding calculating task of method.
For example, it is described that Yarn can be included by resource management framework;With reference to Fig. 2, by resource management framework Yarn it is
The corresponding calculating task of the nomography creates multiple calculate nodes.
Certainly, this area will be understood that above-mentioned resource management framework Yarn is only for example, and other are existing or from now on may
The resource management framework of appearance is such as applicable to the application, and within also should being included in the application protection domain, and here is quoting
Mode is incorporated herein.
Preferably, the 3rd device 13 is distributed to the nomography corresponding calculating task in distributed computing framework
Multiple calculate nodes perform.
For example, the distributed computing framework can include Spark;With reference to Fig. 2, using distributed computing framework Spark
As computing engines, as the calculating process of data is delayed (lazy) model, it is more beneficial for the high figure of computation complexity and calculates.
Certainly, this area will be understood that above-mentioned distributed computing framework Spark is only for example, and other are existing or from now on may be used
Can occur distributed computing framework be such as applicable to the application, within also should being included in the application protection domain, and here with
Way of reference is incorporated herein.
Preferably, the persistence condition includes following at least any one:The elasticity distribution of the distributed computing framework
The calculating of formula data set is time-consuming to reach corresponding duration threshold value;The elasticity distribution formula data set of the distributed computing framework work as
Front dependence length reaches corresponding length threshold.
For example, the elasticity distribution formula number of Spark according to distributed computing framework Spark, can be used in calculating process
According to collection (Resilient Distributed Datasets, RDD).When GraphRDD has longer calculating time or dependence
(can such as calculate time-consuming corresponding duration threshold value to be set to 10 minutes, 10 points be reached when the calculating of GraphRDD is time-consuming
Clock), persistence operation is carried out, data and element type detector ObjectInspector are written in local disk together,
And corresponding BlockId is reported to into Spark Driver.When the nomography for needing many wheel interative computations is processed, in order to avoid
The loss of data that calculate node failure is caused, persistence operation can also write data into Hadoop distributed file systems
(HDFS) in.
Preferably, the persistence operation includes following at least any one:Store current result of calculation;Removing is currently relied upon
Relation.
Here, result of calculation can be preserved, dependence is removed by persistence (persist) operation, some quilts are reduced
The calculating cost of the complex transformations of Reusability, and fault-tolerance is provided.
Preferably, three device 13 also carries out converging operationJu Hecaozuo and connection behaviour to regular diagram data described in key assignments identical
Make.
For example, certain string or a few column data using diagram data be used as key assignments (key), be polymerized (groupBy) operation and
Connection (join) operation, processes all data of identical key assignments (key) by a calculate node, therefore has between calculate node big
The data transfer of amount.Specifically, the specific field of data is chosen by GraphRDD, these field sequences is melted into into key assignments
(key) key assignments (key) identical data are merged by converging operationJu Hecaozuo and attended operation, and it is different according to the species of nomography, apply
Plus nonidentity operation.In order to reduce network transmission pressure, here, first data are merged in each calculate node by converging operationJu Hecaozuo
Once, then by the result for having merged it is delivered in other calculate nodes according to key assignments (key).
In preferred enforcement, in order to improve joint efficiency, it is possible to use a kind of data structure and optimisation strategy of optimization.
When two huge GraphRDD of data volume do attended operation, enormous pressure can be produced to internal memory.In the present embodiment, adopted
Data structure can be stored in data in disk when memory source is nervous, so as to avoid internal memory overflow problem.Work as data volume
When the minimum GraphRDD and great GraphRDD of data volume does attended operation, adopt and the GraphRDD compared with small data quantity is copied
The connection optimisation strategy of shellfish to each calculate node, accelerates to also mitigate network pressure while connection speed.
Preferably, it is described converging operationJu Hecaozuo and attended operation are carried out to regular diagram data described in key assignments identical also to include:Enter
Before the row converging operationJu Hecaozuo, operation is merged to the regular diagram data in each described calculate node.
Here, before carrying out converging operationJu Hecaozuo and attended operation, performing data union operation in current calculate node, can subtracting
The transmission quantity of few network data, improves operation efficiency, so as to mitigate network transmission pressure.
Preferably, the 3rd device 13 first carries out inverted sequence to intermediate data when calculate node acquisition intermediate data
Rowization, process the intermediate data after unserializing according to the nomography, then to according to the mediant after nomography process
According to being serialized.
For example, various intermediate data can be produced in figure calculating process, in order to be able to reduce calculate node parsing data type
CPU and memory cost, the present embodiment adopt a kind of Data Serialization and unserializing method based on type checking, by data
Type resolver passes to calculate node together in company with data.Specifically, GraphOperator is by initial data and element type
Detector ObjectInspector is combined into GraphRDD, used as the input data of each nomography operator.When being related to
When Shuffle is operated, each data is serialized using ObjectInspector, and in other calculate node unserializings.
Preferably, with reference to Fig. 5, the equipment 1 also includes the 4th device 14 ' and the 5th device 15 ';4th device
14 ' obtain pending SQL statement;Five device 15 ' parses the SQL statement to call corresponding nomography.
In the prior art, because nomography has the calculating process and a large amount of iterationses of complexity, it is impossible to by SQL realities
It is existing.
And in the present embodiment, using distributed computing framework Spark as computing engines, will in the way of self-defining function
Numerous nomographys are integrated in Hive.It is thus possible to by nomography and other SQL statement organic assemblings, reduce intractability.
Preferably, the 5th device 15 ' registers multiple nomographys using self-defining function, wherein, each nomography pair
Answer a registration function.
For example, it is possible to use (User Defined Table-Generating Function, user is certainly for the UDTF of Hive
Define table generating function) mechanism, the mesh realized class name, reach by SQL statement startup nomography of nomography is registered to Hive
's.Here, UDTF is the interface that Hive makes function by oneself and designs for user's addition, user can pass through the process side of UDTF
Method, obtain a line input, and be converted in a row or multirow output.
But the model of " a line is input into, multirow output " of UDTF can not meet the demand of figure calculating.In the present embodiment,
By adding new process logic on the basis of UDTF so that the data into nomography are complete diagram datas.Here, can be with
Using UDTF interfaces, it is that each nomography registers a function.The present embodiment realizes the Operator operators based on UDTF, from
And solve the demand that figure is calculated.Specifically, a GraphOperator operator is realized first, as all nomography operators
Base class.GraphOperator inherits UDTF interfaces, therefore can pass through FunctionRegistry's
RegisterGenericUDTF methods, different nomographys are registered in Hive.Change Hive's in the present embodiment
TableScanOperator operators and UDTFOperator operators.UDTFOperator operators are calculated from TableScanOperator
The input data for being encapsulated as RDD is obtained at son, and passes to GraphOperator operators.Each inherits GraphOperator
Nomography operator can just have access to complete diagram data.
Compared with prior art, the application first obtains original diagram data, then according to the nomography process original graph number
The corresponding regular diagram data of the nomography is obtained according to this, in order to be adapted to different types of nomography, then by the graphic calculation
The corresponding calculating task of method is distributed to multiple calculate nodes and performs, wherein, in the process of implementation when persistence condition is met, carry out
Persistence is operated, and cuts off data dependence, reduces double counting amount, improves treatment effeciency.Further, the application is to diagram data
Before carrying out converging operationJu Hecaozuo and attended operation, operation is first merged to which, so as to improve operation efficiency, mitigate network transmission pressure
Power.Further, a kind of method that the application adopts Data Serialization and unserializing, in order to the generation in calculating process
Intermediate data is transmitted between calculate node.Further, the application is realized and starts nomography by SQL statement, and is led to
Cross improvement and process logic so that the data into nomography are complete diagram datas.
It should be noted that the application can be carried out in the assembly of software and/or software with hardware, for example, can adopt
Realized with special IC (ASIC), general purpose computer or any other similar hardware device.In one embodiment
In, the software program of the application can pass through computing device to realize steps described above or function.Similarly, the application
Software program (including related data structure) can be stored in computer readable recording medium storing program for performing, for example, RAM memory,
Magnetically or optically driver or floppy disc and similar devices.In addition, some steps or function of the application can employ hardware to realize, example
Such as, as coordinating so as to perform the circuit of each step or function with processor.
In addition, the part of the application can be applied to computer program, such as computer program instructions, when its quilt
When computer is performed, by the operation of the computer, can call or provide according to the present processes and/or technical scheme.
And the programmed instruction of the present processes is called, it is possibly stored in fixed or moveable recording medium, and/or passes through
Data flow in broadcast or other signal bearing medias and be transmitted, and/or be stored according to described program instruction operation
In the working storage of computer equipment.Here, including a device according to one embodiment of the application, the device includes using
In the memorizer and the processor for execute program instructions of storage computer program instructions, wherein, when the computer program refers to
When order is by the computing device, method and/or skill of the plant running based on aforementioned multiple embodiments according to the application are triggered
Art scheme.
It is obvious to a person skilled in the art that the application is not limited to the details of above-mentioned one exemplary embodiment, Er Qie
In the case of without departing substantially from spirit herein or basic feature, the application can be realized in other specific forms.Therefore, no matter
From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, scope of the present application is by appended power
Profit is required rather than described above is limited, it is intended that all in the implication and scope of the equivalency of claim by falling
Change is included in the application.Any reference in claim should not be considered as and limit involved claim.This
Outward, it is clear that " including ", a word was not excluded for other units or step, and odd number is not excluded for plural number.That what is stated in device claim is multiple
Unit or device can also be realized by software or hardware by a unit or device.The first, the second grade word is used for table
Show title, and be not offered as any specific order.
It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie
In the case of spirit or essential attributes without departing substantially from the present invention, the present invention can be realized in other specific forms.Therefore, no matter
From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention is by appended power
Profit is required rather than described above is limited, it is intended that all in the implication and scope of the equivalency of claim by falling
Change is included in the present invention.Any reference in claim should not be considered as and limit involved claim.This
Outward, it is clear that " including ", a word was not excluded for other units or step, and odd number is not excluded for plural number.That what is stated in device claim is multiple
Unit or device can also be realized by software or hardware by a unit or device.The first, the second grade word is used for table
Show title, and be not offered as any specific order.
Claims (24)
1. it is a kind of for distributed figure calculate method, wherein, methods described includes:
A obtains original diagram data;
B processes the original diagram data to obtain the corresponding regular diagram data of the nomography according to nomography;
The nomography corresponding calculating task is distributed to multiple calculate nodes and is performed by c, wherein, work as satisfaction in the process of implementation
Persistence condition, carries out persistence operation.
2. method according to claim 1, wherein, methods described also includes:
Obtain pending SQL statement;
Parse the SQL statement to call corresponding nomography.
3. method according to claim 2, wherein, the parsing SQL statement is calling corresponding nomography to include:
Multiple nomographys are registered using self-defining function, wherein, each nomography one registration function of correspondence.
4. method according to claim 1, wherein, step c also includes:
Converging operationJu Hecaozuo and attended operation are carried out to regular diagram data described in key assignments identical.
5. method according to claim 4, wherein, it is described that converging operationJu Hecaozuo is carried out to regular diagram data described in key assignments identical
And attended operation also includes:
Before carrying out the converging operationJu Hecaozuo, operation is merged to the regular diagram data in each described calculate node.
6. method according to claim 1, wherein, step c includes:
When the calculate node obtains intermediate data, unserializing is carried out to intermediate data first, it is anti-according to nomography process
Intermediate data after serializing, then to being serialized according to the intermediate data after nomography process.
7. method according to claim 1, wherein, step b also includes:
The regular diagram data is stored in into distributed file system.
8. method according to claim 1, wherein, step b also includes:
According to the nomography, type checking is carried out to the regular diagram data.
9. method according to claim 1, wherein, step c includes:
Multiple calculate nodes are created for performing the corresponding calculating task of the nomography by resource management framework.
10. method according to claim 1, wherein, step c includes:
The multiple calculate nodes nomography corresponding calculating task being distributed in distributed computing framework are performed.
11. methods according to claim 10, wherein, the persistence condition includes following at least any one:
The calculating of the elasticity distribution formula data set of the distributed computing framework is time-consuming to reach corresponding duration threshold value;
The present dependency length of the elasticity distribution formula data set of the distributed computing framework reaches corresponding length threshold.
12. methods according to any one of claim 1 to 11, wherein, persistence operation includes following at least appointing
One:
Store current result of calculation;
Remove present dependency.
A kind of 13. equipment calculated for distributed figure, wherein, the equipment includes:
First device, for obtaining original diagram data;
Second device, for according to nomography, processing the original diagram data to obtain the corresponding regular figure number of the nomography
According to;
3rd device, performs for the nomography corresponding calculating task is distributed to multiple calculate nodes, wherein, performing
During when persistence condition is met, carry out persistence operation.
14. equipment according to claim 13, wherein, the equipment also includes:
4th device, for obtaining pending SQL statement;
5th device, for parsing the SQL statement to call corresponding nomography.
15. equipment according to claim 14, wherein, the 5th device is used for:
Multiple nomographys are registered using self-defining function, wherein, each nomography one registration function of correspondence.
16. equipment according to claim 13, wherein, the 3rd device is additionally operable to:
Converging operationJu Hecaozuo and attended operation are carried out to regular diagram data described in key assignments identical.
17. equipment according to claim 16, wherein, it is described that polymerization behaviour is carried out to regular diagram data described in key assignments identical
Make and attended operation also includes:
Before carrying out the converging operationJu Hecaozuo, operation is merged to the regular diagram data in each described calculate node.
18. equipment according to claim 13, wherein, the 3rd device is used for:
When the calculate node obtains intermediate data, unserializing is carried out to intermediate data first, it is anti-according to nomography process
Intermediate data after serializing, then to being serialized according to the intermediate data after nomography process.
19. equipment according to claim 13, wherein, the second device is additionally operable to:
The regular diagram data is stored in into distributed file system.
20. equipment according to claim 13, wherein, the second device is additionally operable to:
According to the nomography, type checking is carried out to the regular diagram data.
21. equipment according to claim 13, wherein, the 3rd device is used for:
Multiple calculate nodes are created for performing the corresponding calculating task of the nomography by resource management framework.
22. equipment according to claim 13, wherein, the 3rd device is used for:
The multiple calculate nodes nomography corresponding calculating task being distributed in distributed computing framework are performed.
23. equipment according to claim 22, wherein, the persistence condition includes following at least any one:
The calculating of the elasticity distribution formula data set of the distributed computing framework is time-consuming to reach corresponding duration threshold value;
The present dependency length of the elasticity distribution formula data set of the distributed computing framework reaches corresponding length threshold.
24. equipment according to any one of claim 13 to 23, wherein, persistence operation includes following at least appointing
One:
Store current result of calculation;
Remove present dependency.
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201610818819.8A CN106611037A (en) | 2016-09-12 | 2016-09-12 | Method and device for distributed diagram calculation |
| PCT/CN2017/080845 WO2018045753A1 (en) | 2016-09-12 | 2017-04-18 | Method and device for distributed graph computation |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201610818819.8A CN106611037A (en) | 2016-09-12 | 2016-09-12 | Method and device for distributed diagram calculation |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN106611037A true CN106611037A (en) | 2017-05-03 |
Family
ID=58614973
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201610818819.8A Pending CN106611037A (en) | 2016-09-12 | 2016-09-12 | Method and device for distributed diagram calculation |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN106611037A (en) |
| WO (1) | WO2018045753A1 (en) |
Cited By (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107729523A (en) * | 2017-10-27 | 2018-02-23 | 平安科技(深圳)有限公司 | Data service method, electronic installation and storage medium |
| CN109189732A (en) * | 2018-08-03 | 2019-01-11 | 成都四方伟业软件股份有限公司 | A kind of median analysis method and device |
| CN110427359A (en) * | 2019-06-27 | 2019-11-08 | 苏州浪潮智能科技有限公司 | A kind of diagram data treating method and apparatus |
| CN110688610A (en) * | 2019-09-27 | 2020-01-14 | 支付宝(杭州)信息技术有限公司 | Weight calculation method and device for graph data and electronic equipment |
| CN111211993A (en) * | 2018-11-21 | 2020-05-29 | 百度在线网络技术(北京)有限公司 | Incremental persistence method and device for streaming computation |
| CN111475684A (en) * | 2020-06-29 | 2020-07-31 | 北京一流科技有限公司 | Data processing network system and calculation chart generation method thereof |
| CN111935026A (en) * | 2020-08-07 | 2020-11-13 | 腾讯科技(深圳)有限公司 | Data transmission method, device, processing equipment and medium |
| WO2021012497A1 (en) * | 2019-07-22 | 2021-01-28 | 平安科技(深圳)有限公司 | Method, apparatus and device for storing categorical variables for graph calculation, and storage medium |
| CN113495679A (en) * | 2020-04-01 | 2021-10-12 | 孟彤 | Optimization method for large data storage access and processing based on nonvolatile storage medium |
| CN113626207A (en) * | 2021-10-12 | 2021-11-09 | 苍穹数码技术股份有限公司 | Map data processing method, device, equipment and storage medium |
| WO2023083234A1 (en) * | 2021-11-11 | 2023-05-19 | 支付宝(杭州)信息技术有限公司 | Graph state data management |
Families Citing this family (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109918199B (en) * | 2019-02-28 | 2023-06-16 | 中国科学技术大学苏州研究院 | GPU-based distributed graph processing system |
| CN111367936B (en) * | 2020-02-28 | 2023-08-22 | 中国工商银行股份有限公司 | Offline verification method and device for structured query language grammar |
| CN114925123B (en) * | 2022-04-24 | 2024-06-07 | 杭州悦数科技有限公司 | Data transmission method between distributed graph database and graph computing system |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103336808A (en) * | 2013-06-25 | 2013-10-02 | 中国科学院信息工程研究所 | System and method for real-time graph data processing based on BSP (Board Support Package) model |
| CN103793442A (en) * | 2012-11-05 | 2014-05-14 | 北京超图软件股份有限公司 | Spatial data processing method and system |
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102591709B (en) * | 2011-12-20 | 2014-03-12 | 南京大学 | Shapefile master-slave type parallel writing method based on OGR (open geospatial rule) |
| CN103970604B (en) * | 2013-01-31 | 2017-05-03 | 国际商业机器公司 | Method and device for realizing image processing based on MapReduce framework |
| CN104978228B (en) * | 2014-04-09 | 2019-08-30 | 腾讯科技(深圳)有限公司 | A kind of dispatching method and device of distributed computing system |
| CN105335135B (en) * | 2014-07-14 | 2019-01-08 | 华为技术有限公司 | Data processing method and central node |
-
2016
- 2016-09-12 CN CN201610818819.8A patent/CN106611037A/en active Pending
-
2017
- 2017-04-18 WO PCT/CN2017/080845 patent/WO2018045753A1/en not_active Ceased
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103793442A (en) * | 2012-11-05 | 2014-05-14 | 北京超图软件股份有限公司 | Spatial data processing method and system |
| CN103336808A (en) * | 2013-06-25 | 2013-10-02 | 中国科学院信息工程研究所 | System and method for real-time graph data processing based on BSP (Board Support Package) model |
Cited By (16)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107729523A (en) * | 2017-10-27 | 2018-02-23 | 平安科技(深圳)有限公司 | Data service method, electronic installation and storage medium |
| CN109189732A (en) * | 2018-08-03 | 2019-01-11 | 成都四方伟业软件股份有限公司 | A kind of median analysis method and device |
| CN111211993B (en) * | 2018-11-21 | 2023-08-11 | 百度在线网络技术(北京)有限公司 | Incremental persistence method, device and storage medium for stream computation |
| CN111211993A (en) * | 2018-11-21 | 2020-05-29 | 百度在线网络技术(北京)有限公司 | Incremental persistence method and device for streaming computation |
| CN110427359A (en) * | 2019-06-27 | 2019-11-08 | 苏州浪潮智能科技有限公司 | A kind of diagram data treating method and apparatus |
| WO2021012497A1 (en) * | 2019-07-22 | 2021-01-28 | 平安科技(深圳)有限公司 | Method, apparatus and device for storing categorical variables for graph calculation, and storage medium |
| CN110688610B (en) * | 2019-09-27 | 2023-05-09 | 支付宝(杭州)信息技术有限公司 | Weight calculation method and device for graph data and electronic equipment |
| CN110688610A (en) * | 2019-09-27 | 2020-01-14 | 支付宝(杭州)信息技术有限公司 | Weight calculation method and device for graph data and electronic equipment |
| CN113495679A (en) * | 2020-04-01 | 2021-10-12 | 孟彤 | Optimization method for large data storage access and processing based on nonvolatile storage medium |
| CN113495679B (en) * | 2020-04-01 | 2022-10-21 | 北京大学 | Optimization method of big data storage access and processing based on non-volatile storage medium |
| CN111475684B (en) * | 2020-06-29 | 2020-09-22 | 北京一流科技有限公司 | Data processing network system and calculation chart generation method thereof |
| CN111475684A (en) * | 2020-06-29 | 2020-07-31 | 北京一流科技有限公司 | Data processing network system and calculation chart generation method thereof |
| CN111935026A (en) * | 2020-08-07 | 2020-11-13 | 腾讯科技(深圳)有限公司 | Data transmission method, device, processing equipment and medium |
| CN111935026B (en) * | 2020-08-07 | 2024-02-13 | 腾讯科技(深圳)有限公司 | Data transmission method, device, processing equipment and medium |
| CN113626207A (en) * | 2021-10-12 | 2021-11-09 | 苍穹数码技术股份有限公司 | Map data processing method, device, equipment and storage medium |
| WO2023083234A1 (en) * | 2021-11-11 | 2023-05-19 | 支付宝(杭州)信息技术有限公司 | Graph state data management |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2018045753A1 (en) | 2018-03-15 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN106611037A (en) | Method and device for distributed diagram calculation | |
| US11797558B2 (en) | Generating data transformation workflows | |
| US11907216B2 (en) | Multi-language fusion query method and multi-model database system | |
| US11681702B2 (en) | Conversion of model views into relational models | |
| CN106897322B (en) | A kind of access method and device of database and file system | |
| JP7170638B2 (en) | Generating, Accessing, and Displaying Lineage Metadata | |
| JP6144700B2 (en) | Scalable analysis platform for semi-structured data | |
| EP3740880A1 (en) | Pick and applicator for use with a stringed instrument | |
| CN105930479A (en) | Data skew processing method and apparatus | |
| JP2014502762A (en) | Filtering query data in the data store | |
| Whitby et al. | Geowave: Utilizing distributed key-value stores for multidimensional data | |
| US20190310978A1 (en) | Supporting a join operation against multiple nosql databases | |
| CN108037967A (en) | A kind of menu loading method and electronic equipment based on more parent-child structures | |
| EP3293644B1 (en) | Loading data for iterative evaluation through simd registers | |
| CN117668050A (en) | Cross-data-source hybrid engine query method, system, equipment and medium | |
| US9158796B1 (en) | Data source modeling methods for heterogeneous data sources and related computer program products and systems | |
| EP3293645A1 (en) | Iterative evaluation of data through simd processor registers | |
| Barkhordari et al. | ScadiBino: an effective MapReduce-based association rule mining method | |
| US9052956B2 (en) | Selecting execution environments | |
| CN116762067B (en) | Near Data Processing (NDP) in a network node | |
| US10255316B2 (en) | Processing of data chunks using a database calculation engine | |
| US10713244B2 (en) | Calculation engine optimizations for join operations utilizing automatic detection of forced constraints | |
| EP2590089A1 (en) | Rule type columns in database | |
| US9262492B2 (en) | Dividing and combining operations | |
| US12189717B1 (en) | Automatic partitioning of machine learning models for training across multiple devices |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170503 |
|
| RJ01 | Rejection of invention patent application after publication |