[go: up one dir, main page]

CN104580503A - Efficient dynamic load balancing system and method for processing large-scale data - Google Patents

Efficient dynamic load balancing system and method for processing large-scale data Download PDF

Info

Publication number
CN104580503A
CN104580503A CN201510037687.0A CN201510037687A CN104580503A CN 104580503 A CN104580503 A CN 104580503A CN 201510037687 A CN201510037687 A CN 201510037687A CN 104580503 A CN104580503 A CN 104580503A
Authority
CN
China
Prior art keywords
interior joint
central control
computing cluster
data
storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510037687.0A
Other languages
Chinese (zh)
Inventor
高永虎
张清
张广勇
沈铂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
IEIT Systems Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN201510037687.0A priority Critical patent/CN104580503A/en
Publication of CN104580503A publication Critical patent/CN104580503A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1004Server selection for load balancing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1029Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers using data related to the state of servers by a load balancer

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种高效动态负载均衡的处理大规模数据的系统及方法,属于处理大规模数据技术领域,其结构包括中央控制系统、计算集群系统、存储系统、高速网络;中央控制系统中节点采用CPU与GPU混合异构架构;计算集群系统中节点采用CPU与GPU混合异构架构或节点采用CPU架构;存储系统分共享存储和地存储,共享存储中节点采用CPU架构,地存储用于存储所在的中央控制系统的节点或者是计算集群系统的节点的数据;高速网络用于将中央控制系统中节点、计算集群系统中节点、共享存储中节点互相连接,组成集中式的处理大规模数据的系统。发明解决当前服务器计算系统网络带宽不足、内存容量小等情况,而无法对较大规模数据进行处理的问题。

The invention discloses a system and method for processing large-scale data with high-efficiency dynamic load balancing, which belongs to the technical field of processing large-scale data, and its structure includes a central control system, a computing cluster system, a storage system, and a high-speed network; nodes in the central control system Adopt CPU and GPU hybrid heterogeneous architecture; nodes in the computing cluster system adopt CPU and GPU hybrid heterogeneous architecture or nodes adopt CPU architecture; storage system is divided into shared storage and local storage, shared storage nodes adopt CPU architecture, and local storage is used for storage The data of the nodes of the central control system or the nodes of the computing cluster system; the high-speed network is used to connect the nodes in the central control system, the nodes in the computing cluster system, and the nodes in the shared storage to form a centralized processing large-scale data system. The invention solves the problem that the current server computing system cannot process large-scale data due to insufficient network bandwidth and small memory capacity.

Description

一种高效动态负载均衡的处理大规模数据的系统及方法A system and method for processing large-scale data with efficient dynamic load balancing

技术领域 technical field

本发明涉及一种处理大规模数据技术领域,具体地说是一种高效动态负载均衡的处理大规模数据的系统及方法。 The invention relates to the technical field of processing large-scale data, in particular to a system and method for processing large-scale data with high-efficiency dynamic load balancing.

背景技术 Background technique

当前社会人类的数据大爆炸,信息数据越来越多,人们对信息数据的处理能力的要求也越来越高,不仅石油勘探、气象预报、航天国防、科学研究等需求高性能计算,金融、政府信息化、教育、企业、网络游戏等更广泛的领域对高性能计算的需求迅猛增长。 In the current social data explosion of human beings, there are more and more information data, and people have higher and higher requirements for information data processing capabilities. Not only oil exploration, weather forecast, aerospace defense, scientific research, etc. require high-performance computing, but also financial, The demand for high-performance computing in a wider range of fields such as government informatization, education, enterprises, and online games is growing rapidly.

计算速度对于高性能计算尤为重要,高性能计算向多核、众核发展,采用异构并行提升应用计算速度,目前CPU+GPU是非常成熟的异构协同计算模式,适合高度并行计算的应用或算法。但是由于一些应用运算数据规模一直比较大,受限于网络带宽,系统内存等原因在单台服务器中添加硬件设备的方式,已经无法满足当前的需求,需要进一步设计一种根据现有有限的硬件设备下能够处理大规模数据的方法。 Computing speed is particularly important for high-performance computing. High-performance computing is developing toward multi-core and many-core, and heterogeneous parallelism is used to improve application computing speed. At present, CPU+GPU is a very mature heterogeneous collaborative computing mode, which is suitable for highly parallel computing applications or algorithms. . However, due to the relatively large scale of computing data in some applications, the way of adding hardware devices to a single server due to limitations of network bandwidth and system memory has been unable to meet the current needs. It is necessary to further design a method based on the existing limited hardware. A method capable of processing large-scale data under the device.

发明内容 Contents of the invention

本发明的技术任务是提供一种高效动态负载均衡的处理大规模数据的系统及方法;实现动态负载均衡的CPU+GPU混合异构集群系统,充分利用设备的性能,以实现整个系统效率大幅提升,并解决当前服务器计算系统网络带宽不足、内存容量小等情况,而无法对较大规模数据进行处理的问题。 The technical task of the present invention is to provide a system and method for processing large-scale data with high-efficiency dynamic load balancing; realize dynamic load balancing CPU+GPU hybrid heterogeneous cluster system, make full use of the performance of the equipment, and greatly improve the efficiency of the entire system , and solve the problem that the current server computing system cannot process large-scale data due to insufficient network bandwidth and small memory capacity.

本发明的技术任务是按以下方式实现的, Technical task of the present invention is realized in the following manner,

一种高效动态负载均衡的处理大规模数据的系统,为CPU与GPU混合异构集群系统,包括中央控制系统、计算集群系统、存储系统、高速网络;中央控制系统中节点采用CPU与GPU混合异构架构;计算集群系统中节点采用CPU与GPU混合异构架构或采用CPU架构;存储系统分共享存储和本地存储,共享存储中节点采用CPU架构,本地存储设置在中央控制系统的节点以及计算集群系统的每个节点中,共享存储分为主存储和备份存储,主存储和备份存储作为冗余存储、存储相同的计算数据,本地存储用于存储所在的中央控制系统的节点或者是计算集群系统的节点的数据;高速网络用于将中央控制系统中节点、计算集群系统中节点、共享存储中节点互相连接,组成集中式的处理大规模数据的系统。 A system for processing large-scale data with high-efficiency dynamic load balancing, which is a hybrid heterogeneous cluster system of CPU and GPU, including a central control system, a computing cluster system, a storage system, and a high-speed network; architecture; the nodes in the computing cluster system adopt a hybrid heterogeneous architecture of CPU and GPU or adopt CPU architecture; the storage system is divided into shared storage and local storage, the nodes in the shared storage adopt CPU architecture, and the local storage is set in the nodes of the central control system and the computing cluster In each node of the system, the shared storage is divided into main storage and backup storage. The main storage and backup storage are used as redundant storage to store the same computing data. The local storage is used for the nodes of the central control system where the storage is located or the computing cluster system The data of the nodes; the high-speed network is used to connect the nodes in the central control system, the nodes in the computing cluster system, and the nodes in the shared storage to form a centralized system for processing large-scale data.

一种高效动态负载均衡的处理大规模数据的系统,中央控制系统中节点控制计算集群系统中节点、存储系统中节点。 A system for processing large-scale data with high-efficiency dynamic load balancing. The nodes in the central control system control the nodes in the computing cluster system and the nodes in the storage system.

一种高效动态负载均衡的处理大规模数据的系统,中央控制系统中节点设置有1个,共享存储中节点设置有至少1个,计算集群系统中节点设置有至少2个。 A system for processing large-scale data with high-efficiency dynamic load balancing. There is one node in the central control system, at least one node in the shared storage system, and at least two nodes in the computing cluster system.

一种高效动态负载均衡的处理大规模数据的方法,采用上述中任意一种处理大规模数据的系统,对大规模数据进行处理,包括如下步骤: A method for processing large-scale data with high-efficiency dynamic load balancing, using any one of the above-mentioned large-scale data processing systems to process large-scale data, including the following steps:

(1)、中央控制系统中节点通过高速网络与所有计算集群系统中节点相互连接,中央控制系统中节点控制各个计算集群系统中节点,中央控制系统中节点动态的分配计算任务给计算集群系统中节点,中央控制系统中节点接收计算集群系统中节点的返回结果; (1) The nodes in the central control system are connected to the nodes in all computing cluster systems through a high-speed network, the nodes in the central control system control the nodes in each computing cluster system, and the nodes in the central control system dynamically assign computing tasks to the computing cluster systems Node, the node in the central control system receives the return result of the node in the computing cluster system;

(2)、计算集群系统中节点与共享存储中节点通过高速网络互连,中央控制系统中节点与共享存储中节点通过高速网络互连;共享存储中节点根据中央控制系统中节点的命令向计算集群系统中节点发送计算任务数据; (2) The nodes in the computing cluster system and the nodes in the shared storage are interconnected through a high-speed network, and the nodes in the central control system and the nodes in the shared storage are interconnected through a high-speed network; Nodes in the cluster system send computing task data;

(3)、计算集群系统中节点负责计算任务,计算集群系统中节点内有多个同型号的GPU处理器进行计算;可以提高计算的并行度,提高单节点的计算能力,同时同型号的GPU易于计算任务的划分; (3) The nodes in the computing cluster system are responsible for computing tasks. There are multiple GPU processors of the same model in the nodes of the computing cluster system for computing; it can improve the parallelism of computing and the computing power of a single node. At the same time, GPUs of the same model Ease of division of computational tasks;

(4)、中央控制系统的节点或者是计算集群系统的节点内的本地存储,用于缓存本地所必要的数据; (4) The local storage in the nodes of the central control system or the nodes of the computing cluster system is used to cache the necessary local data;

(5)、共享存储中节点存储计算集群系统中节点所需要的计算数据和计算结果数据,通过高速网络向计算集群系统中节点发送计算数据;同时共享存储中节点采用主存储和备份存储的存储方式,保证了数据的安全性。 (5) The nodes in the shared storage store the calculation data and calculation result data required by the nodes in the computing cluster system, and send the calculation data to the nodes in the computing cluster system through the high-speed network; at the same time, the nodes in the shared storage use the storage of primary storage and backup storage way to ensure data security.

一种高效动态负载均衡的处理大规模数据的方法,中央控制系统中节点收集所有计算集群系统中节点的计算能力信息,中央控制系统中节点将计算数据动态的划分,并命令共享存储中节点将计算数据发送给选中的计算集群系统中节点;共享存储中节点根据中央控制系统中节点的命令首先将计算数据以数据块为单位进行划分,然后将不同数量的数据块动态的发送给对应的计算集群系统中节点;计算集群系统中节点接收共享存储中节点发送来的计算数据,并将计算结果数据传输给中央控制系统中节点,中央控制系统中节点将接收到的计算结果统一处理后存储到共享存储中节点。 A method for processing large-scale data with high-efficiency dynamic load balancing. The nodes in the central control system collect the computing capability information of all nodes in the computing cluster system. The nodes in the central control system dynamically divide the computing data and order the nodes in the shared storage. The calculation data is sent to the selected nodes in the computing cluster system; the nodes in the shared storage first divide the calculation data in units of data blocks according to the commands of the nodes in the central control system, and then dynamically send different numbers of data blocks to the corresponding computing nodes. Nodes in the cluster system; the nodes in the computing cluster system receive the calculation data sent by the nodes in the shared storage, and transmit the calculation result data to the nodes in the central control system, and the nodes in the central control system process the received calculation results and store them in the Nodes in shared storage.

计算集群系统中节点在接收下一个数据块的同时,计算当前的数据块,同时发送上一个已经计算完成的数据块。 The nodes in the computing cluster system calculate the current data block while receiving the next data block, and send the last calculated data block at the same time.

一种高效动态负载均衡的处理大规模数据的方法,处理大规模数据的系统的工作流程为: A method for processing large-scale data with high-efficiency dynamic load balancing. The workflow of the system for processing large-scale data is:

①、中央控制系统中节点负责收集各个计算集群系统中节点的GPU卡的数量,根据各计算集群系统中节点不同的卡数量,生成各个计算集群系统中节点的计算能力信息,将此计算能力信息发送给共享存储中节点;计算能力信息包括每个计算集群系统中节点GPU卡的数量,高速网络的通信能力,GPU卡的计算能力; ①. The nodes in the central control system are responsible for collecting the number of GPU cards of the nodes in each computing cluster system. According to the different card numbers of the nodes in each computing cluster system, the computing capability information of the nodes in each computing cluster system is generated, and the computing capability information is Send to the nodes in the shared storage; the computing capability information includes the number of node GPU cards in each computing cluster system, the communication capability of the high-speed network, and the computing capability of the GPU card;

②、共享存储中节点根据中央控制系统中节点发送的计算能力信息,首先将数据分为合适的可发送的基本的数据块,然后为各计算集群系统中节点分配对应数量的计算数据块,再将数据块动态的发送给计算集群系统中节点; ②. According to the computing capability information sent by the nodes in the central control system, the nodes in the shared storage first divide the data into suitable basic data blocks that can be sent, and then allocate the corresponding number of computing data blocks to the nodes in each computing cluster system, and then Dynamically send data blocks to nodes in the computing cluster system;

③、计算集群系统中节点接收数据进行计算同时,若传输数据的较快而计算未完成,可将计算数据暂存储到本地存储中,若没有数据传输时,则从本地存储中获取,若本地也没有则需等待; ③. While the nodes in the computing cluster system receive data for calculation, if the transmission data is fast and the calculation is not completed, the calculation data can be temporarily stored in the local storage. If there is no data transmission, it will be obtained from the local storage. If the local If not, wait;

④、计算集群系统中节点完成计算数据块的同时即可将计算的结果发送给中央控制系统中节点,若传输繁忙则可先将数据暂存于本地存储中,等待网络空闲时再将其发送给中央控制系统中节点; ④. When the nodes in the computing cluster system complete the calculation of data blocks, the calculation results can be sent to the nodes in the central control system. If the transmission is busy, the data can be temporarily stored in the local storage, and then sent when the network is idle To the nodes in the central control system;

⑤、中央控制系统中节点将接收到的各个计算集群系统中节点的计算结果,进行必要的处理操作,然后发送给共享存储中节点,在计算期间中央控制系统中节点定时的收集计算集群系统中节点的必要的信息缓存到本地存储并存储到共享存储中节点。 ⑤. The nodes in the central control system will receive the calculation results of the nodes in each computing cluster system, perform necessary processing operations, and then send them to the nodes in the shared storage. During the calculation period, the nodes in the central control system will regularly collect them in the computing cluster system The necessary information of the node is cached to the local storage and stored in the shared storage node.

本发明的一种高效动态负载均衡的处理大规模数据的系统及方法具有以下优点: An efficient dynamic load balancing system and method for processing large-scale data of the present invention has the following advantages:

1、根据需要处理的数据任务量在现有的硬件设备条件部署集群,动态负载均衡方法,具有针对系统平台自适应,可实现系统的可靠、高效; 1. According to the amount of data tasks to be processed, clusters are deployed on the existing hardware equipment conditions, and the dynamic load balancing method is self-adapting to the system platform, which can realize the reliability and efficiency of the system;

2、此负载均衡方法可以自适应于集群系统,此系统可以由纯CPU集群子系统、CPU+GPU异构集群子系统一个或多个子系统组成的复杂混合集群系统; 2. This load balancing method can be adaptive to the cluster system, which can be a complex hybrid cluster system composed of one or more subsystems of pure CPU cluster subsystem and CPU+GPU heterogeneous cluster subsystem;

3、实现动态负载均衡,集群系统中各计算集群系统中节点间实现负载均衡,系统设备利用率高,不同计算设备,如CPU、GPU可以实现计算均衡,彼此相互不等待,系统中计算设备不出现空闲状态,整个集群系统将实现高效运转; 3. Realize dynamic load balancing, realize load balancing between nodes in each computing cluster system in the cluster system, high utilization rate of system equipment, different computing equipment, such as CPU, GPU can achieve calculation balancing, do not wait for each other, the computing equipment in the system does not In the idle state, the entire cluster system will operate efficiently;

4、在现有硬件的条件如内存容量、网络带宽不足的情况下,采取数据分块传输与计算异步处理的方式,能够有效处理大规模的数据; 4. In the case of existing hardware conditions such as insufficient memory capacity and network bandwidth, adopting the method of data block transmission and calculation asynchronous processing can effectively process large-scale data;

5、运行此系统将实现高性能,此负载均衡方法将根据不同计算集群系统中节点的计算能力和节点内不同计算设备的计算能力动态划分不同的计算任务到不同计算集群系统中节点上,实现计算任务划分的非固定性,提升动态负载均衡的高效性; 5. Running this system will achieve high performance. This load balancing method will dynamically divide different computing tasks to nodes in different computing cluster systems according to the computing capabilities of the nodes in different computing cluster systems and the computing capabilities of different computing devices in the nodes, so as to realize The non-fixedness of computing task division improves the efficiency of dynamic load balancing;

6、根据应用算法特点,以及CPU、GPU不同计算设备计算能力不同,计算设备所动态获取的计算任务应设置为不同; 6. According to the characteristics of the application algorithm and the different computing capabilities of different computing devices such as CPU and GPU, the computing tasks dynamically acquired by the computing devices should be set to be different;

7、将本发明扩展到多台服务器上,能够处理较大规模的数据,并使集群系统的节点间、节点内的计算设备达到计算的负载均衡,从而最大限度的利用现有设备的性能,提高系统的整体运行的效率,大大缩短程序的运行时间。 7. Extending the present invention to multiple servers can handle relatively large-scale data, and make computing load balancing between nodes and computing devices in the cluster system, thereby maximizing the use of the performance of existing devices, Improve the efficiency of the overall operation of the system and greatly shorten the running time of the program.

附图说明 Description of drawings

下面结合附图对本发明进一步说明。 The present invention will be further described below in conjunction with the accompanying drawings.

附图1为一种高效动态负载均衡的处理大规模数据的系统的结构示意框图; Accompanying drawing 1 is a schematic structural block diagram of a system for processing large-scale data of efficient dynamic load balancing;

  附图2为一种高效动态负载均衡的处理大规模数据的系统中各节点的通信示意框图。 Attached Figure 2 is a schematic block diagram of the communication of each node in an efficient dynamic load balancing system for processing large-scale data.

具体实施方式 Detailed ways

  参照说明书附图和具体实施例对本发明的一种高效动态负载均衡的处理大规模数据的系统及方法作以下详细地说明。 A system and method for processing large-scale data with high-efficiency dynamic load balancing of the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

实施例1: Example 1:

本发明的一种高效动态负载均衡的处理大规模数据的系统,为CPU与GPU混合异构集群系统,包括中央控制系统、计算集群系统、存储系统、高速网络;中央控制系统中节点采用CPU与GPU混合异构架构;计算集群系统中节点采用CPU与GPU混合异构架构或采用CPU架构;存储系统分共享存储和本地存储,共享存储中节点采用CPU架构,本地存储设置在中央控制系统的节点以及计算集群系统的每个节点中,共享存储分为主存储和备份存储,主存储和备份存储作为冗余存储、存储相同的计算数据,本地存储用于存储所在的中央控制系统的节点或者是计算集群系统的节点的数据;高速网络用于将中央控制系统中节点、计算集群系统中节点、共享存储中节点互相连接,组成集中式的处理大规模数据的系统。 A system for processing large-scale data with high-efficiency dynamic load balancing of the present invention is a hybrid heterogeneous cluster system of CPU and GPU, including a central control system, a computing cluster system, a storage system, and a high-speed network; the nodes in the central control system use CPU and GPU GPU hybrid heterogeneous architecture; nodes in the computing cluster system adopt CPU and GPU hybrid heterogeneous architecture or CPU architecture; storage system is divided into shared storage and local storage, shared storage nodes adopt CPU architecture, and local storage is set on the nodes of the central control system And in each node of the computing cluster system, the shared storage is divided into main storage and backup storage. The main storage and backup storage are used as redundant storage to store the same computing data. The local storage is used to store the nodes of the central control system or the Compute the data of the nodes of the cluster system; the high-speed network is used to connect the nodes in the central control system, the nodes in the computing cluster system, and the nodes in the shared storage to form a centralized system for processing large-scale data.

  中央控制系统中节点控制计算集群系统中节点、存储系统中节点。 The nodes in the central control system control the nodes in the computing cluster system and the nodes in the storage system.

中央控制系统中节点设置有1个,共享存储中节点设置有1个,计算集群系统中节点设置有2个。 There is one node in the central control system, one node in the shared storage, and two nodes in the computing cluster system.

实施例2: Example 2:

本发明的一种高效动态负载均衡的处理大规模数据的系统,为CPU与GPU混合异构集群系统,包括中央控制系统、计算集群系统、存储系统、高速网络;中央控制系统中节点采用CPU与GPU混合异构架构;计算集群系统中节点采用CPU与GPU混合异构架构或采用CPU架构;存储系统分共享存储和本地存储,共享存储中节点采用CPU架构,本地存储设置在中央控制系统的节点以及计算集群系统的每个节点中,共享存储分为主存储和备份存储,主存储和备份存储作为冗余存储、存储相同的计算数据,本地存储用于存储所在的中央控制系统的节点或者是计算集群系统的节点的数据;高速网络用于将中央控制系统中节点、计算集群系统中节点、共享存储中节点互相连接,组成集中式的处理大规模数据的系统。 A system for processing large-scale data with high-efficiency dynamic load balancing of the present invention is a hybrid heterogeneous cluster system of CPU and GPU, including a central control system, a computing cluster system, a storage system, and a high-speed network; the nodes in the central control system use CPU and GPU GPU hybrid heterogeneous architecture; nodes in the computing cluster system adopt CPU and GPU hybrid heterogeneous architecture or CPU architecture; storage system is divided into shared storage and local storage, shared storage nodes adopt CPU architecture, and local storage is set on the nodes of the central control system And in each node of the computing cluster system, the shared storage is divided into main storage and backup storage. The main storage and backup storage are used as redundant storage to store the same computing data. The local storage is used to store the nodes of the central control system or the Compute the data of the nodes of the cluster system; the high-speed network is used to connect the nodes in the central control system, the nodes in the computing cluster system, and the nodes in the shared storage to form a centralized system for processing large-scale data.

  中央控制系统中节点控制计算集群系统中节点、存储系统中节点。 The nodes in the central control system control the nodes in the computing cluster system and the nodes in the storage system.

中央控制系统中节点设置有1个,共享存储中节点设置有2个,计算集群系统中节点设置有5个。 There is one node in the central control system, two nodes in the shared storage system, and five nodes in the computing cluster system.

实施例3: Example 3:

本发明的一种高效动态负载均衡的处理大规模数据的方法,采用上述中任意一种处理大规模数据的系统,对大规模数据进行处理,包括如下步骤: A method for processing large-scale data with high-efficiency dynamic load balancing of the present invention uses any one of the above-mentioned systems for processing large-scale data to process large-scale data, including the following steps:

(1)、中央控制系统中节点通过高速网络与所有计算集群系统中节点相互连接,中央控制系统中节点控制各个计算集群系统中节点,中央控制系统中节点动态的分配计算任务给计算集群系统中节点,中央控制系统中节点接收计算集群系统中节点的返回结果; (1) The nodes in the central control system are connected to the nodes in all computing cluster systems through a high-speed network, the nodes in the central control system control the nodes in each computing cluster system, and the nodes in the central control system dynamically assign computing tasks to the computing cluster systems Node, the node in the central control system receives the return result of the node in the computing cluster system;

(2)、计算集群系统中节点与共享存储中节点通过高速网络互连,中央控制系统中节点与共享存储中节点通过高速网络互连;共享存储中节点根据中央控制系统中节点的命令向计算集群系统中节点发送计算任务数据; (2) The nodes in the computing cluster system and the nodes in the shared storage are interconnected through a high-speed network, and the nodes in the central control system and the nodes in the shared storage are interconnected through a high-speed network; Nodes in the cluster system send computing task data;

(3)、计算集群系统中节点负责计算任务,计算集群系统中节点内有多个同型号的GPU处理器进行计算;可以提高计算的并行度,提高单节点的计算能力,同时同型号的GPU易于计算任务的划分; (3) The nodes in the computing cluster system are responsible for computing tasks. There are multiple GPU processors of the same model in the nodes of the computing cluster system for computing; it can improve the parallelism of computing and the computing power of a single node. At the same time, GPUs of the same model Ease of division of computational tasks;

(4)、中央控制系统的节点或者是计算集群系统的节点内的本地存储,用于缓存本地所必要的数据; (4) The local storage in the nodes of the central control system or the nodes of the computing cluster system is used to cache the necessary local data;

(5)、共享存储中节点存储计算集群系统中节点所需要的计算数据和计算结果数据,通过高速网络向计算集群系统中节点发送计算数据;同时共享存储中节点采用主存储和备份存储的存储方式,保证了数据的安全性。 (5) The nodes in the shared storage store the calculation data and calculation result data required by the nodes in the computing cluster system, and send the calculation data to the nodes in the computing cluster system through the high-speed network; at the same time, the nodes in the shared storage use the storage of primary storage and backup storage way to ensure data security.

中央控制系统中节点收集所有计算集群系统中节点的计算能力信息,中央控制系统中节点将计算数据动态的划分,并命令共享存储中节点将计算数据发送给选中的计算集群系统中节点;共享存储中节点根据中央控制系统中节点的命令首先将计算数据以数据块为单位进行划分,然后将不同数量的数据块动态的发送给对应的计算集群系统中节点;计算集群系统中节点接收共享存储中节点发送来的计算数据,并将计算结果数据传输给中央控制系统中节点,中央控制系统中节点将接收到的计算结果统一处理后存储到共享存储中节点。 The nodes in the central control system collect the computing capability information of all nodes in the computing cluster system, and the nodes in the central control system dynamically divide the computing data, and order the nodes in the shared storage to send the computing data to the selected nodes in the computing cluster system; the shared storage According to the command of the nodes in the central control system, the middle node first divides the calculation data in units of data blocks, and then dynamically sends different numbers of data blocks to the corresponding nodes in the computing cluster system; the nodes in the computing cluster system receive the data in the shared storage The calculation data sent by the nodes, and the calculation result data are transmitted to the nodes in the central control system, and the nodes in the central control system process the received calculation results uniformly and store them in the shared storage nodes.

计算集群系统中节点在接收下一个数据块的同时,计算当前的数据块,同时发送上一个已经计算完成的数据块。 The nodes in the computing cluster system calculate the current data block while receiving the next data block, and send the last calculated data block at the same time.

采用了集中式的控制及存储方式,采用将大的数据分块的方式,由共享存储中节点动态的分发到各个计算集群系统中节点;在计算与通信异步的方式下,中央控制系统中节点需要根据网络的通信能力,计算集群系统中节点的计算能力将数据划分为合适的数据块,使得计算与传输的相互掩盖,以其达到最优的性能。传输与计算的异步方式不仅缩短了计算的时间,同时由于分块计算,对系统的小内存、低带宽的硬件设备也可以应用与大规模数据的计算。 A centralized control and storage method is adopted, and the large data is divided into blocks, which are dynamically distributed from the nodes in the shared storage to the nodes in each computing cluster system; in the asynchronous way of computing and communication, the nodes in the central control system It is necessary to divide the data into appropriate data blocks according to the communication capabilities of the network and the computing capabilities of the nodes in the computing cluster system, so that the calculation and transmission can cover each other to achieve optimal performance. The asynchronous method of transmission and calculation not only shortens the calculation time, but also can be applied to the calculation of large-scale data for the small memory and low bandwidth hardware devices of the system due to the block calculation.

实施例4: Example 4:

本发明的一种高效动态负载均衡的处理大规模数据的方法,处理大规模数据的系统的工作流程为: A method for processing large-scale data with high-efficiency dynamic load balancing of the present invention, the workflow of the system for processing large-scale data is:

①、中央控制系统中节点负责收集各个计算集群系统中节点的GPU卡的数量,根据各计算集群系统中节点不同的卡数量,生成各个计算集群系统中节点的计算能力信息,将此计算能力信息发送给共享存储中节点;计算能力信息包括每个计算集群系统中节点GPU卡的数量,高速网络的通信能力,GPU卡的计算能力; ①. The nodes in the central control system are responsible for collecting the number of GPU cards of the nodes in each computing cluster system. According to the different card numbers of the nodes in each computing cluster system, the computing capability information of the nodes in each computing cluster system is generated, and the computing capability information is Send to the nodes in the shared storage; the computing capability information includes the number of node GPU cards in each computing cluster system, the communication capability of the high-speed network, and the computing capability of the GPU card;

②、共享存储中节点根据中央控制系统中节点发送的计算能力信息,首先将数据分为合适的可发送的基本的数据块,然后为各计算集群系统中节点分配对应数量的计算数据块,再将数据块动态的发送给计算集群系统中节点; ②. According to the computing capability information sent by the nodes in the central control system, the nodes in the shared storage first divide the data into suitable basic data blocks that can be sent, and then allocate the corresponding number of computing data blocks to the nodes in each computing cluster system, and then Dynamically send data blocks to nodes in the computing cluster system;

③、计算集群系统中节点接收数据进行计算同时,若传输数据的较快而计算未完成,可将计算数据暂存储到本地存储中,若没有数据传输时,则从本地存储中获取,若本地也没有则需等待; ③. While the nodes in the computing cluster system receive data for calculation, if the transmission data is fast and the calculation is not completed, the calculation data can be temporarily stored in the local storage. If there is no data transmission, it will be obtained from the local storage. If the local If not, wait;

④、计算集群系统中节点完成计算数据块的同时即可将计算的结果发送给中央控制系统中节点,若传输繁忙则可先将数据暂存于本地存储中,等待网络空闲时再将其发送给中央控制系统中节点; ④. When the nodes in the computing cluster system complete the calculation of data blocks, the calculation results can be sent to the nodes in the central control system. If the transmission is busy, the data can be temporarily stored in the local storage, and then sent when the network is idle To the nodes in the central control system;

⑤、中央控制系统中节点将接收到的各个计算集群系统中节点的计算结果,进行必要的处理操作,然后发送给共享存储中节点,在计算期间中央控制系统中节点定时的收集计算集群系统中节点的必要的信息缓存到本地存储并存储到共享存储中节点。 ⑤. The nodes in the central control system will receive the calculation results of the nodes in each computing cluster system, perform necessary processing operations, and then send them to the nodes in the shared storage. During the calculation period, the nodes in the central control system will regularly collect them in the computing cluster system The necessary information of the node is cached to the local storage and stored in the shared storage node.

一种处理大规模数据的系统中,计算与通信采用全异步的方式,即每计算完成一个数据块即可发送到网络中。保证了计算与传输的隐藏缩短了整个系统的计算时间,同时采用计算数据分块的处理方式,可以使得系统很好的应用到网络带宽低和存储空间不足但却可以处理大数据的情况。定时的备份起到了容错的功能,有效防止了系统中节点的中断而出现系统崩溃。 In a system for processing large-scale data, calculation and communication adopt a fully asynchronous method, that is, each data block can be sent to the network after calculation is completed. It ensures that the calculation and transmission are hidden and shortens the calculation time of the entire system. At the same time, the calculation data is divided into blocks, which can make the system well applied to the situation where the network bandwidth is low and the storage space is insufficient but can handle large data. Timing backup plays a fault-tolerant function, which effectively prevents system crashes caused by interruption of nodes in the system.

通过上面具体实施方式,所述技术领域的技术人员可容易的实现本发明。但是应当理解,本发明并不限于上述的具体实施方式。在公开的实施方式的基础上,所述技术领域的技术人员可任意组合不同的技术特征,从而实现不同的技术方案。 Through the above specific implementation manners, those skilled in the technical field can easily realize the present invention. However, it should be understood that the present invention is not limited to the specific embodiments described above. On the basis of the disclosed embodiments, those skilled in the art can arbitrarily combine different technical features, so as to realize different technical solutions.

Claims (7)

1. a system for the process large-scale data of high-efficiency dynamic load balancing, is characterized in that, for CPU and GPU mixes Heterogeneous Cluster Environment, comprising central control system, computing cluster system, storage system, express network; Central control system interior joint adopts CPU and GPU to mix isomery framework; Computing cluster system interior joint adopts CPU and GPU mix isomery framework or adopt CPU architecture; Storage system divides shared storage and local storage, share and store interior joint employing CPU architecture, local storage is arranged in the node of central control system and each node of computing cluster system, shared storage is divided into primary storage and back-up storage, primary storage and back-up storage as redundant storage, store identical calculated data, the local data stored for the node of central control system or the node of computing cluster system storing place; Express network is used for central control system interior joint, computing cluster system interior joint, the shared interior joint that stores to be connected to each other, and forms the system of centralized process large-scale data.
2. the system of the process large-scale data of a kind of high-efficiency dynamic load balancing according to claim 1, is characterized in that central control system interior joint controlling calculation group system interior joint, storage system interior joint.
3. the system of the process large-scale data of a kind of high-efficiency dynamic load balancing according to claim 1, it is characterized in that central control system interior joint is provided with 1, share storage interior joint and be provided with at least 1, computing cluster system interior joint is provided with at least 2.
4. a method for the process large-scale data of high-efficiency dynamic load balancing, is characterized in that adopting the system of any one process large-scale data in claim 1-3, processes, comprise the steps: large-scale data
(1), central control system interior joint is interconnected by express network and all computing cluster system interior joint, central control system interior joint controls each computing cluster system interior joint, the dynamic Distribution Calculation task of central control system interior joint is to computing cluster system interior joint, and central control system interior joint receives returning results of computing cluster system interior joint;
(2), computing cluster system interior joint stored interior joint and interconnected by express network with sharing, and central control system interior joint is stored interior joint interconnected by express network with shared; Share and store interior joint according to the order of central control system interior joint to computing cluster system interior joint transmission calculation task data;
(3), computing cluster system interior joint is responsible for calculation task, has the GPU processor of multiple same model to calculate in computing cluster system interior joint;
(4) this locality, in the node of central control system or the node of computing cluster system stores, for the local necessary data of buffer memory;
(5) calculated data, required for shared storage interior joint storage computing cluster system interior joint and calculation result data, send calculated data by express network to computing cluster system interior joint; Share simultaneously and store the storage mode that interior joint adopts primary storage and back-up storage.
5. the method for the process large-scale data of a kind of high-efficiency dynamic load balancing according to claim 4, it is characterized in that central control system interior joint collects the computing capability information of all computing cluster system interior joint, calculated data divides by central control system interior joint dynamically, and calculated data is sent to the computing cluster system interior joint chosen by the shared interior joint that stores of order; Share storage interior joint first calculated data to be divided in units of data block according to the order of central control system interior joint, then the data block of varying number is sent to dynamically corresponding computing cluster system interior joint; Computing cluster system interior joint receives the calculated data shared and store interior joint and send, and calculation result data is transferred to central control system interior joint, central control system interior joint stores interior joint by being stored into share after unified for the result of calculation received process.
6. the method for the process large-scale data of a kind of high-efficiency dynamic load balancing according to claim 5, it is characterized in that computing cluster system interior joint is while the next data block of reception, calculate current data block, send a data block completed as calculated simultaneously.
7. the method for the process large-scale data of a kind of high-efficiency dynamic load balancing according to claim 4, is characterized in that the workflow of the system processing large-scale data is:
1., central control system interior joint is responsible for the quantity of the GPU card collecting each computing cluster system interior joint, the card quantity different according to each computing cluster system interior joint, generate the computing capability information of each computing cluster system interior joint, this computing capability information is sent to and shares storage interior joint; Computing capability information comprises the quantity of each computing cluster system interior joint GPU card, the communication capacity of express network, the computing capability of GPU card;
2. the computing capability information storing interior joint and send according to central control system interior joint, is shared, first data are divided into the suitable basic data block sent, then be the calculated data block that each computing cluster system interior joint distributes respective amount, then data block sent to dynamically computing cluster system interior joint;
3., computing cluster system interior joint reception data carry out calculating simultaneously, if transmit the very fast of data and calculating does not complete, calculated data can be stored into temporarily in local storage, if when there is no transfer of data, then obtain from this locality stores, if this locality also not, does not need to wait for;
4. the result of calculating can be sent to central control system interior joint while, computing cluster system interior joint completes calculated data block, if transmit busy, can first by data temporary storage in this locality store in, wait for network idle time send it to central control system interior joint again;
5., the result of calculation of each computing cluster system interior joint that will receive of central control system interior joint, carry out necessary process operation, then send to shared storage interior joint, store at the information cache of necessity of the collection computing cluster system interior joint of computing interval central control system interior joint timing to local and be stored into shared storage interior joint.
CN201510037687.0A 2015-01-26 2015-01-26 Efficient dynamic load balancing system and method for processing large-scale data Pending CN104580503A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510037687.0A CN104580503A (en) 2015-01-26 2015-01-26 Efficient dynamic load balancing system and method for processing large-scale data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510037687.0A CN104580503A (en) 2015-01-26 2015-01-26 Efficient dynamic load balancing system and method for processing large-scale data

Publications (1)

Publication Number Publication Date
CN104580503A true CN104580503A (en) 2015-04-29

Family

ID=53095660

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510037687.0A Pending CN104580503A (en) 2015-01-26 2015-01-26 Efficient dynamic load balancing system and method for processing large-scale data

Country Status (1)

Country Link
CN (1) CN104580503A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106897148A (en) * 2017-02-28 2017-06-27 郑州云海信息技术有限公司 A kind of system and method for generating micro-downburst
CN107920080A (en) * 2017-11-22 2018-04-17 郑州云海信息技术有限公司 A kind of characteristic acquisition method and system
CN108989398A (en) * 2018-06-27 2018-12-11 郑州云海信息技术有限公司 A kind of virtual shared memory cell and the cluster storage system based on cloud storage
CN109343791A (en) * 2018-08-16 2019-02-15 武汉元鼎创天信息科技有限公司 A kind of big data all-in-one machine
CN110333945A (en) * 2019-05-09 2019-10-15 成都信息工程大学 A dynamic load balancing method, system and terminal
CN112511576A (en) * 2019-09-16 2021-03-16 触景无限科技(北京)有限公司 Internet of things data processing system and data processing method
CN113094183A (en) * 2021-06-09 2021-07-09 苏州浪潮智能科技有限公司 Training task creating method, device, system and medium of AI (Artificial Intelligence) training platform
CN113225362A (en) * 2020-02-06 2021-08-06 北京京东振世信息技术有限公司 Server cluster system and implementation method thereof

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050243094A1 (en) * 2004-05-03 2005-11-03 Microsoft Corporation Systems and methods for providing an enhanced graphics pipeline
CN101751376A (en) * 2009-12-30 2010-06-23 中国人民解放军国防科学技术大学 Quickening method utilizing cooperative work of CPU and GPU to solve triangular linear equation set
CN104301434A (en) * 2014-10-31 2015-01-21 浪潮(北京)电子信息产业有限公司 A cluster-based high-speed communication architecture and method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050243094A1 (en) * 2004-05-03 2005-11-03 Microsoft Corporation Systems and methods for providing an enhanced graphics pipeline
CN101751376A (en) * 2009-12-30 2010-06-23 中国人民解放军国防科学技术大学 Quickening method utilizing cooperative work of CPU and GPU to solve triangular linear equation set
CN104301434A (en) * 2014-10-31 2015-01-21 浪潮(北京)电子信息产业有限公司 A cluster-based high-speed communication architecture and method

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106897148A (en) * 2017-02-28 2017-06-27 郑州云海信息技术有限公司 A kind of system and method for generating micro-downburst
CN107920080A (en) * 2017-11-22 2018-04-17 郑州云海信息技术有限公司 A kind of characteristic acquisition method and system
CN108989398A (en) * 2018-06-27 2018-12-11 郑州云海信息技术有限公司 A kind of virtual shared memory cell and the cluster storage system based on cloud storage
CN108989398B (en) * 2018-06-27 2021-02-02 苏州浪潮智能科技有限公司 A virtual shared storage unit and a cloud storage-based cluster storage system
CN109343791A (en) * 2018-08-16 2019-02-15 武汉元鼎创天信息科技有限公司 A kind of big data all-in-one machine
CN109343791B (en) * 2018-08-16 2021-11-09 武汉元鼎创天信息科技有限公司 Big data all-in-one
CN110333945A (en) * 2019-05-09 2019-10-15 成都信息工程大学 A dynamic load balancing method, system and terminal
CN112511576A (en) * 2019-09-16 2021-03-16 触景无限科技(北京)有限公司 Internet of things data processing system and data processing method
CN113225362A (en) * 2020-02-06 2021-08-06 北京京东振世信息技术有限公司 Server cluster system and implementation method thereof
CN113225362B (en) * 2020-02-06 2024-04-05 北京京东振世信息技术有限公司 Server cluster system and implementation method thereof
CN113094183A (en) * 2021-06-09 2021-07-09 苏州浪潮智能科技有限公司 Training task creating method, device, system and medium of AI (Artificial Intelligence) training platform
CN113094183B (en) * 2021-06-09 2021-09-17 苏州浪潮智能科技有限公司 Training task creating method, device, system and medium of AI (Artificial Intelligence) training platform

Similar Documents

Publication Publication Date Title
CN104580503A (en) Efficient dynamic load balancing system and method for processing large-scale data
CN105159610B (en) Large-scale data processing system and method
CN110619595B (en) Graph calculation optimization method based on interconnection of multiple FPGA accelerators
CN106502792B (en) A Multi-tenant Resource Optimal Scheduling Method for Different Types of Loads
CN108563808B (en) Design Method of Heterogeneous Reconfigurable Graph Computation Accelerator System Based on FPGA
CN107122244B (en) Multi-GPU-based graph data processing system and method
CN101778002B (en) Large-scale cluster system and building method thereof
CN111221624A (en) A container management method for regulating cloud platform based on Docker container technology
CN102929718A (en) Distributed GPU (graphics processing unit) computer system based on task scheduling
CN104239555A (en) MPP (massively parallel processing)-based parallel data mining framework and MPP-based parallel data mining method
CN102135949A (en) Computing network system, method and device based on graphic processing unit
CN103336756B (en) A kind of generating apparatus of data computational node
CN103595780A (en) Cloud computing resource scheduling method based on repeat removing
CN114424174A (en) Parameter caching for neural network accelerators
CN104811503A (en) R statistical modeling system
CN104375882A (en) Multistage nested data drive calculation method matched with high-performance computer structure
CN109254846A (en) The dynamic dispatching method and system of CPU and GPU cooperated computing based on two-level scheduler
CN104618406A (en) Load balancing algorithm based on naive Bayesian classification
CN107463448A (en) A kind of deep learning weight renewing method and system
CN116991590A (en) Resource decoupling system, execution method and device for deep learning applications
CN107301094A (en) The dynamic self-adapting data model inquired about towards extensive dynamic transaction
CN107197039A (en) A kind of PAAS platform service bag distribution methods and system based on CDN
Narantuya et al. Multi-agent deep reinforcement learning-based resource allocation in hpc/ai converged cluster
Li et al. Collm: A collaborative llm inference framework for resource-constrained devices
Li et al. Data prefetching and file synchronizing for performance optimization in Hadoop-based hybrid cloud

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150429

WD01 Invention patent application deemed withdrawn after publication