CN104580503A - Efficient dynamic load balancing system and method for processing large-scale data - Google Patents
Efficient dynamic load balancing system and method for processing large-scale data Download PDFInfo
- Publication number
- CN104580503A CN104580503A CN201510037687.0A CN201510037687A CN104580503A CN 104580503 A CN104580503 A CN 104580503A CN 201510037687 A CN201510037687 A CN 201510037687A CN 104580503 A CN104580503 A CN 104580503A
- Authority
- CN
- China
- Prior art keywords
- interior joint
- central control
- computing cluster
- data
- storage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1001—Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
- H04L67/1004—Server selection for load balancing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1001—Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
- H04L67/1029—Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers using data related to the state of servers by a load balancer
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了一种高效动态负载均衡的处理大规模数据的系统及方法,属于处理大规模数据技术领域,其结构包括中央控制系统、计算集群系统、存储系统、高速网络;中央控制系统中节点采用CPU与GPU混合异构架构;计算集群系统中节点采用CPU与GPU混合异构架构或节点采用CPU架构;存储系统分共享存储和地存储,共享存储中节点采用CPU架构,地存储用于存储所在的中央控制系统的节点或者是计算集群系统的节点的数据;高速网络用于将中央控制系统中节点、计算集群系统中节点、共享存储中节点互相连接,组成集中式的处理大规模数据的系统。发明解决当前服务器计算系统网络带宽不足、内存容量小等情况,而无法对较大规模数据进行处理的问题。
The invention discloses a system and method for processing large-scale data with high-efficiency dynamic load balancing, which belongs to the technical field of processing large-scale data, and its structure includes a central control system, a computing cluster system, a storage system, and a high-speed network; nodes in the central control system Adopt CPU and GPU hybrid heterogeneous architecture; nodes in the computing cluster system adopt CPU and GPU hybrid heterogeneous architecture or nodes adopt CPU architecture; storage system is divided into shared storage and local storage, shared storage nodes adopt CPU architecture, and local storage is used for storage The data of the nodes of the central control system or the nodes of the computing cluster system; the high-speed network is used to connect the nodes in the central control system, the nodes in the computing cluster system, and the nodes in the shared storage to form a centralized processing large-scale data system. The invention solves the problem that the current server computing system cannot process large-scale data due to insufficient network bandwidth and small memory capacity.
Description
技术领域 technical field
本发明涉及一种处理大规模数据技术领域,具体地说是一种高效动态负载均衡的处理大规模数据的系统及方法。 The invention relates to the technical field of processing large-scale data, in particular to a system and method for processing large-scale data with high-efficiency dynamic load balancing.
背景技术 Background technique
当前社会人类的数据大爆炸,信息数据越来越多,人们对信息数据的处理能力的要求也越来越高,不仅石油勘探、气象预报、航天国防、科学研究等需求高性能计算,金融、政府信息化、教育、企业、网络游戏等更广泛的领域对高性能计算的需求迅猛增长。 In the current social data explosion of human beings, there are more and more information data, and people have higher and higher requirements for information data processing capabilities. Not only oil exploration, weather forecast, aerospace defense, scientific research, etc. require high-performance computing, but also financial, The demand for high-performance computing in a wider range of fields such as government informatization, education, enterprises, and online games is growing rapidly.
计算速度对于高性能计算尤为重要,高性能计算向多核、众核发展,采用异构并行提升应用计算速度,目前CPU+GPU是非常成熟的异构协同计算模式,适合高度并行计算的应用或算法。但是由于一些应用运算数据规模一直比较大,受限于网络带宽,系统内存等原因在单台服务器中添加硬件设备的方式,已经无法满足当前的需求,需要进一步设计一种根据现有有限的硬件设备下能够处理大规模数据的方法。 Computing speed is particularly important for high-performance computing. High-performance computing is developing toward multi-core and many-core, and heterogeneous parallelism is used to improve application computing speed. At present, CPU+GPU is a very mature heterogeneous collaborative computing mode, which is suitable for highly parallel computing applications or algorithms. . However, due to the relatively large scale of computing data in some applications, the way of adding hardware devices to a single server due to limitations of network bandwidth and system memory has been unable to meet the current needs. It is necessary to further design a method based on the existing limited hardware. A method capable of processing large-scale data under the device.
发明内容 Contents of the invention
本发明的技术任务是提供一种高效动态负载均衡的处理大规模数据的系统及方法;实现动态负载均衡的CPU+GPU混合异构集群系统,充分利用设备的性能,以实现整个系统效率大幅提升,并解决当前服务器计算系统网络带宽不足、内存容量小等情况,而无法对较大规模数据进行处理的问题。 The technical task of the present invention is to provide a system and method for processing large-scale data with high-efficiency dynamic load balancing; realize dynamic load balancing CPU+GPU hybrid heterogeneous cluster system, make full use of the performance of the equipment, and greatly improve the efficiency of the entire system , and solve the problem that the current server computing system cannot process large-scale data due to insufficient network bandwidth and small memory capacity.
本发明的技术任务是按以下方式实现的, Technical task of the present invention is realized in the following manner,
一种高效动态负载均衡的处理大规模数据的系统,为CPU与GPU混合异构集群系统,包括中央控制系统、计算集群系统、存储系统、高速网络;中央控制系统中节点采用CPU与GPU混合异构架构;计算集群系统中节点采用CPU与GPU混合异构架构或采用CPU架构;存储系统分共享存储和本地存储,共享存储中节点采用CPU架构,本地存储设置在中央控制系统的节点以及计算集群系统的每个节点中,共享存储分为主存储和备份存储,主存储和备份存储作为冗余存储、存储相同的计算数据,本地存储用于存储所在的中央控制系统的节点或者是计算集群系统的节点的数据;高速网络用于将中央控制系统中节点、计算集群系统中节点、共享存储中节点互相连接,组成集中式的处理大规模数据的系统。 A system for processing large-scale data with high-efficiency dynamic load balancing, which is a hybrid heterogeneous cluster system of CPU and GPU, including a central control system, a computing cluster system, a storage system, and a high-speed network; architecture; the nodes in the computing cluster system adopt a hybrid heterogeneous architecture of CPU and GPU or adopt CPU architecture; the storage system is divided into shared storage and local storage, the nodes in the shared storage adopt CPU architecture, and the local storage is set in the nodes of the central control system and the computing cluster In each node of the system, the shared storage is divided into main storage and backup storage. The main storage and backup storage are used as redundant storage to store the same computing data. The local storage is used for the nodes of the central control system where the storage is located or the computing cluster system The data of the nodes; the high-speed network is used to connect the nodes in the central control system, the nodes in the computing cluster system, and the nodes in the shared storage to form a centralized system for processing large-scale data.
一种高效动态负载均衡的处理大规模数据的系统,中央控制系统中节点控制计算集群系统中节点、存储系统中节点。 A system for processing large-scale data with high-efficiency dynamic load balancing. The nodes in the central control system control the nodes in the computing cluster system and the nodes in the storage system.
一种高效动态负载均衡的处理大规模数据的系统,中央控制系统中节点设置有1个,共享存储中节点设置有至少1个,计算集群系统中节点设置有至少2个。 A system for processing large-scale data with high-efficiency dynamic load balancing. There is one node in the central control system, at least one node in the shared storage system, and at least two nodes in the computing cluster system.
一种高效动态负载均衡的处理大规模数据的方法,采用上述中任意一种处理大规模数据的系统,对大规模数据进行处理,包括如下步骤: A method for processing large-scale data with high-efficiency dynamic load balancing, using any one of the above-mentioned large-scale data processing systems to process large-scale data, including the following steps:
(1)、中央控制系统中节点通过高速网络与所有计算集群系统中节点相互连接,中央控制系统中节点控制各个计算集群系统中节点,中央控制系统中节点动态的分配计算任务给计算集群系统中节点,中央控制系统中节点接收计算集群系统中节点的返回结果; (1) The nodes in the central control system are connected to the nodes in all computing cluster systems through a high-speed network, the nodes in the central control system control the nodes in each computing cluster system, and the nodes in the central control system dynamically assign computing tasks to the computing cluster systems Node, the node in the central control system receives the return result of the node in the computing cluster system;
(2)、计算集群系统中节点与共享存储中节点通过高速网络互连,中央控制系统中节点与共享存储中节点通过高速网络互连;共享存储中节点根据中央控制系统中节点的命令向计算集群系统中节点发送计算任务数据; (2) The nodes in the computing cluster system and the nodes in the shared storage are interconnected through a high-speed network, and the nodes in the central control system and the nodes in the shared storage are interconnected through a high-speed network; Nodes in the cluster system send computing task data;
(3)、计算集群系统中节点负责计算任务,计算集群系统中节点内有多个同型号的GPU处理器进行计算;可以提高计算的并行度,提高单节点的计算能力,同时同型号的GPU易于计算任务的划分; (3) The nodes in the computing cluster system are responsible for computing tasks. There are multiple GPU processors of the same model in the nodes of the computing cluster system for computing; it can improve the parallelism of computing and the computing power of a single node. At the same time, GPUs of the same model Ease of division of computational tasks;
(4)、中央控制系统的节点或者是计算集群系统的节点内的本地存储,用于缓存本地所必要的数据; (4) The local storage in the nodes of the central control system or the nodes of the computing cluster system is used to cache the necessary local data;
(5)、共享存储中节点存储计算集群系统中节点所需要的计算数据和计算结果数据,通过高速网络向计算集群系统中节点发送计算数据;同时共享存储中节点采用主存储和备份存储的存储方式,保证了数据的安全性。 (5) The nodes in the shared storage store the calculation data and calculation result data required by the nodes in the computing cluster system, and send the calculation data to the nodes in the computing cluster system through the high-speed network; at the same time, the nodes in the shared storage use the storage of primary storage and backup storage way to ensure data security.
一种高效动态负载均衡的处理大规模数据的方法,中央控制系统中节点收集所有计算集群系统中节点的计算能力信息,中央控制系统中节点将计算数据动态的划分,并命令共享存储中节点将计算数据发送给选中的计算集群系统中节点;共享存储中节点根据中央控制系统中节点的命令首先将计算数据以数据块为单位进行划分,然后将不同数量的数据块动态的发送给对应的计算集群系统中节点;计算集群系统中节点接收共享存储中节点发送来的计算数据,并将计算结果数据传输给中央控制系统中节点,中央控制系统中节点将接收到的计算结果统一处理后存储到共享存储中节点。 A method for processing large-scale data with high-efficiency dynamic load balancing. The nodes in the central control system collect the computing capability information of all nodes in the computing cluster system. The nodes in the central control system dynamically divide the computing data and order the nodes in the shared storage. The calculation data is sent to the selected nodes in the computing cluster system; the nodes in the shared storage first divide the calculation data in units of data blocks according to the commands of the nodes in the central control system, and then dynamically send different numbers of data blocks to the corresponding computing nodes. Nodes in the cluster system; the nodes in the computing cluster system receive the calculation data sent by the nodes in the shared storage, and transmit the calculation result data to the nodes in the central control system, and the nodes in the central control system process the received calculation results and store them in the Nodes in shared storage.
计算集群系统中节点在接收下一个数据块的同时,计算当前的数据块,同时发送上一个已经计算完成的数据块。 The nodes in the computing cluster system calculate the current data block while receiving the next data block, and send the last calculated data block at the same time.
一种高效动态负载均衡的处理大规模数据的方法,处理大规模数据的系统的工作流程为: A method for processing large-scale data with high-efficiency dynamic load balancing. The workflow of the system for processing large-scale data is:
①、中央控制系统中节点负责收集各个计算集群系统中节点的GPU卡的数量,根据各计算集群系统中节点不同的卡数量,生成各个计算集群系统中节点的计算能力信息,将此计算能力信息发送给共享存储中节点;计算能力信息包括每个计算集群系统中节点GPU卡的数量,高速网络的通信能力,GPU卡的计算能力; ①. The nodes in the central control system are responsible for collecting the number of GPU cards of the nodes in each computing cluster system. According to the different card numbers of the nodes in each computing cluster system, the computing capability information of the nodes in each computing cluster system is generated, and the computing capability information is Send to the nodes in the shared storage; the computing capability information includes the number of node GPU cards in each computing cluster system, the communication capability of the high-speed network, and the computing capability of the GPU card;
②、共享存储中节点根据中央控制系统中节点发送的计算能力信息,首先将数据分为合适的可发送的基本的数据块,然后为各计算集群系统中节点分配对应数量的计算数据块,再将数据块动态的发送给计算集群系统中节点; ②. According to the computing capability information sent by the nodes in the central control system, the nodes in the shared storage first divide the data into suitable basic data blocks that can be sent, and then allocate the corresponding number of computing data blocks to the nodes in each computing cluster system, and then Dynamically send data blocks to nodes in the computing cluster system;
③、计算集群系统中节点接收数据进行计算同时,若传输数据的较快而计算未完成,可将计算数据暂存储到本地存储中,若没有数据传输时,则从本地存储中获取,若本地也没有则需等待; ③. While the nodes in the computing cluster system receive data for calculation, if the transmission data is fast and the calculation is not completed, the calculation data can be temporarily stored in the local storage. If there is no data transmission, it will be obtained from the local storage. If the local If not, wait;
④、计算集群系统中节点完成计算数据块的同时即可将计算的结果发送给中央控制系统中节点,若传输繁忙则可先将数据暂存于本地存储中,等待网络空闲时再将其发送给中央控制系统中节点; ④. When the nodes in the computing cluster system complete the calculation of data blocks, the calculation results can be sent to the nodes in the central control system. If the transmission is busy, the data can be temporarily stored in the local storage, and then sent when the network is idle To the nodes in the central control system;
⑤、中央控制系统中节点将接收到的各个计算集群系统中节点的计算结果,进行必要的处理操作,然后发送给共享存储中节点,在计算期间中央控制系统中节点定时的收集计算集群系统中节点的必要的信息缓存到本地存储并存储到共享存储中节点。 ⑤. The nodes in the central control system will receive the calculation results of the nodes in each computing cluster system, perform necessary processing operations, and then send them to the nodes in the shared storage. During the calculation period, the nodes in the central control system will regularly collect them in the computing cluster system The necessary information of the node is cached to the local storage and stored in the shared storage node.
本发明的一种高效动态负载均衡的处理大规模数据的系统及方法具有以下优点: An efficient dynamic load balancing system and method for processing large-scale data of the present invention has the following advantages:
1、根据需要处理的数据任务量在现有的硬件设备条件部署集群,动态负载均衡方法,具有针对系统平台自适应,可实现系统的可靠、高效; 1. According to the amount of data tasks to be processed, clusters are deployed on the existing hardware equipment conditions, and the dynamic load balancing method is self-adapting to the system platform, which can realize the reliability and efficiency of the system;
2、此负载均衡方法可以自适应于集群系统,此系统可以由纯CPU集群子系统、CPU+GPU异构集群子系统一个或多个子系统组成的复杂混合集群系统; 2. This load balancing method can be adaptive to the cluster system, which can be a complex hybrid cluster system composed of one or more subsystems of pure CPU cluster subsystem and CPU+GPU heterogeneous cluster subsystem;
3、实现动态负载均衡,集群系统中各计算集群系统中节点间实现负载均衡,系统设备利用率高,不同计算设备,如CPU、GPU可以实现计算均衡,彼此相互不等待,系统中计算设备不出现空闲状态,整个集群系统将实现高效运转; 3. Realize dynamic load balancing, realize load balancing between nodes in each computing cluster system in the cluster system, high utilization rate of system equipment, different computing equipment, such as CPU, GPU can achieve calculation balancing, do not wait for each other, the computing equipment in the system does not In the idle state, the entire cluster system will operate efficiently;
4、在现有硬件的条件如内存容量、网络带宽不足的情况下,采取数据分块传输与计算异步处理的方式,能够有效处理大规模的数据; 4. In the case of existing hardware conditions such as insufficient memory capacity and network bandwidth, adopting the method of data block transmission and calculation asynchronous processing can effectively process large-scale data;
5、运行此系统将实现高性能,此负载均衡方法将根据不同计算集群系统中节点的计算能力和节点内不同计算设备的计算能力动态划分不同的计算任务到不同计算集群系统中节点上,实现计算任务划分的非固定性,提升动态负载均衡的高效性; 5. Running this system will achieve high performance. This load balancing method will dynamically divide different computing tasks to nodes in different computing cluster systems according to the computing capabilities of the nodes in different computing cluster systems and the computing capabilities of different computing devices in the nodes, so as to realize The non-fixedness of computing task division improves the efficiency of dynamic load balancing;
6、根据应用算法特点,以及CPU、GPU不同计算设备计算能力不同,计算设备所动态获取的计算任务应设置为不同; 6. According to the characteristics of the application algorithm and the different computing capabilities of different computing devices such as CPU and GPU, the computing tasks dynamically acquired by the computing devices should be set to be different;
7、将本发明扩展到多台服务器上,能够处理较大规模的数据,并使集群系统的节点间、节点内的计算设备达到计算的负载均衡,从而最大限度的利用现有设备的性能,提高系统的整体运行的效率,大大缩短程序的运行时间。 7. Extending the present invention to multiple servers can handle relatively large-scale data, and make computing load balancing between nodes and computing devices in the cluster system, thereby maximizing the use of the performance of existing devices, Improve the efficiency of the overall operation of the system and greatly shorten the running time of the program.
附图说明 Description of drawings
下面结合附图对本发明进一步说明。 The present invention will be further described below in conjunction with the accompanying drawings.
附图1为一种高效动态负载均衡的处理大规模数据的系统的结构示意框图; Accompanying drawing 1 is a schematic structural block diagram of a system for processing large-scale data of efficient dynamic load balancing;
附图2为一种高效动态负载均衡的处理大规模数据的系统中各节点的通信示意框图。 Attached Figure 2 is a schematic block diagram of the communication of each node in an efficient dynamic load balancing system for processing large-scale data.
具体实施方式 Detailed ways
参照说明书附图和具体实施例对本发明的一种高效动态负载均衡的处理大规模数据的系统及方法作以下详细地说明。 A system and method for processing large-scale data with high-efficiency dynamic load balancing of the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
实施例1: Example 1:
本发明的一种高效动态负载均衡的处理大规模数据的系统,为CPU与GPU混合异构集群系统,包括中央控制系统、计算集群系统、存储系统、高速网络;中央控制系统中节点采用CPU与GPU混合异构架构;计算集群系统中节点采用CPU与GPU混合异构架构或采用CPU架构;存储系统分共享存储和本地存储,共享存储中节点采用CPU架构,本地存储设置在中央控制系统的节点以及计算集群系统的每个节点中,共享存储分为主存储和备份存储,主存储和备份存储作为冗余存储、存储相同的计算数据,本地存储用于存储所在的中央控制系统的节点或者是计算集群系统的节点的数据;高速网络用于将中央控制系统中节点、计算集群系统中节点、共享存储中节点互相连接,组成集中式的处理大规模数据的系统。 A system for processing large-scale data with high-efficiency dynamic load balancing of the present invention is a hybrid heterogeneous cluster system of CPU and GPU, including a central control system, a computing cluster system, a storage system, and a high-speed network; the nodes in the central control system use CPU and GPU GPU hybrid heterogeneous architecture; nodes in the computing cluster system adopt CPU and GPU hybrid heterogeneous architecture or CPU architecture; storage system is divided into shared storage and local storage, shared storage nodes adopt CPU architecture, and local storage is set on the nodes of the central control system And in each node of the computing cluster system, the shared storage is divided into main storage and backup storage. The main storage and backup storage are used as redundant storage to store the same computing data. The local storage is used to store the nodes of the central control system or the Compute the data of the nodes of the cluster system; the high-speed network is used to connect the nodes in the central control system, the nodes in the computing cluster system, and the nodes in the shared storage to form a centralized system for processing large-scale data.
中央控制系统中节点控制计算集群系统中节点、存储系统中节点。 The nodes in the central control system control the nodes in the computing cluster system and the nodes in the storage system.
中央控制系统中节点设置有1个,共享存储中节点设置有1个,计算集群系统中节点设置有2个。 There is one node in the central control system, one node in the shared storage, and two nodes in the computing cluster system.
实施例2: Example 2:
本发明的一种高效动态负载均衡的处理大规模数据的系统,为CPU与GPU混合异构集群系统,包括中央控制系统、计算集群系统、存储系统、高速网络;中央控制系统中节点采用CPU与GPU混合异构架构;计算集群系统中节点采用CPU与GPU混合异构架构或采用CPU架构;存储系统分共享存储和本地存储,共享存储中节点采用CPU架构,本地存储设置在中央控制系统的节点以及计算集群系统的每个节点中,共享存储分为主存储和备份存储,主存储和备份存储作为冗余存储、存储相同的计算数据,本地存储用于存储所在的中央控制系统的节点或者是计算集群系统的节点的数据;高速网络用于将中央控制系统中节点、计算集群系统中节点、共享存储中节点互相连接,组成集中式的处理大规模数据的系统。 A system for processing large-scale data with high-efficiency dynamic load balancing of the present invention is a hybrid heterogeneous cluster system of CPU and GPU, including a central control system, a computing cluster system, a storage system, and a high-speed network; the nodes in the central control system use CPU and GPU GPU hybrid heterogeneous architecture; nodes in the computing cluster system adopt CPU and GPU hybrid heterogeneous architecture or CPU architecture; storage system is divided into shared storage and local storage, shared storage nodes adopt CPU architecture, and local storage is set on the nodes of the central control system And in each node of the computing cluster system, the shared storage is divided into main storage and backup storage. The main storage and backup storage are used as redundant storage to store the same computing data. The local storage is used to store the nodes of the central control system or the Compute the data of the nodes of the cluster system; the high-speed network is used to connect the nodes in the central control system, the nodes in the computing cluster system, and the nodes in the shared storage to form a centralized system for processing large-scale data.
中央控制系统中节点控制计算集群系统中节点、存储系统中节点。 The nodes in the central control system control the nodes in the computing cluster system and the nodes in the storage system.
中央控制系统中节点设置有1个,共享存储中节点设置有2个,计算集群系统中节点设置有5个。 There is one node in the central control system, two nodes in the shared storage system, and five nodes in the computing cluster system.
实施例3: Example 3:
本发明的一种高效动态负载均衡的处理大规模数据的方法,采用上述中任意一种处理大规模数据的系统,对大规模数据进行处理,包括如下步骤: A method for processing large-scale data with high-efficiency dynamic load balancing of the present invention uses any one of the above-mentioned systems for processing large-scale data to process large-scale data, including the following steps:
(1)、中央控制系统中节点通过高速网络与所有计算集群系统中节点相互连接,中央控制系统中节点控制各个计算集群系统中节点,中央控制系统中节点动态的分配计算任务给计算集群系统中节点,中央控制系统中节点接收计算集群系统中节点的返回结果; (1) The nodes in the central control system are connected to the nodes in all computing cluster systems through a high-speed network, the nodes in the central control system control the nodes in each computing cluster system, and the nodes in the central control system dynamically assign computing tasks to the computing cluster systems Node, the node in the central control system receives the return result of the node in the computing cluster system;
(2)、计算集群系统中节点与共享存储中节点通过高速网络互连,中央控制系统中节点与共享存储中节点通过高速网络互连;共享存储中节点根据中央控制系统中节点的命令向计算集群系统中节点发送计算任务数据; (2) The nodes in the computing cluster system and the nodes in the shared storage are interconnected through a high-speed network, and the nodes in the central control system and the nodes in the shared storage are interconnected through a high-speed network; Nodes in the cluster system send computing task data;
(3)、计算集群系统中节点负责计算任务,计算集群系统中节点内有多个同型号的GPU处理器进行计算;可以提高计算的并行度,提高单节点的计算能力,同时同型号的GPU易于计算任务的划分; (3) The nodes in the computing cluster system are responsible for computing tasks. There are multiple GPU processors of the same model in the nodes of the computing cluster system for computing; it can improve the parallelism of computing and the computing power of a single node. At the same time, GPUs of the same model Ease of division of computational tasks;
(4)、中央控制系统的节点或者是计算集群系统的节点内的本地存储,用于缓存本地所必要的数据; (4) The local storage in the nodes of the central control system or the nodes of the computing cluster system is used to cache the necessary local data;
(5)、共享存储中节点存储计算集群系统中节点所需要的计算数据和计算结果数据,通过高速网络向计算集群系统中节点发送计算数据;同时共享存储中节点采用主存储和备份存储的存储方式,保证了数据的安全性。 (5) The nodes in the shared storage store the calculation data and calculation result data required by the nodes in the computing cluster system, and send the calculation data to the nodes in the computing cluster system through the high-speed network; at the same time, the nodes in the shared storage use the storage of primary storage and backup storage way to ensure data security.
中央控制系统中节点收集所有计算集群系统中节点的计算能力信息,中央控制系统中节点将计算数据动态的划分,并命令共享存储中节点将计算数据发送给选中的计算集群系统中节点;共享存储中节点根据中央控制系统中节点的命令首先将计算数据以数据块为单位进行划分,然后将不同数量的数据块动态的发送给对应的计算集群系统中节点;计算集群系统中节点接收共享存储中节点发送来的计算数据,并将计算结果数据传输给中央控制系统中节点,中央控制系统中节点将接收到的计算结果统一处理后存储到共享存储中节点。 The nodes in the central control system collect the computing capability information of all nodes in the computing cluster system, and the nodes in the central control system dynamically divide the computing data, and order the nodes in the shared storage to send the computing data to the selected nodes in the computing cluster system; the shared storage According to the command of the nodes in the central control system, the middle node first divides the calculation data in units of data blocks, and then dynamically sends different numbers of data blocks to the corresponding nodes in the computing cluster system; the nodes in the computing cluster system receive the data in the shared storage The calculation data sent by the nodes, and the calculation result data are transmitted to the nodes in the central control system, and the nodes in the central control system process the received calculation results uniformly and store them in the shared storage nodes.
计算集群系统中节点在接收下一个数据块的同时,计算当前的数据块,同时发送上一个已经计算完成的数据块。 The nodes in the computing cluster system calculate the current data block while receiving the next data block, and send the last calculated data block at the same time.
采用了集中式的控制及存储方式,采用将大的数据分块的方式,由共享存储中节点动态的分发到各个计算集群系统中节点;在计算与通信异步的方式下,中央控制系统中节点需要根据网络的通信能力,计算集群系统中节点的计算能力将数据划分为合适的数据块,使得计算与传输的相互掩盖,以其达到最优的性能。传输与计算的异步方式不仅缩短了计算的时间,同时由于分块计算,对系统的小内存、低带宽的硬件设备也可以应用与大规模数据的计算。 A centralized control and storage method is adopted, and the large data is divided into blocks, which are dynamically distributed from the nodes in the shared storage to the nodes in each computing cluster system; in the asynchronous way of computing and communication, the nodes in the central control system It is necessary to divide the data into appropriate data blocks according to the communication capabilities of the network and the computing capabilities of the nodes in the computing cluster system, so that the calculation and transmission can cover each other to achieve optimal performance. The asynchronous method of transmission and calculation not only shortens the calculation time, but also can be applied to the calculation of large-scale data for the small memory and low bandwidth hardware devices of the system due to the block calculation.
实施例4: Example 4:
本发明的一种高效动态负载均衡的处理大规模数据的方法,处理大规模数据的系统的工作流程为: A method for processing large-scale data with high-efficiency dynamic load balancing of the present invention, the workflow of the system for processing large-scale data is:
①、中央控制系统中节点负责收集各个计算集群系统中节点的GPU卡的数量,根据各计算集群系统中节点不同的卡数量,生成各个计算集群系统中节点的计算能力信息,将此计算能力信息发送给共享存储中节点;计算能力信息包括每个计算集群系统中节点GPU卡的数量,高速网络的通信能力,GPU卡的计算能力; ①. The nodes in the central control system are responsible for collecting the number of GPU cards of the nodes in each computing cluster system. According to the different card numbers of the nodes in each computing cluster system, the computing capability information of the nodes in each computing cluster system is generated, and the computing capability information is Send to the nodes in the shared storage; the computing capability information includes the number of node GPU cards in each computing cluster system, the communication capability of the high-speed network, and the computing capability of the GPU card;
②、共享存储中节点根据中央控制系统中节点发送的计算能力信息,首先将数据分为合适的可发送的基本的数据块,然后为各计算集群系统中节点分配对应数量的计算数据块,再将数据块动态的发送给计算集群系统中节点; ②. According to the computing capability information sent by the nodes in the central control system, the nodes in the shared storage first divide the data into suitable basic data blocks that can be sent, and then allocate the corresponding number of computing data blocks to the nodes in each computing cluster system, and then Dynamically send data blocks to nodes in the computing cluster system;
③、计算集群系统中节点接收数据进行计算同时,若传输数据的较快而计算未完成,可将计算数据暂存储到本地存储中,若没有数据传输时,则从本地存储中获取,若本地也没有则需等待; ③. While the nodes in the computing cluster system receive data for calculation, if the transmission data is fast and the calculation is not completed, the calculation data can be temporarily stored in the local storage. If there is no data transmission, it will be obtained from the local storage. If the local If not, wait;
④、计算集群系统中节点完成计算数据块的同时即可将计算的结果发送给中央控制系统中节点,若传输繁忙则可先将数据暂存于本地存储中,等待网络空闲时再将其发送给中央控制系统中节点; ④. When the nodes in the computing cluster system complete the calculation of data blocks, the calculation results can be sent to the nodes in the central control system. If the transmission is busy, the data can be temporarily stored in the local storage, and then sent when the network is idle To the nodes in the central control system;
⑤、中央控制系统中节点将接收到的各个计算集群系统中节点的计算结果,进行必要的处理操作,然后发送给共享存储中节点,在计算期间中央控制系统中节点定时的收集计算集群系统中节点的必要的信息缓存到本地存储并存储到共享存储中节点。 ⑤. The nodes in the central control system will receive the calculation results of the nodes in each computing cluster system, perform necessary processing operations, and then send them to the nodes in the shared storage. During the calculation period, the nodes in the central control system will regularly collect them in the computing cluster system The necessary information of the node is cached to the local storage and stored in the shared storage node.
一种处理大规模数据的系统中,计算与通信采用全异步的方式,即每计算完成一个数据块即可发送到网络中。保证了计算与传输的隐藏缩短了整个系统的计算时间,同时采用计算数据分块的处理方式,可以使得系统很好的应用到网络带宽低和存储空间不足但却可以处理大数据的情况。定时的备份起到了容错的功能,有效防止了系统中节点的中断而出现系统崩溃。 In a system for processing large-scale data, calculation and communication adopt a fully asynchronous method, that is, each data block can be sent to the network after calculation is completed. It ensures that the calculation and transmission are hidden and shortens the calculation time of the entire system. At the same time, the calculation data is divided into blocks, which can make the system well applied to the situation where the network bandwidth is low and the storage space is insufficient but can handle large data. Timing backup plays a fault-tolerant function, which effectively prevents system crashes caused by interruption of nodes in the system.
通过上面具体实施方式,所述技术领域的技术人员可容易的实现本发明。但是应当理解,本发明并不限于上述的具体实施方式。在公开的实施方式的基础上,所述技术领域的技术人员可任意组合不同的技术特征,从而实现不同的技术方案。 Through the above specific implementation manners, those skilled in the technical field can easily realize the present invention. However, it should be understood that the present invention is not limited to the specific embodiments described above. On the basis of the disclosed embodiments, those skilled in the art can arbitrarily combine different technical features, so as to realize different technical solutions.
Claims (7)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510037687.0A CN104580503A (en) | 2015-01-26 | 2015-01-26 | Efficient dynamic load balancing system and method for processing large-scale data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510037687.0A CN104580503A (en) | 2015-01-26 | 2015-01-26 | Efficient dynamic load balancing system and method for processing large-scale data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104580503A true CN104580503A (en) | 2015-04-29 |
Family
ID=53095660
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510037687.0A Pending CN104580503A (en) | 2015-01-26 | 2015-01-26 | Efficient dynamic load balancing system and method for processing large-scale data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104580503A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106897148A (en) * | 2017-02-28 | 2017-06-27 | 郑州云海信息技术有限公司 | A kind of system and method for generating micro-downburst |
CN107920080A (en) * | 2017-11-22 | 2018-04-17 | 郑州云海信息技术有限公司 | A kind of characteristic acquisition method and system |
CN108989398A (en) * | 2018-06-27 | 2018-12-11 | 郑州云海信息技术有限公司 | A kind of virtual shared memory cell and the cluster storage system based on cloud storage |
CN109343791A (en) * | 2018-08-16 | 2019-02-15 | 武汉元鼎创天信息科技有限公司 | A kind of big data all-in-one machine |
CN110333945A (en) * | 2019-05-09 | 2019-10-15 | 成都信息工程大学 | A dynamic load balancing method, system and terminal |
CN112511576A (en) * | 2019-09-16 | 2021-03-16 | 触景无限科技(北京)有限公司 | Internet of things data processing system and data processing method |
CN113094183A (en) * | 2021-06-09 | 2021-07-09 | 苏州浪潮智能科技有限公司 | Training task creating method, device, system and medium of AI (Artificial Intelligence) training platform |
CN113225362A (en) * | 2020-02-06 | 2021-08-06 | 北京京东振世信息技术有限公司 | Server cluster system and implementation method thereof |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050243094A1 (en) * | 2004-05-03 | 2005-11-03 | Microsoft Corporation | Systems and methods for providing an enhanced graphics pipeline |
CN101751376A (en) * | 2009-12-30 | 2010-06-23 | 中国人民解放军国防科学技术大学 | Quickening method utilizing cooperative work of CPU and GPU to solve triangular linear equation set |
CN104301434A (en) * | 2014-10-31 | 2015-01-21 | 浪潮(北京)电子信息产业有限公司 | A cluster-based high-speed communication architecture and method |
-
2015
- 2015-01-26 CN CN201510037687.0A patent/CN104580503A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050243094A1 (en) * | 2004-05-03 | 2005-11-03 | Microsoft Corporation | Systems and methods for providing an enhanced graphics pipeline |
CN101751376A (en) * | 2009-12-30 | 2010-06-23 | 中国人民解放军国防科学技术大学 | Quickening method utilizing cooperative work of CPU and GPU to solve triangular linear equation set |
CN104301434A (en) * | 2014-10-31 | 2015-01-21 | 浪潮(北京)电子信息产业有限公司 | A cluster-based high-speed communication architecture and method |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106897148A (en) * | 2017-02-28 | 2017-06-27 | 郑州云海信息技术有限公司 | A kind of system and method for generating micro-downburst |
CN107920080A (en) * | 2017-11-22 | 2018-04-17 | 郑州云海信息技术有限公司 | A kind of characteristic acquisition method and system |
CN108989398A (en) * | 2018-06-27 | 2018-12-11 | 郑州云海信息技术有限公司 | A kind of virtual shared memory cell and the cluster storage system based on cloud storage |
CN108989398B (en) * | 2018-06-27 | 2021-02-02 | 苏州浪潮智能科技有限公司 | A virtual shared storage unit and a cloud storage-based cluster storage system |
CN109343791A (en) * | 2018-08-16 | 2019-02-15 | 武汉元鼎创天信息科技有限公司 | A kind of big data all-in-one machine |
CN109343791B (en) * | 2018-08-16 | 2021-11-09 | 武汉元鼎创天信息科技有限公司 | Big data all-in-one |
CN110333945A (en) * | 2019-05-09 | 2019-10-15 | 成都信息工程大学 | A dynamic load balancing method, system and terminal |
CN112511576A (en) * | 2019-09-16 | 2021-03-16 | 触景无限科技(北京)有限公司 | Internet of things data processing system and data processing method |
CN113225362A (en) * | 2020-02-06 | 2021-08-06 | 北京京东振世信息技术有限公司 | Server cluster system and implementation method thereof |
CN113225362B (en) * | 2020-02-06 | 2024-04-05 | 北京京东振世信息技术有限公司 | Server cluster system and implementation method thereof |
CN113094183A (en) * | 2021-06-09 | 2021-07-09 | 苏州浪潮智能科技有限公司 | Training task creating method, device, system and medium of AI (Artificial Intelligence) training platform |
CN113094183B (en) * | 2021-06-09 | 2021-09-17 | 苏州浪潮智能科技有限公司 | Training task creating method, device, system and medium of AI (Artificial Intelligence) training platform |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104580503A (en) | Efficient dynamic load balancing system and method for processing large-scale data | |
CN105159610B (en) | Large-scale data processing system and method | |
CN110619595B (en) | Graph calculation optimization method based on interconnection of multiple FPGA accelerators | |
CN106502792B (en) | A Multi-tenant Resource Optimal Scheduling Method for Different Types of Loads | |
CN108563808B (en) | Design Method of Heterogeneous Reconfigurable Graph Computation Accelerator System Based on FPGA | |
CN107122244B (en) | Multi-GPU-based graph data processing system and method | |
CN101778002B (en) | Large-scale cluster system and building method thereof | |
CN111221624A (en) | A container management method for regulating cloud platform based on Docker container technology | |
CN102929718A (en) | Distributed GPU (graphics processing unit) computer system based on task scheduling | |
CN104239555A (en) | MPP (massively parallel processing)-based parallel data mining framework and MPP-based parallel data mining method | |
CN102135949A (en) | Computing network system, method and device based on graphic processing unit | |
CN103336756B (en) | A kind of generating apparatus of data computational node | |
CN103595780A (en) | Cloud computing resource scheduling method based on repeat removing | |
CN114424174A (en) | Parameter caching for neural network accelerators | |
CN104811503A (en) | R statistical modeling system | |
CN104375882A (en) | Multistage nested data drive calculation method matched with high-performance computer structure | |
CN109254846A (en) | The dynamic dispatching method and system of CPU and GPU cooperated computing based on two-level scheduler | |
CN104618406A (en) | Load balancing algorithm based on naive Bayesian classification | |
CN107463448A (en) | A kind of deep learning weight renewing method and system | |
CN116991590A (en) | Resource decoupling system, execution method and device for deep learning applications | |
CN107301094A (en) | The dynamic self-adapting data model inquired about towards extensive dynamic transaction | |
CN107197039A (en) | A kind of PAAS platform service bag distribution methods and system based on CDN | |
Narantuya et al. | Multi-agent deep reinforcement learning-based resource allocation in hpc/ai converged cluster | |
Li et al. | Collm: A collaborative llm inference framework for resource-constrained devices | |
Li et al. | Data prefetching and file synchronizing for performance optimization in Hadoop-based hybrid cloud |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20150429 |
|
WD01 | Invention patent application deemed withdrawn after publication |