CN105302526A

CN105302526A - Data processing system and method

Info

Publication number: CN105302526A
Application number: CN201510680669.4A
Authority: CN
Inventors: 张清; 沈铂; 王娅娟
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2015-10-19
Filing date: 2015-10-19
Publication date: 2016-02-03
Anticipated expiration: 2035-10-19
Also published as: CN105302526B

Abstract

The invention provides a data processing system. The system comprises a master node and a plurality of slave nodes, wherein the master node is used for batch reading of to-be-processed data; the master node is also used for sending the to-be-processed data to each slave node after reading each time, updating a network according to a weight returned by each slave node, sending updated network information parameters to each slave node, and reading a next batch of to-be-processed data; and the slave nodes are used for performing forward-backward calculation on the received data sent by the main node to obtain weights and returning the weights to the master node. According to the scheme, a master-slave calculation mode is adopted, so that the time for processing a deep learning application is shortened and the calculation efficiency is improved.

Description

A data processing system and method

技术领域technical field

本发明涉及计算机领域，具体涉及数据处理方法及系统。The invention relates to the field of computers, in particular to a data processing method and system.

背景技术Background technique

2006年，加拿大多伦多大学教授、机器学习领域泰斗——GeoffreyHinton和他的学生RuslanSalakhutdinov在顶尖学术刊物《科学》上发表了一篇文章，开启了深度学习在学术界和工业界的浪潮。自2006年以来，深度学习在学术界持续升温。斯坦福大学、纽约大学、加拿大蒙特利尔大学等成为研究深度学习的重镇。2010年，美国国防部DARPA计划首次资助深度学习项目，参与方有斯坦福大学、纽约大学和NEC美国研究院。支持深度学习的一个重要依据，就是脑神经系统的确具有丰富的层次结构。一个最著名的例子就是Hubel-Wiesel模型，由于揭示了视觉神经的机理而曾获得诺贝尔医学与生理学奖。除了仿生学的角度，目前深度学习的理论研究还基本处于起步阶段，但在应用领域已显现出巨大能量。2011年以来，微软研究院和Google的语音识别研究人员先后采用DNN技术降低语音识别错误率20％～30％，是语音识别领域十多年来最大的突破性进展。2012年，DNN技术在图像识别领域取得惊人的效果，在ImageNet评测上将错误率从26％降低到15％。在这一年，DNN还被应用于制药公司的DrugeActivity预测问题，并获得世界最好成绩，这一重要成果被《纽约时报》报道。In 2006, Geoffrey Hinton, a professor at the University of Toronto in Canada and a leader in the field of machine learning, and his student Ruslan Salakhutdinov published an article in the top academic journal "Science", which opened a wave of deep learning in academia and industry. Since 2006, deep learning has continued to gain momentum in academia. Stanford University, New York University, University of Montreal, Canada, etc. have become important centers for deep learning research. In 2010, the DARPA program of the U.S. Department of Defense funded deep learning projects for the first time, and the participants included Stanford University, New York University and NEC American Research Institute. An important basis for supporting deep learning is that the brain nervous system does have a rich hierarchical structure. One of the most famous examples is the Hubel-Wiesel model, which won the Nobel Prize in Medicine and Physiology for revealing the mechanism of the optic nerve. In addition to the perspective of bionics, the current theoretical research on deep learning is still in its infancy, but it has shown great energy in the field of application. Since 2011, speech recognition researchers at Microsoft Research and Google have successively used DNN technology to reduce the error rate of speech recognition by 20% to 30%, which is the biggest breakthrough in the field of speech recognition for more than ten years. In 2012, DNN technology achieved amazing results in the field of image recognition, reducing the error rate from 26% to 15% in the ImageNet evaluation. In this year, DNN was also applied to the DrugActivity prediction problem of pharmaceutical companies, and achieved the best results in the world. This important achievement was reported by the New York Times.

如今Google、微软、百度等知名的拥有大数据的高科技公司争相投入资源，占领深度学习的技术制高点，正是因为它们都看到了在大数据时代，更加复杂且更加强大的深度模型能深刻揭示海量数据里所承载的复杂而丰富的信息，并对未来或未知事件做更精准的预测。Now Google, Microsoft, Baidu and other well-known high-tech companies with big data are scrambling to invest resources and occupy the technical commanding heights of deep learning, precisely because they have seen that in the era of big data, more complex and powerful deep models can profoundly Reveal the complex and rich information carried in massive data, and make more accurate predictions for future or unknown events.

目前深度学习应用包括语音识别、图像识别、自然语言处理、搜索广告CTR预估等，这些应用的计算量十分巨大，其需要大规模计算，采用GPU高性能计算将进一步提升应用处理效率，基于GPU来设计深度学习系统是一个不错的选择。At present, deep learning applications include speech recognition, image recognition, natural language processing, and search advertisement CTR estimation. It is a good choice to design a deep learning system.

发明内容：Invention content:

本发明提供一种数据处理系统及方法，提高了计算效率。The invention provides a data processing system and method, which improves calculation efficiency.

为解决上述技术问题，本发明提供一种数据处理系统，所述系统包括：一个主节点，多个从节点；In order to solve the above technical problems, the present invention provides a data processing system, the system includes: a master node, a plurality of slave nodes;

所述主节点用于分批读取待处理的数据；还用于每次读取后将待处理的数据分发到各从节点，根据各所述从节点返回的权重更新网络，将所述更新后的网络信息参数发送给各所述从节点后，读取下一批待处理的数据；The master node is used to read the data to be processed in batches; it is also used to distribute the data to be processed to each slave node after reading each time, update the network according to the weight returned by each slave node, and update the After the last network information parameter is sent to each described slave node, read the next batch of data to be processed;

所述从节点用于对接收的所述主节点分发的数据进行前向后向计算后得出权重，返回给所述主节点。The slave node is used to perform forward and backward calculations on the received data distributed by the master node to obtain weights and return them to the master node.

优选地，Preferably,

所述主节点包括两个CPU和一个GPU；The master node includes two CPUs and a GPU;

所述从节点包括两个CPU和两个GPU；The slave node includes two CPUs and two GPUs;

所述主节点及所述各从节点采用CPU和GPU异构架构的混合集群系统模式。The master node and the slave nodes adopt a hybrid cluster system mode of CPU and GPU heterogeneous architecture.

优选地，Preferably,

所述系统还包括并行分布式Lustre存储：The system also includes parallel distributed Luster storage:

所述主节点用于分批读取待处理的数据具体是指：The master node is used to read the data to be processed in batches specifically refers to:

所述主节点从所述Lustre存储中并行读取数据。The master node reads data in parallel from the Luster storage.

优选地，Preferably,

所述Lustre存储支持多进行或多线程并行读写。The Luster storage supports multi-process or multi-thread parallel reading and writing.

优选地，Preferably,

所述主节点与所述各从节点之间采用远程直接数据存取RDMA方式接收/发送数据。The remote direct data access (RDMA) method is used to receive/send data between the master node and the slave nodes.

优选地，Preferably,

所述从节点配置1块IB网卡，所述主节点和所述各从节点之间通过IB网络互连；The slave node is configured with one IB network card, and the master node and each slave node are interconnected through an IB network;

所述主节点及所述各节点内CPU与GPU之间通过PCIE3.0标准。The master node and the CPU and GPU in each node pass the PCIE3.0 standard.

优选地，Preferably,

所述从节点的个数不大于8。The number of said slave nodes is not greater than 8.

本发明还提供一种数据处理的方法，应用于如权利要求1至7任一所述的系统中，所述方法包括：The present invention also provides a method of data processing, which is applied to the system according to any one of claims 1 to 7, said method comprising:

步骤S1、所述主节点读取待处理的数据分发到各从节点；Step S1, the master node reads the data to be processed and distributes it to each slave node;

步骤S2、所述主节点接收所述各从节点返回的权重；Step S2, the master node receives the weights returned by the slave nodes;

步骤S3、所述主节点根据所述各从节点返回的权重更新网络，并将所述更新后的网络信息参数发送给所述各从节点。Step S3, the master node updates the network according to the weights returned by the slave nodes, and sends the updated network information parameters to the slave nodes.

步骤S4、主节点发送更新后的网络后，检查是否还存在待处理的数据，如果存在，则返回S1。Step S4, after the master node sends the updated network, it checks whether there is still data to be processed, and if so, returns to S1.

优选地，Preferably,

所述步骤S1后，步骤S2前，所述方法还包括：After the step S1, before the step S2, the method also includes:

步骤S11、所述各从节点的GPU根据所述网络信息参数对接收的所述主节点分发的数据进行前向后向计算后得出权重；Step S11, the GPUs of the slave nodes perform forward and backward calculations on the received data distributed by the master node according to the network information parameters to obtain weights;

所述S2包括：The S2 includes:

所述主节点的GPU接收所述各从节点发送的权重。The GPU of the master node receives the weights sent by the slave nodes.

优选地，Preferably,

所述步骤S1前，所述方法还包括：Before the step S1, the method also includes:

步骤S0、所述主节点从并行分布式Lustre存储中并行读取数据。Step S0, the master node reads data in parallel from the parallel distributed Luster storage.

上述方案采用主从计算模式，从而减少了深度学习应用处理时间，提升了计算效率。The above scheme adopts the master-slave computing mode, which reduces the processing time of deep learning applications and improves computing efficiency.

附图说明Description of drawings

图1为实施例一中的数据处理系统的结构示意图；Fig. 1 is the structural representation of the data processing system in embodiment one;

图2为实施例一中的数据处理方法的流程图；Fig. 2 is the flowchart of the data processing method in embodiment one;

图3为实施例二中的数据处理系统的结构示意图；Fig. 3 is the structural representation of the data processing system in the second embodiment;

图4为实施例二中的Lustre存储与主节点的连接示意图；Fig. 4 is the schematic diagram of the connection between the Luster storage and the master node in the second embodiment;

图5为实施例二中的数据处理系统的逻辑关系示意图。FIG. 5 is a schematic diagram of the logical relationship of the data processing system in the second embodiment.

具体实施方式detailed description

为使本申请的目的、技术方案和优点更加清楚明白，下文中将结合附图对本申请的实施例进行详细说明。需要说明的是，在不冲突的情况下，本申请中的实施例及实施例中的特征可以相互任意组合。In order to make the purpose, technical solution and advantages of the application clearer, the embodiments of the application will be described in detail below in conjunction with the accompanying drawings. It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined arbitrarily with each other.

实施例一Embodiment one

如图1所示，本发明提供一种数据处理系统，所述系统包括：一个主节点11，多个从节点12；As shown in Figure 1, the present invention provides a kind of data processing system, and described system comprises: a master node 11, a plurality of slave nodes 12;

所述主节点11用于分批读取待处理的数据；还用于每次读取后将待处理的数据分发到各从节点，根据各所述从节点返回的权重更新网络，将所述更新后的网络信息参数发送给各所述从节点后，读取下一批待处理的数据；The master node 11 is used to read the data to be processed in batches; it is also used to distribute the data to be processed to each slave node after each reading, and update the network according to the weight returned by each slave node, and the After the updated network information parameters are sent to each of the slave nodes, the next batch of data to be processed is read;

所述从节点12用于对接收的所述主节点分发的数据进行前向后向计算后得出权重，返回给所述主节点。The slave node 12 is used to perform forward and backward calculations on the received data distributed by the master node to obtain weights and return them to the master node.

在本实施例中，从节点的个数可以设置为不大于8。主节点11包括两个CPU121和一个GPU122；从节点12包括两个CPU121和两个GPU122；主节点及各从节点采用CPU和GPU异构架构的混合集群系统模式。In this embodiment, the number of slave nodes may be set to be no greater than 8. The master node 11 includes two CPUs 121 and one GPU 122; the slave node 12 includes two CPUs 121 and two GPUs 122; the master node and each slave node adopt a mixed cluster system mode of CPU and GPU heterogeneous architecture.

优选地，Preferably,

所述系统还包括并行分布式Lustre存储13：Lustre存储13支持多进行或多线程并行读写。主节点11从Lustre存储中并行读取数据。The system also includes a parallel distributed Luster storage 13: the Luster storage 13 supports multi-process or multi-thread parallel reading and writing. Master nodes 11 read data in parallel from Luster storage.

主节点与各从节点之间采用远程直接数据存取RDMA方式接收/发送数据。The remote direct data access RDMA method is used to receive/send data between the master node and each slave node.

优选地，Preferably,

从节点12配置1块IB网卡，主节点11和各从节点12之间通过IB网络互连；主节点11及各节点内CPU与GPU之间通过PCIE3.0标准。The slave node 12 is configured with one IB network card, and the master node 11 and each slave node 12 are interconnected through the IB network; the master node 11 and the CPU and GPU in each node are connected through the PCIE3.0 standard.

如图2所示，本发明还提供一种数据处理的方法，应用于如权利要求1至7任一所述的系统中，所述方法包括：As shown in Figure 2, the present invention also provides a data processing method, which is applied to the system according to any one of claims 1 to 7, the method comprising:

步骤S1、主节点读取待处理的数据分发到各从节点；Step S1, the master node reads the data to be processed and distributes it to each slave node;

步骤S2、主节点接收各从节点返回的权重；Step S2, the master node receives the weight returned by each slave node;

具体的，主节点的GPU接收各从节点发送的权重Specifically, the GPU of the master node receives the weights sent by each slave node

步骤S3、主节点根据各从节点返回的权重更新网络，并将更新后的网络信息参数发送给各从节点。Step S3, the master node updates the network according to the weight returned by each slave node, and sends the updated network information parameters to each slave node.

优选地，Preferably,

步骤S11、各从节点的GPU根据网络信息参数对接收的主节点分发的数据进行前向后向计算后得出权重；Step S11, the GPU of each slave node performs forward and backward calculation on the received data distributed by the master node according to the network information parameters to obtain the weight;

优选地，Preferably,

步骤S0、主节点从并行分布式Lustre存储中并行读取数据。Step S0, the master node reads data in parallel from the parallel distributed Luster storage.

实施例二Embodiment two

下面结合具体的场景进一步说明本发明的技术方案。The technical solution of the present invention will be further described below in conjunction with specific scenarios.

如图3所示，本实施例的数据处理系统可运行深度学习caffe应用，采用Cifar-10数据测试，具体可以采用如下架构实现：As shown in Figure 3, the data processing system of this embodiment can run deep learning caffe applications, using Cifar-10 data testing, specifically can be implemented using the following architecture:

一、数据处理系统可以采用CPU+GPU异构架构的混合集群系统模式；并采用主从模式，整个系统计算节点分为1个主节点和8个从节点。根据深度学习应用算法特点，参数更新计算、数据读取和分发、网络更新计算由主节点完成；耗时的前向后向计算由从节点完成。1. The data processing system can adopt the hybrid cluster system mode of CPU+GPU heterogeneous architecture; and adopt the master-slave mode, and the computing nodes of the whole system are divided into 1 master node and 8 slave nodes. According to the characteristics of deep learning application algorithms, parameter update calculations, data reading and distribution, and network update calculations are completed by the master node; time-consuming forward and backward calculations are completed by the slave nodes.

下面进一步对本实施例中主节点和从节点做详细的介绍：The master node and the slave node in this embodiment are further described in detail below:

a)主节点a) Master node

主节点内为CPU与GPU协同计算，CPU与GPU通信采用PCIE3.0标准，2块CPU，1块NvidiaK40GPU，GPU支持PCIE3.0标准，主节点个数为1个。主节点配置2块IB网卡，主节点与存储、其它从节点通过IB网络互连。The master node is for CPU and GPU collaborative computing, CPU and GPU communication adopts PCIE3.0 standard, 2 CPUs, 1 NvidiaK40GPU, GPU supports PCIE3.0 standard, and the number of master nodes is 1. The master node is equipped with two IB network cards, and the master node is interconnected with the storage and other slave nodes through the IB network.

b)从节点b) slave node

从节点内为CPU与GPU协同计算，CPU与GPU通信采用PCIE3.0标准，2块CPU，2块NvidiaK40GPU，GPU支持PCIE3.0标准，2块GPU都插到CPU0的插槽上。从节点个数为8个。从节点配置1块IB网卡，从节点与主节点通过IB网络互连。The slave node is for CPU and GPU collaborative computing. The communication between CPU and GPU adopts the PCIE3.0 standard. There are 2 CPUs and 2 NvidiaK40GPUs. The GPUs support the PCIE3.0 standard. Both GPUs are plugged into the CPU0 slot. The number of slave nodes is 8. The slave node is equipped with an IB network card, and the slave node and the master node are interconnected through the IB network.

二、如图4所示，在本实施例的技术方案中，还需设计并行分布式Lustre存储，支持多进程或多线程并行读写，并行读写带宽高、延迟低，Lustre存储与主节点通过IB网络互连。2. As shown in Figure 4, in the technical solution of this embodiment, parallel distributed Luster storage needs to be designed to support multi-process or multi-thread parallel read and write, with high parallel read and write bandwidth and low latency. Luster storage and the master node Interconnected through the IB network.

三、设计网络，本数据处理系统采用Mellanox公司的56Gb/sIB高速网络，实现并行存储、主节点、从节点的高速互连。3. Network design. This data processing system adopts 56Gb/sIB high-speed network of Mellanox Company to realize parallel storage, high-speed interconnection of master nodes and slave nodes.

如图5所示，系统各个部件工作逻辑关系设计如下：As shown in Figure 5, the working logic relationship of each component of the system is designed as follows:

(1)主节点从并行Lustre存储并行读取Cifar-10数据；(1) The master node reads Cifar-10 data in parallel from parallel Luster storage;

(2)主节点把数据分发到8个从节点；(2) The master node distributes the data to 8 slave nodes;

(3)每个从节点的2块GPU开始进行前向后向计算，并把计算后的权重通过RDMA直传到主节点GPU上；(3) The two GPUs of each slave node start to perform forward and backward calculations, and directly transmit the calculated weights to the master node GPU through RDMA;

(4)主节点接收到新权重后在GPU上进行计算，并更新网络，然后把新网络RDMA发送给从节点；(4) After the master node receives the new weight, it calculates on the GPU, updates the network, and then sends the new network RDMA to the slave node;

上述步骤依此迭代执行，直至所有数据处理完成，其逻辑关系如图3所示。The above steps are executed iteratively until all data processing is completed, and the logical relationship is shown in FIG. 3 .

以上所述仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。本领域普通技术人员可以理解上述方法中的全部或部分步骤可通过程序来指令相关硬件完成，所述程序可以存储于计算机可读存储介质中，如只读存储器、磁盘或光盘等。可选地，上述实施例的全部或部分步骤也可以使用一个或多个集成电路来实现，相应地，上述实施例中的各模块/模块可以采用硬件的形式实现，也可以采用软件功能模块的形式实现。本申请不限制于任何特定形式的硬件和软件的结合。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention. Those skilled in the art can understand that all or part of the steps in the above method can be completed by instructing relevant hardware through a program, and the program can be stored in a computer-readable storage medium, such as a read-only memory, a magnetic disk or an optical disk, and the like. Optionally, all or part of the steps in the above embodiments can also be implemented using one or more integrated circuits. Correspondingly, each module/module in the above embodiments can be implemented in the form of hardware, or can be implemented in the form of software function modules. The form is realized. This application is not limited to any specific form of combination of hardware and software.

Claims

1. A data processing system, characterized in that the system comprises: a master node, a plurality of slave nodes;

The master node is used to read the data to be processed in batches; it is also used to distribute the data to be processed to each slave node after reading each time, update the network according to the weight returned by each slave node, and update the After the last network information parameter is sent to each described slave node, read the next batch of data to be processed;

The slave node is used to perform forward and backward calculations on the received data distributed by the master node to obtain weights and return them to the master node.

2. The system of claim 1, wherein:

The master node includes two CPUs and a GPU;

The slave node includes two CPUs and two GPUs;

The master node and the slave nodes adopt a hybrid cluster system mode of CPU and GPU heterogeneous architecture.

3. The system according to claim 2, further comprising parallel distributed Luster storage:

The master node is used to read the data to be processed in batches specifically refers to:

The master node reads data in parallel from the Luster storage.

4. The system of claim 3, wherein:

The Luster storage supports multi-process or multi-thread parallel reading and writing.

5. The system of claim 4, wherein:

The remote direct data access (RDMA) method is used to receive/send data between the master node and the slave nodes.

6. The system according to any one of claims 1 to 5, characterized in that:

The slave node is configured with one IB network card, and the master node and each slave node are interconnected through an IB network;

The master node and the CPU and GPU in each node pass the PCIE3.0 standard.

7. The system according to any one of claims 1 to 5, characterized in that:

The number of said slave nodes is not greater than 8.

8. A method for data processing, applied to the system according to any one of claims 1 to 7, characterized in that the method comprises:

Step S1, the master node reads the data to be processed and distributes it to each slave node;

Step S2, the master node receives the weights returned by the slave nodes;

Step S3, the master node updates the network according to the weights returned by the slave nodes, and sends the updated network information parameters to the slave nodes.

Step S4, after the master node sends the updated network, it checks whether there is still data to be processed, and if so, returns to S1.

9. The method of claim 8, wherein:

After the step S1, before the step S2, the method also includes:

Step S11, the GPUs of the slave nodes perform forward and backward calculations on the received data distributed by the master node according to the network information parameters to obtain weights;

The S2 includes:

The GPU of the master node receives the weights sent by the slave nodes.

10. The method according to any one of claims 8 to 9, characterized in that:

Before the step S1, the method also includes:

Step S0, the master node reads data in parallel from the parallel distributed Luster storage.