CN114579311B

CN114579311B - Method, device, equipment and storage medium for executing distributed computing task

Info

Publication number: CN114579311B
Application number: CN202210214311.2A
Authority: CN
Inventors: 奎志清; 李龙; 吴志华
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-03-04
Filing date: 2022-03-04
Publication date: 2023-05-30
Anticipated expiration: 2042-03-04
Also published as: CN114579311A

Abstract

The disclosure provides a method, device, device, storage medium, and program product for executing distributed computing tasks, and relates to the technical field of artificial intelligence, and in particular to cloud computing, distributed computing, and other technical fields. The specific implementation scheme is: sending the computing data to be processed related to the distributed computing task to each first node in the first node set, wherein the hardware information of the first node matches the local hardware information; aggregating each first node The node obtains a first aggregation result for the first calculation result of the calculation data to be processed; shares the first aggregation result with each second node in the second node set to obtain a shared result, wherein the hardware information of the second node is the same as the local The hardware information does not match; and determining the shared result as new computing data to be processed, and returning the operation of sending the computing data to be processed to each node in the first node set until the execution of the distributed computing task is completed.

Description

Method, device, device and storage medium for performing distributed computing tasks

技术领域technical field

本公开涉及人工智能技术领域，尤其涉及云计算、分布式计算等技术领域。The present disclosure relates to the technical field of artificial intelligence, and in particular to technical fields such as cloud computing and distributed computing.

背景技术Background technique

分布式计算是一种计算方法，和集中式计算是相对的。随着计算技术的发展，有些计算任务需要较大的计算能力才能完成，如果采用集中式计算，需要消耗较长时间来完成。分布式计算可以将该计算任务分解成许多小的部分，分配给多个节点处理。这样可以缩短计算时间，提高计算效率。Distributed computing is a computing method, which is relative to centralized computing. With the development of computing technology, some computing tasks require greater computing power to complete. If centralized computing is used, it will take a long time to complete. Distributed computing can decompose the computing task into many small parts and assign them to multiple nodes for processing. This can shorten calculation time and improve calculation efficiency.

发明内容Contents of the invention

本公开提供了一种执行分布式计算任务的方法、装置、设备、存储介质以及程序产品。The present disclosure provides a method, device, device, storage medium and program product for executing distributed computing tasks.

根据本公开的一方面，提供了一种执行分布式计算任务的方法，包括：将与分布式计算任务相关的待处理计算数据发送至第一节点集合中的每个第一节点，其中，所述第一节点的硬件信息与本地硬件信息匹配；According to an aspect of the present disclosure, there is provided a method for executing a distributed computing task, including: sending to-be-processed computing data related to the distributed computing task to each first node in the first node set, wherein the The hardware information of the first node matches the local hardware information;

聚合所述每个第一节点针对所述待处理计算数据的第一计算结果，得到第一聚合结果；与第二节点集合中的每个第二节点共享所述第一聚合结果，得到共享结果，其中，所述第二节点的硬件信息与所述本地硬件信息不匹配；以及确定所述共享结果作为新的待处理计算数据，并返回所述将待处理计算数据发送至第一节点集合中的每个节点的操作，直到分布式计算任务执行完毕。Aggregating the first calculation results of each first node for the calculation data to be processed to obtain a first aggregation result; sharing the first aggregation result with each second node in the second node set to obtain a shared result , wherein the hardware information of the second node does not match the local hardware information; and determine the shared result as new computing data to be processed, and return the sending the computing data to be processed to the first set of nodes The operation of each node until the distributed computing task is completed.

根据本公开的另一方面，提供了一种执行分布式计算任务的方法，包括：接收来自主节点的待处理计算数据；根据所述待处理计算数据进行计算操作，得到第一计算结果；以及将所述第一计算结果发送至所述主节点。According to another aspect of the present disclosure, there is provided a method for executing a distributed computing task, including: receiving computing data to be processed from a master node; performing a computing operation according to the computing data to be processed to obtain a first computing result; and Send the first calculation result to the master node.

根据本公开的另一方面，提供了一种执行分布式计算任务的装置，包括：第一发送模块，用于将与分布式计算任务相关的待处理计算数据发送至第一节点集合中的每个第一节点，其中，所述第一节点的硬件信息与本地硬件信息匹配；聚合模块，用于聚合所述每个第一节点针对所述待处理计算数据的第一计算结果，得到第一聚合结果；共享模块，用于与第二节点集合中的每个第二节点共享所述第一聚合结果，得到共享结果，其中，所述第二节点的硬件信息与所述本地硬件信息不匹配；以及确定模块，用于确定所述共享结果作为新的待处理计算数据，并返回所述将待处理计算数据发送至第一节点集合中的每个节点的操作，直到分布式计算任务执行完毕。According to another aspect of the present disclosure, an apparatus for executing a distributed computing task is provided, including: a first sending module, configured to send pending computing data related to the distributed computing task to each node in the first set of nodes a first node, wherein the hardware information of the first node matches the local hardware information; the aggregation module is configured to aggregate the first calculation results of each first node for the calculation data to be processed to obtain the first Aggregation result; a sharing module, configured to share the first aggregation result with each second node in the second node set to obtain a shared result, wherein the hardware information of the second node does not match the local hardware information and a determining module, configured to determine the shared result as new computing data to be processed, and return the operation of sending the computing data to be processed to each node in the first node set until the execution of the distributed computing task is completed .

根据本公开的另一方面，提供了一种执行分布式计算任务的装置，包括：接收模块，用于接收来自主节点的待处理计算数据；计算模块，用于根据所述待处理计算数据进行计算操作，得到第一计算结果；以及发送模块，用于将所述第一计算结果发送至所述主节点。According to another aspect of the present disclosure, there is provided an apparatus for performing distributed computing tasks, including: a receiving module, configured to receive computing data to be processed from a master node; a computing module, configured to perform computation based on the computing data to be processed A calculation operation, to obtain a first calculation result; and a sending module, configured to send the first calculation result to the master node.

本公开的另一个方面提供了一种电子设备，包括：至少一个处理器；以及与所述至少一个处理器通信连接的存储器；其中，所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行本公开实施例所示的方法。Another aspect of the present disclosure provides an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein, the memory stores information executable by the at least one processor. Instructions, the instructions are executed by the at least one processor, so that the at least one processor can execute the method shown in the embodiments of the present disclosure.

根据本公开实施例的另一方面，提供了一种存储有计算机指令的非瞬时计算机可读存储介质，其中，所述计算机指令用于使所述计算机执行本公开实施例所示的方法。According to another aspect of the embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to make the computer execute the method shown in the embodiments of the present disclosure.

根据本公开实施例的另一方面，提供了一种计算机程序产品，包括计算机程序/指令，其特征在于，该计算机程序/指令被处理器执行时实现本公开实施例所示方法的步骤。According to another aspect of the embodiments of the present disclosure, there is provided a computer program product, including computer programs/instructions, which is characterized in that, when the computer program/instructions are executed by a processor, the steps of the methods shown in the embodiments of the present disclosure are implemented.

应当理解，本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征，也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。It should be understood that what is described in this section is not intended to identify key or important features of the embodiments of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will be readily understood through the following description.

附图说明Description of drawings

附图用于更好地理解本方案，不构成对本公开的限定。其中：The accompanying drawings are used to better understand the present solution, and do not constitute a limitation to the present disclosure. in:

图1是根据本公开实施例的执行分布式计算任务的方法、装置、电子设备和存储介质的示例性分布式系统的示意图；1 is a schematic diagram of an exemplary distributed system of a method, an apparatus, an electronic device, and a storage medium for performing a distributed computing task according to an embodiment of the present disclosure;

图2示意性示出了根据本公开的实施例的执行分布式计算任务的方法的流程图；Fig. 2 schematically shows a flow chart of a method for performing a distributed computing task according to an embodiment of the present disclosure;

图3示意性示出了根据本公开的实施例的确定第一节点集合和第二节点集合的方法的流程图；FIG. 3 schematically shows a flowchart of a method for determining a first node set and a second node set according to an embodiment of the present disclosure;

图4示意性示出了根据本公开的实施例的确定主节点的方法的流程图；FIG. 4 schematically shows a flowchart of a method for determining a master node according to an embodiment of the present disclosure;

图5示意性示出了根据本公开另一实施例的执行分布式计算任务的方法示意图；Fig. 5 schematically shows a schematic diagram of a method for executing a distributed computing task according to another embodiment of the present disclosure;

图6示意性示出了根据本公开实施例的执行分布式计算任务的装置的框图；Fig. 6 schematically shows a block diagram of an apparatus for performing distributed computing tasks according to an embodiment of the present disclosure;

图7示意性示出了根据本公开另一实施例的执行分布式计算任务的装置的框图；以及FIG. 7 schematically shows a block diagram of an apparatus for performing distributed computing tasks according to another embodiment of the present disclosure; and

图8示意性示出了可以用来实施本公开的实施例的示例电子设备的框图。FIG. 8 schematically illustrates a block diagram of an example electronic device that may be used to implement embodiments of the present disclosure.

具体实施方式Detailed ways

以下结合附图对本公开的示范性实施例做出说明，其中包括本公开实施例的各种细节以助于理解，应当将它们认为仅仅是示范性的。因此，本领域普通技术人员应当认识到，可以对这里描述的实施例做出各种改变和修改，而不会背离本公开的范围和精神。同样，为了清楚和简明，以下的描述中省略了对公知功能和结构的描述。Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and they should be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

以下将结合图1对本公开提供的执行分布式计算任务的方法和装置的应用场景进行描述。The application scenarios of the method and apparatus for performing distributed computing tasks provided by the present disclosure will be described below with reference to FIG. 1 .

图1是根据本公开实施例的执行分布式计算任务的方法、装置、电子设备和存储介质的示例性分布式系统的示意图。需要注意的是，图1所示仅为可以应用本公开实施例的系统架构的示例，以帮助本领域技术人员理解本公开的技术内容，但并不意味着本公开实施例不可以用于其他设备、系统、环境或场景。FIG. 1 is a schematic diagram of an exemplary distributed system of a method, an apparatus, an electronic device, and a storage medium for performing distributed computing tasks according to an embodiment of the present disclosure. It should be noted that, what is shown in FIG. 1 is only an example of the system architecture to which the embodiments of the present disclosure can be applied, so as to help those skilled in the art understand the technical content of the present disclosure, but it does not mean that the embodiments of the present disclosure cannot be used in other device, system, environment or scenario.

如图1所示，该分布式系统100包括多个节点，例如节点111、112、113、114、121、122、123、124、131、132、133和134。As shown in FIG. 1 , the distributed system 100 includes multiple nodes, such as nodes 111 , 112 , 113 , 114 , 121 , 122 , 123 , 124 , 131 , 132 , 133 and 134 .

该多个节点可以基于不同的硬件架构，本实施例中，可以将基于相同硬件架构的节点称为同构节点，可以将基于不同硬件架构的节点称为异构节点。The multiple nodes may be based on different hardware architectures. In this embodiment, nodes based on the same hardware architecture may be called homogeneous nodes, and nodes based on different hardware architectures may be called heterogeneous nodes.

例如，节点111、112、113和114可以基于GPU(Graphics Processing Unit，图形处理器)，节点121、122、123和124可以基于NPU(Natural-network Processing Unit，嵌入式神经网络处理器)，节点131、132、133和134可以基于CPU(Central Processing Unit，中央处理器)。基于此，节点111、112、113和114互为同构节点，节点121、122、123和124互为同构节点，节点131、132、133和134互为同构节点。节点111、112、113和114与节点121、122、123和124之间互为异构节点，节点111、112、113和114与节点131、132、133和134之间互为异构节点，节点121、122、123和124与节点131、132、133和134之间互为异构节点。For example, nodes 111, 112, 113 and 114 can be based on GPU (Graphics Processing Unit, graphics processing unit), and nodes 121, 122, 123 and 124 can be based on NPU (Natural-network Processing Unit, embedded neural network processor), node 131, 132, 133, and 134 may be based on a CPU (Central Processing Unit, central processing unit). Based on this, the nodes 111, 112, 113 and 114 are mutually isomorphic nodes, the nodes 121, 122, 123 and 124 are mutually isomorphic nodes, and the nodes 131, 132, 133 and 134 are mutually isomorphic nodes. Nodes 111, 112, 113 and 114 and nodes 121, 122, 123 and 124 are mutually heterogeneous nodes, nodes 111, 112, 113 and 114 and nodes 131, 132, 133 and 134 are mutually heterogeneous nodes, Nodes 121, 122, 123, and 124 and nodes 131, 132, 133, and 134 are heterogeneous nodes.

根据本公开的实施例，例如可以将同构的节点组成一个同构节点集合。然后确定每个节点集合中的主节点。将每个节点集合中的主节点组成一个异构节点集合。例如，可以将节点111、112、113和114组成同构节点集合10，并确定节点111作为同构节点集合10的主节点。可以将节点121、122、123和124组成同构节点集合20，并确定节点121作为同构节点集合20的主节点。可以将节点131、132、133和134组成同构节点集合30，并确定节点131作为同构节点集合30的主节点。According to the embodiments of the present disclosure, for example, homogeneous nodes can be formed into a homogeneous node set. Then determine the master node in each node set. Combine the master nodes in each node set into a heterogeneous set of nodes. For example, nodes 111 , 112 , 113 and 114 may be formed into a homogeneous node set 10 , and node 111 may be determined as the master node of the homogeneous node set 10 . Nodes 121 , 122 , 123 and 124 may be formed into a homogeneous node set 20 , and node 121 is determined as the master node of the homogeneous node set 20 . The nodes 131 , 132 , 133 and 134 may be formed into a homogeneous node set 30 , and node 131 is determined as the master node of the homogeneous node set 30 .

根据本公开的实施例，在执行分布式计算任务时，可以由每个同构节点集合中的主节点向所属同构节点集合内的节点发送待计算数据。然后该主节点可以聚合所属同构节点集合内所有节点的计算结果，得到聚合结果。According to an embodiment of the present disclosure, when executing a distributed computing task, the master node in each homogeneous node set may send the data to be calculated to the nodes in the homogeneous node set to which it belongs. Then the master node can aggregate the calculation results of all nodes in the homogeneous node set to which it belongs, and obtain the aggregation result.

根据本公开的实施例，异构节点的绝对算力是不一样的，所完成计算任务的速度也是不一样的，因此在同构节点集合内计算完成后，需要进行同构节点集合之间进行数据同步。基于此，各主节点可以将聚合结果与异构节点集合中的其他主节点进行共享。每个主节点再根据共享后的聚合结果，继续后续的计算。根据本公开的实施例，可以定期在同构节点集合之间进行数据同步。其中，数据同步周期可以根据实际需要确定。可以理解的是，在每个数据同步周期，每个同构节点集合中所进行的计算次数可能是不一样的。绝对算力较大同构节点集合中计算次数较多，绝对算力较小的同构节点集合中计算次数较多。According to the embodiments of the present disclosure, the absolute computing power of heterogeneous nodes is different, and the speed at which the computing tasks are completed is also different. Therefore, after the calculation in the homogeneous node set is completed, it is necessary to perform calculations between the homogeneous node sets. data synchronization. Based on this, each master node can share the aggregation result with other master nodes in the heterogeneous node set. Each master node continues subsequent calculations based on the shared aggregation results. According to the embodiments of the present disclosure, data synchronization can be performed regularly among homogeneous node sets. Wherein, the data synchronization period may be determined according to actual needs. It can be understood that, in each data synchronization cycle, the number of calculations performed in each homogeneous node set may be different. The number of calculations in the homogeneous node set with higher absolute computing power is more, and the number of calculations in the homogeneous node set with smaller absolute computing power is more.

示例性地，同构节点集合内部可以采用最优或较优的通信方式，如同构节点集合10中的节点均基于GPU，则同构节点集合10中的节点可以使用GPU专用的集合通讯库进行通信。同构节点集合20中的节点均基于NPU，则同构节点集合20中的节点可以采用NPU专用的集合通讯库进行通信。另外，在异构节点之间通信时可以采用通用的通信方式，例如主节点111、121和131之间使用适合于GPU、CPU和NPU的通用集合通讯库。Exemplarily, the optimal or better communication method can be used inside the homogeneous node set. If the nodes in the homogeneous node set 10 are all based on GPU, then the nodes in the isomorphic node set 10 can use a GPU-specific collective communication library for communication. communication. The nodes in the homogeneous node set 20 are all based on the NPU, and the nodes in the homogeneous node set 20 can use the NPU-specific collective communication library for communication. In addition, a general communication method can be used for communication between heterogeneous nodes, for example, a general collective communication library suitable for GPU, CPU and NPU is used among the master nodes 111 , 121 and 131 .

根据本公开的实施例的执行分布式计算任务的方法，可以将各节点根据硬件信息进行分组，将同构的节点分为一个同构节点集合，同构节点集合之间由各同构节点集合的主节点进行通信，可以实现利用异构节点进行分布式计算。另外，同构节点集合内部可以使用专用的通信方式，可以更高效地利用硬件资源，提高计算效率。According to the method for executing distributed computing tasks in the embodiments of the present disclosure, each node can be grouped according to hardware information, and isomorphic nodes can be divided into a set of isomorphic nodes, and the sets of isomorphic nodes are divided into sets of isomorphic nodes The master node communicates, which can realize distributed computing using heterogeneous nodes. In addition, a dedicated communication method can be used inside the homogeneous node set, which can make more efficient use of hardware resources and improve computing efficiency.

本公开的技术方案中，所涉及的计算数据、硬件信息等数据的收集、存储、使用、加工、传输、提供和公开等处理，均符合相关法律法规的规定，且不违背公序良俗。In the technical solutions of the present disclosure, the collection, storage, use, processing, transmission, provision, and disclosure of calculation data, hardware information and other data involved are all in compliance with relevant laws and regulations, and do not violate public order and good customs.

以下将结合图2对本公开提供的执行分布式计算任务的方法进行描述。其中，该执行分布式计算任务的方法例如可以由主节点执行。The method for executing distributed computing tasks provided by the present disclosure will be described below with reference to FIG. 2 . Wherein, the method for executing a distributed computing task may be executed, for example, by a master node.

图2示意性示出了根据本公开的实施例的执行分布式计算任务的方法的流程图。Fig. 2 schematically shows a flowchart of a method for executing a distributed computing task according to an embodiment of the present disclosure.

如图2所示，该执行分布式计算任务的方法200包括在操作S210a，主节点将与分布式计算任务相关的待处理计算数据发送至第一节点集合中的每个第一节点。As shown in FIG. 2 , the method 200 for executing a distributed computing task includes, at operation S210a, the master node sends the computing data to be processed related to the distributed computing task to each first node in the first node set.

根据本公开的实施例，本地硬件信息例如可以为主节点的硬件信息。其中，硬件信息例如可以包括硬件类型。第一节点例如可以为与主节点的硬件信息相匹配的节点。第一节点集合例如可以包括至少一个第一节点。示例性地，本实施例中，如果两个节点的硬件信息匹配，则可以表示该两个节点互为同构节点。According to an embodiment of the present disclosure, the local hardware information may be, for example, the hardware information of the master node. Wherein, the hardware information may include hardware type, for example. For example, the first node may be a node that matches the hardware information of the master node. The first set of nodes may include at least one first node, for example. Exemplarily, in this embodiment, if the hardware information of two nodes matches, it may indicate that the two nodes are isomorphic nodes.

根据本公开的实施例，分布式计算任务例如可以包括深度学习模型的训练任务等，待处理计算数据例如可以包括模型参数等。According to an embodiment of the present disclosure, the distributed computing task may include, for example, a training task of a deep learning model, and the computing data to be processed may include, for example, model parameters.

然后，对于每个第一节点，执行以下操作S210b～S230b。Then, for each first node, the following operations S210b to S230b are performed.

其中，在操作S210b，接收来自主节点的待处理计算数据。Wherein, in operation S210b, the computing data to be processed is received from the master node.

在操作S220b，根据待处理计算数据进行计算操作，得到第一计算结果。In operation S220b, a calculation operation is performed according to the calculation data to be processed to obtain a first calculation result.

根据本公开的实施例，计算操作例如可以包括梯度计算等。According to an embodiment of the present disclosure, the calculation operation may include gradient calculation and the like, for example.

在操作S230b，将第一计算结果发送至主节点。In operation S230b, the first calculation result is sent to the master node.

接下来，主节点继续执行操作S220a～S240a。Next, the master node continues to perform operations S220a-S240a.

在操作S220a，聚合每个第一节点针对待处理计算数据的第一计算结果，得到第一聚合结果。In operation S220a, first calculation results of each first node for the calculation data to be processed are aggregated to obtain a first aggregation result.

根据本公开的实施例，主节点例如可以根据每个第一节点的权重，将各第一计算结果进行加权求和，得到第一聚合结果。其中，每个第一节点的权重可以根据实际需要设置。According to an embodiment of the present disclosure, for example, the master node may perform weighted summation of the first calculation results according to the weight of each first node to obtain the first aggregation result. Wherein, the weight of each first node can be set according to actual needs.

根据本公开另一实施例，主节点例如也可以对所有第一计算结果求平均值，得到第一聚合结果。According to another embodiment of the present disclosure, for example, the master node may also average all the first calculation results to obtain the first aggregation result.

在操作S230a，与第二节点集合中的每个第二节点共享第一聚合结果，得到共享结果，并确定共享结果作为新的待处理计算数据。In operation S230a, the first aggregation result is shared with each second node in the second node set, the shared result is obtained, and the shared result is determined as new calculation data to be processed.

根据本公开的实施例，第二节点例如可以为与主节点的硬件信息不匹配的节点。第二节点集合例如可以包括至少一个第二节点。示例性地，本实施例中，如果两个节点的硬件信息不匹配，则可以表示该两个节点互为异构节点。According to an embodiment of the present disclosure, for example, the second node may be a node whose hardware information does not match the master node. The second set of nodes may, for example, include at least one second node. Exemplarily, in this embodiment, if the hardware information of the two nodes does not match, it may indicate that the two nodes are heterogeneous nodes.

根据本公开的实施例，例如可以计算第一聚合结果和所有第二聚合结果的平均值，作为共享结果。According to an embodiment of the present disclosure, for example, an average value of the first aggregated result and all second aggregated results may be calculated as the shared result.

根据本公开的另一实施例，例如也可以根据每个主节点的权重，将第一聚合结果和所有第二聚合结果进行加权求和，得到共享结果。其中，每个主节点的权重可以根据实际需要设置。According to another embodiment of the present disclosure, for example, according to the weight of each master node, the first aggregation result and all the second aggregation results may be weighted and summed to obtain a shared result. Among them, the weight of each master node can be set according to actual needs.

操作S240a，确定分布式计算任务是否执行完毕。如果分布式计算任务还未执行完毕，则返回操作S220a。如果分布式计算任务还未执行完毕，则结束。In operation S240a, it is determined whether the execution of the distributed computing task is completed. If the distributed computing task has not been executed yet, return to operation S220a. If the distributed computing task has not been executed yet, it ends.

根据本公开的实施例，例如可以根据执行轮次是否达到预定次数来确定分布式计算任务是否执行完毕。如果执行轮次达到预定次数，则确定分布式计算任务已执行完毕。如果执行轮次没有达到预定次数，则确定分布式计算任务没有执行完毕。According to an embodiment of the present disclosure, for example, it may be determined whether the execution of the distributed computing task is completed according to whether the number of execution rounds reaches a predetermined number of times. If the execution rounds reach the predetermined number of times, it is determined that the distributed computing task has been executed. If the number of execution rounds does not reach the predetermined number, it is determined that the execution of the distributed computing task has not been completed.

根据本公开的实施例的执行分布式计算任务的方法，可以实现利用异构节点进行分布式计算，可以更高效地利用硬件资源，提高计算效率。According to the method for executing distributed computing tasks in the embodiments of the present disclosure, heterogeneous nodes can be used for distributed computing, hardware resources can be used more efficiently, and computing efficiency can be improved.

根据本公开另一实施例，主节点也可以针对待处理计算数据进行计算操作，得到第二计算结果。基于此，在接收到来自每个第一节点的第一计算结果之后，可以将第二计算结果和每个第一节点的第一计算结果进行加权求和，得到第一聚合结果。其中，主节点和每个第一节点的权重可以根据实际需要设置。根据本公开另一实施例，主节点例如也可以对第二计算结果和所有第一计算结果求平均值，得到第一聚合结果。According to another embodiment of the present disclosure, the master node may also perform calculation operations on the calculation data to be processed to obtain a second calculation result. Based on this, after receiving the first calculation result from each first node, the second calculation result and the first calculation result of each first node may be weighted and summed to obtain the first aggregation result. Wherein, the weight of the master node and each first node can be set according to actual needs. According to another embodiment of the present disclosure, for example, the master node may also average the second calculation result and all first calculation results to obtain the first aggregation result.

以下将结合图3对本公开提供的确定第一节点集合和第二节点集合的方法进行描述。其中，该执行分布式计算任务的方法例如可以由主节点执行。The method for determining the first node set and the second node set provided by the present disclosure will be described below with reference to FIG. 3 . Wherein, the method for executing a distributed computing task may be executed, for example, by a master node.

图3示意性示出了根据本公开的实施例的确定第一节点集合和第二节点集合的方法的流程图。Fig. 3 schematically shows a flowchart of a method for determining a first node set and a second node set according to an embodiment of the present disclosure.

如图3所示，该确定第一节点集合和第二节点集合的方法可以包括在操作S310，获取本地硬件信息和多个节点的硬件信息。As shown in FIG. 3 , the method for determining the first node set and the second node set may include acquiring local hardware information and hardware information of multiple nodes in operation S310.

根据本公开的实施例，硬件信息例如可以包括硬件类型。硬件类型例如可以包括GPU、CPU和NPU等等。According to an embodiment of the present disclosure, the hardware information may include, for example, a hardware type. The hardware type may include GPU, CPU, NPU, etc., for example.

在操作S320，确定多个节点中硬件信息与本地硬件信息匹配的节点作为第一节点，得到第一节点集合。In operation S320, a node whose hardware information matches the local hardware information among the plurality of nodes is determined as the first node to obtain a first node set.

根据本公开的实施例，例如可以确定与主节点的硬件类型相同的节点，作为第一节点。将所有第一节点组为第一节点集合。According to an embodiment of the present disclosure, for example, a node having the same hardware type as the master node may be determined as the first node. All first nodes are grouped into a first node set.

在操作S330，在多个节点中硬件信息与本地硬件信息不匹配的节点中确定主节点，作为第二节点，得到第二节点集合。In operation S330, among the nodes whose hardware information does not match the local hardware information, a master node is determined as a second node to obtain a second node set.

根据本公开的实施例，例如可以确定与主节点的硬件类型不相同的节点。然后确定这些硬件类型不相同的节点中的主节点，作为第二节点。将所有第二节点组为第二节点集合。According to an embodiment of the present disclosure, for example, it is possible to determine a node whose hardware type is different from that of the master node. Then determine the master node among the nodes with different hardware types as the second node. All second nodes are grouped into a second node set.

以下将结合图4对本公开提供的确定主节点的方法进行描述。其中，该确定主节点的方法例如可以由第一节点执行。The method for determining the master node provided by the present disclosure will be described below with reference to FIG. 4 . Wherein, the method for determining the master node may be performed, for example, by the first node.

图4示意性示出了根据本公开的实施例的确定主节点的方法的流程图。Fig. 4 schematically shows a flowchart of a method for determining a master node according to an embodiment of the present disclosure.

如图4所示，该确定主节点的方法可以包括在操作S410，获取多个节点的硬件信息。As shown in FIG. 4, the method for determining a master node may include acquiring hardware information of multiple nodes in operation S410.

根据本公开的实施例，硬件信息例如可以包括硬件类型和节点标识。According to an embodiment of the present disclosure, the hardware information may include, for example, a hardware type and a node identifier.

在操作S420，根据多个节点中硬件信息，在多个节点中确定主节点。In operation S420, a master node is determined among the plurality of nodes according to the hardware information in the plurality of nodes.

根据本公开的实施例，可以确定多个节点中与第一节点硬件类型相同的节点，即同构节点。然后根据这些同构节点的节点标识，确定这些同构节点中的主节点。示例性地，节点标识可以包括数字，可以根据数字的大小，确定同构节点中的主节点。例如，数字最小的同构节点作为主节点。According to an embodiment of the present disclosure, a node of the same hardware type as the first node among the multiple nodes may be determined, that is, an isomorphic node. Then, according to the node identifiers of these isomorphic nodes, the master node among these isomorphic nodes is determined. Exemplarily, the node identifier may include a number, and the master node among the isomorphic nodes may be determined according to the size of the number. For example, the homogeneous node with the smallest number acts as the master node.

根据本公开的实施例，在确定第一节点集合之后，主节点可以建立与第一节点集合中每个第一节点之间的第一通信域。基于此，主节点例如可以在第一通信域中广播待处理计算数据，以便待处理计算数据发送至第一节点集合中的每个第一节点。对应地，每个第一节点也可以建立与主节点之间的第一通信域。然后将第一计算结果通过第一通信域发送至主节点。According to an embodiment of the present disclosure, after the first node set is determined, the master node may establish a first communication domain with each first node in the first node set. Based on this, the master node may, for example, broadcast the calculation data to be processed in the first communication domain, so that the calculation data to be processed can be sent to each first node in the first node set. Correspondingly, each first node may also establish a first communication domain with the master node. Then the first calculation result is sent to the master node through the first communication domain.

根据本公开另一实施例，还可以预先将节点进行分组。将硬件信息匹配的节点组成一个同构节点集合。然后确定每个同构节点集合中的主节点。将每个同构节点集合中的主节点组成一个异构节点集合。对于每个主节点，该主节点所属同构节点集合中的其他节点即构成第一节点集合，该主节点所属异构节点集合中的其他主节点即构成第二节点集合。According to another embodiment of the present disclosure, nodes may also be grouped in advance. The nodes whose hardware information matches form a homogeneous node set. Then determine the master node in each set of homogeneous nodes. Combine the master nodes in each homogeneous node set into a heterogeneous node set. For each master node, other nodes in the homogeneous node set to which the master node belongs constitute the first node set, and other master nodes in the heterogeneous node set to which the master node belongs constitute the second node set.

下面参考图5，结合具体实施例对上文所示的执行分布式计算任务的方法做进一步说明。本领域技术人员可以理解，以下示例实施例仅用于理解本公开，本公开并不局限于此。Referring to FIG. 5 , the method for executing distributed computing tasks shown above will be further described in conjunction with specific embodiments. Those skilled in the art can understand that the following exemplary embodiments are only for understanding the present disclosure, and the present disclosure is not limited thereto.

图5示意性示出了根据本公开另一实施例的执行分布式计算任务的方法示意图。Fig. 5 schematically shows a schematic diagram of a method for executing a distributed computing task according to another embodiment of the present disclosure.

在图5中示出了，节点的标识可以包括a0、a1、a2、b0、b1、b2、b3、c0、c1、c2、c3、d0、d1、d2和d3。节点可以各自启动，通过内置程序获得自身节点的硬件类型、节点标识、网络地址等信息。网络地址例如可以包括IP地址。As shown in FIG. 5 , the node identifiers may include a0, a1, a2, b0, b1, b2, b3, c0, c1, c2, c3, d0, d1, d2 and d3. The nodes can be started separately, and the hardware type, node identification, network address and other information of their own nodes can be obtained through the built-in program. A network address may include, for example, an IP address.

各节点将硬件类型、节点标识和网络地址发送至注册中心，以便由注册中心为节点进行注册。在注册成功之后，各个节点通过注册中心获取全局信息，全局信息包括所有节点的硬件类型节点标识和网络地址。示例性地，注册中心可以使用预先声明的master(主控)节点或etcd等服务实现。其中，etcd是一种分布式键值数据库。Each node sends the hardware type, node ID and network address to the registration center, so that the registration center can register the node. After successful registration, each node obtains global information through the registration center, and the global information includes hardware type node identifiers and network addresses of all nodes. Exemplarily, the registration center can be implemented by using a pre-declared master (master control) node or services such as etcd. Among them, etcd is a distributed key-value database.

根据本公开的实施例，每个节点根据自身的硬件信息和其他节点的硬件信息，确定其他节点中的与自身硬件信息匹配的同构节点和与自身硬件信息不匹配的异构节点，同构节点之间构成同构节点集合。例如，节点a0、a1、a2和a3组成一个同构节点集合g1。节点b0、b1、b2和b3组成一个同构节点集合g2。节点c0、c1、c2和c3组成一个同构节点集合g3。节点d0、d1、d2和d3组成一个同构节点集合g4。According to the embodiments of the present disclosure, each node determines the isomorphic nodes that match its own hardware information and the heterogeneous nodes that do not match its own hardware information among other nodes according to its own hardware information and the hardware information of other nodes. The nodes form a set of isomorphic nodes. For example, nodes a0, a1, a2 and a3 form an isomorphic node set g1. Nodes b0, b1, b2 and b3 form an isomorphic node set g2. Nodes c0, c1, c2 and c3 form an isomorphic node set g3. Nodes d0, d1, d2 and d3 form an isomorphic node set g4.

根据本公开的实施例，各节点可以根据所有节点的节点标识对所有节点进行排序。示例性地，本实施例中，可以根据节点标识中的数字编号为节点确定主节点，例如各个同构集合中0号节点作为主节点，即a0、b0、c0和d0分别作为同构集合g1、g2、g3和g4的主节点。a0、b0、c0和d0构成异构集合g5。According to the embodiment of the present disclosure, each node can sort all the nodes according to the node identifiers of all the nodes. Exemplarily, in this embodiment, the primary node can be determined for the node according to the number in the node identifier, for example, node 0 in each homogeneous set is the primary node, that is, a0, b0, c0, and d0 are respectively the isomorphic set g1 , the master node of g2, g3 and g4. a0, b0, c0 and d0 constitute a heterogeneous set g5.

然后，各节点启动各自工作进程，通过获取到的网络地址建立通信域。主节点建立同构通信域和异构通信域，非主节点建立同构通信域。每个主节点将第一数据在对应的同构通信域中广播，以将第一数据发送至该同构通信域中的每个同构节点。Then, each node starts its own working process, and establishes a communication domain through the obtained network address. The master node establishes the homogeneous communication domain and the heterogeneous communication domain, and the non-master node establishes the homogeneous communication domain. Each master node broadcasts the first data in the corresponding homogeneous communication domain, so as to send the first data to each homogeneous node in the homogeneous communication domain.

接下来，每个节点根据第一数据进行预定计算操作，得到计算结果。Next, each node performs a predetermined calculation operation according to the first data to obtain a calculation result.

响应于每个同步周期到达，非主节点将自身计算得到的计算结果发送至所属同构节点集合的主节点。In response to the arrival of each synchronization period, the non-master node sends the calculation result obtained by itself to the master node of the homogeneous node set to which it belongs.

主节点接收来自所属同构节点集合中各节点的计算结果。然后聚合自身计算的计算结果和所属同构节点集合中其他节点的计算结果，得到聚合数据。接着将聚合数据在异构通信域中广播，以发送至其他同构集合的主节点。The master node receives the calculation results from each node in the homogeneous node set to which it belongs. Then aggregate the calculation results calculated by itself and the calculation results of other nodes in the homogeneous node set to which it belongs to obtain aggregated data. The aggregated data is then broadcast across the heterogeneous communication domain to be sent to master nodes of other homogeneous sets.

主节点接收来自其他主节点发送的聚合数据。根据其他主节点的聚合数据和自身的聚合数据，确定更新待处理计算数据。然后将待处理计算数据发送至与同集合的每个节点，以便根据新的待处理计算数据进行下一轮计算。Master nodes receive aggregated data sent from other master nodes. According to the aggregated data of other master nodes and its own aggregated data, it is determined to update the calculation data to be processed. Then send the calculation data to be processed to each node in the same set, so that the next round of calculation can be performed based on the new calculation data to be processed.

相关技术中，集合通信都是基于同构硬件进行的设计，不支持异构硬件进行通信。In related technologies, collective communication is designed based on homogeneous hardware, and heterogeneous hardware is not supported for communication.

图6示意性示出了根据本公开实施例的执行分布式计算任务的装置的框图。Fig. 6 schematically shows a block diagram of an apparatus for performing distributed computing tasks according to an embodiment of the present disclosure.

如图6所示，该执行分布式计算任务的装置600包括第一发送模块610、聚合模块620、共享模块630和确定模块640。As shown in FIG. 6 , the apparatus 600 for performing distributed computing tasks includes a first sending module 610 , an aggregation module 620 , a sharing module 630 and a determination module 640 .

第一发送模块610，用于将与分布式计算任务相关的待处理计算数据发送至第一节点集合中的每个第一节点，其中，第一节点的硬件信息与本地硬件信息匹配。The first sending module 610 is configured to send the computing data to be processed related to the distributed computing task to each first node in the first node set, wherein the hardware information of the first node matches the local hardware information.

聚合模块620，用于聚合每个第一节点针对待处理计算数据的第一计算结果，得到第一聚合结果。The aggregation module 620 is configured to aggregate the first calculation results of each first node for the calculation data to be processed to obtain the first aggregation result.

共享模块630，用于与第二节点集合中的每个第二节点共享第一聚合结果，得到共享结果，其中，第二节点的硬件信息与本地硬件信息不匹配。The sharing module 630 is configured to share the first aggregation result with each second node in the second node set to obtain the sharing result, wherein the hardware information of the second node does not match the local hardware information.

确定模块640，用于确定共享结果作为新的待处理计算数据，并返回将待处理计算数据发送至第一节点集合中的每个节点的操作，直到分布式计算任务执行完毕。The determining module 640 is configured to determine the shared result as new computing data to be processed, and return to the operation of sending the computing data to be processed to each node in the first node set until the execution of the distributed computing task is completed.

图7示意性示出了根据本公开另一实施例的执行分布式计算任务的装置的框图。Fig. 7 schematically shows a block diagram of an apparatus for executing distributed computing tasks according to another embodiment of the present disclosure.

如图7所示，该执行分布式计算任务的装置700包括接收模块710、计算模块720和发送模块730。As shown in FIG. 7 , the apparatus 700 for performing distributed computing tasks includes a receiving module 710 , a computing module 720 and a sending module 730 .

接收模块710，用于接收来自主节点的待处理计算数据。The receiving module 710 is configured to receive computing data to be processed from the master node.

计算模块720，用于根据待处理计算数据进行计算操作，得到第一计算结果。The calculation module 720 is configured to perform a calculation operation according to the calculation data to be processed to obtain a first calculation result.

发送模块730，用于将所述第一计算结果发送至所述主节点。A sending module 730, configured to send the first calculation result to the master node.

根据本公开的实施例，本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。According to the embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.

图8示意性示出了可以用来实施本公开的实施例的示例电子设备800的框图。电子设备旨在表示各种形式的数字计算机，诸如，膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置，诸如，个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例，并且不意在限制本文中描述的和/或者要求的本公开的实现。FIG. 8 schematically shows a block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

如图8所示，设备800包括计算单元801，其可以根据存储在只读存储器(ROM)802中的计算机程序或者从存储单元808加载到随机访问存储器(RAM)803中的计算机程序，来执行各种适当的动作和处理。在RAM 803中，还可存储设备800操作所需的各种程序和数据。计算单元801、ROM 802以及RAM 803通过总线804彼此相连。输入/输出(I/O)接口805也连接至总线804。As shown in FIG. 8, the device 800 includes a computing unit 801 that can execute according to a computer program stored in a read-only memory (ROM) 802 or loaded from a storage unit 808 into a random access memory (RAM) 803. Various appropriate actions and treatments. In the RAM 803, various programs and data necessary for the operation of the device 800 can also be stored. The computing unit 801 , ROM 802 , and RAM 803 are connected to each other through a bus 804 . An input/output (I/O) interface 805 is also connected to the bus 804 .

设备800中的多个部件连接至I/O接口805，包括：输入单元806，例如键盘、鼠标等；输出单元807，例如各种类型的显示器、扬声器等；存储单元808，例如磁盘、光盘等；以及通信单元809，例如网卡、调制解调器、无线通信收发机等。通信单元809允许设备800通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Multiple components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, etc.; an output unit 807, such as various types of displays, speakers, etc.; a storage unit 808, such as a magnetic disk, an optical disk, etc. ; and a communication unit 809, such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 809 allows the device 800 to exchange information/data with other devices over a computer network such as the Internet and/or various telecommunication networks.

计算单元801可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元801的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元801执行上文所描述的各个方法和处理，例如执行分布式计算任务的方法。例如，在一些实施例中，执行分布式计算任务的方法可被实现为计算机软件程序，其被有形地包含于机器可读介质，例如存储单元808。在一些实施例中，计算机程序的部分或者全部可以经由ROM 802和/或通信单元809而被载入和/或安装到设备800上。当计算机程序加载到RAM 803并由计算单元801执行时，可以执行上文描述的执行分布式计算任务的方法的一个或多个步骤。备选地，在其他实施例中，计算单元801可以通过其他任何适当的方式(例如，借助于固件)而被配置为执行执行分布式计算任务的方法。The computing unit 801 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 801 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 executes various methods and processes described above, for example, a method of executing distributed computing tasks. For example, in some embodiments, a method for performing distributed computing tasks may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 808 . In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 800 via the ROM 802 and/or the communication unit 809 . When the computer program is loaded into RAM 803 and executed by computing unit 801, one or more steps of the method for performing distributed computing tasks described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured in any other appropriate way (for example, by means of firmware) to execute a method for performing a distributed computing task.

本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、复杂可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括：实施在一个或者多个计算机程序中，该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释，该可编程处理器可以是专用或者通用可编程处理器，可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令，并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips Implemented in a system of systems (SOC), complex programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor Can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.

用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器，使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行，作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

在本公开的上下文中，机器可读介质可以是有形的介质，其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备，或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

为了提供与用户的交互，可以在计算机上实施此处描述的系统和技术，该计算机具有：用于向用户显示信息的显示装置(例如，CRT(阴极射线管)或者LCD(液晶显示器)监视器)；以及键盘和指向装置(例如，鼠标或者轨迹球)，用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互；例如，提供给用户的反馈可以是任何形式的传感反馈(例如，视觉反馈、听觉反馈、或者触觉反馈)；并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide for interaction with the user, the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer. Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.

可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如，作为数据服务器)、或者包括中间件部件的计算系统(例如，应用服务器)、或者包括前端部件的计算系统(例如，具有图形用户界面或者网络浏览器的用户计算机，用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如，通信网络)来将系统的部件相互连接。通信网络的示例包括：局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.

计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。A computer system may include clients and servers. Clients and servers are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other.

服务器可以是云服务器，又称为云计算服务器或云主机，是云计算服务体系中的一项主机产品，以解决了传统物理主机与VPS服务(″Virtual Private Server″，或简称″VPS″)中，存在的管理难度大，业务扩展性弱的缺陷。服务器也可以为分布式系统的服务器，或者是结合了区块链的服务器。The server can be a cloud server, also known as cloud computing server or cloud host, which is a host product in the cloud computing service system to solve the problem of traditional physical host and VPS service ("Virtual Private Server", or "VPS") Among them, there are defects such as difficult management and weak business scalability. The server can also be a server of a distributed system, or a server combined with a blockchain.

应该理解，可以使用上面所示的各种形式的流程，重新排序、增加或删除步骤。例如，本发公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行，只要能够实现本公开公开的技术方案所期望的结果，本文在此不进行限制。It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.

上述具体实施方式，并不构成对本公开保护范围的限制。本领域技术人员应该明白的是，根据设计要求和其他因素，可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等，均应包含在本公开保护范围之内。The specific implementation manners described above do not limit the protection scope of the present disclosure. It should be apparent to those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made depending on design requirements and other factors. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present disclosure shall be included within the protection scope of the present disclosure.

Claims

1. A method for performing distributed computing tasks, applied to a master node, the method comprising:

Send the computing data to be processed related to the distributed computing task to each first node in the first node set through the first communication domain, wherein the hardware information of the first node matches the local hardware information, and the The first communication domain is established between the master node and each first node in the first node set; the local hardware information is hardware information of the master node;

Aggregating first calculation results of each of the first nodes for the calculation data to be processed through the first communication domain to obtain a first aggregation result;

Share the first aggregation result with each second node in the second node set through the second communication domain to obtain a shared result, wherein the hardware information of the second node does not match the local hardware information, so establishing the second communication domain between the master node and each second node in the set of second nodes; and

determining the shared result as new computing data to be processed, and returning the operation of sending the computing data to be processed related to the distributed computing task to each first node in the first node set, until the distributed The calculation task is completed.

2. The method of claim 1, further comprising:

Obtain the local hardware information and hardware information of multiple nodes;

Determining a node whose hardware information matches the local hardware information among the plurality of nodes as a first node to obtain the first node set; and

Determining a master node among the multiple nodes whose hardware information does not match the local hardware information, as the second node, to obtain the second node set.

3. The method of claim 1 or 2, further comprising:

establishing a first communication domain with each first node in the set of first nodes;

Wherein, the sending the computing data to be processed related to the distributed computing task to each first node in the first node set includes:

The computation data to be processed is broadcast in the first communication domain.

4. The method according to claim 1, wherein the aggregating the first calculation results of each first node for the calculation data to be processed to obtain a first aggregation result comprises:

performing calculation operations on the calculation data to be processed to obtain a second calculation result;

receiving a first calculation result from each of said first nodes; and

performing a weighted summation of the second calculation result and the first calculation result of each first node to obtain the first aggregation result.

5. The method of claim 1 or 2, further comprising:

establishing a second communication domain with each second node in the set of second nodes;

Wherein, the sharing the first aggregation result with each second node in the second node set, and obtaining the sharing result includes:

broadcasting the first aggregation result in the second communication domain;

receiving a second aggregated result from each of said second nodes; and

The shared result is determined according to the first aggregated result and the second aggregated result.

6. The method according to claim 5, wherein said determining said shared result according to said first aggregated result and said second aggregated result comprises:

calculating an average value of the first aggregation result and the second aggregation result as the shared result.

7. A method for performing a distributed computing task, applied to a first node, the method comprising:

Receive the computing data to be processed from the master node through the first communication domain, wherein the master node and the first node belong to the same node set, and the hardware information of the nodes in the node set matches each other, and the Establishing the first communication domain between the master node and one or more first nodes in the set of nodes except the master node;

performing calculation operations according to the calculation data to be processed to obtain a first calculation result; and

sending the first calculation result to the master node through the first communication domain;

Wherein, the master node is the master node according to any one of claims 1-6.

8. The method of claim 7, further comprising:

Obtain hardware information and node identities for multiple nodes; and

Determine the master node among the multiple nodes according to the hardware information and node identifiers in the multiple nodes.

9. The method of claim 8, further comprising:

establishing a first communication domain with the master node;

Wherein, sending the first calculation result to the master node includes:

Send the first calculation result to the master node through the first communication domain.

10. A device for performing distributed computing tasks, applied to a master node, comprising:

The first sending module is configured to send the computing data to be processed related to the distributed computing task to each first node in the first node set through the first communication domain, wherein the hardware information of the first node is related to the The local hardware information matches, and the first communication domain is established between the master node and each first node in the first node set; the local hardware information is hardware information of the master node;

An aggregation module, configured to aggregate the first calculation results of each first node for the calculation data to be processed through the first communication domain to obtain a first aggregation result;

A sharing module, configured to share the first aggregation result with each second node in the second node set through a second communication domain to obtain a shared result, wherein the hardware information of the second node is the same as that of the local hardware The information does not match, and the second communication domain is established between the master node and each second node in the set of second nodes; and

A determining module, configured to determine the shared result as new computing data to be processed, and return the operation of sending the computing data to be processed related to the distributed computing task to each first node in the first node set, until the execution of the distributed computing task is completed.

11. An apparatus for performing distributed computing tasks, applied to a first node, comprising:

The receiving module is configured to receive computing data to be processed from the master node through the first communication domain, wherein the master node and the first node belong to the same node set, and the hardware information between the nodes in the node set matching, establishing the first communication domain between the master node and one or more first nodes in the node set except the master node;

A calculation module, configured to perform a calculation operation according to the calculation data to be processed to obtain a first calculation result; and

a sending module, configured to send the first calculation result to the master node through the first communication domain;

Wherein, the master node is the master node according to claim 10 .

12. An electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform any one of claims 1-9. Methods.

13. A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method according to any one of claims 1-9.