CN110609744B

CN110609744B - Method, apparatus and computer program product for processing computing tasks

Info

Publication number: CN110609744B
Application number: CN201810621213.4A
Authority: CN
Inventors: 赵军平; 王鲲
Original assignee: EMC IP Holding Co LLC
Current assignee: EMC Corp
Priority date: 2018-06-15
Filing date: 2018-06-15
Publication date: 2023-06-09
Anticipated expiration: 2038-06-15
Also published as: CN110609744A; US11314557B2; US20190384646A1

Abstract

Implementations of the present disclosure relate to methods, apparatuses, and computer program products for processing computing tasks. According to one exemplary implementation of the present disclosure, a method for processing computing tasks is provided. The method comprises the following steps: dividing the plurality of computing resources into a plurality of groups based on topology information describing connection relationships between the plurality of computing resources; selecting at least one computing resource from at least one of the plurality of groups; determining a processing performance for processing the computing task with the selected at least one computing resource; and allocating at least one computing resource for processing the computing task based on the processing performance. According to one exemplary implementation of the present disclosure, an apparatus and computer program product for processing computing tasks are provided. According to an exemplary implementation of the present disclosure, multiple computing resources may be leveraged to process computing tasks with better processing performance.

Description

Method, apparatus and computer program product for processing computing tasks

Technical Field

Implementations of the present disclosure relate generally to computing systems that include dedicated computing resources and, more particularly, relate to methods, apparatuses, and computer program products for processing computing tasks in the computing systems utilizing dedicated computing resources.

Background

Applications on clients may be designed to utilize computing resources, such as processing and storage resources, to accomplish a variety of processing or analysis tasks. As the demands and complexity of tasks such as machine learning, deep learning, data mining, etc., continue to increase, a significant and/or variable amount of computing resources is required to meet the execution of the respective applications. This may be accomplished by a machine or system having multiple dedicated computing resources, wherein an application may be scheduled to run on one or more of the dedicated computing resources of the machine or system. For example, cloud-based computing systems have been developed that include machines with one or more dedicated computing resources. Different clients may lease computing resources (e.g., dedicated computing resources) of the system as needed to run respective applications.

With the development of computer technology, the variety of computing resources is becoming increasingly rich and has been no longer limited to traditional computing resources such as central processing units. For example, current graphics processing units (Graphic Processing Unit, GPUs) are increasingly more computationally powerful. Because of its unique nature, the GPU is particularly well suited for performing computing tasks in terms of Deep Learning (Deep Learning), high performance computing (High Performance Computing), and Machine Learning (Machine Learning). However, for common client devices as well as conventional cloud computing devices, the performance of the graphics processing units of these devices is often limited and does not have high performance processing capabilities. Thus, how to utilize (e.g., remotely) the computing power of the graphics processing unit of the other device to handle the computing task at this time is a focus of research.

However, some current solutions do not determine in an efficient manner which remote computing resource(s) (e.g., computing resources in a computing resource pool) to select to handle a computing task. It is therefore desirable to be able to provide a solution for processing computing tasks using multiple computing resources in a resource pool in a simple and efficient manner.

Disclosure of Invention

Implementations of the present disclosure provide methods, apparatus, and corresponding computer program products for processing computing tasks.

According to a first aspect of the present disclosure, a method for processing a computing task is provided. The method comprises the following steps: dividing the plurality of computing resources into a plurality of groups based on topology information describing connection relationships between the plurality of computing resources; selecting at least one computing resource from at least one of the plurality of groups; determining a processing performance for processing the computing task with the selected at least one computing resource; and allocating at least one computing resource for processing the computing task based on the processing performance.

According to a second aspect of the present disclosure, an apparatus for processing computing tasks is provided. The apparatus includes: at least one processor; a volatile memory; and a memory coupled to the at least one processor, the memory having instructions stored therein, which when executed by the at least one processor, cause the device to perform actions. The actions include: dividing the plurality of computing resources into a plurality of groups based on topology information describing connection relationships between the plurality of computing resources; selecting at least one computing resource from at least one of the plurality of groups; determining a processing performance for processing the computing task with the selected at least one computing resource; and allocating at least one computing resource for processing the computing task based on the processing performance.

According to a third aspect of the present disclosure, a computer program product is provided. The computer program product is tangibly stored on a computer-readable medium and includes machine-executable instructions that, when executed, cause a machine to perform a method according to the first aspect.

The summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the disclosure, nor is it intended to be used to limit the scope of the disclosure.

Drawings

The foregoing and other objects, features, and advantages of the disclosure will be apparent from the following more particular descriptions of exemplary implementations of the disclosure as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout exemplary implementations of the disclosure.

FIG. 1 schematically illustrates a block diagram of an exemplary computing system suitable for implementing implementations of the present disclosure;

FIG. 2 schematically illustrates a block diagram of a process for processing a neural network model-based computational task, according to one aspect;

FIG. 3 schematically illustrates a block diagram for processing computing tasks according to one exemplary implementation of the present disclosure;

FIG. 4 schematically illustrates a flow chart of a method for processing computing tasks according to one exemplary implementation of the present disclosure;

FIG. 5A schematically illustrates a block diagram of one example topology of a plurality of computing resources according to one example implementation of the present disclosure, and FIG. 5B schematically illustrates a block diagram of another example topology of a plurality of computing resources according to another example implementation of the present disclosure;

FIG. 6A schematically illustrates a block diagram of dividing a plurality of computing resources connected according to the topology illustrated in FIG. 5A into different groupings, according to one exemplary implementation of the present disclosure; FIG. 6B schematically illustrates a block diagram of dividing a plurality of computing resources connected according to the topology illustrated in FIG. 5B into different groupings, according to one exemplary implementation of the present disclosure;

FIG. 7A schematically illustrates a block diagram of allocation of multiple computing resources to be connected according to the topology illustrated in FIG. 5A, according to one exemplary implementation of the present disclosure; FIG. 7B schematically illustrates a block diagram of allocation of multiple computing resources to be connected according to the topology illustrated in FIG. 5B, according to one exemplary implementation of the present disclosure;

FIG. 8 schematically illustrates a block diagram for acquiring parameter data associated with a neural network, according to one exemplary implementation of the present disclosure; and

FIG. 9 schematically illustrates a block diagram of an apparatus for processing computing tasks according to one exemplary implementation of the present disclosure.

Detailed Description

Preferred implementations of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred implementations of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the implementations set forth herein. Rather, these implementations are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "or" means "and/or" unless specifically stated otherwise. The term "based on" means "based at least in part on". The terms "one example implementation" and "one implementation" mean "at least one example implementation". The term "another implementation" means "at least one additional implementation". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.

As above, dedicated computing resources may be local to the client or may be provided by a remote machine or system. In some examples, a cloud-based computing system may be deployed, including a plurality of machines with one or more dedicated computing resources. The dedicated computing resources of the computing system may be used by different clients as needed to schedule the corresponding applications to run on the available dedicated computing resources.

FIG. 1 illustrates a schematic diagram of an example computing system 100 in which implementations of the present disclosure may be implemented. A plurality of servers for application execution are deployed in the computing system 100, including server 110-1, server 110-2, server 110-3, server 110-U (hereinafter collectively or individually referred to as server 110, where U is a natural number greater than 1). The computing system 100 also includes a dedicated computing resource 160-1, a dedicated computing resource 160-2, a dedicated computing resource 160-3, a dedicated computing resource 160-V (hereinafter collectively or individually referred to as a dedicated computing resource 160, where V is a natural number greater than 1). Each server 110 may have one or more dedicated computing resources 160 thereon.

In the example of FIG. 1, server 110-1 has dedicated computing resource 160-1, server 110-2 has dedicated computing resource 160-2, and server 110-U has dedicated computing resource 160-V. It will be appreciated that there is no limitation in having only one computing resource per server, but that one server may have one or more computing resources. Thus, the values of U and V may be unequal herein.

In the context of the present disclosure, examples of dedicated computing resources 160 may include, but are not limited to, graphics-specific computing resources (GPUs), field Programmable Gate Arrays (FPGAs), and the like. For ease of discussion, certain implementations will be described with a GPU as an example of a dedicated computing resource. In addition to dedicated computing resources 160, server 110 may also include one or more general purpose processing units (not shown), such as a Central Processing Unit (CPU).

Fig. 1 also shows a plurality of clients 120-1, 120-2 … …, 120-P, etc. (hereinafter collectively or individually referred to as clients 120, where P is a natural number greater than 1) having applications 150-1, 150-2, … …, 150-Q (hereinafter collectively or individually referred to as applications 150, where Q is a natural number greater than 1) to be run, respectively. The application 150 may be any application that is executable on a machine and that may be designed to perform a corresponding data processing or analysis task. By way of example, the application 150 may perform data processing or analysis tasks related to the neural network. It will be appreciated that there is no limitation here that each client has only one application, but that a client may have one or more applications. Thus, the values of P and Q may not be equal here.

To enable quick and efficient running of these applications and/or to conserve local computing resources, the client 120 may request a dedicated computing resource 160 on the server 110 to run these applications 150. In such an implementation, the client 120 may connect to one or more servers 110 over the interconnection network 130 and hand the application 150 over one or more dedicated computing resources 160 of the servers 110. Depending on the interfaces supported by the client 120, server 110, and/or dedicated computing resources 160, the interconnection network 130 may support different types of wired or wireless connections based on various network transport technologies, such as Remote Direct Memory Access (RDMA) and Transmission Control Protocol (TCP).

It should be understood that the apparatus and/or arrangement shown in fig. 1 is merely one example. In other examples, the computing system 100 may include any suitable number of servers 110 and clients 120. Each server 110 may be installed with any suitable number of dedicated computing resources 160 and each client 120 may have a plurality of applications 150 to be run. Further, while shown separately, the scheduler 140 may be implemented in actual practice by other devices independent of the server 110, or may be implemented partially or wholly on one or more servers 110.

For clarity and brevity of description, example implementations of the present disclosure will be described in detail primarily with respect to GPU kernels as examples. As is known, a GPU acts as a special purpose processor, with its powerful computing power arising from its large number of cores and high bandwidth memory. In GPU hardware architecture, one GPU typically has a large number of GPU cores, e.g., 5120 or nearly 10000 cores. GPU cores, which are a type of dedicated computing resource, are the most basic processing units, also known as Stream Processors (SP). Both instructions and tasks are ultimately processed on the GPU core. And the plurality of GPU kernels execute the instructions simultaneously, so that the parallel computation of the GPUs is realized. Multiple SPs plus some other resources, e.g., registers, shared memory, may constitute a Streaming Multiprocessor (SM).

However, it should be understood that the GPU is merely one exemplary dedicated computing resource and is not intended to limit the scope of the present disclosure. The spirit and principles described herein may be applied to other specialized computing resources, such as those in accelerators such as Field Programmable Gate Arrays (FPGAs), whether currently known or developed in the future, and are not limited to GPU cores alone.

With the development of cloud computing, a technical solution for processing computing tasks based on a cloud architecture has been proposed. For example, an application 120 in a client 150 may request a computing resource 160 in a server 110. It should be noted that computing tasks typically require the invocation of multiple computing resources 160 due to the complexity of the computing task. In the following, further details of the implementation of the present disclosure will be described in detail with specific examples of computing tasks based on neural network models. Fig. 2 schematically illustrates a block diagram 200 of a process for processing a neural network model-based computational task 210, according to one aspect. As shown in fig. 2, the computing task 210 may be a computing task based on a neural network model, where the neural network may involve multiple layers, such as layer 1, layer 2, … …, layer N, shown at

reference numerals

212, 214, … …, 216. It will be appreciated that each of layers 1 through N will involve a number of parameters defining the neural network model, such as gradient, weight, bias, etc. There will be a large difference in the amount of data of the parameters involved for the different layers, e.g. the number of parameters may vary in the range of tens to millions or more. Thus, how to allocate a set of computing resources from among multiple computing resources (e.g., computing resources 160-1 through 160-V) to handle computing task 210 is a challenge.

It will be appreciated that a solution for handling computational tasks based on neural network models has been provided. In one aspect, a set of computing resources with a lighter workload may be selected in accordance with a configuration of multiple computing resources. However, this solution does not give an efficient assessment of the processing performance of a selected set of computing resources.

In another aspect, it may be determined during a pilot run (pilot run) which computing resources 160 to select to process the computing task 210. During commissioning, one or more computing resources may be individually selected in a plurality of rounds by permutation and combination, and processing performance associated with the selected one or more computing resources determined. In turn, one or more computing resources associated with optimal processing performance may be allocated to process computing task 210. However, when a large number of computing resources are included in the resource pool, a large number of combinations will likely occur, thereby making it difficult to determine which computing resources to select in a short time. Based on the deficiencies in the prior art, the present disclosure proposes a method for processing computing tasks.

FIG. 3 schematically illustrates a block diagram 300 of processing computing tasks according to one exemplary implementation of the present disclosure. As shown in fig. 3, topology information 320 for a plurality of computing resources 160-1 through 160-V in a resource pool 310 may be obtained. The plurality of computing resources 160-1 through 160-V may be partitioned into a plurality of groups 322 based on the topology information 320. At least one computing resource may then be selected from at least one of the plurality of groups 322 and a processing performance 324 associated with the selected at least one computing resource is determined. A selection may be made among a plurality of passes and corresponding processing performance 324 is obtained. By comparing the processing performance 324 associated with each of the plurality of rounds, the computing resources associated with the optimal processing performance may be allocated for processing the computing task 210.

FIG. 4 schematically illustrates a flow chart of a method 400 for processing a computing task 210 according to one exemplary implementation of the present disclosure. As shown in fig. 4, at block 410, a plurality of computing resources may be divided into a plurality of groups based on topology information describing a connection relationship between the plurality of computing resources 160. It will be appreciated that since there are a variety of ways in which the computing resources 160 may be connected, there may be a variety of different topologies. Hereinafter, two typical topologies will be schematically shown with reference to fig. 5A and 5B.

FIG. 5A schematically illustrates a block diagram 500A of one example topology of a plurality of computing resources according to one example implementation of the present disclosure. As shown in FIG. 5A, a PCIe connection is established between computing resources 160-1 and 160-2 based on PCIe switch 510A and a PCIe connection is established between computing resources 160-3 and 160-4 based on PCIe switch 520A. A fast path interconnect (Quick Path Interconnection, QPI) connection is established between

PCIe switches

510A and 520A based on

socks

512A and 522A.

Fig. 5B schematically illustrates a block diagram 500B of another example topology of a plurality of computing resources 160 according to another example implementation of the present disclosure. As shown in FIG. 5B, taking nVidia's GPU as an example, computing resources 160-1, 160-2, 160-3, 160-4 may have NVlink connections between them, as shown by solid lines, which support 72GB/s data transfer. Further, there is a PCIe-based connection as shown by the dashed line established via PCIe switch 510B between the plurality of computing resources 160.

In this implementation, relevant topology information may be collected from the topology as shown in fig. 5A and 5B. It will be appreciated that only two exemplary topologies are schematically shown in fig. 5A and 5B. In other application environments, more or fewer computing resources 160 may be included, and other ways of connecting between the computing resources 160 may exist. For example, in one application environment, 8 computing resources may be included, with one set of 4 computing resources connected as shown in FIG. 5A and another set of 4 computing resources connected as shown in FIG. 5B. The two sets of computing resources may be connected in QPI fashion.

Returning to fig. 4, at block 420, at least one computing resource may be selected from at least one of the plurality of groups. In this implementation, computing resources may be progressively selected from multiple groups. For example, in the first stage, one computing resource may be selected from a group. In the second phase, one computing resource may be selected from each of the two groups, and so on. It will be appreciated that any one of the computing resources may be selected from among the plurality of computing resources included in the group herein.

At block 430, a processing performance is determined for processing the computing task with the selected at least one computing resource. In this implementation, operations associated with computing task 210 may be deployed at the selected at least one computing resource and the processing performance of the operations performed by the selected at least one computing resource is tested. At block 440, at least one computing resource is allocated for processing the computing task based on the processing performance. In this implementation, computing resources associated with higher processing performance may be preferentially selected for processing computing tasks 210.

With the above-described exemplary implementation, by dividing the plurality of computing resources 160 into a plurality of groups, one computing resource may be selected from each of the plurality of groups during subsequent selections, and thus the number of combinations that may occur during the selections may be reduced. In this way, the time and other resource overhead of the commissioning phase may be greatly reduced, thereby increasing the efficiency of processing the computing task 210.

Specifically, assuming that there are 4 computing resources in the resource pool 310, and that the 4 computing resources can be divided into two different groupings according to implementations of the present disclosure, then only the computing resources need be selected from the 2 different groupings, respectively. According to one exemplary implementation of the present disclosure, the number of combinations is

Compared with the number of combinations in test runs based on conventional technical solutions +.>

The various overheads involved in commissioning according to implementations of the present disclosure will be greatly reduced. It will be appreciated that in the aboveFormula->

Representing the number of combinations that may occur to select y elements from the x elements.

According to one exemplary implementation of the present disclosure, topology information describes the type of connection between multiple computing resources. To divide the plurality of computing resources into a plurality of groups based on the topology information, a distance between the plurality of computing resources may be determined based on the type; and partitioning the plurality of computing resources into a plurality of groups based on the distance.

During processing of the computing task 210, the type of connection between the plurality of computing resources becomes a critical factor affecting the processing performance of the computing task 210, as data needs to be transferred frequently between the plurality of computing resources 160. With the above-described exemplary implementation, the distance between the various computing resources 160 may be determined based simply on the type of connection, and thus the multiple computing resources 160 may be partitioned into different groups in a simpler and efficient manner. In table 1 below, the types of connections that may exist are schematically shown.

Table 1 examples of connection types

Based on the connection types shown in table 1, topology information between the plurality of computing resources 160 may be represented as shown in table 2 below. It will be appreciated that table 2 shows topology information related to the topology shown in fig. 5A, and that one skilled in the art may obtain topology information associated with other topologies based on the example of table 2.

Table 2 examples of topology information

The types of connections between the multiple computing resources 160-1, 160-2, 160-3, 160-4 are shown in Table 2. It will be appreciated that the intersection between a row and a column represents the type of connection between the associated two computing resources. Specifically, the intersection of the second row and the third column represents: the type of connection between computing resources 160-1 and 160-2 is a PIX connection. The intersection of the second row and the second column represents: there is no connection between the computing resource 160-1 and itself, and is thus shown as empty.

According to one exemplary implementation of the present disclosure, to determine a distance between a plurality of computing resources based on a type, a bandwidth of a connection between the plurality of computing resources may be determined based on the type; and determining a distance based on the bandwidth, the distance being inversely proportional to the bandwidth. It will be appreciated that the higher the bandwidth of the connection between two computing resources 160, the shorter the time required to transmit the same amount of data. The bandwidth can thus be simply considered an indicator of the distance between two computing resources 160, and higher bandwidths represent closer distances. With the above exemplary implementations, distances between multiple computing resources may be quickly and efficiently determined.

For example, the distance between two computing resources 160 having a connection relationship may be determined based on the bandwidth between the two computing resources 160. If there is a direct connection between two computing resources 160, the distance may be determined based on the bandwidth of the direct connection. If there is an indirect connection between two computing resources 160, the distance may be determined based on the bandwidths of multiple direct connections in the indirect connection.

According to one exemplary implementation of the present disclosure, connection types may simply be mapped to different distance scores. It will be appreciated that the distance score is representative of the size of the distance between two computing resources 160, schematically, and not an accurate value of the distance. According to one exemplary implementation of the present disclosure, the distance scores associated with the respective types may be determined using the content as shown in table 3.

Table 3 example of distance scoring

Based on the determined type of connection between the plurality of computing resources and the mapping between the types and distance scores described in Table 3 above, distances between the plurality of computing resources connected according to different topologies may be determined. Specifically, the distances between the plurality of computing resources 160 connected in accordance with the topology shown in fig. 5A are shown in table 4 below, and the distances between the plurality of computing resources 160 connected in accordance with the topology shown in fig. 5B are shown in table 5 below.

Table 4 examples of distance tables

The distance scores between the plurality of computing resources 160-1, 160-2, 160-3, 160-4 are shown in Table 4. It will be appreciated that the intersection between a row and a column represents a distance score between the associated two computing resources. Specifically, the intersection of the second row and the third column represents: the distance between computing resources 160-1 and 160-2 is scored as 2. The intersection of the second row and the second column represents: the distance between the computing resource 160-1 and itself scores 0.

Table 5 examples of distance tables

The distance scores between the plurality of computing resources 160-1, 160-2, 160-3, 160-4 are shown in Table 5. It will be appreciated that the intersection between a row and a column represents a distance score between the associated two computing resources. Specifically, the intersection of the second row and the third column represents: the distance between computing resources 160-1 and 160-2 is scored as 1. The intersection of the second row and the second column represents: the distance between the computing resource 160-1 and itself scores 0.

According to one exemplary implementation of the present disclosure, the plurality of computing resources 160 may be partitioned into different groupings by the value of the distance score. Fig. 6A schematically illustrates a block diagram 600A of dividing a plurality of computing resources connected according to the topology illustrated in fig. 5A into different groupings according to one exemplary implementation of the present disclosure. As shown in FIG. 6A, computing resources 160-1 and 160-2 may be partitioned into a first group 620A and computing resources 160-3 and 160-4 may be partitioned into a second group 622A based on a distance score 610A between the plurality of computing resources 160.

According to one exemplary implementation of the present disclosure, the grouping may be divided in different ways. For example, a graph (graph) describing distances between multiple computing resources may be constructed based on what is shown in table 4, where nodes in the graph represent computing resources, and weights of edges in the graph represent distances between two computing resources. Then, the partitioning may be performed based on the principle of graph theory, which will not be described in detail herein.

Fig. 6B schematically illustrates a block diagram 600B of dividing a plurality of computing resources connected according to the topology illustrated in fig. 5B into different groupings according to one exemplary implementation of the present disclosure. As shown in FIG. 6B, computing resources 160-1 and 160-4 may be partitioned into a first group 620B and computing resources 160-2 and 160-3 may be partitioned into a second group 622B based on a distance score 610B between the plurality of computing resources 160.

How the plurality of computing resources 160 are partitioned into different groupings has been described above with reference to fig. 6A and 6B. It will be appreciated that while only 4 computing resources 160 are partitioned into two different groupings in the above, in other application environments, other numbers of computing resources 160 may be partitioned into other numbers of groupings. For example, 16 computing resources may be divided into 4 different groupings.

According to one exemplary implementation of the present disclosure, at least a portion of operations are selected from a plurality of operations associated with a computing task; and performing at least a portion of the selected operations with the selected at least one computing resource to obtain processing performance. With the above-described exemplary implementation, only a portion of the operations of the computing task need be performed during the commissioning process, and by deploying such portion of the operations to run on the selected computing resources, the processing performance of deploying all of the operations of the computing task 210 to run on the selected computing resources can be predicted. For example, assuming that computing task 210 involves hundreds of thousands of operations, then only 5000 or other number of operations need be selected at this time. In this way, the time and other overhead of the commissioning phase can be greatly reduced.

According to one exemplary implementation of the present disclosure, a measure of a time or a number of operations performed per unit time at which at least a portion of the operations are performed using the selected at least one computing resource may be determined. For example, continuing with the example above, the time to run the selected 5000 operations may be measured, or the number of operations processed per second may also be measured while running the selected 5000 operations. Processing performance may then be determined based on the obtained measurements. It will be appreciated that a shorter run time indicates a higher processing performance, and that a greater number of operations performed per unit time indicates a higher processing performance. With the above exemplary implementation, the processing performance associated with a selected computing resource may be determined in a quantitative manner, thereby measuring the level of processing performance in a more accurate and efficient manner.

According to one exemplary implementation of the present disclosure, the first and second sets of computing resources may be selected from different numbers of groups in different phases, respectively. For example, a first set of computing resources may be selected from a first number of first groups and a second set of computing resources may be selected from a second number of second groups. In order to allocate at least one computing resource for processing a computing task based on the processing performance, a second set of computing resources is allocated for processing the computing task in response to the first processing performance of processing the computing task with the first set of computing resources being no higher than the second processing performance of processing the computing task with the second set of computing resources.

According to one exemplary implementation of the present disclosure, selecting at least one computing resource from at least one of a plurality of groups may include a plurality of phases. For example, in the first stage, one computing resource may be selected from one of a plurality of groups. Assuming that the plurality of computing resources 160 has been divided into 4 groups, in a first phase, one computing resource may be selected from a first group of the 4 groups and an associated processing performance1-1 is determined. In turn, one computing resource may be selected from the second and third groups, respectively, and associated processing performance1-2, performance1-3, and performance1-4 may be determined in a similar manner. By comparing the values of performance1-1, performance1-2, performance1-3, and performance1-4, the computing resources associated with the best processing performance may be allocated to process computing task 210.

According to one exemplary implementation of the present disclosure, after the first phase has been completed, the number of selected computing resources 160 may also be gradually increased in subsequent commissioning. Continuing with the example above, in the second phase, one computing resource may be selected from 2 of the 4 groups, respectively, and the associated processing performance determined. The computing resources associated with the optimal processing performance may then be allocated to process the computing task 210 by comparing the determined values of the plurality of processing performances. With the above-described exemplary implementations, the number of computing resources used to process computing tasks 210 may be continually increased, thereby ensuring that the manner in which the selected at least one computing resource is combined may represent various exemplary combinations.

It will be appreciated that the greater the number of computing resources 160 selected, the greater the overhead of data transfer between the various computing resources 160 selected. Thus, as the number increases, a decrease in processing performance may result. In other words, the processing performance associated with a larger number of computing resources may be lower than the processing performance associated with a smaller number of computing resources.

According to one exemplary implementation of the present disclosure, a third computing resource is selected from a third group of the plurality of groups in response to processing performance of processing the computing task with the first computing resource and the second computing resource being not lower than processing performance of processing the computing task with the first computing resource. With the above-described exemplary implementation, when a decrease in processing performance associated with a subsequent stage is found, then it is deemed that the best processing performance has been found, and thus a process stop need not be selected.

Continuing with the example above, assuming that the processing performance associated with 2 computing resources is lower than the processing performance associated with 1 computing resource, the selection process ends at this point. Assuming that the processing performance associated with 2 computing resources is higher than the processing performance associated with 1 computing resource, the third stage of selection may begin at this point. Specifically, in the third stage, one computing resource may be selected from 3 groups out of 4 groups, respectively, and the associated processing performance is determined. The computing resources associated with the optimal processing performance may then be allocated to process the computing task 210 by comparing the determined values of the plurality of processing performances. Assuming that the processing performance associated with 3 computing resources is lower than the processing performance associated with 2 computing resources, then the selection process ends at this point. In this implementation, the processing performance associated with the 2 computing resources 160 is the best processing performance, and thus the combination of computing resources used to process the computing task 210 may be determined based on the computing resources that obtained the best performance.

Hereinafter, how to determine the combination of computing resources for processing the computing task 210 will be described with reference to fig. 7A and 7B. According to the rules for determining processing performance described above, one computing resource may be selected from the two groups, respectively, assuming that the processing performance associated with the two computing resources is determined to be optimal. Fig. 7A schematically illustrates a block diagram 700A of allocation of multiple computing resources to be connected according to the topology illustrated in fig. 5A, according to one exemplary implementation of the present disclosure. As shown in fig. 7A, one computing resource may be selected from the first packet 620A and the second packet 622A, respectively, and thus a combination of 4 as shown in combination 710A may occur:

(computing resource 160-1, computing resource 160-3)

(computing resource 160-2, computing resource 160-3)

(computing resource 160-1, computing resource 160-4)

(computing resource 160-2, computing resource 160-4)

Fig. 7B schematically illustrates a block diagram 700B of allocation of multiple computing resources to be connected according to the topology illustrated in fig. 5B, according to one exemplary implementation of the present disclosure. As shown in fig. 7B, one computing resource may be selected from the first packet 620B and the second packet 622B, respectively, and thus a combination of 4 as shown in combination 710B may occur:

(computing resource 160-1, computing resource 160-2)

(computing resource 160-4, computing resource 160-2)

(computing resource 160-1, computing resource 160-3)

(computing resource 160-4, computing resource 160-3)

According to one exemplary implementation of the present disclosure, the plurality of computing resources are a plurality of graphics processing units. It will be appreciated that while various exemplary implementations of the present disclosure are described in the context of the present disclosure with a graphics processing unit as an example, in other application environments, computing resources may also include, but are not limited to, other computing resources such as field programmable gate arrays. According to one exemplary implementation of the present disclosure, the computational tasks are neural network model-based computational tasks. After having determined which computing resources 160 to utilize to process the computing task 210, operations associated with the computing task 210 may be deployed to the determined computing resources 160.

Fig. 8 schematically illustrates a block diagram 800 for acquiring parameter data associated with a neural network, according to one exemplary implementation of the present disclosure. As shown in fig. 8, reference numeral 810 schematically illustrates configuration information of a neural network model according to one example implementation. The multiple layers included in the neural network model and the parameters involved in each layer are defined in this configuration information 810. By parsing the configuration information 810, parameter data 830 about the neural network model can be obtained.

As shown in fig. 8, parameter data 830 is a specific example of parameter data according to one exemplary implementation of the present disclosure. As shown in parameter data 830, the neural network model may include a plurality of layers, and wherein the field "Param-size" in each row defines the number of parameters associated with each layer. As shown in line 820 of parameter data 830, a layer may include 23232 parameters; as shown in line 822 in parameter data 830, one layer may include 37748736 parameters; etc. It will be appreciated that the manner in which the parameter data 830 is obtained is not limited in the context of the present disclosure. Rather, those skilled in the art may obtain the parameter data 830 according to various technical schemes that have been developed in the prior art or will be developed in the future.

According to one exemplary implementation of the present disclosure, the number of parameters involved in each layer may first be counted in order to preferentially allocate computing resources to layers with a larger number of parameters. For example, based on the parameter data 830 as shown in fig. 8, a corresponding number of parameters associated with at least a portion of the layers may be determined. In this example, by extracting the value of the field Param-size portion in the parameter data 830, the number of parameters associated with each layer may be expressed as: [23232, 64, 307200, 192, 663552, 384, 1327104, 384, 884736, 256, 37748736, 4096, 16777216, 4096, 4100096, 1001]. In this example, the determined parameters associated with the neural network model may be deployed at the assigned computing resources 160, respectively.

Examples of methods according to the present disclosure have been described in detail hereinabove with reference to fig. 2-8, the methods described hereinabove may be implemented in a device. According to one exemplary implementation of the present disclosure, an apparatus for processing computing tasks is provided. The apparatus includes: a partitioning module configured to partition a plurality of computing resources into a plurality of groups based on topology information describing a connection relationship between the plurality of computing resources; a selection module configured to select at least one computing resource from at least one of the plurality of groups; a determination module configured to determine a processing performance for processing a computing task with the selected at least one computing resource; and an allocation module configured to allocate at least one computing resource for processing a computing task based on the processing performance.

FIG. 9 schematically illustrates a block diagram of an apparatus for processing computing tasks according to one exemplary implementation of the present disclosure. As shown, the device 900 includes a Central Processing Unit (CPU) 901, which can perform various suitable actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM) 902 or loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The CPU 901, ROM 902, and RAM 903 are connected to each other through a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

Various components in device 900 are connected to I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

The various processes and treatments described above, such as method 400, may be performed by processing unit 901. For example, in some implementations, the method 400 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some implementations, some or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communication unit 909. When the computer program is loaded into RAM 903 and executed by CPU 901, one or more steps of method 400 described above may be performed. Alternatively, in other implementations, CPU 901 may be configured in any other suitable manner to implement the above-described processes/methods.

According to one exemplary implementation of the present disclosure, there is provided an apparatus for processing computing tasks, comprising: at least one processor; a volatile memory; and a memory coupled to the at least one processor, the memory having instructions stored therein, which when executed by the at least one processor, cause the device to perform actions. The actions include: dividing the plurality of computing resources into a plurality of groups based on topology information describing connection relationships between the plurality of computing resources; selecting at least one computing resource from at least one of the plurality of groups; determining a processing performance for processing the computing task with the selected at least one computing resource; and allocating at least one computing resource for processing the computing task based on the processing performance.

According to one exemplary implementation of the present disclosure, the topology information describes a type of connection between the plurality of computing resources, and dividing the plurality of computing resources into a plurality of groups based on the topology information includes: determining a distance between a plurality of computing resources based on the type; and dividing the plurality of computing resources into a plurality of groups based on the distance, a distance between a first computing resource included in a first group of the plurality of groups and other resources in the first group being less than a distance between the first computing resource and other resources in a second group of the plurality of groups.

According to one exemplary implementation of the present disclosure, determining a distance between a plurality of computing resources based on a type includes: determining a bandwidth of a connection between the plurality of computing resources based on the type; and determining a distance based on the bandwidth, the distance being inversely proportional to the bandwidth.

According to one exemplary implementation of the present disclosure, selecting at least one computing resource from at least one of a plurality of groups comprises: a first computing resource is selected from a first group of the plurality of groups.

According to one exemplary implementation of the present disclosure, selecting at least one computing resource from at least one of the plurality of groups further comprises: a second computing resource is selected from a second group of the plurality of groups.

According to one exemplary implementation of the present disclosure, selecting at least one computing resource from at least one of the plurality of groups further comprises: in response to processing the computing task with the first computing resource and the second computing resource having a processing performance not less than processing the computing task with the first computing resource, a third computing resource is selected from a third group of the plurality of groups.

According to one exemplary implementation of the present disclosure, determining processing performance for processing a computing task with the selected at least one computing resource includes: selecting at least a portion of operations from a plurality of operations associated with a computing task; and performing at least a portion of the selected operations with the selected at least one computing resource to obtain processing performance.

According to one exemplary implementation of the present disclosure, obtaining processing performance includes: determining a measure of a time at which at least a portion of the operations are performed or a number of operations performed per unit time using the selected at least one computing resource; and determining a processing performance based on the measurement.

According to one exemplary implementation of the present disclosure, selecting at least one computing resource from at least one of a plurality of groups comprises: selecting a first set of computing resources from a first number of first groups and a second set of computing resources from a second number of second groups; allocating at least one computing resource for processing a computing task based on processing performance includes: the second set of computing resources is allocated for processing the computing task in response to the first processing performance of processing the computing task with the first set of computing resources being not higher than the second processing performance of processing the computing task with the second set of computing resources.

According to one exemplary implementation of the present disclosure, the plurality of computing resources is a plurality of graphics processing units, and wherein the computing task is a neural network model-based computing task.

According to one exemplary implementation of the present disclosure, a computer program product is provided. The computer program product is tangibly stored on a computer-readable medium and includes machine-executable instructions that, when executed, cause a machine to perform a method according to the present disclosure.

According to one exemplary implementation of the present disclosure, a computer-readable medium is provided. The computer-readable medium has stored thereon machine-executable instructions that, when executed by at least one processor, cause the at least one processor to implement a method according to the present disclosure.

The present disclosure may be methods, apparatus, systems, and/or computer program products. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for performing aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for performing the operations of the present disclosure can be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some implementations, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information for computer-readable program instructions, which can execute the computer-readable program instructions.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various implementations of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of implementations of the present disclosure has been provided for illustrative purposes, is not exhaustive, and is not limited to the implementations disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various implementations described. The terminology used herein was chosen in order to best explain the principles of each implementation, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand each implementation disclosed herein.

Claims

1. A method for processing computing tasks, comprising:

dividing a plurality of computing resources into a first group including a first group and a second group based on topology information describing a connection relationship between the plurality of computing resources;

selecting a first set of computing resources from the first group and a second set of computing resources from the second group;

determining a processing performance for processing the computing task with the selected first set of computing resources and the second set of computing resources;

generating a second group of groups based on the determined processing performance, wherein each group of the second group of groups includes a respective computing resource from the first group of computing resources and a respective computing resource from the second group of computing resources, wherein a total number of groups in the second group of groups is a combined number of each computing resource from the respective computing resource from the first group of computing resources and each computing resource from the respective computing resource from the second group of computing resources; and

One of the second group of groups is assigned for processing the computing task based on the processing performance.

2. The method of claim 1, wherein the topology information describes a type of connection between the plurality of computing resources, and dividing the plurality of computing resources into a first group of groups based on the topology information comprises:

determining a distance between the plurality of computing resources based on the type; and

the plurality of computing resources are divided into the first group of groups based on the distances, a distance between a first computing resource included in the first group of the first group and other resources in the first group being less than a distance between the first computing resource and other resources in the second group of the first group.

3. The method of claim 2, wherein determining a distance between the plurality of computing resources based on the type comprises:

determining a bandwidth of the connection between the plurality of computing resources based on the type; and

the distance is determined based on the bandwidth, the distance being inversely proportional to the bandwidth.

4. The method of claim 1, wherein the method further comprises:

In response to processing the computing task with the first set of computing resources and the second set of computing resources having a processing performance not lower than processing the computing task with the first set of computing resources, selecting a third set of computing resources from a third group of the first group of computing resources.

5. The method of claim 1, wherein determining processing performance of processing the computing task with the selected first and second sets of computing resources comprises:

selecting at least a portion of operations from a plurality of operations associated with the computing task; and

performing the selected at least a portion of the operations with the selected first and second sets of computing resources to obtain the processing performance.

6. The method of claim 5, wherein obtaining the processing performance comprises:

determining a measure of the time or number of operations performed per unit time at which the at least a portion of the operations were performed using the selected first set of computing resources and the second set of computing resources; and

the processing performance is determined based on the measurements.

7. The method according to claim 1, wherein:

Assigning one of the second group of groups for processing the computing task based on the processing performance includes: the second set of computing resources is allocated for processing the computing task in response to a first processing performance of processing the computing task with the first set of computing resources being not higher than a second processing performance of processing the computing task with the second set of computing resources.

8. The method of claim 1, wherein the plurality of computing resources are a plurality of graphics processing units, and wherein the computing task is a neural network model-based computing task.

9. An apparatus for processing computing tasks, comprising:

at least one processor;

a volatile memory; and

a memory coupled with the at least one processor, the memory having instructions stored therein, which when executed by the at least one processor, cause the apparatus to perform actions comprising:

10. The apparatus of claim 9, wherein the topology information describes a type of connection between the plurality of computing resources, and dividing the plurality of computing resources into a first group of groups based on the topology information comprises:

the plurality of computing resources are divided into the first group of groups based on the distances, a distance between a first computing resource included in a first group of the first group and other resources in the first group being less than a distance between the first computing resource and other resources in a second group of the first group.

11. The apparatus of claim 10, wherein determining a distance between the plurality of computing resources based on the type comprises:

12. The apparatus of claim 9, wherein a third set of computing resources is selected from a third group of the first group of groups in response to processing the computing task with the first set of computing resources and the second set of computing resources having a processing performance that is not lower than processing the computing task with the first set of computing resources.

13. The apparatus of claim 9, wherein determining processing performance of processing the computing task with the selected first and second sets of computing resources comprises:

14. The apparatus of claim 13, wherein obtaining the processing performance comprises:

the processing performance is determined based on the measurements.

15. The apparatus of claim 9, wherein:

16. The apparatus of claim 9, wherein the plurality of computing resources are a plurality of graphics processing units, and wherein the computing task is a neural network model-based computing task.

17. A computer readable storage medium having stored thereon computer readable instructions which, when executed, cause a computer to perform the method of any of claims 1 to 8.