WO2022047632A1

WO2022047632A1 - Data computation method and device

Info

Publication number: WO2022047632A1
Application number: PCT/CN2020/112901
Authority: WO
Inventors: 石达清
Original assignee: 华为技术有限公司
Priority date: 2020-09-01
Filing date: 2020-09-01
Publication date: 2022-03-10
Also published as: CN115989478A

Abstract

The present application provides a data computation method and device, relating to the technical field of communications, and capable of reducing delays in MPI operations and improving MPI execution efficiency. The method is applicable to a network card coupled to a memory via a bus, and comprises: receiving a first packet, the first packet comprising computation instruction information and first data; determining, according to the computation instruction information, that data computation in an MPI operation needs to be performed on the first data; acquiring second data from the memory, the second data being local data for the data computation in the MPI operation; and performing data computation on the first data and the second data, so as to obtain a first computation result.

Description

A data computing method and device

technical field

The present application relates to the field of communication technologies, and in particular, to a data computing method and device.

Background technique

With the rapid development of scenarios such as high performance computing (HPC) and artificial intelligence (AI), the execution efficiency of message passing interface (MPI) collective communication functions becomes more and more important. The MPI collective communication functions include reduce function types such as MPI_reduce and MPI_allreduce. Among them, the MPI reduce function accounts for a large proportion in the MPI application scenario, about 40%. Improving the execution efficiency of the MPI reduce function will bring a greater operating efficiency of the MPI application. The MPI reduce function can be decomposed into three parts, namely calculation, synchronization and communication. This article will realize the optimization of the calculation part of the MPI reduce function.

In the prior art, the computing part in the MPI operation is usually realized by offloading the send queue (SQ) task of the external network card of the server. Specifically, as shown in FIG. 1 , the method includes: S1. the network card writes data A1 from the network message into a dynamic random access memory (DRAM); S2. the network card schedules SQ, when the The task of the selected SQ is to perform the reduce operation of the specified operation on the A1 data, the network card reads the data A1 in the message from the DRAM; S3. The network card reads the local data A2 from the DRAM; S4. The network card completes the data A1 and data Operation of A2, and write the operation result into DRAM.

In the above method, the network card can only perform the data operation when it is scheduled to the SQ corresponding to the computing task. When the network scale is larger, the number of SQs corresponding to the network card is greater. The delay of the MPI operation is also larger, resulting in a larger delay of the MPI operation.

SUMMARY OF THE INVENTION

The present application provides a data operation method and device, which are used to reduce the delay of MPI operation and improve the execution efficiency of MPI operation.

To achieve the above object, the embodiments of the present application adopt the following technical solutions:

A first aspect provides a data operation method in MPI operation, which is applied to a network card, the network card is coupled to a memory through a bus, and the method includes: receiving a first message, and the first message may be a message passing interface in the network. Sent by other servers operated by MPI, the first message includes operation indication information and first data; according to the operation indication information, it is determined that the data operation that needs to perform MPI operation on the first data; the second data is obtained from the memory, the second data is the local data of the data operation in the MPI operation; complete the data operation (for example, addition or multiplication, etc.) of the first data and the second data in the MPI operation, and obtain the first operation result; The result of the operation is written into the memory.

In the above technical solution, when the network card receives and obtains the operation indication information and the first data in the first message, the network card can determine the data operation that needs to perform the MPI operation on the first data according to the operation indication information, so as to directly The local data of the data operation in the MPI operation, that is, the second data, is obtained from the memory, and the data operation of the first data and the second data is completed to obtain the first operation result. Compared with the prior art, the network card does not need to write the first data into the memory, but directly obtains the second data when the first data is obtained, that is, the first data and the second data are calculated on the road, reducing the The number of reads and writes of the memory is reduced, thereby reducing the delay of the MPI operation and improving the execution efficiency of the MPI.

In a possible implementation manner of the first aspect, the method includes: receiving a first packet, where the first packet includes first data; when the first packet carries operation indication information, determining that the first packet needs to be assigned to the first packet. One data performs the data operation of the MPI operation of the message passing interface, and obtains the second data from the memory, and the second data is the local data of the operation of the MPI operation; completes the MPI operation of the first data and the second data. , get the first operation result. It should be understood that the solution may further include: writing the first operation result into the memory.

It should be understood that the operation indication information may be carried in the packet header of the first packet, for example, the packet header of the existing packet is extended, and the operation indication information is carried in the packet header obtained by the expansion.

In a possible implementation manner of the first aspect, the first packet further includes a storage address of the second data, and acquiring the second data from the memory includes: reading the second data from the memory according to the storage address. Further, writing the first operation result into the memory includes: according to the storage address of the second data, storing the first operation result in the storage location where the second data is located to cover the second data. In the above possible implementation manners, useless data can be prevented from occupying the storage space in the memory, thereby improving the utilization rate of the memory.

In a possible implementation manner of the first aspect, a network card, a memory and a bus are integrated in a system-on-a-chip SoC. In the above possible implementation manner, by integrating the network card, the memory and the bus in the SoC, the end-to-end transmission delay can be reduced, and the execution efficiency of the data operation in the MPI operation can be further improved.

In a possible implementation manner of the first aspect, the operation indication information includes: an operation type and a data type. In the above possible implementations, it can be determined by the operation type and data type that the first data needs to perform the data operation of the MPI operation, so that when the network card obtains the information, the first data does not need to be written into the memory, but is directly obtained. The second data realizes the data operation of the first data and the second data, thereby reducing the number of times of reading and writing the memory, reducing the delay of the MPI operation, and improving the execution efficiency of the MPI.

In a possible implementation manner of the first aspect, the operation indication information is carried in a packet header of the first packet. In the above possible implementation manners, a simple and effective manner of carrying the operation indication information is provided.

In a possible implementation manner of the first aspect, the MPI operation includes: an MPI_reduce operation, or an MPI_allreduce operation. In the above possible implementation manners, the delay of the MPI_reduce operation or the MPI_allreduce operation can be reduced, thereby improving the execution efficiency of the MPI_reduce operation or the MPI_allreduce operation.

In a possible implementation manner of the first aspect, the network card is further coupled to the processor through a bus, and the method further includes: sending notification information to the processor, where the notification information is used to indicate that the data operation is completed. In the above possible implementation manner, by sending notification information to the processor, the state of the MPI operation recorded by the processor can be consistent with the state of the actual MPI operation, thereby ensuring the orderly and efficient execution of the MPI operation.

In a second aspect, a data computing device is provided, the device is a network card or a chip built into the network card, the network card is coupled to a memory through a bus, and the device includes: a receiving unit for receiving a first message from a network; a processing unit for is used to parse the first message to obtain operation indication information and first data included in the first message, where the operation indication information is used to indicate that the first data needs to be subjected to a data operation of a message passing interface MPI operation; The second data is obtained from the memory, and the second data is local data of the data operation in the MPI operation; the processing unit is further configured to complete the data operation of the first data and the second data, and obtain the first operation result. Further, the apparatus further includes: a writing unit for writing the first operation result into the memory.

In a possible implementation manner of the second aspect, the apparatus includes: a receiving unit, configured to receive a first packet, where the first packet includes first data; and a processing unit, configured to receive a first packet when the first packet carries When calculating the indication information, it is determined that the first data needs to be subjected to a data operation of a message passing interface MPI operation; an obtaining unit is used to obtain second data from the memory, and the second data is the local data of the operation of the MPI operation; processing The unit is further configured to complete the MPI operation of the first data and the second data to obtain a first operation result. It should be understood that the solution may further include: writing the first operation result into the memory. It should be understood that the operation indication information may be carried in the packet header of the first packet, for example, the packet header of the existing packet is extended, and the operation indication information is carried in the packet header obtained by the expansion.

In a possible implementation manner of the second aspect, the first packet further includes a storage address of the second data, and the obtaining unit is further configured to: read the second data from the memory according to the storage address. Further, the writing unit is further configured to: according to the storage address of the second data, store the first operation result in the storage location where the second data is located to cover the second data.

In a possible implementation manner of the second aspect, the network card, the memory and the bus are integrated in a system-on-a-chip SoC.

In a possible implementation manner of the second aspect, the operation indication information includes: an operation type and a data type.

In a possible implementation manner of the second aspect, the operation indication information is carried in a packet header of the first packet.

In a possible implementation manner of the second aspect, the MPI operation corresponding to the first data includes: an MPI_reduce operation or an MPI_allreduce operation.

In a possible implementation manner of the second aspect, the network card is further coupled to the processor through a bus, and the apparatus further includes: a sending unit, configured to send notification information to the processor, where the notification information is used to indicate that the data operation is completed.

In a third aspect, a data computing device is provided, the device is a network card or a built-in chip of the network card, the network card is coupled to a memory through a bus, code and data are stored in the memory, and the network card runs the code in the memory so that the device executes the first aspect or the first aspect. A data operation method provided by any possible implementation manner of an aspect.

In another aspect of the present application, there is provided a computer-readable storage medium having instructions stored in the computer-readable storage medium which, when run on a computer, cause the computer to perform the first aspect or any one of the first aspects Data manipulation methods provided by a possible implementation.

In another aspect of the present application, a computer program product is provided, characterized in that, when the computer program product runs on a device, the device is made to execute the first aspect or any of the possible implementations of the first aspect. Provided data manipulation methods.

It can be understood that any data computing device, computer storage medium or computer program product provided above is used to execute the corresponding method provided above, therefore, the beneficial effects that can be achieved can refer to the provided above. The beneficial effects in the corresponding method will not be repeated here.

Description of drawings

Fig. 1 is a kind of execution schematic diagram of MPI operation;

2 is a schematic diagram of an MPI operation provided by an embodiment of the present application;

3a is a schematic structural diagram of a server provided by an embodiment of the present application;

3b is a schematic structural diagram of another server provided by an embodiment of the present application;

4 is a schematic flowchart of a data operation method provided by an embodiment of the present application;

5 is a schematic flowchart of another data operation method provided by an embodiment of the present application;

6 is a schematic diagram of an MPI operation provided by an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a data computing device according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of another data computing apparatus provided by an embodiment of the present application.

detailed description

In this application, "at least one" means one or more, and "plurality" means two or more. "And/or", which describes the association relationship of the associated objects, indicates that there can be three kinds of relationships, for example, A and/or B, which can indicate: the existence of A alone, the existence of A and B at the same time, and the existence of B alone, where A, B can be singular or plural. The character "/" generally indicates that the associated objects are an "or" relationship. "At least one item(s) below" or similar expressions thereof refer to any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one item (a) of a, b, or c may represent: a, b, c, ab, ac, bc, or abc, where a, b, and c may be single or multiple . In addition, in the embodiments of the present application, words such as "first" and "second" are used to distinguish the same items or similar items with basically the same functions and functions. For example, the first threshold and the second threshold are only used to distinguish different thresholds, and the sequence of the first threshold is not limited. Those skilled in the art can understand that words such as "first" and "second" do not limit the quantity and execution order.

It should be noted that, in this application, words such as "exemplary" or "for example" are used to represent examples, illustrations or illustrations. Any embodiment or design described in this application as "exemplary" or "such as" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present the related concepts in a specific manner.

Before introducing the embodiments of the present application, the related technical terms involved in the embodiments of the present application are first introduced and explained.

The message passing interface (MPI) is a message passing programming interface and provides a multi-language function library that implements a series of MPI interfaces. The MPI standard defines a set of functions that enable applications to send messages from one MPI process to another.

MPI collective communication may refer to implementing communication of different functions through MPI, and the function may be referred to as an MPI collective communication function. The MPI collective communication function includes reduce function types such as MPI_reduce and MPI_allreduce. That is, MPI_reduce and MPI_allreduce are both defined standard set communication functions. The difference between these two functions is that the final result of MPI_Reduce is that a certain process node in the communication domain obtains the final calculation result, while MPI_Allreduce is a process node in the communication domain. The final calculation result can be obtained.

Wherein, the MPI collective communication can also be commonly referred to as MPI operation, and can usually be decomposed into three parts, namely synchronization, calculation and communication. The synchronization may refer to the synchronization and information exchange between different computing processes, or the synchronization between different steps and tasks in the same process; the calculation may refer to the specified operation operation on the input data in each process; the communication may Refers to the data transfer between different nodes in the communication domain. For convenience of description, the MPI collective communication is collectively referred to as MPI operation herein.

Exemplarily, as shown in Figure 2, it is assumed that an MPI_allreduce operation is performed in a communication domain with a network scale of 8 nodes (represented as P0 to P7), and a recursive doubling algorithm is used, then the communication Each node in the domain only needs to send and receive communications three times, and the MPI_allreduce operation is completed when all nodes complete the three times of sending and receiving communications. The specific implementation steps may include the following steps S01 to S03.

S01. Nodes with a distance of 1 exchange 1/8 data with each other and perform a reduction operation. The result is that each node gets a reduction result of 1/4 data. For example, as shown in Table 1 below, data A and B are exchanged between P0 and P1, data C and D are exchanged between P2 and P3, data E and F are exchanged between P4 and P5, and data G and F are exchanged between P6 and P7. H, each node is added separately, then P0 and P1 get A+B, P2 and P3 get C+D, P4 and P5 get E+F, P6 and P7 get G+H.

S02. Nodes with a distance of 2 exchange 1/4 data with each other and perform a reduction operation. The result is that each node gets a reduction result of 1/2 data. For example, as shown in Table 1 below, data A+B and C+D are exchanged between P0 and P2, and between P1 and P3, respectively, and data E+F and P7 are exchanged between P4 and P6, and between P5 and P7, respectively. G+H, each node is added separately, then P0 to P3 all get A+B+C+D, P4 to P7 all get E+F+G+H.

S03. Nodes with a distance of 4 exchange 1/2 data with each other and perform a reduction operation. The result is that each node gets the reduction result of all data. For example, as shown in Table 1 below, data A+B+C+D and E+F+G+H are exchanged between P0 and P4, between P1 and P5, between P2 and P6, and between P3 and P7, respectively. , and each node performs addition operation respectively, then P0 to P7 all get A+B+C+D+E+F+G+H.

Table 1

3a and 3b are schematic structural diagrams of two exemplary servers provided in this embodiment of the application. The servers may include: a memory 301, a processor 302, a network card 303, and a bus 304. The memory 301, the processor 302, and the network card 303 pass through the Buses 304 are interconnected.

Among them, the memory 301 can be used to store data, software programs and modules, mainly including a storage program area and a storage data area. The storage program area can store an operating system, an application program required by at least one function, and the like. Data created during use, etc. For example, the operating system may include a Linux operating system, a Unix operating system, or a Windows operating system, etc.; the application program (application, APP) required by the at least one function may include an artificial intelligence (artificial intelligence) related APP, high-performance computing ( High performance computing (HPC) related APP, deep learning (deep learning) related APP or computer graphics (computer graphics, CG) related APP, etc. In a possible example, the memory 301 includes but is not limited to static random access memory (static RAM, SRAM), dynamic random access memory (dynamic RAM, DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM) or high-speed random access memory etc. Further, the memory 301 may also include other non-volatile memories, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

In addition, the processor 302 is used to control and manage the operation of the server, for example, by running or executing the software programs and/or modules stored in the memory 301 and calling the data stored in the memory 301 to execute various functions of the server. function and process data. In a possible example, the processor 302 includes, but is not limited to, a central processing unit (CPU), a network processing unit (NPU), a graphics processing unit (GPU), an application Application specific integrated circuit (ASIC), field programmable gate array (FPGA) or other programmable logic device, transistor logic device, logic circuit or any combination thereof. It may implement or execute the various exemplary logical blocks, modules and circuits described in connection with this disclosure. The processor 302 may also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of a digital signal processor and a microprocessor, and the like.

The network card 303 may be used to implement communication between the server and an external network, for example, the network card 303 may be a smart network interface card (smart NIC). In some feasible embodiments, the network card 303 may support a remote direct memory access (remote direct memory access, RDMA) method, for example, the network card 303 receives packets from the network through the RDMA method, and sends the packets to other network devices in the network through the RDMA method. The device sends a message. The network card 303 may store the received message in the memory 301 by means of RDMA.

The bus 304 may include an extended industry standard architecture (EISA) bus, and/or a peripheral component interconnect express (PCIe) bus, or the like. The bus 304 can be divided into an address bus, a data bus, a control bus, and the like. For ease of presentation, only one thick line is used in Figures 3a and 3b, but it does not mean that there is only one bus or one type of bus.

In this embodiment of the present application, as shown in FIG. 3a, the memory 301, the processor 301 and the network card 303 may all be integrated in a system of chip (system of chip, SoC) of the server. Alternatively, as shown in FIG. 3b, the memory 301 and the processor 301 may be integrated in a system of chip (SoC) of the server, and the network card 303 is an external network card connected to the SoC through an external bus.

FIG. 4 is a schematic flowchart of a data computing method provided by an embodiment of the present application. The method may be executed by a network card in the server provided above, and the method includes the following steps.

S401: The network card receives a first packet, where the first packet includes operation indication information and first data, and determines, according to the operation indication information, a data operation that needs to perform an MPI operation on the first data.

Wherein, the server may be any server in a communication domain including multiple servers, and the multiple servers may be jointly used to perform MPI operations. The multiple servers may send messages to each other through a network (enternet), and the messages may include data used for data operations in the MPI operation. For example, the multiple servers include a first server and a second server, the server may be the first server, the server may receive the first packet sent by the second server, and the server may also send the second packet to the second server. Arts. The first message and the second message may be messages in the same format, but the data included in the messages are different, and the following description takes the first message as an example.

In addition, the first packet may be a packet based on the RDMA over Converged Ethernet (RoCE) protocol. The first packet may include a packet header and a payload, the operation indication information may be carried in a packet header of the first packet, and the first data may be carried in a payload of the first packet. Exemplarily, on the basis of the existing RoCE protocol message, an extension header is added, and the operation indication information is carried in the extension header, for example, 4 bits (bits) are added to the standard RDMA transmission header field. ) The extended header reduce_eth, of which 1 bit can be used to indicate the specific data type reduce_type (for example, the data type can include: int8, int16, int32, uint8, uint16, uint32, FP16 or FP32), etc., use another 1 bit to indicate the specific operation type reduce_code ( For example, the operation type includes max, min or sum, etc.), and the remaining 2bits can be reserved. Correspondingly, when the operation indication information is carried in the extended message header, the interface between software and hardware based on RDMA will also increase the corresponding write read (WR) type, which is used for reading the extended message header. fetch or write.

Furthermore, the MPI operation corresponding to the first data may be any MPI operation including a data operation, for example, the MPI operation may be an MPI_reduce operation, an MPI_allreduce operation, or the like.

Optionally, the operation indication information may include the operation type and data type of the data operation in the MPI operation. For example, the operation type may be addition, subtraction, or multiplication, and the data type may be half-precision floating-point numbers, single-precision Floating point numbers, double precision floating point numbers, or integers, etc. For the specific operation type and data type, and the related description of the above-mentioned MPI operation, reference may be made to the description in the related art, which is not described in this embodiment of the present application.

Specifically, when the server is performing an MPI operation, a processor (eg, a CPU) in the server may send a data computing task to the network card. Subsequently, the network card of the server may receive the first packet sent from other servers in the network, and the network card may parse the first packet to obtain the operation indication information and the first data included in the first packet. When the server parses the operation indication information, the network card may determine that the data operation in the MPI operation needs to be performed on the first data according to the operation indication information. For example, if the operation indication information includes the operation type and data type of the MPI_reduce operation, the network card may determine the data operation that needs to perform the MPI_reduce operation on the first data according to the operation type and the data type.

S402: The network card obtains the second data from the memory.

Wherein, the memory may include a memory, and the memory may be a dynamic random access memory (DRAM). The second data may be local data stored in the DRAM for data operations of the MPI operation. The data type of the second data may be the same as the data type of the first data. For example, the data type of the first data and the data type of the second data are both the data types indicated by the operation indication information in the first packet.

In addition, the storage address corresponding to the second data may be carried in the first packet. Specifically, after the network card receives the first packet and parses the first packet, the network card can obtain the storage address of the second data from the first packet, so that the network card can obtain the storage address of the second data from the server based on the storage address. The second data is obtained in the memory.

S403: The network card completes the data operation of the first data and the second data, and obtains a first operation result.

When the network card obtains the first data and the second data, the network card may perform a data operation on the first data and the second data based on the operation indication information to obtain a first operation result. For example, if the operation type indicated by the operation indication information is addition and the data type is a floating-point number, the network card can add the first data and the second data based on the addition rule corresponding to the floating-point number to obtain the first operation result; Alternatively, if the operation type indicated by the operation indication information is multiplication and the data type is a floating point number, the network card may multiply the first data and the second data based on the multiplication rule corresponding to the floating point number to obtain the first operation result.

Further, as shown in FIG. 5, after S403, the method further includes: S404.

S404: The network card stores the first operation result in the memory.

Specifically, when the network card obtains the first operation result, the network card may store the first operation result in the memory of the server, for example, the network card stores the first operation result in the DRAM included in the memory. Optionally, the storage address of the first operation result can be the same as the storage address of the second data, that is, the network card can store the first operation result in the storage location of the second data according to the storage address of the second data to overwrite. Second data.

Optionally, as shown in FIG. 5, after S404, the method further includes: S405.

S405: The network card sends notification information to the processor, where the notification information is used to indicate that the data operation is completed.

Specifically, after the network card stores the first operation result in the memory, the network card may send notification information to the processor, where the notification information is used to indicate that the data operation is completed. When the processor receives the communication information, the processor can determine that the data operation is completed, thereby synchronizing the relevant state information of the MPI operation to ensure that the actual state of the MPI operation is consistent with the recorded state. Optionally, the processor may also send the next task to the network card, so that the network card continues to perform the corresponding task.

Further, the processor may divide the data operation in the MPI operation into multiple data operation tasks, and send the multiple data operation tasks to the network card in sequence according to the order of the multiple data operation tasks, that is, in the After the previous data operation task is completed, the next data operation task is sent to the network card until the multiple data operation tasks are completed. For each data operation task in the multiple data operation tasks, the network card may execute the method provided above.

For example, for the MPI operation shown in FIG. 2 , the data operation in the MPI operation may include three data operation tasks. Taking the server as P0 as an example, the network card can perform the MPI operation by successively performing three data operations. . Specifically, first, the processor first sends the task of computing A+B data to the network card, and the network card performs A+B operation and reports it according to the above S401-S405; secondly, the processor sends the data computing A+B to the network card For the task of +C+D, the network card performs A+B+C+D operation according to the above S401-S405 and reports it; finally, the processor sends the data operation A+B+C+D+E+F+G to the network card For the task of +H, the network card performs the A+B+C+D+E+F+G+H operation according to the above S401-S405 and reports it.

Exemplarily, as shown in FIG. 6 , the processor, memory, network card and bus in the server can all be integrated in the SoC of the server, the memory is a DRAM, and the processor is a CPU as an example for description. In the data operation method provided in the embodiment of the present application, from the time when the network card receives the first message to the time when the first operation result is stored in the memory, the network card only needs to perform one read operation and one write operation, that is, read from the memory The data operation can be completed by fetching the second data and writing the first operation result into the memory. As shown in Figure 6, step 1 means that the network card reads the data in the DDR of the server to the network card, and waits to participate in the operation with the data in the network; step 2 means that the network card receives the data in the network, and identifies it through the relevant information in the message header It is necessary to perform reduce calculation, complete the calculation while processing the work queue element (receive queue_working queue element, RQ_WQE) of the receive queue, and write the calculation result back to the memory of the server; Step 3 means that the software reads by interrupt or polling Completion queue element (CQE). Specifically, the method includes: the network card receives and parses the first packet in the network, and obtains operation indication information and first data, where the operation indication information is carried in a packet header of the first packet; The operation indication information determines the number operation that needs to perform MPI operation on the first data; the network card reads the second data A1 from the memory (for example, the memory) of the server into the network card, thereby completing the first data and the RQ_WQE while processing the RQ_WQE. Calculate the second data, and write the calculation result back to the service memory; after that, the processor reads the CQE by interrupting or polling, that is, the processor receives the notification information sent by the network card to complete the information synchronization of the MPI operation. In FIG. 6, the second data is represented as A1, and the first operation result is represented as R1. It should be understood that after receiving an RDMA packet from the network, a local RQ_WQE needs to be consumed, and the RQ_WQE indicates a piece of local DDR space. After receiving the first packet, the operation indication information carried in the extended packet header is first based on It is determined that the MPI operation needs to be performed on the first data. Therefore, after the first data is obtained, the first data is firstly subjected to a path operation, and after the calculation result is obtained, the calculation result is written into the memory space indicated by RQ_WQE. The corresponding calculation result is obtained from the corresponding memory space.

In the above execution process, from the perspective of the CPU, the CPU does not perceive the entire calculation process, and only processes the reported interrupt after the calculation is completed, thereby greatly reducing the operating system (OS) noise of the CPU and improving the The execution efficiency of the CPU. The whole process only needs one DDR write and one DDR read, and the entire delay includes the read DDR delay, the RDMA network card processing data calculation delay and one DDR write operation.

In the embodiment of the present application, when the network card receives the first message and obtains the operation indication information and the first data in the first message, the network card can directly obtain the second data from the memory, and according to the operation indication information , complete the data operation of the first data and the second data to obtain the first operation result. Compared with the prior art, the network card does not need to write the first data into the memory, but directly obtains the second data when the first data is obtained, that is, performs on-path calculation on the first data and the second data, thereby reducing The number of times of reading and writing the memory is reduced, the delay of the MPI operation is reduced, and the execution efficiency of the MPI is improved. In addition, when the network card, processor and memory in the server are all integrated in the SoC of the server, the end-to-end transmission delay can also be reduced, and the execution efficiency of the MPI operation can be further improved.

The above mainly introduces the data computing method in the MPI operation provided by the embodiments of the present application from the perspective of the server. It can be understood that, in order to realize the above-mentioned functions, the server includes corresponding hardware structures and/or software modules for executing each function. Those skilled in the art should easily realize that the present application can be implemented in hardware or a combination of hardware and computer software with reference to the network elements and algorithm steps of each example described in the embodiments disclosed herein. Whether a function is performed by hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

In this embodiment of the present application, the data computing device in the MPI operation can be divided into functional modules according to the above method examples. For example, each functional module can be divided into corresponding functions, or two or more functions can be integrated into one processing module. middle. The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. It should be noted that, the division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation.

In the case where each functional module is divided according to each function, FIG. 7 shows a possible schematic structural diagram of the data computing device involved in the above embodiment, the device is a network card or a chip built in the network card, and the network card passes through the bus. Coupled with the memory, the apparatus includes: a receiving unit 501 , a processing unit 502 and an obtaining unit 503 . Wherein, the receiving unit 501 is used to support the device to receive the first message from the network; the processing unit 502 is used to support the device to parse the first message to obtain the operation indication information and the first data included in the first message. The operation indication information is used to indicate the data operation that needs to be performed on the first data by the MPI operation of the message passing interface; the obtaining unit 503 is used to support the device to obtain the second data from the memory, and the second data is the local data operation of the data operation in the MPI operation. data; the processing unit 502 is further configured to support the device to complete the data operation of the first data and the second data in the MPI operation to obtain the first operation result. Further, the apparatus further includes: a writing unit 504 and a sending unit 505 . Wherein, the writing unit 504 is used to support the device to write the first operation result into the memory; the sending unit 505 is used to support the device to send notification information to the processor, and the notification information is used to indicate that the data operation is completed.

It should be noted that, all relevant contents of the steps involved in the above method embodiments can be cited in the functional descriptions of the corresponding functional modules, which will not be repeated here.

Based on hardware implementation, the processing unit 502 and the writing unit 504 in the present application may be part of the functions of the processor of the device, and the receiving unit 501, the acquiring unit 503 and the sending unit 505 may be the transceiver functions of the device The transceiver may generally include a transmitter and a receiver, and a specific transceiver may also be referred to as a communication interface.

8 shows another possible structural schematic diagram of the data computing device involved in the above embodiment, a network card or a chip built in the network card, the network card is coupled to the memory through a bus, and the device includes: a processor 602 and a communication interface 603 . The processor 602 is used to control and manage the actions of the device. For example, the processor 602 can be used to support the device to perform S401 to S405 in the above-mentioned embodiments through the communication interface 603, and/or for the technology described herein. other processes. In addition, the device can also include a memory 601 and a bus 604, the processor 602, the communication interface 603 and the memory 601 are connected to each other through the bus 604; the communication interface 603 is used to support the device to communicate; the memory 601 is used to store the program code of the device and data.

The processor 602 may be a central processing unit, a general-purpose processor, a digital signal processor, an application-specific integrated circuit, a field programmable gate array, or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. It may implement or execute the various exemplary logical blocks, modules and circuits described in connection with this disclosure. The processor may also be a combination that performs computing functions, such as a combination comprising one or more microprocessors, a combination of a digital signal processor and a microprocessor, and the like. The bus 604 may be a peripheral component interconnect (PCI) bus or an Extended industry standard architecture (EISA) bus or the like. The bus can be divided into an address bus, a data bus, a control bus, and the like. For convenience of representation, only one thick line is used in FIG. 8, but it does not mean that there is only one bus or one type of bus.

In another embodiment of the present application, a readable storage medium is also provided, where computer execution instructions are stored in the readable storage medium. Steps in the network card. The aforementioned readable storage medium may include: U disk, removable hard disk, read-only memory, random access memory, magnetic disk or optical disk and other media that can store program codes.

In another embodiment of the present application, a computer program product is also provided, the computer program product includes computer-executable instructions, and the computer-executable instructions are stored in a computer-readable storage medium; The computer-readable storage medium reads the computer-executable instruction, and at least one processor executes the computer-executable instruction to cause the device to perform the steps of the network card in the method provided by the above method embodiments.

Finally, it should be noted that: the above are only specific embodiments of the present application, but the protection scope of the present application is not limited to this, and any changes or replacements within the technical scope disclosed in the present application should be covered by the present application. within the scope of protection of the application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

A data operation method, characterized in that it is applied to a network card, the network card is coupled to a memory through a bus, and the method comprises:

receiving a first message, where the first message includes operation indication information and first data;

Determine, according to the operation indication information, a data operation that needs to perform a message passing interface MPI operation on the first data;

Acquiring second data from the memory, where the second data is local data of the data operation in the MPI operation;

The data operation of the first data and the second data in the MPI operation is completed, and a first operation result is obtained.
The method according to claim 1, wherein the network card, the memory and the bus are integrated in a system-on-a-chip (SoC).
The method according to claim 1 or 2, wherein the operation indication information comprises: an operation type and a data type.
The method according to any one of claims 1-3, wherein the operation indication information is carried in a packet header of the first packet.
The method according to any one of claims 1-4, wherein the MPI operation comprises: an MPI_reduce operation or an MPI_allreduce operation.
The method according to any one of claims 1-5, wherein the first message further includes a storage address of the second data, and the acquiring the second data from the memory includes:

The second data is acquired from the memory according to the storage address of the second data.
The method according to claim 6, wherein the method further comprises:

According to the storage address of the second data, the first operation result is stored in the storage location where the second data is located to cover the second data.
The method according to any one of claims 1-7, wherein the network card is further coupled to the processor through the bus, and the method further comprises:

Send notification information to the processor, where the notification information is used to indicate that the data operation is completed.
A data computing device, characterized in that the device is a network card or a chip built into the network card, the network card is coupled to a memory through a bus, and the device comprises:

a receiving unit, configured to receive a first message, where the first message includes operation indication information and first data;

a processing unit, configured to determine, according to the operation indication information, a data operation that needs to perform a message passing interface MPI operation on the first data;

an acquisition unit, configured to acquire second data from the memory, where the second data is the local data of the data operation in the MPI operation;

The processing unit is further configured to complete the data operation of the first data and the second data in the MPI operation to obtain a first operation result.
The apparatus according to claim 9, wherein the network card, the memory and the bus are integrated in a system-on-a-chip (SoC).
The apparatus according to claim 9 or 10, wherein the operation indication information comprises: an operation type and a data type.
The apparatus according to any one of claims 9-11, wherein the operation indication information is carried in a packet header of the first packet.
The apparatus according to any one of claims 9-12, wherein the MPI operation comprises: an MPI_reduce operation or an MPI_allreduce operation.
The device according to any one of claims 9-13, wherein the first message further includes a storage address of the second data, and the obtaining unit is further configured to:

The second data is acquired from the memory according to the storage address of the second data.
The apparatus of claim 14, wherein the apparatus further comprises:

A writing unit, configured to store the first operation result in the storage location where the second data is located according to the storage address of the second data to overwrite the second data.
The device according to any one of claims 9-15, wherein the network card is further coupled to the processor through the bus, and the device further comprises:

A sending unit, configured to send notification information to the processor, where the notification information is used to indicate that the data operation is completed.
A data computing device, characterized in that the device is a network card or a chip built into the network card, the network card is coupled to a memory through a bus, the memory stores code and data, and the network card runs the code in the memory so that the The device executes the data operation method according to any one of claims 1-8.
A computer-readable storage medium, storing instructions in the computer-readable storage medium, when running on a computer, causes the computer to execute the data operation method according to any one of claims 1-8.
A computer program product, characterized in that, when the computer program product runs on a device, the device is made to execute the data operation method according to any one of claims 1-8.