WO2022047632A1 - Data computation method and device - Google Patents
Data computation method and device Download PDFInfo
- Publication number
- WO2022047632A1 WO2022047632A1 PCT/CN2020/112901 CN2020112901W WO2022047632A1 WO 2022047632 A1 WO2022047632 A1 WO 2022047632A1 CN 2020112901 W CN2020112901 W CN 2020112901W WO 2022047632 A1 WO2022047632 A1 WO 2022047632A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- mpi
- network card
- memory
- indication information
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
Definitions
- the present application relates to the field of communication technologies, and in particular, to a data computing method and device.
- the MPI collective communication functions include reduce function types such as MPI_reduce and MPI_allreduce.
- reduce function types such as MPI_reduce and MPI_allreduce.
- the MPI reduce function accounts for a large proportion in the MPI application scenario, about 40%. Improving the execution efficiency of the MPI reduce function will bring a greater operating efficiency of the MPI application.
- the MPI reduce function can be decomposed into three parts, namely calculation, synchronization and communication. This article will realize the optimization of the calculation part of the MPI reduce function.
- the computing part in the MPI operation is usually realized by offloading the send queue (SQ) task of the external network card of the server.
- the method includes: S1. the network card writes data A1 from the network message into a dynamic random access memory (DRAM); S2. the network card schedules SQ, when the The task of the selected SQ is to perform the reduce operation of the specified operation on the A1 data, the network card reads the data A1 in the message from the DRAM; S3. The network card reads the local data A2 from the DRAM; S4. The network card completes the data A1 and data Operation of A2, and write the operation result into DRAM.
- DRAM dynamic random access memory
- the network card can only perform the data operation when it is scheduled to the SQ corresponding to the computing task.
- the number of SQs corresponding to the network card is greater.
- the delay of the MPI operation is also larger, resulting in a larger delay of the MPI operation.
- the present application provides a data operation method and device, which are used to reduce the delay of MPI operation and improve the execution efficiency of MPI operation.
- a first aspect provides a data operation method in MPI operation, which is applied to a network card, the network card is coupled to a memory through a bus, and the method includes: receiving a first message, and the first message may be a message passing interface in the network.
- the first message includes operation indication information and first data; according to the operation indication information, it is determined that the data operation that needs to perform MPI operation on the first data; the second data is obtained from the memory, the second data is the local data of the data operation in the MPI operation; complete the data operation (for example, addition or multiplication, etc.) of the first data and the second data in the MPI operation, and obtain the first operation result; The result of the operation is written into the memory.
- the network card when the network card receives and obtains the operation indication information and the first data in the first message, the network card can determine the data operation that needs to perform the MPI operation on the first data according to the operation indication information, so as to directly The local data of the data operation in the MPI operation, that is, the second data, is obtained from the memory, and the data operation of the first data and the second data is completed to obtain the first operation result.
- the network card does not need to write the first data into the memory, but directly obtains the second data when the first data is obtained, that is, the first data and the second data are calculated on the road, reducing the The number of reads and writes of the memory is reduced, thereby reducing the delay of the MPI operation and improving the execution efficiency of the MPI.
- the method includes: receiving a first packet, where the first packet includes first data; when the first packet carries operation indication information, determining that the first packet needs to be assigned to the first packet.
- One data performs the data operation of the MPI operation of the message passing interface, and obtains the second data from the memory, and the second data is the local data of the operation of the MPI operation; completes the MPI operation of the first data and the second data. , get the first operation result.
- the solution may further include: writing the first operation result into the memory.
- the operation indication information may be carried in the packet header of the first packet, for example, the packet header of the existing packet is extended, and the operation indication information is carried in the packet header obtained by the expansion.
- the first packet further includes a storage address of the second data
- acquiring the second data from the memory includes: reading the second data from the memory according to the storage address.
- writing the first operation result into the memory includes: according to the storage address of the second data, storing the first operation result in the storage location where the second data is located to cover the second data.
- a network card, a memory and a bus are integrated in a system-on-a-chip SoC.
- SoC system-on-a-chip SoC
- the operation indication information includes: an operation type and a data type.
- it can be determined by the operation type and data type that the first data needs to perform the data operation of the MPI operation, so that when the network card obtains the information, the first data does not need to be written into the memory, but is directly obtained.
- the second data realizes the data operation of the first data and the second data, thereby reducing the number of times of reading and writing the memory, reducing the delay of the MPI operation, and improving the execution efficiency of the MPI.
- the operation indication information is carried in a packet header of the first packet.
- a simple and effective manner of carrying the operation indication information is provided.
- the MPI operation includes: an MPI_reduce operation, or an MPI_allreduce operation.
- the delay of the MPI_reduce operation or the MPI_allreduce operation can be reduced, thereby improving the execution efficiency of the MPI_reduce operation or the MPI_allreduce operation.
- the network card is further coupled to the processor through a bus, and the method further includes: sending notification information to the processor, where the notification information is used to indicate that the data operation is completed.
- sending notification information to the processor the state of the MPI operation recorded by the processor can be consistent with the state of the actual MPI operation, thereby ensuring the orderly and efficient execution of the MPI operation.
- a data computing device is provided, the device is a network card or a chip built into the network card, the network card is coupled to a memory through a bus, and the device includes: a receiving unit for receiving a first message from a network; a processing unit for is used to parse the first message to obtain operation indication information and first data included in the first message, where the operation indication information is used to indicate that the first data needs to be subjected to a data operation of a message passing interface MPI operation; The second data is obtained from the memory, and the second data is local data of the data operation in the MPI operation; the processing unit is further configured to complete the data operation of the first data and the second data, and obtain the first operation result. Further, the apparatus further includes: a writing unit for writing the first operation result into the memory.
- the apparatus includes: a receiving unit, configured to receive a first packet, where the first packet includes first data; and a processing unit, configured to receive a first packet when the first packet carries
- a receiving unit configured to receive a first packet, where the first packet includes first data
- a processing unit configured to receive a first packet when the first packet carries
- an obtaining unit is used to obtain second data from the memory, and the second data is the local data of the operation of the MPI operation
- processing The unit is further configured to complete the MPI operation of the first data and the second data to obtain a first operation result.
- the solution may further include: writing the first operation result into the memory.
- the operation indication information may be carried in the packet header of the first packet, for example, the packet header of the existing packet is extended, and the operation indication information is carried in the packet header obtained by the expansion.
- the first packet further includes a storage address of the second data
- the obtaining unit is further configured to: read the second data from the memory according to the storage address.
- the writing unit is further configured to: according to the storage address of the second data, store the first operation result in the storage location where the second data is located to cover the second data.
- the network card, the memory and the bus are integrated in a system-on-a-chip SoC.
- the operation indication information includes: an operation type and a data type.
- the operation indication information is carried in a packet header of the first packet.
- the MPI operation corresponding to the first data includes: an MPI_reduce operation or an MPI_allreduce operation.
- the network card is further coupled to the processor through a bus
- the apparatus further includes: a sending unit, configured to send notification information to the processor, where the notification information is used to indicate that the data operation is completed.
- a data computing device is provided, the device is a network card or a built-in chip of the network card, the network card is coupled to a memory through a bus, code and data are stored in the memory, and the network card runs the code in the memory so that the device executes the first aspect or the first aspect.
- a data operation method provided by any possible implementation manner of an aspect.
- a computer-readable storage medium having instructions stored in the computer-readable storage medium which, when run on a computer, cause the computer to perform the first aspect or any one of the first aspects Data manipulation methods provided by a possible implementation.
- a computer program product is provided, characterized in that, when the computer program product runs on a device, the device is made to execute the first aspect or any of the possible implementations of the first aspect.
- Provided data manipulation methods when the computer program product runs on a device, the device is made to execute the first aspect or any of the possible implementations of the first aspect.
- any data computing device, computer storage medium or computer program product provided above is used to execute the corresponding method provided above, therefore, the beneficial effects that can be achieved can refer to the provided above.
- the beneficial effects in the corresponding method will not be repeated here.
- Fig. 1 is a kind of execution schematic diagram of MPI operation
- FIG. 2 is a schematic diagram of an MPI operation provided by an embodiment of the present application.
- 3a is a schematic structural diagram of a server provided by an embodiment of the present application.
- 3b is a schematic structural diagram of another server provided by an embodiment of the present application.
- FIG. 4 is a schematic flowchart of a data operation method provided by an embodiment of the present application.
- FIG. 5 is a schematic flowchart of another data operation method provided by an embodiment of the present application.
- FIG. 6 is a schematic diagram of an MPI operation provided by an embodiment of the present application.
- FIG. 7 is a schematic structural diagram of a data computing device according to an embodiment of the present application.
- FIG. 8 is a schematic structural diagram of another data computing apparatus provided by an embodiment of the present application.
- At least one means one or more
- plural means two or more.
- And/or which describes the association relationship of the associated objects, indicates that there can be three kinds of relationships, for example, A and/or B, which can indicate: the existence of A alone, the existence of A and B at the same time, and the existence of B alone, where A, B can be singular or plural.
- the character “/” generally indicates that the associated objects are an “or” relationship.
- At least one item(s) below” or similar expressions thereof refer to any combination of these items, including any combination of single item(s) or plural items(s).
- At least one item (a) of a, b, or c may represent: a, b, c, ab, ac, bc, or abc, where a, b, and c may be single or multiple .
- words such as “first” and “second” are used to distinguish the same items or similar items with basically the same functions and functions.
- the first threshold and the second threshold are only used to distinguish different thresholds, and the sequence of the first threshold is not limited. Those skilled in the art can understand that words such as “first” and “second” do not limit the quantity and execution order.
- the message passing interface is a message passing programming interface and provides a multi-language function library that implements a series of MPI interfaces.
- the MPI standard defines a set of functions that enable applications to send messages from one MPI process to another.
- MPI collective communication may refer to implementing communication of different functions through MPI, and the function may be referred to as an MPI collective communication function.
- the MPI collective communication function includes reduce function types such as MPI_reduce and MPI_allreduce. That is, MPI_reduce and MPI_allreduce are both defined standard set communication functions. The difference between these two functions is that the final result of MPI_Reduce is that a certain process node in the communication domain obtains the final calculation result, while MPI_Allreduce is a process node in the communication domain. The final calculation result can be obtained.
- the MPI collective communication can also be commonly referred to as MPI operation, and can usually be decomposed into three parts, namely synchronization, calculation and communication.
- the synchronization may refer to the synchronization and information exchange between different computing processes, or the synchronization between different steps and tasks in the same process; the calculation may refer to the specified operation operation on the input data in each process; the communication may Refers to the data transfer between different nodes in the communication domain.
- the MPI collective communication is collectively referred to as MPI operation herein.
- the servers may include: a memory 301, a processor 302, a network card 303, and a bus 304.
- the memory 301, the processor 302, and the network card 303 pass through the Buses 304 are interconnected.
- the memory 301 can be used to store data, software programs and modules, mainly including a storage program area and a storage data area.
- the storage program area can store an operating system, an application program required by at least one function, and the like. Data created during use, etc.
- the operating system may include a Linux operating system, a Unix operating system, or a Windows operating system, etc.
- the application program (application, APP) required by the at least one function may include an artificial intelligence (artificial intelligence) related APP, high-performance computing ( High performance computing (HPC) related APP, deep learning (deep learning) related APP or computer graphics (computer graphics, CG) related APP, etc.
- the memory 301 includes but is not limited to static random access memory (static RAM, SRAM), dynamic random access memory (dynamic RAM, DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM) or high-speed random access memory etc. Further, the memory 301 may also include other non-volatile memories, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
- the processor 302 is used to control and manage the operation of the server, for example, by running or executing the software programs and/or modules stored in the memory 301 and calling the data stored in the memory 301 to execute various functions of the server. function and process data.
- the processor 302 includes, but is not limited to, a central processing unit (CPU), a network processing unit (NPU), a graphics processing unit (GPU), an application Application specific integrated circuit (ASIC), field programmable gate array (FPGA) or other programmable logic device, transistor logic device, logic circuit or any combination thereof. It may implement or execute the various exemplary logical blocks, modules and circuits described in connection with this disclosure.
- the processor 302 may also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of a digital signal processor and a microprocessor, and the like.
- the network card 303 may be used to implement communication between the server and an external network, for example, the network card 303 may be a smart network interface card (smart NIC).
- the network card 303 may support a remote direct memory access (remote direct memory access, RDMA) method, for example, the network card 303 receives packets from the network through the RDMA method, and sends the packets to other network devices in the network through the RDMA method. The device sends a message.
- the network card 303 may store the received message in the memory 301 by means of RDMA.
- the bus 304 may include an extended industry standard architecture (EISA) bus, and/or a peripheral component interconnect express (PCIe) bus, or the like.
- EISA extended industry standard architecture
- PCIe peripheral component interconnect express
- the bus 304 can be divided into an address bus, a data bus, a control bus, and the like. For ease of presentation, only one thick line is used in Figures 3a and 3b, but it does not mean that there is only one bus or one type of bus.
- the memory 301, the processor 301 and the network card 303 may all be integrated in a system of chip (system of chip, SoC) of the server.
- SoC system of chip
- the memory 301 and the processor 301 may be integrated in a system of chip (SoC) of the server, and the network card 303 is an external network card connected to the SoC through an external bus.
- FIG. 4 is a schematic flowchart of a data computing method provided by an embodiment of the present application. The method may be executed by a network card in the server provided above, and the method includes the following steps.
- the network card receives a first packet, where the first packet includes operation indication information and first data, and determines, according to the operation indication information, a data operation that needs to perform an MPI operation on the first data.
- the server may be any server in a communication domain including multiple servers, and the multiple servers may be jointly used to perform MPI operations.
- the multiple servers may send messages to each other through a network (enternet), and the messages may include data used for data operations in the MPI operation.
- the multiple servers include a first server and a second server, the server may be the first server, the server may receive the first packet sent by the second server, and the server may also send the second packet to the second server.
- the first message and the second message may be messages in the same format, but the data included in the messages are different, and the following description takes the first message as an example.
- the first packet may be a packet based on the RDMA over Converged Ethernet (RoCE) protocol.
- the first packet may include a packet header and a payload, the operation indication information may be carried in a packet header of the first packet, and the first data may be carried in a payload of the first packet.
- an extension header is added, and the operation indication information is carried in the extension header, for example, 4 bits (bits) are added to the standard RDMA transmission header field.
- the extended header reduce_eth of which 1 bit can be used to indicate the specific data type reduce_type (for example, the data type can include: int8, int16, int32, uint8, uint16, uint32, FP16 or FP32), etc., use another 1 bit to indicate the specific operation type reduce_code ( For example, the operation type includes max, min or sum, etc.), and the remaining 2bits can be reserved.
- the interface between software and hardware based on RDMA will also increase the corresponding write read (WR) type, which is used for reading the extended message header. fetch or write.
- WR write read
- the MPI operation corresponding to the first data may be any MPI operation including a data operation, for example, the MPI operation may be an MPI_reduce operation, an MPI_allreduce operation, or the like.
- the operation indication information may include the operation type and data type of the data operation in the MPI operation.
- the operation type may be addition, subtraction, or multiplication
- the data type may be half-precision floating-point numbers, single-precision Floating point numbers, double precision floating point numbers, or integers, etc.
- a processor eg, a CPU in the server may send a data computing task to the network card.
- the network card of the server may receive the first packet sent from other servers in the network, and the network card may parse the first packet to obtain the operation indication information and the first data included in the first packet.
- the network card may determine that the data operation in the MPI operation needs to be performed on the first data according to the operation indication information. For example, if the operation indication information includes the operation type and data type of the MPI_reduce operation, the network card may determine the data operation that needs to perform the MPI_reduce operation on the first data according to the operation type and the data type.
- the network card obtains the second data from the memory.
- the memory may include a memory, and the memory may be a dynamic random access memory (DRAM).
- the second data may be local data stored in the DRAM for data operations of the MPI operation.
- the data type of the second data may be the same as the data type of the first data.
- the data type of the first data and the data type of the second data are both the data types indicated by the operation indication information in the first packet.
- the storage address corresponding to the second data may be carried in the first packet. Specifically, after the network card receives the first packet and parses the first packet, the network card can obtain the storage address of the second data from the first packet, so that the network card can obtain the storage address of the second data from the server based on the storage address. The second data is obtained in the memory.
- the network card completes the data operation of the first data and the second data, and obtains a first operation result.
- the network card may perform a data operation on the first data and the second data based on the operation indication information to obtain a first operation result. For example, if the operation type indicated by the operation indication information is addition and the data type is a floating-point number, the network card can add the first data and the second data based on the addition rule corresponding to the floating-point number to obtain the first operation result; Alternatively, if the operation type indicated by the operation indication information is multiplication and the data type is a floating point number, the network card may multiply the first data and the second data based on the multiplication rule corresponding to the floating point number to obtain the first operation result.
- the operation type indicated by the operation indication information is addition and the data type is a floating-point number
- the network card may multiply the first data and the second data based on the multiplication rule corresponding to the floating point number to obtain the first operation result.
- the method further includes: S404.
- S404 The network card stores the first operation result in the memory.
- the network card may store the first operation result in the memory of the server, for example, the network card stores the first operation result in the DRAM included in the memory.
- the storage address of the first operation result can be the same as the storage address of the second data, that is, the network card can store the first operation result in the storage location of the second data according to the storage address of the second data to overwrite. Second data.
- the method further includes: S405.
- S405 The network card sends notification information to the processor, where the notification information is used to indicate that the data operation is completed.
- the network card may send notification information to the processor, where the notification information is used to indicate that the data operation is completed.
- the processor can determine that the data operation is completed, thereby synchronizing the relevant state information of the MPI operation to ensure that the actual state of the MPI operation is consistent with the recorded state.
- the processor may also send the next task to the network card, so that the network card continues to perform the corresponding task.
- the processor may divide the data operation in the MPI operation into multiple data operation tasks, and send the multiple data operation tasks to the network card in sequence according to the order of the multiple data operation tasks, that is, in the After the previous data operation task is completed, the next data operation task is sent to the network card until the multiple data operation tasks are completed.
- the network card may execute the method provided above.
- the data operation in the MPI operation may include three data operation tasks.
- the network card can perform the MPI operation by successively performing three data operations. . Specifically, first, the processor first sends the task of computing A+B data to the network card, and the network card performs A+B operation and reports it according to the above S401-S405; secondly, the processor sends the data computing A+B to the network card For the task of +C+D, the network card performs A+B+C+D operation according to the above S401-S405 and reports it; finally, the processor sends the data operation A+B+C+D+E+F+G to the network card For the task of +H, the network card performs the A+B+C+D+E+F+G+H operation according to the above S401-S405 and reports it.
- the processor, memory, network card and bus in the server can all be integrated in the SoC of the server, the memory is a DRAM, and the processor is a CPU as an example for description.
- the network card from the time when the network card receives the first message to the time when the first operation result is stored in the memory, the network card only needs to perform one read operation and one write operation, that is, read from the memory
- the data operation can be completed by fetching the second data and writing the first operation result into the memory.
- step 1 means that the network card reads the data in the DDR of the server to the network card, and waits to participate in the operation with the data in the network;
- step 2 means that the network card receives the data in the network, and identifies it through the relevant information in the message header It is necessary to perform reduce calculation, complete the calculation while processing the work queue element (receive queue_working queue element, RQ_WQE) of the receive queue, and write the calculation result back to the memory of the server;
- step 3 means that the software reads by interrupt or polling Completion queue element (CQE).
- CQE interrupt or polling Completion queue element
- the method includes: the network card receives and parses the first packet in the network, and obtains operation indication information and first data, where the operation indication information is carried in a packet header of the first packet;
- the operation indication information determines the number operation that needs to perform MPI operation on the first data;
- the network card reads the second data A1 from the memory (for example, the memory) of the server into the network card, thereby completing the first data and the RQ_WQE while processing the RQ_WQE. Calculate the second data, and write the calculation result back to the service memory; after that, the processor reads the CQE by interrupting or polling, that is, the processor receives the notification information sent by the network card to complete the information synchronization of the MPI operation.
- the second data is represented as A1, and the first operation result is represented as R1.
- a local RQ_WQE needs to be consumed, and the RQ_WQE indicates a piece of local DDR space.
- the operation indication information carried in the extended packet header is first based on It is determined that the MPI operation needs to be performed on the first data. Therefore, after the first data is obtained, the first data is firstly subjected to a path operation, and after the calculation result is obtained, the calculation result is written into the memory space indicated by RQ_WQE. The corresponding calculation result is obtained from the corresponding memory space.
- the CPU does not perceive the entire calculation process, and only processes the reported interrupt after the calculation is completed, thereby greatly reducing the operating system (OS) noise of the CPU and improving the The execution efficiency of the CPU.
- the whole process only needs one DDR write and one DDR read, and the entire delay includes the read DDR delay, the RDMA network card processing data calculation delay and one DDR write operation.
- the network card when the network card receives the first message and obtains the operation indication information and the first data in the first message, the network card can directly obtain the second data from the memory, and according to the operation indication information , complete the data operation of the first data and the second data to obtain the first operation result.
- the network card does not need to write the first data into the memory, but directly obtains the second data when the first data is obtained, that is, performs on-path calculation on the first data and the second data, thereby reducing The number of times of reading and writing the memory is reduced, the delay of the MPI operation is reduced, and the execution efficiency of the MPI is improved.
- the network card, processor and memory in the server are all integrated in the SoC of the server, the end-to-end transmission delay can also be reduced, and the execution efficiency of the MPI operation can be further improved.
- the server includes corresponding hardware structures and/or software modules for executing each function.
- the present application can be implemented in hardware or a combination of hardware and computer software with reference to the network elements and algorithm steps of each example described in the embodiments disclosed herein. Whether a function is performed by hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.
- the data computing device in the MPI operation can be divided into functional modules according to the above method examples.
- each functional module can be divided into corresponding functions, or two or more functions can be integrated into one processing module. middle.
- the above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. It should be noted that, the division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation.
- FIG. 7 shows a possible schematic structural diagram of the data computing device involved in the above embodiment, the device is a network card or a chip built in the network card, and the network card passes through the bus.
- the apparatus includes: a receiving unit 501 , a processing unit 502 and an obtaining unit 503 .
- the receiving unit 501 is used to support the device to receive the first message from the network;
- the processing unit 502 is used to support the device to parse the first message to obtain the operation indication information and the first data included in the first message.
- the operation indication information is used to indicate the data operation that needs to be performed on the first data by the MPI operation of the message passing interface; the obtaining unit 503 is used to support the device to obtain the second data from the memory, and the second data is the local data operation of the data operation in the MPI operation. data; the processing unit 502 is further configured to support the device to complete the data operation of the first data and the second data in the MPI operation to obtain the first operation result. Further, the apparatus further includes: a writing unit 504 and a sending unit 505 . Wherein, the writing unit 504 is used to support the device to write the first operation result into the memory; the sending unit 505 is used to support the device to send notification information to the processor, and the notification information is used to indicate that the data operation is completed.
- the processing unit 502 and the writing unit 504 in the present application may be part of the functions of the processor of the device, and the receiving unit 501, the acquiring unit 503 and the sending unit 505 may be the transceiver functions of the device
- the transceiver may generally include a transmitter and a receiver, and a specific transceiver may also be referred to as a communication interface.
- FIG. 8 shows another possible structural schematic diagram of the data computing device involved in the above embodiment, a network card or a chip built in the network card, the network card is coupled to the memory through a bus, and the device includes: a processor 602 and a communication interface 603 .
- the processor 602 is used to control and manage the actions of the device.
- the processor 602 can be used to support the device to perform S401 to S405 in the above-mentioned embodiments through the communication interface 603, and/or for the technology described herein. other processes.
- the device can also include a memory 601 and a bus 604, the processor 602, the communication interface 603 and the memory 601 are connected to each other through the bus 604; the communication interface 603 is used to support the device to communicate; the memory 601 is used to store the program code of the device and data.
- the processor 602 may be a central processing unit, a general-purpose processor, a digital signal processor, an application-specific integrated circuit, a field programmable gate array, or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. It may implement or execute the various exemplary logical blocks, modules and circuits described in connection with this disclosure.
- the processor may also be a combination that performs computing functions, such as a combination comprising one or more microprocessors, a combination of a digital signal processor and a microprocessor, and the like.
- the bus 604 may be a peripheral component interconnect (PCI) bus or an Extended industry standard architecture (EISA) bus or the like.
- PCI peripheral component interconnect
- EISA Extended industry standard architecture
- a readable storage medium is also provided, where computer execution instructions are stored in the readable storage medium. Steps in the network card.
- the aforementioned readable storage medium may include: U disk, removable hard disk, read-only memory, random access memory, magnetic disk or optical disk and other media that can store program codes.
- a computer program product in another embodiment, includes computer-executable instructions, and the computer-executable instructions are stored in a computer-readable storage medium;
- the computer-readable storage medium reads the computer-executable instruction, and at least one processor executes the computer-executable instruction to cause the device to perform the steps of the network card in the method provided by the above method embodiments.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
- Computer And Data Communications (AREA)
Abstract
The present application provides a data computation method and device, relating to the technical field of communications, and capable of reducing delays in MPI operations and improving MPI execution efficiency. The method is applicable to a network card coupled to a memory via a bus, and comprises: receiving a first packet, the first packet comprising computation instruction information and first data; determining, according to the computation instruction information, that data computation in an MPI operation needs to be performed on the first data; acquiring second data from the memory, the second data being local data for the data computation in the MPI operation; and performing data computation on the first data and the second data, so as to obtain a first computation result.
Description
本申请涉及通信技术领域,尤其涉及一种数据运算方法及装置。The present application relates to the field of communication technologies, and in particular, to a data computing method and device.
随着高性能计算(high performance computing,HPC)和人工智能(artificial intelligence,AI)等场景的飞速发展,消息传递接口(message passing interface,MPI)集合通信函数的执行效率越显重要。MPI集合通信函数包括MPI_reduce和MPI_allreduce等reduce函数类型,其中MPI reduce函数在MPI应用场景中占比大,约为40%,提升MPI reduce函数执行效率将带来较大的MPI应用程序运行效率。MPI reduce函数可分解为三个部分,分别是计算、同步和通信,本文将实现MPI reduce函数计算部分的优化。With the rapid development of scenarios such as high performance computing (HPC) and artificial intelligence (AI), the execution efficiency of message passing interface (MPI) collective communication functions becomes more and more important. The MPI collective communication functions include reduce function types such as MPI_reduce and MPI_allreduce. Among them, the MPI reduce function accounts for a large proportion in the MPI application scenario, about 40%. Improving the execution efficiency of the MPI reduce function will bring a greater operating efficiency of the MPI application. The MPI reduce function can be decomposed into three parts, namely calculation, synchronization and communication. This article will realize the optimization of the calculation part of the MPI reduce function.
现有技术中,对于MPI操作中的计算部分,通常是由服务器的外置网卡的发送队列(send queue,SQ)任务卸载来实现。具体的,如图1所示,该方法包括:S1.网卡将来自网络报文中的数据A1写入动态随机存取存储器(dynamic random access memory,DRAM)中;S2.网卡调度SQ,当被选中的SQ的任务是对A1数据进行指定运算的reduce操作时,网卡从DRAM中读取报文中的数据A1;S3.网卡从DRAM中读取本地数据A2;S4.网卡完成数据A1和数据A2的运算,并将运算结果写入DRAM中。In the prior art, the computing part in the MPI operation is usually realized by offloading the send queue (SQ) task of the external network card of the server. Specifically, as shown in FIG. 1 , the method includes: S1. the network card writes data A1 from the network message into a dynamic random access memory (DRAM); S2. the network card schedules SQ, when the The task of the selected SQ is to perform the reduce operation of the specified operation on the A1 data, the network card reads the data A1 in the message from the DRAM; S3. The network card reads the local data A2 from the DRAM; S4. The network card completes the data A1 and data Operation of A2, and write the operation result into DRAM.
上述方法中,该网卡只有调度到该计算任务对应的SQ,才能执行该数据运算,当组网规模越大时,网卡对应的SQ的数量越多,此时网卡调度到该计算任务对应的SQ的延时也较大,从而导致该MPI操作的延时较大。In the above method, the network card can only perform the data operation when it is scheduled to the SQ corresponding to the computing task. When the network scale is larger, the number of SQs corresponding to the network card is greater. The delay of the MPI operation is also larger, resulting in a larger delay of the MPI operation.
发明内容SUMMARY OF THE INVENTION
本申请提供一种数据运算方法及装置,用于降低MPI操作的延时,提高MPI操作的执行效率。The present application provides a data operation method and device, which are used to reduce the delay of MPI operation and improve the execution efficiency of MPI operation.
为达到上述目的,本申请的实施例采用如下技术方案:To achieve the above object, the embodiments of the present application adopt the following technical solutions:
第一方面,提供一种MPI操作中的数据运算方法,应用于网卡中,该网卡通过总线与存储器耦合,该方法包括:接收第一报文,第一报文可以是网络中执行消息传递接口MPI操作的其他服务器发送的,第一报文包括运算指示信息和第一数据;根据该运算指示信息确定需要对第一数据进行MPI操作的数据运算;从存储器中获取第二数据,第二数据为该MPI操作中该数据运算的本地数据;完成该MPI操作中第一数据和第二数据的数据运算(比如,相加或者相乘等),得到第一运算结果;进一步的,将第一运算结果写入存储器中。A first aspect provides a data operation method in MPI operation, which is applied to a network card, the network card is coupled to a memory through a bus, and the method includes: receiving a first message, and the first message may be a message passing interface in the network. Sent by other servers operated by MPI, the first message includes operation indication information and first data; according to the operation indication information, it is determined that the data operation that needs to perform MPI operation on the first data; the second data is obtained from the memory, the second data is the local data of the data operation in the MPI operation; complete the data operation (for example, addition or multiplication, etc.) of the first data and the second data in the MPI operation, and obtain the first operation result; The result of the operation is written into the memory.
上述技术方案中,该网卡在接收并获取到第一报文中的运算指示信息和第一数据时,该网卡可以根据该运算指示信息确定需要对第一数据进行MPI操作的数据运算,从而直接从存储器中获取该MPI操作中该数据运算的本地数据即第二数据,并完成第一数据和第二数据的数据运算以得到第一运算结果。与现有技术相比,该网卡无需将 第一数据写入存储器中,而是在获取到第一数据时直接获取第二数据,即对第一数据和第二数据进行随路计算,减小了存储器的读写次数,从而降低了MPI操作的延时,提高了MPI的执行效率。In the above technical solution, when the network card receives and obtains the operation indication information and the first data in the first message, the network card can determine the data operation that needs to perform the MPI operation on the first data according to the operation indication information, so as to directly The local data of the data operation in the MPI operation, that is, the second data, is obtained from the memory, and the data operation of the first data and the second data is completed to obtain the first operation result. Compared with the prior art, the network card does not need to write the first data into the memory, but directly obtains the second data when the first data is obtained, that is, the first data and the second data are calculated on the road, reducing the The number of reads and writes of the memory is reduced, thereby reducing the delay of the MPI operation and improving the execution efficiency of the MPI.
在第一方面的一种可能的实现方式中,该方法包括:接收第一报文,该第一报文包括第一数据;当该第一报文携带运算指示信息时,确定需要对该第一数据进行消息传递接口MPI操作的数据运算,并从存储器获取第二数据,所述第二数据为该MPI操作的运算的本地数据;完成所述第一数据和所述第二数据的MPI运算,得到第一运算结果。应当理解,该方案还可以包括:将第一运算结果写入存储器。In a possible implementation manner of the first aspect, the method includes: receiving a first packet, where the first packet includes first data; when the first packet carries operation indication information, determining that the first packet needs to be assigned to the first packet. One data performs the data operation of the MPI operation of the message passing interface, and obtains the second data from the memory, and the second data is the local data of the operation of the MPI operation; completes the MPI operation of the first data and the second data. , get the first operation result. It should be understood that the solution may further include: writing the first operation result into the memory.
应当理解,运算指示信息可以携带在第一报文的报文头中,例如对现有报文的报文头进行扩展,该运算指示信息携带在该扩展得到的报文头中。It should be understood that the operation indication information may be carried in the packet header of the first packet, for example, the packet header of the existing packet is extended, and the operation indication information is carried in the packet header obtained by the expansion.
在第一方面的一种可能的实现方式中,第一报文中还包括第二数据的存储地址,从存储器中获取第二数据,包括:根据该存储地址从存储器中读取第二数据。进一步的,将第一运算结果写入存储器中,包括:根据第二数据的存储地址,将第一运算结果存储在第二数据所在的存储位置上以覆盖第二数据。上述可能的实现方式中,可以避免无用的数据占用存储器中的存储空间,从而提高存储器的利用率。In a possible implementation manner of the first aspect, the first packet further includes a storage address of the second data, and acquiring the second data from the memory includes: reading the second data from the memory according to the storage address. Further, writing the first operation result into the memory includes: according to the storage address of the second data, storing the first operation result in the storage location where the second data is located to cover the second data. In the above possible implementation manners, useless data can be prevented from occupying the storage space in the memory, thereby improving the utilization rate of the memory.
在第一方面的一种可能的实现方式中,网卡、存储器和总线集成在芯片系统SoC中。上述可能的实现方式中,通过将网卡、存储器和总线集成在SoC中,可以降低端到端的传输时延,进一步提高MPI操作中数据运算的执行效率。In a possible implementation manner of the first aspect, a network card, a memory and a bus are integrated in a system-on-a-chip SoC. In the above possible implementation manner, by integrating the network card, the memory and the bus in the SoC, the end-to-end transmission delay can be reduced, and the execution efficiency of the data operation in the MPI operation can be further improved.
在第一方面的一种可能的实现方式中,运算指示信息包括:运算类型和数据类型。上述可能的实现方式中,通过该运算类型和数据类型可以确定第一数据需要进行MPI操作的数据运算,从而在网卡获取到该信息时,无需将第一数据写入存储器中,而是直接获取第二数据,实现第一数据和第二数据的数据运算,从而减小了存储器的读写次数,降低了MPI操作的延时,提高了MPI的执行效率。In a possible implementation manner of the first aspect, the operation indication information includes: an operation type and a data type. In the above possible implementations, it can be determined by the operation type and data type that the first data needs to perform the data operation of the MPI operation, so that when the network card obtains the information, the first data does not need to be written into the memory, but is directly obtained. The second data realizes the data operation of the first data and the second data, thereby reducing the number of times of reading and writing the memory, reducing the delay of the MPI operation, and improving the execution efficiency of the MPI.
在第一方面的一种可能的实现方式中,运算指示信息携带在第一报文的报文头中。上述可能的实现方式中,提供了一种简单、有效地携带运算指示信息的方式。In a possible implementation manner of the first aspect, the operation indication information is carried in a packet header of the first packet. In the above possible implementation manners, a simple and effective manner of carrying the operation indication information is provided.
在第一方面的一种可能的实现方式中,该MPI操作包括:MPI_reduce操作、或者MPI_allreduce操作。上述可能的实现方式中,可以降低MPI_reduce操作、或者MPI_allreduce操作的延时,从而提高MPI_reduce操作、或者MPI_allreduce操作的执行效率。In a possible implementation manner of the first aspect, the MPI operation includes: an MPI_reduce operation, or an MPI_allreduce operation. In the above possible implementation manners, the delay of the MPI_reduce operation or the MPI_allreduce operation can be reduced, thereby improving the execution efficiency of the MPI_reduce operation or the MPI_allreduce operation.
在第一方面的一种可能的实现方式中,网卡还通过总线与处理器耦合,该方法还包括:向处理器发送通知信息,通知信息用于指示数据运算完成。上述可能的实现方式中,通过向处理器发送通知信息,可以使得处理器记录的MPI操作的状态与实际MPI操作的状态一致,从而保证MPI操作执行的有序性和高效性。In a possible implementation manner of the first aspect, the network card is further coupled to the processor through a bus, and the method further includes: sending notification information to the processor, where the notification information is used to indicate that the data operation is completed. In the above possible implementation manner, by sending notification information to the processor, the state of the MPI operation recorded by the processor can be consistent with the state of the actual MPI operation, thereby ensuring the orderly and efficient execution of the MPI operation.
第二方面,提供一种数据运算装置,该装置为网卡或者网卡内置的芯片,网卡通过总线与存储器耦合,该装置包括:接收单元,用于接收来自网络的第一报文;处理单元,用于解析第一报文,得到第一报文中包括的运算指示信息和第一数据,运算指示信息用于指示需要对第一数据进行消息传递接口MPI操作的数据运算;获取单元,用于从存储器中获取第二数据,第二数据为该MPI操作中该数据运算的本地数据;处理单元,还用于完成第一数据和第二数据的数据运算,得到第一运算结果。进一步的, 该装置还包括:写入单元,用于将第一运算结果写入存储器中。In a second aspect, a data computing device is provided, the device is a network card or a chip built into the network card, the network card is coupled to a memory through a bus, and the device includes: a receiving unit for receiving a first message from a network; a processing unit for is used to parse the first message to obtain operation indication information and first data included in the first message, where the operation indication information is used to indicate that the first data needs to be subjected to a data operation of a message passing interface MPI operation; The second data is obtained from the memory, and the second data is local data of the data operation in the MPI operation; the processing unit is further configured to complete the data operation of the first data and the second data, and obtain the first operation result. Further, the apparatus further includes: a writing unit for writing the first operation result into the memory.
在第二方面的一种可能的实现方式中,该装置包括:接收单元,用于接收第一报文,该第一报文包括第一数据;处理单元,用于当该第一报文携带运算指示信息时,确定需要对该第一数据进行消息传递接口MPI操作的数据运算;获取单元,用于从存储器获取第二数据,所述第二数据为该MPI操作的运算的本地数据;处理单元,还用于完成所述第一数据和所述第二数据的MPI运算,得到第一运算结果。应当理解,该方案还可以包括:将第一运算结果写入存储器。应当理解,运算指示信息可以携带在第一报文的报文头中,例如对现有报文的报文头进行扩展,该运算指示信息携带在该扩展得到的报文头中。In a possible implementation manner of the second aspect, the apparatus includes: a receiving unit, configured to receive a first packet, where the first packet includes first data; and a processing unit, configured to receive a first packet when the first packet carries When calculating the indication information, it is determined that the first data needs to be subjected to a data operation of a message passing interface MPI operation; an obtaining unit is used to obtain second data from the memory, and the second data is the local data of the operation of the MPI operation; processing The unit is further configured to complete the MPI operation of the first data and the second data to obtain a first operation result. It should be understood that the solution may further include: writing the first operation result into the memory. It should be understood that the operation indication information may be carried in the packet header of the first packet, for example, the packet header of the existing packet is extended, and the operation indication information is carried in the packet header obtained by the expansion.
在第二方面的一种可能的实现方式中,第一报文中还包括第二数据的存储地址,获取单元还用于:根据该存储地址从存储器中读取第二数据。进一步的,写入单元还用于:根据第二数据的存储地址,将第一运算结果存储在第二数据所在的存储位置上以覆盖第二数据。In a possible implementation manner of the second aspect, the first packet further includes a storage address of the second data, and the obtaining unit is further configured to: read the second data from the memory according to the storage address. Further, the writing unit is further configured to: according to the storage address of the second data, store the first operation result in the storage location where the second data is located to cover the second data.
在第二方面的一种可能的实现方式中,网卡、存储器和总线集成在芯片系统SoC中。In a possible implementation manner of the second aspect, the network card, the memory and the bus are integrated in a system-on-a-chip SoC.
在第二方面的一种可能的实现方式中,运算指示信息包括:运算类型和数据类型。In a possible implementation manner of the second aspect, the operation indication information includes: an operation type and a data type.
在第二方面的一种可能的实现方式中,运算指示信息携带在第一报文的报文头中。In a possible implementation manner of the second aspect, the operation indication information is carried in a packet header of the first packet.
在第二方面的一种可能的实现方式中,第一数据对应的MPI操作包括:MPI_reduce操作、或者MPI_allreduce操作。In a possible implementation manner of the second aspect, the MPI operation corresponding to the first data includes: an MPI_reduce operation or an MPI_allreduce operation.
在第二方面的一种可能的实现方式中,网卡还通过总线与处理器耦合,该装置还包括:发送单元,用于向处理器发送通知信息,通知信息用于指示数据运算完成。In a possible implementation manner of the second aspect, the network card is further coupled to the processor through a bus, and the apparatus further includes: a sending unit, configured to send notification information to the processor, where the notification information is used to indicate that the data operation is completed.
第三方面,提供一种数据运算装置,该装置为网卡或者网卡内置的芯片,网卡通过总线与存储器耦合,存储器中存储代码和数据,网卡运行存储器中的代码使得该装置执行第一方面或第一方面的任一种可能的实现方式所提供的数据运算方法。In a third aspect, a data computing device is provided, the device is a network card or a built-in chip of the network card, the network card is coupled to a memory through a bus, code and data are stored in the memory, and the network card runs the code in the memory so that the device executes the first aspect or the first aspect. A data operation method provided by any possible implementation manner of an aspect.
在本申请的另一方面,提供一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当其在计算机上运行时,使得该计算机执行第一方面或第一方面的任一种可能的实现方式所提供的数据运算方法。In another aspect of the present application, there is provided a computer-readable storage medium having instructions stored in the computer-readable storage medium which, when run on a computer, cause the computer to perform the first aspect or any one of the first aspects Data manipulation methods provided by a possible implementation.
在本申请的另一方面,提供一种计算机程序产品,其特征在于,当该计算机程序产品在设备上运行时,使得该设备执行第一方面或第一方面的任一种可能的实现方式所提供的数据运算方法。In another aspect of the present application, a computer program product is provided, characterized in that, when the computer program product runs on a device, the device is made to execute the first aspect or any of the possible implementations of the first aspect. Provided data manipulation methods.
可以理解地,上述提供的任一种数据运算装置、计算机存储介质或者计算机程序产品均用于执行上文所提供的对应的方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。It can be understood that any data computing device, computer storage medium or computer program product provided above is used to execute the corresponding method provided above, therefore, the beneficial effects that can be achieved can refer to the provided above. The beneficial effects in the corresponding method will not be repeated here.
图1为一种MPI操作的执行示意图;Fig. 1 is a kind of execution schematic diagram of MPI operation;
图2为本申请实施例提供的一种MPI操作的示意图;2 is a schematic diagram of an MPI operation provided by an embodiment of the present application;
图3a为本申请实施例提供的一种服务器的结构示意图;3a is a schematic structural diagram of a server provided by an embodiment of the present application;
图3b为本申请实施例提供的另一种服务器的结构示意图;3b is a schematic structural diagram of another server provided by an embodiment of the present application;
图4为本申请实施例提供的一种数据运算方法的流程示意图;4 is a schematic flowchart of a data operation method provided by an embodiment of the present application;
图5为本申请实施例提供的另一种数据运算方法的流程示意图;5 is a schematic flowchart of another data operation method provided by an embodiment of the present application;
图6为本申请实施例提供的一种MPI操作的示意图;6 is a schematic diagram of an MPI operation provided by an embodiment of the present application;
图7为本申请实施例提供的一种数据运算装置的结构示意图;FIG. 7 is a schematic structural diagram of a data computing device according to an embodiment of the present application;
图8为本申请实施例提供的另一种数据运算装置的结构示意图。FIG. 8 is a schematic structural diagram of another data computing apparatus provided by an embodiment of the present application.
本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。另外,本申请实施例采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。例如,第一阈值和第二阈值仅仅是为了区分不同的阈值,并不对其先后顺序进行限定。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定。In this application, "at least one" means one or more, and "plurality" means two or more. "And/or", which describes the association relationship of the associated objects, indicates that there can be three kinds of relationships, for example, A and/or B, which can indicate: the existence of A alone, the existence of A and B at the same time, and the existence of B alone, where A, B can be singular or plural. The character "/" generally indicates that the associated objects are an "or" relationship. "At least one item(s) below" or similar expressions thereof refer to any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one item (a) of a, b, or c may represent: a, b, c, ab, ac, bc, or abc, where a, b, and c may be single or multiple . In addition, in the embodiments of the present application, words such as "first" and "second" are used to distinguish the same items or similar items with basically the same functions and functions. For example, the first threshold and the second threshold are only used to distinguish different thresholds, and the sequence of the first threshold is not limited. Those skilled in the art can understand that words such as "first" and "second" do not limit the quantity and execution order.
需要说明的是,本申请中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其他实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。It should be noted that, in this application, words such as "exemplary" or "for example" are used to represent examples, illustrations or illustrations. Any embodiment or design described in this application as "exemplary" or "such as" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present the related concepts in a specific manner.
在介绍本申请实施例之前,首先对本申请实施例所涉及的相关技术名词进行介绍说明。Before introducing the embodiments of the present application, the related technical terms involved in the embodiments of the present application are first introduced and explained.
消息传输接口(message passing interface,MPI),是一种消息传递编程接口,同时提供了实现一系列MPI接口的多语言函数库。MPI标准定义了一组函数,使应用程序可以将消息从一个MPI进程送到另一个MPI进程。The message passing interface (MPI) is a message passing programming interface and provides a multi-language function library that implements a series of MPI interfaces. The MPI standard defines a set of functions that enable applications to send messages from one MPI process to another.
MPI集合通信可以是指通过MPI实现不同函数的通信,该函数可以称为MPI集合通信函数,该MPI集合通信函数包括MPI_reduce和MPI_allreduce等reduce函数类型。即MPI_reduce和MPI_allreduce都是定义的标准集合通信函数,这两种函数的区别是MPI_Reduce的最终结果是通信域中的某一个进程节点得到最终的计算结果,而MPI_Allreduce是通信域中每一个进程节点都可以得到最终的计算结果。MPI collective communication may refer to implementing communication of different functions through MPI, and the function may be referred to as an MPI collective communication function. The MPI collective communication function includes reduce function types such as MPI_reduce and MPI_allreduce. That is, MPI_reduce and MPI_allreduce are both defined standard set communication functions. The difference between these two functions is that the final result of MPI_Reduce is that a certain process node in the communication domain obtains the final calculation result, while MPI_Allreduce is a process node in the communication domain. The final calculation result can be obtained.
其中,该MPI集合通信通常也可以称为MPI操作,通常可以被分解为三个部分,分别是同步、计算和通信。该同步可以是指不同运算进程之间的同步与信息交互、或者是同一进程内不同步骤任务间的同步;该计算可以是指对每个进程内的输入数据进行指定的操作运算;该通信可以是指通信域中不同节点之间的数据传递。为便于描述,在本文中将该MPI集合通信统称为MPI操作。Wherein, the MPI collective communication can also be commonly referred to as MPI operation, and can usually be decomposed into three parts, namely synchronization, calculation and communication. The synchronization may refer to the synchronization and information exchange between different computing processes, or the synchronization between different steps and tasks in the same process; the calculation may refer to the specified operation operation on the input data in each process; the communication may Refers to the data transfer between different nodes in the communication domain. For convenience of description, the MPI collective communication is collectively referred to as MPI operation herein.
示例性的,如图2所示,假设在一个组网规模为8个节点(分别表示为P0至P7)的通信域中执行一个MPI_allreduce操作,且使用递归倍增(recursive doubling)算法,则该通信域内的各节点仅需3次收发通信,当所有节点都完成3次收发通信时,该MPI_allreduce操作完成。具体实现步骤可以包括如下步骤S01至S03。Exemplarily, as shown in Figure 2, it is assumed that an MPI_allreduce operation is performed in a communication domain with a network scale of 8 nodes (represented as P0 to P7), and a recursive doubling algorithm is used, then the communication Each node in the domain only needs to send and receive communications three times, and the MPI_allreduce operation is completed when all nodes complete the three times of sending and receiving communications. The specific implementation steps may include the following steps S01 to S03.
S01.距离为1的节点相互交换1/8数据,并作reduction运算,结果是每个节点得 到1/4数据的reduction结果。比如,如下表1所示,P0与P1之间交换数据A和B,P2与P3之间交换数据C和D,P4与P5之间交换数据E和F,P6与P7之间交换数据G和H,每个节点分别作加法运算,则P0和P1得到A+B,P2和P3得到C+D,P4和P5得到E+F,P6和P7得到G+H。S01. Nodes with a distance of 1 exchange 1/8 data with each other and perform a reduction operation. The result is that each node gets a reduction result of 1/4 data. For example, as shown in Table 1 below, data A and B are exchanged between P0 and P1, data C and D are exchanged between P2 and P3, data E and F are exchanged between P4 and P5, and data G and F are exchanged between P6 and P7. H, each node is added separately, then P0 and P1 get A+B, P2 and P3 get C+D, P4 and P5 get E+F, P6 and P7 get G+H.
S02.距离为2的节点相互交换1/4数据,并作reduction运算,结果是每个节点得到1/2数据的reduction结果。比如,如下表1所示,P0与P2之间、以及P1与P3之间分别交换数据A+B和C+D,P4与P6之间、以及P5与P7之间分别交换数据E+F和G+H,每个节点分别作加法运算,则P0至P3均得到A+B+C+D,P4至P7均得到E+F+G+H。S02. Nodes with a distance of 2 exchange 1/4 data with each other and perform a reduction operation. The result is that each node gets a reduction result of 1/2 data. For example, as shown in Table 1 below, data A+B and C+D are exchanged between P0 and P2, and between P1 and P3, respectively, and data E+F and P7 are exchanged between P4 and P6, and between P5 and P7, respectively. G+H, each node is added separately, then P0 to P3 all get A+B+C+D, P4 to P7 all get E+F+G+H.
S03.距离为4的节点相互交换1/2数据,并作reduction运算,结果是每个节点得到所有数据的reduction结果。比如,如下表1所示,P0与P4之间、P1与P5之间、P2与P6之间、以及P3与P7之间分别交换数据A+B+C+D和E+F+G+H,每个节点分别作加法运算,则P0至P7均得到A+B+C+D+E+F+G+H。S03. Nodes with a distance of 4 exchange 1/2 data with each other and perform a reduction operation. The result is that each node gets the reduction result of all data. For example, as shown in Table 1 below, data A+B+C+D and E+F+G+H are exchanged between P0 and P4, between P1 and P5, between P2 and P6, and between P3 and P7, respectively. , and each node performs addition operation respectively, then P0 to P7 all get A+B+C+D+E+F+G+H.
表1Table 1
图3a和图3b为本申请实施例提供的两种示例性的服务器的结构示意图,该服务器可以包括:存储器301、处理器302、网卡303和总线304,存储器301、处理器302以及网卡303通过总线304相互连接。3a and 3b are schematic structural diagrams of two exemplary servers provided in this embodiment of the application. The servers may include: a memory 301, a processor 302, a network card 303, and a bus 304. The memory 301, the processor 302, and the network card 303 pass through the Buses 304 are interconnected.
其中,存储器301可用于存储数据、软件程序以及模块,主要包括存储程序区和存储数据区,存储程序区可存储操作系统、至少一个功能所需的应用程序等,存储数据区可存储该设备的使用时所创建的数据等。比如,该操作系统可以包括Linux操作系统、Unix操作系统或者Window操作系统等;该至少一个功能所需的应用程序(application,APP)可以包括人工智能(artificial intelligence)相关的APP、高性能计算(high performance computing,HPC)相关的APP、深度学习(deep learning)相关的APP或者计算机图形(computer graphics,CG)相关的APP等。在一种可能的示例性中,存储器301包括但不限于静态随机存储器(static RAM,SRAM)、动态随机存储器(dynamic RAM,DRAM)、同步动态随机存储器(synchronous DRAM,SDRAM)或者高速随机存取存储器等。进一步的,存储器301还可以包括其他非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。Among them, the memory 301 can be used to store data, software programs and modules, mainly including a storage program area and a storage data area. The storage program area can store an operating system, an application program required by at least one function, and the like. Data created during use, etc. For example, the operating system may include a Linux operating system, a Unix operating system, or a Windows operating system, etc.; the application program (application, APP) required by the at least one function may include an artificial intelligence (artificial intelligence) related APP, high-performance computing ( High performance computing (HPC) related APP, deep learning (deep learning) related APP or computer graphics (computer graphics, CG) related APP, etc. In a possible example, the memory 301 includes but is not limited to static random access memory (static RAM, SRAM), dynamic random access memory (dynamic RAM, DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM) or high-speed random access memory etc. Further, the memory 301 may also include other non-volatile memories, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
另外,处理器302用于对该服务器的操作进行控制管理,比如通过运行或执行存储在存储器301内的软件程序和/或模块,以及调用存储在存储器301内的数据,执行 该服务器的各种功能和处理数据。在一种可能的示例性中,处理器302包括但不限于中央处理单元(central processing unit,CPU)、网络处理单元(network processing unit,NPU)、图形处理器(graphics processing unit,GPU)、应用专用集成电路(application specific integrated circuit,ASIC),现场可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、晶体管逻辑器件、逻辑电路或者其任意组合。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。处理器302也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,数字信号处理器和微处理器的组合等。In addition, the processor 302 is used to control and manage the operation of the server, for example, by running or executing the software programs and/or modules stored in the memory 301 and calling the data stored in the memory 301 to execute various functions of the server. function and process data. In a possible example, the processor 302 includes, but is not limited to, a central processing unit (CPU), a network processing unit (NPU), a graphics processing unit (GPU), an application Application specific integrated circuit (ASIC), field programmable gate array (FPGA) or other programmable logic device, transistor logic device, logic circuit or any combination thereof. It may implement or execute the various exemplary logical blocks, modules and circuits described in connection with this disclosure. The processor 302 may also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of a digital signal processor and a microprocessor, and the like.
网卡303可用于实现该服务器与外部网络间的通信,比如,该网卡303可以为智能网卡(smart network interface card,smart NIC)。在一些可行的实施例中,网卡303可以支持远端直接内存访问(remote direct memory access,RDMA)方式,比如,网卡303通过RDMA方式接收来自网络的报文,以及通过RDMA方式向网络中的其他设备发送报文。其中,网卡303可通过RDMA方式将接收到的报文存储在存储器301中。The network card 303 may be used to implement communication between the server and an external network, for example, the network card 303 may be a smart network interface card (smart NIC). In some feasible embodiments, the network card 303 may support a remote direct memory access (remote direct memory access, RDMA) method, for example, the network card 303 receives packets from the network through the RDMA method, and sends the packets to other network devices in the network through the RDMA method. The device sends a message. The network card 303 may store the received message in the memory 301 by means of RDMA.
总线304可以包括扩展工业标准结构(extended industry standard architecture,EISA)总线,和/或外设部件互连标准(peripheral component interconnect express,PCIe)总线等。总线304可以分为地址总线、数据总线、控制总线等。为便于表示,图3a和图3b中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。The bus 304 may include an extended industry standard architecture (EISA) bus, and/or a peripheral component interconnect express (PCIe) bus, or the like. The bus 304 can be divided into an address bus, a data bus, a control bus, and the like. For ease of presentation, only one thick line is used in Figures 3a and 3b, but it does not mean that there is only one bus or one type of bus.
在本申请实施例中,如图3a所示,存储器301、处理器301和网卡303可以均集成在该服务器的芯片系统(system of chip,SoC)中。或者,如图3b所示,存储器301和处理器301可以集成在该服务器的芯片系统(system of chip,SoC)中,网卡303为外置网卡,通过外部总线与该SoC连接。In this embodiment of the present application, as shown in FIG. 3a, the memory 301, the processor 301 and the network card 303 may all be integrated in a system of chip (system of chip, SoC) of the server. Alternatively, as shown in FIG. 3b, the memory 301 and the processor 301 may be integrated in a system of chip (SoC) of the server, and the network card 303 is an external network card connected to the SoC through an external bus.
图4为本申请实施例提供的一种数据运算方法的流程示意图,该方法可以由上文所提供的服务器中的网卡来执行,该方法包括以下几个步骤。FIG. 4 is a schematic flowchart of a data computing method provided by an embodiment of the present application. The method may be executed by a network card in the server provided above, and the method includes the following steps.
S401:网卡接收第一报文,第一报文包括运算指示信息和第一数据,根据该运算指示信息确定需要对第一数据进行MPI操作的数据运算。S401: The network card receives a first packet, where the first packet includes operation indication information and first data, and determines, according to the operation indication information, a data operation that needs to perform an MPI operation on the first data.
其中,该服务器可以是包括多个服务器的通信域中的任一服务器,该多个服务器可共同用于执行MPI操作。该多个服务器之间可以通过网络(enternet)互相发送报文,该报文中可以包括用于MPI操作中的数据运算的数据。比如,该多个服务器包括第一服务器和第二服务器,该服务器可以是第一服务器,则该服务器可以接收第二服务器发送的第一报文,该服务器也可以向第二服务器发送第二报文。第一报文和第二报文可以是相同格式的报文,仅是报文中包括的数据不同,下文中以第一报文为例进行说明。Wherein, the server may be any server in a communication domain including multiple servers, and the multiple servers may be jointly used to perform MPI operations. The multiple servers may send messages to each other through a network (enternet), and the messages may include data used for data operations in the MPI operation. For example, the multiple servers include a first server and a second server, the server may be the first server, the server may receive the first packet sent by the second server, and the server may also send the second packet to the second server. Arts. The first message and the second message may be messages in the same format, but the data included in the messages are different, and the following description takes the first message as an example.
另外,第一报文可以是基于通过聚合以太网(RDMA over Converged Ethernet,RoCE)协议的报文。第一报文可以包括报文头和载荷,该运算指示信息可以承载在第一报文的报文头中,第一数据可以承载在第一报文的载荷中。示例性的,在现有RoCE协议报文的基础上,增加扩展报文头,将该运算指示信息承载在该扩展报文头中,比如,在标准的RDMA传输头域中增加4比特(bits)扩展头reduce_eth,其中1bit可用于指示具体数据类型reduce_type(比如,该数据类型可以包括:int8、int16、int32、 uint8、uint16、uint32、FP16或FP32)等,使用另外1bit指示具体操作类型reduce_code(比如,该操作类型包括max、min或者sum等),剩余的2bits可以保留。相应的,当该运算指示信息承载在该扩展报文头中时,基于RDMA的软件与硬件的接口也会增加对应的写读(write read,WR)类型,用于该扩展报文头的读取或写入。In addition, the first packet may be a packet based on the RDMA over Converged Ethernet (RoCE) protocol. The first packet may include a packet header and a payload, the operation indication information may be carried in a packet header of the first packet, and the first data may be carried in a payload of the first packet. Exemplarily, on the basis of the existing RoCE protocol message, an extension header is added, and the operation indication information is carried in the extension header, for example, 4 bits (bits) are added to the standard RDMA transmission header field. ) The extended header reduce_eth, of which 1 bit can be used to indicate the specific data type reduce_type (for example, the data type can include: int8, int16, int32, uint8, uint16, uint32, FP16 or FP32), etc., use another 1 bit to indicate the specific operation type reduce_code ( For example, the operation type includes max, min or sum, etc.), and the remaining 2bits can be reserved. Correspondingly, when the operation indication information is carried in the extended message header, the interface between software and hardware based on RDMA will also increase the corresponding write read (WR) type, which is used for reading the extended message header. fetch or write.
再者,第一数据对应的MPI操作可以是包含数据运算的任意一个MPI操作,比如,该MPI操作可以是MPI_reduce操作、或者MPI_allreduce操作等。Furthermore, the MPI operation corresponding to the first data may be any MPI operation including a data operation, for example, the MPI operation may be an MPI_reduce operation, an MPI_allreduce operation, or the like.
可选的,该运算指示信息可以包括该MPI操作中该数据运算的运算类型和数据类型,比如,该运算类型可以为加法、减法或者乘法等,该数据类型可以为半精度浮点数、单精度浮点数、双精度浮点数或者整数等。其中,关于具体的运算类型和数据类型、以及上述MPI操作的相关描述均可以参见相关技术中的描述,本申请实施例在此不作描述。Optionally, the operation indication information may include the operation type and data type of the data operation in the MPI operation. For example, the operation type may be addition, subtraction, or multiplication, and the data type may be half-precision floating-point numbers, single-precision Floating point numbers, double precision floating point numbers, or integers, etc. For the specific operation type and data type, and the related description of the above-mentioned MPI operation, reference may be made to the description in the related art, which is not described in this embodiment of the present application.
具体的,当服务器在执行MPI操作时,该服务器中的处理器(比如,CPU)可以向该网卡发送数据运算任务。后续该服务器的网卡可以接收到来自网络中其他服务器发送的第一报文,该网卡可以解析第一报文以得到第一报文中包括的运算指示信息和第一数据。当该服务器解析到该运算指示信息时,该网卡根据该运算指示信息可以确定需要对第一数据执行MPI操作中的数据运算。比如,该运算指示信息包括MPI_reduce操作的运算类型和数据类型,则该网卡根据该运算类型和数据类型可以确定需要对第一数据执行MPI_reduce操作的数据运算。Specifically, when the server is performing an MPI operation, a processor (eg, a CPU) in the server may send a data computing task to the network card. Subsequently, the network card of the server may receive the first packet sent from other servers in the network, and the network card may parse the first packet to obtain the operation indication information and the first data included in the first packet. When the server parses the operation indication information, the network card may determine that the data operation in the MPI operation needs to be performed on the first data according to the operation indication information. For example, if the operation indication information includes the operation type and data type of the MPI_reduce operation, the network card may determine the data operation that needs to perform the MPI_reduce operation on the first data according to the operation type and the data type.
S402:该网卡从存储器中获取第二数据。S402: The network card obtains the second data from the memory.
其中,该存储器可以包括内存,该内存可以为动态随机存储器DRAM。第二数据可以是存储在DRAM中用于MPI操作的数据运算的本地数据。第二数据的数据类型与第一数据的数据类型可以相同,比如,第一数据的数据类型和第二数据的数据类型均为第一报文中的运算指示信息所指示的数据类型。Wherein, the memory may include a memory, and the memory may be a dynamic random access memory (DRAM). The second data may be local data stored in the DRAM for data operations of the MPI operation. The data type of the second data may be the same as the data type of the first data. For example, the data type of the first data and the data type of the second data are both the data types indicated by the operation indication information in the first packet.
另外,第二数据对应的存储地址可以承载在第一报文中。具体的,当该网卡接收到第一报文,并解析第一报文之后,该网卡可以从第一报文中获取第二数据的存储地址,从而该网卡可以基于该存储地址从该服务器的存储器中获取第二数据。In addition, the storage address corresponding to the second data may be carried in the first packet. Specifically, after the network card receives the first packet and parses the first packet, the network card can obtain the storage address of the second data from the first packet, so that the network card can obtain the storage address of the second data from the server based on the storage address. The second data is obtained in the memory.
S403:该网卡完成第一数据和第二数据的数据运算,得到第一运算结果。S403: The network card completes the data operation of the first data and the second data, and obtains a first operation result.
当该网卡获取到第一数据和第二数据时,该网卡可以基于该运算指示信息对第一数据和第二数据进行数据运算,以得到第一运算结果。比如,该运算指示信息所指示的运算类型为加法、数据类型为浮点数,则该网卡可以基于浮点数对应的加法规则,将第一数据和第二数据相加,以得到第一运算结果;或者,该运算指示信息所指示的运算类型为乘法、数据类型为浮点数,则该网卡可以基于浮点数对应的乘法规则,将第一数据和第二数据相乘,以得到第一运算结果。When the network card obtains the first data and the second data, the network card may perform a data operation on the first data and the second data based on the operation indication information to obtain a first operation result. For example, if the operation type indicated by the operation indication information is addition and the data type is a floating-point number, the network card can add the first data and the second data based on the addition rule corresponding to the floating-point number to obtain the first operation result; Alternatively, if the operation type indicated by the operation indication information is multiplication and the data type is a floating point number, the network card may multiply the first data and the second data based on the multiplication rule corresponding to the floating point number to obtain the first operation result.
进一步的,如图5所示,在S403之后,该方法还包括:S404。Further, as shown in FIG. 5, after S403, the method further includes: S404.
S404:该网卡将第一运算结果存储在存储器中。S404: The network card stores the first operation result in the memory.
具体的,当该网卡得到第一运算结果时,该网卡可以将第一运算结果存储在该服务器的存储器中,比如,该网卡将第一运算结果存储在该存储器包括的DRAM中。可选的,第一运算结果的存储地址可以与第二数据的存储地址相同,即该网卡可以根据第二数据的存储地址,将第一运算结果存储在第二数据所在的存储位置上以覆盖第二 数据。Specifically, when the network card obtains the first operation result, the network card may store the first operation result in the memory of the server, for example, the network card stores the first operation result in the DRAM included in the memory. Optionally, the storage address of the first operation result can be the same as the storage address of the second data, that is, the network card can store the first operation result in the storage location of the second data according to the storage address of the second data to overwrite. Second data.
可选的,如图5所示,在S404之后,该方法还包括:S405。Optionally, as shown in FIG. 5, after S404, the method further includes: S405.
S405:该网卡向处理器发送通知信息,该通知信息用于指示该数据运算完成。S405: The network card sends notification information to the processor, where the notification information is used to indicate that the data operation is completed.
具体的,在该网卡将第一运算结果存储在存储器中后,该网卡可以向处理器发送通知信息,该通知信息用于指示该数据运算完成。当该处理器接收到该通信信息时,该处理器可以确定该数据运算完成,从而将该MPI操作的相关状态信息进行同步,以保证该MPI操作的实际状态与记录状态一致。可选的,该处理器还可以向该网卡发送下一个任务,以使该网卡继续执行相应的任务。Specifically, after the network card stores the first operation result in the memory, the network card may send notification information to the processor, where the notification information is used to indicate that the data operation is completed. When the processor receives the communication information, the processor can determine that the data operation is completed, thereby synchronizing the relevant state information of the MPI operation to ensure that the actual state of the MPI operation is consistent with the recorded state. Optionally, the processor may also send the next task to the network card, so that the network card continues to perform the corresponding task.
进一步的,该处理器可以将该MPI操作中的数据运算划分为多个数据运算任务,并按照这多个数据运算任务的先后顺序,依次将该多个数据运算任务发送给该网卡,即在前一个数据运算任务被完成后,再将下一个数据运算任务发送给该网卡,直到这多个数据运算任务均被完成。对于这多个数据运算任务中的每个数据运算任务,该网卡均可以按照上文所提供的方法来执行。Further, the processor may divide the data operation in the MPI operation into multiple data operation tasks, and send the multiple data operation tasks to the network card in sequence according to the order of the multiple data operation tasks, that is, in the After the previous data operation task is completed, the next data operation task is sent to the network card until the multiple data operation tasks are completed. For each data operation task in the multiple data operation tasks, the network card may execute the method provided above.
比如,对于图2所示的MPI操作,则该MPI操作中的数据运算可以包括三个数据运算任务,以该服务器为P0为例,则该网卡可以通过先后执行三次数据运算来完成该MPI操作。具体的,首先,该处理器先向该网卡发送数据运算A+B的任务,该网卡按照上述S401-S405执行A+B运算并上报;其次,该处理器向该网卡发送数据运算A+B+C+D的任务,该网卡按照上述S401-S405执行A+B+C+D运算并上报;最后,该处理器向该网卡发送数据运算A+B+C+D+E+F+G+H的任务,该网卡按照上述S401-S405执行A+B+C+D+E+F+G+H运算并上报。For example, for the MPI operation shown in FIG. 2 , the data operation in the MPI operation may include three data operation tasks. Taking the server as P0 as an example, the network card can perform the MPI operation by successively performing three data operations. . Specifically, first, the processor first sends the task of computing A+B data to the network card, and the network card performs A+B operation and reports it according to the above S401-S405; secondly, the processor sends the data computing A+B to the network card For the task of +C+D, the network card performs A+B+C+D operation according to the above S401-S405 and reports it; finally, the processor sends the data operation A+B+C+D+E+F+G to the network card For the task of +H, the network card performs the A+B+C+D+E+F+G+H operation according to the above S401-S405 and reports it.
示例性的,如图6所示,以该服务器中的处理器、存储器、网卡和总线均可以集成在该服务器的SoC中,该存储器为DRAM、该处理器为CPU为例进行说明。在本申请实施例提供的数据运算方法中,该网卡从接收到第一报文至将第一运算结果存储在存储器中,该网卡仅需执行一次读操作和一次写操作,即从存储器中读取第二数据和将第一运算结果写入存储器,即可完成该数据运算。如图6所示,步骤1表示网卡将sever端的DDR中的数据读到网卡中,等待与网络中的数据参与运算;步骤2表示网卡接收网络中的数据,通过报文头中的相关信息识别需要进行reduce计算,在处理接收队列的工作队列元素(receive queue_working queue element,RQ_WQE)的同时完成计算,并将计算结果回写到server端的内存中;步骤3表示软件通过中断或者轮询方式读取完成队列元素(completion queue element,CQE)。具体的,该方法包括:网卡接收并解析网络中的第一报文,得到运算指示信息和第一数据,该运算指示信息承载在第一报文的报文头中;网卡根据报文头中该运算指示信息确定需要对第一数据进行MPI操作的数运算;网卡从服务器的存储器(比如,内存)中将第二数据A1读取到网卡中,从而在处理RQ_WQE的同时完成第一数据和第二数据的计算,并将计算结果回写到服务的存储器中;之后,处理器通过中断或者轮询方式读取CQE,即处理器接收网卡发送的通知信息,以完成MPI操作的信息同步。图6中将第二数据表示为A1,将第一运算结果表示为R1。应当理解,接收来自网络中的RDMA报文后,需要消耗一个本地的RQ_WQE,该RQ_WQE指示一片本地的DDR空间,在收到第一报文之后,首先基于扩展报文头中携带的运算指示信息确定需要对第一数据进行MPI运算, 因此在得到第一数据之后,首先对第一数据进行随路运算,得到计算结果之后再将计算结果写入到RQ_WQE指示的内存空间内,后续CPU可以到相应的内存空间中去取对应的计算结果。Exemplarily, as shown in FIG. 6 , the processor, memory, network card and bus in the server can all be integrated in the SoC of the server, the memory is a DRAM, and the processor is a CPU as an example for description. In the data operation method provided in the embodiment of the present application, from the time when the network card receives the first message to the time when the first operation result is stored in the memory, the network card only needs to perform one read operation and one write operation, that is, read from the memory The data operation can be completed by fetching the second data and writing the first operation result into the memory. As shown in Figure 6, step 1 means that the network card reads the data in the DDR of the server to the network card, and waits to participate in the operation with the data in the network; step 2 means that the network card receives the data in the network, and identifies it through the relevant information in the message header It is necessary to perform reduce calculation, complete the calculation while processing the work queue element (receive queue_working queue element, RQ_WQE) of the receive queue, and write the calculation result back to the memory of the server; Step 3 means that the software reads by interrupt or polling Completion queue element (CQE). Specifically, the method includes: the network card receives and parses the first packet in the network, and obtains operation indication information and first data, where the operation indication information is carried in a packet header of the first packet; The operation indication information determines the number operation that needs to perform MPI operation on the first data; the network card reads the second data A1 from the memory (for example, the memory) of the server into the network card, thereby completing the first data and the RQ_WQE while processing the RQ_WQE. Calculate the second data, and write the calculation result back to the service memory; after that, the processor reads the CQE by interrupting or polling, that is, the processor receives the notification information sent by the network card to complete the information synchronization of the MPI operation. In FIG. 6, the second data is represented as A1, and the first operation result is represented as R1. It should be understood that after receiving an RDMA packet from the network, a local RQ_WQE needs to be consumed, and the RQ_WQE indicates a piece of local DDR space. After receiving the first packet, the operation indication information carried in the extended packet header is first based on It is determined that the MPI operation needs to be performed on the first data. Therefore, after the first data is obtained, the first data is firstly subjected to a path operation, and after the calculation result is obtained, the calculation result is written into the memory space indicated by RQ_WQE. The corresponding calculation result is obtained from the corresponding memory space.
在上述执行过程中,从CPU的角度看,CPU并不感知整个计算过程,只在计算完成之后处理上报的中断,从而大大降低了CPU的操作系统(operating system,OS)噪声(noise),提升CPU的执行效率。整个过程只需要一次DDR写和一次DDR读,整个延时包括读DDR延时、RDMA网卡处理数据运算(calculation)延时和1次DDR写操作。In the above execution process, from the perspective of the CPU, the CPU does not perceive the entire calculation process, and only processes the reported interrupt after the calculation is completed, thereby greatly reducing the operating system (OS) noise of the CPU and improving the The execution efficiency of the CPU. The whole process only needs one DDR write and one DDR read, and the entire delay includes the read DDR delay, the RDMA network card processing data calculation delay and one DDR write operation.
在本申请实施例中,该网卡在接收到第一报文获取第一报文中的运算指示信息和第一数据时,该网卡可以直接从存储器中获取第二数据,并根据该运算指示信息,完成第一数据和第二数据的数据运算以得到第一运算结果。与现有技术相比,该网卡无需将第一数据写入存储器中,而是在获取到第一数据时直接获取第二数据,即对第一数据和第二数据进行随路计算,从而减小了存储器的读写次数,降低了MPI操作的延时,提高了MPI的执行效率。此外,当该服务器中的网卡、处理器和存储器均集成在该服务器的SoC中时,还可以降低端到端的传输时延,进一步提高MPI操作的执行效率。In the embodiment of the present application, when the network card receives the first message and obtains the operation indication information and the first data in the first message, the network card can directly obtain the second data from the memory, and according to the operation indication information , complete the data operation of the first data and the second data to obtain the first operation result. Compared with the prior art, the network card does not need to write the first data into the memory, but directly obtains the second data when the first data is obtained, that is, performs on-path calculation on the first data and the second data, thereby reducing The number of times of reading and writing the memory is reduced, the delay of the MPI operation is reduced, and the execution efficiency of the MPI is improved. In addition, when the network card, processor and memory in the server are all integrated in the SoC of the server, the end-to-end transmission delay can also be reduced, and the execution efficiency of the MPI operation can be further improved.
上述主要从服务器的角度对本申请实施例提供的MPI操作中的数据运算方法进行了介绍。可以理解的是,该服务器为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的网元及算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。The above mainly introduces the data computing method in the MPI operation provided by the embodiments of the present application from the perspective of the server. It can be understood that, in order to realize the above-mentioned functions, the server includes corresponding hardware structures and/or software modules for executing each function. Those skilled in the art should easily realize that the present application can be implemented in hardware or a combination of hardware and computer software with reference to the network elements and algorithm steps of each example described in the embodiments disclosed herein. Whether a function is performed by hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.
本申请实施例可以根据上述方法示例对MPI操作中的数据运算装置进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In this embodiment of the present application, the data computing device in the MPI operation can be divided into functional modules according to the above method examples. For example, each functional module can be divided into corresponding functions, or two or more functions can be integrated into one processing module. middle. The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. It should be noted that, the division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation.
在采用对应各个功能划分各个功能模块的情况下,图7示出了上述实施例中所涉及的数据运算装置的一种可能的结构示意图,该装置为网卡或者网卡内置的芯片,该网卡通过总线与存储器耦合,该装置包括:接收单元501、处理单元502和获取单元503。其中,接收单元501用于支持该装置接收来自网络的第一报文;处理单元502用于支持该装置解析第一报文,得到第一报文中包括的运算指示信息和第一数据,该运算指示信息用于指示需要对第一数据进行消息传递接口MPI操作的数据运算;获取单元503用于支持该装置从存储器中获取第二数据,第二数据为该MPI操作中该数据运算的本地数据;处理单元502还用于支持该装置完成该MPI操作中第一数据和第二数据的数据运算,得到第一运算结果。进一步的,该装置还包括:写入单元504和发送单元505。其中,写入单元504用于支持该装置将第一运算结果写入存储器中;发送单元505用于支持该装置向处理器发送通知信息,该通知信息用于指示该数据运算 完成。In the case where each functional module is divided according to each function, FIG. 7 shows a possible schematic structural diagram of the data computing device involved in the above embodiment, the device is a network card or a chip built in the network card, and the network card passes through the bus. Coupled with the memory, the apparatus includes: a receiving unit 501 , a processing unit 502 and an obtaining unit 503 . Wherein, the receiving unit 501 is used to support the device to receive the first message from the network; the processing unit 502 is used to support the device to parse the first message to obtain the operation indication information and the first data included in the first message. The operation indication information is used to indicate the data operation that needs to be performed on the first data by the MPI operation of the message passing interface; the obtaining unit 503 is used to support the device to obtain the second data from the memory, and the second data is the local data operation of the data operation in the MPI operation. data; the processing unit 502 is further configured to support the device to complete the data operation of the first data and the second data in the MPI operation to obtain the first operation result. Further, the apparatus further includes: a writing unit 504 and a sending unit 505 . Wherein, the writing unit 504 is used to support the device to write the first operation result into the memory; the sending unit 505 is used to support the device to send notification information to the processor, and the notification information is used to indicate that the data operation is completed.
需要说明的是,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。It should be noted that, all relevant contents of the steps involved in the above method embodiments can be cited in the functional descriptions of the corresponding functional modules, which will not be repeated here.
在采用硬件实现的基础上,本申请中的处理单元502和写入单元504可以为该装置的处理器的部分功能,接收单元501、获取单元503和发送单元505可以为该装置的收发器功能的集合,该收发器通常可以包括发送器和接收器,具体的收发器还可以称为通信接口。Based on hardware implementation, the processing unit 502 and the writing unit 504 in the present application may be part of the functions of the processor of the device, and the receiving unit 501, the acquiring unit 503 and the sending unit 505 may be the transceiver functions of the device The transceiver may generally include a transmitter and a receiver, and a specific transceiver may also be referred to as a communication interface.
图8示出了上述实施例中所涉及的数据运算装置的另一种可能的结构示意图,网卡或者网卡内置的芯片,该网卡通过总线与存储器耦合,该装置包括:处理器602和通信接口603。处理器602用于对该装置的动作进行控制管理,例如,处理器602通过该通信接口603可用于支持该装置执行上述实施例中的S401至S405,和/或用于本文所描述的技术的其他过程。此外,该装置还可以包括存储器601和总线604,处理器602、通信接口603以及存储器601通过总线604相互连接;通信接口603用于支持该装置进行通信;存储器601用于存储该装置的程序代码和数据。8 shows another possible structural schematic diagram of the data computing device involved in the above embodiment, a network card or a chip built in the network card, the network card is coupled to the memory through a bus, and the device includes: a processor 602 and a communication interface 603 . The processor 602 is used to control and manage the actions of the device. For example, the processor 602 can be used to support the device to perform S401 to S405 in the above-mentioned embodiments through the communication interface 603, and/or for the technology described herein. other processes. In addition, the device can also include a memory 601 and a bus 604, the processor 602, the communication interface 603 and the memory 601 are connected to each other through the bus 604; the communication interface 603 is used to support the device to communicate; the memory 601 is used to store the program code of the device and data.
其中,处理器602可以是中央处理器单元,通用处理器,数字信号处理器,专用集成电路,现场可编程门阵列或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。所述处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,数字信号处理器和微处理器的组合等等。总线604可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(Extended industry standard architecture,EISA)总线等。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示,图8中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。The processor 602 may be a central processing unit, a general-purpose processor, a digital signal processor, an application-specific integrated circuit, a field programmable gate array, or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. It may implement or execute the various exemplary logical blocks, modules and circuits described in connection with this disclosure. The processor may also be a combination that performs computing functions, such as a combination comprising one or more microprocessors, a combination of a digital signal processor and a microprocessor, and the like. The bus 604 may be a peripheral component interconnect (PCI) bus or an Extended industry standard architecture (EISA) bus or the like. The bus can be divided into an address bus, a data bus, a control bus, and the like. For convenience of representation, only one thick line is used in FIG. 8, but it does not mean that there is only one bus or one type of bus.
在本申请的另一实施例中,还提供一种可读存储介质,可读存储介质中存储有计算机执行指令,当一个设备(可以是单片机,芯片等)执行上述方法实施例所提供的方法中网卡的步骤。前述的可读存储介质可以包括:U盘、移动硬盘、只读存储器、随机存取存储器、磁碟或者光盘等各种可以存储程序代码的介质。In another embodiment of the present application, a readable storage medium is also provided, where computer execution instructions are stored in the readable storage medium. Steps in the network card. The aforementioned readable storage medium may include: U disk, removable hard disk, read-only memory, random access memory, magnetic disk or optical disk and other media that can store program codes.
在本申请的另一实施例中,还提供一种计算机程序产品,该计算机程序产品包括计算机执行指令,该计算机执行指令存储在计算机可读存储介质中;当一个设备的至少一个处理器可以从计算机可读存储介质读取该计算机执行指令,至少一个处理器执行该计算机执行指令使得设备上述方法实施例所提供的方法中网卡的步骤。In another embodiment of the present application, a computer program product is also provided, the computer program product includes computer-executable instructions, and the computer-executable instructions are stored in a computer-readable storage medium; The computer-readable storage medium reads the computer-executable instruction, and at least one processor executes the computer-executable instruction to cause the device to perform the steps of the network card in the method provided by the above method embodiments.
最后应说明的是:以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。Finally, it should be noted that: the above are only specific embodiments of the present application, but the protection scope of the present application is not limited to this, and any changes or replacements within the technical scope disclosed in the present application should be covered by the present application. within the scope of protection of the application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.
Claims (19)
- 一种数据运算方法,其特征在于,应用于网卡中,所述网卡通过总线与存储器耦合,所述方法包括:A data operation method, characterized in that it is applied to a network card, the network card is coupled to a memory through a bus, and the method comprises:接收第一报文,所述第一报文包括运算指示信息和第一数据;receiving a first message, where the first message includes operation indication information and first data;根据所述运算指示信息确定需要对所述第一数据进行消息传递接口MPI操作的数据运算;Determine, according to the operation indication information, a data operation that needs to perform a message passing interface MPI operation on the first data;从所述存储器中获取第二数据,所述第二数据为所述MPI操作中所述数据运算的本地数据;Acquiring second data from the memory, where the second data is local data of the data operation in the MPI operation;完成所述MPI操作中所述第一数据和所述第二数据的数据运算,得到第一运算结果。The data operation of the first data and the second data in the MPI operation is completed, and a first operation result is obtained.
- 根据权利要求1所述的方法,其特征在于,所述网卡、所述存储器和所述总线集成在芯片系统SoC中。The method according to claim 1, wherein the network card, the memory and the bus are integrated in a system-on-a-chip (SoC).
- 根据权利要求1或2所述的方法,其特征在于,所述运算指示信息包括:运算类型和数据类型。The method according to claim 1 or 2, wherein the operation indication information comprises: an operation type and a data type.
- 根据权利要求1-3任一项所述的方法,其特征在于,所述运算指示信息携带在所述第一报文的报文头中。The method according to any one of claims 1-3, wherein the operation indication information is carried in a packet header of the first packet.
- 根据权利要求1-4任一项所述的方法,其特征在于,所述MPI操作包括:MPI_reduce操作、或者MPI_allreduce操作。The method according to any one of claims 1-4, wherein the MPI operation comprises: an MPI_reduce operation or an MPI_allreduce operation.
- 根据权利要求1-5任一项所述的方法,其特征在于,所述第一报文还包括所述第二数据的存储地址,所述从所述存储器中获取第二数据,包括:The method according to any one of claims 1-5, wherein the first message further includes a storage address of the second data, and the acquiring the second data from the memory includes:根据所述第二数据的存储地址,从所述存储器中获取所述第二数据。The second data is acquired from the memory according to the storage address of the second data.
- 根据权利要求6所述的方法,其特征在于,所述方法还包括:The method according to claim 6, wherein the method further comprises:根据所述第二数据的存储地址,将所述第一运算结果存储在所述第二数据所在的存储位置上以覆盖所述第二数据。According to the storage address of the second data, the first operation result is stored in the storage location where the second data is located to cover the second data.
- 根据权利要求1-7任一项所述的方法,其特征在于,所述网卡还通过所述总线与处理器耦合,所述方法还包括:The method according to any one of claims 1-7, wherein the network card is further coupled to the processor through the bus, and the method further comprises:向所述处理器发送通知信息,所述通知信息用于指示所述数据运算完成。Send notification information to the processor, where the notification information is used to indicate that the data operation is completed.
- 一种数据运算装置,其特征在于,所述装置为网卡或者网卡内置的芯片,所述网卡通过总线与存储器耦合,所述装置包括:A data computing device, characterized in that the device is a network card or a chip built into the network card, the network card is coupled to a memory through a bus, and the device comprises:接收单元,用于接收第一报文,所述第一报文包括运算指示信息和第一数据;a receiving unit, configured to receive a first message, where the first message includes operation indication information and first data;处理单元,用于根据所述运算指示信息确定需要对所述第一数据进行消息传递接口MPI操作的数据运算;a processing unit, configured to determine, according to the operation indication information, a data operation that needs to perform a message passing interface MPI operation on the first data;获取单元,用于从所述存储器中获取第二数据,所述第二数据为所述MPI操作中所述数据运算的本地数据;an acquisition unit, configured to acquire second data from the memory, where the second data is the local data of the data operation in the MPI operation;所述处理单元,还用于完成所述MPI操作中所述第一数据和所述第二数据的数据运算,得到第一运算结果。The processing unit is further configured to complete the data operation of the first data and the second data in the MPI operation to obtain a first operation result.
- 根据权利要求9所述的装置,其特征在于,所述网卡、所述存储器和所述总线集成在芯片系统SoC中。The apparatus according to claim 9, wherein the network card, the memory and the bus are integrated in a system-on-a-chip (SoC).
- 根据权利要求9或10所述的装置,其特征在于,所述运算指示信息包括:运算类型和数据类型。The apparatus according to claim 9 or 10, wherein the operation indication information comprises: an operation type and a data type.
- 根据权利要求9-11任一项所述的装置,其特征在于,所述运算指示信息携带在所述第一报文的报文头中。The apparatus according to any one of claims 9-11, wherein the operation indication information is carried in a packet header of the first packet.
- 根据权利要求9-12任一项所述的装置,其特征在于,所述MPI操作包括:MPI_reduce操作、或者MPI_allreduce操作。The apparatus according to any one of claims 9-12, wherein the MPI operation comprises: an MPI_reduce operation or an MPI_allreduce operation.
- 根据权利要求9-13任一项所述的装置,其特征在于,所述第一报文还包括所述第二数据的存储地址,所述获取单元还用于:The device according to any one of claims 9-13, wherein the first message further includes a storage address of the second data, and the obtaining unit is further configured to:根据所述第二数据的存储地址,从所述存储器中获取所述第二数据。The second data is acquired from the memory according to the storage address of the second data.
- 根据权利要求14所述的装置,其特征在于,所述装置还包括:The apparatus of claim 14, wherein the apparatus further comprises:写入单元,用于根据所述第二数据的存储地址,将所述第一运算结果存储在所述第二数据所在的存储位置上以覆盖所述第二数据。A writing unit, configured to store the first operation result in the storage location where the second data is located according to the storage address of the second data to overwrite the second data.
- 根据权利要求9-15任一项所述的装置,其特征在于,所述网卡还通过所述总线与处理器耦合,所述装置还包括:The device according to any one of claims 9-15, wherein the network card is further coupled to the processor through the bus, and the device further comprises:发送单元,用于向所述处理器发送通知信息,所述通知信息用于指示所述数据运算完成。A sending unit, configured to send notification information to the processor, where the notification information is used to indicate that the data operation is completed.
- 一种数据运算装置,其特征在于,所述装置为网卡或者网卡内置的芯片,所述网卡通过总线与存储器耦合,所述存储器中存储代码和数据,所述网卡运行所述存储器中的代码使得所述装置执行权利要求1-8任一项所述的数据运算方法。A data computing device, characterized in that the device is a network card or a chip built into the network card, the network card is coupled to a memory through a bus, the memory stores code and data, and the network card runs the code in the memory so that the The device executes the data operation method according to any one of claims 1-8.
- 一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得所述计算机执行权利要求1-8任一项所述的数据运算方法。A computer-readable storage medium, storing instructions in the computer-readable storage medium, when running on a computer, causes the computer to execute the data operation method according to any one of claims 1-8.
- 一种计算机程序产品,其特征在于,当所述计算机程序产品在设备上运行时,使得所述设备执行权利要求1-8任一项所述的数据运算方法。A computer program product, characterized in that, when the computer program product runs on a device, the device is made to execute the data operation method according to any one of claims 1-8.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2020/112901 WO2022047632A1 (en) | 2020-09-01 | 2020-09-01 | Data computation method and device |
CN202080103371.6A CN115989478A (en) | 2020-09-01 | 2020-09-01 | A data computing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2020/112901 WO2022047632A1 (en) | 2020-09-01 | 2020-09-01 | Data computation method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022047632A1 true WO2022047632A1 (en) | 2022-03-10 |
Family
ID=80492116
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/112901 WO2022047632A1 (en) | 2020-09-01 | 2020-09-01 | Data computation method and device |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN115989478A (en) |
WO (1) | WO2022047632A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116155712A (en) * | 2022-12-26 | 2023-05-23 | 超聚变数字技术有限公司 | A network card configuration method, network card and computing device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102279728A (en) * | 2011-08-10 | 2011-12-14 | 北京百度网讯科技有限公司 | Data storage equipment and method for computing data |
US8108876B2 (en) * | 2007-08-28 | 2012-01-31 | International Business Machines Corporation | Modifying an operation of one or more processors executing message passing interface tasks |
CN105183531A (en) * | 2014-06-18 | 2015-12-23 | 华为技术有限公司 | Distributed development platform and calculation method of same |
CN107391402A (en) * | 2017-07-21 | 2017-11-24 | 郑州云海信息技术有限公司 | A kind of data operating method, device and a kind of data operation card |
US20180219797A1 (en) * | 2017-01-30 | 2018-08-02 | Intel Corporation | Technologies for pooling accelerator over fabric |
CN111078286A (en) * | 2018-10-19 | 2020-04-28 | 上海寒武纪信息科技有限公司 | Data communication method, computing system and storage medium |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100552518B1 (en) * | 2004-01-16 | 2006-02-14 | 삼성전자주식회사 | ECMPU implementation in network processor |
US8695102B2 (en) * | 2006-05-01 | 2014-04-08 | International Business Machines Corporation | Controlling execution of executables between partitions in a multi-partitioned data processing system |
US8065503B2 (en) * | 2006-12-15 | 2011-11-22 | International Business Machines Corporation | Iteratively processing data segments by concurrently transmitting to, processing by, and receiving from partnered process |
CN109426574B (en) * | 2017-08-31 | 2022-04-05 | 华为技术有限公司 | Distributed computing system, data transmission method and device in distributed computing system |
CN111382390B (en) * | 2018-12-28 | 2022-08-12 | 上海寒武纪信息科技有限公司 | Computing method, device and related products |
-
2020
- 2020-09-01 WO PCT/CN2020/112901 patent/WO2022047632A1/en active Application Filing
- 2020-09-01 CN CN202080103371.6A patent/CN115989478A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8108876B2 (en) * | 2007-08-28 | 2012-01-31 | International Business Machines Corporation | Modifying an operation of one or more processors executing message passing interface tasks |
CN102279728A (en) * | 2011-08-10 | 2011-12-14 | 北京百度网讯科技有限公司 | Data storage equipment and method for computing data |
CN105183531A (en) * | 2014-06-18 | 2015-12-23 | 华为技术有限公司 | Distributed development platform and calculation method of same |
US20180219797A1 (en) * | 2017-01-30 | 2018-08-02 | Intel Corporation | Technologies for pooling accelerator over fabric |
CN107391402A (en) * | 2017-07-21 | 2017-11-24 | 郑州云海信息技术有限公司 | A kind of data operating method, device and a kind of data operation card |
CN111078286A (en) * | 2018-10-19 | 2020-04-28 | 上海寒武纪信息科技有限公司 | Data communication method, computing system and storage medium |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116155712A (en) * | 2022-12-26 | 2023-05-23 | 超聚变数字技术有限公司 | A network card configuration method, network card and computing device |
Also Published As
Publication number | Publication date |
---|---|
CN115989478A (en) | 2023-04-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11010681B2 (en) | Distributed computing system, and data transmission method and apparatus in distributed computing system | |
WO2019129167A1 (en) | Method for processing data packet and network card | |
US7802025B2 (en) | DMA engine for repeating communication patterns | |
CN112906075A (en) | Memory sharing method and device | |
US11341087B2 (en) | Single-chip multi-processor communication | |
KR20150052102A (en) | Control messaging in multislot link layer flit | |
WO2019153702A1 (en) | Interrupt processing method, apparatus and server | |
CN110825436A (en) | Calculation method applied to artificial intelligence chip and artificial intelligence chip | |
US20230350676A1 (en) | Tensor Processing Method, Apparatus, and Device, and Computer-Readable Storage Medium | |
CN114885045A (en) | Method and device for saving DMA channel resources in high-speed intelligent network card/DPU | |
WO2022047632A1 (en) | Data computation method and device | |
CN110659143B (en) | A communication method, device and electronic device between containers | |
US10101963B2 (en) | Sending and receiving data between processing units | |
CN115756767A (en) | Device and method for multi-core CPU atomic operation memory | |
CN118427151B (en) | A data transmission method, device, equipment, medium and product | |
US20240370385A1 (en) | Computing System, Method, and Apparatus, and Acceleration Device | |
WO2021169690A1 (en) | Processor communication method and apparatus, electronic device, and computer-readable storage medium | |
CN113900793B (en) | A server cluster and its deep learning collective communication system and method | |
US10452579B2 (en) | Managing input/output core processing via two different bus protocols using remote direct memory access (RDMA) off-loading processing system | |
JP2021064166A (en) | Memory control device and control method | |
US10193797B2 (en) | Triggered-actions network processor | |
CN115842719B (en) | Message processing device, method, processor, chip and electronic device | |
CN113139519B (en) | Target detection system based on fully programmable system-on-chip | |
CN117751356A (en) | A processor and communication method | |
EP4206932A1 (en) | Data processing apparatus and method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20951880 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20951880 Country of ref document: EP Kind code of ref document: A1 |