WO2024093112A1 - Computing engine communication method and apparatus, electronic device, and storage medium - Google Patents
Computing engine communication method and apparatus, electronic device, and storage medium Download PDFInfo
- Publication number
- WO2024093112A1 WO2024093112A1 PCT/CN2023/084813 CN2023084813W WO2024093112A1 WO 2024093112 A1 WO2024093112 A1 WO 2024093112A1 CN 2023084813 W CN2023084813 W CN 2023084813W WO 2024093112 A1 WO2024093112 A1 WO 2024093112A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- communication
- computing
- address
- host
- computing engine
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
Definitions
- the present application relates to the field of computing engine communication technology, and in particular to computing engine communication methods and devices, electronic devices, and storage media.
- the engine chips, PCIE (Peripheral Component Interconnect Express, Transmission Control Protocol) or local interconnect buses, and network cards of various computing nodes are independent of each other, and the efficiency of mutual communication and computing performance have become very low, and the degree of freedom and flexibility are not high.
- the current large-scale computing power network not only has the problem of computing power performance of computing devices, but also faces the bottleneck caused by inefficient communication between computing engines of different architectures.
- the existing communication schemes between different computing engines are usually divided into communication schemes within nodes and communication schemes between different nodes.
- communication tasks usually require computing engines on the host side, and the host side still bears a lot of work.
- the purpose of this application is to provide a computing engine communication method and device, electronic device, and storage medium, which can avoid the host-side computing engine from allocating communication tasks, release the computing power of the host-side computing engine, and adapt to the communication scenarios of different computing engines in the same node and computing engines between different nodes.
- the specific scheme is as follows:
- the present application discloses a computing engine communication method, which is applied to a first device end, comprising:
- device-side program data sent by the host-side computing engine; wherein the device-side program data is program data obtained by compiling program codes including communication functions and computing task functions, and the communication function is a function created based on communication requirements;
- the device-side computing engine is used to execute the device-side program data, the target data is calculated based on the computing task function to obtain the calculation result, and the calculation result is sent to the target receiving end based on the communication function to realize the communication between the device-side computing engine and the computing engine in the target receiving end.
- obtaining device-side program data sent by a host-side computing engine includes:
- the RDMA module is used to obtain the device-side program data sent by the host-side computing engine;
- the standby computing engine is integrated into the same chip;
- sending the calculation result to the target receiving end based on the communication function includes: sending the calculation result to the target receiving end based on the communication function and through the RDMA module.
- obtaining device-side program data sent by a host-side computing engine through an RDMA module includes:
- the device-side computing engine is used to execute the device-side program data, and the target data is calculated based on the computing task function to obtain the calculation result, including:
- the device-side computing engine is used to execute the device-side program data, the target data is read from the memory based on the parameter information, and the target data is calculated based on the computing task function to obtain the calculation result.
- the calculation result is sent to the target receiving end through the RDMA module, including:
- the calculation results are sent to the target receiving end based on the IP address through the RDMA module.
- it also includes:
- the write address and data length are obtained from the communication table through the table parsing engine, and the write address, data length and IP address are written into the memory in the RDMA module.
- sending the calculation result to the target receiving end based on the IP address through the RDMA module includes:
- the memory is detected through the RDMA module, and when the memory is not empty, the write address, data length and IP address are read from the memory, the calculation result is read from the memory according to the write address and data length, and the calculation result is sent to the target receiving end based on the IP address.
- obtaining the IP address corresponding to the identification information includes:
- it also includes:
- RDMA module Through the RDMA module, a discovery message is sent to the host end, so that the host end allocates an IP address and identification information to each device end based on the discovery message of each device end;
- the IP address and identification information of each device end, the IP address and identification information of the host end are saved in the IP identification comparison table.
- the method further includes:
- the host end Obtain the IP address and identification information assigned by the host end to the first device end through the RDMA module, and reply confirmation information to the host end, so that after receiving the confirmation information replied by each device end, the host end saves the IP address and identification information of each device end, and sends the IP address and identification information of each device end and the IP address and identification information of the host end to each device end;
- the confirmation information carries the IP address and identification information allocated by the host to the corresponding device.
- the target receiving end is a host end or a second device end
- the identification information of the target receiving end is written into the communication table based on the communication function, including:
- the identification information of the plurality of second device terminals is written into the communication table.
- the second device end receives the calculation result through its own RDMA module and stores the calculation result in its own memory.
- the method further includes:
- the transmission end information is sent to the second device end, so that the RDMA module of the second device end returns the first address and length information of the calculation result in the memory to the calculation engine of the second device end after receiving the transmission end information.
- it also includes:
- the device waits to receive the data from the third device side, and after receiving the data from the third device side, continues to execute subsequent operations.
- the present application discloses a computing engine communication method, which is applied to a host side, comprising:
- the device-side program data is program data obtained by compiling program codes including communication functions and computing task functions, and the communication function is a function created based on communication requirements.
- sending device-side program data to the device-side through a host-side computing engine includes:
- the target data, parameter information and device-side program data are sent to the device-side through the host-side computing engine; wherein the parameter information includes the first address and length information of the target data written into the device-side memory, so that the device-side writes the target data into the memory based on the parameter information.
- it also includes:
- the IP address and identification information of each device end and the IP address and identification information of the host end are sent to each device end, so that each device end saves the IP address and identification information of each device end and the IP address and identification information of the host end into the IP identification comparison table.
- sending the IP address and identification information of each device end and the IP address and identification information of the host end to each device end further includes:
- the IP address and identification information of each device end are saved, and the IP address and identification information of each device end and the IP address and identification information of the host end are sent to each device end.
- the present application discloses a computing engine communication device, applied to a first device end, comprising:
- the communication engine is used to obtain the device-side program data sent by the host-side computing engine; wherein the device-side program data is program data compiled from program codes including communication functions and computing task functions, and the communication function is a function created based on communication requirements;
- the device-side computing engine is used to execute device-side program data and calculate the target data based on the computing task function.
- the calculation result is obtained and sent to the target receiving end based on the communication function.
- the present application discloses a computing engine communication device, applied to a host side, comprising:
- the host-side computing engine is used to send the device-side program data to the device-side, so that the device-side computing engine of the device-side executes the device-side program data, calculates the target data based on the computing task function to obtain a computing result, and sends the computing result to the target receiving end based on the communication function;
- the device-side program data is program data obtained by compiling program codes including communication functions and computing task functions, and the communication function is a function created based on communication requirements.
- an electronic device including:
- the processor is used to execute a computer program to implement the computing engine communication method as described above.
- the present application discloses a non-volatile computer-readable storage medium, in which a computer program is stored.
- the computer program is loaded and executed by a processor, the computing engine communication method as described above is implemented.
- the present application first obtains the device-side program data sent by the host-side computing engine; wherein the device-side program data is the program data obtained by compiling the program code containing the communication function and the computing task function, and the communication function is a function created based on the communication demand, and then the device-side computing engine is used to execute the device-side program data, and the target data is calculated based on the computing task function to obtain the calculation result, and the calculation result is sent to the target receiving end based on the communication function, so as to realize the communication between the device-side computing engine and the computing engine in the target receiving end.
- the device-side program data is the program data obtained by compiling the program code containing the communication function and the computing task function
- the communication function is a function created based on the communication demand
- the device-side computing engine is used to execute the device-side program data
- the target data is calculated based on the computing task function to obtain the calculation result
- the calculation result is sent to the target receiving end based on the communication function, so as to realize the
- a communication function is created based on the communication demand, and the program code containing the communication function and the computing task function is compiled to obtain the device-side program data, and after the device-side obtains the device-side program code sent by the host-side computing engine, the device-side computing engine completes the computing task and the communication task in the process of executing the device-side program code, and sends the calculation result to the target receiving end, so as to realize the communication between the computing engine in the host-side and the computing engine in the device-side, and the computing engine in the device-side and the computing engine in the target receiving end, without the need for the host-side computing engine to allocate the communication task, releasing the computing power of the host-side computing engine, and adapting to the communication scenarios of different computing engines in the same node and computing engines between different nodes.
- FIG1 is a flow chart of a computing engine communication method disclosed in the present application.
- FIG2 is a schematic diagram of a specific distributed communication engine architecture disclosed in this application.
- FIG3 is a flow chart of another computing engine communication method disclosed in the present application.
- FIG4 is a schematic diagram of the structure of a computing engine communication device disclosed in the present application.
- FIG5 is a schematic diagram of the structure of another computing engine communication device disclosed in the present application.
- FIG6 is a structural block diagram of an electronic device disclosed in the present application.
- FIG. 7 is a structural block diagram of a non-volatile computer-readable storage medium disclosed in the present application.
- the data When data needs to be transferred between the GPU memory space of device A and the GPU memory space of device B, the data must at least first be copied from the GPU memory space of device A to the system kernel space of device A, and then sent from the system kernel space of device A to the network device space of device A. After receiving the data, device B needs to send it from the network device space of device B to the system kernel space of device B, and then copy it from the system kernel space of device B to the GPU memory space of device B. Under normal circumstances, four data transmissions and copies are generally required. It can be seen that the communication architecture of traditional computing networks is an important factor causing the inefficiency and low speed of heterogeneous computing networks.
- NVIDIA launched GPU Direct Shared Memory (i.e., direct shared memory) technology in 2010, and gradually developed GPU Direct P2P (Peer-to-Peer), NVLink, and the latest NVSwitch technology.
- NVLink is a set of bus protocols developed by NVIDIA to solve the limitation of PCIe transmission rate when transmitting data between GPUs in a single node.
- NVSwitch is an independent communication chip designed by NVIDIA.
- NVSwitch is a dedicated data chip based on NVLink, so the limitations of NVLink still exist on NVSwitch.
- the node machines are completely customized, and the nodes are connected in close proximity, so they cannot be deployed in a distributed manner.
- the current solutions are basically implemented through RDMA network.
- RDMA network Through RDMA network, the hardware network card can directly read the data to be sent in the user space without going through the system kernel space. The number of data copies can be reduced, thus greatly reducing network latency and CPU overhead on data copying and increasing transmission rates.
- RDMA Remote Direct Memory Access
- NVIDIA released the latest GPU Direct RDMA Async technology, which allows direct synchronization between GPU and third-party devices, while the CPU does not participate in the key communication path of GPU applications.
- RDMA NIC Network Interface Controller
- the working process is as follows: 1) The CPU dispatches computing and communication tasks to the GPU through NVIDIA's CUDA API (Application Programming Interface); 2) After the GPU completes the computing task, it automatically executes the communication task and directly triggers the communication operation to the RDMA NIC; 3) The RDMA NIC copies data directly from the GPU's memory or the host's memory to the RDMA NIC through the RDMA network; 4) The RDMA NIC sends data; From the process of this technology, it can be seen that except for the stage where the host starts to distribute computing and communication tasks, the rest of the process does not require the host to participate.
- the data is only copied once.
- the problem with this solution is that the GPU and RDMA NIC are two independent devices.
- the GPU needs to trigger the data movement of the RDMA NIC through PCIe.
- PCIe makes the GPU and RDMA NIC have certain physical distance restrictions and bindings.
- the data transmission reduces the number of temporary storage times, the physical PCIE cross-node signal transmission path does not change. And as a closed-source solution of a commercial company, this technical solution is difficult to modify again according to actual conditions.
- FPGA In addition to GPU, another common heterogeneous computing engine is FPGA.
- Intel has developed IKL (Inter-Kernel Links) to achieve heterogeneous communication between FPGAs within a node or between nodes.
- User Kernel is a module written by the user in OpenCL language to implement specific computing tasks.
- write_channel_intel(channel_id,data) and read_channel_intel(channel_id) to achieve communication between different FPGAs. These two functions will send the data to be sent from User Kernel through IKL I/O Channel (i.e.
- Inter-Kernel Logic RTL IP Inter-Kernel Logic RTL IP
- Ethernet Switch i.e. network card.
- IKL realizes communication between FPGAs, it can be seen that it has the following main disadvantages: 1) The write_channel_intel and read_channel_intel functions used to send and receive data in OpenCL are based on channel_id, which is cumbersome to configure and each channel only supports fixed point-to-point communication; 2) In order to achieve reliable communication, Inter-Kernel Logic RTL IP needs to implement complex control functions such as timeout retransmission, fragmentation, and packet loss retransmission similar to the TCP/IP protocol stack, which occupies a large amount of FPGA resources, and the occupied resources increase with the number of channel_ids.
- the maximum number of channel_ids supported is 48 to 256;
- This communication method only supports network communication between FPGAs, but not between FPGAs and the host. When communicating between FPGAs, it only supports exchanging data on the FPGA board, and cannot exchange data in the host memory;
- an embodiment of the present application discloses a computing engine communication method, which is applied to a first device end and includes:
- Step S11 Obtain device-side program data sent by the host-side computing engine; wherein the device-side program data is program data obtained by compiling program codes including communication functions and computing task functions, and the communication function is a function created based on communication requirements.
- the host-side computing engine is the CPU, and, in the embodiment of the present application, a specific computing task is described in the OpenCL language to obtain a computing task function.
- the difference between the device-side program data in the embodiment of the present application and the ordinary device-side OpenCL program is that a custom communication function library is integrated.
- the functions in the custom communication function library add corresponding communication functions to the OpenCL code of the cl file according to the communication requirements of the task, and are compiled together with the cl file to form the final device-side binary code, that is, the device-side program data.
- the embodiment of the present application can obtain the device-side program data sent by the host-side computing engine through the RDMA module; wherein the RDMA module and the device-side computing engine are integrated in the same chip.
- the device-side computing engine is an xPU (i.e., a general computing engine), a general term for various computing engines, including GPU, NPU (i.e., Neural-network Processing Unit), DPU (i.e., Data Processing Unit), and IPU (i.e., Infrastructure Processing Unit).
- the embodiment of the present application can integrate the RDMA module and the computing engine in the same FPGA chip.
- the target data, parameter information, and device-side program data sent by the host-side computing engine can be obtained through the RDMA module; wherein the parameter information includes the first address and length information of the target data written into the memory of the first device; and the target data is written into the memory based on the parameter information.
- the memory can be DDR (i.e., DDR SDRAM, Double Data Rate Synchronous Dynamic Random Access Memory).
- Step S12 Utilize the device-side computing engine to execute the device-side program data, calculate the target data based on the computing task function to obtain the calculation result, and send the calculation result to the target receiving end based on the communication function to realize the communication between the device-side computing engine and the computing engine in the target receiving end.
- the device-side computing engine can be used to execute the device-side program data, read the target data from the memory based on the parameter information, and calculate the target data based on the computing task function to obtain the calculation result. Based on the communication function, the calculation result is sent to the target receiving end through the RDMA module.
- the calculation result is sent to the target receiving end through the RDMA module, which specifically includes the following steps:
- Step 00 Write the identification information of the target receiving end into the communication table based on the communication function, and notify the table parsing engine.
- the table parsing engine and the RDMA module are both underlying hardware IP (i.e., intellectual property).
- a custom IP can be written in HDL (i.e., Hardware Description Language) to obtain the table parsing engine and the RDMA module.
- HDL Hardware Description Language
- the custom communication library of the upper-layer software can be implemented based on the common C/C++ language, while the underlying hardware IP is implemented through HDL code.
- the solution provided in the embodiment of this application has good versatility and portability.
- a corresponding target receiving end when creating a communication function, a corresponding target receiving end can be specified, and specifically, the target receiving end can be specified based on the identification information of the device.
- Step 01 Obtain the identification information of the target receiving end from the communication table through the table parsing engine, and obtain the IP address corresponding to the identification information.
- the embodiment of the present application also writes the calculation result into the data area in the memory, and writes the write address and data length into the communication table; obtains the write address and data length from the communication table through the table parsing engine, and writes the write address, data length and IP address into the memory in the RDMA module.
- the memory can be a FIFO (First Input First Output) memory.
- the embodiment of the present application can obtain the IP address corresponding to the identification information from the IP identification comparison table.
- a discovery message can be sent to the host side through the RDMA module, so that the host side can assign IP addresses and identification information to each device side based on the discovery message of each device side; obtain the IP address and identification information of each device side sent by the host side, and the IP address and identification information of the host side; save the IP address and identification information of each device side, and the IP address and identification information of the host side to the IP identification comparison table.
- the IP address and identification information assigned by the host side to the first device side can be obtained through the RDMA module, and confirmation information can be replied to the host side, so that after receiving the confirmation information replied by each device side, the host side saves the IP address and identification information of each device side, and sends the IP address and identification information of each device side, and the IP address and identification information of the host side to each device side.
- the confirmation information carries the IP address and identification information assigned by the host side to the corresponding device side.
- each device when powered on, each device is actually a DHCP (Dynamic Host Configuration Protocol) client, that is, the RDMA module will send DHCP discovery information to find a DHCP server, that is, send specific broadcast information to the broadcast IP address 255.255.255.255.
- the host side as a DHCP server, will receive DHCP discovery messages from each device side and assign the device side's IP address, device side identification and other information to each device side.
- the RDMA module of each device side After receiving the information assigned to it by the host side, the RDMA module of each device side will reply with a confirmation message, which will also contain all the information assigned to it by the host side, in order to declare to the host side that the device side will use this information for communication.
- the host side After receiving the confirmation information from each device side, the host side will send the information assigned to all device sides and the IP and identification information of the host side in a custom format to each device side, and also keep a copy in the host side.
- the RDMA module of each device side After receiving the communication information of all devices sent by the host side, the RDMA module of each device side saves it in the IP identification comparison table.
- a node may include a host end and at least one device end. If the same network includes multiple nodes, the host end in a node can be determined to allocate IP addresses and identification information, that is, allocate IP addresses and identification information of other host ends and all device ends in the network. Ultimately, each host end and each device end has an IP identification comparison table that includes the IP addresses and identification information of all host ends and all device ends in the entire network.
- any device in the same network can automatically discover and obtain the communication information of all devices in the network without human intervention. It is scalable and flexible, and provides a physical basis for realizing the expansion of arbitrary distributed computing power networks with high degrees of freedom.
- the addresses of the IP identification comparison table and the information table are determined, and the device-side program data and the host-side program data are written, wherein the host-side program data is the program data executed on the host side.
- the program mainly written in C/C++ uses the C/C++ standard library and the API function of OpenCL on the host side to obtain the final host-side program data.
- Step 02 Send the calculation results to the target receiving end based on the IP address through the RDMA module.
- the memory can be detected through the RDMA module, and when the memory is not empty, the write address, data length and IP address are read from the memory, the calculation result is read from the memory according to the write address and data length, and the calculation result is sent to the target receiving end based on the IP address.
- the target receiving end is the host end or the second device end; if the target receiving end is multiple second device ends, the identification information of the multiple second device ends is written into the communication table based on the communication function, and the calculation results are sent to the multiple second device ends through the table parsing engine.
- the second device end receives the calculation result through its own RDMA module and saves the calculation result in its own memory.
- a transmission end message can also be sent to the second device end, so that after receiving the transmission end message, the RDMA module of the second device end returns the first address and length information of the calculation result in the memory to the calculation engine of the second device end, so that the calculation engine of the second device end can perform subsequent operations according to the first address and length information of the calculation result in the memory.
- the first device executes the device program data
- the target program instruction is executed, the first device waits to receive the data from the third device, and after receiving the data from the third device, continues to perform subsequent operations.
- the target program instruction is a program instruction compiled based on the communication function. It is understandable that the multiple second device also executes the corresponding program instruction and waits to receive the calculation result of the first device.
- the embodiments of the present application can provide a communication solution for active interconnection between computing engines of any different architectures, eliminate the distinction between inter-node and intra-node communications, and do not require the CPU to allocate communication tasks, and have one-to-many communication capabilities. Whether it is between XPUs or between XPUs and the host, data only needs to be moved once during the data transmission process, which has higher efficiency and effectiveness.
- FIG. 2 is a schematic diagram of a specific distributed communication engine architecture provided by an embodiment of the present application.
- the acceleration card is an FPGA board, which includes an FPGA chip, the RDMA module and the GPU are integrated into the chip, and in addition to the FPGA chip, there are a series of peripherals, such as network ports, DDR, etc., and there is no master-slave relationship between the host side and the device side.
- the host-side program is a program mainly written in C/C++, in which the C/C++ standard library and the API functions of OpenCL on the host side are used; the device-side program is mainly a specific computing task described in the OpenCL language, and a custom communication function library is integrated.
- the functions in the custom communication function library when writing the code, add the corresponding communication function to the OpenCL code of the cl file according to the communication requirements of the task, and compile it together with the cl file to form the final binary code of the device side.
- the upper CL compiler and the xPU chip jointly determine the addresses of the communication table and the IP identification comparison table of both parties, and write the above binary code.
- the addresses of the two-way communication table and IP identification comparison table required for communication between the host and the device are determined in advance, and the host program data and the device program data are written.
- custom IPs are written in HDL language, including RDMA, table parsing engine, etc., and customized communication function libraries implemented in the upper layer C/C++ language and some standard protocol libraries are used to realize any interconnection between any computing engines based on the public Ethernet protocol, without the need for the host to control the communication process.
- the main working process of the above distributed communication engine architecture includes:
- the RDMA module implements the functions of the RDMA NIC and automatically completes the following tasks: a) When powered on, each accelerator card is actually a DHCP client, that is, the RDMA module will send a DHCP discovery message to find a DHCP server, that is, send a specific broadcast message to the broadcast IP address 255.255.255.255; b) The host side acts as The DHCP server will receive the DHCP discovery message from each accelerator card and assign the IP address, device ID and other information of the accelerator card to each accelerator card; c) After receiving the information assigned to it by the host, the RDMA module of each accelerator card will reply with a confirmation message, which will also contain all the information assigned to it in b) to declare to the host that the accelerator card will use this information for communication; d) After receiving the confirmation information from each accelerator card, the host will send the information assigned to all accelerator cards and the IP and ID information of the host in a custom format to each accelerator card, and also keep a copy in the
- the host downloads relevant data to the accelerator card:
- the compiled host binary program is executed on the host.
- a series of OpenCL API functions in the program will perform a series of tasks such as searching for devices and initializing device parameters.
- the program will also send the following three types of data to the accelerator card: initial data required for calculation, some parameters (such as the first address and length of the initial data in the accelerator card DDR, and the addresses of these parameters in the accelerator card DDR are fixed), and the compiled device binary program data on the disk (the first address of this data in the accelerator card DDR is also fixed and has nothing to do with the specific task).
- the host sends this data to the accelerator card through RDMA Ethernet.
- the accelerator card reads data from the host: Usually, in the above (2), the host has downloaded the relevant data to the accelerator card. The data required for the GPU calculation in the accelerator card has been downloaded to the DDR of the accelerator card, and the GPU does not need to read data from the host.
- the accelerator card writes data to the host: After the GPU in the accelerator card completes the calculation, it writes the calculation result into the data area in the accelerator card DDR, and saves the written address, length, and ID of the other end (i.e., the target receiving end) in the communication table; then the GPU notifies the table parsing engine, which takes out information from the communication table and marks the information that has been taken out from the communication table to prevent repeated retrieval; the table parsing engine queries the IP identification comparison table for the IP address corresponding to the ID of the other end in the obtained information; after the table parsing engine obtains the IP address corresponding to the ID, it writes this information into a FIFO in the RDMA module; when the RDMA module detects that the FIFO is not empty, it will continuously take out the information in it, read the corresponding data from the accelerator card DDR according to the accelerator card DDR address and length in the information, and then send the data to the other end according to the IP address; after the data is sent, the accelerator card also sends
- the reading and writing between accelerator cards is different from the reading and writing between the accelerator card and the host.
- the former occurs in pairs, that is, when one end performs a write operation, the other end or multiple ends must perform a corresponding read operation:
- accelerator card 2 and accelerator card 3 need to read the data of accelerator card 1, that is, after the GPUs in accelerator card 2 and accelerator card 3 perform tasks to a certain stage, they may need the data of accelerator card 1.
- the fundamental reason for this situation is that the GPUs in these two accelerator cards have executed the program instructions compiled by the read function in the custom communication function library.
- accelerator card 2 and accelerator card 3 After their respective GPUs execute the instructions generated by the compiled read function, they will wait for the device ID given in the instructions to send data.
- the RDMA modules of accelerator card 2 and accelerator card 3 receive the data sent by accelerator card 1, they will save it in their respective DDRs.
- the GPUs can perform subsequent operations based on the first address and length of the data.
- accelerator card 1 and accelerator card 2 cannot communicate directly, accelerator card 1 will send the intermediate result calculated by itself back to a certain address in the host side, and then the host side will notify accelerator card 2 to let accelerator card 2 read the data at the address in the host side; since accelerator cards can communicate with each other in this communication architecture, this situation will not exist.
- a custom communication function library of the upper-level software written in a high-level language and the underlying xPU hardware IP written in HDL language are used, based on the physical FPGA chip as an example, to realize an efficient and universal communication engine between computing engines of different architectures.
- a custom communication function library is written in a standard high-level language, combined with the RDMA module, table parsing engine, etc.
- the communication engine integrates the module responsible for RDMA network communication with the computing engine into one chip, further accelerating the connection between the computing engine and the communication engine, and realizing the active interconnection between computing engines of any different architectures in the same RDMA network, completely eliminating the need for CPU control to participate in communication control, and having one-to-many communication capabilities; whether it is between accelerator cards and accelerator cards or between accelerator cards and host terminals, only data needs to be moved once during data transmission, effectively solving the problem of inefficient and low-speed communication between computing engines.
- the solution provided in this application is applicable to various xPU computing engines, such as GPU, NPU, IPU, DPU, VPU, etc.
- an embodiment of the present application discloses a computing engine communication method, which is applied to a host side and includes:
- Step S31 sending device-side program data to the device side through the host-side computing engine, so that the device-side computing engine of the device side executes the device-side program data, calculates the target data based on the computing task function to obtain a calculation result, and sends the calculation result to the target receiving end based on the communication function;
- the device-side program data is program data obtained by compiling program codes including communication functions and computing task functions, and the communication function is a function created based on communication requirements.
- sending device-side program data to the device-side through a host-side computing engine includes:
- the target data, parameter information and device-side program data are sent to the device-side through the host-side computing engine; wherein the parameter information includes the first address and length information of the target data written into the device-side memory, so that the device-side writes the target data into the memory based on the parameter information.
- it also includes:
- the IP address and identification information of each device end and the IP address and identification information of the host end are sent to each device end, so that each device end saves the IP address and identification information of each device end and the IP address and identification information of the host end into the IP identification comparison table.
- sending the IP address and identification information of each device end and the IP address and identification information of the host end to each device end further includes:
- a communication function is created based on the communication needs, and the program code containing the communication function and the computing task function is compiled to obtain the device-side program data.
- the device-side computing engine completes the computing task and the communication task in the process of executing the device-side program code, and sends the computing result to the target receiving end.
- the communication between the computing engine in the host side and the computing engine in the device side, and between the computing engine in the device side and the computing engine in the target receiving end is realized.
- the host computing engine there is no need for the host computing engine to allocate communication tasks, thus releasing the computing power of the host-side computing engine, and it is suitable for communication scenarios between different computing engines in the same node and between computing engines in different nodes.
- an embodiment of the present application discloses a computing engine communication device, which is applied to a first device end and includes:
- the communication engine 11 is used to obtain the device-side program data sent by the host-side computing engine; wherein the device-side program data is program data compiled from program codes including communication functions and computing task functions, and the communication function is a function created based on communication requirements;
- the device-side computing engine 12 is used to execute device-side program data, calculate the target data based on the computing task function to obtain the calculation result, and send the calculation result to the target receiving end based on the communication function.
- the present application first obtains the device-side program data sent by the host-side computing engine; wherein the device-side program data is the program data obtained by compiling the program code containing the communication function and the computing task function, and the communication function is a function created based on the communication demand, and then the device-side computing engine is used to execute the device-side program data, and the target data is calculated based on the computing task function to obtain the calculation result, and the calculation result is sent to the target receiving end based on the communication function, so as to realize the communication between the device-side computing engine and the computing engine in the target receiving end.
- the device-side program data is the program data obtained by compiling the program code containing the communication function and the computing task function
- the communication function is a function created based on the communication demand
- the device-side computing engine is used to execute the device-side program data
- the target data is calculated based on the computing task function to obtain the calculation result
- the calculation result is sent to the target receiving end based on the communication function, so as to realize the
- a communication function is created based on the communication demand, and the program code containing the communication function and the computing task function is compiled to obtain the device-side program data, and after the device-side obtains the device-side program code sent by the host-side computing engine, the device-side computing engine completes the computing task and the communication task in the process of executing the device-side program code, and sends the calculation result to the target receiving end, so as to realize the communication between the computing engine in the host-side and the computing engine in the device-side, and the computing engine in the device-side and the computing engine in the target receiving end, without the need for the host-side computing engine to allocate the communication task, releasing the computing power of the host-side computing engine, and adapting to the communication scenarios of different computing engines in the same node and computing engines between different nodes.
- the communication engine 11 is an RDMA module, and the RDMA module and the device-side computing engine are integrated in the same chip;
- the device-side computing engine 12 is specifically used to send the computing result to the target receiving end based on the communication function and through the RDMA module.
- the device-side computing engine 12 is specifically used to execute device-side program data using the device-side computing engine, read target data from the memory based on parameter information, and calculate the target data based on the computing task function to obtain a calculation result.
- the RDMA module is also used to write the calculation result into the data area in the memory, and write the write address and data length into the communication table; correspondingly, the device-side calculation engine 12 is used to obtain the write address and data length from the communication table through the table parsing engine, and write the write address, data length and IP address into the memory in the RDMA module.
- the RDMA module is used to detect the memory, and when the memory is not empty, read the write address, data length and IP address from the memory, read the calculation result from the memory according to the write address and data length, and send the calculation result to the target receiving end based on the IP address.
- obtaining the IP address corresponding to the identification information includes: obtaining the IP address corresponding to the identification information from an IP identification comparison table.
- the RDMA module is also used to send a discovery message to the host side, so that the host side can assign an IP address and identification information to each device side based on the discovery message of each device side; obtain the IP address and identification information of each device side and the IP address and identification information of the host side sent by the host side; save the IP address and identification information of each device side and the IP address and identification information of the host side to the IP identification comparison table.
- the RDMA module is also used to obtain the IP address and identification information assigned by the host side to the first device side after sending a discovery message to the host side, and reply confirmation information to the host side, so that after the host side receives the confirmation information replied by each device side, it saves the IP address and identification information of each device side, and sends the IP address and identification information of each device side and the IP address and identification information of the host side to each device side.
- the target receiving end is a host end or a second device end
- the device-end computing engine writes the identification information of the multiple second device ends into the communication table based on the communication function.
- the second device end receives the calculation result through its own RDMA module and stores the calculation result in its own memory.
- the RDMA module is also used to send transmission end information to the second device end after sending the calculation result to the second device end based on the communication function, so that after receiving the transmission end information, the RDMA module on the second device end returns the starting address and length information of the calculation result in the memory to the computing engine on the second device end.
- the device-side computing engine when executing the device-side program data, if the device-side computing engine executes the target program instruction, it waits to receive the data from the third device side, and continues to execute subsequent operations after receiving the data from the third device side.
- an embodiment of the present application discloses a computing engine communication device, which is applied to a host side and includes:
- the host-side computing engine 51 is used to send the device-side program data to the device-side, so that the device-side computing engine of the device-side executes the device-side program data, calculates the target data based on the computing task function to obtain the calculation result, and sends the calculation result to the target receiving end based on the communication function;
- the device-side program data is program data obtained by compiling program codes including communication functions and computing task functions, and the communication function is a function created based on communication requirements.
- the host-side computing engine is specifically used to send target data, parameter information and device-side program data to the device-side; wherein the parameter information includes the starting address and length information of the target data written into the device-side memory, so that the device-side writes the target data into the memory based on the parameter information.
- the apparatus further comprises:
- a discovery message acquisition module is used to acquire discovery messages sent by each device end;
- a configuration module used to configure an IP address and identification information for each device end based on the discovery message sent by each device end;
- the configuration information sending module is used to send the IP address and identification information of each device end and the IP address and identification information of the host end to each device end, so that each device end saves the IP address and identification information of each device end and the IP address and identification information of the host end to the IP identification comparison table.
- the configuration information sending module further includes:
- the first sending module sends the IP address and identification information assigned by the host end to each device end in English;
- the second sending module is used to save the IP address and identification information of each device end after receiving the confirmation information replied by each device end, and send the IP address and identification information of each device end and the IP address and identification information of the host end to each device end.
- an embodiment of the present application further provides an electronic device, including: a memory 601 for storing a computer program; and a processor 602 for executing the computer program to implement the steps of the computing engine communication method of any of the above embodiments.
- an embodiment of the present application further provides a non-volatile computer-readable storage medium 7 on which a computer program 701 is stored.
- a computer program 701 is executed by a processor, the steps of the computing engine communication method of any of the above embodiments are implemented.
- each embodiment is described in a progressive manner, and each embodiment focuses on the differences from other embodiments.
- the same or similar parts between the embodiments can be referred to each other.
- the description is relatively simple, and the relevant parts can be referred to the method part.
- the steps of the method or algorithm described in conjunction with the embodiments disclosed herein may be implemented directly using hardware, a software module executed by a processor, or a combination of the two.
- the software module may be placed in a random access memory (RAM), a memory, a read-only memory (ROM), an electrically programmable ROM, an electrically erasable programmable ROM, a register, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multi Processors (AREA)
Abstract
Description
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2022年10月31日提交中国专利局,申请号为202211347106.X,申请名称为“计算引擎通信方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to a Chinese patent application filed with the China Patent Office on October 31, 2022, with application number 202211347106.X and application name “Computing Engine Communication Method and Device”, the entire contents of which are incorporated by reference into this application.
本申请涉及计算引擎通信技术领域,特别涉及计算引擎通信方法及装置、电子设备、存储介质。The present application relates to the field of computing engine communication technology, and in particular to computing engine communication methods and devices, electronic devices, and storage media.
随着传统的摩尔定律的发展,算力的发展经历了从提升单个处理器的主频、提升单个芯片内处理器的数量,再到提升单个算力网络内芯片的数量(例如大规模的云计算网络)。目前芯片制程所带来的红利几乎消失殆尽,依靠堆砌芯片数量也遇到了算力不匹配、能耗居高不下等一系列问题。针对目前多样化、巨型化的计算任务,在同一套系统内部署多种不同架构的计算引擎例如,CPU(即Central Processing Unit,中央处理器)、GPU(即Graphics Processing Unit,图像处理单元)、FPGA(即Field Programmable Gate Array,现场可编程与门阵列)等,根据各种不同架构的计算引擎各自的特性,或是服务于同一个任务的不同环节,或是服务于不同的任务,实现算力的最大化利用,已经成为了业界的共识和发展趋势。With the development of traditional Moore's Law, the development of computing power has gone through the process of increasing the main frequency of a single processor, increasing the number of processors in a single chip, and then increasing the number of chips in a single computing network (such as a large-scale cloud computing network). At present, the dividends brought by the chip process have almost disappeared, and relying on the number of chips to stack up has also encountered a series of problems such as mismatched computing power and high energy consumption. In response to the current diversified and giant computing tasks, it has become an industry consensus and development trend to deploy computing engines with different architectures in the same system, such as CPU (Central Processing Unit), GPU (Graphics Processing Unit), FPGA (Field Programmable Gate Array), etc. According to the characteristics of computing engines with different architectures, they can either serve different links of the same task or serve different tasks to maximize the use of computing power.
由于传统的计算机体系结构系统中网络软件协议栈和网络硬件设备的历史遗留问题的存在,在一个具有多元异构多计算引擎的系统中,各类计算节点的引擎芯片、PCIE(即Peripheral Component Interconnect Express,传输控制协议)或局部互联总线、网卡之间彼此独立,相互通信效率和计算效能变得十分低效,并且自由度和灵活性不高,使得目前大型算力网络,不仅仅有计算设备算力性能的问题,还要面对不同架构的计算引擎之间低效通信所变相导致的瓶颈。现有的不同计算引擎之间的通信方案,通常分为节点内部的通信方案以及不同节点间的通信方案,并且,通信任务通常需要主机端的计算引擎,主机端的计算仍负担较多工作。Due to the existence of historical legacy issues of network software protocol stacks and network hardware devices in traditional computer architecture systems, in a system with multiple heterogeneous computing engines, the engine chips, PCIE (Peripheral Component Interconnect Express, Transmission Control Protocol) or local interconnect buses, and network cards of various computing nodes are independent of each other, and the efficiency of mutual communication and computing performance have become very low, and the degree of freedom and flexibility are not high. As a result, the current large-scale computing power network not only has the problem of computing power performance of computing devices, but also faces the bottleneck caused by inefficient communication between computing engines of different architectures. The existing communication schemes between different computing engines are usually divided into communication schemes within nodes and communication schemes between different nodes. In addition, communication tasks usually require computing engines on the host side, and the host side still bears a lot of work.
发明内容Summary of the invention
有鉴于此,本申请的目的在于提供计算引擎通信方法及装置、电子设备、存储介质,能够避免主机端计算引擎分配通信任务,释放主机端计算引擎的算力,并且,适应于同一节点内不同计算引擎以及不同节点之间计算引擎的通信场景。其具体方案如下:In view of this, the purpose of this application is to provide a computing engine communication method and device, electronic device, and storage medium, which can avoid the host-side computing engine from allocating communication tasks, release the computing power of the host-side computing engine, and adapt to the communication scenarios of different computing engines in the same node and computing engines between different nodes. The specific scheme is as follows:
第一方面,本申请公开了一种计算引擎通信方法,应用于第一设备端,包括:In a first aspect, the present application discloses a computing engine communication method, which is applied to a first device end, comprising:
获取主机端计算引擎发送的设备端程序数据;其中,设备端程序数据为对包含通信函数以及计算任务函数的程序代码编译得到的程序数据,通信函数为基于通信需求创建的函数;Obtaining device-side program data sent by the host-side computing engine; wherein the device-side program data is program data obtained by compiling program codes including communication functions and computing task functions, and the communication function is a function created based on communication requirements;
利用设备端计算引擎执行设备端程序数据,基于计算任务函数对目标数据进行计算以得到计算结果,并基于通信函数将计算结果发送至目标接收端,以实现设备端计算引擎与目标接收端中计算引擎之间的通信。The device-side computing engine is used to execute the device-side program data, the target data is calculated based on the computing task function to obtain the calculation result, and the calculation result is sent to the target receiving end based on the communication function to realize the communication between the device-side computing engine and the computing engine in the target receiving end.
在一些实施例中,获取主机端计算引擎发送的设备端程序数据,包括:In some embodiments, obtaining device-side program data sent by a host-side computing engine includes:
通过RDMA模块获取主机端计算引擎发送的设备端程序数据;其中,RDMA模块与设 备端计算引擎集成在同一芯片中;The RDMA module is used to obtain the device-side program data sent by the host-side computing engine; The standby computing engine is integrated into the same chip;
相应的,基于通信函数将计算结果发送至目标接收端,包括:基于通信函数,并通过RDMA模块将计算结果发送至目标接收端。Correspondingly, sending the calculation result to the target receiving end based on the communication function includes: sending the calculation result to the target receiving end based on the communication function and through the RDMA module.
在一些实施例中,通过RDMA模块获取主机端计算引擎发送的设备端程序数据,包括:In some embodiments, obtaining device-side program data sent by a host-side computing engine through an RDMA module includes:
通过RDMA模块获取主机端计算引擎发送的目标数据、参数信息以及设备端程序数据;其中,参数信息包括目标数据写入第一设备端的内存的首地址以及长度信息;Obtaining target data, parameter information, and device-side program data sent by the host-side computing engine through the RDMA module; wherein the parameter information includes the first address and length information of the target data written into the memory of the first device;
基于参数信息将目标数据写入内存;Write target data into memory based on parameter information;
相应的,利用设备端计算引擎执行设备端程序数据,基于计算任务函数对目标数据进行计算以得到计算结果,包括:Accordingly, the device-side computing engine is used to execute the device-side program data, and the target data is calculated based on the computing task function to obtain the calculation result, including:
利用设备端计算引擎执行设备端程序数据,基于参数信息从内存中读取目标数据,并基于计算任务函数对目标数据进行计算以得到计算结果。The device-side computing engine is used to execute the device-side program data, the target data is read from the memory based on the parameter information, and the target data is calculated based on the computing task function to obtain the calculation result.
在一些实施例中,基于通信函数,并通过RDMA模块将计算结果发送至目标接收端,包括:In some embodiments, based on the communication function, the calculation result is sent to the target receiving end through the RDMA module, including:
基于通信函数将目标接收端的标识信息写入通信表,并通知表解析引擎;Write the identification information of the target receiving end into the communication table based on the communication function, and notify the table parsing engine;
通过表解析引擎从通信表中获取目标接收端的标识信息,并获取标识信息对应的IP地址;Obtain identification information of the target receiving end from the communication table through the table parsing engine, and obtain the IP address corresponding to the identification information;
通过RDMA模块基于IP地址将计算结果发送至目标接收端。The calculation results are sent to the target receiving end based on the IP address through the RDMA module.
在一些实施例中,还包括:In some embodiments, it also includes:
将计算结果写入内存中的数据区域,并将写入地址、数据长度写入通信表;Write the calculation result into the data area in the memory, and write the write address and data length into the communication table;
通过表解析引擎从通信表中获取写入地址、数据长度,并将写入地址、数据长度以及IP地址写入RDMA模块中的存储器。The write address and data length are obtained from the communication table through the table parsing engine, and the write address, data length and IP address are written into the memory in the RDMA module.
在一些实施例中,通过RDMA模块基于IP地址将计算结果发送至目标接收端,包括:In some embodiments, sending the calculation result to the target receiving end based on the IP address through the RDMA module includes:
通过RDMA模块检测存储器,并在存储器不为空时,从存储器中读取出写入地址、数据长度以及IP地址,根据写入地址和数据长度从内存中读取计算结果,基于IP地址将计算结果发送至目标接收端。The memory is detected through the RDMA module, and when the memory is not empty, the write address, data length and IP address are read from the memory, the calculation result is read from the memory according to the write address and data length, and the calculation result is sent to the target receiving end based on the IP address.
在一些实施例中,获取标识信息对应的IP地址,包括:In some embodiments, obtaining the IP address corresponding to the identification information includes:
从IP标识对照表中获取标识信息对应的IP地址。Obtain the IP address corresponding to the identification information from the IP identification comparison table.
在一些实施例中,还包括:In some embodiments, it also includes:
通过RDMA模块,向主机端发送发现消息,以便主机端基于各设备端的发现消息为各设备端分别分配IP地址以及标识信息;Through the RDMA module, a discovery message is sent to the host end, so that the host end allocates an IP address and identification information to each device end based on the discovery message of each device end;
获取主机端发送的各设备端的IP地址以及标识信息、主机端的IP地址以及标识信息;Obtain the IP address and identification information of each device end and the IP address and identification information of the host end sent by the host end;
将各设备端的IP地址以及标识信息、主机端的IP地址以及标识信息保存至IP标识对照表。The IP address and identification information of each device end, the IP address and identification information of the host end are saved in the IP identification comparison table.
在一些实施例中,通过RDMA模块,向主机端发送发现消息之后,还包括:In some embodiments, after sending the discovery message to the host through the RDMA module, the method further includes:
通过RDMA模块,获取主机端为第一设备端分配的IP地址以及标识信息,并向主机端回复确认信息,以便主机端在收到各设备端回复的确认信息后,保存各设备端的IP地址以及标识信息,并将各设备端的IP地址以及标识信息、主机端的IP地址以及标识信息发送给各设备端; Obtain the IP address and identification information assigned by the host end to the first device end through the RDMA module, and reply confirmation information to the host end, so that after receiving the confirmation information replied by each device end, the host end saves the IP address and identification information of each device end, and sends the IP address and identification information of each device end and the IP address and identification information of the host end to each device end;
其中,确认信息携带主机端为相应设备端分配的IP地址以及标识信息。The confirmation information carries the IP address and identification information allocated by the host to the corresponding device.
在一些实施例中,目标接收端为主机端或第二设备端;In some embodiments, the target receiving end is a host end or a second device end;
若目标接收端为多个第二设备端,则基于通信函数将目标接收端的标识信息写入通信表,包括:If the target receiving end is a plurality of second device ends, the identification information of the target receiving end is written into the communication table based on the communication function, including:
基于通信函数将多个第二设备端的标识信息均写入通信表。Based on the communication function, the identification information of the plurality of second device terminals is written into the communication table.
在一些实施例中,若目标接收端为第二设备端,第二设备端通过自身的RDMA模块接收计算结果,并将计算结果保存在自身的内存中。In some embodiments, if the target receiving end is the second device end, the second device end receives the calculation result through its own RDMA module and stores the calculation result in its own memory.
在一些实施例中,在基于通信函数将计算结果发送至第二设备端之后,还包括:In some embodiments, after sending the calculation result to the second device end based on the communication function, the method further includes:
向第二设备端发送传输结束信息,以便第二设备端的RDMA模块在接收到传输结束信息后,将计算结果在内存中的首地址和长度信息返回给第二设备端的计算引擎。The transmission end information is sent to the second device end, so that the RDMA module of the second device end returns the first address and length information of the calculation result in the memory to the calculation engine of the second device end after receiving the transmission end information.
在一些实施例中,还包括:In some embodiments, it also includes:
在执行设备端程序数据的过程中,若执行到目标程序指令,则等待接收第三设备端的数据,并在接收到第三设备端的数据之后,继续执行后续操作。During the process of executing the program data on the device side, if the target program instruction is executed, the device waits to receive the data from the third device side, and after receiving the data from the third device side, continues to execute subsequent operations.
第二方面,本申请公开了一种计算引擎通信方法,应用于主机端,包括:In a second aspect, the present application discloses a computing engine communication method, which is applied to a host side, comprising:
通过主机端计算引擎向设备端发送设备端程序数据,以便设备端的设备端计算引擎执行设备端程序数据,基于计算任务函数对目标数据进行计算以得到计算结果,并基于通信函数将计算结果发送至目标接收端;Sending device-side program data to the device side through the host-side computing engine, so that the device-side computing engine at the device side executes the device-side program data, calculates the target data based on the computing task function to obtain a calculation result, and sends the calculation result to the target receiving end based on the communication function;
其中,设备端程序数据为对包含通信函数以及计算任务函数的程序代码编译得到的程序数据,通信函数为基于通信需求创建的函数。The device-side program data is program data obtained by compiling program codes including communication functions and computing task functions, and the communication function is a function created based on communication requirements.
在一些实施例中,通过主机端计算引擎向设备端发送设备端程序数据,包括:In some embodiments, sending device-side program data to the device-side through a host-side computing engine includes:
通过主机端计算引擎向设备端发送目标数据、参数信息以及设备端程序数据;其中,参数信息包括目标数据写入设备端的内存的首地址以及长度信息,以使设备端基于参数信息将目标数据写入内存。The target data, parameter information and device-side program data are sent to the device-side through the host-side computing engine; wherein the parameter information includes the first address and length information of the target data written into the device-side memory, so that the device-side writes the target data into the memory based on the parameter information.
在一些实施例中,还包括:In some embodiments, it also includes:
获取各设备端发送的发现消息;Get the discovery message sent by each device;
基于各设备端发送的发现消息为各设备端分别配置IP地址以及标识信息;Based on the discovery messages sent by each device, configure an IP address and identification information for each device;
将各设备端的IP地址以及标识信息、主机端的IP地址以及标识信息发送至各设备端,以使各设备端将各设备端的IP地址以及标识信息、主机端的IP地址以及标识信息保存至IP标识对照表。The IP address and identification information of each device end and the IP address and identification information of the host end are sent to each device end, so that each device end saves the IP address and identification information of each device end and the IP address and identification information of the host end into the IP identification comparison table.
在一些实施例中,将各设备端的IP地址以及标识信息、主机端的IP地址以及标识信息发送至各设备端,还包括:In some embodiments, sending the IP address and identification information of each device end and the IP address and identification information of the host end to each device end further includes:
向各设备端发送主机端为其分配的IP地址以及标识信息;Send the IP address and identification information assigned by the host to each device;
当接收到各设备端回复的确认信息后,保存各设备端的IP地址以及标识信息,并将各设备端的IP地址以及标识信息、主机端的IP地址以及标识信息发送给各设备端。After receiving the confirmation information replied by each device end, the IP address and identification information of each device end are saved, and the IP address and identification information of each device end and the IP address and identification information of the host end are sent to each device end.
第三方面,本申请公开了一种计算引擎通信装置,应用于第一设备端,包括:In a third aspect, the present application discloses a computing engine communication device, applied to a first device end, comprising:
通信引擎,用于获取主机端计算引擎发送的设备端程序数据;其中,设备端程序数据为对包含通信函数以及计算任务函数的程序代码编译得到的程序数据,通信函数为基于通信需求创建的函数;The communication engine is used to obtain the device-side program data sent by the host-side computing engine; wherein the device-side program data is program data compiled from program codes including communication functions and computing task functions, and the communication function is a function created based on communication requirements;
设备端计算引擎,用于执行设备端程序数据,基于计算任务函数对目标数据进行计算以 得到计算结果,并基于通信函数将计算结果发送至目标接收端。The device-side computing engine is used to execute device-side program data and calculate the target data based on the computing task function. The calculation result is obtained and sent to the target receiving end based on the communication function.
第四方面,本申请公开了一种计算引擎通信装置,应用于主机端,包括:In a fourth aspect, the present application discloses a computing engine communication device, applied to a host side, comprising:
主机端计算引擎,用于向设备端发送设备端程序数据,以便设备端的设备端计算引擎执行设备端程序数据,基于计算任务函数对目标数据进行计算以得到计算结果,并基于通信函数将计算结果发送至目标接收端;The host-side computing engine is used to send the device-side program data to the device-side, so that the device-side computing engine of the device-side executes the device-side program data, calculates the target data based on the computing task function to obtain a computing result, and sends the computing result to the target receiving end based on the communication function;
其中,设备端程序数据为对包含通信函数以及计算任务函数的程序代码编译得到的程序数据,通信函数为基于通信需求创建的函数。The device-side program data is program data obtained by compiling program codes including communication functions and computing task functions, and the communication function is a function created based on communication requirements.
第五方面,本申请公开了一种电子设备,包括:In a fifth aspect, the present application discloses an electronic device, including:
存储器,用于存储计算机程序;Memory for storing computer programs;
处理器,用于执行计算机程序,以实现如上述的计算引擎通信方法。The processor is used to execute a computer program to implement the computing engine communication method as described above.
第六方面,本申请公开了一种非易失性计算机可读存储介质,非易失性计算机可读存储介质中存储有计算机程序,计算机程序被处理器加载并执行时,以实现如上述的计算引擎通信方法。In a sixth aspect, the present application discloses a non-volatile computer-readable storage medium, in which a computer program is stored. When the computer program is loaded and executed by a processor, the computing engine communication method as described above is implemented.
可见,本申请先获取主机端计算引擎发送的设备端程序数据;其中,设备端程序数据为对包含通信函数以及计算任务函数的程序代码编译得到的程序数据,通信函数为基于通信需求创建的函数,之后利用设备端计算引擎执行设备端程序数据,基于计算任务函数对目标数据进行计算以得到计算结果,并基于通信函数将计算结果发送至目标接收端,以实现设备端计算引擎与目标接收端中计算引擎之间的通信。也即,本申请中,基于通信需求创建通信函数,并对包含通信函数以及计算任务函数的程序代码编译,得到设备端程序数据,设备端获取主机端计算引擎发送的设备端程序代码之后,设备端计算引擎在执行设备端程序代码的过程中完成计算任务和通信任务,将计算结果发送至目标接收端,这样实现了主机端中计算引擎与设备端中计算引擎、以及设备端中计算引擎与目标接收端中计算引擎的通信,无需主机算计算引擎分配通信任务,释放主机端计算引擎的算力,并且,适应于同一节点内不同计算引擎以及不同节点之间计算引擎的通信场景。It can be seen that the present application first obtains the device-side program data sent by the host-side computing engine; wherein the device-side program data is the program data obtained by compiling the program code containing the communication function and the computing task function, and the communication function is a function created based on the communication demand, and then the device-side computing engine is used to execute the device-side program data, and the target data is calculated based on the computing task function to obtain the calculation result, and the calculation result is sent to the target receiving end based on the communication function, so as to realize the communication between the device-side computing engine and the computing engine in the target receiving end. That is, in the present application, a communication function is created based on the communication demand, and the program code containing the communication function and the computing task function is compiled to obtain the device-side program data, and after the device-side obtains the device-side program code sent by the host-side computing engine, the device-side computing engine completes the computing task and the communication task in the process of executing the device-side program code, and sends the calculation result to the target receiving end, so as to realize the communication between the computing engine in the host-side and the computing engine in the device-side, and the computing engine in the device-side and the computing engine in the target receiving end, without the need for the host-side computing engine to allocate the communication task, releasing the computing power of the host-side computing engine, and adapting to the communication scenarios of different computing engines in the same node and computing engines between different nodes.
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are merely embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on the provided drawings without paying any creative work.
图1为本申请公开的一种计算引擎通信方法流程图;FIG1 is a flow chart of a computing engine communication method disclosed in the present application;
图2为本申请公开的一种具体的分布式通信引擎架构示意图;FIG2 is a schematic diagram of a specific distributed communication engine architecture disclosed in this application;
图3为本申请公开的另一种计算引擎通信方法流程图;FIG3 is a flow chart of another computing engine communication method disclosed in the present application;
图4为本申请公开的一种计算引擎通信装置结构示意图;FIG4 is a schematic diagram of the structure of a computing engine communication device disclosed in the present application;
图5为本申请公开的另一种计算引擎通信装置结构示意图;FIG5 is a schematic diagram of the structure of another computing engine communication device disclosed in the present application;
图6为本申请公开的一种电子设备的结构框图;FIG6 is a structural block diagram of an electronic device disclosed in the present application;
图7为本申请公开的一种非易失性计算机可读存储介质的结构框图。FIG. 7 is a structural block diagram of a non-volatile computer-readable storage medium disclosed in the present application.
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请 中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The following will be combined with the drawings in the embodiments of the present application to clearly and completely describe the technical solutions in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. All other embodiments obtained by ordinary technicians in this field without making creative work are within the scope of protection of this application.
由于传统的计算机体系结构系统中网络软件协议栈和网络硬件设备的历史遗留问题的存在,在一个具有多元异构多计算引擎的系统中,各类计算节点的引擎芯片、PCIE或局部互联总线、网卡之间彼此独立,相互通信效率和计算效能变得十分低效,并且自由度和灵活性不高,使得目前大型算力网络,不仅仅有计算设备算力性能的问题,还要面对不同架构的计算引擎之间低效通信所变相导致的瓶颈。以常见的传统的CPU和GPU之间的通信为例(其它各种xPU类似),在单个设备内部,当有数据需要在应用内存空间和GPU内存空间传输时,需要跨越系统内核空间和PCIE总线,即需要多复制一次,这一般是依靠CPU来完成的。当有数据需要在设备A和设备B的应用内存空间传递时,数据首先需要从设备A的应用内存空间拷贝至设备A的系统内核空间,然后从设备A的系统内核空间发送至设备A网络设备空间,设备B的接收到数据之后,需要将其从设备B的网络设备空间发送至设备B的系统内核空间,然后从设备B的系统内核空间拷贝至设备B的应用内存空间。当有数据需要在设备A的GPU内存空间和设备B的GPU内存空间传递时,数据至少首先需要从设备A的GPU内存空间拷贝至设备A的系统内核空间,然后从设备A的系统内核空间发送至设备A网络设备空间,设备B的接收到数据之后,需要将其从设备B的网络设备空间发送至设备B的系统内核空间,然后从设备B的系统内核空间拷贝至设备B的GPU内存空间。普通情况一般需要4次数据传输和复制。可以看出,传统计算网络的通信架构,是造成异构算力网络低效和低速的重要因素。可见,不传统方案中,不同架构的计算引擎之间,重复且不必要的数据拷贝过程导致了低效问题,并且,虽然单个设备中具有多个计算引擎,但是数据的传输控制完全由CPU主导,GPU只是作为该系统的一个协处理器分担了部分计算工作。解决上述问题的关键点有两个,一是要尽可能的减少不必要的跨芯片拷贝数据的次数;二是要充分发挥各个计算引擎的在传输数据时的主观能动性,每个计算引擎都不再是协处理器,每个计算引擎都能自由的与其他任何类型的计算引擎进行自主通信,释放CPU的算力以服务于真正的核心业务。Due to the existence of historical legacy issues of network software protocol stacks and network hardware devices in traditional computer architecture systems, in a system with multiple heterogeneous and multi-computing engines, the engine chips, PCIE or local interconnection buses, and network cards of various computing nodes are independent of each other, and the communication efficiency and computing performance become very inefficient, and the degree of freedom and flexibility are not high, which makes the current large-scale computing power network not only have the problem of computing device computing performance, but also face the bottleneck caused by the inefficient communication between computing engines of different architectures. Take the common traditional communication between CPU and GPU as an example (other various xPUs are similar), inside a single device, when data needs to be transmitted between the application memory space and the GPU memory space, it needs to cross the system kernel space and the PCIE bus, that is, it needs to be copied once more, which is generally done by the CPU. When data needs to be transferred between the application memory space of device A and device B, the data first needs to be copied from the application memory space of device A to the system kernel space of device A, and then sent from the system kernel space of device A to the network device space of device A. After receiving the data, device B needs to send it from the network device space of device B to the system kernel space of device B, and then copy it from the system kernel space of device B to the application memory space of device B. When data needs to be transferred between the GPU memory space of device A and the GPU memory space of device B, the data must at least first be copied from the GPU memory space of device A to the system kernel space of device A, and then sent from the system kernel space of device A to the network device space of device A. After receiving the data, device B needs to send it from the network device space of device B to the system kernel space of device B, and then copy it from the system kernel space of device B to the GPU memory space of device B. Under normal circumstances, four data transmissions and copies are generally required. It can be seen that the communication architecture of traditional computing networks is an important factor causing the inefficiency and low speed of heterogeneous computing networks. It can be seen that in the non-traditional solution, the repeated and unnecessary data copying process between computing engines of different architectures leads to inefficiency problems. Moreover, although there are multiple computing engines in a single device, the data transmission control is completely dominated by the CPU, and the GPU only acts as a coprocessor of the system to share part of the computing work. There are two key points to solving the above problems. The first is to reduce the number of unnecessary cross-chip data copies as much as possible. The second is to give full play to the subjective initiative of each computing engine in transmitting data. Each computing engine is no longer a coprocessor. Each computing engine can freely communicate with any other type of computing engine, freeing up the CPU's computing power to serve the real core business.
目前,为了解决上述问题,出现了一些技术方案:At present, in order to solve the above problems, some technical solutions have emerged:
例如对于单节点内部的GPU与CPU之间的通信,NVIDIA从2010年开始推出了GPU Direct Shared Memory(即直接共享内存)技术,逐渐发展了GPU Direct P2P(Peer-to-Peer,点对点)、NVLink,以及最新的NVSwitch技术。其中,NVLink是NVIDIA为解决单节点GPU与GPU之间传输数据时受PCIe传输速率限制而开发的一套总线协议,基于该总线协议,在物理连接也开发了对应的硬件总线实现;但是由于该方法主要是为了解决单节点内部GPU之间单对单的高速互联的问题,高度依靠电路定制,对于跨节点的GPU之间的通信无能为力;而从NVLink的实现方式上可知,其数据传输依赖于GPU与GPU之间的直接电路物理链路绑定,完全不能远距离通信。而NVSwitch是NVIDIA设计的一块独立的通信芯片,它将多个NVLink加以整合,在单个节点内以NVLink的较高速度实现单节点内的多对多的GPU通信,从而进一步提高互联性能;如12个NVSwitch连接16个GPU,实现这些GPU之间的0跳任意互联。NVSwitch是基于NVLink的专用数据芯片,因此NVLink所具有的局限性在NVSwitch上仍然存在,节点机器完全定制,且节点间近距离连接,无法分布式部署。For example, for the communication between GPU and CPU within a single node, NVIDIA launched GPU Direct Shared Memory (i.e., direct shared memory) technology in 2010, and gradually developed GPU Direct P2P (Peer-to-Peer), NVLink, and the latest NVSwitch technology. Among them, NVLink is a set of bus protocols developed by NVIDIA to solve the limitation of PCIe transmission rate when transmitting data between GPUs in a single node. Based on this bus protocol, the corresponding hardware bus implementation is also developed for physical connection; however, since this method is mainly used to solve the problem of one-to-one high-speed interconnection between GPUs within a single node, it is highly dependent on circuit customization and is powerless for communication between GPUs across nodes; and from the implementation method of NVLink, it can be seen that its data transmission depends on the direct circuit physical link binding between GPUs, and it is completely unable to communicate over long distances. NVSwitch is an independent communication chip designed by NVIDIA. It integrates multiple NVLinks to achieve many-to-many GPU communication within a single node at the higher speed of NVLink, thereby further improving the interconnection performance; for example, 12 NVSwitches connect 16 GPUs to achieve 0-hop arbitrary interconnection between these GPUs. NVSwitch is a dedicated data chip based on NVLink, so the limitations of NVLink still exist on NVSwitch. The node machines are completely customized, and the nodes are connected in close proximity, so they cannot be deployed in a distributed manner.
对于不同节点之间的GPU的通信,目前的解决方案基本上都是通过RDMA网络实现,通过RDMA网络,硬件网卡可以直接在用户空间读取所需要发送的数据,无需经过系统内核空 间,减少数据的拷贝次数,因此可以极大的降低网络延时和CPU在数据拷贝上的开销,提高传输速率。目前基于RDMA(即Remote Direct Memory Access,远程直接内存访问)的多节点GPU通信的解决方案,2017年NVIDIA发布了最新的GPU Direct RDMA Async(即GPU直接RDMA异步)技术,允许GPU和第三方设备之间直接同步,而CPU不参与GPU应用程序的关键通信路径,数据直接从GPU内存向RDMA NIC(即Network Interface Controller,网络接口控制器)发送,对端的RDMA NIC接收后直接传输到GPU内存,减少了CPU的参与和数据拷贝的次数。其工作过程如下:1)CPU通过NVIDIA公司的CUDA API(即Application Programming Interface,应用程序接口)派发计算和通信任务给GPU;2)GPU完成计算任务后,自动执行通信任务直接向RDMA NIC触发通信操作;3)RDMA NIC通过RDMA网络,直接从GPU的内存或者主机内存中拷贝数据到RDMA NIC中;4)RDMA NIC发送数据;从该技术的过程中可知,除了开始主机分发计算和通信任务的阶段,其余过程均不需要主机参与。在第3步发送数据的过程中,数据仅被拷贝了一次。但是该方案存在的问题是,GPU和RDMA NIC作为两个独立的设备,GPU需要通过PCIe来触发RDMA NIC的数据搬移,PCIe作为一种物理链路,使得GPU和RDMA NIC存在着一定的物理距离限制和绑定,数据传输虽然减少了暂存次数,但物理PCIE跨节点信号传输路径没有变化。并且该技术方案作为商业公司闭源的解决方案,难以根据实际情况再次修改。For the communication between GPUs in different nodes, the current solutions are basically implemented through RDMA network. Through RDMA network, the hardware network card can directly read the data to be sent in the user space without going through the system kernel space. The number of data copies can be reduced, thus greatly reducing network latency and CPU overhead on data copying and increasing transmission rates. Currently, there are solutions for multi-node GPU communication based on RDMA (Remote Direct Memory Access). In 2017, NVIDIA released the latest GPU Direct RDMA Async technology, which allows direct synchronization between GPU and third-party devices, while the CPU does not participate in the key communication path of GPU applications. Data is sent directly from GPU memory to RDMA NIC (Network Interface Controller), and the peer RDMA NIC receives it and directly transmits it to GPU memory, reducing CPU participation and the number of data copies. The working process is as follows: 1) The CPU dispatches computing and communication tasks to the GPU through NVIDIA's CUDA API (Application Programming Interface); 2) After the GPU completes the computing task, it automatically executes the communication task and directly triggers the communication operation to the RDMA NIC; 3) The RDMA NIC copies data directly from the GPU's memory or the host's memory to the RDMA NIC through the RDMA network; 4) The RDMA NIC sends data; From the process of this technology, it can be seen that except for the stage where the host starts to distribute computing and communication tasks, the rest of the process does not require the host to participate. In the process of sending data in step 3, the data is only copied once. However, the problem with this solution is that the GPU and RDMA NIC are two independent devices. The GPU needs to trigger the data movement of the RDMA NIC through PCIe. As a physical link, PCIe makes the GPU and RDMA NIC have certain physical distance restrictions and bindings. Although the data transmission reduces the number of temporary storage times, the physical PCIE cross-node signal transmission path does not change. And as a closed-source solution of a commercial company, this technical solution is difficult to modify again according to actual conditions.
除了GPU之外,另一个常见的异构计算引擎为FPGA,对基于FPGA的计算引擎,Intel开发了IKL(即Inter-Kernel Links,内核间链接)以实现节点内或节点间FPGA之间的异构通信。其中User Kernel是用户通过OpenCL语言编写的实现具体计算任务的模块,编写User Kernel时主要使用以下两个主要函数:write_channel_intel(channel_id,data)和read_channle_intel(channel_id)来实现不同FPGA之间的通信。这两个函数会将需要发送的数据从User Kernel通过IKL I/O Channel(即通道)发送给Inter-Kernel Logic RTL IP,该IP会将数据打包,最终转发给Ethernet Switch,即网卡。从IKL实现FPGA间的通信的方式可知,其主要有以下几个缺点:1)在OpenCL中用于发送和接收数据的write_channel_intel和read_channle_intel函数,是基于channel_id进行的,配置繁琐且每个通道仅支持固定的点对点的通信;2)为了实现可靠的通信,Inter-Kernel Logic RTL IP需要实现类似于TCP/IP协议栈的超时重传、分片、丢包重传等复杂的控制功能,占用的大量的FPGA资源,且占用的资源随着channel_id的数量增加而增加,目前根据不同的FPGA板卡型号,最大支持的channel_id的数量为48~256个;该通信方式仅支持FPGA与FPGA之间的网络通信,并不支持FPGA与主机端之间的网络通信,且FPGA与FPGA之间通信时,仅支持交换FPGA板卡上的数据,并不能交换主机端内存中的数据;In addition to GPU, another common heterogeneous computing engine is FPGA. For FPGA-based computing engines, Intel has developed IKL (Inter-Kernel Links) to achieve heterogeneous communication between FPGAs within a node or between nodes. User Kernel is a module written by the user in OpenCL language to implement specific computing tasks. When writing User Kernel, the following two main functions are mainly used: write_channel_intel(channel_id,data) and read_channel_intel(channel_id) to achieve communication between different FPGAs. These two functions will send the data to be sent from User Kernel through IKL I/O Channel (i.e. channel) to Inter-Kernel Logic RTL IP, which will package the data and finally forward it to Ethernet Switch, i.e. network card. From the way IKL realizes communication between FPGAs, it can be seen that it has the following main disadvantages: 1) The write_channel_intel and read_channel_intel functions used to send and receive data in OpenCL are based on channel_id, which is cumbersome to configure and each channel only supports fixed point-to-point communication; 2) In order to achieve reliable communication, Inter-Kernel Logic RTL IP needs to implement complex control functions such as timeout retransmission, fragmentation, and packet loss retransmission similar to the TCP/IP protocol stack, which occupies a large amount of FPGA resources, and the occupied resources increase with the number of channel_ids. At present, according to different FPGA board models, the maximum number of channel_ids supported is 48 to 256; This communication method only supports network communication between FPGAs, but not between FPGAs and the host. When communicating between FPGAs, it only supports exchanging data on the FPGA board, and cannot exchange data in the host memory;
基于上述内容,可以发现目前的主要技术方案主要有如下缺点:1)对于单节点内部的计算引擎之间的通信,均是通过传统的PCIe或者类NVLink形式的具有物理链路距离限制的方式实现的,不能大规模的分布式部署;2)对于节点间的计算引擎之间的通信,即使是GPU Direct RDMA Async技术也需要主机在第一步的时候分配通信任务,此外也未能完全解耦GPU与RDMA NIC之间的物理链路,仍需要通过PCIe链路触发数据搬运,在部署此类GPU时,需要占用较大的物理空间和更高的能耗;而基于FPGA的Intel IKL则是更倾向于实现一种同种设备之间点对点传输,且由于channle_id的数量限制,导致某个设备交换数据的对象数量有着严重的限制,传输局限且效率极差,此外,该方案不支持节点内的NIC直接读取本 节点内主机端端的内存发送给对端;3)无论是基于GPU基于FPGA计算引擎的解决方案,绝大多数都是闭源的、特化的解决方案,很难对其进行修改,缺乏一定的泛用性。Based on the above content, it can be found that the current main technical solutions have the following main shortcomings: 1) For the communication between computing engines within a single node, it is all achieved through traditional PCIe or NVLink-like methods with physical link distance limitations, and cannot be deployed in a large-scale distributed manner; 2) For the communication between computing engines between nodes, even GPU Direct RDMA Async technology requires the host to allocate communication tasks in the first step. In addition, it fails to completely decouple the physical link between the GPU and the RDMA NIC, and still needs to trigger data movement through the PCIe link. When deploying such GPUs, it takes up a large physical space and consumes higher energy; while the FPGA-based Intel IKL is more inclined to implement a point-to-point transmission between devices of the same type, and due to the number limit of channle_id, the number of objects that a device can exchange data is severely limited, the transmission is limited and the efficiency is extremely poor. In addition, this solution does not support the NIC in the node to directly read the local The memory of the host end within the node is sent to the other end; 3) Regardless of whether it is a solution based on a GPU or FPGA computing engine, most of them are closed-source, specialized solutions that are difficult to modify and lack a certain degree of versatility.
参见图1所示,本申请实施例公开了一种计算引擎通信方法,应用于第一设备端,包括:As shown in FIG. 1 , an embodiment of the present application discloses a computing engine communication method, which is applied to a first device end and includes:
步骤S11:获取主机端计算引擎发送的设备端程序数据;其中,设备端程序数据为对包含通信函数以及计算任务函数的程序代码编译得到的程序数据,通信函数为基于通信需求创建的函数。Step S11: Obtain device-side program data sent by the host-side computing engine; wherein the device-side program data is program data obtained by compiling program codes including communication functions and computing task functions, and the communication function is a function created based on communication requirements.
其中,主机端计算引擎为CPU,并且,本申请实施例中,使用OpenCL语言描述的具体计算任务,得到计算任务函数,此外,本申请实施例中设备端程序数据与普通的设备端OpenCL程序的区别在于,集成了自定义的通信函数库,该自定义的通信函数库中的函数,在编写代码时,根据任务的通信需求,添加对应的通信函数到cl文件的OpenCL代码中,与cl文件一起编译,形成了最终的设备端的二进制代码,也即设备端程序数据。Among them, the host-side computing engine is the CPU, and, in the embodiment of the present application, a specific computing task is described in the OpenCL language to obtain a computing task function. In addition, the difference between the device-side program data in the embodiment of the present application and the ordinary device-side OpenCL program is that a custom communication function library is integrated. When writing code, the functions in the custom communication function library add corresponding communication functions to the OpenCL code of the cl file according to the communication requirements of the task, and are compiled together with the cl file to form the final device-side binary code, that is, the device-side program data.
进一步的,本申请实施例可以通过RDMA模块获取主机端计算引擎发送的设备端程序数据;其中,RDMA模块与设备端计算引擎集成在同一芯片中。这样,通过将负责RDMA网络通信的模块与计算引擎集成至一个芯片中,进一步加速了计算引擎与通信引擎之间的联系,消除了现有技术中物理链路距离限制的问题,方便大规模的分布式部署。设备端计算引擎为xPU(即通用计算引擎),各种计算引擎,包括GPU、NPU(即Neural-network Processing Unit,嵌入式神经网络处理器)、DPU(即Data Processing Unit,中央处理器分散处理单元)、IPU(即Infrastructure Processing Unit,基础设施处理器)的统称。本申请实施例可以在同一FPGA芯片中集成RDMA模块以及计算引擎。Furthermore, the embodiment of the present application can obtain the device-side program data sent by the host-side computing engine through the RDMA module; wherein the RDMA module and the device-side computing engine are integrated in the same chip. In this way, by integrating the module responsible for RDMA network communication and the computing engine into one chip, the connection between the computing engine and the communication engine is further accelerated, eliminating the problem of physical link distance limitation in the prior art, and facilitating large-scale distributed deployment. The device-side computing engine is an xPU (i.e., a general computing engine), a general term for various computing engines, including GPU, NPU (i.e., Neural-network Processing Unit), DPU (i.e., Data Processing Unit), and IPU (i.e., Infrastructure Processing Unit). The embodiment of the present application can integrate the RDMA module and the computing engine in the same FPGA chip.
并且,在一种实施方式中,可以通过RDMA模块获取主机端计算引擎发送的目标数据、参数信息以及设备端程序数据;其中,参数信息包括目标数据写入第一设备端的内存的首地址以及长度信息;基于参数信息将目标数据写入内存。其中,内存可以为DDR(即DDR SDRAM,Double Data Rate Synchronous Dynamic Random Access Memory,双倍速率同步动态随机存储器)。Furthermore, in one implementation, the target data, parameter information, and device-side program data sent by the host-side computing engine can be obtained through the RDMA module; wherein the parameter information includes the first address and length information of the target data written into the memory of the first device; and the target data is written into the memory based on the parameter information. The memory can be DDR (i.e., DDR SDRAM, Double Data Rate Synchronous Dynamic Random Access Memory).
步骤S12:利用设备端计算引擎执行设备端程序数据,基于计算任务函数对目标数据进行计算以得到计算结果,并基于通信函数将计算结果发送至目标接收端,以实现设备端计算引擎与目标接收端中计算引擎之间的通信。Step S12: Utilize the device-side computing engine to execute the device-side program data, calculate the target data based on the computing task function to obtain the calculation result, and send the calculation result to the target receiving end based on the communication function to realize the communication between the device-side computing engine and the computing engine in the target receiving end.
在一种实施方式中,可以利用设备端计算引擎执行设备端程序数据,基于参数信息从内存中读取目标数据,并基于计算任务函数对目标数据进行计算以得到计算结果。基于通信函数,并通过RDMA模块将计算结果发送至目标接收端。In one implementation, the device-side computing engine can be used to execute the device-side program data, read the target data from the memory based on the parameter information, and calculate the target data based on the computing task function to obtain the calculation result. Based on the communication function, the calculation result is sent to the target receiving end through the RDMA module.
其中,基于通信函数,并通过RDMA模块将计算结果发送至目标接收端,具体包括以下步骤:Among them, based on the communication function, the calculation result is sent to the target receiving end through the RDMA module, which specifically includes the following steps:
步骤00:基于通信函数将目标接收端的标识信息写入通信表,并通知表解析引擎。Step 00: Write the identification information of the target receiving end into the communication table based on the communication function, and notify the table parsing engine.
其中,表解析引擎与RDMA模块均为底层硬件IP(即intellectual property,知识产权),在一种实施方式中可以通过HDL(即Hardware Description Language,硬件描述语言)语言编写自定义IP,得到表解析引擎与RDMA模块。可见,本申请中,上层软件的自定义通信库可以基于常见的C/C++语言实现的,而底层硬件IP则通过HDL代码实现,本申请实施例提供的方案,具有良好的通用性和可移植性。 Among them, the table parsing engine and the RDMA module are both underlying hardware IP (i.e., intellectual property). In one implementation, a custom IP can be written in HDL (i.e., Hardware Description Language) to obtain the table parsing engine and the RDMA module. It can be seen that in this application, the custom communication library of the upper-layer software can be implemented based on the common C/C++ language, while the underlying hardware IP is implemented through HDL code. The solution provided in the embodiment of this application has good versatility and portability.
并且,本申请实施例可以在创建通信函数时,可以指定相应的目标接收端。具体的,可以基于设备的标识信息进行指定。Furthermore, in the embodiment of the present application, when creating a communication function, a corresponding target receiving end can be specified, and specifically, the target receiving end can be specified based on the identification information of the device.
步骤01:通过表解析引擎从通信表中获取目标接收端的标识信息,并获取标识信息对应的IP地址。Step 01: Obtain the identification information of the target receiving end from the communication table through the table parsing engine, and obtain the IP address corresponding to the identification information.
并且,本申请实施例还将计算结果写入内存中的数据区域,并将写入地址、数据长度写入通信表;通过表解析引擎从通信表中获取写入地址、数据长度,并将写入地址、数据长度以及IP地址写入RDMA模块中的存储器。其中,存储器可以为FIFO(即First Input First Output,先进先出)存储器。In addition, the embodiment of the present application also writes the calculation result into the data area in the memory, and writes the write address and data length into the communication table; obtains the write address and data length from the communication table through the table parsing engine, and writes the write address, data length and IP address into the memory in the RDMA module. Among them, the memory can be a FIFO (First Input First Output) memory.
其中,本申请实施例可以从IP标识对照表中获取标识信息对应的IP地址。Among them, the embodiment of the present application can obtain the IP address corresponding to the identification information from the IP identification comparison table.
进一步的,在具体的实施方式中,在驱动初始化阶段,可以通过RDMA模块,向主机端发送发现消息,以便主机端基于各设备端的发现消息为各设备端分别分配IP地址以及标识信息;获取主机端发送的各设备端的IP地址以及标识信息、主机端的IP地址以及标识信息;将各设备端的IP地址以及标识信息、主机端的IP地址以及标识信息保存至IP标识对照表。其中,在通过RDMA模块,向主机端发送发现消息之后,可以通过RDMA模块,获取主机端为第一设备端分配的IP地址以及标识信息,并向主机端回复确认信息,以便主机端在收到各设备端回复的确认信息后,保存各设备端的IP地址以及标识信息,并将各设备端的IP地址以及标识信息、主机端的IP地址以及标识信息发送给各设备端。其中,确认信息携带主机端为相应设备端分配的IP地址以及标识信息。Further, in a specific implementation, during the driver initialization phase, a discovery message can be sent to the host side through the RDMA module, so that the host side can assign IP addresses and identification information to each device side based on the discovery message of each device side; obtain the IP address and identification information of each device side sent by the host side, and the IP address and identification information of the host side; save the IP address and identification information of each device side, and the IP address and identification information of the host side to the IP identification comparison table. Among them, after sending a discovery message to the host side through the RDMA module, the IP address and identification information assigned by the host side to the first device side can be obtained through the RDMA module, and confirmation information can be replied to the host side, so that after receiving the confirmation information replied by each device side, the host side saves the IP address and identification information of each device side, and sends the IP address and identification information of each device side, and the IP address and identification information of the host side to each device side. Among them, the confirmation information carries the IP address and identification information assigned by the host side to the corresponding device side.
需要指出的是,在上电时,每个设备端实际上是一台DHCP(即Dynamic Host Configuration Protocol,动态主机配置协议)客户端,即RDMA模块会发送DHCP发现信息来寻找DHCP服务器,即向广播IP地址255.255.255.255发送特定的广播信息,主机端作为DHCP服务器,会接收来自各个设备端的DHCP发现消息,并分别向每个设备端分配设备端的IP地址、设备端标识等信息。每个设备端的RDMA模块在接收到主机端分配给自己的信息之后,均会回复一个确认信息,确认信息也会包含主机端分配给自己的所有信息,以向主机端声明该设备端将使用这些信息进行通信,主机端在接收到来自各个设备端的确认信息之后,将分配给所有设备端的信息以及主机端的IP和标识信息,以自定义的格式,发送给每个设备端,同时也在主机端中保留一份。每个设备端的RDMA模块在接收到主机端发送过来的包含所有设备的通信信息后,将其保存在IP标识对照表中。It should be pointed out that when powered on, each device is actually a DHCP (Dynamic Host Configuration Protocol) client, that is, the RDMA module will send DHCP discovery information to find a DHCP server, that is, send specific broadcast information to the broadcast IP address 255.255.255.255. The host side, as a DHCP server, will receive DHCP discovery messages from each device side and assign the device side's IP address, device side identification and other information to each device side. After receiving the information assigned to it by the host side, the RDMA module of each device side will reply with a confirmation message, which will also contain all the information assigned to it by the host side, in order to declare to the host side that the device side will use this information for communication. After receiving the confirmation information from each device side, the host side will send the information assigned to all device sides and the IP and identification information of the host side in a custom format to each device side, and also keep a copy in the host side. After receiving the communication information of all devices sent by the host side, the RDMA module of each device side saves it in the IP identification comparison table.
进一步的,本申请实施例中,一个节点可以包括主机端和至少一个设备端,若同一网络中包括多个节点,可以确定一个节点中的主机端进行IP地址以及标识信息的分配,也即,分配网络中其他主机端以及所有设备端的IP地址以及标识信息,最终,每个主机端以及每个设备端均存在包括全网所有主机端以及所有设备端的IP地址以及标识信息的IP标识对照表。Furthermore, in an embodiment of the present application, a node may include a host end and at least one device end. If the same network includes multiple nodes, the host end in a node can be determined to allocate IP addresses and identification information, that is, allocate IP addresses and identification information of other host ends and all device ends in the network. Ultimately, each host end and each device end has an IP identification comparison table that includes the IP addresses and identification information of all host ends and all device ends in the entire network.
这样,在同一个网络中的任意一个设备都能自动发现和获取网络中所有设备的通信信息,无需人工干预,具有扩展性和灵活性,为实现高自由度的任意分布式算力网络拓展提供了物理基础。In this way, any device in the same network can automatically discover and obtain the communication information of all devices in the network without human intervention. It is scalable and flexible, and provides a physical basis for realizing the expansion of arbitrary distributed computing power networks with high degrees of freedom.
并且,本申请实施例中,IP标识对照表和信息表的地址均是确定的,写入了设备端程序数据以及主机端程序数据,其中主机端程序数据为在主机端执行的程序数据。在一种实施方式中,主要由C/C++编写的程序,使用C/C++标准库和OpenCL在主机端的API函数,得到最终的主机端程序数据。 Furthermore, in the embodiment of the present application, the addresses of the IP identification comparison table and the information table are determined, and the device-side program data and the host-side program data are written, wherein the host-side program data is the program data executed on the host side. In one embodiment, the program mainly written in C/C++ uses the C/C++ standard library and the API function of OpenCL on the host side to obtain the final host-side program data.
步骤02:通过RDMA模块基于IP地址将计算结果发送至目标接收端。Step 02: Send the calculation results to the target receiving end based on the IP address through the RDMA module.
在具体的实施方式中,可以通过RDMA模块检测存储器,并在存储器不为空时,从存储器中读取出写入地址、数据长度以及IP地址,根据写入地址和数据长度从内存中读取计算结果,基于IP地址将计算结果发送至目标接收端。In a specific implementation, the memory can be detected through the RDMA module, and when the memory is not empty, the write address, data length and IP address are read from the memory, the calculation result is read from the memory according to the write address and data length, and the calculation result is sent to the target receiving end based on the IP address.
其中,目标接收端为主机端或第二设备端;若目标接收端为多个第二设备端,则基于通信函数将多个第二设备端的标识信息均写入通信表,通过表解析引擎将计算结果发送至多个第二设备端。Among them, the target receiving end is the host end or the second device end; if the target receiving end is multiple second device ends, the identification information of the multiple second device ends is written into the communication table based on the communication function, and the calculation results are sent to the multiple second device ends through the table parsing engine.
并且,若目标接收端为第二设备端,第二设备端通过自身的RDMA模块接收计算结果,并将计算结果保存在自身的内存中,并且,在第一设备端基于通信函数将计算结果发送至第二设备端之后,还可以向第二设备端发送传输结束信息,以便第二设备端的RDMA模块在接收到传输结束信息后,将计算结果在内存中的首地址和长度信息返回给第二设备端的计算引擎,以便第二设备端的计算引擎根据计算结果在内存中的首地址和长度信息,进行后续操作。Furthermore, if the target receiving end is the second device end, the second device end receives the calculation result through its own RDMA module and saves the calculation result in its own memory. Moreover, after the first device end sends the calculation result to the second device end based on the communication function, a transmission end message can also be sent to the second device end, so that after receiving the transmission end message, the RDMA module of the second device end returns the first address and length information of the calculation result in the memory to the calculation engine of the second device end, so that the calculation engine of the second device end can perform subsequent operations according to the first address and length information of the calculation result in the memory.
并且,第一设备端在执行设备端程序数据的过程中,若执行到目标程序指令,则等待接收第三设备端的数据,并在接收到第三设备端的数据之后,继续执行后续操作。其中,目标程序指令为基于通信函数编译得到的程序指令。可以理解的是,多个第二设备端也是执行到相应的程序指令,等待接收第一设备端的计算结果。Furthermore, when the first device executes the device program data, if the target program instruction is executed, the first device waits to receive the data from the third device, and after receiving the data from the third device, continues to perform subsequent operations. The target program instruction is a program instruction compiled based on the communication function. It is understandable that the multiple second device also executes the corresponding program instruction and waits to receive the calculation result of the first device.
可以理解的是,本申请实施例可以为任意不同架构的计算引擎之间的主动互联提供通信方案,消除了节点间和节点内通信之间的区分界限,并且不需要CPU的分配通信任务,并具有了一对多的通信能力,无论是XPU与XPU之间还是XPU与主机端之间,仅需在数据传输过程中搬移一次数据,具有较高的效率和效能。It can be understood that the embodiments of the present application can provide a communication solution for active interconnection between computing engines of any different architectures, eliminate the distinction between inter-node and intra-node communications, and do not require the CPU to allocate communication tasks, and have one-to-many communication capabilities. Whether it is between XPUs or between XPUs and the host, data only needs to be moved once during the data transmission process, which has higher efficiency and effectiveness.
例如,参见图2所示,图2为本申请实施例提供的一种具体的分布式通信引擎架构示意图。以完全软件的可编程硬件FPGA器件为例,该架构中存在两种计算引擎——主机端的普通CPU,以及加速卡(即设备端)中基于RISC-V处理器扩展功能而来的GPU。其中,加速卡为FPGA板卡,其中包括FPGA芯片,RDMA模块与GPU集成至该芯片,除了FPGA芯片外,还有一系列的外设,例如网口、DDR等,主机端和设备端无主从关系。主机端程序为主要由C/C++编写的程序,其中使用了C/C++标准库和OpenCL在主机端端的API函数;设备端程序,主要是使用OpenCL语言描述的具体计算任务,并集成了自定义的通信函数库,该自定义的通信函数库中的函数,在编写代码时,根据任务的通信需求,添加对应的通信函数到cl文件的OpenCL代码中,与cl文件一起编译,形成了最终的设备端的二进制代码。上层CL编译器和xPU芯片共同确定双方通信表、IP标识对照表的地址,并写入上述二进制代码。也即,预先确定主机端与设备端通信所需的双方通信表、IP标识对照表的地址,写入主机端程序数据,以及设备端程序数据。并且,通过HDL语言编写自定义的IP,包括RDMA、表解析引擎等,配合上层的C/C++语言实现的自定义通信函数库以及一些标准的协议库,实现任意计算引擎之间的基于公共以太网协议的任意互联,且完全不需要主机控制通信过程。For example, as shown in FIG. 2, FIG. 2 is a schematic diagram of a specific distributed communication engine architecture provided by an embodiment of the present application. Taking a fully software programmable hardware FPGA device as an example, there are two computing engines in the architecture-the ordinary CPU on the host side, and the GPU based on the RISC-V processor extension function in the acceleration card (i.e., the device side). Among them, the acceleration card is an FPGA board, which includes an FPGA chip, the RDMA module and the GPU are integrated into the chip, and in addition to the FPGA chip, there are a series of peripherals, such as network ports, DDR, etc., and there is no master-slave relationship between the host side and the device side. The host-side program is a program mainly written in C/C++, in which the C/C++ standard library and the API functions of OpenCL on the host side are used; the device-side program is mainly a specific computing task described in the OpenCL language, and a custom communication function library is integrated. The functions in the custom communication function library, when writing the code, add the corresponding communication function to the OpenCL code of the cl file according to the communication requirements of the task, and compile it together with the cl file to form the final binary code of the device side. The upper CL compiler and the xPU chip jointly determine the addresses of the communication table and the IP identification comparison table of both parties, and write the above binary code. That is, the addresses of the two-way communication table and IP identification comparison table required for communication between the host and the device are determined in advance, and the host program data and the device program data are written. In addition, custom IPs are written in HDL language, including RDMA, table parsing engine, etc., and customized communication function libraries implemented in the upper layer C/C++ language and some standard protocol libraries are used to realize any interconnection between any computing engines based on the public Ethernet protocol, without the need for the host to control the communication process.
进一步的,上述分布式通信引擎架构的主要工作过程包括:Furthermore, the main working process of the above distributed communication engine architecture includes:
(1)驱动初始化阶段:RDMA模块实现了RDMA NIC的功能,自动完成以下工作:a)在上电时,各个加速卡实际上是一台DHCP客户端,即RDMA模块会发送DHCP发现信息来寻找DHCP服务器,即向广播IP地址255.255.255.255发送特定的广播信息;b)主机端作为 DHCP服务器,会接收来自各个加速卡的DHCP发现消息,并分别向每个加速卡分配加速卡的IP地址、设备ID等信息;c)每个加速卡的RDMA模块在接收到主机端分配给自己的信息之后,均会回复一个确认信息,确认信息也会包含b)中分配给自己的所有信息,以向主机端声明该加速卡将使用这些信息进行通信;d)主机端在接收到来自各个加速卡的确认信息之后,将分配给所有加速卡的信息以及主机端的IP和ID信息,以自定义的格式,发送给每个加速卡,同时也在主机端中保留一份;e)每个加速卡的RDMA模块在接收到上一步主机端发送过来的包含所有设备的通信信息后,将其保存在IP标识对照表中。(1) Driver initialization phase: The RDMA module implements the functions of the RDMA NIC and automatically completes the following tasks: a) When powered on, each accelerator card is actually a DHCP client, that is, the RDMA module will send a DHCP discovery message to find a DHCP server, that is, send a specific broadcast message to the broadcast IP address 255.255.255.255; b) The host side acts as The DHCP server will receive the DHCP discovery message from each accelerator card and assign the IP address, device ID and other information of the accelerator card to each accelerator card; c) After receiving the information assigned to it by the host, the RDMA module of each accelerator card will reply with a confirmation message, which will also contain all the information assigned to it in b) to declare to the host that the accelerator card will use this information for communication; d) After receiving the confirmation information from each accelerator card, the host will send the information assigned to all accelerator cards and the IP and ID information of the host in a custom format to each accelerator card, and also keep a copy in the host; e) After receiving the communication information of all devices sent by the host in the previous step, the RDMA module of each accelerator card saves it in the IP identification comparison table.
(2)主机端下载相关数据给加速卡:在主机端端执行编译好的主机端二进制程序,该程序中的一系列的OpenCL API函数会进行查找设备、初始化设备参数等一系列工作;此外,该程序还会发送以下三类数据给加速卡:计算所需要的初始数据、一些参数(例如初始数据在该加速卡DDR中的首地址、长度等,这些参数本身在加速卡DDR中的地址是固定的)、和磁盘上编译好的设备端二进制程序数据(该数据在加速卡DDR中的首地址也是固定的,与具体任务无关);主机端通过RDMA以太网将这些数据发送至加速卡中;(2) The host downloads relevant data to the accelerator card: The compiled host binary program is executed on the host. A series of OpenCL API functions in the program will perform a series of tasks such as searching for devices and initializing device parameters. In addition, the program will also send the following three types of data to the accelerator card: initial data required for calculation, some parameters (such as the first address and length of the initial data in the accelerator card DDR, and the addresses of these parameters in the accelerator card DDR are fixed), and the compiled device binary program data on the disk (the first address of this data in the accelerator card DDR is also fixed and has nothing to do with the specific task). The host sends this data to the accelerator card through RDMA Ethernet.
(3)加速卡与主机端通信的两个场景如下:(3) There are two scenarios for communication between the accelerator card and the host:
a)加速卡读取主机端的数据:通常在上述(2)中主机端已下载相关数据给加速卡,加速卡中的GPU计算所需要的数据已经下载至加速卡的DDR中了,不需要GPU再去主机端读取数据。a) The accelerator card reads data from the host: Usually, in the above (2), the host has downloaded the relevant data to the accelerator card. The data required for the GPU calculation in the accelerator card has been downloaded to the DDR of the accelerator card, and the GPU does not need to read data from the host.
b)加速卡向主机端写入数据:在加速卡中的GPU计算完成后,会将计算结果写入加速卡DDR中的数据区域,并将写入的地址、长度、对端(即目标接收端)的ID等保存在通信表中;然后GPU通知表解析引擎,表解析引擎从通信表取出信息,并标记通信表中已取出的信息防止重复取出;表解析引擎根据获取的信息中对端的ID,向IP标识对照表查询该ID对应的IP地址;表解析引擎获取ID所对应的IP地址后,将这些信息写入RDMA模块中的一个FIFO中;RDMA模块在检测到FIFO不为空时,会不断的取出其中的信息,根据信息中的加速卡DDR地址、长度从加速卡DDR中读取对应的数据,然后根据IP地址,将数据发送给对端;在数据发送完成后,加速卡还向主机端发送传输完成信息。b) The accelerator card writes data to the host: After the GPU in the accelerator card completes the calculation, it writes the calculation result into the data area in the accelerator card DDR, and saves the written address, length, and ID of the other end (i.e., the target receiving end) in the communication table; then the GPU notifies the table parsing engine, which takes out information from the communication table and marks the information that has been taken out from the communication table to prevent repeated retrieval; the table parsing engine queries the IP identification comparison table for the IP address corresponding to the ID of the other end in the obtained information; after the table parsing engine obtains the IP address corresponding to the ID, it writes this information into a FIFO in the RDMA module; when the RDMA module detects that the FIFO is not empty, it will continuously take out the information in it, read the corresponding data from the accelerator card DDR according to the accelerator card DDR address and length in the information, and then send the data to the other end according to the IP address; after the data is sent, the accelerator card also sends a transmission completion message to the host.
(4)加速卡之间的读写与加速卡和主机端之间的读写有所区别;前者是成对出现的,即当一端执行写入时,必定有另一端或多端执行对应的读操作:假设加速卡2和加速卡3需要读取加速卡1的数据,即在加速卡2和加速卡3中的GPU执行任务到某个阶段后,可能会需要加速卡1的数据,发生这种情况的根本原因是,这两个加速卡中的GPU都执行到了由自定义通信函数库中read函数编译生成的程序指令,该程序指令表示GPU需要等待接收来自Device ID=1(即加速卡的设备ID为1)的数据,才能继续执行后续任务;(4) The reading and writing between accelerator cards is different from the reading and writing between the accelerator card and the host. The former occurs in pairs, that is, when one end performs a write operation, the other end or multiple ends must perform a corresponding read operation: suppose that accelerator card 2 and accelerator card 3 need to read the data of accelerator card 1, that is, after the GPUs in accelerator card 2 and accelerator card 3 perform tasks to a certain stage, they may need the data of accelerator card 1. The fundamental reason for this situation is that the GPUs in these two accelerator cards have executed the program instructions compiled by the read function in the custom communication function library. The program instructions indicate that the GPU needs to wait to receive data from Device ID = 1 (that is, the device ID of the accelerator card is 1) before continuing to execute subsequent tasks.
a)在加速卡1中,GPU完成计算后,将计算结果写入加速卡1DDR中的数据区域,并将写入的地址、长度、对端的ID等保存在DDR中地址固定的通信表中,由于需要发送的设备端有2个,因此,这里向通信表中写入两条对端ID不同的信息,然后GPU通知表解析引擎,表解析引擎从通信表取出这2条信息;表解析引擎根据获取的信息中2个对端的ID(即设备ID=2和3),向IP标识对照表查询该ID对应的IP地址,表解析引擎获取2个ID所对应的IP地址后,将两条信息及其IP写入RDMA模块中的一个FIFO中;RDMA模块在检测到FIFO不为空时,会不断的取出其中的信息,据信息中的加速卡DDR地址、长度从加速卡DDR中读取对应的数据,然后根据IP地址,将数据发送给对端设备(即加速卡2和加速卡3);在数据发送 完成后,还分别给加速卡2和加速卡3发送一个数据传输结束的信息;a) In accelerator card 1, after GPU completes calculation, it writes the calculation result into the data area in accelerator card 1 DDR, and saves the written address, length, peer ID, etc. in the communication table with fixed address in DDR. Since there are two device ends that need to be sent, two pieces of information with different peer IDs are written into the communication table here, and then GPU notifies the table parsing engine, and the table parsing engine takes out the two pieces of information from the communication table; the table parsing engine queries the IP identification comparison table for the IP address corresponding to the two peer IDs (i.e., device ID = 2 and 3) in the obtained information, and after the table parsing engine obtains the IP addresses corresponding to the two IDs, it writes the two pieces of information and its IP into a FIFO in the RDMA module; when the RDMA module detects that the FIFO is not empty, it will continuously take out the information therein, read the corresponding data from the accelerator card DDR according to the accelerator card DDR address and length in the information, and then send the data to the peer device (i.e., accelerator card 2 and accelerator card 3) according to the IP address; in data sending After completion, a data transmission completion message is sent to accelerator card 2 and accelerator card 3 respectively;
b)在加速卡2和加速卡3中,各自的GPU执行到read函数编译生成的指令之后,则会等待指令中给出的设备ID发送数据,当加速卡2和加速卡3的RDMA模块接收到加速卡1发送过来的数据,将其保存在各自的DDR中,在接收到加速卡1发过来的传输结束的信息之后,会将接收到的数据在DDR中的首地址和长度返回给各自的GPU,GPU根据数据所在的首地址和长度,即可进行后续操作。b) In accelerator card 2 and accelerator card 3, after their respective GPUs execute the instructions generated by the compiled read function, they will wait for the device ID given in the instructions to send data. When the RDMA modules of accelerator card 2 and accelerator card 3 receive the data sent by accelerator card 1, they will save it in their respective DDRs. After receiving the transmission completion information sent by accelerator card 1, they will return the first address and length of the received data in DDR to their respective GPUs. The GPUs can perform subsequent operations based on the first address and length of the data.
需要指出的是,在其他的通信引擎架构中,可能会出现这样一种情况:由于加速卡1和加速卡2之间不能直接通信,加速卡1会将自己计算出来的中间结果,先发送回主机端中的某个地址,然后主机端再通知加速卡2,让加速卡2来读取主机端中该地址的数据;由于本通信架构中加速卡之间可以之间通信,因此不会存在这种情况。It should be pointed out that in other communication engine architectures, the following situation may occur: since accelerator card 1 and accelerator card 2 cannot communicate directly, accelerator card 1 will send the intermediate result calculated by itself back to a certain address in the host side, and then the host side will notify accelerator card 2 to let accelerator card 2 read the data at the address in the host side; since accelerator cards can communicate with each other in this communication architecture, this situation will not exist.
这样,通过高级语言编写上层软件的自定义通信函数库,和HDL语言编写的底层xPU硬件IP,基于物理的FPGA芯片为例,实现了一种不同架构的计算引擎之间高效、通用的通信引擎。根据底层xPU的具体类型(本例中是基于RISC-V扩展而来GPU),通过标准的高层语言编写自定义通信函数库,结合HDL语言编写的RDMA模块、表解析引擎等,该通信引擎中通过将负责RDMA网络通信的模块与计算引擎集成至一个芯片中,进一步加速了计算引擎与通信引擎之间的联系,并且在同一RDMA网络内实现了实现了任意不同架构的计算引擎之间的主动互联,完全不需要CPU的控制参与通信控制,并具有了一对多的通信能力;无论是加速卡与加速卡之间还是加速卡与主机端之间,仅需在数据传输过程中搬移一次数据,有效的解决了计算引擎之间通信低效、低速的问题。本申请提供的方案,适用于各种xPU计算引擎,如GPU、NPU、IPU、DPU、VPU等。In this way, a custom communication function library of the upper-level software written in a high-level language and the underlying xPU hardware IP written in HDL language are used, based on the physical FPGA chip as an example, to realize an efficient and universal communication engine between computing engines of different architectures. According to the specific type of the underlying xPU (GPU based on RISC-V extension in this case), a custom communication function library is written in a standard high-level language, combined with the RDMA module, table parsing engine, etc. written in HDL language, and the communication engine integrates the module responsible for RDMA network communication with the computing engine into one chip, further accelerating the connection between the computing engine and the communication engine, and realizing the active interconnection between computing engines of any different architectures in the same RDMA network, completely eliminating the need for CPU control to participate in communication control, and having one-to-many communication capabilities; whether it is between accelerator cards and accelerator cards or between accelerator cards and host terminals, only data needs to be moved once during data transmission, effectively solving the problem of inefficient and low-speed communication between computing engines. The solution provided in this application is applicable to various xPU computing engines, such as GPU, NPU, IPU, DPU, VPU, etc.
进一步的,参见图3,本申请实施例公开了一种计算引擎通信方法,应用于主机端,包括:Further, referring to FIG. 3 , an embodiment of the present application discloses a computing engine communication method, which is applied to a host side and includes:
步骤S31,通过主机端计算引擎向设备端发送设备端程序数据,以便设备端的设备端计算引擎执行设备端程序数据,基于计算任务函数对目标数据进行计算以得到计算结果,并基于通信函数将计算结果发送至目标接收端;Step S31, sending device-side program data to the device side through the host-side computing engine, so that the device-side computing engine of the device side executes the device-side program data, calculates the target data based on the computing task function to obtain a calculation result, and sends the calculation result to the target receiving end based on the communication function;
其中,设备端程序数据为对包含通信函数以及计算任务函数的程序代码编译得到的程序数据,通信函数为基于通信需求创建的函数。The device-side program data is program data obtained by compiling program codes including communication functions and computing task functions, and the communication function is a function created based on communication requirements.
在一些实施例中,通过主机端计算引擎向设备端发送设备端程序数据,包括:In some embodiments, sending device-side program data to the device-side through a host-side computing engine includes:
通过主机端计算引擎向设备端发送目标数据、参数信息以及设备端程序数据;其中,参数信息包括目标数据写入设备端的内存的首地址以及长度信息,以使设备端基于参数信息将目标数据写入内存。The target data, parameter information and device-side program data are sent to the device-side through the host-side computing engine; wherein the parameter information includes the first address and length information of the target data written into the device-side memory, so that the device-side writes the target data into the memory based on the parameter information.
在一些实施例中,还包括:In some embodiments, it also includes:
获取各设备端发送的发现消息;Get the discovery message sent by each device;
基于各设备端发送的发现消息为各设备端分别配置IP地址以及标识信息;Based on the discovery messages sent by each device, configure an IP address and identification information for each device;
将各设备端的IP地址以及标识信息、主机端的IP地址以及标识信息发送至各设备端,以使各设备端将各设备端的IP地址以及标识信息、主机端的IP地址以及标识信息保存至IP标识对照表。The IP address and identification information of each device end and the IP address and identification information of the host end are sent to each device end, so that each device end saves the IP address and identification information of each device end and the IP address and identification information of the host end into the IP identification comparison table.
在一些实施例中,将各设备端的IP地址以及标识信息、主机端的IP地址以及标识信息发送至各设备端,还包括: In some embodiments, sending the IP address and identification information of each device end and the IP address and identification information of the host end to each device end further includes:
向各设备端发送主机端为其分配的IP地址以及标识信息;Send the IP address and identification information assigned by the host to each device;
当接收到各设备端回复的确认信息后,保存各设备端的IP地址以及标识信息,并将各设备端的IP地址以及标识信息、主机端的IP地址以及标识信息发送给各设备端。After receiving the confirmation information replied by each device end, the IP address and identification information of each device end are saved, and the IP address and identification information of each device end and the IP address and identification information of the host end are sent to each device end.
可见,本申请中,基于通信需求创建通信函数,并对包含通信函数以及计算任务函数的程序代码编译,得到设备端程序数据,设备端获取主机端计算引擎发送的设备端程序代码之后,设备端计算引擎在执行设备端程序代码的过程中完成计算任务和通信任务,将计算结果发送至目标接收端,这样实现了主机端中计算引擎与设备端中计算引擎、以及设备端中计算引擎与目标接收端中计算引擎的通信,无需主机算计算引擎分配通信任务,释放主机端计算引擎的算力,并且,适应于同一节点内不同计算引擎以及不同节点之间计算引擎的通信场景。It can be seen that in the present application, a communication function is created based on the communication needs, and the program code containing the communication function and the computing task function is compiled to obtain the device-side program data. After the device side obtains the device-side program code sent by the host-side computing engine, the device-side computing engine completes the computing task and the communication task in the process of executing the device-side program code, and sends the computing result to the target receiving end. In this way, the communication between the computing engine in the host side and the computing engine in the device side, and between the computing engine in the device side and the computing engine in the target receiving end is realized. There is no need for the host computing engine to allocate communication tasks, thus releasing the computing power of the host-side computing engine, and it is suitable for communication scenarios between different computing engines in the same node and between computing engines in different nodes.
参见图4所示,本申请实施例公开了一种计算引擎通信装置,应用于第一设备端,包括:As shown in FIG. 4 , an embodiment of the present application discloses a computing engine communication device, which is applied to a first device end and includes:
通信引擎11,用于获取主机端计算引擎发送的设备端程序数据;其中,设备端程序数据为对包含通信函数以及计算任务函数的程序代码编译得到的程序数据,通信函数为基于通信需求创建的函数;The communication engine 11 is used to obtain the device-side program data sent by the host-side computing engine; wherein the device-side program data is program data compiled from program codes including communication functions and computing task functions, and the communication function is a function created based on communication requirements;
设备端计算引擎12,用于执行设备端程序数据,基于计算任务函数对目标数据进行计算以得到计算结果,并基于通信函数将计算结果发送至目标接收端。The device-side computing engine 12 is used to execute device-side program data, calculate the target data based on the computing task function to obtain the calculation result, and send the calculation result to the target receiving end based on the communication function.
可见,本申请先获取主机端计算引擎发送的设备端程序数据;其中,设备端程序数据为对包含通信函数以及计算任务函数的程序代码编译得到的程序数据,通信函数为基于通信需求创建的函数,之后利用设备端计算引擎执行设备端程序数据,基于计算任务函数对目标数据进行计算以得到计算结果,并基于通信函数将计算结果发送至目标接收端,以实现设备端计算引擎与目标接收端中计算引擎之间的通信。也即,本申请中,基于通信需求创建通信函数,并对包含通信函数以及计算任务函数的程序代码编译,得到设备端程序数据,设备端获取主机端计算引擎发送的设备端程序代码之后,设备端计算引擎在执行设备端程序代码的过程中完成计算任务和通信任务,将计算结果发送至目标接收端,这样实现了主机端中计算引擎与设备端中计算引擎、以及设备端中计算引擎与目标接收端中计算引擎的通信,无需主机算计算引擎分配通信任务,释放主机端计算引擎的算力,并且,适应于同一节点内不同计算引擎以及不同节点之间计算引擎的通信场景。It can be seen that the present application first obtains the device-side program data sent by the host-side computing engine; wherein the device-side program data is the program data obtained by compiling the program code containing the communication function and the computing task function, and the communication function is a function created based on the communication demand, and then the device-side computing engine is used to execute the device-side program data, and the target data is calculated based on the computing task function to obtain the calculation result, and the calculation result is sent to the target receiving end based on the communication function, so as to realize the communication between the device-side computing engine and the computing engine in the target receiving end. That is, in the present application, a communication function is created based on the communication demand, and the program code containing the communication function and the computing task function is compiled to obtain the device-side program data, and after the device-side obtains the device-side program code sent by the host-side computing engine, the device-side computing engine completes the computing task and the communication task in the process of executing the device-side program code, and sends the calculation result to the target receiving end, so as to realize the communication between the computing engine in the host-side and the computing engine in the device-side, and the computing engine in the device-side and the computing engine in the target receiving end, without the need for the host-side computing engine to allocate the communication task, releasing the computing power of the host-side computing engine, and adapting to the communication scenarios of different computing engines in the same node and computing engines between different nodes.
其中,通信引擎11为,RDMA模块,RDMA模块与设备端计算引擎集成在同一芯片中;The communication engine 11 is an RDMA module, and the RDMA module and the device-side computing engine are integrated in the same chip;
相应的,设备端计算引擎12,具体用于基于通信函数,并通过RDMA模块将计算结果发送至目标接收端。Correspondingly, the device-side computing engine 12 is specifically used to send the computing result to the target receiving end based on the communication function and through the RDMA module.
进一步的,RDMA模块,具体用于获取主机端计算引擎发送的目标数据、参数信息以及设备端程序数据;其中,参数信息包括目标数据写入第一设备端的内存的首地址以及长度信息;基于参数信息将目标数据写入内存;Further, the RDMA module is specifically used to obtain target data, parameter information and device-side program data sent by the host-side computing engine; wherein the parameter information includes the first address and length information of the target data written into the memory of the first device; and write the target data into the memory based on the parameter information;
相应的,设备端计算引擎12,具体用于利用设备端计算引擎执行设备端程序数据,基于参数信息从内存中读取目标数据,并基于计算任务函数对目标数据进行计算以得到计算结果。Correspondingly, the device-side computing engine 12 is specifically used to execute device-side program data using the device-side computing engine, read target data from the memory based on parameter information, and calculate the target data based on the computing task function to obtain a calculation result.
进一步的,设备端计算引擎12,基于通信函数将目标接收端的标识信息写入通信表,并通知表解析引擎;通过表解析引擎从通信表中获取目标接收端的标识信息,并获取标识信息 对应的IP地址;RDMA模块,用于基于IP地址将计算结果发送至目标接收端。Further, the device-side computing engine 12 writes the identification information of the target receiving end into the communication table based on the communication function, and notifies the table parsing engine; obtains the identification information of the target receiving end from the communication table through the table parsing engine, and obtains the identification information The corresponding IP address; the RDMA module is used to send the calculation results to the target receiving end based on the IP address.
进一步的,RDMA模块还用于将计算结果写入内存中的数据区域,并将写入地址、数据长度写入通信表;相应的,设备端计算引擎12,用于通过表解析引擎从通信表中获取写入地址、数据长度,并将写入地址、数据长度以及IP地址写入RDMA模块中的存储器。RDMA模块,用于检测存储器,并在存储器不为空时,从存储器中读取出写入地址、数据长度以及IP地址,根据写入地址和数据长度从内存中读取计算结果,基于IP地址将计算结果发送至目标接收端。Furthermore, the RDMA module is also used to write the calculation result into the data area in the memory, and write the write address and data length into the communication table; correspondingly, the device-side calculation engine 12 is used to obtain the write address and data length from the communication table through the table parsing engine, and write the write address, data length and IP address into the memory in the RDMA module. The RDMA module is used to detect the memory, and when the memory is not empty, read the write address, data length and IP address from the memory, read the calculation result from the memory according to the write address and data length, and send the calculation result to the target receiving end based on the IP address.
进一步的,获取标识信息对应的IP地址,包括:从IP标识对照表中获取标识信息对应的IP地址。Further, obtaining the IP address corresponding to the identification information includes: obtaining the IP address corresponding to the identification information from an IP identification comparison table.
相应的,RDMA模块,还用于向主机端发送发现消息,以便主机端基于各设备端的发现消息为各设备端分别分配IP地址以及标识信息;获取主机端发送的各设备端的IP地址以及标识信息、主机端的IP地址以及标识信息;将各设备端的IP地址以及标识信息、主机端的IP地址以及标识信息保存至IP标识对照表。Correspondingly, the RDMA module is also used to send a discovery message to the host side, so that the host side can assign an IP address and identification information to each device side based on the discovery message of each device side; obtain the IP address and identification information of each device side and the IP address and identification information of the host side sent by the host side; save the IP address and identification information of each device side and the IP address and identification information of the host side to the IP identification comparison table.
其中,RDMA模块,还用于在向主机端发送发现消息之后,获取主机端为第一设备端分配的IP地址以及标识信息,并向主机端回复确认信息,以便主机端在收到各设备端回复的确认信息后,保存各设备端的IP地址以及标识信息,并将各设备端的IP地址以及标识信息、主机端的IP地址以及标识信息发送给各设备端。Among them, the RDMA module is also used to obtain the IP address and identification information assigned by the host side to the first device side after sending a discovery message to the host side, and reply confirmation information to the host side, so that after the host side receives the confirmation information replied by each device side, it saves the IP address and identification information of each device side, and sends the IP address and identification information of each device side and the IP address and identification information of the host side to each device side.
进一步的,目标接收端为主机端或第二设备端;Further, the target receiving end is a host end or a second device end;
若目标接收端为多个第二设备端,则设备端计算引擎基于通信函数将多个第二设备端的标识信息均写入通信表。If the target receiving end is multiple second device ends, the device-end computing engine writes the identification information of the multiple second device ends into the communication table based on the communication function.
并且,若目标接收端为第二设备端,第二设备端通过自身的RDMA模块接收计算结果,并将计算结果保存在自身的内存中。Furthermore, if the target receiving end is the second device end, the second device end receives the calculation result through its own RDMA module and stores the calculation result in its own memory.
进一步的,RDMA模块还用于在基于通信函数将计算结果发送至第二设备端之后,向第二设备端发送传输结束信息,以便第二设备端的RDMA模块在接收到传输结束信息后,将计算结果在内存中的首地址和长度信息返回给第二设备端的计算引擎。Furthermore, the RDMA module is also used to send transmission end information to the second device end after sending the calculation result to the second device end based on the communication function, so that after receiving the transmission end information, the RDMA module on the second device end returns the starting address and length information of the calculation result in the memory to the computing engine on the second device end.
进一步的,设备端计算引擎在执行设备端程序数据的过程中,若执行到目标程序指令,则等待接收第三设备端的数据,并在接收到第三设备端的数据之后,继续执行后续操作。Furthermore, when executing the device-side program data, if the device-side computing engine executes the target program instruction, it waits to receive the data from the third device side, and continues to execute subsequent operations after receiving the data from the third device side.
进一步的,参见图5,本申请实施例公开了一种计算引擎通信装置,应用于主机端,包括:Further, referring to FIG. 5 , an embodiment of the present application discloses a computing engine communication device, which is applied to a host side and includes:
主机端计算引擎51,用于向设备端发送设备端程序数据,以便设备端的设备端计算引擎执行设备端程序数据,基于计算任务函数对目标数据进行计算以得到计算结果,并基于通信函数将计算结果发送至目标接收端;The host-side computing engine 51 is used to send the device-side program data to the device-side, so that the device-side computing engine of the device-side executes the device-side program data, calculates the target data based on the computing task function to obtain the calculation result, and sends the calculation result to the target receiving end based on the communication function;
其中,设备端程序数据为对包含通信函数以及计算任务函数的程序代码编译得到的程序数据,通信函数为基于通信需求创建的函数。The device-side program data is program data obtained by compiling program codes including communication functions and computing task functions, and the communication function is a function created based on communication requirements.
在一些实施例中,主机端计算引擎,具体用于向设备端发送目标数据、参数信息以及设备端程序数据;其中,参数信息包括目标数据写入设备端的内存的首地址以及长度信息,以使设备端基于参数信息将目标数据写入内存。In some embodiments, the host-side computing engine is specifically used to send target data, parameter information and device-side program data to the device-side; wherein the parameter information includes the starting address and length information of the target data written into the device-side memory, so that the device-side writes the target data into the memory based on the parameter information.
在一些实施例中,装置还包括:In some embodiments, the apparatus further comprises:
发现消息获取模块,用于获取各设备端发送的发现消息; A discovery message acquisition module is used to acquire discovery messages sent by each device end;
配置模块,用于基于各设备端发送的发现消息为各设备端分别配置IP地址以及标识信息;A configuration module, used to configure an IP address and identification information for each device end based on the discovery message sent by each device end;
配置信息发送模块,用于将各设备端的IP地址以及标识信息、主机端的IP地址以及标识信息发送至各设备端,以使各设备端将各设备端的IP地址以及标识信息、主机端的IP地址以及标识信息保存至IP标识对照表。The configuration information sending module is used to send the IP address and identification information of each device end and the IP address and identification information of the host end to each device end, so that each device end saves the IP address and identification information of each device end and the IP address and identification information of the host end to the IP identification comparison table.
在一些实施例中,配置信息发送模块还包括:In some embodiments, the configuration information sending module further includes:
第一发送模块,用英语向各设备端发送主机端为其分配的IP地址以及标识信息;The first sending module sends the IP address and identification information assigned by the host end to each device end in English;
第二发送模块,用于当接收到各设备端回复的确认信息后,保存各设备端的IP地址以及标识信息,并将各设备端的IP地址以及标识信息、主机端的IP地址以及标识信息发送给各设备端。The second sending module is used to save the IP address and identification information of each device end after receiving the confirmation information replied by each device end, and send the IP address and identification information of each device end and the IP address and identification information of the host end to each device end.
参见图6,本申请实施例还提供一种电子设备,包括:存储器601,用于存储计算机程序;处理器602,用于执行计算机程序,以实现上述任意实施例的计算引擎通信方法的步骤。Referring to FIG. 6 , an embodiment of the present application further provides an electronic device, including: a memory 601 for storing a computer program; and a processor 602 for executing the computer program to implement the steps of the computing engine communication method of any of the above embodiments.
参见图7,本申请实施例还提供一种非易失性计算机可读存储介质7,非易失性计算机可读存储介质7上存储有计算机程序701,计算机程序701被处理器执行时实现上述任意实施例的计算引擎通信方法的步骤。Referring to FIG. 7 , an embodiment of the present application further provides a non-volatile computer-readable storage medium 7 on which a computer program 701 is stored. When the computer program 701 is executed by a processor, the steps of the computing engine communication method of any of the above embodiments are implemented.
由于电子设备和非易失性计算机可读存储介质部分的实施例与计算引擎通信方法部分的实施例相互对应,因此非易失性计算机可读存储介质部分的实施例请参见计算引擎通信方法部分的实施例的描述,这里暂不赘述。Since the embodiments of the electronic device and the non-volatile computer-readable storage medium part correspond to the embodiments of the computing engine communication method part, please refer to the description of the embodiments of the computing engine communication method part for the embodiments of the non-volatile computer-readable storage medium part, which will not be repeated here.
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。In this specification, each embodiment is described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The same or similar parts between the embodiments can be referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant parts can be referred to the method part.
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of the method or algorithm described in conjunction with the embodiments disclosed herein may be implemented directly using hardware, a software module executed by a processor, or a combination of the two. The software module may be placed in a random access memory (RAM), a memory, a read-only memory (ROM), an electrically programmable ROM, an electrically erasable programmable ROM, a register, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
以上对本申请所提供的计算引擎通信方法及装置、电子设备、存储介质进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上,本说明书内容不应理解为对本申请的限制。 The computing engine communication method and device, electronic device, and storage medium provided by the present application are introduced in detail above. Specific examples are used in this article to illustrate the principles and implementation methods of the present application. The description of the above embodiments is only used to help understand the method of the present application and its core idea; at the same time, for general technical personnel in this field, according to the idea of the present application, there will be changes in the specific implementation method and application scope. In summary, the content of this specification should not be understood as a limitation on the present application.
Claims (21)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211347106.X | 2022-10-31 | ||
| CN202211347106.XA CN116028238A (en) | 2022-10-31 | 2022-10-31 | Computing engine communication method and device |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024093112A1 true WO2024093112A1 (en) | 2024-05-10 |
Family
ID=86071157
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2023/084813 Ceased WO2024093112A1 (en) | 2022-10-31 | 2023-03-29 | Computing engine communication method and apparatus, electronic device, and storage medium |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN116028238A (en) |
| WO (1) | WO2024093112A1 (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119576585A (en) * | 2025-02-06 | 2025-03-07 | 山东浪潮科学研究院有限公司 | A memory management method and heterogeneous computing system |
| CN120295958A (en) * | 2025-06-13 | 2025-07-11 | 山东云海国创云计算装备产业创新中心有限公司 | Data transmission method, device, medium and product |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117834709B (en) * | 2024-01-04 | 2024-08-06 | 天津大学 | Method for directly transferring data between functions of server-oriented non-perception computing scene |
| CN119996539A (en) * | 2025-01-14 | 2025-05-13 | 广西电网有限责任公司 | Efficient parsing and forwarding method of RDMA protocol based on RISC-V architecture |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109828843A (en) * | 2019-01-30 | 2019-05-31 | 郑州云海信息技术有限公司 | Method, system and the electronic equipment that data are transmitted between a kind of calculate node |
| CN111966504A (en) * | 2020-10-23 | 2020-11-20 | 腾讯科技(深圳)有限公司 | Task processing method in graphics processor and related equipment |
| CN113849293A (en) * | 2021-11-30 | 2021-12-28 | 湖北芯擎科技有限公司 | Data processing method, device, system and computer readable storage medium |
| CN114003392A (en) * | 2021-12-28 | 2022-02-01 | 苏州浪潮智能科技有限公司 | Data accelerated computing method and related device |
| WO2022105736A1 (en) * | 2020-11-20 | 2022-05-27 | 深圳前海微众银行股份有限公司 | Data processing method and apparatus, device, computer storage medium, and program |
-
2022
- 2022-10-31 CN CN202211347106.XA patent/CN116028238A/en active Pending
-
2023
- 2023-03-29 WO PCT/CN2023/084813 patent/WO2024093112A1/en not_active Ceased
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109828843A (en) * | 2019-01-30 | 2019-05-31 | 郑州云海信息技术有限公司 | Method, system and the electronic equipment that data are transmitted between a kind of calculate node |
| CN111966504A (en) * | 2020-10-23 | 2020-11-20 | 腾讯科技(深圳)有限公司 | Task processing method in graphics processor and related equipment |
| WO2022105736A1 (en) * | 2020-11-20 | 2022-05-27 | 深圳前海微众银行股份有限公司 | Data processing method and apparatus, device, computer storage medium, and program |
| CN113849293A (en) * | 2021-11-30 | 2021-12-28 | 湖北芯擎科技有限公司 | Data processing method, device, system and computer readable storage medium |
| CN114003392A (en) * | 2021-12-28 | 2022-02-01 | 苏州浪潮智能科技有限公司 | Data accelerated computing method and related device |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119576585A (en) * | 2025-02-06 | 2025-03-07 | 山东浪潮科学研究院有限公司 | A memory management method and heterogeneous computing system |
| CN120295958A (en) * | 2025-06-13 | 2025-07-11 | 山东云海国创云计算装备产业创新中心有限公司 | Data transmission method, device, medium and product |
Also Published As
| Publication number | Publication date |
|---|---|
| CN116028238A (en) | 2023-04-28 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2024093112A1 (en) | Computing engine communication method and apparatus, electronic device, and storage medium | |
| US7788334B2 (en) | Multiple node remote messaging | |
| JP3836840B2 (en) | Multiprocessor system | |
| WO2025001317A1 (en) | Server system and communication method therefor | |
| CN112540941A (en) | Data forwarding chip and server | |
| EP4029219A1 (en) | Methods and apparatus for network interface fabric send/receive operations | |
| CN116627888A (en) | Hardware computing module, device, method, electronic device and storage medium | |
| CN117591450B (en) | Data processing system, method, equipment and medium | |
| CN114546913A (en) | Method and device for high-speed data interaction among multiple hosts based on PCIE interface | |
| CN107957970A (en) | The means of communication and solid-state hard disk controller of a kind of heterogeneous polynuclear | |
| CN118606079B (en) | A communication method and system based on socket interface | |
| CN107209725A (en) | Method, processor and the computer of processing write requests | |
| CN114445260A (en) | Distributed GPU communication method and device based on FPGA | |
| CN118519753B (en) | A computing resource aggregation method and system based on pooled memory | |
| CN111488308A (en) | A system and method for supporting multiprocessor expansion of different architectures | |
| CN112817774A (en) | System and method for transaction broadcasting in network on chip | |
| CN118827819A (en) | A protocol conversion device, method, equipment, medium and product | |
| CN110519242A (en) | Data transmission method and device | |
| WO2025112837A1 (en) | Server system, job execution method and apparatus, device, and medium | |
| WO2025138694A1 (en) | Data transmission method and device, and system | |
| CN119597489A (en) | P2P communication method and system between IO devices based on PCIe-NTB | |
| WO2024244557A1 (en) | Inter-node communication method and apparatus, electronic device, and storage medium | |
| CN118708368B (en) | Data processing method and device for distributed memory computing engine cluster | |
| CN117971135B (en) | Storage device access method and device, storage medium and electronic device | |
| CN119988005A (en) | A data processing method and related equipment |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23884055 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 23884055 Country of ref document: EP Kind code of ref document: A1 |