[go: up one dir, main page]

CN118642856B - Data processing method and system, storage medium and electronic equipment - Google Patents

Data processing method and system, storage medium and electronic equipment

Info

Publication number
CN118642856B
CN118642856B CN202410866971.8A CN202410866971A CN118642856B CN 118642856 B CN118642856 B CN 118642856B CN 202410866971 A CN202410866971 A CN 202410866971A CN 118642856 B CN118642856 B CN 118642856B
Authority
CN
China
Prior art keywords
port
write
data processing
write request
completion signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410866971.8A
Other languages
Chinese (zh)
Other versions
CN118642856A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mole Thread Intelligent Technology Beijing Co ltd
Moore Threads Technology Co Ltd
Original Assignee
Mole Thread Intelligent Technology Beijing Co ltd
Moore Threads Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mole Thread Intelligent Technology Beijing Co ltd, Moore Threads Technology Co Ltd filed Critical Mole Thread Intelligent Technology Beijing Co ltd
Priority to CN202410866971.8A priority Critical patent/CN118642856B/en
Publication of CN118642856A publication Critical patent/CN118642856A/en
Application granted granted Critical
Publication of CN118642856B publication Critical patent/CN118642856B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multi Processors (AREA)

Abstract

The disclosure relates to a data processing method, a system, a storage medium and electronic equipment, wherein the data processing method comprises the steps that a request initiating processor of a first data processing board sends a write request to a memory of a second data processing board in a computing cluster, a system fence instruction is transmitted to a local port for sending the write request, the local port transmits the system fence instruction to each downstream port for receiving the write request, the downstream port in the computing cluster executes the system fence instruction to collect acknowledgement signals of the write request and returns a system fence completion signal to an upstream port when collection is completed, and the local port of the first data processing board returns a system fence completion signal to the request initiating processor when collection of the acknowledgement signals of the write request is completed so as to indicate that execution of all the write requests is completed. Embodiments of the present disclosure may enable global ordering and consistency of memory operations of multiple processors in a computing cluster.

Description

Data processing method and system, storage medium and electronic equipment
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a data processing method and system, a storage medium, and an electronic device.
Background
In modern computing systems, consistency and synchronization of memory operations is critical to ensuring data accuracy and system stability. Memory fences are widely used in multiprocessor systems as a synchronization mechanism to maintain the sequency of memory operations. However, prior art memory fences have focused on write operations to local storage, with significant shortcomings in synchronization and traceability for remote devices across multiple levels of interconnection.
In a multiprocessor cluster environment, processors communicate and exchange data with each other through a high-speed internet. Conventional implementations of memory fences rely on a cache coherency protocol within the processor to ensure that write operations in memory can be completed in the desired order. However, when remote memory access is involved, the prior art fails to achieve consistency of memory operations in a multiprocessor cluster.
Disclosure of Invention
The present disclosure proposes a data processing technique.
According to an aspect of the present disclosure, there is provided a data processing system for use in a computing cluster including a plurality of data processing boards, comprising:
a request initiating processor in a first data processing board card transmits a system fence instruction to a local port in the first data processing board card for transmitting a write request after transmitting the write request to a memory of a second data processing board card in the computing cluster;
The local port of the first data processing board card transmits the system fence instruction to each downstream port for receiving the write request;
The downstream port in the computing cluster executes the system fence instruction under the condition that the system fence instruction sent by the upstream port is received, so as to collect the acknowledgement signal of the write request, and returns a system fence completion signal to the upstream port under the condition that collection is completed;
And the local port of the first data processing board card returns a system fence completion signal to the request initiating processor under the condition that the confirmation signal of the write request is collected, so as to indicate that all the write requests are executed.
In one possible implementation, the acknowledgement signal of the write request includes an acknowledgement signal that the local memory performed a write operation on the write request, and/or a system fence completion signal returned by the downstream port that characterizes the downstream port has collected the acknowledgement signal of the corresponding write request.
In one possible implementation, the collecting the acknowledgement signal of the write request includes:
the downstream port in the computing cluster, in case of receiving the system fence instruction sent by the upstream port, if the local memory corresponding to the port performs processing on the write request, collects a confirmation signal of the local memory performing writing operation on the write request, and/or,
If the port forwards the write request to the downstream port, forwarding the system fence instruction to the downstream port receiving the write request, and collecting a system fence completion signal returned by the downstream port receiving the write request.
In one possible implementation, the information recorded in the core exit of the request initiating processor includes port information of the local port that issued the write request in a time interval from the issuance of the last system barrier instruction to the issuance of the current system barrier instruction;
the request initiating processor transmits the system fence instruction to the corresponding local port based on the recorded port information.
In one possible implementation manner, the information recorded in the port of the data processing board card of the computing cluster, which has sent the write request, includes port information of a downstream port accessed by the issued write request in a time interval from the issuance of the last system barrier instruction to the issuance of the current system barrier instruction;
and transmitting the system fence instruction to a downstream port corresponding to the port information by the port which sends the write-over request in the computing cluster based on the recorded port information.
In one possible implementation manner, in a case that a write request sent by an upstream port is received by a downstream port in the computing cluster, after the write request is sent to a local memory or the write request is forwarded to the downstream port, the collection is determined to be completed, and a system fence completion signal is returned to the upstream port.
In one possible implementation, the request initiating processor adds an advance acknowledgement identifier to the write request and sends the advance acknowledgement identifier to the downstream port, where the advance acknowledgement identifier is used to instruct the downstream port to return a system fence completion signal to the upstream port after the write request is sent to the local memory or the write request is forwarded to the downstream port.
In one possible implementation, there are multiple data paths between the first data processing board and the second data processing board;
In each data path, after receiving a system fence completion signal, a port of the data processing board card directly connected with the second data processing board card sends a write-in completion signal to a port of the second data processing board card, wherein each data path corresponds to the write-in completion signal once;
After receiving the writing completion signal, the port of the second data processing board card sends the writing completion signal to the marking address of the local memory under the condition of receiving the signal returned by the local memory and having successful data writing;
the memory of the second data processing board card updates a mark in the memory based on the writing completion signal, wherein the mark is used for indicating the number of the received writing completion signals;
The data consumption processor determines whether the data writing is completed based on the flag, and reads the data written in the memory in the case that the data writing is determined to be completed.
In one possible implementation manner, after receiving a system fence completion signal returned by a downstream port, an upstream port in each data path transmits a write completion signal to the downstream port, wherein when the upstream port corresponds to a plurality of downstream ports, the upstream port respectively transmits the write completion signal to each downstream port, and when the plurality of upstream ports corresponds to one downstream port, the downstream port collects each write completion signal transmitted by the upstream port, so that each data path corresponds to one write completion signal;
the local memory performs an atomic operation on the mark in the memory to update the mark in the memory when receiving a write-in completion signal;
And the data consumption processor reads the data written in the memory under the condition that the mark in the memory is determined to indicate that K writing completion signals are received, wherein K is the number of the data paths, and K is a positive integer.
In one possible implementation, there is a data path in the data paths that does not send a write request;
A local port in the first data processing board card sends a system fence instruction to a port in a data path which does not send a write request;
After receiving a system fence instruction, a port in the data path which does not send a write request sends a write completion signal to a downstream port, and the write completion signal is recursively transferred to a port of a second data processing board card;
and after receiving a write-in completion signal transmitted by a data channel which does not transmit a write request, the port of the second data processing board card transmits the write-in completion signal to a mark address of the local memory.
In one possible implementation manner, the write request includes a group identifier, and the local port sends a system fence instruction to each downstream port that receives the write request based on the group identifier, where the system fence instruction includes the same group identifier;
And under the condition that a system fence instruction sent by an upstream port is received, the downstream port in the computing cluster collects a confirmation signal of a write request corresponding to a group identifier based on the group identifier in the system fence instruction, and under the condition that collection is completed, returns a system fence completion signal containing the group identifier to the upstream port.
In one possible implementation manner, after receiving a data writing failure signal returned by the local memory, a downstream port in the computing cluster returns a system fence failure signal to an upstream port;
And the local port of the first data processing board card returns a system fence failure signal to the request initiating processor under the condition of receiving the system fence failure signal.
According to an aspect of the present disclosure, there is provided a data processing method applied to a computing cluster including a plurality of data processing boards, including:
After a request initiating processor in a first data processing board card sends a write request to a memory of a second data processing board card in the computing cluster, transmitting a system fence instruction to a local port in the first data processing board card for sending the write request, so as to transmit the system fence instruction to each downstream port for receiving the write request through the local port of the first data processing board card;
Under the condition that a downstream port in a computer group receives a system fence instruction sent by an upstream port, acquiring the system fence instruction, executing the system fence instruction to collect a confirmation signal of the write request, and returning a system fence completion signal to the upstream port under the condition that collection is completed;
And under the condition that the local port of the first data processing board card collects the confirmation signal of the write request, returning a system fence completion signal to the request initiating processor so as to indicate that all the write requests are executed.
According to an aspect of the present disclosure, there is provided a data processing apparatus applied to a computing cluster including a plurality of data processing boards, including:
The system comprises a sending module, a first data processing board card and a second data processing board card, wherein the sending module is used for transmitting a system fence instruction to a local port of the first data processing board card for sending the writing request after a request initiating processor in the first data processing board card sends the writing request to a memory of the second data processing board card in the computing cluster, so that the system fence instruction is transmitted to each downstream port for receiving the writing request through the local port of the first data processing board card;
The collecting module is used for acquiring a system fence instruction under the condition that a downstream port in the computer group receives the system fence instruction sent by an upstream port, executing the system fence instruction so as to collect a confirmation signal of the write request, and returning a system fence completion signal to the upstream port under the condition that the collection is completed;
and the return module is used for returning a system fence completion signal to the request initiating processor under the condition that the local port of the first data processing board card collects the confirmation signal of the write request, so as to indicate that all the write requests are executed.
According to an aspect of the disclosure, there is provided an electronic device comprising a processor, a memory for storing processor-executable instructions, wherein the processor is configured to invoke the instructions stored in the memory to implement the above system.
According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described system.
In the embodiment of the disclosure, a request initiating processor in a first data processing board transmits a system fence instruction to a local port of the first data processing board for transmitting a write request after transmitting the write request to a memory of a second data processing board in the computing cluster, the local port of the first data processing board transmits the system fence instruction to each downstream port for receiving the write request, the downstream port in the computing cluster executes the system fence instruction under the condition that the system fence instruction transmitted by the upstream port is received to collect a confirmation signal of the write request, and returns a system fence completion signal to the upstream port under the condition that the collection is completed, and the local port of the first data processing board returns a system fence completion signal to the request initiating processor under the condition that the confirmation signal of the write request is collected to indicate that all the write requests are executed. Therefore, after the request initiating processor sends the write request, the system fence instruction is triggered, the instruction is transmitted to all downstream ports receiving the write request through the local port, after all write operations are completed, the downstream ports return system fence completion signals, and after the local port collects the signals, the request initiating processor confirms that all the write requests are completed, so that the global sequence and consistency of the memory operation are realized.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the technical aspects of the disclosure.
FIG. 1 illustrates a block diagram of a data processing system according to an embodiment of the present disclosure.
Fig. 2 is a flow chart illustrating a data processing method for not returning an acknowledgement signal in advance, which is provided in the present disclosure.
Fig. 3 is a flowchart illustrating a data processing method for returning an acknowledgement signal in advance according to the present disclosure.
Fig. 4 illustrates a block diagram of a computing cluster provided by the present disclosure.
Fig. 5 shows a flow chart of a data processing method according to an embodiment of the present disclosure.
Fig. 6 shows a block diagram of a data processing apparatus according to an embodiment of the present disclosure.
Fig. 7 illustrates a block diagram of an electronic device 1900 according to an embodiment of the disclosure.
Detailed Description
Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
The term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean that a exists alone, while a and B exist together, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, may mean including any one or more elements selected from the group consisting of A, B and C.
Furthermore, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.
FIG. 1 illustrates a block diagram of a data processing system, as shown in FIG. 1, applied to a computing cluster including a plurality of data processing boards, according to an embodiment of the present disclosure, including:
A request initiating processor 101-1 in a first data processing board 101 transmits a system fence instruction to a local port 101-2 in the first data processing board that transmits a write request after transmitting the write request to a memory 102-1 of a second data processing board 102 in the computing cluster;
the local port 101-2 of the first data processing board 101 transmits the system fence instruction to each downstream port that receives the write request;
The downstream port in the computing cluster executes the system fence instruction under the condition that the system fence instruction sent by the upstream port is received, so as to collect the acknowledgement signal of the write request, and returns a system fence completion signal to the upstream port under the condition that collection is completed;
And the local port 101-2 of the first data processing board 101 returns a system fence completion signal to the request initiating processor 101-1 to indicate that all the write requests are executed when the acknowledgement signals of the write requests are collected.
The system is applied to a computing cluster comprising a plurality of data processing boards, wherein the computing cluster is a parallel computing system formed by the plurality of data processing boards (or called nodes), and the parallel computing systems are cooperated with each other through a network or other connection modes to jointly execute complex computing tasks.
The data processing board here is a hardware device capable of performing data processing, and may be a graphics card, for example. It should be noted that, in one data processing board, a module such as a processor, a memory, a port, etc. may be included, and a module located in the same data processing board is described as a local module, for example, a local port, a local memory, etc. The write request in the present disclosure is sent by the processor, and can be transferred between different data processing boards, sent via the local port of the first data processing board, and finally transmitted to the port of the destination data processing board, and then written into the memory thereof.
The processors herein may include a CPU or GPU, and may include other types of processors for performing data processing such as computation, which is not limited by this disclosure, and may be used to execute write requests and system fence instructions in a computing cluster. Taking the CPU and GPU as examples, for the CPU, the write requests and execution of the system fence instructions will follow the conventional processor Instruction Set Architecture (ISA) and memory hierarchy. Whereas for GPUs, write requests may need to be performed by CUDA or other GPU programming model.
The request initiating processor is the processor in the computing cluster that initiates a write request, which refers to a request that the processor of one data processing card (i.e., the request initiating processor) wishes to write data into a memory area of another data processing card. For example, assume that there is one computing cluster that includes three data processing cards A, B and C. If data processing board a needs to update a data set and this data set is also stored in data processing board B, the processor in data processing board a needs to initiate a write request to write the new value of the data into the corresponding memory area of data processing board B. In this process, the processor in the data processing board a is the request initiating processor, which is responsible for generating write requests and sending these requests to the storage area of the data processing board B via a network or other communication mechanism, and when the storage area of the data processing board B receives the write request and successfully performs a write operation, a confirmation signal may be sent to the request initiating processor to indicate that the write operation is completed.
And the different data processing boards in the computing cluster are communicated through ports, and the local ports of the data processing boards are used for sending write requests and system fence instructions.
For convenience of description, an interface in a data processing board card of a computing cluster that receives a write request and/or a system fence instruction is referred to as a downstream port, and an interface in a data processing board card of a computing cluster that sends a write request and/or a system fence instruction is referred to as an upstream port. It should be noted that, the upstream port and the downstream port are a relative expression, which is used to express the relationship between ports in the time sequence flow during the transmission process of the write request and/or the system fence instruction.
In the disclosed embodiment, a system fence instruction is used to collect local write request acknowledge signals and receive system fence completion signals from other downstream ports to determine the completion of write requests by the local memory and the memory of the downstream port, e.g., to ensure that both local and downstream write requests have been acknowledged. For example, for each port that receives a write request, an upstream port may receive and execute a system fence instruction, or for ports on all paths that can reach the memory that performs the write operation, an upstream port may receive and execute a system fence instruction.
The information recorded by the core exit of the request initiating processor (for example, the last level cache exit in the processor) comprises port information of a local port which issues a write request in a time interval from the last system fence instruction to the current system fence instruction, and the request initiating processor transmits the system fence instruction to the corresponding local port based on the recorded port information.
And the request initiating processor transmits the system fence instruction to a local port for sending the write request according to the recorded port information, and broadcasts the system fence instruction to all downstream ports accessed by the previous write request through the local port.
The information recorded in the port of the data processing board card of the computing cluster, which is used for sending the write request, comprises port information of a downstream port accessed by the sent write request in a time interval from the sending of the last system fence instruction to the sending of the current system fence instruction, and the port of the computing cluster, which is used for sending the write request, is used for transmitting the system fence instruction to the downstream port corresponding to the port information based on the recorded port information.
These downstream ports, upon receiving the system fence instruction, execute the system fence instruction to perform corresponding operations including collecting write request acknowledge signals for the local memory and receiving system fence complete signals from other downstream ports. After the collection is completed, the downstream port returns a system fence completion signal to the upstream port, where the upstream port is a port for sending a system fence instruction, and includes a port for requesting the initiating processor to send the system fence instruction and a port for forwarding the system fence instruction. When the downstream port returns a system fence completion signal to the upstream port, the downstream port returns the system fence completion signal only to the upstream port that sends a system fence instruction to the port.
After collecting all necessary system fence completion signals, the local port of the first data processing board card returns a system fence completion signal to the request initiating processor to indicate that all relevant write requests have been executed. In this way, the system ensures that all necessary write operations have been confirmed before continuing to perform subsequent operations, thereby improving the stability and reliability of the system.
In the embodiment of the disclosure, a request initiating processor in a first data processing board transmits a system fence instruction to a local port of the first data processing board for transmitting a write request after transmitting the write request to a memory of a second data processing board in the computing cluster, the local port of the first data processing board transmits the system fence instruction to each downstream port for receiving the write request, the downstream port in the computing cluster executes the system fence instruction under the condition that the system fence instruction transmitted by the upstream port is received to collect a confirmation signal of the write request, and returns a system fence completion signal to the upstream port under the condition that the collection is completed, and the local port of the first data processing board returns a system fence completion signal to the request initiating processor under the condition that the confirmation signal of the write request is collected to indicate that all the write requests are executed. Therefore, after the request initiating processor sends the write request, the system fence instruction is triggered, the instruction is transmitted to all downstream ports receiving the write request through the local port, after all write operations are completed, the downstream ports return system fence completion signals, and after the local port collects the signals, the request initiating processor confirms that all the write requests are completed, so that the global sequence and consistency of the memory operation are realized.
The data processing system described by the present application ensures that all relevant write requests in a computing cluster have been validated before continuing to perform subsequent operations by introducing a system fence mechanism. This greatly improves the stability and reliability of the system and reduces errors caused by data inconsistencies or losses. Meanwhile, through a clear signal transmission and collection mechanism, the system can manage and schedule resources more efficiently, and the overall processing efficiency is improved.
In one possible implementation, the acknowledgement signal of the write request includes an acknowledgement signal that the local memory performed a write operation on the write request, and/or a system fence completion signal returned by the downstream port that characterizes the downstream port has collected the acknowledgement signal of the corresponding write request.
When a write request is sent to a downstream port, the write request may be processed by a local memory corresponding to the downstream port and written to a local memory location or register, or the write request may need to be executed by a memory of another data processing board card, so that the downstream port needs to forward the request and forward the downstream port further to the downstream port.
For both cases, when the local memory performs a write operation on a write request, i.e., when a write request is sent to the local memory, the local memory (e.g., memory system) performs the corresponding write operation to write data to the specified memory location or register. When the write operation is completed, an acknowledgement signal is generated indicating that the write request has been successfully performed, which is returned to the local port that sent the write request. The acknowledge signal is the acknowledge signal of the write operation performed by the local memory on the write request.
When the write request is transmitted to the downstream port by the port, if the memory corresponding to the downstream port finishes processing the write request, the downstream port also generates a system fence completion signal after collecting the acknowledgement signal of the write request, and returns the system fence completion signal to the upstream port.
In this way, all ports receiving the write requests aggregate the acknowledgement signals of all the write requests to the local ports of the first data processing board by passing back the acknowledgement signals step by step, so as to coordinate the data write operation across multiple data processing boards, and maintain the consistency of the memory operation in the whole computing cluster.
In one possible implementation manner, the collecting the acknowledgement signal of the write request includes that in the case that a system fence instruction sent by an upstream port is received by a downstream port in the computing cluster, if a local memory corresponding to the port performs processing on the write request, the acknowledgement signal of the local memory performing a write operation on the write request is collected, and/or if the port forwards the write request to the downstream port, the system fence instruction is forwarded to the downstream port receiving the write request, and a system fence completion signal returned by the downstream port receiving the write request is collected.
For the same data processing card, three processing modes exist for a plurality of received write requests, namely, 1, processing the write requests, writing data into a local memory, 2, forwarding the write requests to a downstream port, 3, processing part of the write requests, and simultaneously forwarding part of the write requests.
In this implementation, for any downstream port in the computing cluster, when the downstream port receives the system fence instruction sent by the upstream port, if the local memory corresponding to the port has performed processing (i.e. performed a write operation) on the write request, the port will collect the acknowledgement signal returned by the local memory.
If a downstream port does not process a write request directly, but forwards it to a further downstream port, the downstream port will perform the following operations:
forwarding the system fence instruction to a further downstream port receiving the write request so that the further downstream port executes the system fence instruction, thereby collecting and returning an acknowledgement signal of the write request;
The system fence completion signal returned by the downstream port receiving the write request is waited for and collected, indicating that the downstream port has completed processing the write request and that all necessary acknowledgement signals have been collected (possibly from its own local memory or from a further downstream port).
Using both mechanisms in combination, the compute cluster can ensure that the acknowledgement signal of a write request is properly collected and propagated. The method is particularly suitable for the situation that write requests can be transmitted and processed among a plurality of data processing boards in a distributed system or network environment, and the system fence instruction ensures that write operations across the plurality of data processing boards can be synchronously completed in a cluster, so that the consistency of memory operations is maintained. Meanwhile, the mechanism also allows flexible processing of the write request, whether the write request is directly written into a local memory or forwarded to a port of other data processing boards, and can ensure that the completion of the write operation is confirmed and recorded.
In one possible implementation manner, in a case that a write request sent by an upstream port is received by a downstream port in the computing cluster, after the write request is sent to a local memory or the write request is forwarded to the downstream port, the collection is determined to be completed, and a system fence completion signal is returned to the upstream port.
In this implementation, the downstream port returns a system fence complete signal to the upstream port immediately after sending or forwarding the write request, rather than waiting for the actual write operation to complete. This allows the upstream port to continue its execution flow without being blocked on completion of the write operation.
In the embodiment of the disclosure, under the condition that the downstream port receives the write request sent by the upstream port, after the write request is sent to the local memory or forwarded to the downstream port, the collection completion is determined, that is, a system fence completion signal can be returned to the upstream port, so that the waiting time can be reduced, the resources can be more effectively utilized, the performance can be improved, and the cost for recording the request waiting for completion is reduced.
In one possible implementation, the request initiating processor adds an advance acknowledgement identifier to the write request and sends the advance acknowledgement identifier to the downstream port, where the advance acknowledgement identifier is used to instruct the downstream port to return a system fence completion signal to the upstream port after the write request is sent to the local memory or the write request is forwarded to the downstream port.
In this implementation, the request initiating processor may add an advance acknowledgement identifier to the write request, where the identifier enables the processor to control whether the downstream port needs to return to the system fence completion signal in advance after sending the write request, or needs to wait for the write request to actually arrive at the memory and then return to the system fence completion signal when sending the write request.
Specifically, when the processor is sending a write request, an advance acknowledge flag is added to the request. The identification is a control signal that indicates the behavior of the downstream port in processing the write request.
When the downstream port receives the write request with the advance confirmation mark, the system fence completion signal is returned to the upstream port immediately after the write request is sent to the local memory or forwarded to the downstream port. So that the upstream processor continues its execution flow without having to wait for the actual write operation to complete.
If the write request does not have the advance confirmation mark, or according to the configuration or strategy of the system, the downstream port waits for the write request to reach the memory and successfully execute, and then returns a completion signal to the upstream port.
In the embodiment of the disclosure, the system is more flexible by adding the advance confirmation mark, and the processor can select whether to enable the advance confirmation mechanism according to the needs so that the system can perform performance optimization or data consistency assurance according to specific application scenes and needs.
In one possible implementation manner, a plurality of data paths exist between the first data processing board and the second data processing board, in each data path, after receiving a system fence completion signal, a port of the data processing board directly connected with the second data processing board sends a write completion signal to a port of the second data processing board, wherein each data path corresponds to a write completion signal, after receiving the write completion signal, the port of the second data processing board sends the write completion signal to a mark address of the local memory when receiving a signal of successful data write returned by the local memory, the memory of the second data processing board updates a mark in the memory based on the write completion signal, the mark is used for indicating the number of the received write completion signals, and the data consumption processor determines whether the data is written in the memory or not based on the mark, and reads the data written in the memory when determining that the data write is completed.
In this implementation, the write complete signal is a signal triggered by the system fence instruction to signal the data consuming processor that the data has been successfully written to the memory of the second data processing board card.
When an upstream port in the cluster receives a system fence completion signal returned by a downstream port, the downstream port indicates that the downstream port has asserted that the data processing card has completed processing a write request (completed forwarding the write request or storing data to local memory), and therefore, the upstream port may send a write completion signal to the downstream port that returns the system fence completion signal. For the case that a plurality of data processing boards are arranged in one data path, each data processing board can send a write-in completion signal to a downstream port according to the logic, so that the recursion transfer of the write-in completion signal is realized.
And the downstream port waits for a signal that the data returned by the local memory is successfully written after receiving the writing completion signal, and when the signal that the data returned by the local memory is successfully written is received, the processing of the writing request is truly completed by the local memory, and then the writing completion signal can be sent to the marking address of the local memory to inform the data consumption processor that the data is successfully written into the local memory.
Illustratively, it is assumed that there is one data path with a connection order between the data processing boards of 0- >3- >1- >2, where 0 is the first data processing board and 2 is the second data processing board. Then its recursive transfer of the write complete signal is:
when the data processing board 0 confirms that all the write requests have been sent to the data processing board 3 (a system fence completion signal from the board 3 is received, indicating that the board 3 has completed processing the write requests), the data processing board 0 will send a write completion signal to the port of the data processing board 3;
After receiving the write completion signal from the upstream (i.e., the data processing board 0), the data processing board 3 waits for all write requests to reach the data processing board 1 downstream (receiving the system fence completion signal from the board 1, indicating that the board 1 has completed processing the write request), and after receiving the system fence completion signal of the board 1, the data processing board 3 transmits the write completion signal to the data processing board 1;
After receiving the writing completion signal from the data processing board card 3, the data processing board card 1 waits until all writing requests reach the downstream data processing board card 2, and receives the system fence completion signal from the board card 2, and the board card 1 further transmits the writing completion signal to a port of the data processing board card 2;
Finally, after the data processing card 2 receives the write complete signal from the card 1, it waits until all the previous write requests have successfully entered its local memory. Upon confirming that all write requests have been stored, the data processing card 2 will write a write complete signal to its local memory, indicating that the write operation is complete on the entire data path.
In the case that there are multiple data paths between the first data processing board and the second data processing board, multiple parallel lines trigger the second data processing board to execute writing operation, for example, the memory may execute multiple writing operations in parallel, and then the ports of the data processing boards directly connected to the second data processing board in the multiple parallel lines send writing completion signals to the memory, that is, multiple writing completion signals are received in the memory. In this implementation manner, each data path corresponds to a write-once completion signal, and a specific implementation manner thereof may participate in possible implementation manners provided in the present disclosure, which are not described herein.
To facilitate counting the number of received write completion signals, the number of received write completion signals may be indicated by a flag in the memory. Then, when the memory receives the write completion signal, the flag in the memory may be updated based on the write completion signal. The updating of the tag may be an atomic operation to ensure data consistency in parallel writing, which may be, for example, an atomic plus one, an atomic multiply, an atomic plus two, etc. operation, which is not limiting of the present disclosure.
The data consuming processor periodically checks the tag to determine if the data has been completely written, and when the value of the tag reaches a committed updated value (e.g., the original value plus the number of paths involved in the write), the consuming processor considers the data to have been completely written and can safely read the data in memory.
In the computing clusters provided by the present disclosure, the processing of data follows a "producer-consumer" model, the producer of the data generates and writes the data to the remote memory, then confirms that the data has arrived at the remote memory by a write completion signal in the system fence instruction, and maintains a flag in the remote memory indicating the number of write completion signals received. The consumer of the data then reads the value of the tag periodically until the value of the tag is updated to the agreed value, and the consumer confirms that the data can be safely read.
As described above, the downstream port, after sending the write request to the local memory or forwarding the write request to the downstream port, may return a system fence complete signal to the upstream port, which for this implementation, because the starting point and each stage of forwarding node can allow the write request to return the confirmation signal before reaching the real end point, the starting point and each stage of forwarding node can reduce the resource quantity for recording the write request, and the communication time can be remarkably saved.
Specifically, fig. 2 shows a flowchart of a data processing manner provided in the present disclosure without returning an acknowledgement signal in advance. The specific process includes that a processing unit (Multiprocessors, MP) of the GPU0 sends a write request to the interconnection port, the interconnection port forwards the write request to the GPU1 to write into a local memory DDR of the GPU1, after the write is completed, the memory returns a confirmation signal to the interconnection port, the interconnection port returns a system fence completion signal to the GPU0 MP, the GPU0 MP sends a write completion signal to the interconnection port, the interconnection port sends the write completion signal to the GPU1DDR, and the GPU1DDR updates a mark. In this process, a total of 6 time periods are spent, and the data consuming processor waits and periodically checks the tag to determine if the write request is complete.
Fig. 3 is a flowchart illustrating a data processing method for returning an acknowledgement signal in advance according to the present disclosure. The specific process includes that the GPU0 MP sends a write request to the interconnect port, the interconnect port forwards the write request to the GPU1 to write into the local memory DDR of the GPU1, the interconnect port returns a system barrier completion signal to the GPU0 MP while the forwarding is completed, i.e. steps ② and ③ are performed almost simultaneously, which may be a parallel operation, the GPU0 MP sends a write completion signal to the interconnect port, the memory system returns a confirm signal to the interconnect port after the writing of data to the DDR of the GPU1 is completed, i.e. steps ④ and ⑤ are performed almost simultaneously, the interconnect port sends a write completion signal to the DDR of the GPU1, and the DDR 1 updates the flag. In this process, a total of 4 time periods are spent.
By comparing the above, in the embodiment of the present disclosure, since the acknowledgement signal of the write operation may be returned in advance, so that the "port sends the write request to the memory of the second data processing board and writes the data" and the "port returns the system fence completion signal" may be performed synchronously, and the "upstream port sends the write completion signal to the downstream port" and the "write request acknowledgement signal returns the port from the memory" may be performed simultaneously, so that the time consumption of the processing flow can be reduced, the performance of remote access is significantly improved, and the overhead of the data consumption processor for waiting for the request of the write request completion is reduced.
In one possible implementation manner, after receiving a system fence completion signal returned by a downstream port, an upstream port in each data path transmits a write completion signal to the downstream port, wherein when the upstream port corresponds to a plurality of downstream ports, the upstream port respectively transmits the write completion signal to each downstream port, when the plurality of upstream ports corresponds to one downstream port, the downstream port collects each write completion signal transmitted by the upstream port so that each data path corresponds to one write completion signal, the local memory performs an atomic operation on a tag in the memory to update the tag in the memory when receiving one write completion signal, and the data consumption processor reads data written in the memory when determining that the tag in the memory indicates that K write completion signals are received, wherein K is the number of the data paths and K is a positive integer.
As described above, the data processing board in each data path recursively transfers the write completion signal through the port, and there may be an intersection between the plurality of data paths, for example, when the upstream port corresponds to the plurality of downstream ports, the upstream port may send the write completion signal to each of the downstream ports, so that each of the downstream paths has one write completion signal, and when the plurality of upstream ports corresponds to one of the downstream ports, the downstream port collects each write completion signal sent by the upstream port and recursively transfers the collected write completion signals to the port of the second data processing board, so that each of the data paths corresponds to one write completion signal.
A plurality of write-completion signals are received in a local memory of the second data processing board card, and the local memory performs an atomic operation on a specific mark in the memory after receiving the write-completion signals. An atomic operation is an uninterruptible operation, i.e., it is not interrupted by other threads or processors during execution, thereby ensuring consistency and correctness of data. The atomic operation may be an add operation to the tag, or any other operation that reflects the number of write complete signals, so long as the tag can be used to record the number of write complete signals received.
In this implementation, the data consumption processor will check the value of this flag to determine if all write complete signals have been received. Where K represents the number of data paths from the first data processing board to the second data processing board. Since there may be multiple data paths that can complete the transfer of data, the system needs to ensure that write complete signals are received on all data paths.
When the data consuming processor checks that the flag in the memory indicates that K write completion signals have been received, the data can be considered to have been written and the data in the memory can be safely read. The data is considered complete and consistent only when the write complete signal is received on all paths. Thus, the data consuming processor can accurately know when the data can be safely read, thereby avoiding the problem of data inconsistency.
In one possible implementation manner, a data path which does not send a write request exists in the data path, a local port in the first data processing board sends a system fence instruction to a port in the data path which does not send the write request, the port in the data path which does not send the write request sends a write completion signal to a downstream port after receiving the system fence instruction, the write completion signal is recursively transferred to a port of a second data processing board, and the port of the second data processing board sends the write completion signal to a mark address of a local memory after receiving the write completion signal transferred by the data path which does not send the write request.
It should be noted that, the write request may reach the memory only through a portion of the plurality of data paths, and since the memory may not be aware of which data path is specifically the write request transmitted by, the request initiating processor may send the system fence instruction to the ports of all the data paths, so that each data path may correspond to a write completion signal. After the memory collects the write-complete signals of all the data paths, the data consumption processor can read the data written into the memory.
And for the data path which does not send the write request, the local port in the first data processing board card can also send the system fence instruction to the port in the data path which does not send the write request, so after the port in the data path which does not send the write request receives the system fence instruction, the port in the data path which does not send the write request can directly send the write completion signal to the downstream port and recursively transfer the write completion signal to the port of the second data processing board card.
And after receiving the write-in completion signal transmitted by the data channel which does not send the write request, the port of the second data processing board card also does not receive the write-in request, so that the write-in completion signal can be directly sent to the marking address of the local memory.
Therefore, for the data paths which do not send the writing request, a writing completion signal is correspondingly arranged, so that the number of the writing completion signals received in the memory of the second data processing board card is consistent with the number of the data paths, the integrity and consistency of data are guaranteed according to the number of the writing completion signals, and therefore a data consumption processor can accurately know when the data can be safely read, and the problem of data inconsistency is avoided.
In addition, there may be multiple specific implementations of sending the write completion signal to the data path that does not send the write request, in another example, after receiving the system fence instruction, any data processing card in the data path that does not send the write request may continue to transfer the system fence instruction to the downstream port, and after receiving the system fence instruction, the downstream port may return the system fence completion signal to the upstream port because the port does not receive the write request, and then, the upstream port sends the write completion signal to the downstream port based on the system fence completion signal, so that transfer of the write signal is achieved, and finally, transfer is performed to the port of the second data processing board card. The port of the second data processing board card does not receive the write request, so that the write completion signal can be directly sent to the memory.
In one possible implementation manner, the write request includes a group identifier, the local port sends a system fence instruction to each downstream port receiving the write request based on the group identifier, the system fence instruction includes the same group identifier, the downstream port in the computing cluster collects a confirmation signal of the write request corresponding to the group identifier based on the group identifier in the system fence instruction when receiving the system fence instruction sent by the upstream port, and returns a system fence completion signal including the group identifier to the upstream port when collecting is completed.
Because of the virtualized or multitasking parallel environment, there may be multiple sets of mutually independent write requests transmitted simultaneously on the system bus and interconnect ports. Thus, the concept of grouping (group identification) can be employed to manage and synchronize write requests and system fence instructions for different groups.
In particular, the write request may include a group identifier, where the group identifier is used to distinguish between different groups of write requests. In this way, the system is able to manage and synchronize write requests for a particular group based on the group identification.
When a local port receives a write request, it sends a system fence instruction to all downstream ports that receive the write request based on the group identification in the write request. The same group identification may also be included in this system fence instruction to ensure that the downstream port is able to correctly identify and process the instruction.
After receiving the system fence instruction sent by the upstream port, the downstream port collects, based on the group identifier in the instruction, an acknowledgement signal (write completion signal) of the write request corresponding to the group identifier. Only when all acknowledgement signals for write requests associated with the group identification have been collected, will the downstream port return a system fence complete signal containing the same group identification to the upstream port.
Where multiple packets are defined, the system fence processing logic at the interconnect logic needs to have a corresponding packet recording function. One or more data structures (e.g., tables, hash tables, etc.) may be maintained specifically for recording the status of each group identification (e.g., number of received write requests, number of acknowledgement signals collected, etc.). In this way, the system is able to properly handle and manage different groups of write requests and system fence instructions based on group identification.
For example, in a virtualized environment, there may be multiple virtual machines running simultaneously and generating a large number of write requests. By using group identification, the system can group write requests belonging to the same virtual machine and apply a specific system fence instruction only to the group, thereby improving synchronization efficiency and performance.
Furthermore, in a multitasking parallel environment, multiple tasks may access a shared resource at the same time and generate write requests. By assigning different group identifications to different tasks, the system can ensure data synchronization and consistency for each task while avoiding unnecessary synchronization overhead.
In the embodiment of the disclosure, by introducing the group identifier, grouping the requests according to the group identifier and executing the system fence instruction in a grouping manner, multiple groups of write requests which are not related to each other can be effectively managed and synchronized in complex environments such as virtualization/multitasking parallelism, data synchronization and consistency of each task in the complex environments are ensured, and performance and reliability of the system are improved.
In one possible implementation manner, after receiving the data writing failure signal returned by the local memory, the downstream port in the computing cluster returns a system fence failure signal to the upstream port, and the local port of the first data processing board card returns the system fence failure signal to the request initiating processor under the condition that the system fence failure signal is received.
Since the downstream port may return the acknowledgement signal of the write request in advance after forwarding the write request, error reporting may be performed in case of a write request failure.
Specifically, in an attempt to write data to local memory, if an unrecoverable error (e.g., bit error, packet loss, etc.) is encountered, a system fence failure signal may be sent to the downstream port that sent the write request, which contains information about the write failure and the corresponding group identification (if a packet mechanism is used).
The downstream port forwards the system fence failure signal to the upstream port after receiving it. The upstream port acts as a transfer station for the write request and will continue to pass this failure signal upstream until it reaches the original local port of the first data processing card.
The local port of the first data processing card, upon receiving the system fence failure signal, may recognize that this is a failure signal associated with a particular write request or group of write requests. This failure signal may be returned directly to the request initiating processor so that the processor can learn the specifics and reasons of the write operation failure.
In the embodiment of the disclosure, after receiving a data writing failure signal returned by the local memory, a system fence failure signal is returned to an upstream port, and the local port of the first data processing board card returns the system fence failure signal to the request initiating processor under the condition that the system fence failure signal is received. Therefore, the accumulated report of the write operation errors can be realized, even in a complex computing cluster, when the write operation errors occur in a certain link, the write operation errors can be effectively transferred to a request initiating processor, so that the corresponding error processing flow is triggered, the robustness and the reliability of the system are improved, and timely response and measures can be taken when the errors occur.
An application scenario of the embodiments of the present disclosure is described below. Fig. 4 illustrates a block diagram of a computing cluster provided by the present disclosure. The application scenario is a 4-card computing cluster, each card comprises a plurality of processors and 4 interconnection ports, and the linking mode among the cards is shown in the figure.
Card 0 needs to send a set of data to card 2, there are a total of 3 paths from card 0 to card 2, card 0 is directly connected to card 2, card 0 is forwarded to card 2 through card 1, and card 0 is forwarded to card 2 through card 3.
The write request of the processor in card 0 arrives at card 2 via the direct communication path and the path forwarded by card 3, respectively.
The following procedure is performed on card 0:
1. the processor of the card 0 sends out write requests to the card 2 through the interconnection port 1 and to the card 3 through the interconnection port 2 respectively;
2. interconnect port 1 and interconnect port 2 return acknowledgement of the write request to the processor of card 0;
3. The processor sends out a system fence instruction which is sent to the card 2 and the card 3 according to the record;
4. the system fence instruction returns to the interconnection ports 1 and 2 of the card 0 after completing the processing on the card 2 and the card 3;
5. The processor of card 0 determines that the issued system fence instruction has been completed and returns a system fence completion signal to the processor core, where the processor core includes the processing unit (Multiprocessors, MP) of the GPU or the processor core in the CPU.
The following procedure is performed on card 3:
1. the write request reaches the card 3 and is forwarded to the card 2;
2. receiving a system fence instruction from the card 0, and detecting that write requests are sent to the card 2 through the interconnection port 1;
3. The system fence instruction is sent to the card 2 through the interconnection port 1;
4. After the system fence completion signal of the card 2 returns from the interconnection port 1, returning the system fence completion signal to the card 0 through the interconnection port 0;
The following procedure is performed on card 2:
1. card 0 direct path:
a) A direct write request from interconnect port 1 of card 0 to interconnect port 1 of card 2;
b) The system fence instruction from interconnect port 1 of card 0 reaches interconnect port 1 of card 2;
c) The interconnect port 1 of card 2 receives a confirmation of the write request (all received from card 0) from the local memory;
d) The interconnect port 1 of card 2 returns a system fence complete signal to card 0.
2. Card 3 forward path:
a) A direct write request from interconnect port 1 of card 3 to interconnect port 2 of card 2;
b) System fence instruction from interconnect port 1 of card 3 reaches interconnect port 2 of card 2;
c) The interconnect port 2 of card 2 receives the acknowledgement of the write request (all received from card 3) from the local memory;
d) The interconnection port 2 of the card 2 returns the received system fence completion signal from the card 3 to the card 3;
The two processing paths of card 2 are parallel.
The implementation for updating the tag in the memory based on the write completion signal includes the following steps:
Based on the same topology as above, card 0 needs to send a set of data to card 2, there are a total of 3 paths from card 0 to card 2, card 0 is directly connected to card 2, card 0 is forwarded to card 2 through card 1, and card 0 is forwarded to card 2 through card 3. The write data of card 0 arrives at card 2 through the direct communication path 1 and the path forwarded by card 3, respectively.
The following procedure is performed on card 0:
1. The card 0 sends out a write request to reach the card 2 through the interconnection port 1 and to reach the card 3 through the interconnection port 2;
2. The interconnection port 1 and the interconnection port 2 return acknowledgement signals of the write requests to the processor;
3. the card 0 sends out a system fence instruction, and according to the detected topological relation and the detected address corresponding to the card 2 of the system fence instruction, the system fence instruction is broadcast to the ports 0,1 and 2 which possibly reach the card 2;
4. The processor in the card 0 collects the system fence completion signals of the ports 0, 1 and 2, and after the collection is completed, sends a write completion signal to the port 0 of the card 1, the port 1 of the card 2 and the port 0 of the card 3;
5. The processor in card 0 determines that the issued system fence instructions have all returned system fence completion signals, which are returned to the processor core system fence completion signals.
The following procedure is performed on card 1:
1. Receiving a system fence instruction from the card 0, and forwarding the system fence instruction to the card 2 through the interconnection port 1 because the write request corresponds to the address on the card 2;
2. After the system fence completion signal of the card 2 returns from the interconnection port 1, returning the system fence completion signal to the card 0 through the interconnection port 0;
3. The write completion signal of card 0 is received through port 0, and since card 1 did not receive the write request, it is forwarded directly through port 1 to port 0 of card 2.
The following procedure is performed on card 3:
1. the write request reaches the card 3 and is forwarded to the card 2;
2. receiving a system fence instruction from the card 0, and forwarding the system fence instruction to the card 2 through the interconnection port 1 because the write request corresponds to the address on the card 2;
3. After the system fence completion signal of the card 2 returns from the interconnection port 1, returning the system fence completion signal to the card 0 through the interconnection port 0;
4. The write completion signal of card 0 is received through port 0, and since the system barrier completion signal returned through port 2 of card 2 has been received, the write completion signal is transmitted through port 1 to port 2 of card 2. The following procedure is performed on card 2:
1. card 0 direct path:
a) A direct write request from interconnect port 1 of card 0 to interconnect port 1 of card 2;
b) The system fence instruction from interconnect port 1 of card 0 reaches interconnect port 1 of card 2;
c) The interconnect port 1 of card 2 receives a confirmation of the write request (all received from card 0) from the local memory;
d) Sending a write completion signal from the interconnection port 1 of the card 0 to a destination address corresponding to the write completion signal of the card 2;
e) The interconnect port 1 of card 2 returns a system fence completion signal received from card 0 to card 0.
2. Card 1 forwarding path:
a) The system fence instruction and write completion signal from interconnect port 1 of card 1 reach interconnect port 0 of card 2;
b) Because there is no write request forwarded from card 1, interconnect port 0 of card 2 directly sends the write completion signal to the destination address corresponding to the write completion signal of card 2;
c) The interconnect port 0 of card 2 returns a system fence completion signal received from card 1 to card 1.
3. Card 3 forward path:
a) A direct write request from interconnect port 1 of card 3 to interconnect port 2 of card 2;
b) System fence instruction from interconnect port 1 of card 3 reaches interconnect port 2 of card 2;
c) The interconnection port 2 of the card 2 receives all the acknowledgements of write requests received from the card 3;
d) Sending a write completion signal from the interconnect port 1 of the card 3 to a destination address corresponding to the write completion signal of the card 2;
e) The interconnect port 2 of card 2 returns a system fence completion signal received from card 3 to card 3.
Consumer process on card 2:
After 1. D), 2. B), 3. D) of card 2 are all completed, card 2 reads the value of the tag address, and when the expected value of k=3 is read, the consumer process of card 2 can begin the read/write operation.
The present disclosure also provides a data processing method, and fig. 5 shows a flowchart of the data processing method according to an embodiment of the present disclosure, and as shown in fig. 5, the method is applied to a computing cluster including a plurality of data processing boards, including:
in step S51, after the request initiating processor in the first data processing board sends a write request to the memory of the second data processing board in the computing cluster, transmitting a system fence instruction to a local port that sends the write request, so as to transmit the system fence instruction to each downstream port that receives the write request through the local port;
In step S52, if the downstream port in the computer group receives the system fence instruction sent by the upstream port, acquiring the system fence instruction, executing the system fence instruction to collect the acknowledgement signal of the write request, and if collection is completed, returning a system fence completion signal to the upstream port;
In step S53, when the local port of the first data processing board collects the acknowledgement signal of the write request, a system fence completion signal is returned to the request initiating processor to indicate that all write requests are performed.
The execution body of the method can be a module in the processor or a module in interconnection consistency processing logic responsible for communication among all interconnection ports.
For example, for an implementation in which the modules execute within the processor, one module within the processor may be responsible for receiving write requests from the various ports and recording the status of those requests. When the conditions of the system fence are met (e.g., all relevant write requests are completed), the module generates a system fence complete signal and sends it to the corresponding port via the processor's communication mechanism. If an error occurs during the execution of the write request, the module is also responsible for generating a system fence failure signal and forwarding the signal to the corresponding port.
Aiming at the implementation mode of the internal module in the interconnection consistency processing logic, the interconnection consistency processing logic is responsible for processing communication among all interconnection ports, including the sending and receiving of write requests and system fence instructions, when a processor sends the write requests to the interconnection consistency processing logic, the internal module sends the write requests to the corresponding ports and sends the system fence instructions to the corresponding ports, and a system fence completion signal is generated when all relevant write requests are completed. If an error occurs during execution of a write request, the interconnect coherence processing logic generates a system fence failure signal and sends it to all relevant processors over the interconnect network.
In one possible implementation, the acknowledgement signal of the write request includes an acknowledgement signal that the local memory performed a write operation on the write request, and/or a system fence completion signal returned by the downstream port that characterizes the downstream port has collected the acknowledgement signal of the corresponding write request.
In one possible implementation, the collecting the acknowledgement signal of the write request includes:
And/or if the write request is forwarded to a downstream port, forwarding the system fence instruction to the downstream port receiving the write request, and collecting a system fence completion signal returned by the downstream port receiving the write request.
In one possible implementation, the information recorded in the core exit of the request initiating processor includes port information of the local port that issued the write request in a time interval from the last system barrier instruction issue to the current system barrier instruction issue, and the request initiating processor transmits the system barrier instruction to the corresponding local port based on the recorded port information.
In one possible implementation manner, the information recorded in the port of the data processing board card of the computing cluster, which has sent the write request, includes port information of a downstream port accessed by the issued write request in a time interval from the issuance of the last system barrier instruction to the issuance of the current system barrier instruction;
and transmitting the system fence instruction to a downstream port corresponding to the port information by the port which sends the write-over request in the computing cluster based on the recorded port information.
In one possible implementation, the method further includes, upon receiving a write request sent by an upstream port, returning a system fence completion signal to the upstream port after sending the write request to local memory or forwarding the write request to a downstream port.
In one possible implementation manner, the transmission of the system fence instruction to each downstream port receiving the write request through the local port includes adding an advance acknowledgement identifier to the write request and sending the advance acknowledgement identifier to the downstream port, where the advance acknowledgement identifier is used to instruct the downstream port to determine that collection is completed after the write request is sent to a local memory or the write request is forwarded to the downstream port, and returning a system fence completion signal to the upstream port.
In one possible implementation, the write request includes a group identifier, and the transmitting the system fence instruction to each downstream port that receives the write request includes:
based on the group identifier, a system fence instruction is sent to each downstream port receiving the write request, wherein the system fence instruction contains the same group identifier;
The acquiring the system fence instruction, executing the system fence instruction to collect the acknowledgement signal of the write request, and returning a system fence completion signal to an upstream port if collection is completed, including:
and under the condition that a system fence instruction sent by an upstream port is received, collecting a confirmation signal of a write request corresponding to a group identifier based on the group identifier in the system fence instruction, and under the condition that the collection is completed, returning a system fence completion signal containing the group identifier to the upstream port.
In one possible implementation, the method further includes:
and the local port of the first data processing board card returns the system fence failure signal to the request initiating processor under the condition of receiving the system fence failure signal.
In addition, the disclosure further provides a data processing apparatus, an electronic device, a computer readable storage medium, and a program, where the foregoing may be used to implement any one of the data processing systems provided in the disclosure, and the corresponding technical schemes and descriptions may be referred to the corresponding descriptions of the method parts, and are not repeated.
FIG. 6 illustrates a block diagram of a data processing apparatus according to an embodiment of the present disclosure, as shown in FIG. 6, applied to a computing cluster including a plurality of data processing boards, the apparatus 60 comprising:
Applied to a computing cluster comprising a plurality of data processing boards, comprising:
a sending module 61, configured to, after a request initiating processor in a first data processing board sends a write request to a memory of a second data processing board in the computing cluster, transmit a system fence instruction to a local port in the first data processing board, where the write request is sent, so as to transmit the system fence instruction to each downstream port that receives the write request through the local port of the first data processing board;
A collecting module 62, configured to obtain a system fence instruction when a downstream port in the computer group receives the system fence instruction sent by an upstream port, execute the system fence instruction to collect a confirmation signal of the write request, and return a system fence completion signal to the upstream port when collection is completed;
and the return module 63 is configured to return a system fence completion signal to the request initiating processor to indicate that all write requests are performed completely, when the local port of the first data processing board collects a confirmation signal of the write request.
In one possible implementation, the acknowledgement signal of the write request includes an acknowledgement signal that the local memory performed a write operation on the write request, and/or a system fence completion signal returned by the downstream port that characterizes the downstream port has collected the acknowledgement signal of the corresponding write request.
In one possible implementation, the collecting module is configured to:
And/or if the write request is forwarded to a downstream port, forwarding the system fence instruction to the downstream port receiving the write request, and collecting a system fence completion signal returned by the downstream port receiving the write request.
In one possible implementation, the information recorded in the core exit of the request initiating processor includes port information of the local port that issued the write request in a time interval from the issuance of the last system barrier instruction to the issuance of the current system barrier instruction;
the request initiating processor transmits the system fence instruction to the corresponding local port based on the recorded port information.
The information recorded in the port of the data processing board card of the computing cluster, which is sent with the write request, comprises port information of a downstream port accessed by the sent write request in a time interval from the sending of the last system fence instruction to the sending of the current system fence instruction;
and transmitting the system fence instruction to a downstream port corresponding to the port information by the port which sends the write-over request in the computing cluster based on the recorded port information.
In one possible implementation, the device further comprises an upstream return module, configured to determine that the collection is completed after the write request is sent to the local memory or the write request is forwarded to the downstream port, and return a system fence completion signal to the upstream port when the write request sent by the upstream port is received.
In one possible implementation, the sending module is configured to add an advance acknowledgement identifier to the write request and send the advance acknowledgement identifier to the downstream port, where the advance acknowledgement identifier is configured to instruct the downstream port to return a system fence completion signal to the upstream port after the write request is sent to the local memory or the write request is forwarded to the downstream port.
In one possible implementation manner, the write request includes a group identifier, and the sending module is configured to:
based on the group identifier, a system fence instruction is sent to each downstream port receiving the write request, wherein the system fence instruction contains the same group identifier;
The collection module is used for:
and under the condition that a system fence instruction sent by an upstream port is received, collecting a confirmation signal of a write request corresponding to a group identifier based on the group identifier in the system fence instruction, and under the condition that the collection is completed, returning a system fence completion signal containing the group identifier to the upstream port.
In one possible implementation, the apparatus further includes:
The system comprises a request initiating processor, a data processing board card, a failure signal returning module and a request initiating processor, wherein the request initiating processor is used for receiving a data writing failure signal returned by the local memory, and then returning a system fence failure signal to an upstream port.
In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.
The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described system. The computer readable storage medium may be a volatile or nonvolatile computer readable storage medium.
The embodiment of the disclosure also provides electronic equipment, which comprises a processor and a memory for storing instructions executable by the processor, wherein the processor is configured to call the instructions stored by the memory so as to realize the system.
Embodiments of the present disclosure also provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs the above method.
The electronic device may be provided as a terminal, server or other form of device.
Fig. 7 illustrates a block diagram of an electronic device 1900 according to an embodiment of the disclosure. For example, electronic device 1900 may be provided as a server or terminal device. Referring to FIG. 7, electronic device 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the methods described above.
The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system stored in memory 1932, such as the Microsoft Server operating system (Windows Server TM), the apple Inc. promoted graphical user interface-based operating system (Mac OS X TM), the multi-user, multi-process computer operating system (Unix TM), the free and open source Unix-like operating system (Linux TM), the open source Unix-like operating system (FreeBSD TM), or the like.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of electronic device 1900 to perform the methods described above.
The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium include a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical encoding device, punch cards or intra-groove protrusion structures such as those having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.
The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.
The computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.
The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.
It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.
If the technical scheme of the application relates to personal information, the product applying the technical scheme of the application clearly informs the personal information processing rule before processing the personal information and obtains the autonomous agreement of the individual. If the technical scheme of the application relates to sensitive personal information, the product applying the technical scheme of the application obtains individual consent before processing the sensitive personal information, and simultaneously meets the requirement of 'explicit consent'. For example, a clear and obvious mark is set at a personal information acquisition device such as a camera to inform that the personal information acquisition range is entered, personal information is acquired, if the personal voluntarily enters the acquisition range, the personal information is considered as consent to acquire the personal information, or if a clear mark/information is used on a personal information processing device to inform that the personal information processing rule is used, personal authorization is obtained through popup information or a mode of requesting the personal information to upload the personal information by the personal, wherein the personal information processing rule can comprise information such as a personal information processor, a personal information processing purpose, a processing mode, a processed personal information type and the like.
The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (15)

1.一种数据处理系统,其特征在于,应用于包括多个多级互联的数据处理板卡的计算集群,包括:1. A data processing system, characterized in that it is applied to a computing cluster comprising multiple interconnected data processing boards, including: 第一数据处理板卡中的请求发起处理器,在向所述计算集群中的第二数据处理板卡的存储器发送写请求后,将系统栅栏指令传输至所述第一数据处理板卡中发送所述写请求的本地端口;The request initiator in the first data processing board sends a write request to the memory of the second data processing board in the computing cluster, and then transmits a system fence command to the local port in the first data processing board that sent the write request. 所述第一数据处理板卡的本地端口,将所述系统栅栏指令传输至接收所述写请求的各下游端口;The local port of the first data processing board transmits the system fence command to each downstream port that receives the write request; 所述计算集群中的下游端口,在接收到上游端口发送的系统栅栏指令的情况下,执行所述系统栅栏指令,以收集所述写请求的确认信号,并在收集完成的情况下,向上游端口返回系统栅栏完成信号,其中,所述下游端口向对本端口发送系统栅栏指令的上游端口返回系统栅栏完成信号;所述计算集群中的下游端口,在接收到上游端口发送的写请求的情况下,在将所述写请求发送给本地存储器或将所述写请求转发给下游端口后,确定收集完成,向上游端口返回系统栅栏完成信号;When a downstream port in the computing cluster receives a system fence instruction sent by an upstream port, it executes the system fence instruction to collect confirmation signals for the write request. Upon completion of the collection, it returns a system fence completion signal to the upstream port. Specifically, the downstream port returns the system fence completion signal to the upstream port that sent the system fence instruction to it. Furthermore, when a downstream port in the computing cluster receives a write request sent by an upstream port, it determines that the collection is complete after sending the write request to local memory or forwarding the write request to the downstream port, and then returns a system fence completion signal to the upstream port. 所述第一数据处理板卡的本地端口,在收集完写请求的确认信号的情况下,向所述请求发起处理器返回系统栅栏完成信号,以指示所有写请求执行完毕;The local port of the first data processing board, after collecting the confirmation signals of the write requests, sends a processor return system fence completion signal to the request to indicate that all write requests have been completed. 其中,所述第一数据处理板卡到所述第二数据处理板卡之间存在多个数据通路;There are multiple data paths between the first data processing board and the second data processing board; 在每个数据通路中,与所述第二数据处理板卡直接相连的数据处理板卡的端口在接收到系统栅栏完成信号后,向第二数据处理板卡的端口发送写入完成信号;In each data path, after receiving the system fence completion signal, the port of the data processing board directly connected to the second data processing board sends a write completion signal to the port of the second data processing board. 所述第二数据处理板卡的端口在接收到写入完成信号后,在接收到本地存储器返回的数据写入成功的信号的情况下,将所述写入完成信号发送至本地存储器的标记地址中;After receiving the write completion signal, the port of the second data processing board, upon receiving a data write success signal returned by the local memory, sends the write completion signal to the marker address of the local memory. 所述第二数据处理板卡的存储器基于所述写入完成信号,更新存储器中的标记,所述标记用于指示接收到的写入完成信号的数量。The memory of the second data processing board updates the marker in the memory based on the write completion signal, the marker being used to indicate the number of write completion signals received. 2.根据权利要求1所述的系统,其特征在于,所述写请求的确认信号包括:本地存储器对写请求执行写操作的确认信号,和/或,下游端口返回的系统栅栏完成信号,所述系统栅栏完成信号表征下游端口已收集完对应的写请求的确认信号。2. The system according to claim 1, wherein the confirmation signal for the write request includes: a confirmation signal from the local memory for performing a write operation on the write request, and/or a system barrier completion signal returned by the downstream port, wherein the system barrier completion signal indicates that the downstream port has collected the confirmation signal for the corresponding write request. 3.根据权利要求1所述的系统,其特征在于,所述收集所述写请求的确认信号,包括:3. The system according to claim 1, wherein collecting the confirmation signal for the write request includes: 所述计算集群中的下游端口,在接收到上游端口发送的系统栅栏指令的情况下,若本端口对应的本地存储器对所述写请求执行处理,则收集本地存储器对写请求执行写操作的确认信号;和/或,In the computing cluster, a downstream port, upon receiving a system barrier command from an upstream port, if its corresponding local memory processes the write request, collects an acknowledgment signal from the local memory confirming the write operation; and/or, 若本端口将所述写请求转发至下游端口,则向接收写请求的下游端口转发所述系统栅栏指令,并收集接收写请求的下游端口返回的系统栅栏完成信号。If this port forwards the write request to a downstream port, it forwards the system fence instruction to the downstream port that received the write request and collects the system fence completion signal returned by the downstream port that received the write request. 4.根据权利要求1所述的系统,其特征在于,所述请求发起处理器的核心出口中记录的信息包括:自上一个系统栅栏指令发出以来至当前系统栅栏指令发出这一时间区间内,发出写请求的本地端口的端口信息;4. The system according to claim 1, wherein the information recorded in the core exit of the request initiating processor includes: port information of the local port that issued the write request during the time interval from the issuance of the previous system fence instruction to the issuance of the current system fence instruction; 所述请求发起处理器基于记录的端口信息,将所述系统栅栏指令传输至对应的本地端口。The request initiating processor transmits the system fence command to the corresponding local port based on the recorded port information. 5.根据权利要求1所述的系统,其特征在于,所述计算集群的数据处理板卡中发送过所述写请求的端口中记录的信息包括:自上一个系统栅栏指令发出以来至当前系统栅栏指令发出这一时间区间内,发出的写请求所访问的下游端口的端口信息;5. The system according to claim 1, wherein the information recorded in the port that sent the write request in the data processing board of the computing cluster includes: port information of the downstream port accessed by the written request during the time interval from the issuance of the previous system fence instruction to the issuance of the current system fence instruction; 所述计算集群中发送过写请求的端口基于记录的端口信息,将所述系统栅栏指令传输至与所述端口信息对应的下游端口。The port in the computing cluster that sends the write request transmits the system fence command to the downstream port corresponding to the recorded port information based on the recorded port information. 6.根据权利要求5所述的系统,其特征在于,所述请求发起处理器,在所述写请求中添加提前确认标识,并发送给下游端口;所述提前确认标识用于指示下游端口在将所述写请求发送给本地存储器或将所述写请求转发给下游端口后,向上游端口返回系统栅栏完成信号。6. The system according to claim 5, wherein the request initiating processor adds an early acknowledgment flag to the write request and sends it to the downstream port; the early acknowledgment flag is used to instruct the downstream port to return a system barrier completion signal to the upstream port after sending the write request to the local memory or forwarding the write request to the downstream port. 7.根据权利要求5所述的系统,其特征在于,所述每个数据通路均会对应一次写入完成信号;7. The system according to claim 5, wherein each data path corresponds to a write completion signal; 数据消费处理器基于所述标记,确定数据是否写入完成,在确定数据写入完成的情况下,读取存储器中写入的数据。Based on the flag, the data consumption processor determines whether the data writing is complete. If the data writing is complete, it reads the data written to the memory. 8.根据权利要求7所述的系统,其特征在于,每个所述数据通路中的上游端口在接收到下游端口返回的系统栅栏完成信号后,向下游端口传递写入完成信号;其中,在上游端口对应多个下游端口时,上游端口分别向各下游端口发送写入完成信号,在多个上游端口对应一个下游端口时,下游端口收集上游端口发送的各写入完成信号,并将收集的多个写入完成信号递归传递至第二数据处理板卡的端口,以使得每个数据通路对应一次写入完成信号;8. The system according to claim 7, wherein, after receiving the system fence completion signal returned by the downstream port, the upstream port in each data path transmits a write completion signal to the downstream port; wherein, when the upstream port corresponds to multiple downstream ports, the upstream port sends a write completion signal to each downstream port respectively; when multiple upstream ports correspond to one downstream port, the downstream port collects each write completion signal sent by the upstream port and recursively transmits the collected multiple write completion signals to the port of the second data processing board, so that each data path corresponds to one write completion signal; 所述本地存储器在接收到一个写入完成信号的情况下,对存储器中的标记执行一次原子操作,以更新存储器中的标记;Upon receiving a write completion signal, the local memory performs an atomic operation on the tag in the memory to update the tag in the memory; 所述数据消费处理器,在确定所述存储器中的标记指示接收到了K个写入完成信号的情况下,读取所述存储器中写入的数据,其中K为所述数据通路的数量,K为正整数。The data consumption processor reads the data written to the memory when it determines that the marker in the memory indicates that K write completion signals have been received, where K is the number of data paths and K is a positive integer. 9.根据权利要求7所述的系统,其特征在于,所述数据通路中存在未发送写请求的数据通路;9. The system according to claim 7, wherein there are data paths in the data path that have not sent write requests; 所述第一数据处理板卡中的本地端口,向未发送写请求的数据通路中的端口发送系统栅栏指令;The local port in the first data processing board sends a system fence command to the port in the data path that has not sent a write request; 所述未发送写请求的数据通路中的端口在接收到系统栅栏指令后,向下游端口发送写入完成信号,将写入完成信号递归传递至第二数据处理板卡的端口;After receiving the system fence command, the port in the data path that did not send a write request sends a write completion signal to the downstream port and recursively passes the write completion signal to the port of the second data processing board. 所述第二数据处理板卡的端口,在接收到未发送写请求的数据通路传递的写入完成信号后,将所述写入完成信号发送至本地存储器的标记地址中。After receiving a write completion signal from a data path that has not sent a write request, the port of the second data processing board sends the write completion signal to the tag address of the local memory. 10.根据权利要求1所述的系统,其特征在于,所述写请求中包含组标识,所述本地端口基于所述组标识,将系统栅栏指令发往接收所述写请求的各下游端口,所述系统栅栏指令中包含相同的组标识;10. The system according to claim 1, wherein the write request includes a group identifier, and the local port sends a system fence instruction to each downstream port receiving the write request based on the group identifier, wherein the system fence instruction includes the same group identifier; 所述计算集群中的下游端口,在接收到上游端口发送的系统栅栏指令的情况下,基于系统栅栏指令中的组标识,收集与所述组标识对应的写请求的确认信号,并在收集完成的情况下,向上游端口返回包含所述组标识的系统栅栏完成信号。When a downstream port in the computing cluster receives a system fence instruction sent by an upstream port, it collects an acknowledgment signal for a write request corresponding to the group identifier based on the group identifier in the system fence instruction, and returns a system fence completion signal containing the group identifier to the upstream port when the collection is complete. 11.根据权利要求1所述的系统,其特征在于,所述计算集群中的下游端口,在接收到本地存储器返回的数据写入失败的信号后,向上游端口返回系统栅栏失败信号;11. The system according to claim 1, wherein the downstream port in the computing cluster, after receiving a data write failure signal returned by the local memory, returns a system barrier failure signal to the upstream port; 所述第一数据处理板卡的本地端口,在接收到系统栅栏失败信号的情况下,向所述请求发起处理器返回系统栅栏失败信号。When the local port of the first data processing board receives a system fence failure signal, it sends a system fence failure signal back to the processor initiating the request. 12.一种数据处理方法,其特征在于,应用于包括多个多级互联的数据处理板卡的计算集群,包括:12. A data processing method, characterized in that it is applied to a computing cluster comprising multiple interconnected data processing boards, including: 在第一数据处理板卡中的请求发起处理器向所述计算集群中的第二数据处理板卡的存储器发送写请求后,将系统栅栏指令传输至所述第一数据处理板卡中发送所述写请求的本地端口,以通过第一数据处理板卡的所述本地端口将所述系统栅栏指令传输至接收所述写请求的各下游端口;After the request initiating processor in the first data processing board sends a write request to the memory of the second data processing board in the computing cluster, it transmits a system fence instruction to the local port in the first data processing board that sent the write request, so as to transmit the system fence instruction to each downstream port that receives the write request through the local port of the first data processing board. 在计算机群中的下游端口接收到上游端口发送的系统栅栏指令的情况下,获取系统栅栏指令,执行所述系统栅栏指令,以收集所述写请求的确认信号,并在收集完成的情况下,向上游端口返回系统栅栏完成信号,其中,所述下游端口向对本端口发送系统栅栏指令的上游端口返回系统栅栏完成信号;所述计算集群中的下游端口,在接收到上游端口发送的写请求的情况下,在将所述写请求发送给本地存储器或将所述写请求转发给下游端口后,确定收集完成,向上游端口返回系统栅栏完成信号;When a downstream port in a computer cluster receives a system fence instruction sent by an upstream port, it acquires the system fence instruction, executes the system fence instruction to collect confirmation signals for the write request, and returns a system fence completion signal to the upstream port upon completion of collection. Specifically, the downstream port returns the system fence completion signal to the upstream port that sent the system fence instruction to it. When a downstream port in the computing cluster receives a write request sent by an upstream port, it determines that collection is complete after sending the write request to local memory or forwarding the write request to the downstream port, and then returns a system fence completion signal to the upstream port. 在所述第一数据处理板卡的本地端口收集完写请求的确认信号的情况下,向所述请求发起处理器返回系统栅栏完成信号,以指示所有写请求执行完毕;Once the local port of the first data processing board has collected the confirmation signal for the write request, a processor return system fence completion signal is sent to the request to indicate that all write requests have been completed. 其中,所述第一数据处理板卡到所述第二数据处理板卡之间存在多个数据通路;There are multiple data paths between the first data processing board and the second data processing board; 在每个数据通路中,与所述第二数据处理板卡直接相连的数据处理板卡的端口在接收到系统栅栏完成信号后,向第二数据处理板卡的端口发送写入完成信号;In each data path, after receiving the system fence completion signal, the port of the data processing board directly connected to the second data processing board sends a write completion signal to the port of the second data processing board. 所述第二数据处理板卡的端口在接收到写入完成信号后,在接收到本地存储器返回的数据写入成功的信号的情况下,将所述写入完成信号发送至本地存储器的标记地址中;After receiving the write completion signal, the port of the second data processing board, upon receiving a data write success signal returned by the local memory, sends the write completion signal to the marker address of the local memory. 所述第二数据处理板卡的存储器基于所述写入完成信号,更新存储器中的标记,所述标记用于指示接收到的写入完成信号的数量。The memory of the second data processing board updates the marker in the memory based on the write completion signal, the marker being used to indicate the number of write completion signals received. 13.一种数据处理装置,其特征在于,应用于包括多个多级互联的数据处理板卡的计算集群,包括:13. A data processing apparatus, characterized in that it is applied to a computing cluster comprising multiple interconnected data processing boards, comprising: 发送模块,用于在第一数据处理板卡中的请求发起处理器向所述计算集群中的第二数据处理板卡的存储器发送写请求后,将系统栅栏指令传输至所述第一数据处理板卡中发送所述写请求的本地端口,以通过所述第一数据处理板卡的本地端口将所述系统栅栏指令传输至接收所述写请求的各下游端口;The sending module is configured to transmit a system fence instruction to the local port of the first data processing board that sent the write request after the request initiating processor in the first data processing board sends a write request to the memory of the second data processing board in the computing cluster, so as to transmit the system fence instruction to each downstream port receiving the write request through the local port of the first data processing board. 收集模块,用于在计算机群中的下游端口接收到上游端口发送的系统栅栏指令的情况下,获取系统栅栏指令,执行所述系统栅栏指令,以收集所述写请求的确认信号,并在收集完成的情况下,向上游端口返回系统栅栏完成信号,其中,所述下游端口向对本端口发送系统栅栏指令的上游端口返回系统栅栏完成信号;所述计算集群中的下游端口,在接收到上游端口发送的写请求的情况下,在将所述写请求发送给本地存储器或将所述写请求转发给下游端口后,确定收集完成,向上游端口返回系统栅栏完成信号;A collection module is configured to, upon receiving a system fence instruction from an upstream port in a computer cluster, acquire the system fence instruction, execute the system fence instruction to collect confirmation signals for the write request, and, upon completion of collection, return a system fence completion signal to the upstream port. The downstream port returns the system fence completion signal to the upstream port that sent the system fence instruction to it. Upon receiving a write request from an upstream port, the downstream port in the computing cluster, after sending the write request to local memory or forwarding the write request to the downstream port, determines that collection is complete and returns a system fence completion signal to the upstream port. 返回模块,用于在所述第一数据处理板卡的本地端口收集完写请求的确认信号的情况下,向所述请求发起处理器返回系统栅栏完成信号,以指示所有写请求执行完毕;The return module is used to send a processor return system barrier completion signal to the request after collecting the confirmation signal of the write request on the local port of the first data processing board, so as to indicate that all write requests have been executed. 其中,所述第一数据处理板卡到所述第二数据处理板卡之间存在多个数据通路;There are multiple data paths between the first data processing board and the second data processing board; 在每个数据通路中,与所述第二数据处理板卡直接相连的数据处理板卡的端口在接收到系统栅栏完成信号后,向第二数据处理板卡的端口发送写入完成信号;In each data path, after receiving the system fence completion signal, the port of the data processing board directly connected to the second data processing board sends a write completion signal to the port of the second data processing board. 所述第二数据处理板卡的端口在接收到写入完成信号后,在接收到本地存储器返回的数据写入成功的信号的情况下,将所述写入完成信号发送至本地存储器的标记地址中;After receiving the write completion signal, the port of the second data processing board, upon receiving a data write success signal returned by the local memory, sends the write completion signal to the marker address of the local memory. 所述第二数据处理板卡的存储器基于所述写入完成信号,更新存储器中的标记,所述标记用于指示接收到的写入完成信号的数量。The memory of the second data processing board updates the marker in the memory based on the write completion signal, the marker being used to indicate the number of write completion signals received. 14.一种电子设备,其特征在于,包括:14. An electronic device, characterized in that it comprises: 处理器;processor; 用于存储处理器可执行指令的存储器;Memory used to store processor-executable instructions; 其中,所述处理器被配置为调用所述存储器存储的指令,以实现权利要求1至11中任意一项所述的系统。The processor is configured to invoke instructions stored in the memory to implement the system according to any one of claims 1 to 11. 15.一种计算机可读存储介质,其上存储有计算机程序指令,其特征在于,所述计算机程序指令被处理器执行时实现权利要求1至11中任意一项所述的系统。15. A computer-readable storage medium having stored thereon computer program instructions, characterized in that, when executed by a processor, the computer program instructions implement the system according to any one of claims 1 to 11.
CN202410866971.8A 2024-06-28 2024-06-28 Data processing method and system, storage medium and electronic equipment Active CN118642856B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410866971.8A CN118642856B (en) 2024-06-28 2024-06-28 Data processing method and system, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410866971.8A CN118642856B (en) 2024-06-28 2024-06-28 Data processing method and system, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN118642856A CN118642856A (en) 2024-09-13
CN118642856B true CN118642856B (en) 2025-11-28

Family

ID=92664514

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410866971.8A Active CN118642856B (en) 2024-06-28 2024-06-28 Data processing method and system, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN118642856B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118820170B (en) * 2024-09-19 2024-12-20 北京壁仞科技开发有限公司 Method for data transmission between boards, board, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201205830D0 (en) * 2010-03-19 2012-05-16 Imagination Tech Ltd Requests and data handling in a bus architecture
CN106575206A (en) * 2014-09-26 2017-04-19 英特尔公司 Memory write management in a computer system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5811245B1 (en) * 2014-07-24 2015-11-11 日本電気株式会社 Information processing apparatus, memory order guarantee method, and program
US11119927B2 (en) * 2018-04-03 2021-09-14 International Business Machines Corporation Coordination of cache memory operations
US11847048B2 (en) * 2020-09-24 2023-12-19 Advanced Micro Devices, Inc. Method and apparatus for providing persistence to remote non-volatile memory
CN117609122B (en) * 2023-11-03 2024-06-18 摩尔线程智能科技(上海)有限责任公司 Data transmission system and method, electronic equipment and storage medium
CN117407181B (en) * 2023-12-14 2024-03-22 沐曦集成电路(南京)有限公司 Heterogeneous computing process synchronization method and system based on barrier instruction

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201205830D0 (en) * 2010-03-19 2012-05-16 Imagination Tech Ltd Requests and data handling in a bus architecture
CN106575206A (en) * 2014-09-26 2017-04-19 英特尔公司 Memory write management in a computer system

Also Published As

Publication number Publication date
CN118642856A (en) 2024-09-13

Similar Documents

Publication Publication Date Title
CN113656227B (en) Chip verification method and device, electronic equipment and storage medium
EP4564218A2 (en) System and method for incremental topology synthesis of a network-on-chip
US5649164A (en) Sets and holds in virtual time logic simulation for parallel processors
CN111190842B (en) Direct memory access, processor, electronic device and data transfer method
US20060047849A1 (en) Apparatus and method for packet coalescing within interconnection network routers
JP7324282B2 (en) Methods, systems, and programs for handling input/output store instructions
US20110161966A1 (en) Controlling parallel execution of plural simulation programs
CN116909639B (en) Mounting system, method, cluster and storage medium
CN118642856B (en) Data processing method and system, storage medium and electronic equipment
CN108924008A (en) A kind of dual controller data communications method, device, equipment and readable storage medium storing program for executing
CN113778331B (en) Data processing method, master node and storage medium
CN108418859A (en) The method and apparatus for writing data
JP2011503731A (en) Changing system routing information in link-based systems
CN120675933B (en) A data transmission method, apparatus, and electronic device based on on-chip network.
CN120448331B (en) Coherent interconnection control device, method, electronic device, product and computing system
US10394729B2 (en) Speculative and iterative execution of delayed data flow graphs
US8935354B2 (en) Coordinating write sequences in a data storage system
CN116795605B (en) Automatic recovery system and method for abnormality of peripheral device interconnection extension equipment
US7231334B2 (en) Coupler interface for facilitating distributed simulation of a partitioned logic design
US10958597B2 (en) General purpose ring buffer handling in a network controller
CN108541365B (en) Apparatus and method for distribution of congestion information in switches
CN115686625A (en) Integrated chip and instruction processing method
CN116095024A (en) Verification method, verification device, electronic device and computer readable storage medium
CN115794230A (en) Update Metadata Prediction Tables Using the Reprediction Pipeline
WO2016095340A1 (en) Method and device for determining that data is sent successfully

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Country or region after: China

Address after: B655, 4th Floor, Building 14, Cuiwei Zhongli, Haidian District, Beijing, 100036

Applicant after: Mole Thread Intelligent Technology (Beijing) Co.,Ltd.

Address before: 209, 2nd Floor, No. 31 Haidian Street, Haidian District, Beijing

Applicant before: Moore Threads Technology Co., Ltd.

Country or region before: China

GR01 Patent grant