Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of implementations consistent with aspects of the application as set forth in the following claims.
Before explaining the data processing method provided by the embodiment of the application in detail, an application scenario of the embodiment of the application is described.
In a complex and efficient distributed data processing system, the master control processor acts as the central nervous of the overall data processing system, playing a key role in coordinating and monitoring the exchange of data between the various TPU's. However, as the size of data processing systems increases and the need for communication increases, the challenges faced by master processors are also increasing. Especially when it is required to manage real-time data communication between a plurality of TPUs at the same time, the load pressure of the main control processor is significantly increased, which not only considers the processing capacity and efficiency thereof, but also directly relates to the response speed and stability of the whole system.
Currently, the process of the master control processor distributing data communication instructions to the various TPUs, while highly automated and accurate, the subsequent interrupt signal management mechanisms have become potential bottlenecks. Each TPU, after completing the communication tasks assigned to it, will send an interrupt signal to the master control processor immediately to report the task completion status and request the next instructions. However, when multiple TPUs are tasked almost simultaneously and send interrupts, the interrupt handling queues of the master processor fill up quickly, resulting in a so-called "interrupt storm" phenomenon. This phenomenon not only causes a delay in processing interrupt requests by the host processor, but may also incur additional context switching overhead due to resource contention, thereby affecting timely response to subsequent interrupt requests. This delay not only delays the overall operating efficiency of the system, but may introduce data consistency problems due to the confusion of communication timing, reducing the reliability and stability of the data processing system. More seriously, long delay and error accumulation may cause the data processing system to enter an unstable state, even a service interruption, which has a serious influence on the business which relies on the data processing system for data processing and analysis.
In order to solve the above problems, the embodiments of the present application provide a data processing method, in which a flag bit is set at a receiver, and after a communication task is executed by both tensor processors, the tensor processors notify the other party of the completion of the execution of the communication task by modifying the value of the flag bit, so that the dependency on a master control processor can be reduced, the communication delay is reduced, and the performance and stability of the system are improved.
In some embodiments, the data processing method described above may be applied to a data processing system. The data processing system includes a master control processor and a plurality of TPUs. In one example, FIG. 1 is a schematic diagram of an architecture of a data processing system provided by an embodiment of the present application. Referring to fig. 1, the data processing system includes a master control processor 100 and TPU200-1, TPU200-2. The master control processor 100 is configured to send a communication instruction to the TPU200-1, the TPU 200-2..the TPU200-n is configured to instruct each TPU to perform a communication task, and the TPU200-1, the TPU 200-2..the TPU200-n is configured to autonomously initiate the communication instruction to complete a transmission task of communication data.
In some embodiments, where the TPU200-1 and the TPU200-2 are in data communication and the TPU200-1 is the data sender and the TPU200-2 is the data receiver, the TPU200-1 is configured to, after receiving the communication instruction for performing the communication task sent by the master control processor, send the first data to the TPU200-2 based on the destination address of the first data indicated by the preset rule, and after the first data sending is completed, set the value of the flag bit in the TPU200-2 to the first value, so that the TPU200-2 determines that the TPU200-1 is completed to send the first data based on the first value. The TPU200-2 is configured to receive the first data sent by the TPU200-1 after receiving the communication command sent by the main control processor to perform the communication task, and set the value of the flag bit to a second value after the first data is received, so that the TPU200-1 determines that the TPU200-2 receives the first data based on the second value. The process of performing the corresponding operations by the TPU200-1 and the TPU200-2 is described in detail later, and will not be described here again.
In some embodiments, where the TPU200-1, the TPU200-2 and the TPU200-3 are in data communication, and the TPU200-1 is determined to be a data sender, the TPU200-2 is determined to be a data relay, and the TPU200-3 is determined to be a data receiver based on a preset rule, the TPU200-1 is configured to send first data to the TPU200-2 after receiving a communication instruction sent by a main control processor to perform a communication task, and to set a value of a flag bit in the TPU200-2 to a first value after the first data is sent, so that the TPU200-2 determines that the TPU200-1 is finished sending the first data based on the first value. The TPU200-2 is configured to receive the first data sent by the TPU200-1, and after the first data is received, set the flag bit to a second value, so that the TPU200-1 determines that the TPU200-2 receives the first data based on the second value. Moreover, the TPU200-2 transmits the first data to the TPU200-3 after receiving the first data, and sets the value of the flag bit in the TPU200-3 to the first value after the transmission of the first data is completed, so that the TPU200-3 determines that the transmission of the first data by the TPU200-2 is completed based on the first value. The TPU200-3 is configured to receive the first data sent by the TPU200-2, and after the first data is received, set the flag bit to a second value, such that the TPU200-2 determines that the TPU200-3 received the first data based on the second value. The process of performing the corresponding operations by TPU200-1, TPU200-2 and TPU200-3 will be described in detail herein, and will not be described in detail herein.
In some embodiments, the master processor may be a central processing unit (central processing unit, CPU), a neural network processor (neural processing unit, NPU), or the like.
Fig. 2 is a schematic flow chart of an implementation of the data processing method according to the present application, and fig. 2 is an illustration of a tensor processor including a first tensor processor (TPU 200-1) and a second tensor processor (TPU 200-2), and the data processing method includes the following steps, as shown in fig. 2.
Step S201, the main control processor sends communication instructions to the first tensor processor and the second tensor processor.
In some embodiments, the communication instructions are to instruct the first tensor processor and the second tensor processor to perform the communication task.
It will be appreciated that a plurality of tensor processors are included in the data processing system. In performing the data processing task, data transmission may be performed between two tensor processors or between a plurality of tensor processors. Thus, the first tensor processor may be the source sender of the data, or the relay of the data. The second tensor processor is the recipient of the data.
It can be appreciated that the distributed communication between processors is a complex communication behavior implemented by stacking multiple communication instructions together, and the master control processor initiates different communication instructions to each tensor processor, where the communication instructions may include information such as a sender and a receiver of communication data, and a storage address and a length of the communication data. After each tensor processor receives the communication instruction, the corresponding task is executed based on the communication instruction. However, the load of the master processor is large, which may cause instability of the communication system. Therefore, the master processor in the embodiment of the application only sends one same communication instruction to each tensor processor, and the communication instruction is used for instructing each tensor processor to execute a communication task.
Step S202, the first tensor processor and the second tensor processor receive communication instructions.
It will be appreciated that each tensor processor, upon receipt of the data processing instruction, may autonomously begin executing the distributed communication task.
In some embodiments, a data transmission rule (preset rule) may be preset, where the data transmission rule may include an execution order of each tensor processor when transmitting data, that is, a transmission address of the data. Each tensor processor may, in operation, generate a routing table based on the data transmission rule, and each tensor processor then transmits data based on the routing table. The routing table is an information base used for determining a data forwarding path in the network device, and the routing table can include information such as a destination address, a next hop address, a metric value, a routing source and the like of data. In this way, a tensor processor, upon receiving the data, may forward the data to the next node according to the next hop address in the locally generated routing table.
Step 203, based on the preset rule, the first tensor processor sends the first data to the second tensor processor.
It is appreciated that tensor processors are processors designed for deep learning and other types of tensor (tensor) computation, and are widely used in the process of accelerating training and reasoning of machine learning models. Thus, the first data that the tensor processor communicates is the data required in the training and reasoning process of the machine learning model, which may include, for example, model parameters, gradient information, weights, activation values, control information, etc.
In some embodiments, in a case where the first tensor processor serves as a source sender of the first data, the first tensor processor may determine a receiver of the first data based on a next hop address in the local routing table generated by a preset rule after receiving the communication instruction, so as to send the first data to the second tensor processor.
In some embodiments, the first tensor processor may begin performing the communication task after receiving the communication instruction sent by the master processor. The first tensor processor may prepare the first data to be transmitted, and illustratively, the first tensor processor may retrieve the first data from an internal store or cache and transmit the first data to the second tensor processor.
In some embodiments, in a case that the first tensor processor is a relay party of the first data, after receiving the communication instruction, when receiving the first data sent by the third tensor processor, the first tensor processor may determine a receiving party of the first data based on a next hop address in a local routing table generated by a preset rule, so as to send the first data to the second tensor processor, where the third tensor processor is a tensor processor of a previous node.
It will be appreciated that when the first tensor processor is acting as an intermediate tensor processor, the source of the first data should be sent by the tensor processor at a node on the first tensor processor. In this case, therefore, after the first tensor processor receives the first data sent by the tensor processor of the previous node, the first data may be sent to the tensor processor of the next node, i.e., the second tensor processor.
In some embodiments, each tensor processor may perform a preparation operation for data transfer after receiving the data processing instruction sent by the master processor. For example, the tensor processor as the data sender may prepare the data to be sent and a hardware unit configured for data transmission, and the tensor processor as the data receiver may set a flag bit at its own end, the value of the flag bit indicating the state of the tensor processor. For example, a flag bit value of 0 (second value) indicates that the tensor processor is able to receive data, and a flag bit value of 1 (first value) indicates that the tensor processor is unable to receive data. The second tensor processor, after setting the indicated meaning of the different values of the flag bit, may send the indicated meaning to the first tensor processor to determine synchronization of the data with the first tensor processor based on the value of the flag bit.
In some embodiments, after receiving the communication instruction, the second tensor processor may set the value of the flag bit to a second value, where the second value is used to indicate that the second tensor processor is capable of receiving the first data sent by the first tensor processor, corresponding to step S2021. Thus, when the first tensor processor acquires that the value of the flag bit is the second value, the first data can be sent to the second tensor processor.
In some embodiments, the first tensor processor may further obtain a value of the flag bit in case of a triggering event, and send the first data to the second tensor processor in case of the value of the flag bit being a second value, the second value being used to indicate that the second tensor processor is capable of receiving the first data sent by the first tensor processor. When the first tensor processor is used as a source sender of the first data, the triggering event is an event that the first tensor processor receives a communication instruction. When the first tensor processor is used as a relay party of the first data, the triggering event is an event that the first tensor processor receives the first data sent by the tensor processor of the previous node.
It will be appreciated that the first tensor processor may determine whether the second tensor processor is able to receive the first data sent by the first tensor processor by obtaining the value of the flag bit. After determining that the second tensor processor can receive the first data sent by the first tensor processor based on the value of the flag bit, the configured hardware unit can send the first data to be sent across the chip to the second tensor processor.
In some embodiments, it may be preset that a certain address in the data sender corresponds to an address of a flag bit memory storing a flag bit in the data receiver, that is, different addresses on the two tensor processors correspond to each other. For example, the first address in the first tensor processor may be preset to correspond to the second address of the flag bit memory storing the flag bit in the second tensor processor. In this way, the first tensor processor may obtain the value of the flag bit by accessing its first address. And if the value of the flag bit set by the second tensor processor is the first value, indicating that the second tensor processor cannot receive the first data sent by the first tensor processor, and if the value of the flag bit is the second value, indicating that the second tensor processor can receive the first data sent by the first tensor processor. Thus, when the value of the flag bit is obtained as the first value, the first tensor processor may determine that the second tensor processor may be in a busy state and cannot receive new data, at this time, the first tensor processor may continuously obtain the value of the flag bit, when the value of the flag bit is obtained as the second value, it is determined that the second tensor processor can receive the first data sent by the first tensor processor at this time, and then the first tensor processor may send the first data to the second tensor processor.
In some embodiments, the sending of the first data by the first tensor processor to the second tensor processor in step S203 may be implemented by reading the first data from a first memory of the first tensor processor, where the first memory stores data to be sent by the first tensor processor, and sending the first data to a destination address of the second tensor processor.
It will be appreciated that the first tensor processor may first retrieve the first data from the first memory storing the data to be transmitted and then transmit the first data to the destination address of the second tensor processor. The first tensor processor may send a data read request to the first memory, where the data read request carries information indicating the first data, and the first memory may return the first data to the first tensor processor after receiving the data read request. The first tensor processor then sends the received first data to the memory address of the second tensor processor or to a particular data receiving interface via a data processing mechanism. The data processing mechanisms may include a system bus, a peripheral component interconnect express (PERIPHERAL COMPONENT INTERCONNECT EXPRESS, PCIe) bus, a network interface, and so forth.
In some embodiments, the first tensor processor may further perform a preprocessing operation on the first data and send the first data to the second tensor processor after receiving the first data, so as to ensure the integrity and security of the first data to be sent. The first tensor processor may perform formatting or encoding operations on the first data, and further may perform compression or encryption operations on the first data.
In some embodiments, the first tensor processor may read the first data from the first memory via the address access unit and send the first data to the destination address of the second tensor processor via the address forwarding unit.
In some embodiments, the first tensor processor, which is the data sender, may configure the hardware units required for data transmission after receiving the data processing instruction, and store the hardware units in registers. The first tensor processor may be configured with an address access unit for accessing a first memory of the first tensor processor and reading the first data, and an address forwarding unit for forwarding the data. Illustratively, the address access unit may be a chip direct memory access (CHIP DIRECT memory access) unit, and the address forwarding unit may be an address forwarding unit (address translation unit, ATU) in the PCIe bus, where the ATU unit is a critical component in the PCIe bus responsible for address translation to ensure that data is properly routed to the target device.
In one example, the master control unit in the first tensor processor may configure the CDMA unit and the ATU unit in the PCIe bus after receiving the data processing instructions. Taking the ATU configuration as an example, the master unit may send a configuration command to PCIE SWITCH via the PCIe bus, where the configuration command includes information such as a data source address, a destination address, and an address translation rule. PCIE SWITCH updates its own internal table entry according to the configuration command to support subsequent data forwarding. After the main control unit configures the CDMA unit and the ATU unit, the CDMA unit may obtain the first data from the first memory, and then encapsulate the first data according to the PCIe protocol, including adding necessary header information, for example, a data destination address, a data source address, a data length, and so on. The encapsulated first data is then sent PCIE SWITCH, PCIE SWITCH over the PCIe bus to a destination address of the second tensor processor according to address translation rules configured in the ATU unit. Illustratively, the master unit in the first tensor processor may be a scalar (scalar) unit.
Fig. 3 is a schematic diagram of data transmission provided in the embodiment of the present application, as shown in fig. 3, after the first tensor processor (TPU 0) completes the configuration of the CDMA unit and the ATU unit, the CDMA unit may read the first data from the first memory of the TPU200-1, then may send the first data to a temporary storage address (bar) on the master processor, and then forward the first data to the destination address (e.g. the second memory) of the second tensor processor (TPU 200-2) through the ATU unit. In some embodiments, the first data may be forwarded to the destination address of the second tensor processor directly through the ATU unit without sending the first data to the bar on the master processor, so that participation of the master processor may be reduced, and stability of the system may be improved.
Step S204, after the first data transmission is completed, the first tensor processor sets the value of the flag bit in the second tensor processor to be a first value to indicate that the first data transmission is completed.
In some embodiments, the implementation of step S204 may be that after the first data transmission is completed, the first tensor processor writes a first value into a first address of the first tensor processor, the first address corresponding to a second address of the flag bit memory storing the flag bit, and writes the first value into the second address by synchronizing the first address with the second address, such that the value of the flag bit is set to the first value.
It can be understood that in the embodiment of the present application, the first address in the first tensor processor is preset to correspond to the second address in the flag bit memory storing the flag bit in the second tensor processor, so that it can be determined that the two addresses can be logically or physically associated with each other. Thus, when the first processor writes the first value into the first address, the system may synchronize the first value to the second address using this address correspondence, such that the value of the flag bit in the second tensor processor may be updated to the first value to cause the second tensor processor to determine that the first data transmission is complete based on the first value.
Step S205, the second tensor processor receives the first data.
In some embodiments, the second tensor processor may continuously acquire the value of the flag bit after setting the value of the flag bit to the second value, and determine that the first data transmission is completed when the acquired value of the flag bit is updated from the second value to the first value. The second tensor processor may then receive the first data for subsequent training and reasoning of the model using the first data.
Step S206, after the first data receiving is completed, the second tensor processor sets the value of the flag bit to a second value to indicate that the first data receiving is completed.
It will be appreciated that the second tensor processor, upon receiving the first data, may modify the value of the flag bit such that the first tensor processor determines that the second processor receives the first data based on the modified value, such that synchronization of the data between the two tensor processors may be achieved.
In some embodiments, the first tensor processor obtains the value of the flag bit after setting the value of the flag bit in the second tensor processor to the first value, corresponding to step S207.
It is understood that the first tensor processor may continuously acquire the value of the flag bit after setting the value of the flag bit in the second tensor processor to the first value, so as to determine whether the second tensor processor receives the first data based on the value of the flag bit.
In some embodiments, in the case that the value of the flag bit is the second value, the first tensor processor determines that the second tensor processor receives the first data, corresponding to step S208.
It will be appreciated that if the first tensor processor obtains that the value of the flag bit is updated from the first value to the second value, the first tensor processor determines that the second tensor processor received the first data.
In the embodiment of the application, the master processor only needs to send communication instructions for instructing each tensor processor to start executing the communication task to the first tensor processor and the second tensor processor, and the first tensor processor can send first data to the second tensor processor, and after the first data is sent, the value of the flag bit in the second tensor processor is set to be a first value, so that the second tensor processor determines that the first data is sent completely based on the first value. The second tensor processor receives the first data sent by the first tensor processor, and updates the value of the flag bit from the first value to the second value after the first data is sent, so that the first tensor processor determines that the first data is received, when the first tensor processor and the second tensor processor perform data communication, only the tensor processors modify the value of the flag bit in the receiver when the data is sent or received, and the opposite tensor processor can confirm that the data is sent or received based on the modified value, so that the main control processor is not required to be notified when the data is sent or received, and delay caused by the interrupt task received by the main control processor does not exist, so that communication delay can be reduced, and the performance and stability of the system are improved.
The above description is given by taking the interaction of each processor in the data operating system as an example to illustrate the data processing method provided by the application. The data processing method provided by the application is described below by taking the first tensor processor as an execution main body.
FIG. 4 is a flow chart illustrating an implementation of a data processing method according to an embodiment of the present application, where the data processing method is performed by a first tensor processor. As shown in fig. 4, the method includes steps S401 to S402.
Step S401, in case of triggering event, the first data is sent to the second tensor processor based on preset rules.
In some embodiments, a preset rule is used to indicate a destination address of the first data.
In some embodiments, in the case that the first tensor processor is the source sender of the data, the implementation procedure of step S401 may be that a communication instruction sent by the master processor is received, where the communication instruction is used to instruct the first tensor processor to perform a communication task, and the first data is sent to the second tensor processor based on a preset rule.
It can be appreciated that the distributed communication between processors is a complex communication behavior implemented by stacking multiple communication instructions together, and the master control processor initiates different communication instructions to each tensor processor, where the communication instructions may include information such as a sender and a receiver of communication data, and a storage address and a length of the communication data. After each tensor processor receives the communication instruction, the corresponding task is executed based on the communication instruction. However, the load of the main control processor is large, which may cause instability of the data processing system. Therefore, the master control processor in the embodiment of the application only sends the same communication instruction to each tensor processor, and the communication instruction is used for instructing each tensor processor to execute the communication task, and each tensor processor can autonomously start to execute the distributed communication task after receiving the communication instruction.
In some embodiments, the first tensor processor determines to start performing the communication task after receiving the communication instruction sent by the master processor. The first tensor processor may prepare the first data to be transmitted, and illustratively, the first tensor processor may retrieve the first data from an internal storage or cache and determine a receiver of the first data based on a next hop address in a local routing table generated by a preset rule, so as to transmit the first data to the second tensor processor.
In some embodiments, the first tensor processor may further obtain a value of the flag bit in the event of a trigger event, and send the first data to the second tensor processor in the event that the value of the flag bit is a second value, the second value indicating that the second tensor processor is capable of receiving the first data sent by the first tensor processor.
In some embodiments, each tensor processor may perform a preparation operation for data transfer after receiving the data processing instruction sent by the master processor. For example, the tensor processor as the data sender may prepare the data to be sent and a hardware unit configured for data transmission, and the tensor processor as the data receiver may set a flag bit at its own end, the value of the flag bit indicating the state of the tensor processor. For example, a flag bit value of 0 (second value) indicates that the tensor processor is able to receive data, and a flag bit value of 1 (first value) indicates that the tensor processor is unable to receive data. The second tensor processor, after setting the indicated meaning of the different values of the flag bit, may send the indicated meaning to the first tensor processor to determine synchronization of the data with the first tensor processor based on the value of the flag bit. Thus, the first processor may obtain the value of the flag bit to determine whether the second tensor processor is capable of receiving the first data sent by the first tensor processor. After determining that the second tensor processor can receive the first data sent by the first tensor processor based on the value of the flag bit, the configured hardware unit can send the first data to be sent across the chip to the second tensor processor.
In some embodiments, it may be preset that a certain address in the data sender corresponds to an address of a flag bit memory storing a flag bit in the data receiver, that is, different addresses on the two tensor processors correspond to each other. For example, the first address in the first tensor processor may be preset to correspond to the second address of the flag bit memory storing the flag bit in the second tensor processor. In this way, the first tensor processor may obtain the value of the flag bit by accessing its first address. If the value of the flag bit set by the second tensor processor is the first value, the second tensor processor is indicated to be unable to receive the first data sent by the first tensor processor, and if the value of the flag bit is the second value, the second tensor processor is indicated to be able to receive the first data sent by the first tensor processor, so that when the value of the flag bit is the first value, the first tensor processor determines that the second tensor processor may be in a busy state and unable to receive new data, at this time, the first tensor processor may continuously acquire the value of the flag bit, and when the value of the flag bit is the second value, the first tensor processor is determined to be able to receive the first data sent by the first tensor processor at this time, and then the first tensor processor may send the first data to the second tensor processor.
In some embodiments, the implementation procedure of the first tensor processor sending the first data to the second tensor processor in step S401 may be that the first data is read from a first memory of the first tensor processor, where the first memory stores the data to be sent by the first tensor processor, and the first data is sent to a destination address of the second tensor processor. The implementation process may refer to the relevant content in step S203 in fig. 1, which is not described herein.
The above is a procedure in which the first tensor processor performs data transmission as a source sender of the first data. When the first tensor processor is used as a relay party of the first data to perform data transmission, the implementation process of step S401 may be that the first data sent by a third tensor processor is received, the third tensor processor is the tensor processor of the previous node, and the first data is sent to the second tensor processor based on a preset rule.
It will be appreciated that when the first tensor processor is acting as an intermediate tensor processor, the source of the first data should be sent by the tensor processor at a node on the first tensor processor. In this case, therefore, after the first tensor processor receives the first data transmitted by the tensor processor of the previous node, the receiver of the first data may be determined based on the next hop address in the local routing table generated by the preset rule, so as to transmit the first data to the tensor processor of the next node, i.e., the second tensor processor. The implementation process of the first tensor processor sending the first data to the second tensor processor may refer to the first tensor processor as the source sending direction of the first data, and send the relevant content of the first data to the second tensor processor, which is not described herein.
Step S402, after the first data transmission is completed, the value of the flag bit in the second tensor processor is set to be a first value to indicate that the first data transmission is completed.
In some embodiments, the step S402 may be implemented by writing the first value into a first address of the first tensor processor after the first data transmission is completed, the first address corresponding to a second address of the flag bit memory storing the flag bit, and writing the first value into the second address by synchronizing the first address with the second address such that the value of the flag bit is set to the first value.
It can be understood that in the embodiment of the present application, the first address in the first tensor processor is preset to correspond to the second address in the flag bit memory storing the flag bit in the second tensor processor, so that it can be determined that the two addresses can be logically or physically associated with each other. Thus, when the first processor writes the first value to the first address, the system may synchronize the first value to the second address using this address correspondence, such that the value of the flag bit in the second tensor processor may be updated to the first value. It may be appreciated that the second tensor processor may continuously acquire the value of the flag bit after setting the value of the flag bit to the second value, and determine that the first data transmission is completed when the acquired value of the flag bit is updated from the second value to the first value. This allows for synchronization of data between the two tensor processors.
The scheme that the data sender timely notifies the data receiver of the completion of data transmission is described above. In addition, in the embodiment of the application, the data receiver can also timely inform the data sender of the completion of data reception.
In some embodiments, the first tensor processor may further obtain the value of the flag bit after setting the value of the flag bit in the second tensor processor to the first value, and determine that the second tensor processor receives the first data if the value of the flag bit is the second value.
It is understood that the first tensor processor may continuously acquire the value of the flag bit after setting the value of the flag bit in the second tensor processor to the first value to determine whether the second tensor processor receives the first data. The first tensor processor determines that the second tensor processor received the first data if the value of the flag bit is updated from the first value to the second value. Wherein the operation of updating the value of the flag bit to the second value is performed by the second tensor processor.
In some embodiments, the second tensor processor may continuously acquire the value of the flag bit after setting the value of the flag bit to the second value, and may determine that the first data transmission is completed when the acquired value of the flag bit is updated to the first value, so that the second tensor processor may receive the first data, and modify the value of the flag bit to the second value after the first data reception is completed, to indicate that the second tensor processor receives the first data.
The above describes a method in which a first tensor processor sends first data to a second tensor processor and informs the second tensor processor that the first data transmission is completed. The above method will be further described with reference to fig. 5.
Fig. 5 is a flowchart of another implementation of the data processing method according to the embodiment of the present application, where the method is performed by the first tensor processor, and as shown in fig. 5, the method includes steps S501 to S507.
Step S501, obtaining a value of a flag bit under the condition that a trigger event occurs;
In step S502, the first data is read from the first memory when the flag bit has the second value.
Step S503 is to send the first data to the destination address of the second tensor processor based on a preset rule.
Step S504, after the first data transmission is completed, the first value is written into a first address, and the first address corresponds to a second address of the flag bit memory for storing the flag bit.
Step S505 is to write the first value to the second address by synchronizing the first address with the second address such that the value of the flag bit is set to the first value.
And S506, acquiring the value of the flag bit.
Step S507, determining that the second tensor processor receives the first data under the condition that the value of the flag bit is the second value.
In the embodiment of the application, the first tensor processor sends first data to the second tensor processor under the condition that a trigger event occurs, and after the first data is sent, the value of the flag bit in the second tensor processor is set to be a first value, so that the second tensor processor determines that the first data is sent completely based on the first value. Therefore, when the first tensor processor and the second tensor processor perform data communication, only the tensor processors of both sides need to modify the value of the flag bit in the receiver when the data transmission or data reception is completed, and the opposite tensor processor can confirm that the data transmission or reception is completed based on the modified value, so that the main control processor is not required to be notified of the completion of the data transmission or reception, and delay caused by the interrupt task received by the main control processor does not exist, thereby reducing communication delay and improving the performance and stability of the system.
The above is a data processing method introduced by using a data sender as an execution body, and the data processing method according to the embodiment of the present application is introduced by using a data receiver as an execution body.
FIG. 6 is a flowchart of another implementation of a data processing method according to an embodiment of the present application, where the data processing method is performed by a second tensor processor. As shown in fig. 6, the method includes steps S601 to S603.
And step S601, receiving a communication instruction sent by the main control processor, wherein the communication instruction is used for instructing the second tensor processor to execute a communication task.
It may be appreciated that in the data processing system, when executing the communication task, the master control processor sends an identical communication instruction to each tensor processor, where the communication instruction is used to instruct each tensor processor to execute the communication task, and each tensor processor may autonomously start executing the distributed communication task after receiving the communication instruction. The second tensor processor first receives the communication command sent by the master processor, and then starts to execute the communication task.
In some embodiments, after receiving the communication instruction, the second tensor processor may set a flag bit at its end, where the value of the flag bit is used to indicate the state of the second tensor processor. The second tensor processor, after setting the indicated meaning of the different values of the flag bit, may send the indicated meaning to the first tensor processor to determine synchronization of the data with the first tensor processor based on the value of the flag bit.
In some embodiments, the second tensor processor may set the value of the flag bit to a second value indicating that the second tensor processor is capable of receiving the first data sent by the first tensor processor. Thus, when the first tensor processor acquires that the value of the flag bit is the second value, the first data can be sent to the second tensor processor.
Step S602, receiving first data sent by a first tensor processor.
In some embodiments, the second tensor processor may continuously acquire the value of the flag bit after setting the value of the flag bit to the second value, and determine that the first data transmission is completed when the acquired value of the flag bit is updated from the second value to the first value. The second tensor processor may then receive the first data for subsequent training and reasoning of the model using the first data.
Step S603, after the first data reception is completed, the value of the flag bit is set to a second value to indicate that the first data reception is completed.
It will be appreciated that the second tensor processor, upon receiving the first data, may modify the value of the flag bit such that the first tensor processor determines that the second processor receives the first data based on the modified value, such that synchronization of the data between the two tensor processors may be achieved.
The above method will be further described with reference to fig. 7.
Fig. 7 is a flowchart of another implementation of the data processing method according to the embodiment of the present application, where the method is performed by the second tensor processor, and as shown in fig. 7, the method includes steps S701 to S704.
Step S701, receiving a communication instruction sent by a main control processor;
Step S702 sets the flag bit value to a second value to instruct the first tensor processor to transmit the first data.
Step S703, receiving the first data sent by the first tensor processor.
Step S704, after the first data reception is completed, sets the first value of the flag bit to the second value, so that the first tensor processor determines that the first data reception is completed.
In the embodiment of the application, the second tensor processor receives the communication instruction sent by the main control processor, sets the value of the flag bit to the second value to instruct the first tensor processor to send the first data, then receives the first data sent by the first tensor processor, and updates the value of the flag bit from the first value to the second value after the first data is sent, so that the first tensor processor determines that the first data is received, thereby realizing the data synchronization between the two tensor processors, simultaneously avoiding the operation of the main control processor participating in the data synchronization, reducing the communication delay and improving the system performance and stability.
The embodiment of the application also provides a data processing device, and fig. 8 is a schematic diagram of a composition structure of the data processing device provided by the embodiment of the application. As shown in fig. 8, the data processing apparatus 800 includes a transmission module 801 and a first processing module 802. The first processing module 802 is configured to set a value of a flag bit in the second tensor processor to a first value after the first data transmission is completed, so as to indicate that the first data transmission is completed.
In some possible embodiments, the sending module 801 is further configured to obtain a value of a flag bit in case of occurrence of a trigger event, and send the first data to the second tensor processor in case of the value of the flag bit being a second value, where the second value is used to indicate that the second tensor processor is capable of receiving the first data sent by the first tensor processor.
In some possible embodiments, the sending module 801 is further configured to read the first data from a first memory of the first tensor processor, where the first memory stores data to be sent by the first tensor processor, and send the first data to a destination address of the second tensor processor.
In some possible embodiments, the first processing module 802 is further configured to write a first value into a first address of the first tensor processor after the first data transmission is completed, the first address corresponding to a second address of the flag bit memory storing the flag bit, and write the first value into the second address by synchronizing the first address with the second address, such that the value of the flag bit is set to the first value.
In some possible embodiments, the apparatus further comprises an acquisition module configured to acquire a value of the flag bit, and a determination module configured to determine that the second tensor processor receives the first data if the value of the flag bit is the second value.
In some possible embodiments, the sending module 801 is further configured to receive a communication instruction sent by the master processor, where the communication instruction is used to instruct the first tensor processor to perform a communication task, and send the first data to the second tensor processor based on a preset rule.
In some possible embodiments, the sending module 801 is further configured to receive the first data sent by the third tensor processor, where the third tensor processor is a processor of a previous node, and send the first data to the second tensor processor based on a preset rule.
The embodiment of the application also provides a data processing device, and fig. 9 is a schematic diagram of another composition structure of the data processing device provided by the embodiment of the application. As shown in fig. 9, the data processing apparatus 900 includes a first receiving module 901, a second receiving module 902, and a second processing module 903. The first receiving module 901 is configured to receive a communication instruction sent by the master control processor, where the communication instruction is configured to instruct the second tensor processor to perform a communication task, the second receiving module 902 is configured to receive first data sent by the first tensor processor, and the second processing module 903 is configured to set a flag bit value to a second value after the first data is received, so as to instruct that the first data is received.
In some possible embodiments, the second processing module 903 is further configured to set, after the first data reception is completed, a first value of the flag bit to a second value, where the first value is set by the first tensor processor, and the first value is used to indicate that the first data transmission is completed.
In some possible embodiments, the apparatus further comprises a third processing module configured to set a value of the flag bit to a second value after receiving the communication instruction, the second value being used to indicate that the second tensor processor is capable of receiving the first data sent by the first tensor processor.
The description of the communication device embodiments above is similar to that of the method embodiments above, with similar advantageous effects as the method embodiments. In some embodiments, functions or modules included in the communication device provided by the embodiments of the present disclosure may be used to perform the methods described in the method embodiments, and for technical details not disclosed in the embodiments of the communication device of the present disclosure, reference is made to the description of the method embodiments of the present disclosure.
The application provides a computer device comprising a memory and a processor, the memory storing a computer program executable on the processor, the processor implementing the steps of the data processing method described above when executing the program.
An embodiment of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements some or all of the steps of the data processing method described above. The computer readable storage medium may be transitory or non-transitory.
Embodiments of the present application provide a computer program product comprising a non-transitory computer-readable storage medium storing a computer program which, when read and executed by a computer, performs some or all of the steps of the data processing method described above. The computer program product may be realized in particular by means of hardware, software or a combination thereof. In some embodiments, the computer program product is embodied as a computer storage medium, and in other embodiments the computer program product is embodied as a software product, such as a software development kit (software development kit, SDK), or the like.
It is noted that other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.