[go: up one dir, main page]

CN117075903A - Tensor-based compilation method, device and computer-readable storage medium thereof - Google Patents

Tensor-based compilation method, device and computer-readable storage medium thereof Download PDF

Info

Publication number
CN117075903A
CN117075903A CN202210503503.5A CN202210503503A CN117075903A CN 117075903 A CN117075903 A CN 117075903A CN 202210503503 A CN202210503503 A CN 202210503503A CN 117075903 A CN117075903 A CN 117075903A
Authority
CN
China
Prior art keywords
tensor
statement
register
creation
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210503503.5A
Other languages
Chinese (zh)
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Cambricon Information Technology Co Ltd
Original Assignee
Shanghai Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Cambricon Information Technology Co Ltd filed Critical Shanghai Cambricon Information Technology Co Ltd
Priority to CN202210503503.5A priority Critical patent/CN117075903A/en
Publication of CN117075903A publication Critical patent/CN117075903A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/37Compiler construction; Parser generation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/441Register allocation; Assignment of physical memory space to logical memory space
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/443Optimisation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/48Incremental compilation

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

本公开涉及一种基于张量的编译方法及其相关产品,其中该相关产品包括设备和计算机可读存储介质。该设备可以包括在组合处理装置的计算处理装置中,该计算处理装置可以包括一个或多个数据处理装置。前述的组合处理装置还可以包括接口装置和其他处理装置。所述计算处理装置与其他处理装置进行交互,共同完成用户指定的计算操作。组合处理装置还可以包括存储装置,该存储装置分别与设备和其他处理装置连接,用于存储该设备和其他处理装置的数据。通过本公开的方案,可以优化针对于张量的编译操作,显著减少程序代码量。

The present disclosure relates to a tensor-based compilation method and related products, wherein the related products include a device and a computer-readable storage medium. The apparatus may be included in a combination of computing processing means, which may comprise one or more data processing means. The aforementioned combined processing device may also include an interface device and other processing devices. The computing processing device interacts with other processing devices to jointly complete computing operations specified by the user. The combined processing device may also include a storage device, which is connected to the device and other processing devices respectively, and is used to store data of the device and other processing devices. Through the solution of the present disclosure, the compilation operation for tensors can be optimized and the amount of program code can be significantly reduced.

Description

Tensor-based compiling method, tensor-based compiling device and computer-readable storage medium for tensor-based compiling device
Technical Field
The present disclosure relates generally to the field of program compilation. More particularly, the present disclosure relates to a tensor-based compilation method, an apparatus for performing the foregoing method, and a computer-readable storage medium.
Background
Constant propagation is one of the most widely used optimization methods in modern compilers, which is commonly applied to high-level intermediate expressions ("Intermediate Representation", IR). Through constant propagation, the problem of statically detecting whether an expression is always evaluated as a unique constant at runtime can be solved. If it is known at the time of invoking the process which variables will have constant values, and what these values will be, the compiler can simplify the constants during compilation. Constant propagation algorithms typically have simple constant propagation ("Simple Constant Propagation"), sparse simple constant propagation ("Sparse Simple Constant Propagation"), conditional constant propagation ("Condition Constant Propagation"), and sparse conditional constant propagation ("Sparse Condition Constant Propagation").
While the constant propagation schemes described above have an optimized effect on compilation operations, these conventional compilation techniques are applied to conventional programming languages and programming models (e.g., C language or c++ language, etc.), which are only applicable to scalar processing. That is, the existing compiling technique cannot be applied to the processing of tensors. In view of this, how to realize the propagation of tensors in intermediate expressions is a technical problem to be solved.
Disclosure of Invention
In view of the technical problems mentioned in the background section above, the present disclosure proposes a compiling method for tensors. With the scheme of the present disclosure, global propagation of tensors, i.e., propagation within the scope of all basic blocks ("BB") within the intermediate expression, can be achieved in the intermediate expression stage of compilation. Thus, the compiling processing complexity during compiling is effectively reduced, and the compiling operation is obviously optimized. To this end, the present disclosure provides a solution for tensor propagation during compilation in several aspects as follows.
In a first aspect, the present disclosure provides a tensor-based compiling method, comprising: setting a data structure based on tensor data attributes for tensor global propagation of compiled intermediate expressions, wherein the tensor global propagation is related to all basic blocks of the intermediate expression stage; and executing one or more creation statements to create a target tensor register containing the data structure to enable tensor global propagation in compilation.
In a second aspect, the present disclosure provides a tensor-based compiled device, comprising: a processor; and a memory storing computer program instructions for compiling a tensor, which when executed by the processor, cause the implementation of the above method and the various embodiments thereof discussed below.
In a third aspect, the present disclosure provides a computer readable storage medium storing computer program instructions for tensor-based compilation, which when executed by a processor, cause the implementation of the above method and the various embodiments thereof to be discussed below.
By the compilation scheme as provided in the aspects above of the present disclosure, the target tensor registers related to the tensor can be created in the intermediate expression stage of compilation, in particular by executing one or more creation statements for the tensor, i.e. the propagation of the tensor data in the intermediate expression over all basic blocks in the intermediate expression can be achieved by reading and operating on the data structures in the target tensor registers. Thus, the compiler can directly operate on the tensor associated with the target tensor register without the need to operate upon the code execution, thereby optimizing the code and compiling operations and adapting the programming model involving the tensor operations. Further, the present disclosure skips the step of storing parameters into registers, and reading parameters from the registers when ready for use, so that, based on the settings of the multiple creation statements of the present disclosure, redundant instructions, particularly the number of instructions related to tensor migration, can be eliminated during compilation. Based on the scheme, the generation of redundant instructions can be greatly reduced, so that the overall performance of a user program is improved.
Drawings
The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:
fig. 1 is a block diagram illustrating a board card according to an embodiment of the present disclosure;
fig. 2 is a block diagram illustrating a combination processing apparatus according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram illustrating the internal structure of a computing device according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram illustrating the internal architecture of a processor core according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram illustrating a data write process between processor cores of different clusters according to an embodiment of the present disclosure;
FIG. 6 is a schematic diagram illustrating the architecture of software and hardware for data flow programming according to an embodiment of the present disclosure;
FIG. 7 is a simplified flowchart illustrating a compilation method for tensors according to an embodiment of the present disclosure;
FIG. 8 is a schematic diagram illustrating a tensor register according to an embodiment of the present disclosure;
FIG. 9 is an exemplary flowchart illustrating a process of partially creating a statement according to an embodiment of the disclosure;
FIG. 10 is an exemplary flowchart illustrating a process of another portion of creating a statement according to an embodiment of the disclosure;
FIG. 11 is an exemplary flowchart illustrating a process of using statements in accordance with an embodiment of the present disclosure; and
fig. 12 is a detailed exemplary flowchart illustrating a compiling method for tensors according to an embodiment of the present disclosure.
Detailed Description
The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings of the embodiments of the present disclosure, in which it is apparent that the described embodiments are some, but not all, embodiments of the present disclosure. All other embodiments, which can be made by those skilled in the art without the inventive effort, based on the embodiments in this disclosure are intended to be within the scope of this disclosure.
It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, specification, and drawings of this disclosure are used for distinguishing between different objects and not for describing a particular sequential order. The terms "comprises" and "comprising" when used in the specification and claims of this disclosure are taken to specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in this disclosure and in the claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present disclosure and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".
Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.
Fig. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure. As shown in fig. 1, the board 10 includes a Chip 101, which is a System on Chip (SoC), or a System on Chip, integrated with one or more combined processing devices, wherein the combined processing device is an artificial intelligent computing unit, and is used for supporting various deep learning and machine learning algorithms, so as to meet the intelligent processing requirements in complex fields such as computer vision, voice, natural language processing, data mining, and the like. Particularly, the deep learning technology is largely applied to the cloud intelligent field, and one remarkable characteristic of the cloud intelligent application is that the input data volume is large, and the high requirements on the storage capacity and the computing capacity of the platform are provided.
The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a WIFI interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface means 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface means 102. The external interface device 102 may have different interface forms, such as PCIe interfaces, etc., according to different application scenarios.
The board 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected to the control device 106 and the chip 101 via a bus and transmits data. The control device 106 in the board 10 is configured to regulate the state of the chip 101. To this end, in one application scenario, the control device 106 may comprise a single chip microcomputer (Micro Controller Unit, MCU).
Fig. 2 is a block diagram showing the combination processing apparatus 20 in the chip 101 of this embodiment. As shown in fig. 2, the combination processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a DRAM 204.
The computing device 201 is configured to perform user-specified operations, primarily implemented as a single-core smart processor or as a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively accomplish the user-specified operations.
The interface means 202 are used for transmitting data and control instructions between the computing means 201 and the processing means 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, writing to a storage device on the chip of the computing device 201. Further, the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202, and write the control instructions into a control cache on the chip of the computing device 201. Alternatively or in addition, the interface device 202 may also read data in the memory device of the computing device 201 and transmit it to the processing device 203.
The processing device 203 is a general purpose processing device that performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of processors, including but not limited to a digital signal processor (digital signal processor, DSP), an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (field-programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., of a central processing unit (central processing unit, CPU), graphics processor (graphics processing unit, GPU) or other general purpose and/or special purpose processor, and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure only with respect to it. However, when computing device 201 and processing device 203 are considered together, they are considered to form a heterogeneous multi-core structure.
The DRAM 204 is used to store data to be processed, and is a DDR memory, typically 16G or more in size, for storing data for the computing device 201 and/or the processing device 203. In one or more implementation scenarios, the memory management scheme of the present application may be applied to the management and maintenance of the DDR, thereby enabling reuse or reclamation operations on events. In this case, the board of the present application may be considered as the device side in an artificial intelligence computing system.
Fig. 3 shows a schematic diagram of the internal structure of the computing device 201. The computing device 201 is configured to process input data such as computer vision, voice, natural language, and data mining, where the computing device 201 is configured as a multi-core hierarchical structure, and the computing device 201 is a system-on-a-chip (soc) including a plurality of clusters (clusters), each of which includes a plurality of processor cores. In other words, the computing device 201 is structured in a hierarchy of system-on-chip-cluster-processor cores.
At the system-on-chip level, as shown in FIG. 3, computing device 201 includes an external storage controller 301, a peripheral communication module 302, an on-chip interconnect module 303, a synchronization module 304, and a plurality of clusters 305.
There may be a plurality of external memory controllers 301, 2 being shown by way of example, for accessing external memory devices, such as DRAM 204 in FIG. 2, to read data from or write data to the off-chip in response to an access request issued by the processor core. The peripheral communication module 302 is configured to receive a control signal from the processing device 203 through the interface device 202, and activate the computing device 201 to perform a task. The on-chip interconnect module 303 connects the external memory controller 301, the peripheral communication module 302, and the plurality of clusters 305 for transferring data and control signals between the respective modules. The synchronization module 304 is a global synchronization barrier controller (global barrier controller, GBC) for coordinating the working progress of each cluster to ensure synchronization of information. The plurality of clusters 305 are the computing cores of the computing device 201, 4 being illustratively shown, and as hardware progresses, the computing device 201 of the present disclosure may also include 8, 16, 64, or even more clusters 305.
At the cluster level, as shown in FIG. 3, each cluster 305 includes a plurality of processor cores (IPU cores) 306 and a memory core (MEM core) 307.
The processor cores 306 are illustratively shown as 4 in the figures, and the present disclosure does not limit the number of processor cores 306. The internal architecture is shown in fig. 4. Each processor core 306 includes three major modules: a control module 41, an operation module 42 and a storage module 43.
The control module 41 is used for coordinating and controlling the operation of the operation module 42 and the storage module 43 to complete the task of deep learning, and comprises a fetch unit (instruction fetch unit, IFU) 411 and an instruction decode unit (instruction decode unit, IDU) 412. The instruction fetching unit 411 is configured to fetch an instruction from the processing device 203, and the instruction decoding unit 412 decodes the fetched instruction and sends the decoded result to the operation module 42 and the storage module 43 as control information.
The operation module 42 includes a vector operation unit 421 and a matrix operation unit 422. The vector operation unit 421 is used for performing vector operations and can support complex operations such as vector multiplication, addition, nonlinear transformation, etc.; the matrix operation unit 422 is responsible for the core computation of the deep learning algorithm, i.e. matrix multiplication and convolution.
The storage module 43 is used for storing or handling related data, including a neuron storage unit (NRAM) 431, a weight storage unit (WRAM) 432, an input/output direct memory access module (input/output direct memory access, IODMA) 433, and a handling direct memory access module (move direct memory access, MVDMA) 434.NRAM 431 is used to store input, output data and intermediate results for computation by processor core 306; WRAM 432 is configured to store weights for the deep learning network; the IODMA 433 controls access to the NRAM 431/WRAM 432 and the DRAM 204 via the broadcast bus 309; MVDMA 434 is used to control access to NRAM 431/WRAM 432 and SRAM 308.
Returning to FIG. 3, the storage cores 307 are primarily used to store and communicate, i.e., to store shared data or intermediate results between the processor cores 306, as well as to perform communications between the clusters 305 and the DRAM 204, between the clusters 305, between the processor cores 306, etc. In other embodiments, the memory core 307 has scalar operation capabilities to perform scalar operations.
The memory core 307 includes a shared memory unit (SRAM) 308, a broadcast bus 309, a clustered direct memory access module (cluster direct memory access, CDMA) 310, and a global direct memory access module (global direct memory access, GDMA) 311. The SRAM 308 plays a role of a high-performance data transfer station, and data multiplexed between different processor cores 306 in the same cluster 305 is not required to be obtained from the processor cores 306 to the DRAM 204 respectively, but is transferred between the processor cores 306 through the SRAM 308, and the memory core 307 only needs to rapidly distribute the multiplexed data from the SRAM 308 to a plurality of processor cores 306, so that the inter-core communication efficiency is improved, and the on-chip off-chip input/output access is greatly reduced.
Broadcast bus 309, CDMA 310, and GDMA 311 are used to perform inter-processor core 306 communications, inter-cluster 305 communications, and cluster 305 and DRAM 204 data transfers, respectively. As will be described below, respectively.
The broadcast bus 309 is used to perform high-speed communication between the processor cores 306 in the cluster 305. The broadcast bus 309 of this embodiment supports inter-core communication modes including unicast, multicast and broadcast. Unicast refers to the transmission of data from point to point (i.e., single processor core to single processor core), multicast is a communication scheme that transfers a piece of data from SRAM 308 to a specific number of processor cores 306, and broadcast is a communication scheme that transfers a piece of data from SRAM 308 to all processor cores 306, a special case of multicast.
CDMA 310 is used to control access to SRAM 308 between different clusters 305 within the same computing device 201. Fig. 5 shows a schematic diagram when one processor core wants to write data to another clustered processor core to illustrate the operation of CDMA 310. In this application scenario, the same computing device includes a plurality of clusters, for convenience of illustration, only cluster 0 and cluster 1 are shown in the figure, and cluster 0 and cluster 1 respectively include a plurality of processor cores, for convenience of illustration, also, cluster 0 in the figure only shows processor core 0, and cluster 1 only shows processor core 1. Processor core 0 is to write data to processor core 1.
Firstly, the processor core 0 sends a unicast write request to write data into the local SRAM 0, CDMA 0 is used as a master (master) end, CDMA 1 is used as a slave (slave) end, the master end pushes the write request to the slave end, that is, the master end sends a write address AW and write data W, the data is transmitted to the SRAM 1 of the cluster 1, then the slave end sends a write response B as a response, and finally the processor core 1 of the cluster 1 sends a unicast read request to read the data from the SRAM 1.
Returning to FIG. 3, GDMA 311 cooperates with external memory controller 301 to control access of SRAM 308 of cluster 305 to DRAM 204 or to read data from DRAM 204 into SRAM 308. From the foregoing, it is appreciated that communication between DRAM 204 and NRAM 431 or WRAM 432 may be accomplished via 2 channels. The first channel is to directly contact the DRAM 204 with the NRAM 431 or WRAM 432 through the IODAM 433; the second channel is to transfer data between the DRAM 204 and the SRAM 308 via the GDMA 311 and then transfer data between the SRAM 308 and the NRAM 431 or WRAM 432 via the MVDMA 434. While the second channel seemingly requires more components to participate and the data stream is longer, in practice in some embodiments the bandwidth of the second channel is much greater than the first channel, and thus communication between the DRAM 204 and the NRAM 431 or WRAM 432 may be more efficient through the second channel. Embodiments of the present disclosure may select a data transmission channel based on the hardware conditions itself.
In other embodiments, the functionality of the GDMA 311 and the functionality of the IODMA 433 may be integrated in the same component. The GDMA 311 and the iomma 433 are considered as different components for convenience of description of the present disclosure, and it is within the scope of protection of the present disclosure for a person skilled in the art as long as the functions and technical effects achieved are similar to those of the present disclosure. Further, the functions of the GDMA 311, the IODMA 433, the CDMA 310, and the MVDMA 434 may be implemented by the same components, which are also within the scope of the present disclosure as long as the implemented functions and the achieved technical effects are similar to those of the present disclosure.
The hardware architecture of the present disclosure and its internal structure are described in detail above in connection with fig. 1-5. It is to be understood that the above description is intended to be illustrative and not restrictive. According to different application scenarios and hardware specifications, a person skilled in the art may also change the board card (or artificial intelligent device) and the internal structure thereof, and these changes still fall within the protection scope of the present disclosure. In addition to the hardware architecture shown in fig. 1-5, aspects of the present disclosure relate to a software and hardware architecture, which will be described below.
FIG. 6 shows a design of a software and hardware architecture for data stream programming in one embodiment of the present disclosure. As can be seen from the figure, the software and hardware architecture in this embodiment may include an AI processor 601, a driver and operating system 602, a compiler and programming language 603, a library 604, a framework layer 605, and an application layer 606. It will be appreciated that the software and hardware architecture herein may be applied to the artificial intelligence computing system of the present application to enable global propagation of tensors during compilation of tensor registers.
Specifically, the AI processor 601 (which may be included, for example, in a board as described above in connection with the figures) considers both operational optimization and data handling optimization on the hardware design. For this purpose, it employs a customized arithmetic unit to accelerate the arithmetic and uses on-chip storage to accelerate data handling, resulting in extremely high performance and energy efficiency ratios. In addition, to support various algorithmic optimizations, the AI processor 601 may have customized arithmetic units and instruction sets, where the instruction sets may provide arithmetic instructions (scalar, vector, and/or matrix) of different granularity. Further, when various factors such as access characteristics of the algorithm, hardware cost, verification difficulty and the like are considered, an on-chip storage mode can be adopted, and data handling is optimized. In actual operation, the AI processor of the present disclosure may achieve speeds that exceed the mainstream GPU (graphics processing unit) by more than a few tens of times.
The driver and operating system 602 is primarily responsible for implementing the scheduling of tasks on the AI processor 601. The scheduling operation may, for example, implement scheduling according to task priorities, communication and synchronization between multiple devices, and so on. For compiled programs, it may be possible to implement scheduled execution of tasks to be performed on a particular processor through an operating system and drivers, including, but not limited to, the following operations: distributing and releasing the memory of the equipment, realizing data transmission among the equipment, maintaining the task queue, and dispatching the tasks according to the priority, thereby realizing synchronization and cooperation among multiple equipment.
The compiler and programming language 603 may be a suite of assembly languages developed for the instruction set of the AI processor 601. In an application, it may translate deep learning operators developed for the AI processor 601 into processor instruction combinations in order to invoke the AI processor 601, thereby efficiently using the AI processor 601. According to the scheme of the disclosure, the compiler can be utilized to execute the intermediate expression stage of compiling to optimize the compiling so as to support global creation and propagation of tensors, thereby remarkably improving the compiling efficiency and optimizing codes.
The libraries 604 may include a runtime library 614 and a machine learning library 624. In one implementation scenario, the aforementioned library 604 may use the instruction set of the AI processor 601 and perform partial optimization according to the instruction set of the AI processor 601 to increase the operation speed of the operator. The runtime library 614 may be a set of high-performance operator libraries developed specifically for the AI processor 601 and which may be used to accomplish interactions between the general purpose processor and the artificial intelligence processor. Further, the runtime library 614 may also provide a set of artificial intelligence processor oriented interfaces. For the machine learning library 624, it may be used to accelerate various machine learning or deep learning algorithms on the artificial intelligence processor. In particular, the machine learning library 624 may provide a set of efficient, general-purpose, flexible, and extensible programming interfaces, and the machine learning applications at the upper level may employ the programming interfaces of the various programming frameworks (e.g., pytorch, tensorFlow, caffe, MXNet, etc.) directly, or may be programmed directly using the interfaces provided by the machine learning library 624. Additionally, the machine learning library 624 of the present disclosure may facilitate invocation of hardware platforms, while the runtime library 614 may implement some underlying common operators, such as various operations of convolution, pooling, and the like.
The framework layer 605 may add encapsulation to the operators developed for the AI processor and primarily encapsulation to the operators of the runtime library 614. In addition, the framework layer 605 may modify portions of the associated task schedule or memory management. In one application scenario, the framework layer 605 may employ the architecture of a framework such as TensorFlow.
Fig. 7 is a simplified flowchart illustrating a compilation method 700 for tensors according to an embodiment of the present disclosure. As previously described, the compilation method herein may be performed by a computing device. In an artificial intelligence system of a master-slave heterogeneous architecture, the compilation method herein may be performed by the host side.
As shown in fig. 7, at step S702, a data structure based on tensor data attributes is set for tensor global propagation of the compiled intermediate expressions. In one embodiment, the tensor global propagation herein is related to all basic blocks of the intermediate expression stage. As previously mentioned, in the context of the present disclosure, intermediate expressions (or intermediate codes) may be partitioned by basic blocks. Specifically, each basic block is a sequence of statements that are executed in the program in maximum order and has only one entry and exit, where the entry is its first statement and the exit is its last statement. Based on this, the tensor global propagation of the present invention, i.e. propagation of tensors within the scope of all basic blocks of the intermediate code.
Wherein the tensor data attributes refer to information attributes of tensor registers, the tensor interface unit (Tensor Interface Unit, TIU) may contain, for example, 32-46 general purpose tensor registers (TensorRegister, TR) according to different artificial intelligence system hardware architectures. Here, each TR represents a three-dimensional tensor mapped into, for example, random Access Memory (RAM). The three-dimensional tensor is actually stored in a memory space, e.g. a RAM space, and is specified by a physical address contained in TR.
Fig. 8 is a schematic diagram illustrating a tensor register according to an embodiment of the present disclosure, and as illustrated in fig. 8, the TR includes a valid Flag (Active Flag) and the aforementioned tensor data attribute (TensorInfo), wherein the tensor data attribute generally includes Address information of a tensor and dimension information (Dim Info), specifically, the Address information may be a start physical Address (Base Address) of the tensor, and the dimension information may include a Size (Dim Size) of each dimension, a start position (Dim Offset) of each dimension, and a physical Address interval (Dim Stride) between each dimension of the tensor data. Wherein the specific meanings are as follows:
base Address: TR start physical address, 49 bits (including an off-chip flag, off-chip indicates whether the data is on-chip or off-chip)
Dim Offset: the starting position of each dimension, which may include, for example, dim0 Offset, dim1 Offset, dim2 Offset, the lowest dimension may be 32 bits, while the other dimensions may be 16 bits.
Dim Size: the Size of each dimension, which may include, for example, dim0 Size, dim1 Size, and Dim2 Size, may be 32 bits for the lowest dimension and 16 bits for the other dimensions.
Dim strand: the physical address space of each dimension except the highest dimension may include, for example, dim1stride, dim2 stride, where the lowest dimension may be 32 bits and the other dimensions may be 16 bits.
Based on the tensor data attributes (TensorInfo) described above, the data structure of the tensor register may be set to contain three items, namely, the parameters associated with the tensor data attributes, the parameter valid bits, and the identification. Parameters associated with the tensor data attributes herein may refer to parameters related to Address information and dimension information of the tensor, including a start physical Address (or Base Address) for referring to tensor data, a size (dim_size) of each dimension, an offset (dim_offset), and a step size (dim_stride). Further, a parameter valid bit (valid) may be used to indicate whether the aforementioned respective parameters are valid. In addition, an identification ("tensor_info_id") is used to identify the tensor data. For example, the parameters and parameter valid bits (denoted by "_valid") may be represented as follows:
1m_base_addr (IR), bolm_base_addr_valid (IR, intermediate expression of an instruction)
2m_dim0_size (OpndVal), bolm_dim0_size_valid (OpndVal, parameter representation)
3m_dim1_size(OpndVal),bool m_dim1_size_valid
4m_dim2_size(OpndVal),bool m_dim2_size_valid
5m_dim0_offset(OpndVal),bool m_dim0_offset_valid
6m_dim1_offset(OpndVal),bool m_dim1_offset_valid
7m_dim2_offset(OpndVal),bool m_dim2_offset_valid
8m_dim0_stride(OpndVal),bool m_dim0_stride_valid
9m_dim1_stride(OpndVal),bool m_dim1_stride_valid
By maintaining a data structure Vector < TensorInfo > m_info, which is used to record the created TensorInfo, the subscript of each TensorInfo within the data structure m_info may be denoted as a tensor_info_id. Based on such a data structure, direct reading and writing of the TensorInfo can be converted to operating on the tensor_info_id.
By constructing the data structure, the data attribute of the tensor data can be effectively expressed, and the positioning of the tensor data is facilitated, so that the global propagation in the intermediate expression is also facilitated.
Next, at step S704, one or more create statements are executed to create a target tensor register containing the data structure to enable tensor global propagation in compilation. The created target tensor register contains a data structure based on tensor data attributes, based on which direct read-write of the TensorInfo can be converted to operate on tensor info id. The tensor data can be conveniently positioned through the operation of the tensor_info_id, so that the global propagation in the intermediate expression is also convenient.
By the tensor-based compilation method illustrated in fig. 7 above, the target tensor registers associated with the tensor may be created during intermediate expression stages of compilation, particularly by executing one or more create statements for the tensor. Based on this, the compilation scheme of the present disclosure may enable propagation of tensor data within all basic blocks of the intermediate representation at the compilation stage by a read operation of the data structure in the target tensor register. Thus, relative to prior art failing to support data propagation of tensors, the compilation scheme of the present disclosure may directly operate on tensors associated with target tensor registers without performing corresponding operations at code runtime, thereby optimizing code and compilation operations and adapting programming models involving tensor operations. Further, based on the setting of the multiple creation statements of the present disclosure, redundant instructions, particularly the number of instructions related to tensor migration, may be eliminated during compilation. Based on the scheme, the generation of redundant instructions can be greatly reduced, so that the overall performance of a user program is improved.
Depending on the different application scenarios, the creation statement of the present disclosure may include one or more creation statements, such as first through fourth creation statements having different creation manners. In particular, the first creation statement may be used to create a target tensor register from known dimension information, which may be denoted as createtr statement in the context of the present disclosure. The second create statement may be used to assign tensor data associated with the tensor register at the particular storage region to the target tensor register to create the target tensor register, which may be denoted as a tld.tr.xram statement in the context of this disclosure. The third creation statement may be used to perform a cut on the tensor data according to the predetermined tensor dimension information and the source tensor register to create a target tensor register, which may be denoted as a slicetr statement in the context of the present disclosure. A fourth creation statement for assigning the data structure of the source tensor register to the target tensor register to create the target tensor register, which may be denoted as tmv. Tr statement in the context of the present disclosure.
An exemplary statement expression of the four creation statements will be given below.
1. A first create statement createtr statement that creates TR from given dimension information (e.g., size per dimension, offset or step per dimension, etc. as mentioned above), an exemplary statement is as follows:
createtr dst_tr,base_addr,
dim0_offset,dim0_size,dim0_stride,
dim1_offset,dim1_size,dim1_stride,
dim2_offset,dim2_size;
2. the second create statement tld.tr.xram statement, which may load a source TR (src_tr) from xram (memory, register or stack space etc.), and assign the data pointed to by BaseAddr of tensorfo of the source TR to the destination TR (dst_tr). Exemplary statements are as follows:
tld.tr.xram dst_tr,src_tr;
3. and a third creating statement slicetr statement for slicing the tensor data. That is, slicing the source TR according to given dimension parameter information and tensorfo dimension information of the source TR may create a new TR. Exemplary statements are as follows.
slicetr dst_tr,src_tr+[<dim0_offset,dim0_size>,
<dim1_offset,dim1_size>,
<dim2_offset,dim2_size>];
4. A fourth creation statement tmv.tr.tr statement for assigning the tensorfo information of one source TR to the target TR. Exemplary statements are as follows:
tmv.tr.tr dst_tr,src_tr
each of the four creation statements creates a new TR, also called a constant value statement, i.e. where the statement defining the TR is located. In the context of the present disclosure, a constant value statement is a statement that is used to define a specific value of a tensor, including tensor data attributes such as its number of dimensions, size, offset, or step size.
In some implementations, an attribute identification term may also be included in each creation statement (including, but not limited to, the four creation statements described above). The attribute identifier may also be a data structure or an array, and the present disclosure does not limit the representation of the attribute identifier, and may be used to store some intermediately expressed attributes (e.g., indicate the execution mode "running mode" when computing, request mode, whether there is a fusion operation, on-chip or off-chip operation, etc.).
In one implementation scenario, after creating the target tensor register from the creation statement, the method further includes associating an attribute identification item of the creation statement with the created target tensor register. Specifically, the attribute identification item of the creation statement may be represented as an AttachInfo, and the attribute identification item may further include a class member variable m_id for storing a tensor_info_id of the TensorInfo to associate the creation statement with a target register created by the creation statement.
According to the method and the device, the attribute identification item of the creation statement is associated with the identification of the data structure of the target tensor register, so that the tensor data attribute of the target tensor register can be found according to the creation statement, and global propagation of tensor data is achieved.
In an embodiment, the basic block of the intermediate expression stage further comprises a usage statement, and the scheme of the present disclosure further proposes a tensor-based usage statement that can determine, based on the identification of the data structure of the tensor data attributes, the tensor registers required for performing the operation, thereby using the created tensor registers by means of the usage statement, thereby performing the tensor operation between the different tensor registers. As one embodiment, a statement using TR may be represented as use TR.
Taking the first creation sentence to the fourth creation sentence as an example, the processing of tensors in basic blocks in the compiling process will be described in general, that is, the first creation sentence to the fourth creation sentence and the processing of tensors using the sentences will be described in general, so as to facilitate further understanding of the present scheme.
For the first creation statement, the first creation statement in all basic blocks of the intermediate expression may be traversed in order to create the target TR. Next, the identity of the data structure in the target tensor register (tensor_info_id) may be associated with the attribute identification term in the first creation statement (AttachInfo). By such an association relationship, the tensor_info_id can be operated without directly reading and writing the TensorInfo.
For the second creation statement, the second creation statement in all basic blocks of the intermediate expression may be traversed in order to create the target TR. Then, since the information of the block of space pointed to by BaseAddr is unknown (possibly data, dimension information or other), after assigning it to the destination TR, the identity of the data structure in the destination TR can be set to unknown or invalid. For example, the tensor_info_id of the target TR may be set to 0, i.e., tensorInfo indicating the current TR is unknown or invalid. In one possible implementation, then, the identification of the data structure in the target tensor register (tensor_info_id) may also be associated with the attribute identification item (attainfo) in the second creation statement, i.e. the corresponding location of the attribute identification item in the second creation statement is invalid or set to 0.
For the third creation statement, the third creation statement in all basic blocks of the intermediate expression may be traversed in order to create the target TR. The identity of the data structure in the target TR may then be associated with the attribute identity item in the third creation statement. In one implementation scenario, the foregoing traversing the third creation statement in all basic blocks of the intermediate expression, so as to create the target TR, may specifically include the following operations:
Determining all fixed-point statements of the source TR using a chain (defined-use chain), i.e. "DU" chain, according to the definition established by the specific optimization level, wherein the fixed-point statements comprise a create statement; determining the identification of the data structure associated with the fixed-value point statement according to the attribute identification item in the fixed-value point statement; obtaining tensor data attribute of the source TR according to the identification; and creating the target TR according to tensor data attributes of the source TR and parameters of the third creation statement. Regarding the aforementioned DU chain, it is a sparse representation of variable data stream information. The DU chain for a variable connects the constant value of the variable to where it is used in the code. In abstract terms, the DU chain is a function from variable pairings to basic block location sets, one set for each constant point. Specifically, the DU chain is generally represented by a linked list, and can be constructed by solving an arrival-constant value (a classical data stream analysis method) data stream problem of a process and then building the linked list using the obtained information. Once the linked list is established, the reach-fix bit vector can be released because the DU chain represents the same information.
In some scenarios, for a plurality of consecutive third creation statements, and where the target TR of the previous third creation statement is the source TR of the next third creation statement, then an operation may also be performed in executing the plurality of consecutive third creation statements, in which the initial source TR is determined in a recursive manner based on the definition use chain to obtain tensor data attributes, and the tensor data attributes of the respective targets TR are determined in a retrospective manner based on the tensor data attributes of the initial source TR. Details regarding this retrospectively determined tensor data attribute will be described in detail later in connection with steps S1012 and S1014 of fig. 10.
For the fourth creation statement, the fourth creation statement in all basic blocks of the intermediate expression may be traversed to create the target TR, and the identification of the data structure in the target TR is associated with the attribute identification term in the fourth creation statement. In one implementation scenario, to traverse the fourth create statement in all basic blocks of the intermediate expression to create the target tensor register, the present disclosure proposes to use a chain (i.e. the aforementioned DU chain) to determine all the setpoint statements of the source TR according to the definition established by the specific optimization level, wherein the setpoint statements comprise the create statement. Then, the identification of tensor data associated with the fixed-value point statement may be determined from the attribute identification term in the fixed-value point statement. Finally, the aforementioned identification is taken as an identification of the data structure in the target TR, thereby creating the target TR.
Regarding the above-described traversal order of the exemplary creation statement in the intermediate expression, the scheme of the present disclosure considers that two creation statements of slicetr and tmv.tr.xram belong to both the statement of the fixed value TR and the statement of the use TR, whereas the scheme of the present disclosure requires that the statement of the use TR be executed again after traversing the statement of the fixed value TR (i.e., the use statement is executed after the creation statement), otherwise, the present disclosure proposes that the foregoing creation statement createtr and tld.tr, xram be executed before the slicetr and tmv.tr.xram statement are executed again due to the data interaction.
By associating the attribute identification items of the four creation sentences with the identification of the data structure of the target tensor register, the tensor data attribute of the target tensor register can be found according to the creation sentences, and global propagation of tensor data is realized.
With respect to the foregoing usage phrases, the solution of the present disclosure proposes traversing all usage phrases in all basic blocks of the intermediate expression. Next, for each TR to which the usage statement relates, a usage chain is used by definition to find the corresponding fixed-value point statement, wherein the fixed-value point statement comprises a creation statement. Thereafter, the identification of tensor data associated with the fixed-value point statement may be determined from the attribute identification term in the fixed-value point statement. Finally, obtaining tensor data attribute of the tensor register according to the identification and completing execution of the using statement.
In one implementation scenario, for the presence of multiple fixed-value point statements when executing the use statement, the present disclosure proposes that tensor data attributes of respective TRs of multiple fixed-value point statements may be compared to determine common and different attribute parameters. Then, for the common attribute parameters, the TR of the point of use may be assigned the common attribute parameters. Correspondingly, for different attribute parameters, the different attribute parameters can be set as the TR assigned to the using point after being invalid.
In one implementation scenario, the usage phrases may also include a third creation phrase and a fourth creation phrase. The third creation statement and the fourth creation statement can create a new TR, and are creation statements, and at the same time, the two creation statements can also use the existing TR, and are also use statements.
In a statement using TR, the tensor register TR may present a cut ("slice") operation (third create statement), for example as shown in the following usage statement.
tadd tr3,tr1,tr0+[<...>,<...>,<...>];
Where the expressions "tr0+ <.>, < >, and < >, are used to perform a slicing operation on tr0, the expression" tadd "means that tr1 and the tensor" sliced "from tr0 are added, and the result is assigned to tr3. The tensor addition uses a term equal to the sum of the two terms:
slicetr tr2, tr0+ [ <. >, </i >; (indicating that partial tensor data is "sliced" out of tr0 and assigned to tr 2)
tadd tr3, tr1, tr2; (indicating that tr1 and tr2 are added and assigned to tr 3)
In one implementation scenario, since the third creation statement and the fourth creation statement are also use statements at the same time, there may be data interaction with the creation statement during processing, resulting in errors, so that other creation statements are preferentially processed during execution, and then the third creation statement or the fourth creation statement is processed.
The compiling scheme of the present disclosure is described above in connection with fig. 7 and 8, and in particular, a plurality of creation sentences and use sentences that create and use TR are described in detail. Assuming that four creation statements and use statements are included in the basic block of the intermediate expression stage, createtr, slicetr, tld.tr, xram, and tmv.tr.xram, the compiling process of the present disclosure based on the four statements will be described in detail with reference to fig. 9 to 12. It is to be understood that the method of the present disclosure is implemented at the compiler's O1 optimization level and by means of DU relations for exemplary purposes only. Those skilled in the art will also appreciate, based on the present disclosure, that utilizing the aspects of the present disclosure to achieve global propagation for tensors at other suitable optimization levels.
Fig. 9 is an exemplary flowchart illustrating a process 900 of partially creating a statement according to an embodiment of the disclosure. It will be appreciated that the aforementioned create statement createtr, slicetr, tld. TR, xram and tmv. TR. Xram may all be included in the statement of the fixed value TR. Considering that two create statements of slicetr and tmv.tr.xram belong to both the statement of the constant value TR and the statement of the use TR, the scheme of the present disclosure needs to execute the statement of the use TR after traversing the statement of the constant value TR (i.e. the use statement is executed after creating the statement), so the algorithm flow shown in fig. 9 processes the create TR and the tld.tr, xram statement first.
As shown in fig. 9, at step S902, the algorithm of the present disclosure may first traverse createttr statements within all basic blocks, thereby creating TR containing tensoninfo with known dimension information. Next, the TensorInfo may be associated with the AttachInfo in the createtr statement. For this purpose, for example, a class member variable m_id may be added to the AttachInfo to store the tensor_info_id of the TensorInfo.
After executing the createtr statement as above, the algorithm of the present disclosure then traverses the tld.tr.xram statement within all basic blocks at step S904. From the semantic analysis, the information of the space pointed to by the BaseAddr is unknown, which may be data, dimensional information, or other information. In view of this, after assigning it to the target TR again, the tensor_info_id of the target TR may be set to 0, where 0 indicates that the TensorInfo of the current TR is unknown or invalid. In this embodiment, the tld.tr.xram statement may be processed first, and the createtr statement may be processed later.
Thereafter, the method of the present disclosure can process two sentences of slicetr and tmv.tr.xram in the flow shown in fig. 10. Fig. 10 is an exemplary flowchart illustrating a process 1000 of slicetr and tmv.tr.xram creation statements, according to an embodiment of the disclosure.
As shown in fig. 10, at step S1002, a slicetr sentence is first processed. Specifically, as previously described, the method of the present disclosure may traverse the slicetr statement within all basic blocks, whereby at step S1006, all fixed value points (also referred to as define_list, which is the location defining TR) of the source TR (src_tr), i.e., DU relations, may be obtained. Thereafter, at step S1008, the sentence in the define_list is traversed, and at step S1010, the TensorInfo of the sentence in the define_list is acquired. Specifically, the TensorInfo information associated with the constant point statement may be found from the AttachInfo of the IR of the constant point statement, and then derived from the tensor_info_id.
For the case where the tenstenfo information is not acquired, a recursive call to the current function may be performed at step S1012 in order to find the source TR and acquire the tenstenfo of the source TR. The aforementioned recursive call scenario occurs because the characteristics of the slicetr statement may be such that the target TR of the last slicetr statement is the source TR of the next statement:
slicetr tr1, tr0+ [ < … >, < … >, < … > ]; (… is an omitted parameter)
slicetr tr2,tr1+[<…>,<…>,<…>];
…;
slicetr trn,trm+[<…>,<…>,<…>];
In view of the statement scenario above, the solution of the present disclosure proposes to find the source TR at the beginning using a recursive manner, thereby obtaining the tensorfo of the source TR. And the path is traced back and processed one by one. That is, as shown at step S1014, the define_list is traversed and the TensorInfo of the sentence therein is acquired, and then the define_list is traversed again.
Through the above-described multiple recursive calls and traversals, at step S1016, the already acquired tensornfo, in particular, the parameter conflict is processed. Such parameter conflicts occur because in one scenario, the source TR of a statement has multiple fixed-value points (i.e., where the source TR has multiple definitions), so that in the case of static compilation, it is not known which fixed-value point is ultimately enabled. For example the following statements exist in the different basic blocks:
BB0, createttr 1, …; (… is an omitted other parameter, BB denotes Basic Block)
BB1:createtr tr1,…;
BB2:slicetr tr2,tr1+[…];
Since the fixed value points (BB 0 and BB 1) and the use point (BB 2) of TR1 (one TR) are in different BBs, the compiler cannot determine which fixed value point is valid at the time of static compilation. Because the source code for generating different BBs may be some conditional judgment statement such as if-else or switch, the result of the conditional judgment can only be known at runtime. Although specific setpoint values cannot be determined, more information can be obtained from these setpoint values as much as possible.
To this end, the present disclosure proposes comparing conflicting parameters. That is, the TensorInfo of the usage point TR is assigned by comparing the TensorInfo of all the fixed value points TR to find the same dimensional information therein, thereby retaining the same dimensional information. Thus, it can be shown that the current parameter is valid, where the value of the parameter is the same, no matter how the program is executed at run-time, because the definition of this parameter is the same for different constant points. Corresponding to the former, for finding different dimensional information therein, the valid position of the corresponding parameter in the TensorInfo may be set as an invalid, and the TensorInfo of the usage point TR is also assigned. Thus, it can be shown that the current parameter is invalid because the parameters defined by different fixed-value points are not identical, and the compiler needs to do conservative processing.
After passing through the above-described conflict processing, at step S1018, a new TR (target TR, that is, dst_tr) may be created from the TensorInfo information and the parameter information of the sentence itself. Finally, in step S1020, the tensor_info_id of the TensorInfo of the newly created target TR is associated with the attachnfo of the slicetr statement, here considering that the slicetr statement itself is also a statement of a constant value TR.
After executing the slicetr statement as described above, the algorithm of the present disclosure may execute the tmv.tr.xram statement at step S1004, where the steps involved in the execution of the statement are the same as those involved in the slicetr statement, e.g., the algorithm of the present disclosure traverses the tmv.tr.tr statement in all bases and finds all constant value points of the source TR according to the DU relation. The tensor_info_id associated with the fixed-value point statement is found from the AttachInfo of the fixed-value point statement, and then the tensor_info_id of this source TR is directly taken as the tensor_info_id of the target TR. This process differs from slicetr in that tmv. TR corresponds to the handling of TR, i.e. from a source TR to a new TR or target TR, which does not require a segmentation and calculation of the source TR. Finally, the tensor_info_id is associated with the AttachInfo of the tmv.tr.tr statement. It can be seen that the execution of the tmv.tr.xram statement does not involve the operations of steps S1012, S1014 and S1016.
It should be understood that in the illustrated embodiment, the tmv.tr.xram statement may be processed first, and the slicetr statement may be processed later, that is, the processing sequence of the two statements is not limited in this disclosure.
After the execution of slicetr and tmv.tr.xram described above, the algorithm of the present disclosure may process the user tr statement as shown in step S1102 in fig. 11.
Fig. 11 is an exemplary flowchart illustrating a process 1100 of using statements in accordance with an embodiment of the present disclosure. As shown in fig. 11, at step S1104, the user TR statement within all basic blocks is traversed, resulting in all fixed value points of the source TR (src_tr), i.e. DU relations. Thereafter, at step S1106, the tensorfo of the sentence in define_list is acquired. Specifically, for the TR in the defined-list, the tensor_info_id of the TR can be obtained through the AttachInfo of the corresponding statement, and then the TensorInfo of the TR can be obtained. Thereafter, the same parameter and different parameters are processed in step S1110 in a manner similar to step S1016 in fig. 10, thereby obtaining parameters after the conflict is resolved. Next, at step S1116, a new TR is created, and at step S1118, the tensor_info_id of the TR and the AttachInfo of the current TR (expressed therebetween) are associated for subsequent use.
Fig. 12 is a detailed exemplary flowchart illustrating a compilation method 1200 for tensors according to an embodiment of the present disclosure. It will be appreciated that the compilation method 1200 includes the operational steps previously described and illustrated in fig. 9-11. Therefore, the descriptions previously described with reference to fig. 9 to 11 are also applicable to the following descriptions, and the same will not be repeated.
As shown in fig. 12, at step S1202, the method flow starts, and at step S1204, a creation statement is first executed. When the creation statement is createtr, then the flow proceeds to step S1206 to create TR and to associate the newly created TR with the tensor_info_id at step S1208. When the creation statement is tld.tr.xram, the tensor_info_id is set to 0 at step S1210.
Next, at step S1212, the slicetr statement is processed and when the tensor_info_id of the current statement is not acquired, a recursive function is called at step S1214. Next, at step S1216, a define_list is obtained from the DU relation and traversed. At step S1218, a tensor_info_id is acquired from the defined_list. When final acquisition is not possible, then flow returns to step 1214 to continue the recursive operation as described above.
At step S1220, the processing of parameter conflicts of the tensorfo of all the fixed-value points TR is started. Specifically, a define_list is obtained from the DU relation and traversed at step S1234. At step S1236, the first tensor info id in the define_list is acquired. If the tensor_info_id is equal to 0, 0 is returned at step S1240. Otherwise, a separate TensorInfo is created as the final returned TensorInfo at step S1238. Next, at step S1242, a walk back from the second element in the define_list is performed, and at step S1244 a pairwise comparison is performed to determine the same and different parameters, e.g. the same or different dimension information. Next, a new tensor_info_id is returned at step S1246.
Thereafter, for the slicetr statement, at step S1222, the AttachInfo in the new tensor_info_id and src_tr are associated. At step S1224, a new target TR is created based on the TensorInfo of the source TR after the conflict is handled, and the tensor_info_id of the target TR is returned. Next, at step S1226, the tensor_info_id of the target TR is associated with the attacinfo in the slicetr statement. For the tmv.tr.xram statement, step S1226, associating the tensor_info_id with the AttachInfo of the tmv.tr.tr statement, is performed directly.
Next, the flow processes the statement using TR at step S1228, and performs collision of all fixed-value points at step S1230, which is similar to the above description, and is not repeated here. Thereafter, at step S1232, the new tensor_info_id obtained after the processing and the AttachInfo of the IR of the current use statement are associated. Finally, the flow ends in step S1234.
Based on the above description, one skilled in the art will appreciate that the present application enables optimization of compilation by software means (i.e., program instructions). In view of this, the present application also discloses in practice an apparatus comprising a processor and a memory. In particular, the memory may store program instructions for tensor compilation of an artificial intelligence computing system that, when executed by the processor, implement the method steps of the application described in connection with fig. 7-12. Additionally, since the aspects of the present application may be implemented by computer program instructions, the present application also discloses a computer readable storage medium or computer program product having stored thereon a computer program/instructions for tensor compilation, thereby implementing the method steps described in connection with fig. 7-12.
The aspects of the present disclosure are described in detail above with reference to the accompanying drawings. According to different application scenarios, the devices or apparatuses of the present disclosure may include servers, cloud servers, server clusters, data processing apparatuses, robots, computers, printers, scanners, tablet computers, intelligent terminals, PC devices, internet of things terminals, mobile terminals, cell phones, automobile recorders, navigators, sensors, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, vision terminals, autopilot terminals, vehicles, household appliances, and/or medical devices. The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus. The apparatus or device of the present disclosure may also be applied to the internet, the internet of things, data centers, energy sources, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, medical, and the like.
Further, the device or apparatus of the present disclosure may also be used in cloud, edge, terminal, etc. application scenarios related to artificial intelligence, big data, and/or cloud computing. In one or more embodiments, a high power device or apparatus according to aspects of the present disclosure may be applied to a cloud device (e.g., a cloud server), while a low power device or apparatus may be applied to a terminal device and/or an edge device (e.g., a smart phone or camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and collaborative work of an end cloud entity or an edge cloud entity.
It should be noted that, for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of actions and combinations thereof, but those skilled in the art will understand that the aspects of the present disclosure are not limited by the order of actions described. Thus, one of ordinary skill in the art will appreciate in light of the present disclosure or teachings that certain steps thereof may be performed in other sequences or concurrently. Further, those skilled in the art will appreciate that the embodiments described in this disclosure may be considered alternative embodiments, i.e., wherein the acts or modules involved are not necessarily required for the implementation of some or some aspects of this disclosure. In addition, the description of some embodiments of the present disclosure is also focused on, depending on the scenario. In view of this, those skilled in the art will appreciate that portions of one embodiment of the disclosure that are not described in detail may be referred to in connection with other embodiments.
In particular implementations, based on the disclosure and teachings of the present disclosure, one of ordinary skill in the art will appreciate that several embodiments of the disclosure disclosed herein may also be implemented in other ways not disclosed herein. For example, in terms of the units in the foregoing embodiment of the apparatus or device, the logic function is divided herein in consideration of the logic function, and there may be another division manner when actually implemented. For another example, multiple units or components may be combined or integrated into another system, or some features or functions in the units or components may be selectively disabled. In terms of the connection relationship between different units or components, the connections discussed above in connection with the figures may be direct or indirect couplings between the units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustical, magnetic, or other forms of signal transmission.
In the present disclosure, elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physically separate. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, some or all of the units may be selected to achieve the objectives of the embodiments of the disclosure, as desired. In addition, in some scenarios, multiple units in embodiments of the disclosure may be integrated into one unit or each unit may physically exist alone.
In some implementation scenarios, the above-described integrated units may be implemented in the form of software program modules. The integrated unit may be stored in a computer readable memory if implemented in the form of software program modules and sold or used as a stand alone product. In this regard, when the aspects of the present disclosure are embodied in a software product (e.g., a computer-readable storage medium), the software product may be stored in a memory, which may include instructions for causing a computer device (e.g., a personal computer, a server, or a network device, etc.) to perform some or all of the steps of the methods described by the embodiments of the present disclosure. The aforementioned Memory may include, but is not limited to, a usb disk, a flash disk, a Read Only Memory ("ROM"), a random access Memory ("Random Access Memory" RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media capable of storing program codes.
In other implementation scenarios, the integrated units may also be implemented in hardware, i.e. as specific hardware circuits, which may include digital circuits and/or analog circuits, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, devices such as transistors or memristors. In view of this, various types of devices described herein (e.g., computing devices or other processing devices) may be implemented by appropriate hardware processors, such as CPU, GPU, FPGA, DSP and ASICs, etc. Further, the aforementioned memory unit or storage device may be any suitable storage medium (including magnetic or magneto-optical storage medium, etc.), which may be, for example, variable resistance memory ("Resistive Random Access Memory", abbreviated RRAM), dynamic random access memory ("Dynamic Random Access Memory", abbreviated DRAM), static random access memory ("Static Random Access Memory", abbreviated SRAM), enhanced dynamic random access memory ("Enhanced Dynamic Random Access Memory", abbreviated EDRAM "), high bandwidth memory (" High Bandwidth Memory ", abbreviated HBM"), hybrid memory cube ("Hybrid Memory Cube", abbreviated HMC "), ROM, RAM, etc.
The foregoing may be better understood in light of the following clauses:
clause A1, a compiling method for tensors, comprising:
setting a data structure based on tensor data attributes for tensor global propagation of compiled intermediate expressions, wherein the tensor global propagation is related to all basic blocks of the intermediate expression stage; and
one or more create statements are executed to create a target tensor register containing the data structure to enable global propagation of tensors in compilation.
Clause A2, the compiling method according to clause A1, wherein the tensor data attribute comprises address information and dimension information of tensor data.
Clause A3, the compiling method according to clause A2, wherein the address information comprises a starting physical address of the tensor data, and the dimension information comprises a dimension size of the tensor data, a starting position of the dimension, and a physical address interval between the dimensions.
Clause A4, the compilation method of any of clauses A1-A3, wherein the data structure comprises parameters, parameter valid bits, and identifications associated with the tensor data attributes.
Clause A5, the compiling method according to clause A1, further comprising:
A use statement is executed based on the identification of the data structure to perform tensor operations between different tensor registers using the created tensor registers.
Clause A6, the compiling method according to clause A1, wherein the creation statement comprises one or more of the following creation statements:
a first creation statement for creating a target tensor register;
a second creation statement for assigning tensor data associated with the tensor register at the specific storage area to the target tensor register to create the target tensor register;
a third creation statement for performing segmentation on the tensor data according to the predetermined tensor dimension information and the source tensor register to create a target tensor register; and
a fourth creation statement for assigning a data structure of a source tensor register to a target tensor register to create the target tensor register.
Clause A7, the compiling method according to clause A6, further comprising:
traversing a first creation statement in all basic blocks of the intermediate expression so as to create a target tensor register; and
the identification of the data structure in the target tensor register is associated with an attribute identification term in the first creation statement.
Clause A8, the compiling method according to clause A6, further comprising:
traversing a second creation statement in all basic blocks of the intermediate expression to create a target tensor register; and
and setting the identification of the data structure in the target tensor register to be unknown or invalid.
Clause A9, the compiling method according to clause A7 or 8, further comprising:
traversing a third creation statement in all basic blocks of the intermediate expression so as to create a target tensor register; and
and associating the identification of the data structure in the target tensor register with the attribute identification item in the third creation statement.
Clause a10, the compilation method of clause A9, wherein traversing the third creation statement in all basic blocks of the intermediate expression to create the target tensor register comprises:
determining all fixed-value point statements of the source tensor register using a chain according to a definition established by a particular optimization level, wherein the fixed-value point statements include a create statement;
determining the identification of a data structure associated with the fixed-value point statement according to the attribute identification item in the fixed-value point statement;
obtaining tensor data attributes of the source tensor register according to the identification; and
The target tensor register is created according to the tensor data attribute and parameters of a third creation statement.
Clause a11, the compiling method according to clause a10, wherein for a plurality of consecutive third creation statements, and the target tensor register of the previous third creation statement is the source tensor register of the next third creation statement, the method further comprises, in executing the plurality of consecutive third creation statements:
determining an initial source tensor register in a recursive manner based on the defined usage chain to obtain tensor data attributes; and
and sequentially determining the tensor data attribute of each target tensor register in a backtracking mode based on the tensor data attribute of the initial source tensor register.
Clause a12, the compiling method according to clause A7 or A8, further comprising:
traversing a fourth creation statement in all basic blocks of the intermediate expression so as to create a target tensor register; and
and associating the identification of the data structure in the target tensor register with the attribute identification item in the fourth creation statement.
Clause a13, the compilation method of clause a12, wherein traversing the fourth creation statement in all basic blocks of the intermediate expression to create the target tensor register comprises:
Determining all fixed-value point statements of a source tensor register by using a chain according to a definition established by a specific optimization level, wherein the fixed-value point statements comprise a creation statement;
determining the identification of tensor data associated with the fixed-value point statement according to the attribute identification item in the fixed-value point statement; and
the identification is used as the identification of the data structure in the target tensor register to create the target tensor register.
Clause a14, the compiling method according to clause A6, wherein when executing the use statement, the method comprises:
traversing all the used sentences in all basic blocks of the intermediate expression;
for each tensor register related to the use statement, a use chain is defined so as to find a corresponding fixed-value point statement, wherein the fixed-value point statement comprises a creation statement;
determining the identification of tensor data associated with the fixed-value point statement according to the attribute identification item in the fixed-value point statement; and
and obtaining tensor data attributes of the tensor registers according to the identification.
Clause a15, the compiling method according to clause a14, wherein the used sentence includes a third created sentence, and when there are a plurality of fixed-value point sentences in executing the used sentence, the method further comprises:
Comparing tensor data attributes of respective tensor registers of the plurality of constant value point statements to determine common and different attribute parameters; and
the common attribute parameters are assigned to the tensor registers of the points of use, and the different attribute parameters are set to be invalid and then assigned to the tensor registers of the points of use.
Clause a16, an apparatus for compilation of tensors, comprising:
a processor; and
a memory storing computer program instructions for compiling a tensor, which, when executed by a processor, cause a method according to any of clauses A1-a15 to be implemented.
A computer readable storage medium of clause a17, storing computer program instructions for compiling a tensor, which, when executed by a processor, cause the method according to any of clauses A1-a15 to be implemented.
While the embodiments of the present disclosure are described above, the descriptions are merely examples employed to facilitate understanding of the present disclosure, and are not intended to limit the scope and application of the present disclosure. Any person skilled in the art to which this disclosure pertains will appreciate that numerous modifications and variations in form and detail can be made without departing from the spirit and scope of the disclosure, but the scope of the disclosure is to be determined by the appended claims.

Claims (17)

1. A tensor-based compilation method, comprising:
setting a data structure based on tensor data attributes for tensor global propagation of compiled intermediate expressions, wherein the tensor global propagation is related to all basic blocks of the intermediate expression stage; and
one or more create statements are executed to create a target tensor register containing the data structure to enable global propagation of tensors in compilation.
2. The compiling method of claim 1, wherein the tensor data attribute includes address information and dimension information of tensor data.
3. The compiling method of claim 2, wherein the address information includes a start physical address of the tensor data, and the dimension information includes a size of each dimension of the tensor data, a start position of each dimension, and a physical address interval between each dimension.
4. A compiling method according to any of claims 1-3, wherein the data structure comprises a parameter, a parameter valid bit and an identity associated with the tensor data attribute.
5. The compiling method of claim 1, further comprising:
a use statement is executed based on the identification of the data structure to perform tensor operations between different tensor registers using the created tensor registers.
6. The compilation method of claim 1, wherein the creation statement comprises one or more of the following creation statements:
a first creation statement for creating a target tensor register;
a second creation statement for assigning tensor data associated with the tensor register at the specific storage area to the target tensor register to create the target tensor register;
a third creation statement for performing segmentation on the tensor data according to the predetermined tensor dimension information and the source tensor register to create a target tensor register; and
a fourth creation statement for assigning a data structure of a source tensor register to a target tensor register to create the target tensor register.
7. The compiling method of claim 6, further comprising:
traversing a first creation statement in all basic blocks of the intermediate expression so as to create a target tensor register; and
the identification of the data structure in the target tensor register is associated with an attribute identification term in the first creation statement.
8. The compiling method of claim 6, further comprising:
traversing a second creation statement in all basic blocks of the intermediate expression to create a target tensor register; and
And setting the identification of the data structure in the target tensor register to be unknown or invalid.
9. The compiling method of claim 7 or 8, further comprising:
traversing a third creation statement in all basic blocks of the intermediate expression so as to create a target tensor register; and
and associating the identification of the data structure in the target tensor register with the attribute identification item in the third creation statement.
10. The compilation method of claim 9, wherein traversing a third create statement in all basic blocks of the intermediate expression to create the target tensor register comprises:
determining all fixed-value point statements of the source tensor register using a chain according to a definition established by a particular optimization level, wherein the fixed-value point statements include a create statement;
determining the identification of a data structure associated with the fixed-value point statement according to the attribute identification item in the fixed-value point statement;
obtaining tensor data attributes of the source tensor register according to the identification; and
the target tensor register is created according to the tensor data attribute and parameters of a third creation statement.
11. The compiling method of claim 10, wherein for a plurality of consecutive third creation statements exist and a target tensor register of a previous third creation statement is a source tensor register of a next third creation statement, the method further comprises, in executing the plurality of consecutive third creation statements:
Determining an initial source tensor register in a recursive manner based on the defined usage chain to obtain tensor data attributes; and
and sequentially determining the tensor data attribute of each target tensor register in a backtracking mode based on the tensor data attribute of the initial source tensor register.
12. The compiling method of claim 7 or 8, further comprising:
traversing a fourth creation statement in all basic blocks of the intermediate expression so as to create a target tensor register; and
and associating the identification of the data structure in the target tensor register with the attribute identification item in the fourth creation statement.
13. The compilation method of claim 12, wherein traversing a fourth create statement in all basic blocks of the intermediate expression to create the target-tensor register comprises:
determining all fixed-value point statements of a source tensor register by using a chain according to a definition established by a specific optimization level, wherein the fixed-value point statements comprise a creation statement;
determining the identification of tensor data associated with the fixed-value point statement according to the attribute identification item in the fixed-value point statement; and
the identification is used as the identification of the data structure in the target tensor register to create the target tensor register.
14. The compiling method according to claim 6, wherein in executing the use statement, the method includes:
traversing all the used sentences in all basic blocks of the intermediate expression;
for each tensor register related to the use statement, a use chain is defined so as to find a corresponding fixed-value point statement, wherein the fixed-value point statement comprises a creation statement;
determining the identification of tensor data associated with the fixed-value point statement according to the attribute identification item in the fixed-value point statement; and
and obtaining tensor data attributes of the tensor registers according to the identification.
15. The compiling method of claim 14, wherein the use sentence includes a third creation sentence, the method further comprising, for a plurality of fixed-value point sentences that exist when the use sentence is executed:
comparing tensor data attributes of respective tensor registers of the plurality of constant value point statements to determine common and different attribute parameters; and
the common attribute parameters are assigned to the tensor registers of the points of use, and the different attribute parameters are set to be invalid and then assigned to the tensor registers of the points of use.
16. An apparatus for tensor-based compilation, comprising:
A processor; and
a memory storing computer program instructions for compiling tensors, which, when executed by a processor, cause a method according to any one of claims 1-15 to be implemented.
17. A computer readable storage medium storing computer program instructions for tensor-based compilation, which, when executed by a processor, cause the method according to any one of claims 1-15 to be implemented.
CN202210503503.5A 2022-05-09 2022-05-09 Tensor-based compilation method, device and computer-readable storage medium thereof Pending CN117075903A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210503503.5A CN117075903A (en) 2022-05-09 2022-05-09 Tensor-based compilation method, device and computer-readable storage medium thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210503503.5A CN117075903A (en) 2022-05-09 2022-05-09 Tensor-based compilation method, device and computer-readable storage medium thereof

Publications (1)

Publication Number Publication Date
CN117075903A true CN117075903A (en) 2023-11-17

Family

ID=88718119

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210503503.5A Pending CN117075903A (en) 2022-05-09 2022-05-09 Tensor-based compilation method, device and computer-readable storage medium thereof

Country Status (1)

Country Link
CN (1) CN117075903A (en)

Similar Documents

Publication Publication Date Title
CN112183712B (en) Compiling method and device of deep learning algorithm and related products
CN114035916B (en) Compilation and scheduling methods of computational graphs and related products
CN109284815B (en) Neural network model algorithm compiling method and device and related products
CN109543825B (en) Neural network model algorithm compiling method and device and related products
CN111860805B (en) Fractal calculation device and method, integrated circuit and board card
WO2023030507A1 (en) Compilation optimization method and apparatus, computer device and storage medium
CN119201598A (en) Monitoring methods and related products for streams in machine learning frameworks
Cheng et al. {PipeThreader}:{Software-Defined} Pipelining for Efficient {DNN} Execution
GB2603151A (en) Circuitry and method
CN111831333B (en) Instruction decomposition method and device for intelligent processor and electronic equipment
CN117075903A (en) Tensor-based compilation method, device and computer-readable storage medium thereof
CN117648091A (en) Compilation methods and related products for computational graphs
EP3991027B1 (en) Method and apparatus for enabling autonomous acceleration of dataflow ai applications
CN113469328B (en) Devices, boards, methods and readable storage media for performing revolution pass-through
CN112558978B (en) Accelerated programming and compiling method for supporting heterogeneous many-core full-chip view angle
CN115840894A (en) Method for processing multidimensional tensor data and related product thereof
CN117075902A (en) Tensor-based compilation method, device and computer-readable storage medium thereof
CN117667210A (en) Instruction control device, method, processor, chip and board card
CN114217777A (en) Method for implementing expandable and performance-transplantable multidimensional array library of computer hardware
Burkhart et al. Structured parallel programming: how informatics can help overcome the software dilemma
CN115329923A (en) Compiling method for neural network model and related product
CN112463158A (en) Compiling method, compiling device, electronic equipment and storage medium
CN115904344B (en) Hybrid programming method and related product
CN1661552B (en) Process language for microprocessors with finite resources
Ulm et al. Virtual Parallelism by Self Simulation of the Multiple Instruction Stream Associative Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination