CN110764744B

CN110764744B - Intermediate representation generation method and device for neural network calculation

Info

Publication number: CN110764744B
Application number: CN201810829863.8A
Authority: CN
Inventors: 隋凌志; 刘鑫; 王雨顺
Original assignee: Xilinx Inc
Current assignee: Xilinx Inc
Priority date: 2018-07-25
Filing date: 2018-07-25
Publication date: 2023-12-08
Anticipated expiration: 2038-07-25
Also published as: CN110764744A

Abstract

The disclosure provides an intermediate representation generation method and device for neural network calculation. The method comprises the following steps: analyzing the input model file to obtain topological structure information of the neural network; the feature map information and the computing operation information in the topology information are used as nodes and edges, respectively, to generate a first intermediate representation in the form of a map. Thus, the subsequent convenient optimization of graph and schedule is achieved by introducing a feature graph as the first IR of the node operation edge. Preferably, the generating method of the invention can also generate subsequent IR, thereby converting and describing the algorithm by using IR with different granularity and different forms, so that the compiler based on the invention can be conveniently applied to various front-end frameworks and back-end hardware implementations, and can efficiently and accurately optimize instructions.

Description

Intermediate representation generation method and device for neural network calculation

Technical Field

The invention relates to the field of deep learning, in particular to an intermediate representation generation method and device for neural network calculation.

Background

Neural networks (Neural networks) have recently become a research hotspot in the field of image recognition. The trained neural network model can be used in various fields such as image classification, object identification, significance detection and the like. In recent years, the neural network model has a trend of increasing the calculation scale and complexity, and the traditional CPU platform cannot meet the practical requirements. Therefore, designing the neural network accelerator by utilizing heterogeneous computing platforms such as FPGA, GPU and the like becomes a new research hot spot. Compared with a GPU platform, the FPGA and the ASIC can achieve higher computational energy efficiency ratio, and meanwhile the flexibility and the customizability of the FPGA and the ASIC are more suitable for the requirement of high-speed development of a neural network algorithm.

The workflow of a compiler is typically composed of a number of different task phases, so the composition of a compiler can be generally divided into three parts, front-end, optimization, back-end. In order to pass information between the different task phases, the compiler needs to derive the full knowledge of the target program. Thus, almost all compilers require some form of intermediate characterization for the target algorithm to model, thereby facilitating its analysis, conversion, and optimization.

For neural network compiling, the neural network algorithms from different deep learning frameworks are converted into universal calculation graphs, the calculation graphs are optimized and reconstructed, and then the optimized calculation graphs are mapped into executable instructions and machine codes of a hardware platform, so that the compiling of the algorithm for the hardware platform is completed. The difference of the used bottom computing library, computing graph form, code style and the like of many deep learning frameworks causes great difference in the precision and the operation speed of operation results, and more heterogeneous hardware platforms are emerging besides general processors. If the M front-end deep learning frameworks are required to be optimized and mapped to N back-end hardware platforms, the workload of O (m×n) is faced, and there is a risk of combinatorial explosion.

For this reason, an intermediate representation generation scheme capable of coping with flexible compatibility with various front-end and back-end is required.

Disclosure of Invention

In order to solve at least one of the problems described above, the present invention proposes a compiler architecture scheme capable of coping with various deep learning frameworks and backend hardware platforms with extremely high scalability and compatibility and providing efficient code optimization capability by means of the cooperation of modules therein with various intermediate representations of different granularity and properties.

According to one aspect of the present invention, there is provided an intermediate representation generation method for neural network computation, including: analyzing the input model file to obtain topological structure information of the neural network; the feature map information and the computing operation information in the topology information are used as nodes and edges, respectively, to generate a first intermediate representation in the form of a map. Thus, the subsequent convenient optimization of graph and schedule is achieved by introducing feature graphs to operate as IR of edges for nodes.

The first intermediate representation further includes node attributes and edge attributes, the node attributes including at least one of: dimension information and length-width channel information of the feature map; the computing operation of the edge representation includes at least one of: convolution, pooling, dimensional transformation, point-plus (eltwise), deconvolution, rearrangement, nonlinearity, batch normalization (batch norm), scaling; and the edge attributes include parameters of the computing operation and include at least one of: convolution kernel size, extended edges (pad), stride, grouping, expansion (position).

The method further comprises the steps of: the first intermediate representation is subjected to graph optimization to generate a second intermediate representation in the form of a graph. The method may specifically include merging the computing operations to obtain a feature graph as a node, and merging the computing operations as a second intermediate representation in the form of a hypergraph of edges.

Merging the computing operations may include at least one of: removing operations which are not needed or have no influence on the calculation result; fusing a plurality of adjacent computing operations; and decomposing the computing operation to fuse the decomposed computing operation with a preceding or following computing operation or to implement processing of the decomposed computing operation.

Merging the computing operations may include: setting a sub-graph template capable of computational operation merging, acquiring at least one sub-graph matching scheme of a computational graph for the first intermediate representation, and reconstructing the computational graph into the second intermediate representation merged by the computational operation based on the sub-graph matching scheme. The sub-graph template may be determined based on attribute information of a hardware platform on which the instruction code compiled from the intermediate representation is to be executed.

Merging the computing operations may further include: and adding an edge corresponding to the execution cost of the computing operation merging mode between the input node and the output node corresponding to each computing operation merging mode of the first intermediate representation under the condition that a plurality of computing operation merging modes exist, and solving an optimal computing operation merging scheme based on the shortest path problem among the nodes.

The second intermediate representation may be represented by a Domain-specific language (DSL, domain-Specific Language) designed based on the schema language.

Thus, by introducing the second intermediate representation, a graphical optimization of the first IR is carried.

The intermediate representation generating method of the present invention may further include: scheduling optimization is performed on the second intermediate representation to obtain a fine-grained third intermediate representation. Specifically, the method comprises the following steps: and performing scheduling optimization on the second intermediate representation based on attribute information of a hardware platform for executing instruction codes compiled by the intermediate representation to acquire a third intermediate representation of a block execution scheme for indicating feature graphs and/or weights, preferably acquiring the third intermediate representation for indicating instruction dependency relations among execution instructions in the block execution scheme for the feature graphs and/or weights based on the attribute information of the hardware platform.

The third intermediate representation is represented by a language that writes each computing operation as multiple loops.

The method may further comprise: the third intermediate representation is compiled into instruction code for execution on a hardware platform. Thus, code optimization based on hardware attributes is facilitated.

The hardware platform may include at least one of: a neural network special-purpose computing platform realized based on FPGA or ASIC; a neural network special-purpose computing platform realized based on the GPU; a general purpose computing platform.

According to another aspect of the present invention, there is provided an intermediate representation generating apparatus for neural network computation, comprising: the analysis unit is used for analyzing the input model file to acquire topological structure information of the neural network; and a first intermediate representation generating unit configured to generate a first intermediate representation in a graph form using the feature graph information and the calculation operation information in the topology information as nodes and edges, respectively.

The apparatus may further include: and the second intermediate representation generating unit is used for carrying out graph optimization on the first intermediate representation so as to generate a second intermediate representation in a graph form.

The second intermediate representation generating unit further comprises: and the computing operation merging unit is used for merging the computing operations to acquire the feature graph as a node, and the merged computing operations are used as second intermediate representations in the hypergraph form of the edges.

The computing operation merging unit is used for performing at least one of the following operations: removing operations which are not needed or have no influence on the calculation result; fusing a plurality of adjacent computing operations; and decomposing the computing operation to fuse the decomposed computing operation with a preceding or following computing operation or to implement processing of the decomposed computing operation.

The computing operation merging unit is further configured to: setting a sub-graph template capable of computational operation merging, acquiring at least one sub-graph matching scheme of a computational graph for the first intermediate representation, and reconstructing the computational graph into the second intermediate representation merged by the computational operation based on the sub-graph matching scheme. The sub-graph template is determined based on attribute information of a hardware platform on which instruction code compiled from the intermediate representation is to be executed.

The computing operation merging unit is further configured to: and adding an edge corresponding to the execution cost of the computing operation merging mode between the input node and the output node corresponding to each computing operation merging mode of the first intermediate representation under the condition that a plurality of computing operation merging modes exist, and solving an optimal computing operation merging scheme based on the shortest path problem among the nodes.

The second intermediate representation is represented by a Domain-specific language (DSL, domain-Specific Language) designed based on the schema language.

The apparatus may further include: and the third intermediate representation generating unit is used for carrying out scheduling optimization on the second intermediate representation so as to obtain a fine-grained third intermediate representation. The third intermediate representation generating unit may be for: and scheduling and optimizing the second intermediate representation based on attribute information of a hardware platform for executing instruction codes compiled by the intermediate representation to acquire a third intermediate representation of the block execution scheme indicating feature graphs and/or weights. The third intermediate representation generating unit may be further configured to: and acquiring a third intermediate representation indicating the instruction dependency relationship among each execution instruction in the block execution scheme of the feature map and/or the weight based on the attribute information of the hardware platform.

The apparatus may further include: and the compiling unit is used for compiling the third intermediate representation into instruction codes for execution on a hardware platform.

According to yet another aspect of the present invention, there is provided a computing device comprising: a processor; and a memory having executable code stored thereon that, when executed by the processor, causes the processor to perform the method of any of the above claims.

According to one aspect of the present invention there is provided a non-transitory machine-readable storage medium having stored thereon executable code which, when executed by a processor of an electronic device, causes the processor to perform the method of any of the above.

The first IR can be decoupled from the deep learning framework, and the characteristic diagram is represented by the nodes, and the special structure of the computing operation is represented by the edges, so that the subsequent memory optimization is convenient. The second IR has a hypergraph form, and the efficiency, the accuracy and the hardware pertinence of graph optimization can be greatly improved through the introduction of the sub-graph template and the cost function edge. The third IR is preferably implemented using a multi-loop language, which can greatly improve scheduling optimization efficiency and fully consider the back-end hardware characteristics. The method and the device have the advantages that the IR with different granularity and different forms is used for converting and describing the algorithm, so that the compiler based on the method and the device can be conveniently applied to various front-end frameworks and back-end hardware implementations, and the instructions can be optimized efficiently and accurately.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following more particular descriptions of exemplary embodiments of the disclosure as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout exemplary embodiments of the disclosure.

Fig. 1 shows a series of sequentially running layers that make up a typical CNN.

Fig. 2 shows a compilation schematic of an existing neural network compiler.

Fig. 3A-3B illustrate typical network computational graph structures of existing CNN networks.

FIG. 4 shows a flow diagram of an intermediate representation generation method according to one embodiment of the invention.

Fig. 5 shows an example of a conventional calculation map converted into the first IR of the present invention.

Fig. 6 shows a flow diagram of an intermediate representation generating method according to another embodiment of the invention.

FIG. 7 illustrates the conventional calculation graph of FIG. 5 and a representation of the manner in which the calculation operations are combined in a first IR of the invention.

Fig. 8 shows a schematic diagram of an intermediate representation generating apparatus according to an embodiment of the invention.

Fig. 9 shows a schematic diagram of an intermediate representation generating apparatus according to another embodiment of the invention.

FIG. 10 shows a schematic diagram of a compiler architecture according to one embodiment of the invention.

FIG. 11 illustrates a schematic diagram of a computing device that may be used to implement the above-described intermediate representation generation method according to one embodiment of the invention.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Artificial intelligence has been rapidly developed in recent years, and has achieved good application effects in the fields of image classification, detection, video and voice processing, and the like, and still has great development prospects. Neural networks are the core of artificial intelligence applications, with deep learning neural network algorithms being one of the most common. The workload characteristics of neural networks are computationally and data intensive. The multiply-add operation required for neural network computation is typically on the order of G, e.g., the computation of the target detection class neural network SSD is 120G operations. The parameters required for calculation are typically on the order of M to hundreds of megabytes, for example, 480 megabytes for the classified neural network VGG.

Common Artificial Neural Networks (ANNs) include Deep Neural Networks (DNNs), recurrent Neural Networks (RNNs), and Convolutional Neural Networks (CNNs). CNN is one of artificial neural networks, and has become a research hotspot in the current fields of speech analysis and image recognition. The weight sharing network structure of the system is more similar to a biological neural network, so that the complexity of a network model is reduced, and the number of weights is reduced. The advantage is more obvious when the input of the network is a multidimensional image, so that the image can be directly used as the input of the network, and complex characteristic extraction and data reconstruction processes in the traditional recognition algorithm are avoided. A convolutional network is a multi-layer perceptron specifically designed to recognize two-dimensional shapes, and the network architecture is highly invariant to translation, scaling, tilting, or other forms of deformation. A certain degree of background description of convolutional neural networks will be described below, particularly with reference to the accompanying drawings.

CNN basic concept

As shown in fig. 1, a typical CNN consists of a series of layers that run in order.

The parameters of the CNN model are called "weights". The first layer of CNN reads the input map and outputs a series of feature maps (featuremaps). The lower layer reads the feature map generated by the upper layer and outputs a new feature map. The last classifier (classifer) outputs the probability that the input graph may belong to a certain class. The CONV layer (convolutional layer) and the FC layer (fully-concatenated layer) are two base layer types in CNN. Following the CONV layer, there is typically a Pooling layer (Pooling layers).

In the present application, for one CNN layer,representing the j-th input profile,/->Representing the ith output feature map, b _i The bias term representing the ith output graph.

For the CONV layer, n _in And n _out Representing the number of input and output feature maps, respectively.

For the FC layer, n _in And n _out Representing the length of the input and output feature vectors, respectively.

Definition of CONV layer (Convolutional layers, convolution layer): the CONV layer takes a series of feature maps as input and convolves with a convolution kernel to obtain an output feature map.

A nonlinear layer, i.e., a nonlinear excitation function, typically connected to the CONV layer, is applied to each element in the output signature. The excitation function used is typically a ReLU function, which layer is also commonly referred to as a ReLU layer.

The CONV layer can be represented by expression 1:

wherein g _i,j Is a convolution kernel applied to the jth input feature map and the ith output feature map. Definition of FC layer (full-Connected layers): the FC layer applies a linear transformation of the input features up:

f ^out ＝W f ⁱⁿ +b (2)

w is an n _out ×n _in Transformation matrix, b is the bias term. Notably, for the FC layer, the input is not a combination of several two-dimensional feature maps, but one feature direction. Thus, in expression 2, parameter n _in And n _out In effect corresponding to the length of the input and output feature vectors.

Pooling (pooling) layer: typically connected to the CONV layer, for outputting the maximum or average value of each partition (subtreea) in each profile. The Pooling maximum value can be represented by expression 3:

where p is the size of the pooled kernel. This nonlinear "downsampling" not only reduces the feature map size and computation for the next layer, but also provides a translational invariance. CNNs can be used for image classification in forward reasoning.

Deep learning frame

The deep learning framework provides building blocks for the design, training and verification of neural networks through advanced programming interfaces. In other words, the deep learning framework provides a way for the implementation of a neural network specific algorithm (e.g., the neural network architecture shown in fig. 1).

With the development of deep learning and neural network algorithms, a number of top-level deep learning frameworks, such as Caffe, tensorFlow, mxNet and pyrerch, etc., for researchers and developers are presented. Developers can use the DSL and APIs of these frameworks to design different computational graph models to accomplish specific tasks such as face recognition, image detection, speech recognition, etc.

There are large differences among the deep learning frameworks, including the underlying computational libraries, computational graph forms, code styles, etc. they use, resulting in large differences in both the accuracy of the results of the operations and the speed of the operations.

For example, tensorFlow, mxNet and the like each represent a neural network in the form of a computational graph in a framework, and Caffe does not use the graph, but builds up the dependency relationship between each computational operation and the blob. However, these deep learning frameworks use IR representations of different granularity, e.g., tensorFlow splits the convolution into Padding, dot, biasAdd nodes, and Pad, weights, bias constant nodes, whereas Caffe is only one convolution node.

No matter what kind of deep learning framework is used for training the obtained neural network instructions and parameters, the neural network instructions and parameters need to be compiled into machine codes which can be executed by a hardware processor of the back end through a compiler. If the M front-end deep learning frameworks are optimized separately and mapped to N back-end hardware platforms, the workload of O (m×n) is faced, and there is a risk of combinatorial explosion. The compiler then needs to abstract an intermediate representation (Intermediate Representation, i.e., IR) that is independent of the framework, both to represent all the information of the algorithm and to facilitate its optimization by the subsequent compilers. In other words, in order to decouple from the deep learning computing framework, a computational graph structure corresponding to the neural network processor needs to be constructed. The neural network algorithm from different deep learning platforms is converted into a general calculation graph, the calculation graph is optimized and reconstructed, and then the optimized calculation graph is mapped into instructions and machine codes of a hardware platform, so that the compiling part of the algorithm on the hardware platform is completed.

Compilation of neural networks

In order to deploy the deep neural network after training, a compiler is required to compile the neural network algorithm into a binary instruction stream that can be executed by the computing platform. Unlike applications developed using high-level languages such as c++ or Java, neural network algorithms have their own unique syntax and structure. In view of this, high performance computing platforms dedicated to neural network computing and corresponding neural network compilers have emerged. For example, a deep neural network compiler DNNC (Deep Neural Network Compiler) may compile neural network algorithms into an optimized instruction stream for a DPU (Deep Learning Processor Unit, deep learning special purpose processor) platform. The topology structure of the neural network is analyzed to construct the control flow and data flow information in the internal computing graph of the compiler, which is equivalent to the internal computing graph of the compiler, wherein the control flow and the data flow information in the IR are represented in the middle of the internal computing graph of the compiler, and the neural network compiler applies various compiling optimization and transformation technologies based on the IR, so that the computing performance of the DPU is improved, and meanwhile, the memory bandwidth and the power consumption requirement of the system are effectively reduced. Fig. 2 shows a compilation schematic of an existing neural network compiler. As shown in fig. 2, a specialized neural network algorithm (e.g., for a pruned CNN) may be fed into a neural network compiler including a compiling front end, an optimizer, and an instruction generator, and binary instruction codes for a neural network computing platform (e.g., DPU) may be generated.

Herein, "compilation" refers to the process of generating low-level, computing platform-oriented object code from a representation of a high-level formalized method description using a compiler. Since hardware computing platform processing involves only binary instruction code, a compiler is required to convert the familiar high-level language descriptions into computer-readable low-level binary code. Unlike source program code, which is described using a high-level programming language such as C/C++, neural networks need to be represented by a specialized model that describes the neural network algorithms. The neural network algorithm includes a topology of the neural network algorithm and parameters of the neural network algorithm. In contrast, the storage space required for formal description of neural network topologies is much smaller than the storage space required for bulk neural network algorithm parameters.

The neural network compiler generally includes a compiling front end, an optimizer, and an instruction generator as shown in fig. 2. The neural network compiling front end is used for analyzing the input neural network algorithm. The analysis may include parsing and reconstruction of the network topology. The optimizer is used for optimizing the neural network algorithm and/or the intermediate representation generated by the previous analysis, and the object of optimization by the optimizer is a computational graph representation equivalent to the neural network algorithm. The optimization operations may semantically equivalent various transformations to the computational graph described above in order to subsequently more efficiently generate object code. Finally, the instruction generator may generate efficient object code based on the optimized computation graph. The object code may then be input into the backend, particularly a neural network processor as described below, to perform the corresponding neural network inference calculations.

Basic concept of neural network processor

Because of the characteristics of huge parameter scale and huge calculation amount of the convolutional neural network and the requirements on hardware platform stability and high calculation energy consumption ratio, the conventional CPU cannot meet the calculation requirements of the neural network, and the accelerator is designed to be a new research hotspot by utilizing heterogeneous calculation platforms such as FPGA, GPU, ASIC. Compared with a GPU platform, the FPGA can obtain higher energy efficiency due to the low power consumption characteristic, and meanwhile, the FPGA can be iterated rapidly, and the characteristic of hardware reconstruction can be adopted to meet the requirement of high-speed development of an algorithm. Furthermore, the AI chip is realized by a customized ASIC chip, is taken as a processor chip specially designed for deep learning, and is subjected to deep customization and optimization aiming at a deep neural network in the aspects of operation speed, power consumption, cost and the like, so that the FPGA and the GPU are further improved.

While the compiler architecture of the present application may be used with a general purpose computing platform (i.e., a host or CPU only computing platform), it is more applicable to neural network specific processors that are specifically designed to perform neural network computations. Those skilled in the art will appreciate that the term "neural network-specific processor" as used in the present application may also be referred to simply as "neural network processor" or "NN processor". Since deep learning is one of the most popular technology classifications in the neural network technology at present, the neural network-specific processor may be implemented as a deep learning-specific processor or a deep learning processor. However, it will be appreciated by those skilled in the art that neural networks have various technical branches, such as DNN and CNN (where DNN is named from a depth perspective and CNN is not mutually exclusive from a convolution perspective), and thus neural network-specific processors may also be implemented as deep neural network-specific processors or deep neural network processors (DNN processors or CNN processors). That is, neural network computing implementation techniques involving "deep learning processors" or "deep neural network processors" in heterogeneous computing platforms are also within the scope of the present application.

DPU (Deep-learning Processing Unit) is a general acceleration platform for neural network algorithm in artificial intelligence, and utilizes the characteristics of high parallelism and low power consumption of FPGA to realize reasoning based on Convolutional Neural Network (CNN). Herein, a DPU may be considered as one specific implementation of the above "deep learning processor" or "deep neural network processor" or "neural network processor". The binary instruction code compiled via the compiler architecture of the present invention may be executed by a DPU implemented by an FPGA, but it should be understood by those skilled in the art that the compiler architecture of the present invention is equally scalable to various back-end implementations, such as neural network processors that utilize the hardware architecture of a GPU to reason about other neural networks, and ASIC chips that are deeply tailored and optimized for neural network computations, e.g., dedicated AI chips.

Basic concept of network computational graph

To decouple from the deep learning computation framework, a computation graph structure corresponding to the neural network processor needs to be constructed. The neural network algorithm from different deep learning platforms is converted into a general calculation graph, the calculation graph is optimized and reconstructed, and then the optimized calculation graph is mapped into instructions and machine codes of a hardware platform, so that the compiling part of the algorithm on the hardware platform is completed. Because of the limitations of storage resources, bandwidth, computing resources, hardware design and the like on different hardware platform chips and instruction set bit width, the method is also limited by various computing operations of different deep learning platforms, dimension transformation, parameter change during the computing operations and other factors, and how to find the optimal way of executing a computing graph in the process of mapping the algorithm to the instruction, in other words, how to enable the instruction compiled by the algorithm to be executed on the hardware platform in an waiting and high-efficiency manner is a great problem to be solved by the computing platform.

Fig. 3A-3B illustrate typical network computational graph structures of existing CNN networks. Fig. 3A shows the basic structure in Vgg network. As shown in fig. 3A, the unbranched network computational graph requires repeated data handling between DDR and on-chip cache (e.g., implemented as BRAM, i.e., block RAM) when executing the most basic CONV (convolutional operation), reLU (nonlinear operation, with the stimulus function typically a ReLU function), and POOL (pooling operation), since the feature graph that it needs to load is typically larger than the cache capacity of the on-chip cache. Fig. 3B shows the basic structure in a res net network. As shown in fig. 3B, the branched network computation graph also introduces an eltwise layer for adding and combining multiple convolutional layers and a CONCAT (cascade) layer for cascading the data of each input layer into a new layer according to channels. Likewise, the network computation graph in the figure still requires repeated data handling between DDR and BRAM when the back-end hardware is implemented. It should be understood that the above list of "Vgg" and "res net" are popular CNN architectures in the art for purposes of illustration and not limitation of the principles of the present invention.

Intermediate representation generation scheme of the present invention

Existing compilers are used to map the computational graphs, each for a framework, into hardware instructions of a general-purpose processor, e.g., CUDA for GPU, LLVM for CPU, etc. In the mapping process of the instruction, different front-end deep learning frameworks and different back-end hardware have different levels of optimization modes, including coarse granularity optimization oriented to a computational graph, fine granularity optimization oriented to operators (operations), memory management optimization and the like.

There are large differences among the deep learning frameworks, including the underlying computational libraries, computational graph forms, code styles, etc. they use, resulting in large differences in both the accuracy and speed of the results of the operations. If the M front-end deep learning frameworks are optimized separately and mapped to N back-end hardware platforms, then O (m×n) workload is faced with the risk of combinatorial explosion. In order to be compatible with various front-ends and back-ends, the neural network compiler needs to abstract an intermediate representation (i.e., IR) independent of the framework, which can represent all information of the algorithm and is convenient for the subsequent compiler to optimize.

XLA for google is a backend compiler framework for TensorFlow; the NNVM and TVM compilers of the DMLC team respectively optimize the image level, adopt a AOT (Ahead Of Time) compiling mode, use the IR and scheduling modes of Halide as a reference, and support most of front-end deep learning frames on the market; DLVM of the university of illinois champagne division employs a more traditional compiler optimization approach, using the Swift language as its DSL representation, with a complete control flow, is a "no side effect" representation. These compiler frameworks more or less support different front-end deep learning frameworks, create their own IR, convert to LLVM IR after optimization on these IR, or map to different hardware instructions using OpenCL, CUDA, etc. libraries. However, these compiler frameworks are all oriented towards general-purpose processors. For FPGA or ASIC implementation, the hardware resources and design differences are huge, and the optimization modes are also quite different. By referring to the compiler design and optimization method for the general purpose processor, the method can reversely transfer the hardware design and optimization difficulty of the FPGA to the compiler level.

The neural network algorithm is characterized by data driving, and contains only a small amount of control logic, so that it is important to design an appropriate intermediate representation with enough characterization capability for the neural network. In the design of intermediate representation, factors such as the nature of the algorithm, underlying hardware and compiler need to be fully considered.

To this end, the present invention proposes a new intermediate representation structure that provides a more efficient graph optimization and memory optimization compilation scheme that is better suited for subsequent operations by taking the feature graph as a node and the computation operation as an edge. Further, it is possible to accommodate a variety of front-end deep learning frameworks by using IR with different granularity and attributes, and to acquire instruction codes for a dedicated neural network processor (e.g., DPU) implemented by FPGA or ASIC, and also to conveniently accommodate other hardware platforms such as CPU and GPU general-purpose processors.

FIG. 4 shows a flow diagram of an intermediate representation generation method according to one embodiment of the invention. In step S410, the input model file is parsed to acquire topology information of the neural network. In step S420, the feature map information and the calculation operation information in the topology information are used as nodes and edges, respectively, to generate a first intermediate representation in the form of a map.

Specifically, step S410 may use parsing sub-modules each corresponding to one type of model file, each parsing sub-module for parsing the corresponding type of model file. For example, a corresponding Caffe parser and a TensorFlow parser may be used to parse model files acquired via Caffe and TensorFlow deep learning frameworks, respectively. For other or new deep learning frameworks, a parser for the model may be added accordingly.

Through the parsing step S410, neural network models developed on different deep learning frameworks can be parsed into framework independent IR, namely, the first IR in the invention, so that decoupling of the deep learning frameworks and compiler optimization methods is realized, and the calculation graph forms with different granularities of various deep learning frameworks are integrally converted into the calculation graph forms (namely, the first IR) with fixed granularities in the invention. By utilizing the characteristics of the python script language, the analysis of the model and the conversion of IR can be conveniently realized.

In the present invention, the first IR is generated as a node representation feature graph, the edges are represented as a computation graph of the computation operations, and the nodes and edges each include attribute features. The attributes of the nodes may include dimension information and/or length-width channel information of the feature map. The computing operation of the edge representation includes at least one of: convolution, pooling, dimensional transformation, point-plus (eltwise), deconvolution, reordering, nonlinearity, batch normalization (Batchnorm), scaling (scale). The attributes of the edges then include parameters of the computing operation and include at least one of: convolution kernel size, extended edges (pad), stride, grouping, expansion (position).

Fig. 5 shows an example of a conventional calculation map converted into the first IR of the present invention. The first IR shown on the right side of the graph is the IR in the form of a graph, each node of the graph representing a FeatureMap, which can also be understood as a multi-dimensional Tensor, each side of the graph representing an operator, including but not limited to convolution, pooling, dimensional transformation, point addition, deconvolution, normalization, nonlinearity, etc., as compared to the conventional computational graph on the left side of the graph. This is a coarse-grained IR representation, and each edge can be considered as performing a multi-loop form of computation operation on a multi-dimensional Tensor. For example, for a computation graph with finer granularity under Tensorflow, either Pad or Biasadd (biased) adjacent to Conv2D (two-dimensional convolution) can be fused into the edge represented by Conv 2D; all constant nodes are fused into the attributes of the edges corresponding to the corresponding operators, so that a directed acyclic graph is constructed. In a typical deep learning framework and compiler, computing operations are set as nodes of a computation graph, and edges of the computation graph represent dependencies between operations. However, the first IR of the present invention sets the node as the FeatureMap, records its dimension information and dimension sequence, and is more favorable for the compiler to perform memory optimization (e.g., based on the third IR) in the subsequent steps, including memory multiplexing and All-Bank optimization, i.e., when the FeatureMap is smaller, the intermediate results between different computing operations may All exist in on-chip storage, without repeated interactions with external storage.

Fig. 6 shows a flow diagram of an intermediate representation generating method according to another embodiment of the invention. Similar to fig. 4, fig. 6 also includes a parsing step S610 and a first IR generating step S620, but fig. 6 further includes a multi-level optimization of the IR to generate second and third IR.

In step S630, the first intermediate representation is graph optimized to generate a second intermediate representation in the form of a graph. In particular, the computing operations may be combined to obtain a feature map as a node, and the combined computing operations are taken as a second intermediate representation in the form of a hypergraph of edges. Merging the computing operations includes at least one of: removing operations which are not needed or have no influence on the calculation result; fusing a plurality of adjacent computing operations; and decomposing the computing operation to fuse the decomposed computing operation with a preceding or following computing operation or to implement processing of the decomposed computing operation.

Specifically, pruning operations may remove operations that are not needed or have no impact on the results of the computation.

The fusing operation fuses a plurality of adjacent computing operations. For example, a fusion operation may fuse a computing operation into a front-to-back access operation and/or a computing operation. One is that the dimension transformation can be fused into the load or store process before and after this operation. The other is naturally incorporated into the previous computing operation by an operational transformation. For example, the FLATTEN layer operates to "FLATTEN" or "take" a single convolved feature map. Whereas in branched networks (e.g., googLeNet), for multi-layer cascading operations, the upper layer has multiple layers of outputs connected to the input of CONCAT. The operation of the CONCAT layer refers to cascading the data of each input layer into a new layer according to the channel, and then outputting the new layer to the next layer. The operations of the FLATTEN and CONCAT layers are special data rearrangement and dimension transformation operations, and can be omitted by specific regulations of the data storage and/or reading mode. In addition, batchNorm (BN) and Scale may incorporate the operations and parameters directly into the preceding convolutional layers. In addition, the fusion operation can reduce the data interaction times of the hardware platform and the external memory when the instruction codes are executed by fusing the calculation operation. The fusion may be a point-wise fusion (i.e., a fusion for a single computational result). For example, for the active layers ReLU, leaky ReLU, etc. after convolution operation, the fusion meaning of point-wise is that one convolution result calculated by a large number of input parameters needs to be stored in DDR all first, then the kernel of the ReLU is started, and all the ReLU results are calculated, and the fusion can save the starting time of the kernel. Besides point-wise fusion, non-point-wise operations, such as Conv+Pool, conv+Eltwise and the like, can be fused as far as possible through the design of a heuristic algorithm, so that the interaction process of data on-chip and off-chip can be omitted, bandwidth pressure can be reduced, the process of Pool or Eltwise calculation can be hidden in the process of loading or storing the data, and calculation efficiency can be improved. In one embodiment, in order to complete CONV, reLU and POOL operations with respect to an input feature map, the data stored in an on-chip cache (for example, BRAM) after the CONV operation may be directly subjected to the ReLU operation through operation fusion, and the POOL operation may be directly performed after the necessary on-chip cache is performed, so that the overall processing efficiency may be greatly improved compared to the off-chip storage and reading of the data after the CONV, reLU and POOL operations are respectively completed.

The decomposing operation may then decompose the computing operation to merge the decomposed computing operation with the computing operation before and after or to implement processing of the decomposed computing operation. In other words, one purpose of the decomposition is to more effectively perform fusion. An operation may be broken down into branches to perform fusion of a single branch. For example, a pooling operation after cascading (concat) is equivalent to a cascading operation after pooling. The pooling operation can thus be decomposed and incorporated into the convolution operation before concatenation. On the other hand, for some computing operations that would otherwise not be able to be processed, processing of the computing operations may be accomplished through decomposition. For example, a convolution with group may be decomposed into multiple convolutions that may be processed. .

When the above-described optimization for the first IR is performed subsequently, it is inevitable that some computational operation merging schemes may achieve better computational efficiency than others. Such merging is limited by the storage resources, bandwidth, and computational resources of the hardware platform, and is not always preferred over separate implementations, in other words, it is desirable to find a general rule to determine whether merging of computational operations is required.

In view of this, the above-described operational merging can be more efficiently achieved by introducing sub-graph templates. Combining the computing operations may then include: setting a sub-graph template capable of computational operation merging, acquiring at least one sub-graph matching scheme of a computational graph for the first intermediate representation, and reconstructing the computational graph into the second intermediate representation merged by the computational operation based on the sub-graph matching scheme. Preferably, the sub-graph template is determined based on attribute information of a hardware platform on which the instruction code compiled from the intermediate representation is to be executed. Specifically, a sub-graph template capable of being fused can be set in a sub-graph isomorphic matching mode, then all possible fusion schemes are found out in a calculation graph, and the execution time of various fusion modes is obtained through a cost fitting formula or a simulator, so that which fusion mode is more preferable is determined.

When determining the overall calculation operation merging scheme, the determination of the optimal merging scheme can be facilitated by introducing a cost edge. Combining the computing operations may then include: and adding an edge corresponding to the execution cost of the computing operation merging mode between the input node and the output node corresponding to each computing operation merging mode of the first intermediate representation under the condition that a plurality of computing operation merging modes exist, and solving an optimal computing operation merging scheme based on the shortest path problem among the nodes. In other words, in one embodiment, the determination of the optimal fusion approach may be achieved by introducing a cost edge in the computational graph of the first IR. For example, an edge can be added between the input and output FeatureMap nodes of each potential merging mode, merging cost is calculated according to a fitting model, non-merging cost is calculated respectively, the cost mainly comprises information such as calculation time, power consumption and the like, and the cost can be used as one of the attributes of the edge. The coarse-grained compilation optimization problem can then be converted into a graph optimization problem that finds the shortest path of the entire neural network from input to output.

FIG. 7 illustrates the conventional calculation graph of FIG. 5 and a representation of the manner in which the calculation operations are combined in a first IR of the invention. As shown on the left side of fig. 7, the dashed boxes 1-5 correspond to representations of five potential merging means in a conventional computational graph, respectively. After conversion to the calculation graph of the first IR version of the invention shown on the right side of fig. 7, the above-described merging approach can be converted to the corresponding 5 links 1-5. Thus, the selection of the optimal combining scheme can be achieved by finding the shortest path from input to output.

It should be understood that the implementation of the above-mentioned sub-graph template and the cost edge may be preferably used in combination, for example, the sub-graph template is first used to determine a candidate merging mode, and then the introduction of the cost edge is used to simplify the calculation of the optimal solution.

The first IR of the present invention, after the above-described optimization, can be obtained as the second IR. The second IR is a hypergraph where nodes represent feature graphs and edges represent a combination of multiple computing operations. These combined calculation operations result from the previous coarse-grained optimization for the first IR, so the second IR can also be understood as a coarser grained IR. The merged computation operation comes from optimization including All-Bank, lateral-longitudinal merging of computation operations, hardware-related heuristic merging approaches, merging of dimensional transformations, merging of equivalent transformations such as conv+bn+scale, and the like.

In one embodiment, the second IR is represented by a Domain-Specific Language (DSL) language designed based on the schema language. For example, DSL, which is used to represent this level of IR, can be designed using a dialect schema of Lisp, whereby the extreme simplicity of the schema can be leveraged and the IR representation and parsing expanded using various powerful language tools (grammatical sugars) therein.

The use of DSL for hypergraph IR representation greatly facilitates subsequent third IR based scheduling optimization. Firstly, the method is not limited by a specific network structure, calculation parameters and calculation scale; secondly, the DSL can be utilized to complete the test of the hardware platform; again, using such DSL means that tasks can be directly generated without the need for a corresponding network structure to perform cost calculations to fit the cost function in the graph optimization process.

The optimization of the entire computational graph hierarchy is largely hardware independent. One hardware-related aspect is the setting of the sub-graph template, which can be adjusted according to the design of the underlying hardware computing module and the change of the instruction mapping strategy, and the other hardware-related aspect is that when the cost caused by the fusion mode is measured and calculated, the fusion cost is changed according to the actual measurement or the change of the fitting method.

In step S640, the second intermediate representation is schedule-optimized to obtain a fine-grained third intermediate representation.

The second IR is still a coarse-grained hypergraph form IR, and the calculation process is a multi-dimensional Tensor, performing a multi-cycle calculation operation.

Multiple loop operations, whether for general purpose processors or dedicated accelerators, have significant space for optimization, including computation speed, memory usage, etc. The above requirements must be met in the process of splitting the computing operation of multiple loops, for example, when mapping to LOAD instructions, the number of LOAD cannot exceed the size of the buffer allocated to it on-chip, due to various limitations of on-chip memory resources, computing resources, bandwidth of different hardware platforms on which the DPUs are mounted, bit widths of different DPU instruction sets, different neural network topologies, computing parameters, parameter scales, and the like. After these constraints are sufficiently met, dependency issues between instructions need to be resolved. The dependency problem involves several dependency ways, such as requiring all the numbers required for computation to be loaded onto the chip before the computation operation is performed; when writing data to an address cached on a chip, it is required that the number originally present at that address is not dependent on other instructions that have not yet been executed.

Thus, the optimization of step S640 for the third IR is based on the scheduling optimization performed by the hardware platform to determine a block execution scheme for the feature map and/or weights, and further determine the instruction dependency between the executing instructions. Preferably, step S640 includes performing scheduling optimization on the second intermediate representation based on attribute information of a hardware platform on which the instruction code compiled from the intermediate representation is to be executed, so as to obtain a third intermediate representation of the block execution scheme indicating the feature map and/or the weight, and further, based on the attribute information of the hardware platform, obtaining a third intermediate representation indicating an instruction dependency relationship between each execution instruction in the block execution scheme indicating the feature map and/or the weight.

Preferably, the third IR is represented by a language written as multiple loops for each computing operation. With an IR representation similar to Halide, each calculation operation can be written as a multiple loop to determine the partition that enables the multiple loop to occupy the least memory and the most efficient calculation in performing the scheduling optimization. The above-mentioned blocking needs to fully consider the influence of on-chip computing resources, storage resources and bandwidth, and is determined by combining parameters of specific neural network structures. In a preferred embodiment, the third IR is utilized to enable the generation of automated scheduling policies and achieve the same efficiency of execution as the handwritten instructions.

Subsequently, in step S650, the third intermediate representation is compiled into instruction code for execution on the hardware platform. For example, the above-described partitioning execution scheme and/or instruction dependencies may be mapped to the instruction code of the hardware platform via encoding for the third IR.

Instruction code (also referred to as fourth IR in the present invention) is a specific instruction for a hardware platform and includes a representation of IR mapping to a specific instruction in the optimal block manner found when performing scheduling optimization. The back end of the invention is mainly designed aiming at FPGA and ASIC of DPU and embedded CPU such as ARM. In one embodiment, the hardware platform to which the compiler architecture of the present invention is applicable includes at least one of: a neural network special-purpose computing platform realized based on FPGA or ASIC; a neural network special-purpose computing platform realized based on the GPU; and general purpose computing platforms, such as embedded CPU computing platforms.

The instruction set used by the DPU may be a coarse-grained instruction set, but is also the finest-grained IR compared to the first three. The instruction set may act directly on various values including DDR, on-chip Ram, registers, etc. Instruction sets can be broadly classified as memory management, computational control, logic control, etc., such as LOAD instructions are responsible for handling data from DDR to on-chip; the SAVE instruction stores data from on-chip to DDR; the CONV instruction is responsible for sending on-chip data to the various convolution computation units for computation and then performing some point-wise computation such as biasing and ReLU operations, thereby greatly reducing bandwidth pressure on-chip and between DDR; the END instruction is responsible for sending an END signal to a certain register, indicating that the DPU is working, and can continue with subsequent computing operations, etc. Thus, the strategy after optimized scheduling can be mapped to each instruction of the DPU instruction set quickly and translated into machine code which can be directly executed on hardware, so that the whole compiling flow is completed.

The intermediate representation generation method according to the invention has been described above in connection with fig. 4-7. The optimization of the graph independent (or almost independent) of the backend hardware is achieved by the coarse-grained first and second IR, followed by the introduction of the fine-grained third IR to achieve the optimization of the hardware and facilitate its compilation to the final instruction code. Where the first IR is used to decouple from various types of depth computing platforms, the second IR is used to perform results characterizing coarse-grained graph optimization, the third IR is used for fine-grained optimization for hardware, and the instruction code (which may also be referred to as a fourth IR) is used for final hardware platform execution.

In other embodiments, the solution of the present invention may also be implemented as an intermediate representation generating apparatus for neural network computation. Fig. 8 shows a schematic diagram of an intermediate representation generating apparatus according to an embodiment of the invention.

As shown in fig. 8, the intermediate representation generating apparatus 800 includes a parsing unit 810 and a first intermediate representation generating unit 820. The parsing unit 810 is configured to parse the input model file to obtain topology information of the neural network. The first intermediate representation generating unit 820 is configured to use the feature map information and the calculation operation information in the topology information as nodes and edges, respectively, to generate a first intermediate representation in the form of a map.

Preferably, the first intermediate representation further comprises node attributes and edge attributes, the node attributes comprising at least one of: dimension information and length-width channel information of the feature map; the computing operation of the edge representation includes at least one of: convolution, pooling, dimensional transformation, point-plus (eltwise), deconvolution, rearrangement, nonlinearity, batch normalization (batch norm), scaling; and the edge attributes include parameters of the computing operation and include at least one of: convolution kernel size, extended edges (pad), stride, grouping, expansion (position).

In a preferred embodiment, the intermediate representation generating means may further comprise generating means for subsequent IR. Fig. 9 shows a schematic diagram of an intermediate representation generating apparatus according to another embodiment of the invention. The generating apparatus 900 of fig. 9 may further comprise second, third and fourth IR generating units in addition to the parsing unit 910 and the first intermediate representation generating unit 920.

In particular, the second intermediate representation generating unit 930 may be configured to perform a graph optimization on the first intermediate representation to generate a second intermediate representation in the form of a graph. The second intermediate representation generating unit 930 may further include: and the computing operation merging unit is used for merging the computing operations to acquire the feature graph as a node, and the merged computing operations are used as second intermediate representations in the hypergraph form of the edges.

In one embodiment, the computing operation merging unit may be configured to perform at least one of: merging the computing operations into a preceding or following access operation and/or computing operation; fusing a plurality of adjacent computing operations to reduce the number of data interactions between the hardware platform and the external memory when the instruction code is executed; and decomposing the computing operation to fuse the decomposed computing operation with a preceding or following computing operation.

In one embodiment, the computing operation merging unit may be further configured to: setting a sub-graph template capable of computational operation merging, acquiring at least one sub-graph matching scheme of a computational graph for the first intermediate representation, and reconstructing the computational graph into the second intermediate representation merged by the computational operation based on the sub-graph matching scheme. The sub-graph template may be determined based on attribute information of a hardware platform on which the instruction code compiled from the intermediate representation is to be executed.

In one embodiment, the computing operation merging unit may be further configured to: and adding an edge corresponding to the execution cost of the computing operation merging mode between the input node and the output node corresponding to each computing operation merging mode of the first intermediate representation under the condition that a plurality of computing operation merging modes exist, and solving an optimal computing operation merging scheme based on the shortest path problem among the nodes.

The third intermediate representation generating unit 940 may be configured to schedule optimizing the second intermediate representation to obtain a fine-grained third intermediate representation. Specifically, the third intermediate representation generating unit 940 may be configured to perform scheduling optimization on the second intermediate representation based on attribute information of a hardware platform on which the instruction code compiled from the intermediate representation is to be executed, so as to obtain a third intermediate representation of the block execution scheme indicating the feature map and/or the weight.

In one embodiment, the third intermediate representation generating unit 940 may be further configured to: and acquiring a third intermediate representation indicating the instruction dependency relationship among each execution instruction in the block execution scheme of the feature map and/or the weight based on the attribute information of the hardware platform.

Preferably, the intermediate representation generating means may further comprise a fourth intermediate representation generating unit 950 (which may also be referred to as compiling unit) for compiling said third intermediate representation into instruction code for execution on a hardware platform.

It will be appreciated that the above-described intermediate representation generating apparatus of the present invention may be implemented as a compiler architecture, and that fig. 10 shows a schematic diagram of a compiler architecture according to an embodiment of the present invention. It should be appreciated that while fig. 10 shows various modules as being described in detail, the example of fig. 10 may be viewed as a superposition of multiple preferred embodiments, which may be implemented separately or in part in combination in different embodiments.

Compiler architecture 1000 may include a computational graph construction module 1010, a computational graph optimization module 1020, and an instruction generation module 1030. Further, the computational graph construction module 510 may include a model file parsing module 1011 and a computational graph generation module 1012. The model file parsing module 1011 and the computation graph generating module 1012 may correspond to the aforementioned parsing unit and the first intermediate representation generating unit, respectively.

The model file parsing module 1011 may include parsing sub-modules each corresponding to one type of model file, each parsing sub-module for parsing the corresponding type of model file. As shown, the model file parsing module 1011 may include a Caffe parser and a TensorFlow parser for parsing model files acquired via Caffe and TensorFlow deep learning frameworks, respectively. Model file parsing module 1011 may also include parsers for other deep learning frameworks (shown as "other" in the figure). In the subsequent application, if the support for a new deep learning framework needs to be added, only one parser for the model needs to be added, and the subsequent optimization mode is mostly frame independent, so that the expandability and the compatibility of the compiler framework are improved.

The neural network model developed on the different deep learning frameworks can be parsed into the framework independent IR, i.e., the first IR in the present invention, via the model file parsing module 1011. By utilizing the characteristics of the python script language, the analysis of the model and the conversion of IR can be conveniently realized. Thus, the computational graph generation module 1012 can conveniently generate the first IR based on the parsing result of the corresponding parsing sub-module.

After the first IR is constructed, the first IR file (i.e., the "computational graph file" in fig. 5) may optionally be exported from the computational graph generation module 1012. The first IR (i.e., the "computational graph" in fig. 5) is then sent to a computational graph optimization module 1020, which can optimize it to obtain a second IR. There are many ways to optimize the computational graph. The computational graph optimization module 1020 may correspond to the preceding second intermediate representation generation unit. In one embodiment, the computational graph user module 1020 may include a pruning module 1021, a decomposition module 1022, and a fusion module 1023 as shown in FIG. 5.

Pruning module 1021 may be used to remove operations that are not needed or have no impact on the results of the calculation. The fusion module 1022 may fuse computing operations, for example, to reduce the number of interactions of the hardware platform with the external memory data when the instruction code is executed. The decomposition module 1023 may decompose the computing operations to merge the decomposed computing operations with the front-to-back computing operations or to implement processing of the decomposed computing operations.

The computational graph optimization module 1020 may also achieve efficient optimization of computational graphs by introducing sub-graph templates and cost edges as previously described.

After optimization of the graph form is completed, a second IR file (i.e., the "optimized computation graph file" in fig. 10) is optionally exported from the computation graph optimization module 1020. The second IR (i.e., the "optimized computational graph" in fig. 10) is then fed into instruction generation module 1030 to complete the process of mapping the optimized second IR to hardware platform (e.g., DPU) instructions. The instruction generation module 1030 may also include functionally separate modules, such as a schedule optimization module for schedule optimizing for the second IR to generate a third IR, and an instruction code generation module for generating hardware platform instruction codes from the third IR, corresponding to the preceding third and fourth intermediate representation generation modules, respectively.

Likewise, the instruction generation module 1030 may map the above-described partitioned execution scheme and/or instruction dependencies to the instruction code of the hardware platform via encoding for the third IR.

In one embodiment, the compiler architecture of the present invention may further include a neural network forwarding module, configured to provide a standard answer for comparing results of executing the instruction codes by the hardware platform. Preferably, the neural network forwarding module may use at least a part of the deep learning framework for which the model file is aimed to solve the standard answer. For example, the deep learning framework for neural network training is directly used to perform the calculation of the software and to obtain the standard answer for the comparison. However, since the neural network algorithm introduces differences due to fixed points and the like after being input into the hardware platform, the neural network forwarding module can modify at least a part of the utilized deep learning framework based on the hardware platform to obtain a standard answer of the model file executed on the hardware platform.

For each deep learning framework, because different bottom floating point number operation libraries are adopted, the operation result of the floating point number is also different due to factors such as operation sequence, different bit cutting methods and the like. Therefore, even though the network structure is the same, floating point number results among different deep learning frames are also huge; for fixed point arithmetic, the above differences will be small but still exist. In addition, there are also differences in the calculation parameters of different deep learning frameworks, such as the processing of Pad, the arrangement order of data, the calculation method of mean pooling, the transformation method of Reorg, etc. The forwarding module, as a module providing standard answers, needs to eliminate the variability between these frameworks. Or more conservatively, the operator implementation of the mainstream deep learning framework can be contained in the forwarding module, and the operator implementation is modified to a certain extent, so that the computing operation and the behavior of the back-end hardware are completely consistent, including shifting operation, boundary expansion rules and the like.

The compiler architecture of the present invention can also be considered a generalized compiler, converting and describing algorithms using different granularity, different form IR for different stages of compilation. The main purpose of these IRs is to simplify the optimization difficulty of compiler developers as much as possible, and to generate instructions with highest efficiency; and the difficulty of use of a user is reduced as much as possible, so that the user does not need to pay attention to the implementation details of the algorithm, and the development flow from the end-to-end algorithm design to the hardware deployment is truly realized.

Referring to fig. 11, a computing device 1100 includes a memory 1110 and a processor 1120.

Processor 1120 may be a multi-core processor or may include multiple processors. In some embodiments, processor 1120 may comprise a general-purpose main processor and one or more special coprocessors, such as a Graphics Processor (GPU), digital Signal Processor (DSP), and so forth. Memory 1110 may include various types of storage units, such as system memory, read Only Memory (ROM), and persistent storage. Where the ROM may store static data or instructions that are required by the processor 1120 or other modules of the computer. The persistent storage may be a readable and writable storage. The persistent storage may be a non-volatile memory device that does not lose stored instructions and data even after the computer is powered down. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the persistent storage may be a removable storage device (e.g., diskette, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as dynamic random access memory. The system memory may store instructions and data that are required by some or all of the processors at runtime. Furthermore, memory 1110 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic disks, and/or optical disks may also be employed. In some embodiments, memory 1110 may include readable and/or writable removable storage devices such as Compact Discs (CDs), digital versatile discs (e.g., DVD-ROMs, dual-layer DVD-ROMs), blu-ray discs read only, super-density discs, flash memory cards (e.g., SD cards, min SD cards, micro-SD cards, etc.), magnetic floppy disks, and the like. The computer readable storage medium does not contain a carrier wave or an instantaneous electronic signal transmitted by wireless or wired transmission.

The memory 1110 has executable code stored thereon that, when processed by the processor 1020, causes the processor 1120 to perform the neural network compilation method described above.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for performing the steps defined in the above-mentioned method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for generating instruction code for execution on a hardware platform of a neural network model during compilation, comprising:

analyzing the input model file to obtain topological structure information of the neural network;

using the feature map information and the computing operation information in the topology structure information as nodes and edges respectively to generate a first intermediate representation in a map form; and

merging the computing operations to obtain a feature graph as a node, wherein the merged computing operations serve as a second intermediate representation in the form of a hypergraph of edges;

performing scheduling optimization on the second intermediate representation based on the hardware platform based on attribute information of the hardware platform for executing instruction codes compiled by the intermediate representation to acquire a third intermediate representation of the block execution scheme indicating feature graphs and/or weights;

compiling the third intermediate representation into instruction code for execution on a hardware platform,

wherein merging the computing operations comprises:

and adding an edge corresponding to the execution cost of the computing operation merging mode between the input node and the output node corresponding to each computing operation merging mode of the first intermediate representation under the condition that a plurality of computing operation merging modes exist, and solving an optimal computing operation merging scheme based on the shortest path problem among the nodes.

2. The method of claim 1, wherein the first intermediate representation further comprises node attributes and edge attributes, the node attributes comprising at least one of:

dimension information and length-width channel information of the feature map;

the computing operation of the edge representation includes at least one of:

convolution, pooling, dimensional transformation, point addition, deconvolution, rearrangement, nonlinearity, batch normalization, scaling; and is also provided with

The edge attributes include parameters of the computing operation and include at least one of:

convolution kernel size, expanded edges, step size, grouping, expansion.

3. The method of claim 1, wherein merging the computing operations comprises at least one of:

removing operations which are not needed or have no influence on the calculation result;

fusing a plurality of adjacent computing operations; and

the computing operations are decomposed to merge the decomposed computing operations with preceding or following computing operations or to effect processing of the decomposed computing operations.

4. The method of claim 1, wherein merging the computing operations comprises:

setting a sub-graph template capable of computational operation merging, acquiring at least one sub-graph matching scheme of a computational graph for the first intermediate representation, and reconstructing the computational graph into the second intermediate representation merged by the computational operation based on the sub-graph matching scheme.

5. The method of claim 4, wherein the sub-graph template is determined based on attribute information of a hardware platform on which instruction code compiled from the intermediate representation is to be executed.

6. The method of claim 1, wherein the second intermediate representation is represented by a domain-specific language designed based on a schema language.

7. The method of claim 1, wherein scheduling optimization of the second intermediate representation to obtain a third intermediate representation of a partitioned execution scheme specifying feature maps and/or weights based on a hardware platform on which instruction code compiled from the intermediate representation is to be executed comprises:

and acquiring a third intermediate representation indicating the instruction dependency relationship among each execution instruction in the block execution scheme of the feature map and/or the weight based on the attribute information of the hardware platform.

8. The method of claim 1, wherein the third intermediate representation is represented by a language that writes each computing operation as multiple loops.

9. The method of claim 1, wherein the hardware platform comprises at least one of:

a neural network special-purpose computing platform realized based on FPGA or ASIC;

a neural network special-purpose computing platform realized based on the GPU; and

A general purpose computing platform.

10. An apparatus for generating instruction code for execution on a hardware platform of a neural network model during compilation, comprising:

the analysis unit is used for analyzing the input model file to acquire topological structure information of the neural network;

a first intermediate representation generating unit configured to generate a first intermediate representation in a graph form using the feature map information and the calculation operation information in the topology information as nodes and edges, respectively; and

the second intermediate representation generating unit comprises a calculation operation merging unit, wherein the calculation operation merging unit is used for merging the calculation operations to obtain a feature graph as a node, and a plurality of merged calculation operations are used as second intermediate representations in the hypergraph form of edges;

a third intermediate representation generating unit configured to schedule-optimize the second intermediate representation based on attribute information of a hardware platform on which the instruction code compiled from the intermediate representation is to be executed, to obtain a third intermediate representation indicating a block execution scheme of a feature map and/or a weight;

a compiling unit for compiling the third intermediate representation into instruction code for execution on a hardware platform,

Wherein the computing operation merging unit is further configured to:

11. The apparatus of claim 10, wherein the first intermediate representation further comprises a node attribute and an edge attribute, the node attribute comprising at least one of:

dimension information and length-width channel information of the feature map;

the computing operation of the edge representation includes at least one of:

convolution kernel size, expanded edges, step size, grouping, expansion.

12. The apparatus of claim 10, wherein the computational operations merger unit is to at least one of:

Fusing a plurality of adjacent computing operations; and

13. The apparatus of claim 10, wherein the computational operations merging unit is further to:

14. The apparatus of claim 13, wherein the sub-graph template is determined based on attribute information of a hardware platform on which instruction code compiled from the intermediate representation is to be executed.

15. The apparatus of claim 10, wherein the second intermediate representation is represented by a domain-specific language designed based on a schema language.

16. The apparatus of claim 10, wherein the third intermediate representation generation unit is further to:

17. The apparatus of claim 10, wherein the third intermediate representation is represented by a language that writes each computing operation as a multiple loop.

18. A computing device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any of claims 1-9.

19. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 1-9.