CN115203634A

CN115203634A - Template-based GPU (graphic processing unit) high-performance tensor shrinkage method

Info

Publication number: CN115203634A
Application number: CN202210343327.3A
Authority: CN
Inventors: 周昆; 侯启明; 芦嘉良; 任重
Original assignee: Muxi Integrated Circuit Shanghai Co ltd; Zhejiang University ZJU
Current assignee: Muxi Integrated Circuit Shanghai Co ltd; Zhejiang University ZJU
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2022-10-18
Anticipated expiration: 2042-03-31
Also published as: CN115203634B

Abstract

The invention discloses a template-based GPU (graphics processing unit) high-performance tensor contraction method, which comprises the steps of defining tensor contraction by a user, classifying indexes and dimensions of the tensor contraction to obtain four index sequences and four dimension sequences; then, dimension reduction is carried out on the obtained memory access function and implicit dimensionality, placeholders are defined, a calculation template is compiled according to a BLAS library implementation and optimization method, then template distribution in a compiling period is carried out, the memory access function and the implicit dimensionality are substituted into the placeholders of the selected calculation template, CUDA C/C + + codes are generated and compiled into an executable program capable of being used repeatedly, and finally data of each tensor and specific values of each dimensionality are input to finish calculation. The invention can support arbitrary tensor shrinkage by a limited group of templates; on the premise of keeping the usability, the system performance still approaches or even exceeds the computation libraries of cuBLAS, cuDNN and the like optimized by hands.

Description

Template-based GPU high-performance tensor contraction method

Technical Field

The invention belongs to the field of high-performance calculation, and particularly relates to a template-based GPU high-performance tensor contraction method.

Background

The concept of tensor is the generalization of vectors, and scalars, vectors and matrices can be understood as 0, 1 and 2-order tensors, respectively. The method is widely applied to the fields of machine learning, data analysis, scientific calculation and the like, and provides simple and clear abstract representation for data and algorithms. However, tensor computation in system implementation is very complex: even though access intensive operators such as normalization, downsampling and the like are easy to adjust and optimize, calculation intensive operators such as matrix multiplication and convolution are difficult to approach the peak performance of hardware, and experts familiar with the target platform system structure are often required to perform detailed manual optimization. Meanwhile, a user often highly customizes the algorithm according to the practical requirement, the related calculation often exceeds the expression range of a common operator, so that the problem of the so-called 'long tail' is caused, namely, the data layout and the calculation mode of the tensor are not supported by the calculation library with fixed functions such as cuBLAS, cuDNN and the like, and the realization of the handwriting high performance brings huge time and mental burden for the user.

Tensor computation generally includes some common modes, for example, tensors are often stored as multidimensional arrays, and the computation process is often a multiple nested loop, so a system implementer can design a Domain-Specific Language (DSL) to express and optimize such operations in a targeted manner.

For example, the machine learning software stack PlaidML (Intel. Plaidml. Https:// githiub. Com/platidml. Accessed: 2021-10-01.) contains a gate of eDSL for users to customize the Long-tail operators, while Facebook has designed Tensor Comrehensions (TensorComp; nicolas Vasilache, oleksandr Zinenko, theodours Theodoris, priya Goyal, zachary Degito, william S.Moses, sven Verdolaege, andrew Adams, and Albert Cohen the next 700accessed layers. The syntax semantics of the two are based on the einstein summation convention, and can support any tensor contract (tensor contract), namely the generalization of matrix multiplication in tensor: some dimensions of the input tensor remain in the output tensor, while elements in other dimensions are multiplied and accumulated correspondingly two by two. PlaidML eDSL and TenscorComp are simple and easy to use, perform well in access and storage intensive operators, but they do not consider the actual architectural features of the back-end platform yet, and thus have performance deficiencies in the context of computation intensive.

Tensor calculations in the fields of machine learning, data analysis, scientific calculation and the like can be roughly divided into two types according to the peak calculation performance and the maximum communication bandwidth of a target platform. The calculation amount generated by the operators such as down-sampling, specification, regularization and the like is small, and the performance bottleneck mainly lies in the bandwidth, so that the bottom layer implementation of the operation does not need to be particularly fine optimized. As long as the method can adapt to the storage structure of the target platform, eliminate invalid access and maintain sufficient parallelism, the method can maximize the bandwidth utilization, and the factors such as the block size, the parallel granularity, the instruction arrangement and the like can not generate obvious influence. In contrast, matrix multiplication, convolution, etc. operators exhibit computationally intensive characteristics, and highly optimized implementations can approach the peak arithmetic performance of the system, but considerable effort is required to achieve this goal: the implementer needs to reasonably divide the data into multiple levels of parallel units, balance between parallel and data multiplexing, and finely arrange different types of instructions to reduce pipeline blocking and the like.

Tensor contraction is any generalization of matrix multiplication under high-dimensional conditions, can be represented by Einstein summation convention, and can also cover convolution operation and various varieties thereof after slight expansion. That is, common computationally intensive operators are all in the range of tensor compressed expressions, which can serve as the basis unit for tensor computation. In addition, tensor compression can also support many access-intensive operators, such as pooling, convention, etc. operations are equivalent to a special form of tensor compression, while more complex regularization, softmax, etc. operators can also be expressed as a combination of multiple tensor compression. Therefore, a high performance implementation of arbitrary tensor shrinkage is taken as a target problem.

PlaidML eDSL and tensrcomp introduced a polyhedral model in an attempt to solve this problem. They convert the user input at the front end into a multi-layered nested loop, perform platform-independent generic loop transformations based on polyhedral models at the middle end, and sink the intermediate representation to machine code for a specific back end. The idea is good in access and storage intensive tasks, but the architectural characteristics of a target platform are ignored, so that the performance of the method is mostly far from the theoretical peak in calculation intensive tensor compression. This problem is exacerbated on heterogeneous computing platforms, e.g., tenscorComp can perform as low as 20% of the native linear algebraic computation library CuBLAS on NVIDIA GPUs in large-scale matrix multiplications. However, with its excellent parallel processing capability, general-purpose GPUs are widely used in tensor computations, and have become a virtual infrastructure in the fields of deep learning and the like. Therefore, shrinkage of the high-performance tensor on the GPU is a very realistic topic: the solution based on the native computation library suffers from the problem of long tail, and has a narrow application range, while the existing tensor computation DSL is flexible and easy to use, and the performance of the existing tensor computation DSL is still to be improved.

Tensor compression is suitable for computational acceleration on general purpose GPUs because it can be recursively divided into many sub-problems that are not related to each other, and the control flow of the computational process is static and does not depend on the input data. The difficulty in implementing tensor compaction on this multi-core architecture is: an implementer must comprehensively consider the calculation and storage characteristics of a target platform, such as the number of parallel processing units, the delay, bandwidth and capacity of each level of cache and the like, and divide the tensor in each dimension by proper size so as to improve the data multiplexing in the block as much as possible and avoid the bandwidth bottleneck; meanwhile, memory access and calculation instructions need to be alternately arranged to hide communication delay, reduce blocking and approach peak value calculation performance. In addition, the SIMT (Single Instruction, multiple Threads) model (Erik Lindholm, john Nickols, stuart Oberman, and John Montrym. Nvidia tesla: A unidentified graphics and computing architecture. IEEE Micro,28 (2): 39-55, 2008.) of general GPU computation introduces many unique problems of access transaction merging, bank confliction and the like, and provides more delicate requirements for the realization of the tensor contraction algorithm.

In addition, if it is desired to support arbitrary tensor compression, the system implementer faces a huge problem space because of the dimensions of the higher-order tensorPossibly arranged in an arbitrary order, resulting in a factorial level of complexity. In the BLAS numerical computation library of each platform, the matrix multiplication generally deals with only four possible cases, namely that each input tensor is a 2 x 2 combination resulting from column-first or row-first; tensor compression is often faced with much more than this. In the shape of C _a,b,d,e ＝A _a,b,c ×B _c,d,e For example, between two third order tensors, one dimension c, which is a function of all possible data layouts of 4! X3! X3! =846 cases; if with C _a,d ＝A _a,b,c ×B _b,c,d Two dimensions of b and c, there are also 2! X3! X3! =72 cases. It is obviously not feasible to provide separate implementations for each case, which requires a system with good applicability, extensibility, and the ability to quickly resolve a class of similar computations.

Disclosure of Invention

The invention aims to provide a template-based GPU high-performance tensor contraction calculation technology aiming at the defects of the prior art. The present invention can compute any tensor compression on a general purpose GPU, performing particularly well in terms of performance.

The purpose of the invention is realized by the following technical scheme: a template-based GPU high-performance tensor shrinkage method comprises the following steps:

(1) A user inputs the definition of tensor shrinkage, and classifies indexes and dimensions of the tensor shrinkage to obtain four index sequences of P, X, Y and R and four dimension sequences of P, Y, X and R;

(2) Reducing the dimensions of the index sequence P, X, Y and R and the dimension sequence P, Y, X and R to obtain an access function and an implicit dimension;

(3) Defining a placeholder to represent content related to a memory access function and an implicit dimension in implicit batch matrix multiplication;

(4) Writing a calculation template according to a BLAS library implementation and optimization method, wherein the placeholders in the step (3) are reserved;

(5) Template dispatching in the compiling period is carried out, and the memory access function and the implicit dimension are substituted into the placeholder of the selected calculation template to generate a CUDA C/C + + code;

(6) Compiling the code generated in step (5) into a reusable executable program;

(7) Inputting data of each tensor and specific values of each dimension, and completing calculation by using the executable program compiled in the step (6).

Further, the step (1) comprises the following substeps:

(1.1) sign convention: dimension from high to low is respectively D = D _n ,…,D ₁ Is denoted as T _D (ii) a By T [ i ]]＝T[i _n ,...,i ₁ ]Representative index sequence i = (i) _n ,...,i ₁ ) A scalar of (a); function ext (i) _k )＝D _k K = 1.. N maps a single index to the dimension corresponding thereto, and Ext (i) = (Ext (i) _n ),...,ext(i ₁ ))＝(D _n ,...,D ₁ ) = D mapping index sequences to corresponding dimension sequences; any tensor draw of user input C = a × B is represented by the einstein summation convention:

C[ic]＝A[ia]×B[ib] (1)

wherein, ic, ia, ib are the index sequences of the tensors C, a, B, respectively.

(1.2) defining a batch index sequence: defining p = ic ia ∞ ib, i.e. a sequence of all batch indices, each element being used to index all tensors. The order of the elements in p in ic, ia, ib may be different, and any order may be adopted herein, and the same holds true for the following sequences;

(1.3) defining free index sequences y, x of the input tensors a, B, respectively; define y = (ic = ia) \ p, i.e. the sequence of all free indices of a, only for indices C and a; all free index composed sequences where x = (ic ≠ ib) \ p is B are defined, only for indices C and B.

(1.4) defining a reduction index sequence: defining r = (ia ≠ ib) \ p as a reduction index sequence, multiplying the elements of the A tensor and the B tensor on the Ext (r) pairwise, and reducing the elements into scalars.

(1.5) defining a permutation function: equivalent transformation of formula (3) by substituting p, y, x, r for ic, ia, ib, and introducing three permutation functions shfl _C (·)、shfl _A (. Cndot.) and shfl _B (. To) therebyThe original formula is rewritten as:

C[shfl _C (p,y,x)]＝A[shfl _A (p,y,r)]×B[shfl _B (p,r,x)] (2)

(1.6) defining a dimension sequence corresponding to each index sequence: defining a batch dimension sequence P = Ext (P), wherein each element is a dimension corresponding to each index in the batch index sequence; the free dimension sequence of a is defined as Y = Ext (Y), the free dimension sequence of B is defined as X = Ext (X), and the reduced dimension sequence is defined as R = Ext (R).

Further, the step (2) comprises the following substeps:

(2.1) defining a dimensionality reduction auxiliary function: first, the mapping i = gath of index sequence to scalar index is defined _d (i) Where d is an arbitrary dimension sequence, but must be as long as index sequence i. The function maps the offset i in each dimension in d to a linear address i:

the inverse function i = scat defining the above function _d (i)＝gath _d ^-1 (i) In that respect It decomposes the linear address i into offsets i in each dimension of d:

the higher-order tensor in equation (2) is implicitly converted into a matrix using a specific storage format by means of the two auxiliary functions. (2.2) dimensionality reduction of the index to access elements in the tensor with scalar indices

The index sequences p, y, x, r are converted into components of the scalar index in the corresponding dimension, respectively. Definition p = gath _P (P), P is a scalar index with a value range of [0, | P), and the index sequence is expressed in reverse as P = gath _P ^-1 (p)＝scat _P (p); denote the other index sequences as y = scat _Y (y)、x＝scat _X (x)、r＝scat _R (r) wherein each scalar newly definedThe index value range is Y belongs to [0, ] n Y), X belongs to [0,. N.X), R belongs to [0,. N.R); rewrite equation (2) to:

(2.3) dimension reduction is carried out on the dimension, and the tensor with any order and any dimension arrangement is expressed as a special matrix

The actual calculation that occurs in equation (7) is:

the left-hand side of equation (8) is referred to as C (p, y, x) = C [ shfl = C (p, y, x) = _C (scat _P (p),scat _Y (y),scat _X (x))]The right-hand visits to the two input tensors are respectively denoted as a (p, y, r) = a [ shfl = _A (scat _P (p),scat _Y (y),scat _R (r))]、B(p,r,x)＝B[shfl _B (scat _P (p),scat _R (r),scat _X (x))]Equation (8) is rewritten as:

only 4 scalar dimensions are included, i.e. P = Π P, Y = Π Y, X = Π X, R = Π R.

(2.4) implicitly converting the tensor contract into a matrix multiplication: c (p, y, x), a (p, x, r), B (p, r, x) are called the memory access functions of the output and input tensors, which abstract the tensors into a batch matrix with a specific data layout. The four scalars P, Y, X, R are called implicit dimensions, which are the product of the batch dimension, the free dimension of a, the free dimension of B, and the reduction dimension, respectively. Formula (9) hides the read-write process of data by using a memory access function, and expresses the range of traversal and iteration by using a hidden dimension, thereby compressing and converting tensor into batch matrix multiplication. With the einstein summation convention, equation (9) can be expressed as:

C(p,y,x)＝A(p,y,r)×B(p,r,x) (10)

further, the step (3) comprises the following substeps:

(3.1) defining implicit dimension placeholders: the formula (9) comprises four implicit dimensions P, Y, X and R which represent the scale of the implicit batch matrix and are respectively products of corresponding dimension sequences in the original tensor. Therefore, four corresponding implicit dimension placeholders ph _ P, ph _ Y, ph _ X, ph _ R are defined, corresponding to the implicit dimensions mentioned above.

(3.2) defining memory function placeholders: in the formula (10), the reading and writing of the implicit matrix are abstracted by three access functions of A (p, y, r), B (p, r, x) and C (p, y, x); three memory function placeholders, ph _ A (p, y, k), ph _ B (p, k, x) and ph _ C (p, y, x), are defined to represent elements of corresponding positions.

Further, the step (4) specifically comprises: compiling a calculation template code of linear algebra operation according to a BLAS library realization and optimization method; writing the matrix dimension as placeholders of four implicit dimensions ph _ P, ph _ Y, ph _ X and ph _ R when the matrix dimension is referred each time; each time any element of the matrix is accessed, the element is written as a placeholder for three access functions, namely ph _ A (p, y, k), ph _ B (p, k, x) and ph _ C (p, y, x).

Further, the step (5) specifically comprises: the analysis of the expression (9) of tensor condensed symbolic expression obtains the contents of the P, Y, X and R four-dimensional sequences, and a specific linear algebraic calculation template is selected in the compiling period according to whether the sequences are empty or not:

after the template is selected, the placeholder is substituted into the calculation template to generate a legal CUDA C/C + + code.

The beneficial effects of the invention are:

(1) The invention designs a universal tensor calculation framework, and compresses and converts any tensor into basic linear algebraic operation represented by GEMM through concepts such as access and storage functions, implicit dimensions and the like. By dispatching in two stages of a compiling period and a running period, the input tensors are compressed and correspond to a specific calculation template, so that a limited group of templates support arbitrary tensor compression;

(2) The invention designs an Einstein summation convention-based DSL as the front-end representation of a universal framework, which is implemented on an NVIDIA GPU. The system performance still approaches or even exceeds the computation libraries of cuBLAS, cuDNN, etc. optimized manually, on the premise of keeping easy to use.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The invention designs a tensor contraction method based on a calculation template, which implicitly transforms tensors arranged in any order and any dimensionality into a matrix on the premise of not occupying extra space, thereby contracting any tensor into a limited group of linear algebraic operations, and supporting the realization of the linear algebraic operations through the calculation template.

The calculation template uses a grammar which is consistent with the native code of the target platform, namely CUDA C/C + +, wherein a group of placeholders to be completed are reserved, specific contents are filled according to tensor contraction definitions, executable codes are generated by compiling, and finally tensor data input of a user is received to obtain tensor contraction calculation results.

The performance provided by the computing template is more than that of a highly optimized numerical computing library on a shoulder target platform, such as cuBLAS and cuDNN on an NVIDIA GPU, and meanwhile, certain code generation capacity is reserved, and similar computing can be supported.

The method comprises the following steps:

1. the user inputs the definition of tensor shrinkage, and classifies the indexes and the dimensions of the tensor shrinkage to obtain four index sequences of P, X, Y and R and four dimension sequences of P, Y, X and R.

1.1 token convention

The notation used in the present invention is agreed first. Dimension from high to low is respectively D = D _n ,…,D ₁ Is recorded as T _D . Each of n dimensions is offset to uniquely determine a certain element in T, and T [ i]＝T[i _n ,...,i ₁ ]Representative index sequence i = (i) _n ,...,i ₁ ) A scalar of (c). Function ext (i) _k )＝D _k K = 1.. N maps a single index to the dimension corresponding thereto, and Ext (i) = (Ext (i) = _n ),...,ext(i ₁ ))＝(D _n ,...,D ₁ ) = D maps the index sequence to the corresponding dimension sequence.

Any tensor draw of user input C = a × B is represented by the einstein summation convention:

C[ic]＝A[ia]×B[ib] (1)

wherein, ic, ia, ib are the index sequence of each tensor C, A, B, respectively.

1.2 defining batch index sequences

Defining p = ic ia ∞ ib, i.e. a sequence of all batch indices, each element being used to index all tensors. The order in which the elements of p appear in ic, ia, ib may be non-uniform, and any order may be used herein, as may the next several sequences.

1.3 defining the free index sequences y, x of the input tensors A, B, respectively

Y = (ic $ ia) \ p, i.e. the sequence of all free indices of a, is defined for indices C and a only.

Similarly, a sequence consisting of all free indices where x = (ic ≠ ib) \ p is B is defined, only for indices C and B.

1.4 defining a reduced index sequence

Defining r = (ia:) ib \\ \ p as a reduction index sequence, multiplying two elements of a and B tensors on Ext (r) by two, and reducing the two elements into scalars.

1.5 defining a permutation function

Equivalent transformation is carried out on the formula (3), ic, ia and ib are replaced by p, y, x and r, and three permutation functions shfl are introduced _C (·)、shfl _A (. Cndot.) and shfl _B (. To) thereby rewrite the original formula as

C[shfl _C (p,y,x)]＝A[shfl _A (p,y,r)]×B[shfl _B (p,r,x)] (2)

Three permutation functions shfl are introduced _C (. O) and let ic = shfl _C The reasons for (p, x, y) are: all indexes in p, y, x may appear alternately in any order as the offset in each dimension of the tensor C; shfl _C The abstraction is made for all possible permutations of the indices on the tensor C, and accordingly, represents the various possible dimensional permutations of the tensor C. Similarly, the two input tensors also introduce the corresponding permutation function shfl _A (. O) and shfl _B (. Cndot.) representing the various possible dimensional permutations of tensors A, B.

1.6 defining the dimension sequences corresponding to the respective index sequences

Defining a batch dimension sequence P = Ext (P), wherein each element is a dimension corresponding to each index in the batch index sequence. Similarly, a is defined as Y = Ext (Y) for the free dimension sequence, X = Ext (X) for the free dimension sequence of B, and R = Ext (R) for the reduced dimension sequence.

Example 1: bulk matrix multiplication C _T,M,N ＝A _T,M,K ×B _T,K,N Using the einstein summation convention may be expressed as

C[t,m,n]＝A[t,m,k]×B[t,k,n]. (3)

T appears in all tensors, being the lot index, T = ext (T) being the corresponding lot dimension. M appears in the output tensor C and the input tensor a, is a free index of a, and M = ext (M) is called a free dimension of a; similarly, N is the free index of the input tensor B, and N = ext (N) is the free dimension of B. K, which only appears in the two input tensors, is the reduced index, K = ext (K) is the reduced dimension, in which matrix multiplication multiplies the elements of the input tensors two by two and accumulates (reduces) to one scalar.

Therefore, in the batch matrix multiplication shown in equation (3), each index sequence is P = (T), Y = (M), X = (N), R = (K), and each dimension sequence is P = (T), Y = (M), X = (N), R = (K).

Example 2: equation (4) represents that the fourth-order tensors A and B are shrunk in the dimension where the reduction indexes a and B are located, c and d are free indexes of A, e and f are free indexes of B, and the batch index does not exist.

C[d,f,c,e]＝A[a,b,c,d]×B[a,e,b,f] (4)

Therefore, in the tensor compression represented by the formula (4), each index sequence is p = (=), y = (c, d), x = (e, f), r = (a, b), and each permutation function is shfl _C ((),(c,d),(e,f))＝(d,f,c,e)，shfl _A ((),(c,d),(a,b))＝(a,b,c,d)，shfl _B ((),(a,b),(e,f))＝(a,e,b,f)。

2, reducing the dimension of the index sequence P, X, Y, R and the dimension sequence P, Y, X, R to obtain the access function and the implicit dimension

The tensor is represented as a matrix of a special storage format, and the tensor is compressed and multiplexed with the matrix multiplication (and its respective degenerate case) by the same calculation process.

2.1 defining a dimensionality reduction auxiliary function

The invention establishes mapping between the dimensions of the tensor and the matrix, thereby converting the multiple linear indexes on the high-order tensor into scalar indexes on the matrix batch, row and column.

The sequence of indices in the tensor contract corresponds to the scalar index in the matrix multiplication. Thus, first the mapping of index sequence to scalar index i = gath is defined _d (i) Where d can be any dimension sequence, but must be as long as index sequence i. The function maps the offset i in each dimension in d to a linear address i:

accordingly, the inverse function i = scat of the above function is defined _d (i)＝gath _d ^-1 (i) In that respect It decomposes the linear address i into offsets i in each dimension of d:

with the aid of the two auxiliary functions, the higher-order tensor in equation (2) can be implicitly converted into a matrix in a specific storage format.

2.2 dimensionality reduction of the index to access elements in the tensor with scalar indices

The previously defined index sequences p, y, x, r are each converted into components of the scalar index in the corresponding dimension. Definition p = gath _P (P), it is easy to know that P is scalar index with value range of [0, pi P ], so that the index sequence can be invertedThe expression is p = gath _P ^-1 (p)＝scat _P (p) of the formula (I). Similarly, the other index sequences are denoted as y = scat _Y (y)、x＝scat _X (x)、r＝scat _R (R), wherein the value range of each newly defined scalar index is Y ∈ [0, Π Y ], X ∈ [0, Π X ], and R ∈ [0, Π R).

The original index sequence is replaced with a scalar index p, y, x, r, and equation (2) is rewritten as:

at this time, each tensor can uniquely determine any element thereof by three scalar indexes.

2.3 dimension reduction, and expressing the tensor with any order and any dimension arrangement as a special matrix

The actual calculation that occurs in equation (7) is:

the above equation uses an indexed mapping function such as scat _P (. Cndot.), permutation functions such as shfl _C (. Cndot.) they contain specific information of tensor contracts, including data layout of tensors and calculations that occur with each other. Once a certain tensor shrinkage is given, the functions are fixed, and the four scalar indexes of p, y, x and r can be regarded as variables, and the output tensor and the input tensor are traversed to perform calculation. Therefore, the left-hand side of equation (8) is referred to as C (p, y, x) = C [ shfl = C (p, y, x) = _C (scat _P (p),scat _Y (y),scat _X (x))]The right-hand visits to the two input tensors are respectively denoted as a (p, y, r) = a [ shfl = _A (scat _P (p),scat _Y (y),scat _R (r))]、B(p,r,x)＝B[shfl _B (scat _P (p),scat _R (r),scat _X (x))]Equation (8) is rewritten as:

only 4 scalar dimensions are included, i.e., P = Π P, Y = Π Y, X = Π X, R = Π R.

2.4 collapsing and implicitly converting tensors into matrix multiplication

The contrast tensor reduction formula (9) and the batch matrix multiplication (3) show that the forms of the contrast tensor reduction formula and the batch matrix multiplication are completely consistent. C (p, y, x), a (p, x, r), B (p, r, x) are called the memory access functions of the output and input tensors, which abstract the tensors into a batch matrix with a specific data layout. The four scalars P, Y, X, R are called implicit dimensions, which are the product of the batch dimension, the free dimension of a, the free dimension of B, and the reduction dimension, respectively. Formula (9) hides the read-write process of data by using a memory access function, and expresses the range of traversal and iteration by using a hidden dimension, thereby compressing and converting tensor into batch matrix multiplication. With the einstein summation convention, equation (9) can be expressed as:

C(p,y,x)＝A(p,y,r)×B(p,r,x) (10)

so far, arbitrary tensor compression has been implicitly converted to batch matrix multiplication.

3 defining placeholders to represent content related to memory access function and implicit dimension in implicit batch matrix multiplication

3.1 the invention is realized by the code of batch matrix multiplication, and the computation of tensor shrinkage is completed. To achieve this, placeholders are reserved in the batch matrix multiplication, and specific contents are filled according to the definition of tensor shrinkage.

3.2 defining implicit dimension placeholders

The expression (9) comprises four implicit dimensions P, Y, X and R which represent the scale of the implicit batch matrix and are respectively the product of corresponding dimension sequences in the original sheet. Therefore, four corresponding placeholders ph _ P, ph _ Y, ph _ X, ph _ R are introduced, corresponding to the implicit dimensions mentioned above, respectively.

Example 1: draw down C by tensor _M,N ＝A _K1,M,K2 ×B _K1,N,K2 For example, it is very close to matrix multiplication, but the reduction occurs in two dimensions K1, K2, then the implicit reduction dimension ph _ R = K ₁ K ₂ 。

3.3 defining memory Access function placeholders

In batch matrix multiplication, three scalar indices can uniquely identify any element in the batch matrix. In equation (10), the reading and writing of the implicit matrix are abstracted by three access functions, i.e., a (p, y, r), B (p, r, x), and C (p, y, x). Correspondingly, three placeholders of ph _ A (p, y, k), ph _ B (p, k, x) and ph _ C (p, y, x) are introduced to represent elements of corresponding positions.

Example 2: the dimensions of the tensor are symbolic representations known in compile time, but specific values of the tensor are variables which can be determined only in a run time mode, so that the logical high-order tensor needs to be flattened into a one-dimensional array in implementation. For example, the output tensor C in single precision batch matrix multiplication _T,M,N In an implementation may be "float C; ", and its shape is defined by the dimensional variables" int T, M, N; "to define. At this time, the placeholder ph _ C (p, y, x) of the output tensor is instantiated to result in "O [ p × M × N + y × N + x]”。

Example 3: the implicit dimension and the memory access function jointly represent the mutual conversion between the tensor and the implicit batch matrix. Table 1 further illustrates the meaning of commonly used placeholders, and gives implicit dimensions and memory access functions of two types of placeholders in N feature maps, K groups of convolution kernels and two-dimensional deep convolution O with the channel number of C _N,K,C,Ho,Wo ＝I _N,C,Hi,Wi ×W _K,C,R,S Examples of applications in (1). Wherein,

integer division with truncation

a% b represents the modulo operation a mod b.

Table 1: placeholder descriptions and examples

Writing a calculation template according to the BLAS library realization and optimization method, wherein the placeholders in the step (3) are reserved

4.1 the solution provided by the invention is that the computation template is still written in the native language of the target platform, i.e. CUDA C/C + +. The computation template is legal CUDA code at the syntactic and semantic level, with the only exception that a series of special symbolic representations are defined, called placeholders (often abbreviated as ph in pseudo code). The batch matrix transformed by tensor implicit is abstracted by placeholders, and the placeholders define all relevant information such as access and storage functions, implicit dimensions, data types, tensor and dimension names and the like; and the calculation template reads tensor data into a cache through information contained in the placeholder, and realizes subsequent high-performance batch matrix multiplication irrelevant to access and storage. According to the method, through mechanisms such as macros, inline functions, template parameters and the like, data representation of tensors is embedded into the position of a placeholder, and therefore legal and readable CUDA codes are generated.

4.2 writing matrix multiplication calculation template by using the existing BLAS library realization and optimization method.

There are three possible degenerate cases for matrix multiplication (GEMM), namely inner product (geot), outer product (GER), matrix-vector multiplication (GEMV). The calculation templates were written separately for these four cases.

And writing a calculation template code of the linear algebra operation according to the existing BLAS library realization and optimization method. However, each time a matrix dimension is referenced, it is written as placeholders for four implicit dimensions ph _ P, ph _ Y, ph _ X, ph _ R; each time any element of the matrix is accessed, it is written as a placeholder for three memory access functions, ph _ A (p, y, k), ph _ B (p, k, x) and ph _ C (p, y, x).

5 according to the contracted symbolic representation of the tensor, performing template distribution in the compiling period to generate a CUDA C/C + + code

The previous step has implicitly represented the tensor contraction as a batch matrix multiplication. The four-dimensional sequence of P, Y, X and R in tensor shrinkage logically corresponds to the four dimensions of batch matrix multiplication respectively. Therefore, when a certain dimension sequence in P, Y, X, R is empty, the corresponding dimension of the batch matrix multiplication is also equal to 1, and thus degenerates to a simpler linear algebraic operation. We design specific computation templates for different degradation cases to maximize computational efficiency.

Tensor condensed symbolic expression (9) has been analyzed to obtain the contents of the four-dimensional sequences of P, Y, X, R, and a specific linear algebraic computation template is selected in the compilation stage according to whether they are empty:

and after selecting a proper template, substituting the placeholder into the calculation template to generate a legal CUDA C/C + + code.

And 6, for the code generated by instantiation of the computing template, generating a dynamic link library through a JIT (Just-In-Time) compiling technology, and loading the dynamic link library into the address space of the user program. In this way, a reusable executable program is obtained.

And 7, inputting the data of each tensor and the specific value of each dimension by a user, and enabling the executable program compiled in the last step to complete calculation.

Claims

1. A template-based GPU high-performance tensor compression method is characterized by comprising the following steps:

(5) Template dispatching in the compilation period is carried out, and the memory access function and the implicit dimension are substituted into a placeholder of the selected calculation template to generate a CUDA C/C + + code;

2. The template-based GPU high-performance tensor compression method as recited in claim 1, wherein the step (1) comprises the following substeps:

(1.1) notation convention: dimension from high to low is respectively D = D _n ,…,D ₁ Is denoted as T _D (ii) a By T [ i ]]＝T[i _n ,...,i ₁ ]Representative index sequence i = (i) _n ,...,i ₁ ) A scalar of (b); function ext (i) _k )＝D _k K = 1.. N maps a single index to the dimension corresponding thereto, and Ext (i) = (Ext (i) = _n ),...,ext(i ₁ ))＝(D _n ,...,D ₁ ) = D maps the index sequences to the corresponding dimension sequences; any tensor shrinkage of user input C = a × B is represented by the einstein summation convention:

C[ic]＝A[ia]×B[ib] (1)

wherein, ic, ia and ib are the index sequences of tensors C, A and B respectively;

(1.2) defining a batch index sequence: defining p = ic &ia &, i.e. a sequence of all batch indices, each element being used to index all tensors; the order of the elements in p in ic, ia, ib may be different, and any order may be used herein, as may the following sequences;

(1.3) defining free index sequences y and x of the input tensors A and B respectively; define y = (ic ═ ia) \ p, i.e., the sequence of all free indices of a, only for indices C and a; defining a sequence consisting of all free indexes with x = (ic ≠ ib) \ p as B, only for indexes C and B;

(1.4) defining a reduction index sequence: defining r = (ia ≠ ib) \ p as a reduction index sequence, multiplying the elements of the A tensor and the B tensor on the Ext (r) pairwise, and reducing the elements into scalars;

(1.5) defining a permutation function: by equivalent transformation of formula (3)p, y, x, r replacing ic, ia, ib, and introducing three permutation functions shfl _C (·)、shfl _A (. O) and shfl _B (. Cndot.), thus rewriting the original formula as:

C[shfl _C (p,y,x)]＝A[shfl _A (p,y,r)]×B[shfl _B (p,r,x)] (2)

(1.6) defining a dimension sequence corresponding to each index sequence: defining a batch dimension sequence P = Ext (P), wherein each element is a dimension corresponding to each index in the batch index sequence; the free dimension sequence defining a is Y = Ext (Y), the free dimension sequence of B is X = Ext (X), and the reduced dimension sequence is R = Ext (R).

3. The template-based GPU high-performance tensor compression method as recited in claim 1, wherein the step (2) comprises the following substeps:

(2.1) defining a dimensionality reduction auxiliary function: firstly, defining the mapping i = gath of index sequence to scalar index _d (i) Where d is an arbitrary dimension sequence, but must be as long as index sequence i; the function maps the offset i in each dimension in d to a linear address i:

the inverse function i = scat defining the above function _d (i)＝gath _d ^-1 (i) (ii) a It decomposes the linear address i into offsets i in each dimension of d:

implicitly converting the high-order tensor in the formula (2) into a matrix adopting a specific storage format by means of the two auxiliary functions;

(2.2) dimensionality reduction of the index to access elements in the tensor with the scalar index

Converting the index sequence p, y, x, r into the components of the scalar index in the corresponding dimension respectively(ii) a Definition p = gath _P (P), P is a scalar index with a value range of [0, | P), and the index sequence is expressed in reverse as P = gath _P ^-1 (p)＝scat _P (p); denote the other index sequence as y = scat _Y (y)、x＝scat _X (x)、r＝scat _R (R), wherein the value range of each newly defined scalar index is Y ∈ [0, Π Y ], X ∈ [0, Π X ], and R ∈ [0, Π R); rewrite equation (2) as:

C[shfl _C (scat _P (p),scat _Y (y),scat _X (x))]

＝A[shfl _A (scat _P (p),scat _Y (y),scat _R (r))]×B[shfl _B (scat _P (p),scat _R (r),scat _X (x))] (7)

The actual calculation that occurs in equation (7) is:

the left-hand side of equation (8) is referred to as C (p, y, x) = C [ shfl = C (p, y, x) = _C (scat _P (p),scat _Y (y),scat _X (x))]The right-hand visits to the two input tensors are respectively denoted as a (p, y, r) = a [ shfl = _A (scat _P (p),scat _Y (y),scat _R (r))]、B(p,r,x)＝B[shfl _B (scat _P (p),scat _R (r),scat _X (x))]The formula (8) is rewritten as:

wherein only 4 scalar dimensions are included, i.e. P = |, Y = Π Y, X = Π X, R = Π R;

(2.4) implicitly converting the tensor contract into a matrix multiplication: c (p, y, x), A (p, x, r) and B (p, r, x) are called the access functions of output and input tensors, and the tensors are abstracted into batch matrixes adopting specific data layouts; the four scalars of P, Y, X and R are called implicit dimensions which are products of batch dimensions, free dimensions of A, free dimensions of B and reduction dimensions respectively; formula (9) hides the read-write process of data by using a memory access function, and expresses the range of traversal and iteration by using a hidden dimension, so that tensor compression is converted into batch matrix multiplication; with the einstein summation convention, equation (9) can be expressed as:

C(p,y,x)＝A(p,y,r)×B(p,r,x) (10)。

4. the template-based GPU high-performance tensor compression method according to claim 1, wherein said step (3) comprises the following substeps:

(3.1) defining implicit dimension placeholders: the formula (9) comprises four implicit dimensions P, Y, X and R which represent the scale of the implicit batch matrix and are respectively the products of corresponding dimensional sequences in the original tensor; therefore, four corresponding implicit dimension placeholders ph _ P, ph _ Y, ph _ X, ph _ R are defined, corresponding to the implicit dimensions mentioned above, respectively;

5. The method for compressing a template-based GPU high-performance tensor according to claim 1, wherein said step (4) is specifically: compiling a calculation template code of linear algebra operation according to a BLAS library realization and optimization method; writing the matrix dimension as placeholders of four implicit dimensions ph _ P, ph _ Y, ph _ X and ph _ R when the matrix dimension is quoted each time; each time any element of the matrix is accessed, it is written as a placeholder for three memory access functions, ph _ A (p, y, k), ph _ B (p, k, x) and ph _ C (p, y, x).

6. The method for compressing a template-based GPU high performance tensor according to claim 1, wherein the step (5) is specifically: the analysis of the expression (9) of tensor condensed symbolic expression obtains the contents of the P, Y, X and R four-dimensional sequences, and a specific linear algebraic calculation template is selected in the compiling period according to whether the sequences are empty or not:

and after the template is selected, substituting the placeholder into the calculation template to generate a legal CUDA C/C + + code.