CN115203634A - Template-based GPU (graphic processing unit) high-performance tensor shrinkage method - Google Patents
Template-based GPU (graphic processing unit) high-performance tensor shrinkage method Download PDFInfo
- Publication number
- CN115203634A CN115203634A CN202210343327.3A CN202210343327A CN115203634A CN 115203634 A CN115203634 A CN 115203634A CN 202210343327 A CN202210343327 A CN 202210343327A CN 115203634 A CN115203634 A CN 115203634A
- Authority
- CN
- China
- Prior art keywords
- tensor
- dimension
- scat
- index
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/44—Encoding
- G06F8/447—Target code generation
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Computational Mathematics (AREA)
- Pure & Applied Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a template-based GPU (graphics processing unit) high-performance tensor contraction method, which comprises the steps of defining tensor contraction by a user, classifying indexes and dimensions of the tensor contraction to obtain four index sequences and four dimension sequences; then, dimension reduction is carried out on the obtained memory access function and implicit dimensionality, placeholders are defined, a calculation template is compiled according to a BLAS library implementation and optimization method, then template distribution in a compiling period is carried out, the memory access function and the implicit dimensionality are substituted into the placeholders of the selected calculation template, CUDA C/C + + codes are generated and compiled into an executable program capable of being used repeatedly, and finally data of each tensor and specific values of each dimensionality are input to finish calculation. The invention can support arbitrary tensor shrinkage by a limited group of templates; on the premise of keeping the usability, the system performance still approaches or even exceeds the computation libraries of cuBLAS, cuDNN and the like optimized by hands.
Description
Technical Field
The invention belongs to the field of high-performance calculation, and particularly relates to a template-based GPU high-performance tensor contraction method.
Background
The concept of tensor is the generalization of vectors, and scalars, vectors and matrices can be understood as 0, 1 and 2-order tensors, respectively. The method is widely applied to the fields of machine learning, data analysis, scientific calculation and the like, and provides simple and clear abstract representation for data and algorithms. However, tensor computation in system implementation is very complex: even though access intensive operators such as normalization, downsampling and the like are easy to adjust and optimize, calculation intensive operators such as matrix multiplication and convolution are difficult to approach the peak performance of hardware, and experts familiar with the target platform system structure are often required to perform detailed manual optimization. Meanwhile, a user often highly customizes the algorithm according to the practical requirement, the related calculation often exceeds the expression range of a common operator, so that the problem of the so-called 'long tail' is caused, namely, the data layout and the calculation mode of the tensor are not supported by the calculation library with fixed functions such as cuBLAS, cuDNN and the like, and the realization of the handwriting high performance brings huge time and mental burden for the user.
Tensor computation generally includes some common modes, for example, tensors are often stored as multidimensional arrays, and the computation process is often a multiple nested loop, so a system implementer can design a Domain-Specific Language (DSL) to express and optimize such operations in a targeted manner.
For example, the machine learning software stack PlaidML (Intel. Plaidml. Https:// githiub. Com/platidml. Accessed: 2021-10-01.) contains a gate of eDSL for users to customize the Long-tail operators, while Facebook has designed Tensor Comrehensions (TensorComp; nicolas Vasilache, oleksandr Zinenko, theodours Theodoris, priya Goyal, zachary Degito, william S.Moses, sven Verdolaege, andrew Adams, and Albert Cohen the next 700accessed layers. The syntax semantics of the two are based on the einstein summation convention, and can support any tensor contract (tensor contract), namely the generalization of matrix multiplication in tensor: some dimensions of the input tensor remain in the output tensor, while elements in other dimensions are multiplied and accumulated correspondingly two by two. PlaidML eDSL and TenscorComp are simple and easy to use, perform well in access and storage intensive operators, but they do not consider the actual architectural features of the back-end platform yet, and thus have performance deficiencies in the context of computation intensive.
Tensor calculations in the fields of machine learning, data analysis, scientific calculation and the like can be roughly divided into two types according to the peak calculation performance and the maximum communication bandwidth of a target platform. The calculation amount generated by the operators such as down-sampling, specification, regularization and the like is small, and the performance bottleneck mainly lies in the bandwidth, so that the bottom layer implementation of the operation does not need to be particularly fine optimized. As long as the method can adapt to the storage structure of the target platform, eliminate invalid access and maintain sufficient parallelism, the method can maximize the bandwidth utilization, and the factors such as the block size, the parallel granularity, the instruction arrangement and the like can not generate obvious influence. In contrast, matrix multiplication, convolution, etc. operators exhibit computationally intensive characteristics, and highly optimized implementations can approach the peak arithmetic performance of the system, but considerable effort is required to achieve this goal: the implementer needs to reasonably divide the data into multiple levels of parallel units, balance between parallel and data multiplexing, and finely arrange different types of instructions to reduce pipeline blocking and the like.
Tensor contraction is any generalization of matrix multiplication under high-dimensional conditions, can be represented by Einstein summation convention, and can also cover convolution operation and various varieties thereof after slight expansion. That is, common computationally intensive operators are all in the range of tensor compressed expressions, which can serve as the basis unit for tensor computation. In addition, tensor compression can also support many access-intensive operators, such as pooling, convention, etc. operations are equivalent to a special form of tensor compression, while more complex regularization, softmax, etc. operators can also be expressed as a combination of multiple tensor compression. Therefore, a high performance implementation of arbitrary tensor shrinkage is taken as a target problem.
PlaidML eDSL and tensrcomp introduced a polyhedral model in an attempt to solve this problem. They convert the user input at the front end into a multi-layered nested loop, perform platform-independent generic loop transformations based on polyhedral models at the middle end, and sink the intermediate representation to machine code for a specific back end. The idea is good in access and storage intensive tasks, but the architectural characteristics of a target platform are ignored, so that the performance of the method is mostly far from the theoretical peak in calculation intensive tensor compression. This problem is exacerbated on heterogeneous computing platforms, e.g., tenscorComp can perform as low as 20% of the native linear algebraic computation library CuBLAS on NVIDIA GPUs in large-scale matrix multiplications. However, with its excellent parallel processing capability, general-purpose GPUs are widely used in tensor computations, and have become a virtual infrastructure in the fields of deep learning and the like. Therefore, shrinkage of the high-performance tensor on the GPU is a very realistic topic: the solution based on the native computation library suffers from the problem of long tail, and has a narrow application range, while the existing tensor computation DSL is flexible and easy to use, and the performance of the existing tensor computation DSL is still to be improved.
Tensor compression is suitable for computational acceleration on general purpose GPUs because it can be recursively divided into many sub-problems that are not related to each other, and the control flow of the computational process is static and does not depend on the input data. The difficulty in implementing tensor compaction on this multi-core architecture is: an implementer must comprehensively consider the calculation and storage characteristics of a target platform, such as the number of parallel processing units, the delay, bandwidth and capacity of each level of cache and the like, and divide the tensor in each dimension by proper size so as to improve the data multiplexing in the block as much as possible and avoid the bandwidth bottleneck; meanwhile, memory access and calculation instructions need to be alternately arranged to hide communication delay, reduce blocking and approach peak value calculation performance. In addition, the SIMT (Single Instruction, multiple Threads) model (Erik Lindholm, john Nickols, stuart Oberman, and John Montrym. Nvidia tesla: A unidentified graphics and computing architecture. IEEE Micro,28 (2): 39-55, 2008.) of general GPU computation introduces many unique problems of access transaction merging, bank confliction and the like, and provides more delicate requirements for the realization of the tensor contraction algorithm.
In addition, if it is desired to support arbitrary tensor compression, the system implementer faces a huge problem space because of the dimensions of the higher-order tensorPossibly arranged in an arbitrary order, resulting in a factorial level of complexity. In the BLAS numerical computation library of each platform, the matrix multiplication generally deals with only four possible cases, namely that each input tensor is a 2 x 2 combination resulting from column-first or row-first; tensor compression is often faced with much more than this. In the shape of C a,b,d,e =A a,b,c ×B c,d,e For example, between two third order tensors, one dimension c, which is a function of all possible data layouts of 4! X3! X3! =846 cases; if with C a,d =A a,b,c ×B b,c,d Two dimensions of b and c, there are also 2! X3! X3! =72 cases. It is obviously not feasible to provide separate implementations for each case, which requires a system with good applicability, extensibility, and the ability to quickly resolve a class of similar computations.
Disclosure of Invention
The invention aims to provide a template-based GPU high-performance tensor contraction calculation technology aiming at the defects of the prior art. The present invention can compute any tensor compression on a general purpose GPU, performing particularly well in terms of performance.
The purpose of the invention is realized by the following technical scheme: a template-based GPU high-performance tensor shrinkage method comprises the following steps:
(1) A user inputs the definition of tensor shrinkage, and classifies indexes and dimensions of the tensor shrinkage to obtain four index sequences of P, X, Y and R and four dimension sequences of P, Y, X and R;
(2) Reducing the dimensions of the index sequence P, X, Y and R and the dimension sequence P, Y, X and R to obtain an access function and an implicit dimension;
(3) Defining a placeholder to represent content related to a memory access function and an implicit dimension in implicit batch matrix multiplication;
(4) Writing a calculation template according to a BLAS library implementation and optimization method, wherein the placeholders in the step (3) are reserved;
(5) Template dispatching in the compiling period is carried out, and the memory access function and the implicit dimension are substituted into the placeholder of the selected calculation template to generate a CUDA C/C + + code;
(6) Compiling the code generated in step (5) into a reusable executable program;
(7) Inputting data of each tensor and specific values of each dimension, and completing calculation by using the executable program compiled in the step (6).
Further, the step (1) comprises the following substeps:
(1.1) sign convention: dimension from high to low is respectively D = D n ,…,D 1 Is denoted as T D (ii) a By T [ i ]]=T[i n ,...,i 1 ]Representative index sequence i = (i) n ,...,i 1 ) A scalar of (a); function ext (i) k )=D k K = 1.. N maps a single index to the dimension corresponding thereto, and Ext (i) = (Ext (i) n ),...,ext(i 1 ))=(D n ,...,D 1 ) = D mapping index sequences to corresponding dimension sequences; any tensor draw of user input C = a × B is represented by the einstein summation convention:
C[ic]=A[ia]×B[ib] (1)
wherein, ic, ia, ib are the index sequences of the tensors C, a, B, respectively.
(1.2) defining a batch index sequence: defining p = ic ia ∞ ib, i.e. a sequence of all batch indices, each element being used to index all tensors. The order of the elements in p in ic, ia, ib may be different, and any order may be adopted herein, and the same holds true for the following sequences;
(1.3) defining free index sequences y, x of the input tensors a, B, respectively; define y = (ic = ia) \ p, i.e. the sequence of all free indices of a, only for indices C and a; all free index composed sequences where x = (ic ≠ ib) \ p is B are defined, only for indices C and B.
(1.4) defining a reduction index sequence: defining r = (ia ≠ ib) \ p as a reduction index sequence, multiplying the elements of the A tensor and the B tensor on the Ext (r) pairwise, and reducing the elements into scalars.
(1.5) defining a permutation function: equivalent transformation of formula (3) by substituting p, y, x, r for ic, ia, ib, and introducing three permutation functions shfl C (·)、shfl A (. Cndot.) and shfl B (. To) therebyThe original formula is rewritten as:
C[shfl C (p,y,x)]=A[shfl A (p,y,r)]×B[shfl B (p,r,x)] (2)
(1.6) defining a dimension sequence corresponding to each index sequence: defining a batch dimension sequence P = Ext (P), wherein each element is a dimension corresponding to each index in the batch index sequence; the free dimension sequence of a is defined as Y = Ext (Y), the free dimension sequence of B is defined as X = Ext (X), and the reduced dimension sequence is defined as R = Ext (R).
Further, the step (2) comprises the following substeps:
(2.1) defining a dimensionality reduction auxiliary function: first, the mapping i = gath of index sequence to scalar index is defined d (i) Where d is an arbitrary dimension sequence, but must be as long as index sequence i. The function maps the offset i in each dimension in d to a linear address i:
the inverse function i = scat defining the above function d (i)=gath d -1 (i) In that respect It decomposes the linear address i into offsets i in each dimension of d:
the higher-order tensor in equation (2) is implicitly converted into a matrix using a specific storage format by means of the two auxiliary functions. (2.2) dimensionality reduction of the index to access elements in the tensor with scalar indices
The index sequences p, y, x, r are converted into components of the scalar index in the corresponding dimension, respectively. Definition p = gath P (P), P is a scalar index with a value range of [0, | P), and the index sequence is expressed in reverse as P = gath P -1 (p)=scat P (p); denote the other index sequences as y = scat Y (y)、x=scat X (x)、r=scat R (r) wherein each scalar newly definedThe index value range is Y belongs to [0, ] n Y), X belongs to [0,. N.X), R belongs to [0,. N.R); rewrite equation (2) to:
(2.3) dimension reduction is carried out on the dimension, and the tensor with any order and any dimension arrangement is expressed as a special matrix
The actual calculation that occurs in equation (7) is:
the left-hand side of equation (8) is referred to as C (p, y, x) = C [ shfl = C (p, y, x) = C (scat P (p),scat Y (y),scat X (x))]The right-hand visits to the two input tensors are respectively denoted as a (p, y, r) = a [ shfl = A (scat P (p),scat Y (y),scat R (r))]、B(p,r,x)=B[shfl B (scat P (p),scat R (r),scat X (x))]Equation (8) is rewritten as:
only 4 scalar dimensions are included, i.e. P = Π P, Y = Π Y, X = Π X, R = Π R.
(2.4) implicitly converting the tensor contract into a matrix multiplication: c (p, y, x), a (p, x, r), B (p, r, x) are called the memory access functions of the output and input tensors, which abstract the tensors into a batch matrix with a specific data layout. The four scalars P, Y, X, R are called implicit dimensions, which are the product of the batch dimension, the free dimension of a, the free dimension of B, and the reduction dimension, respectively. Formula (9) hides the read-write process of data by using a memory access function, and expresses the range of traversal and iteration by using a hidden dimension, thereby compressing and converting tensor into batch matrix multiplication. With the einstein summation convention, equation (9) can be expressed as:
C(p,y,x)=A(p,y,r)×B(p,r,x) (10)
further, the step (3) comprises the following substeps:
(3.1) defining implicit dimension placeholders: the formula (9) comprises four implicit dimensions P, Y, X and R which represent the scale of the implicit batch matrix and are respectively products of corresponding dimension sequences in the original tensor. Therefore, four corresponding implicit dimension placeholders ph _ P, ph _ Y, ph _ X, ph _ R are defined, corresponding to the implicit dimensions mentioned above.
(3.2) defining memory function placeholders: in the formula (10), the reading and writing of the implicit matrix are abstracted by three access functions of A (p, y, r), B (p, r, x) and C (p, y, x); three memory function placeholders, ph _ A (p, y, k), ph _ B (p, k, x) and ph _ C (p, y, x), are defined to represent elements of corresponding positions.
Further, the step (4) specifically comprises: compiling a calculation template code of linear algebra operation according to a BLAS library realization and optimization method; writing the matrix dimension as placeholders of four implicit dimensions ph _ P, ph _ Y, ph _ X and ph _ R when the matrix dimension is referred each time; each time any element of the matrix is accessed, the element is written as a placeholder for three access functions, namely ph _ A (p, y, k), ph _ B (p, k, x) and ph _ C (p, y, x).
Further, the step (5) specifically comprises: the analysis of the expression (9) of tensor condensed symbolic expression obtains the contents of the P, Y, X and R four-dimensional sequences, and a specific linear algebraic calculation template is selected in the compiling period according to whether the sequences are empty or not:
after the template is selected, the placeholder is substituted into the calculation template to generate a legal CUDA C/C + + code.
The beneficial effects of the invention are:
(1) The invention designs a universal tensor calculation framework, and compresses and converts any tensor into basic linear algebraic operation represented by GEMM through concepts such as access and storage functions, implicit dimensions and the like. By dispatching in two stages of a compiling period and a running period, the input tensors are compressed and correspond to a specific calculation template, so that a limited group of templates support arbitrary tensor compression;
(2) The invention designs an Einstein summation convention-based DSL as the front-end representation of a universal framework, which is implemented on an NVIDIA GPU. The system performance still approaches or even exceeds the computation libraries of cuBLAS, cuDNN, etc. optimized manually, on the premise of keeping easy to use.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The invention designs a tensor contraction method based on a calculation template, which implicitly transforms tensors arranged in any order and any dimensionality into a matrix on the premise of not occupying extra space, thereby contracting any tensor into a limited group of linear algebraic operations, and supporting the realization of the linear algebraic operations through the calculation template.
The calculation template uses a grammar which is consistent with the native code of the target platform, namely CUDA C/C + +, wherein a group of placeholders to be completed are reserved, specific contents are filled according to tensor contraction definitions, executable codes are generated by compiling, and finally tensor data input of a user is received to obtain tensor contraction calculation results.
The performance provided by the computing template is more than that of a highly optimized numerical computing library on a shoulder target platform, such as cuBLAS and cuDNN on an NVIDIA GPU, and meanwhile, certain code generation capacity is reserved, and similar computing can be supported.
The method comprises the following steps:
1. the user inputs the definition of tensor shrinkage, and classifies the indexes and the dimensions of the tensor shrinkage to obtain four index sequences of P, X, Y and R and four dimension sequences of P, Y, X and R.
1.1 token convention
The notation used in the present invention is agreed first. Dimension from high to low is respectively D = D n ,…,D 1 Is recorded as T D . Each of n dimensions is offset to uniquely determine a certain element in T, and T [ i]=T[i n ,...,i 1 ]Representative index sequence i = (i) n ,...,i 1 ) A scalar of (c). Function ext (i) k )=D k K = 1.. N maps a single index to the dimension corresponding thereto, and Ext (i) = (Ext (i) = n ),...,ext(i 1 ))=(D n ,...,D 1 ) = D maps the index sequence to the corresponding dimension sequence.
Any tensor draw of user input C = a × B is represented by the einstein summation convention:
C[ic]=A[ia]×B[ib] (1)
wherein, ic, ia, ib are the index sequence of each tensor C, A, B, respectively.
1.2 defining batch index sequences
Defining p = ic ia ∞ ib, i.e. a sequence of all batch indices, each element being used to index all tensors. The order in which the elements of p appear in ic, ia, ib may be non-uniform, and any order may be used herein, as may the next several sequences.
1.3 defining the free index sequences y, x of the input tensors A, B, respectively
Y = (ic $ ia) \ p, i.e. the sequence of all free indices of a, is defined for indices C and a only.
Similarly, a sequence consisting of all free indices where x = (ic ≠ ib) \ p is B is defined, only for indices C and B.
1.4 defining a reduced index sequence
Defining r = (ia:) ib \\ \ p as a reduction index sequence, multiplying two elements of a and B tensors on Ext (r) by two, and reducing the two elements into scalars.
1.5 defining a permutation function
Equivalent transformation is carried out on the formula (3), ic, ia and ib are replaced by p, y, x and r, and three permutation functions shfl are introduced C (·)、shfl A (. Cndot.) and shfl B (. To) thereby rewrite the original formula as
C[shfl C (p,y,x)]=A[shfl A (p,y,r)]×B[shfl B (p,r,x)] (2)
Three permutation functions shfl are introduced C (. O) and let ic = shfl C The reasons for (p, x, y) are: all indexes in p, y, x may appear alternately in any order as the offset in each dimension of the tensor C; shfl C The abstraction is made for all possible permutations of the indices on the tensor C, and accordingly, represents the various possible dimensional permutations of the tensor C. Similarly, the two input tensors also introduce the corresponding permutation function shfl A (. O) and shfl B (. Cndot.) representing the various possible dimensional permutations of tensors A, B.
1.6 defining the dimension sequences corresponding to the respective index sequences
Defining a batch dimension sequence P = Ext (P), wherein each element is a dimension corresponding to each index in the batch index sequence. Similarly, a is defined as Y = Ext (Y) for the free dimension sequence, X = Ext (X) for the free dimension sequence of B, and R = Ext (R) for the reduced dimension sequence.
Example 1: bulk matrix multiplication C T,M,N =A T,M,K ×B T,K,N Using the einstein summation convention may be expressed as
C[t,m,n]=A[t,m,k]×B[t,k,n]. (3)
T appears in all tensors, being the lot index, T = ext (T) being the corresponding lot dimension. M appears in the output tensor C and the input tensor a, is a free index of a, and M = ext (M) is called a free dimension of a; similarly, N is the free index of the input tensor B, and N = ext (N) is the free dimension of B. K, which only appears in the two input tensors, is the reduced index, K = ext (K) is the reduced dimension, in which matrix multiplication multiplies the elements of the input tensors two by two and accumulates (reduces) to one scalar.
Therefore, in the batch matrix multiplication shown in equation (3), each index sequence is P = (T), Y = (M), X = (N), R = (K), and each dimension sequence is P = (T), Y = (M), X = (N), R = (K).
Example 2: equation (4) represents that the fourth-order tensors A and B are shrunk in the dimension where the reduction indexes a and B are located, c and d are free indexes of A, e and f are free indexes of B, and the batch index does not exist.
C[d,f,c,e]=A[a,b,c,d]×B[a,e,b,f] (4)
Therefore, in the tensor compression represented by the formula (4), each index sequence is p = (=), y = (c, d), x = (e, f), r = (a, b), and each permutation function is shfl C ((),(c,d),(e,f))=(d,f,c,e),shfl A ((),(c,d),(a,b))=(a,b,c,d),shfl B ((),(a,b),(e,f))=(a,e,b,f)。
2, reducing the dimension of the index sequence P, X, Y, R and the dimension sequence P, Y, X, R to obtain the access function and the implicit dimension
The tensor is represented as a matrix of a special storage format, and the tensor is compressed and multiplexed with the matrix multiplication (and its respective degenerate case) by the same calculation process.
2.1 defining a dimensionality reduction auxiliary function
The invention establishes mapping between the dimensions of the tensor and the matrix, thereby converting the multiple linear indexes on the high-order tensor into scalar indexes on the matrix batch, row and column.
The sequence of indices in the tensor contract corresponds to the scalar index in the matrix multiplication. Thus, first the mapping of index sequence to scalar index i = gath is defined d (i) Where d can be any dimension sequence, but must be as long as index sequence i. The function maps the offset i in each dimension in d to a linear address i:
accordingly, the inverse function i = scat of the above function is defined d (i)=gath d -1 (i) In that respect It decomposes the linear address i into offsets i in each dimension of d:
with the aid of the two auxiliary functions, the higher-order tensor in equation (2) can be implicitly converted into a matrix in a specific storage format.
2.2 dimensionality reduction of the index to access elements in the tensor with scalar indices
The previously defined index sequences p, y, x, r are each converted into components of the scalar index in the corresponding dimension. Definition p = gath P (P), it is easy to know that P is scalar index with value range of [0, pi P ], so that the index sequence can be invertedThe expression is p = gath P -1 (p)=scat P (p) of the formula (I). Similarly, the other index sequences are denoted as y = scat Y (y)、x=scat X (x)、r=scat R (R), wherein the value range of each newly defined scalar index is Y ∈ [0, Π Y ], X ∈ [0, Π X ], and R ∈ [0, Π R).
The original index sequence is replaced with a scalar index p, y, x, r, and equation (2) is rewritten as:
at this time, each tensor can uniquely determine any element thereof by three scalar indexes.
2.3 dimension reduction, and expressing the tensor with any order and any dimension arrangement as a special matrix
The actual calculation that occurs in equation (7) is:
the above equation uses an indexed mapping function such as scat P (. Cndot.), permutation functions such as shfl C (. Cndot.) they contain specific information of tensor contracts, including data layout of tensors and calculations that occur with each other. Once a certain tensor shrinkage is given, the functions are fixed, and the four scalar indexes of p, y, x and r can be regarded as variables, and the output tensor and the input tensor are traversed to perform calculation. Therefore, the left-hand side of equation (8) is referred to as C (p, y, x) = C [ shfl = C (p, y, x) = C (scat P (p),scat Y (y),scat X (x))]The right-hand visits to the two input tensors are respectively denoted as a (p, y, r) = a [ shfl = A (scat P (p),scat Y (y),scat R (r))]、B(p,r,x)=B[shfl B (scat P (p),scat R (r),scat X (x))]Equation (8) is rewritten as:
only 4 scalar dimensions are included, i.e., P = Π P, Y = Π Y, X = Π X, R = Π R.
2.4 collapsing and implicitly converting tensors into matrix multiplication
The contrast tensor reduction formula (9) and the batch matrix multiplication (3) show that the forms of the contrast tensor reduction formula and the batch matrix multiplication are completely consistent. C (p, y, x), a (p, x, r), B (p, r, x) are called the memory access functions of the output and input tensors, which abstract the tensors into a batch matrix with a specific data layout. The four scalars P, Y, X, R are called implicit dimensions, which are the product of the batch dimension, the free dimension of a, the free dimension of B, and the reduction dimension, respectively. Formula (9) hides the read-write process of data by using a memory access function, and expresses the range of traversal and iteration by using a hidden dimension, thereby compressing and converting tensor into batch matrix multiplication. With the einstein summation convention, equation (9) can be expressed as:
C(p,y,x)=A(p,y,r)×B(p,r,x) (10)
so far, arbitrary tensor compression has been implicitly converted to batch matrix multiplication.
3 defining placeholders to represent content related to memory access function and implicit dimension in implicit batch matrix multiplication
3.1 the invention is realized by the code of batch matrix multiplication, and the computation of tensor shrinkage is completed. To achieve this, placeholders are reserved in the batch matrix multiplication, and specific contents are filled according to the definition of tensor shrinkage.
3.2 defining implicit dimension placeholders
The expression (9) comprises four implicit dimensions P, Y, X and R which represent the scale of the implicit batch matrix and are respectively the product of corresponding dimension sequences in the original sheet. Therefore, four corresponding placeholders ph _ P, ph _ Y, ph _ X, ph _ R are introduced, corresponding to the implicit dimensions mentioned above, respectively.
Example 1: draw down C by tensor M,N =A K1,M,K2 ×B K1,N,K2 For example, it is very close to matrix multiplication, but the reduction occurs in two dimensions K1, K2, then the implicit reduction dimension ph _ R = K 1 K 2 。
3.3 defining memory Access function placeholders
In batch matrix multiplication, three scalar indices can uniquely identify any element in the batch matrix. In equation (10), the reading and writing of the implicit matrix are abstracted by three access functions, i.e., a (p, y, r), B (p, r, x), and C (p, y, x). Correspondingly, three placeholders of ph _ A (p, y, k), ph _ B (p, k, x) and ph _ C (p, y, x) are introduced to represent elements of corresponding positions.
Example 2: the dimensions of the tensor are symbolic representations known in compile time, but specific values of the tensor are variables which can be determined only in a run time mode, so that the logical high-order tensor needs to be flattened into a one-dimensional array in implementation. For example, the output tensor C in single precision batch matrix multiplication T,M,N In an implementation may be "float C; ", and its shape is defined by the dimensional variables" int T, M, N; "to define. At this time, the placeholder ph _ C (p, y, x) of the output tensor is instantiated to result in "O [ p × M × N + y × N + x]”。
Example 3: the implicit dimension and the memory access function jointly represent the mutual conversion between the tensor and the implicit batch matrix. Table 1 further illustrates the meaning of commonly used placeholders, and gives implicit dimensions and memory access functions of two types of placeholders in N feature maps, K groups of convolution kernels and two-dimensional deep convolution O with the channel number of C N,K,C,Ho,Wo =I N,C,Hi,Wi ×W K,C,R,S Examples of applications in (1). Wherein,integer division with truncationa% b represents the modulo operation a mod b.
Table 1: placeholder descriptions and examples
Writing a calculation template according to the BLAS library realization and optimization method, wherein the placeholders in the step (3) are reserved
4.1 the solution provided by the invention is that the computation template is still written in the native language of the target platform, i.e. CUDA C/C + +. The computation template is legal CUDA code at the syntactic and semantic level, with the only exception that a series of special symbolic representations are defined, called placeholders (often abbreviated as ph in pseudo code). The batch matrix transformed by tensor implicit is abstracted by placeholders, and the placeholders define all relevant information such as access and storage functions, implicit dimensions, data types, tensor and dimension names and the like; and the calculation template reads tensor data into a cache through information contained in the placeholder, and realizes subsequent high-performance batch matrix multiplication irrelevant to access and storage. According to the method, through mechanisms such as macros, inline functions, template parameters and the like, data representation of tensors is embedded into the position of a placeholder, and therefore legal and readable CUDA codes are generated.
4.2 writing matrix multiplication calculation template by using the existing BLAS library realization and optimization method.
There are three possible degenerate cases for matrix multiplication (GEMM), namely inner product (geot), outer product (GER), matrix-vector multiplication (GEMV). The calculation templates were written separately for these four cases.
And writing a calculation template code of the linear algebra operation according to the existing BLAS library realization and optimization method. However, each time a matrix dimension is referenced, it is written as placeholders for four implicit dimensions ph _ P, ph _ Y, ph _ X, ph _ R; each time any element of the matrix is accessed, it is written as a placeholder for three memory access functions, ph _ A (p, y, k), ph _ B (p, k, x) and ph _ C (p, y, x).
5 according to the contracted symbolic representation of the tensor, performing template distribution in the compiling period to generate a CUDA C/C + + code
The previous step has implicitly represented the tensor contraction as a batch matrix multiplication. The four-dimensional sequence of P, Y, X and R in tensor shrinkage logically corresponds to the four dimensions of batch matrix multiplication respectively. Therefore, when a certain dimension sequence in P, Y, X, R is empty, the corresponding dimension of the batch matrix multiplication is also equal to 1, and thus degenerates to a simpler linear algebraic operation. We design specific computation templates for different degradation cases to maximize computational efficiency.
Tensor condensed symbolic expression (9) has been analyzed to obtain the contents of the four-dimensional sequences of P, Y, X, R, and a specific linear algebraic computation template is selected in the compilation stage according to whether they are empty:
and after selecting a proper template, substituting the placeholder into the calculation template to generate a legal CUDA C/C + + code.
And 6, for the code generated by instantiation of the computing template, generating a dynamic link library through a JIT (Just-In-Time) compiling technology, and loading the dynamic link library into the address space of the user program. In this way, a reusable executable program is obtained.
And 7, inputting the data of each tensor and the specific value of each dimension by a user, and enabling the executable program compiled in the last step to complete calculation.
Claims (6)
1. A template-based GPU high-performance tensor compression method is characterized by comprising the following steps:
(1) A user inputs the definition of tensor shrinkage, and classifies indexes and dimensions of the tensor shrinkage to obtain four index sequences of P, X, Y and R and four dimension sequences of P, Y, X and R;
(2) Reducing the dimensions of the index sequence P, X, Y and R and the dimension sequence P, Y, X and R to obtain an access function and an implicit dimension;
(3) Defining a placeholder to represent content related to a memory access function and an implicit dimension in implicit batch matrix multiplication;
(4) Writing a calculation template according to a BLAS library implementation and optimization method, wherein the placeholders in the step (3) are reserved;
(5) Template dispatching in the compilation period is carried out, and the memory access function and the implicit dimension are substituted into a placeholder of the selected calculation template to generate a CUDA C/C + + code;
(6) Compiling the code generated in step (5) into a reusable executable program;
(7) Inputting data of each tensor and specific values of each dimension, and completing calculation by using the executable program compiled in the step (6).
2. The template-based GPU high-performance tensor compression method as recited in claim 1, wherein the step (1) comprises the following substeps:
(1.1) notation convention: dimension from high to low is respectively D = D n ,…,D 1 Is denoted as T D (ii) a By T [ i ]]=T[i n ,...,i 1 ]Representative index sequence i = (i) n ,...,i 1 ) A scalar of (b); function ext (i) k )=D k K = 1.. N maps a single index to the dimension corresponding thereto, and Ext (i) = (Ext (i) = n ),...,ext(i 1 ))=(D n ,...,D 1 ) = D maps the index sequences to the corresponding dimension sequences; any tensor shrinkage of user input C = a × B is represented by the einstein summation convention:
C[ic]=A[ia]×B[ib] (1)
wherein, ic, ia and ib are the index sequences of tensors C, A and B respectively;
(1.2) defining a batch index sequence: defining p = ic &ia &, i.e. a sequence of all batch indices, each element being used to index all tensors; the order of the elements in p in ic, ia, ib may be different, and any order may be used herein, as may the following sequences;
(1.3) defining free index sequences y and x of the input tensors A and B respectively; define y = (ic ═ ia) \ p, i.e., the sequence of all free indices of a, only for indices C and a; defining a sequence consisting of all free indexes with x = (ic ≠ ib) \ p as B, only for indexes C and B;
(1.4) defining a reduction index sequence: defining r = (ia ≠ ib) \ p as a reduction index sequence, multiplying the elements of the A tensor and the B tensor on the Ext (r) pairwise, and reducing the elements into scalars;
(1.5) defining a permutation function: by equivalent transformation of formula (3)p, y, x, r replacing ic, ia, ib, and introducing three permutation functions shfl C (·)、shfl A (. O) and shfl B (. Cndot.), thus rewriting the original formula as:
C[shfl C (p,y,x)]=A[shfl A (p,y,r)]×B[shfl B (p,r,x)] (2)
(1.6) defining a dimension sequence corresponding to each index sequence: defining a batch dimension sequence P = Ext (P), wherein each element is a dimension corresponding to each index in the batch index sequence; the free dimension sequence defining a is Y = Ext (Y), the free dimension sequence of B is X = Ext (X), and the reduced dimension sequence is R = Ext (R).
3. The template-based GPU high-performance tensor compression method as recited in claim 1, wherein the step (2) comprises the following substeps:
(2.1) defining a dimensionality reduction auxiliary function: firstly, defining the mapping i = gath of index sequence to scalar index d (i) Where d is an arbitrary dimension sequence, but must be as long as index sequence i; the function maps the offset i in each dimension in d to a linear address i:
the inverse function i = scat defining the above function d (i)=gath d -1 (i) (ii) a It decomposes the linear address i into offsets i in each dimension of d:
implicitly converting the high-order tensor in the formula (2) into a matrix adopting a specific storage format by means of the two auxiliary functions;
(2.2) dimensionality reduction of the index to access elements in the tensor with the scalar index
Converting the index sequence p, y, x, r into the components of the scalar index in the corresponding dimension respectively(ii) a Definition p = gath P (P), P is a scalar index with a value range of [0, | P), and the index sequence is expressed in reverse as P = gath P -1 (p)=scat P (p); denote the other index sequence as y = scat Y (y)、x=scat X (x)、r=scat R (R), wherein the value range of each newly defined scalar index is Y ∈ [0, Π Y ], X ∈ [0, Π X ], and R ∈ [0, Π R); rewrite equation (2) as:
C[shfl C (scat P (p),scat Y (y),scat X (x))]
=A[shfl A (scat P (p),scat Y (y),scat R (r))]×B[shfl B (scat P (p),scat R (r),scat X (x))] (7)
(2.3) dimension reduction is carried out on the dimension, and the tensor with any order and any dimension arrangement is expressed as a special matrix
The actual calculation that occurs in equation (7) is:
the left-hand side of equation (8) is referred to as C (p, y, x) = C [ shfl = C (p, y, x) = C (scat P (p),scat Y (y),scat X (x))]The right-hand visits to the two input tensors are respectively denoted as a (p, y, r) = a [ shfl = A (scat P (p),scat Y (y),scat R (r))]、B(p,r,x)=B[shfl B (scat P (p),scat R (r),scat X (x))]The formula (8) is rewritten as:
wherein only 4 scalar dimensions are included, i.e. P = |, Y = Π Y, X = Π X, R = Π R;
(2.4) implicitly converting the tensor contract into a matrix multiplication: c (p, y, x), A (p, x, r) and B (p, r, x) are called the access functions of output and input tensors, and the tensors are abstracted into batch matrixes adopting specific data layouts; the four scalars of P, Y, X and R are called implicit dimensions which are products of batch dimensions, free dimensions of A, free dimensions of B and reduction dimensions respectively; formula (9) hides the read-write process of data by using a memory access function, and expresses the range of traversal and iteration by using a hidden dimension, so that tensor compression is converted into batch matrix multiplication; with the einstein summation convention, equation (9) can be expressed as:
C(p,y,x)=A(p,y,r)×B(p,r,x) (10)。
4. the template-based GPU high-performance tensor compression method according to claim 1, wherein said step (3) comprises the following substeps:
(3.1) defining implicit dimension placeholders: the formula (9) comprises four implicit dimensions P, Y, X and R which represent the scale of the implicit batch matrix and are respectively the products of corresponding dimensional sequences in the original tensor; therefore, four corresponding implicit dimension placeholders ph _ P, ph _ Y, ph _ X, ph _ R are defined, corresponding to the implicit dimensions mentioned above, respectively;
(3.2) defining memory function placeholders: in the formula (10), the reading and writing of the implicit matrix are abstracted by three access functions of A (p, y, r), B (p, r, x) and C (p, y, x); three memory function placeholders, ph _ A (p, y, k), ph _ B (p, k, x) and ph _ C (p, y, x), are defined to represent elements of corresponding positions.
5. The method for compressing a template-based GPU high-performance tensor according to claim 1, wherein said step (4) is specifically: compiling a calculation template code of linear algebra operation according to a BLAS library realization and optimization method; writing the matrix dimension as placeholders of four implicit dimensions ph _ P, ph _ Y, ph _ X and ph _ R when the matrix dimension is quoted each time; each time any element of the matrix is accessed, it is written as a placeholder for three memory access functions, ph _ A (p, y, k), ph _ B (p, k, x) and ph _ C (p, y, x).
6. The method for compressing a template-based GPU high performance tensor according to claim 1, wherein the step (5) is specifically: the analysis of the expression (9) of tensor condensed symbolic expression obtains the contents of the P, Y, X and R four-dimensional sequences, and a specific linear algebraic calculation template is selected in the compiling period according to whether the sequences are empty or not:
and after the template is selected, substituting the placeholder into the calculation template to generate a legal CUDA C/C + + code.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210343327.3A CN115203634B (en) | 2022-03-31 | 2022-03-31 | Template-based GPU high-performance tensor merging method |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210343327.3A CN115203634B (en) | 2022-03-31 | 2022-03-31 | Template-based GPU high-performance tensor merging method |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN115203634A true CN115203634A (en) | 2022-10-18 |
| CN115203634B CN115203634B (en) | 2025-06-10 |
Family
ID=83574354
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202210343327.3A Active CN115203634B (en) | 2022-03-31 | 2022-03-31 | Template-based GPU high-performance tensor merging method |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN115203634B (en) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2024093293A1 (en) * | 2022-11-02 | 2024-05-10 | 华为技术有限公司 | Method for carrying out stencil computations, and apparatus |
| CN119598085A (en) * | 2024-11-21 | 2025-03-11 | 广州壁仞集成电路有限公司 | Multi-dimensional tensor protocol calculation method, apparatus, device, storage medium, and product |
| CN120560604A (en) * | 2025-07-31 | 2025-08-29 | 上海人工智能创新中心 | Computing device for performing complex Einstein summation operation, method for performing complex Einstein summation operation on a computing device, computer-readable storage medium, and computer program product |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20070239643A1 (en) * | 2006-03-17 | 2007-10-11 | Microsoft Corporation | Document characterization using a tensor space model |
| US20190228344A1 (en) * | 2018-01-19 | 2019-07-25 | Electronics And Telecommunications Research Institute | Gpu-based adaptive blas operation acceleration apparatus and method thereof |
| US20200042867A1 (en) * | 2018-08-01 | 2020-02-06 | Nanjing Iluvatar CoreX Technology Co., Ltd. (DBA “Iluvatar CoreX Inc. Nanjing”) | Hardware architecture for accelerating artificial intelligent processor |
| US20210232969A1 (en) * | 2018-12-24 | 2021-07-29 | Intel Corporation | Methods and apparatus to process a machine learning model in a multi-process web browser environment |
| CN113468469A (en) * | 2021-06-02 | 2021-10-01 | 北京迈格威科技有限公司 | Convolution processing method and device of feature graph executed by computer and electronic equipment |
-
2022
- 2022-03-31 CN CN202210343327.3A patent/CN115203634B/en active Active
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20070239643A1 (en) * | 2006-03-17 | 2007-10-11 | Microsoft Corporation | Document characterization using a tensor space model |
| US20190228344A1 (en) * | 2018-01-19 | 2019-07-25 | Electronics And Telecommunications Research Institute | Gpu-based adaptive blas operation acceleration apparatus and method thereof |
| US20200042867A1 (en) * | 2018-08-01 | 2020-02-06 | Nanjing Iluvatar CoreX Technology Co., Ltd. (DBA “Iluvatar CoreX Inc. Nanjing”) | Hardware architecture for accelerating artificial intelligent processor |
| US20210232969A1 (en) * | 2018-12-24 | 2021-07-29 | Intel Corporation | Methods and apparatus to process a machine learning model in a multi-process web browser environment |
| CN113468469A (en) * | 2021-06-02 | 2021-10-01 | 北京迈格威科技有限公司 | Convolution processing method and device of feature graph executed by computer and electronic equipment |
Non-Patent Citations (1)
| Title |
|---|
| KUN ZHOU: "Memory-Scalable GPU Spatial Hierarchy Construction", IEEE, vol. 17, no. 4, 3 December 2010 (2010-12-03), pages 1 - 9 * |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2024093293A1 (en) * | 2022-11-02 | 2024-05-10 | 华为技术有限公司 | Method for carrying out stencil computations, and apparatus |
| CN119598085A (en) * | 2024-11-21 | 2025-03-11 | 广州壁仞集成电路有限公司 | Multi-dimensional tensor protocol calculation method, apparatus, device, storage medium, and product |
| CN119598085B (en) * | 2024-11-21 | 2025-09-26 | 广州壁仞集成电路有限公司 | Multi-dimensional tensor protocol calculation method, apparatus, device, storage medium, and product |
| CN120560604A (en) * | 2025-07-31 | 2025-08-29 | 上海人工智能创新中心 | Computing device for performing complex Einstein summation operation, method for performing complex Einstein summation operation on a computing device, computer-readable storage medium, and computer program product |
Also Published As
| Publication number | Publication date |
|---|---|
| CN115203634B (en) | 2025-06-10 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Fishman et al. | The ITensor software library for tensor network calculations | |
| Gale et al. | Sparse gpu kernels for deep learning | |
| CN115203634B (en) | Template-based GPU high-performance tensor merging method | |
| Paszke et al. | Getting to the point: index sets and parallelism-preserving autodiff for pointful array programming | |
| Bondhugula | High performance code generation in mlir: An early case study with gemm | |
| US20230024035A1 (en) | Zero-copy sparse matrix factorization synthesis for heterogeneous compute systems | |
| CN111723935A (en) | Neural network computation graph processing method, computer storage medium and electronic device | |
| Alt et al. | Connections Between Numerical Algorithms for PDEs and Neural Networks: T. Alt et al. | |
| Li et al. | MPFFT: An auto-tuning FFT library for OpenCL GPUs | |
| Zayer et al. | A gpu‐adapted structure for unstructured grids | |
| Mudalige et al. | Large-scale performance of a DSL-based multi-block structured-mesh application for Direct Numerical Simulation | |
| JP5235666B2 (en) | Associative matrix method, system and computer program product using bit-plane representation of selected segments | |
| Eswar et al. | PLANC: Parallel low-rank approximation with nonnegativity constraints | |
| Tsai et al. | Performance-portable autotuning of opencl kernels for convolutional layers of deep neural networks | |
| Tukanov et al. | Modeling matrix engines for portability and performance | |
| Cowan et al. | Automating generation of low precision deep learning operators | |
| US5781779A (en) | Tools for efficient sparse matrix computation | |
| Zhou et al. | Linear Layouts: Robust Code Generation of Efficient Tensor Computation Using F_2 | |
| Lu et al. | Im2win: Memory efficient convolution on SIMD architectures | |
| Bećirović et al. | Performance comparison of medical image classification systems using TensorFlow Keras, PyTorch, and JAX | |
| Bassoy et al. | Fast higher-order functions for tensor calculus with tensors and subtensors | |
| Regnault et al. | SPC5: an efficient SpMV framework vectorized using ARM SVE and x86 AVX-512 | |
| De et al. | Hybrid Parallel Tucker Decomposition of Streaming Data | |
| Li et al. | Compiler-assisted operator template library for DNN accelerators | |
| Fernandes | Exploring lazy evaluation and compile-time simplifications for efficient geometric algebra computations |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |