[go: up one dir, main page]

CN116888601A - Methods and devices for processing computing tasks - Google Patents

Methods and devices for processing computing tasks Download PDF

Info

Publication number
CN116888601A
CN116888601A CN202280012811.6A CN202280012811A CN116888601A CN 116888601 A CN116888601 A CN 116888601A CN 202280012811 A CN202280012811 A CN 202280012811A CN 116888601 A CN116888601 A CN 116888601A
Authority
CN
China
Prior art keywords
axis
input
operator
tensors
segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280012811.6A
Other languages
Chinese (zh)
Inventor
柯继伟
俞郑中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN116888601A publication Critical patent/CN116888601A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • User Interface Of Digital Computer (AREA)
  • Multi Processors (AREA)

Abstract

本申请实施例提供了一种处理计算任务的方法及装置,该方法包括:确定用于执行计算任务的第一算子,第一算子包括N个可切分轴,N为大于或等于1的正整数;从算子切分信息库获取第一算子的切分信息;根据第一算子的切分信息,对第一算子的输入张量进行切分,获得K组输入张量;分别发送K组输入张量给K个目标计算资源,以便K个目标计算资源完成计算任务。这样,图优化器可以不基于具体算子的原理,进行算子输入和输出张量的自动切分,进而实现图优化器和算子优化模块的完全解耦,使得计算任务对应的算子在多个计算资源上并行计算。

Embodiments of the present application provide a method and device for processing computing tasks. The method includes: determining a first operator used to perform the computing task. The first operator includes N divisible axes, and N is greater than or equal to 1. is a positive integer; obtain the segmentation information of the first operator from the operator segmentation information database; segment the input tensor of the first operator according to the segmentation information of the first operator to obtain K sets of input tensors ;Send K sets of input tensors to K target computing resources respectively, so that the K target computing resources can complete the computing tasks. In this way, the graph optimizer can automatically segment the operator input and output tensors without being based on the principles of specific operators, thereby achieving complete decoupling of the graph optimizer and the operator optimization module, so that the operators corresponding to the computing tasks can be Parallel computing on multiple computing resources.

Description

Method and device for processing computing task Technical Field
Embodiments of the present application relate to the field of artificial intelligence, and more particularly, to a method and apparatus for processing computing tasks.
Background
Artificial intelligence (artificial intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. Artificial intelligence enables machines to have the functions of sensing, reasoning and decision by researching the design principles and implementation methods of various intelligent machines. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision and reasoning, man-machine interaction, recommendation and search, AI-based theory, and the like.
The deep learning open source software framework such as tensorflow, pytorch, mxnet provides a programming environment for a user to friendly deep learning models, so that the user can conveniently deploy the designed deep learning models on a central processing unit (central processing unit, CPU), an image processing unit (graphprocessing unit, GPU) and other general-purpose computer hardware platforms. If a designed deep learning model is deployed on a given device, a forward reasoning framework of the manufacturer of the given device is typically used, e.g., tensorRT is used in the GPU of Injeida. If a designed deep learning model needs to run on multiple different types of devices, the model described by the deep learning framework can be generated into codes that are valid on the different types of devices by a deep learning compiler.
Deep learning compilers typically improve the performance of models on different hardware by graph optimization and operator optimization. The two optimizations are usually relatively decoupled and independent, but the implementation of graph optimization often needs to be based on the principle of operators themselves to obtain a proper operator optimization parallel strategy. Therefore, how the graph optimizer performs automatic operator segmentation without based on the principles of specific operators is a problem to be solved.
Disclosure of Invention
The embodiment of the application provides a method and a device for processing a computing task, which enable a graph optimizer to automatically segment operator input and output tensors without based on the principle of specific operators, further realize complete decoupling of the graph optimizer and an operator optimizing module, and enable operators corresponding to the computing task to be computed in parallel on a plurality of computing resources.
In a first aspect, there is provided a method of processing a computing task, the method performed by a graph optimizer, comprising: determining a first operator for performing a computing task, the first operator comprising N separable axes, N being a positive integer greater than or equal to 1; obtaining segmentation information of a first operator from an operator segmentation information base, wherein the segmentation information of the first operator comprises an axis type of an nth one of the N cleavable axes in the first operator and first position information, and the first position information is used for indicating the position of the nth one of the N cleavable axes in an input tensor of the first operator, wherein n=1, … and N; dividing the input tensor of the first operator according to dividing information of the first operator to obtain K groups of input tensors, wherein K is a positive integer greater than or equal to 2; and respectively transmitting the K groups of input tensors to the K target computing resources so that the K target computing resources complete the computing task.
It should be appreciated that the N separable axes included in the first operator represent that the N separable axes are included in the input tensor of the first operator.
It should also be appreciated that the computing tasks may be computing tasks in the field of artificial intelligence, such as image processing tasks, video processing tasks, speech processing tasks, natural language processing tasks, and the like; the computing task may also be a computing task in the big data processing field, or may be a computing task in the high-performance computing (HPC) field, which is not limited in this regard by the present application. Accordingly, the input tensor of the first operator corresponding to the computing task may be the input tensor corresponding to the computing task in any one of the above fields, for example, when the computing task is an image processing task, the input tensor of the first operator represents data related to the image.
The current segmentation of the input tensor of the operator is that an algorithm engineer determines a segmentation mode at an application layer according to a segmentation axis included in a certain operator type through a script language, so that automatic segmentation of the input tensor of the operator cannot be realized. In the embodiment of the application, the graph optimizer obtains the segmentation information of the operators from the operator segmentation information base, and the segmentation information of each operator can be directly obtained from the operator segmentation information base, so that the graph optimizer can realize automatic segmentation of the input tensor of each operator without sensing the mathematical semantics and bottom realization of each operator, thereby realizing complete decoupling of graph optimization and operator optimization, and enabling the operators corresponding to the calculation tasks to be calculated in parallel on a plurality of calculation resources.
In one possible implementation, the axis type of the splittable axis is one of the following types: element axis, specification axis and sliding window axis; wherein, the axes of the elements in the input tensor and the output tensor of the operator, which have the point-to-point mapping relation, are element axes; if there is a first axis in the input tensor of the operator and there is no first axis in the output tensor of the operator, then the first axis is the reduction axis; the axis of the operator for sliding window scanning operation on the elements in the input tensor of the operator is the sliding window axis.
In one possible implementation, a target split axis is determined, the target split axis being one of N split axes; determining a splitting mode corresponding to a shaft type of a target splitting shaft in a first operator according to splitting information of the first operator; and splitting the input tensor of the first operator according to the splitting mode of the target splitting axis corresponding to the axis type in the first operator to obtain K groups of input tensors.
In the embodiment of the application, the graph optimizer performs single operator segmentation on different types of axes in the operator input tensor and the segmentation mode corresponding to the types of the axes, so that the graph optimizer can automatically obtain different single operator segmentation strategies without based on the principle of specific operators, and further complete decoupling of the graph optimizer and an operator optimization module is realized.
In one possible implementation manner, according to a splitting manner corresponding to a shaft type of a target splitting shaft in the first operator, splitting an input tensor of the first operator, and obtaining K groups of input tensors includes: according to the segmentation mode, determining Q first input tensors including a target segmentation axis in a first operator and the positions of the target segmentation axis in each of the Q first input tensors, wherein Q is a positive integer greater than or equal to 1; dividing each first input tensor in the Q first input tensors according to the shaft type of the target dividing shaft in the first operator and the number K of target computing resources to obtain Q groups of second input tensors; and obtaining K groups of input tensors according to the Q groups of second input tensors and the input tensors of the first operator which is not segmented.
Wherein each of the Q sets of second input tensors includes K second input tensors, and a Q-th set of second input tensors of the Q sets of second input tensors is a segmentation result of segmenting a Q-th first input tensor of the Q first input tensors into K, where q=1, …, Q.
The K-th input tensor in the K-th input tensor comprises the K-th second input tensor in each group of second input tensors in the Q-th group of second input tensors and the input tensors of the first operator which are not segmented.
In one possible implementation, where the operator for performing the computing task further includes a second operator, the second operator includes P separable axes, the P separable axes being a subset of the N separable axes, splitting the input tensor of the first operator according to the splitting information of the first operator, and obtaining the K groups of input tensors includes: obtaining segmentation information of a second operator from an operator segmentation information base, wherein the segmentation information of the second operator comprises an axis type and second position information of a P-th cuttable axis in the second operator, the second information is used for indicating the position of the P-th cuttable axis in an input tensor of the second operator, the input tensor of the second operator is an output tensor of the first operator, P is a positive integer which is greater than or equal to 1 and less than or equal to N, and p=1, … and P; according to the segmentation information of the first operator and the segmentation information of the second operator, P segmentation reference information is determined, and the P segmentation reference information in the P segmentation reference information comprises: the axis type of the p-th splittable axis in the first operator, the axis type of the p-th splittable axis in the second operator, and the position of the p-th splittable axis in the input tensor of the first operator; determining P groups of candidate segmentation modes according to the P segmentation reference information, wherein the P-th group of candidate segmentation modes in the P groups of candidate segmentation modes comprise at least one segmentation mode; determining a target segmentation mode according to the time required by each segmentation mode in the P group of candidate segmentation modes to finish the calculation task; and according to the target segmentation mode, segmenting the input tensor of the first operator to obtain K groups of input tensors.
The segmentation mode included in the P-th group candidate segmentation mode is determined according to the P-th segmentation reference information in the P segmentation reference information and the computing resource quantity M.
In an embodiment of the application, the graph optimizer automatically segments the operator input and output tensors according to different types of axes. The graph optimizer does not need to split the input and output tensors based on the specific operator principle, only needs to split the input and output tensors based on operator splitting modes corresponding to different types of axes, and for the operators, the calculation formulas of the operators are not changed before and after the input and output tensors of the operators are split, only part of parameters of the operators are changed, so that graph optimization and thorough decoupling of specific operator principles can be realized, and further, the generalization capability of the splitting mode of the first input tensor of the operators based on different types of axes is stronger, and besides, according to the axis types of the split axes and the position information of the split axes on the input tensor and the output tensor of the operators, the proper operator splitting mode can be flexibly selected.
In one possible implementation manner, according to a target segmentation manner, segmenting the input tensor of the first operator, to obtain K groups of input tensors includes: according to a target segmentation mode, determining a target segmentation axis, an axis type of the target segmentation axis in a first operator, an axis type of the target segmentation axis in a second operator, Q first input tensors including the target segmentation axis in the first operator and positions of the target segmentation axis in each of the Q first input tensors, wherein Q is a positive integer greater than or equal to 1; respectively segmenting each first input tensor in the Q first input tensors according to the shaft type of the target segmentation shaft in the first operator, the shaft type of the target segmentation shaft in the second operator and the number K of target computing resources to obtain Q groups of second input tensors, wherein each group of second input tensors in the Q groups of second input tensors comprises K second input tensors, and the Q-th group of second input tensors in the Q groups of second input tensors is a segmentation result obtained by segmenting the Q-th first input tensor in the Q first input tensors into K; and obtaining K groups of input tensors according to the Q groups of second input tensors and the input tensors of the first operator which is not segmented.
The K-th input tensor in the K-th input tensor comprises the K-th second input tensor in each group of second input tensors in the Q-th group of second input tensors and the input tensors of the first operator which are not segmented.
In one possible implementation manner, according to the axis type of the target segmentation axis in the first operator and the axis type of the target segmentation axis in the second operator, and the number K of target computing resources, respectively segmenting each of the Q first input tensors, to obtain Q groups of second input tensors includes: if the axis type of the target segmentation axis in the first operator is an element axis or a sliding window axis, and the axis type of the target segmentation axis in the second operator is an element axis or a sliding window axis, determining L first output tensors comprising the target segmentation axis in the first operator according to the first position information of the target segmentation axis and the second position information of the target segmentation axis, and the position of the target segmentation axis in each of the L first output tensors, wherein L is a positive integer greater than or equal to 1; the first input length is used as the input of a forward shape deriving function corresponding to the shaft type of the target segmentation shaft in the first operator, a third input length is obtained, the first input length is the length of the target segmentation shaft in each first input tensor, and the lengths of the target segmentation shaft in each first input tensor are equal; taking the third input length as the input of a forward shape deriving function corresponding to the shaft type of the target segmentation shaft in the second operator, and obtaining a first output length; according to the first output length and the number K of target computing resources, splitting L first output tensors according to a target splitting axis to obtain L groups of second output tensors, wherein each group of second output tensors in the L groups of second output tensors comprises K second output tensors, and the first group of second output tensors in the L groups of second output tensors is a splitting result of splitting the first output tensor in the L first output tensors into K; respectively taking K second output lengths corresponding to a target split axis in each group of second output tensors in the L groups of second output tensors as inputs of reverse deduction functions corresponding to the target split axis in the second operator center axis types to obtain K third input lengths corresponding to the target split axis in each group of fifth input tensors in the Q groups of fifth input tensors, wherein the lengths corresponding to the target split axis in the kth second output tensors in each group of second output tensors in the L groups of second output tensors are equal, and the lengths corresponding to the target split axis in the kth second input tensors in each group of fifth input tensors in the Q groups of fifth input tensors are equal; respectively taking K third input lengths corresponding to the target segmentation axes in each group of fifth input tensors in the Q groups of fifth input tensors as the input of a reverse derivation function corresponding to the target segmentation axes in the first operator center axis type, so as to obtain K second input lengths corresponding to the target segmentation axes in each group of second input tensors in the Q groups of second input tensors, wherein the lengths corresponding to the target segmentation axes in the kth second input tensors in each group of second input tensors in the Q groups of second input tensors are equal; and respectively segmenting each first input tensor in the Q first input tensors according to the target segmentation axis according to K second input lengths corresponding to the target segmentation axis in each second input tensor in the Q second input tensors to obtain the Q second input tensors.
In the embodiment of the application, the input tensor after segmentation is subjected to continuous operator operation on the same target computing resource, so that the parallel computation of multiple target computing resources can be realized.
In one possible implementation manner, when the axis type of the target slicing axis in the first operator is an element axis or a sliding window axis, the first position information of the target slicing axis is further used to indicate a position of the target slicing axis in an output tensor of the first operator, and slicing each of the Q first input tensors according to the axis type of the target slicing axis in the first operator and the number K of target computing resources, so as to obtain Q groups of second input tensors includes: according to first position information of a target segmentation axis, determining L first output tensors including the target segmentation axis in a first operator, and the position of the target segmentation axis in each of the L first output tensors, wherein L is a positive integer greater than or equal to 1; the first input length is used as the input of a forward shape deduction function of the target segmentation axis, the first output length is obtained, the first input length is the length of the target segmentation axis in each first input tensor, and the lengths of the target segmentation axes in each first input tensor are equal; dividing the L first output tensors according to a target dividing axis according to the first output length and the number K of target computing resources to obtain L groups of second output tensors, wherein each group of second output tensors in the L groups of second output tensors comprises K second output tensors; respectively taking K second output lengths corresponding to the target segmentation axis in each group of second output tensors in the L groups of second output tensors as the input of the reverse derivation function of the target segmentation axis to obtain K second input lengths corresponding to the target segmentation axis in each group of second input tensors in the Q groups of second input tensors; and respectively segmenting each first input tensor in the Q first input tensors according to the target segmentation axis according to K second input lengths corresponding to the target segmentation axis in each second input tensor in the Q second input tensors to obtain the Q second input tensors.
The first group of second output tensors in the L groups of second output tensors are segmentation results obtained by segmenting the first output tensor in the L first output tensors into K.
The lengths of the target dividing axes corresponding to the kth second output tensor in each group of second output tensors in the L groups of second output tensors are equal, and the lengths of the target dividing axes corresponding to the kth second input tensor in each group of second input tensors in the Q groups of second input tensors are equal.
In one possible implementation manner, when the axis type of the target segmentation axis in the first operator is an element axis, according to K second input lengths corresponding to the target segmentation axis in each of the Q sets of second input tensors, respectively segmenting each of the Q first input tensors according to the target segmentation axis, where obtaining the Q sets of second input tensors includes: and according to K second input lengths corresponding to the target segmentation axes in each group of the Q groups of second input tensors, respectively segmenting each first input tensor in the Q groups of first input tensors according to the target segmentation axes by scheduling the first segmentation function to obtain the Q groups of second input tensors.
The element corresponding to the target segmentation axis in each second input tensor in the Q-th group of second input tensors is a subset of the element corresponding to the target segmentation axis in the Q-th first input tensor, no intersection exists among the elements corresponding to the target segmentation axis in each second input tensor in the Q-th group of second input tensors, and the union of the elements corresponding to the target segmentation axis in each second input tensor in the Q-th group of second input tensors is the element corresponding to the target segmentation axis in the Q-th first input tensor.
In one possible implementation manner, when the axis type of the target segmentation axis in the first operator is a sliding window axis, according to K second input lengths corresponding to the target segmentation axis in each of the Q sets of second input tensors, respectively segmenting each of the Q first input tensors according to the target segmentation axis, where obtaining the Q sets of second input tensors includes: and according to K second input lengths corresponding to the target segmentation axes in each group of the Q groups of second input tensors, carrying out overlapped segmentation on each first input tensor in the Q groups of first input tensors according to the target segmentation axes by scheduling the first slicing function to obtain the Q groups of second input tensors.
The element corresponding to the target segmentation axis in each second input tensor in the Q-th group of second input tensors is a subset of the element corresponding to the target segmentation axis in the Q-th first input tensor, an intersection exists between the elements corresponding to the target segmentation axis in each second input tensor in the Q-th group of second input tensors, and the union of the elements corresponding to the target segmentation axis in each second input tensor in the Q-th group of second input tensors is the element corresponding to the target segmentation axis in the Q-th first input tensor.
In one possible implementation manner, when the axis type of the target segmentation axis in the first operator is a sliding window axis, according to K second input lengths corresponding to the target segmentation axis in each of the Q sets of second input tensors, respectively segmenting each of the Q first input tensors according to the target segmentation axis, where obtaining the Q sets of second input tensors includes: dividing each first input tensor in the Q first input tensors according to a target dividing axis by scheduling a second dividing function to obtain Q groups of third input tensors, wherein the Q groups of third input tensors comprise K third input tensors; according to K second input lengths corresponding to the target segmentation axes in each group of second input tensors in the Q groups of second input tensors, respectively segmenting K third input tensors in each group of third input tensors in the Q groups of third input tensors according to the target segmentation axes by scheduling a second slicing function to obtain Q groups of fourth input tensors; and splicing the kth fourth input tensor in the Q-th group of fourth input tensors and the kth third input tensor in the Q-th group of third input tensors according to the target split axis by scheduling a splicing function to obtain the Q-th group of second input tensors.
The element corresponding to the target segmentation axis in each third input tensor in the Q-th group of third input tensors is a subset of the element corresponding to the target segmentation axis in the Q-th first input tensor, no intersection exists among the elements corresponding to the target segmentation axis in each third input tensor in the Q-th group of third input tensors, and the union of the elements corresponding to the target segmentation axis in each third input tensor in the Q-th group of third input tensors is the element corresponding to the target segmentation axis in the Q-th first input tensor.
Wherein, the element corresponding to the target split axis of the kth second input tensor in the Q-th group of second input tensors is continuous.
In the embodiment of the application, the splitting mode of the sliding window shaft without overlapping is suitable for being applied to the scene of frequent data synchronization between different computing resources, for example, multi-grain parallelism, and the splicing function is used as a data synchronization node between different bare chips, so that repeated computation of overlapping data is not caused, the overlapping data is not increased continuously, and the computing pressure and the storage pressure of the computing resources can be effectively linked.
In one possible implementation manner, when the axis type of the target segmentation axis in the first operator is a reduction axis, respectively segmenting each of the Q first input tensors according to the axis type of the target segmentation axis in the first operator and the number K of target computing resources, to obtain Q groups of second input tensors includes: and according to the number K of the target computing resources, respectively segmenting each first input tensor in the Q first input tensors by calling a third segmentation function to obtain Q groups of second input tensors.
The element corresponding to the target segmentation axis in each second input tensor in the Q-th group of second input tensors is a subset of the element corresponding to the target segmentation axis in the Q-th first input tensor, no intersection exists among the elements corresponding to the target segmentation axis in each second input tensor in the Q-th group of second input tensors, and the union of the elements corresponding to the target segmentation axis in each second input tensor in the Q-th group of second input tensors is the element corresponding to the target segmentation axis in the Q-th first input tensor.
In the embodiment of the application, since the specific segmentation mode is determined according to the type of the protocol axis, when the graph is optimized, the graph optimizer can reasonably segment the input tensor comprising the specific operator of the protocol axis without being based on the principle of the specific operator.
In one possible implementation manner, the reduction axes include a first type of reduction axis and a second type of reduction axis, where the first type of reduction axis is a reduction axis for performing a reduction operation on an element in an input tensor of an operator by an operator, and the second type of reduction axis is a reduction axis for performing no reduction operation on an element in the input tensor of the operator by the operator.
As one possible implementation, the first type of protocol axis includes any one of the following: a sum axis of the specifications, a maximum axis of the specifications, a minimum axis of the specifications, and an average axis of the specifications; the sum axis of the protocol is the protocol axis of the operator for carrying out summation reduction operation on elements in the input tensor of the operator; the reduction maximum value axis is the reduction axis of the operator for carrying out maximum value reduction operation on elements in the input tensor of the operator; the reduction minimum value axis is the reduction axis of the operator for carrying out minimum value reduction operation on elements in the input tensor of the operator; the reduction average value axis is the reduction axis of the operator for performing an averaging reduction operation on elements in the input tensor of the operator.
In one possible implementation, the second type of protocol axis includes a protocol acquisition axis, which is an axis of element index data for an operator on an input tensor of the operator according to an address indicated by an element on the index input tensor of the operator.
As one possible implementation, the computing resources include one of the following categories: an image processing unit GPU, a central processing unit CPU, a die, or a chip.
In a second aspect, an apparatus for processing computing tasks is provided, wherein the apparatus is applied to a graph optimizer, the apparatus comprising a processor and a transmission interface: the processor is used for determining a first operator for executing a computing task, wherein the first operator comprises N separable axes, and N is a positive integer greater than or equal to 1; the processor is used for acquiring segmentation information of a first operator from the operator segmentation information base, wherein the segmentation information of the first operator comprises an axis type of an nth one of the N cleavable axes in the first operator and first position information, and the first position information is used for indicating the position of the nth one of the N cleavable axes in an input tensor of the first operator, wherein n=1, … and N; the processor is used for segmenting the input tensor of the first operator according to the segmentation information of the first operator to obtain K groups of input tensors, wherein K is a positive integer greater than or equal to 2; the transmission interface is used for respectively transmitting the K groups of input tensors to the K target computing resources so that the K target computing resources complete the computing tasks.
It should be appreciated that the N separable axes included in the first operator represent that the N separable axes are included in the input tensor of the first operator.
It should also be appreciated that the computing tasks may be computing tasks in the field of artificial intelligence, such as image processing tasks, video processing tasks, speech processing tasks, natural language processing tasks, and the like; the computing task may also be a computing task in the big data processing field, or may be a computing task in the high-performance computing (HPC) field, which is not limited in this regard by the present application. Accordingly, the input tensor of the first operator corresponding to the computing task may be the input tensor corresponding to the computing task in any one of the above fields, for example, when the computing task is an image processing task, the input tensor of the first operator represents data related to the image.
The current segmentation of the input tensor of the operator is that an algorithm engineer determines a segmentation mode at an application layer according to a segmentation axis included in a certain operator type through a script language, so that automatic segmentation of the input tensor of the operator cannot be realized. In the embodiment of the application, the graph optimizer obtains the segmentation information of the operators from the operator segmentation information base, and the segmentation information of each operator can be directly obtained from the operator segmentation information base, so that the graph optimizer can realize automatic segmentation of the input tensor of each operator without sensing the mathematical semantics and bottom realization of each operator, thereby realizing complete decoupling of graph optimization and operator optimization, and enabling the operators corresponding to the calculation tasks to be calculated in parallel on a plurality of calculation resources.
As one possible implementation, the axis type of the separable axis is one of the following types: element axis, specification axis and sliding window axis; wherein, the axes of the elements in the input tensor and the output tensor of the operator, which have the point-to-point mapping relation, are element axes; if there is a first axis in the input tensor of the operator and there is no first axis in the output tensor of the operator, then the first axis is the reduction axis; the axis of the operator for sliding window scanning operation on the elements in the input tensor of the operator is the sliding window axis.
In one possible implementation, the processor is specifically configured to: determining a target dividing axis which is one of N separable axes; determining a splitting mode corresponding to the shaft type of the target splitting shaft in the first operator according to the splitting information of the first operator; and splitting the input tensor of the first operator according to the splitting mode of the target splitting axis corresponding to the axis type in the first operator to obtain K groups of input tensors.
In one possible implementation, the processor is specifically configured to: according to a segmentation mode corresponding to the shaft type of the target segmentation shaft in the first operator, determining Q first input tensors comprising the target segmentation shaft in the first operator and the positions of the target segmentation shaft in each of the Q first input tensors, wherein Q is a positive integer greater than or equal to 1; dividing each first input tensor in the Q first input tensors according to the shaft type of the target dividing shaft in the first operator and the number K of target computing resources to obtain Q groups of second input tensors; and obtaining K groups of input tensors according to the Q groups of second input tensors and the input tensors of the first operator which is not segmented.
Wherein each of the Q sets of second input tensors includes K second input tensors, and a Q-th set of second input tensors of the Q sets of second input tensors is a segmentation result of segmenting a Q-th first input tensor of the Q first input tensors into K, where q=1, …, Q.
The K-th input tensor in the K-th input tensor comprises the K-th second input tensor in each group of second input tensors in the Q-th group of second input tensors and the input tensors of the first operator which are not segmented.
In the embodiment of the application, the graph optimizer performs single operator segmentation on different types of axes in the operator input tensor and the segmentation mode corresponding to the types of the axes, so that the graph optimizer can automatically obtain different single operator segmentation strategies without based on the principle of specific operators, and further complete decoupling of the graph optimizer and an operator optimization module is realized.
In a possible implementation manner, in the case that the operator for performing the computing task further includes a second operator, the second operator includes P separable axes, and the P separable axes are a subset of the N separable axes, the processor is specifically configured to: obtaining segmentation information of a second operator from an operator segmentation information base, wherein the segmentation information of the second operator comprises an axis type of a P-th cuttable axis in the second operator and second position information, the second position information is used for indicating the position of the P-th cuttable axis in an input tensor of the second operator, the input tensor of the second operator is an output tensor of the first operator, P is a positive integer which is greater than or equal to 1 and less than or equal to N, and p=1, … and P; according to the segmentation information of the first operator and the segmentation information of the second operator, P segmentation reference information is determined, and the P segmentation reference information in the P segmentation reference information comprises: the axis type of the p-th splittable axis in the first operator, the axis type of the p-th splittable axis in the second operator, and the position of the p-th splittable axis in the input tensor of the first operator; determining P groups of candidate segmentation modes according to the P segmentation reference information, wherein the P-th group of candidate segmentation modes in the P groups of candidate segmentation modes comprise at least one segmentation mode; determining a target segmentation mode according to the time required by each segmentation mode in the P group of candidate segmentation modes to finish the calculation task; and according to the target segmentation mode, segmenting the input tensor of the first operator to obtain K groups of input tensors.
As a possible implementation manner, the slicing manner included in the P-th group of candidate slicing manners is determined according to the P-th slicing reference information in the P-th slicing reference information and the number M of computing resources.
In an embodiment of the application, the graph optimizer automatically segments the operator input and output tensors according to different types of axes. The graph optimizer does not need to split the input and output tensors based on the specific operator principle, only needs to split the input and output tensors based on operator splitting modes corresponding to different types of axes, and for the operators, the calculation formulas of the operators are not changed before and after the input and output tensors of the operators are split, only part of parameters of the operators are changed, so that graph optimization and thorough decoupling of specific operator principles can be realized, and further, the generalization capability of the splitting mode of the first input tensor of the operators based on different types of axes is stronger, and besides, according to the axis types of the split axes and the position information of the split axes on the input tensor and the output tensor of the operators, the proper operator splitting mode can be flexibly selected.
In one possible implementation, the processor is specifically configured to: according to a target segmentation mode, determining a target segmentation axis, an axis type of the target segmentation axis in a first operator, an axis type of the target segmentation axis in a second operator, Q first input tensors including the target segmentation axis in the first operator and positions of the target segmentation axis in each of the Q first input tensors, wherein Q is a positive integer greater than or equal to 1; respectively segmenting each first input tensor in the Q first input tensors according to the shaft type of the target segmentation shaft in the first operator, the shaft type of the target segmentation shaft in the second operator and the number K of target computing resources to obtain Q groups of second input tensors, wherein each group of second input tensors in the Q groups of second input tensors comprises K second input tensors, and the Q-th group of second input tensors in the Q groups of second input tensors is a segmentation result obtained by segmenting the Q-th first input tensor in the Q first input tensors into K; and obtaining K groups of input tensors according to the Q groups of second input tensors and the input tensors of the first operator which is not segmented.
The K-th input tensor in the K-th input tensor comprises the K-th second input tensor in each group of second input tensors in the Q-th group of second input tensors and the input tensors of the first operator which are not segmented.
As one possible implementation, if the axis type of the target segmentation axis in the first operator is an element axis or a sliding window axis, and the axis type of the target segmentation axis in the second operator is an element axis or a sliding window axis, the processor is specifically configured to: determining L first output tensors including a target segmentation axis in a first operator according to the first position information of the target segmentation axis and the second position information of the target segmentation axis, and the position of the target segmentation axis in each of the L first output tensors, wherein L is a positive integer greater than or equal to 1; the first input length is used as the input of a forward shape deriving function corresponding to the shaft type of the target segmentation shaft in the first operator, a third input length is obtained, the first input length is the length of the target segmentation shaft in each first input tensor, and the lengths of the target segmentation shaft in each first input tensor are equal; taking the third input length as the input of a forward shape deriving function corresponding to the shaft type of the target segmentation shaft in the second operator, and obtaining a first output length; according to the first output length and the number K of target computing resources, splitting L first output tensors according to a target splitting axis to obtain L groups of second output tensors, wherein each group of second output tensors in the L groups of second output tensors comprises K second output tensors, and the first group of second output tensors in the L groups of second output tensors is a splitting result of splitting the first output tensor in the L first output tensors into K; respectively taking K second output lengths corresponding to a target split axis in each group of second output tensors in the L groups of second output tensors as inputs of reverse deduction functions corresponding to the target split axis in the second operator center axis types to obtain K third input lengths corresponding to the target split axis in each group of fifth input tensors in the Q groups of fifth input tensors, wherein the lengths corresponding to the target split axis in the kth second output tensors in each group of second output tensors in the L groups of second output tensors are equal, and the lengths corresponding to the target split axis in the kth second input tensors in each group of fifth input tensors in the Q groups of fifth input tensors are equal; respectively taking K third input lengths corresponding to the target segmentation axes in each group of fifth input tensors in the Q groups of fifth input tensors as the input of a reverse derivation function corresponding to the target segmentation axes in the first operator center axis type, so as to obtain K second input lengths corresponding to the target segmentation axes in each group of second input tensors in the Q groups of second input tensors, wherein the lengths corresponding to the target segmentation axes in the kth second input tensors in each group of second input tensors in the Q groups of second input tensors are equal; and respectively segmenting each first input tensor in the Q first input tensors according to the target segmentation axis according to K second input lengths corresponding to the target segmentation axis in each second input tensor in the Q second input tensors to obtain the Q second input tensors.
In the embodiment of the application, the input tensor after segmentation is subjected to continuous operator operation on the same target computing resource, so that the parallel computation of multiple target computing resources can be realized.
In one possible implementation manner, when the axis type of the target segmentation axis in the first operator is an element axis or a sliding window axis, the first position information of the target segmentation axis is further used to indicate the position of the target segmentation axis in the output tensor of the first operator, and the processor is specifically configured to: according to first position information of a target segmentation axis, determining L first output tensors including the target segmentation axis in a first operator, and the position of the target segmentation axis in each of the L first output tensors, wherein L is a positive integer greater than or equal to 1; the first input length is used as the input of a forward shape deduction function of the target segmentation axis, the first output length is obtained, the first input length is the length of the target segmentation axis in each first input tensor, and the lengths of the target segmentation axes in each first input tensor are equal; according to the first output length and the number K of target computing resources, splitting L first output tensors according to a target splitting axis to obtain L groups of second output tensors, wherein each group of second output tensors in the L groups of second output tensors comprises K second output tensors, and the first group of second output tensors in the L groups of second output tensors is a splitting result of splitting the first output tensor in the L first output tensors into K; respectively taking K second output lengths corresponding to a target segmentation axis in each group of second output tensors in the L groups of second output tensors as input of a reverse derivation function of the target segmentation axis to obtain K second input lengths corresponding to the target segmentation axis in each group of second input tensors in the Q groups of second input tensors, wherein the lengths corresponding to the target segmentation axis in the kth second output tensor in each group of second output tensors in the L groups of second output tensors are equal, and the lengths corresponding to the target segmentation axis in the kth second input tensor in each group of second input tensors in the Q groups of second input tensors are equal; and respectively segmenting each first input tensor in the Q first input tensors according to the target segmentation axis according to K second input lengths corresponding to the target segmentation axis in each second input tensor in the Q second input tensors to obtain the Q second input tensors.
In one possible implementation, when the axis type of the target segmentation axis in the first operator is an element axis, the processor is specifically configured to: and according to K second input lengths corresponding to the target segmentation axes in each group of the Q groups of second input tensors, respectively segmenting each first input tensor in the Q groups of first input tensors according to the target segmentation axes by scheduling the first segmentation function to obtain the Q groups of second input tensors.
The element corresponding to the target segmentation axis in each second input tensor in the Q-th group of second input tensors is a subset of the element corresponding to the target segmentation axis in the Q-th first input tensor, no intersection exists among the elements corresponding to the target segmentation axis in each second input tensor in the Q-th group of second input tensors, and the union of the elements corresponding to the target segmentation axis in each second input tensor in the Q-th group of second input tensors is the element corresponding to the target segmentation axis in the Q-th first input tensor.
In one possible implementation, when the axis type of the target segmentation axis in the first operator is a sliding window axis, the processor is specifically configured to: and according to K second input lengths corresponding to the target segmentation axes in each group of the Q groups of second input tensors, carrying out overlapped segmentation on each first input tensor in the Q groups of first input tensors according to the target segmentation axes by scheduling the first slicing function to obtain the Q groups of second input tensors.
The element corresponding to the target segmentation axis in each second input tensor in the Q-th group of second input tensors is a subset of the element corresponding to the target segmentation axis in the Q-th first input tensor, an intersection exists between the elements corresponding to the target segmentation axis in each second input tensor in the Q-th group of second input tensors, and the union of the elements corresponding to the target segmentation axis in each second input tensor in the Q-th group of second input tensors is the element corresponding to the target segmentation axis in the Q-th first input tensor.
In one possible implementation, when the axis type of the target segmentation axis in the first operator is a sliding window axis, the processor is specifically configured to: dividing each first input tensor in the Q first input tensors according to a target dividing axis by scheduling a second dividing function to obtain Q groups of third input tensors, wherein the Q groups of third input tensors comprise K third input tensors; according to K second input lengths corresponding to the target segmentation axes in each group of second input tensors in the Q groups of second input tensors, respectively segmenting K third input tensors in each group of third input tensors in the Q groups of third input tensors according to the target segmentation axes by scheduling a second slicing function to obtain Q groups of fourth input tensors; and splicing the kth fourth input tensor in the Q-th group of fourth input tensors and the kth third input tensor in the Q-th group of third input tensors according to the target split axis by scheduling a splicing function to obtain the Q-th group of second input tensors.
The element corresponding to the target segmentation axis in each third input tensor in the Q-th group of third input tensors is a subset of the element corresponding to the target segmentation axis in the Q-th first input tensor, no intersection exists among the elements corresponding to the target segmentation axis in each third input tensor in the Q-th group of third input tensors, and the union of the elements corresponding to the target segmentation axis in each third input tensor in the Q-th group of third input tensors is the element corresponding to the target segmentation axis in the Q-th first input tensor.
Wherein, the element corresponding to the target segmentation axis of the kth second input tensor in the Q-th second input tensor is continuous.
In the embodiment of the application, the splitting mode of the sliding window shaft without overlapping is suitable for being applied to the scene of frequent data synchronization between different computing resources, for example, multi-grain parallelism, and the splicing function is used as a data synchronization node between different bare chips, so that repeated computation of overlapping data is not caused, the overlapping data is not increased continuously, and the computing pressure and the storage pressure of the computing resources can be effectively linked.
In one possible implementation, when the axis type of the target segmentation axis in the first operator is a reduction axis, the processor is specifically configured to: and according to the number K of the target computing resources, respectively segmenting each first input tensor in the Q first input tensors by calling a third segmentation function to obtain Q groups of second input tensors.
The element corresponding to the target segmentation axis in each second input tensor in the Q-th group of second input tensors is a subset of the element corresponding to the target segmentation axis in the Q-th first input tensor, no intersection exists among the elements corresponding to the target segmentation axis in each second input tensor in the Q-th group of second input tensors, and the union of the elements corresponding to the target segmentation axis in each second input tensor in the Q-th group of second input tensors is the element corresponding to the target segmentation axis in the Q-th first input tensor.
In the embodiment of the application, since the specific segmentation mode is determined according to the type of the protocol axis, when the graph is optimized, the graph optimizer can reasonably segment the input tensor comprising the specific operator of the protocol axis without being based on the principle of the specific operator.
In one possible implementation manner, the reduction axes include a first type of reduction axis and a second type of reduction axis, where the first type of reduction axis is a reduction axis for performing a reduction operation on an element in an input tensor of an operator by an operator, and the second type of reduction axis is a reduction axis for performing no reduction operation on an element in the input tensor of the operator by the operator.
As one possible implementation, the first type of protocol axis includes any one of the following: a sum axis of the specifications, a maximum axis of the specifications, a minimum axis of the specifications, and an average axis of the specifications; the sum axis of the protocol is the protocol axis of the operator for carrying out summation reduction operation on elements in the input tensor of the operator; the reduction maximum value axis is the reduction axis of the operator for carrying out maximum value reduction operation on elements in the input tensor of the operator; the reduction minimum value axis is the reduction axis of the operator for carrying out minimum value reduction operation on elements in the input tensor of the operator; the reduction average value axis is the reduction axis of the operator for performing an averaging reduction operation on elements in the input tensor of the operator.
In one possible implementation, the second type of protocol axis includes a protocol acquisition axis, which is an axis of element index data for an operator on an input tensor of the operator according to an address indicated by an element on the index input tensor of the operator.
As one possible implementation, the target computing resource includes one of the following categories: an image processing unit GPU, a central processing unit CPU, a die, or a chip.
In one possible implementation manner, the apparatus may further include a memory, where the memory stores instructions, and the processor is configured to execute the instructions stored on the memory, and when the instructions are executed, the processor is configured to perform the method in any implementation manner of the first aspect.
In a third aspect, a computer readable medium is provided, the computer readable medium storing program code comprising instructions for performing the method in any one of the implementations of the first aspect.
Drawings
FIG. 1 is a schematic diagram of a deep learning compiler architecture according to an embodiment of the present application;
FIG. 2 is a schematic diagram of operator segmentation according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of a method for processing a computing task according to an embodiment of the present application;
FIG. 4 is a schematic flow chart of an operator segmentation mode corresponding to a calculation task completed by a single operator according to an embodiment of the present application;
FIG. 5 is a flowchart of another method for processing computing tasks according to an embodiment of the present application;
FIG. 6 is a flow chart of an operator segmentation method corresponding to completion of a computing task by multiple operators according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a splitting manner of an element axis according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a rule sum axis segmentation method according to an embodiment of the present application;
FIG. 9 is a schematic diagram of a cutting mode of a reduction maximum axis according to an embodiment of the present application;
FIG. 10 is a schematic diagram of a segmentation method of a protocol average axis according to an embodiment of the present application;
FIG. 11 is a schematic diagram of a cutting mode of a protocol acquisition axis according to an embodiment of the present application;
FIG. 12 is a schematic view of a sliding window axis splitting method according to an embodiment of the present application;
FIG. 13 is a schematic view of another sliding window axis splitting approach provided by an embodiment of the present application;
FIG. 14 is a schematic diagram of the position information of an operator separable axis in an input tensor and an output tensor of an operator according to an embodiment of the present application;
FIG. 15 is a schematic diagram of an operator segmentation specific application provided by an embodiment of the present application;
FIG. 16 is a schematic diagram of another operator segmentation specific application provided by an embodiment of the present application;
FIG. 17 is a schematic diagram of yet another operator-splitting specific application provided by an embodiment of the present application;
FIG. 18 is a schematic diagram of an operator tensor structure provided by an embodiment of the present application;
FIG. 19 is a schematic diagram of an apparatus for processing computing tasks according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings.
The terminology used in the following examples is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the application and the appended claims, the singular forms "a," "an," "the," and "the" are intended to include, for example, "one or more" such forms of expression, unless the context clearly indicates to the contrary. It should also be understood that in the following embodiments of the present application, "at least one", "one or more" means one, two or more than two. The term "and/or" is used to describe an association relationship of associated objects, meaning that there may be three relationships; for example, a and/or B may represent: a alone, a and B together, and B alone, wherein A, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship.
Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.
In order to facilitate understanding of the technical solution of the present application, a brief description will be first made of the concept related to the present application.
(1) Deep learning model
The deep learning model refers to a machine learning model comprising a deep neural network structure. And the algorithm engineer builds a model by using the deep learning frame framework, performs parameter adjustment and training optimization on the model, and stores the finally generated network parameters and the model structure together, wherein the obtained file is a model file for forward reasoning.
The formats of model files obtained by training different deep learning frameworks are different, but the complete model files generally contain tensor data, operation units, calculation diagrams and other information.
(2) Tensor
Tensors (tensors) are data containers of the deep learning system, which can be understood as extensions of the matrix to arbitrary latitudes. Tensors that contain only one digit are called Scalar (Scalar), scalar tensor, zero-dimensional tensor, or 0D tensor; an array of numbers is called a Vector or one-dimensional tensor or 1D tensor; an array of vectors is called a Matrix (Matrix) or a two-dimensional tensor or a 2D tensor; combining a plurality of matrixes into new data to obtain a three-dimensional tensor, wherein the three-dimensional tensor can be intuitively understood as a cube formed by numbers; combining multiple three-dimensional tensors into an array can create a four-dimensional tensor, and so on. The deep learning process is typically a 0D to 4D tensor, but a 5D tensor may be encountered when processing video data. The 0-axis size of this three-dimensional tensor is 2, the 1-axis size is 1, and the 2-axis size is 3.
The shape of the tensor (shape) represents the number of elements per dimension of the tensor, for example, [ [ [1,2,3] ], [ [7,8,9] ] ] is a three-dimensional tensor, where the shape of the three-dimensional tensor is (2, 1, 3). For another example, fig. 18 is a schematic diagram of a tensor provided in the embodiment of the present application, where the tensor shape shown in fig. 18 is (4,20,20,3), and it is assumed that the tensor shown in fig. 18 represents a feature map, and the physical meaning of the tensor shape in fig. 18 is from left to right, where the batch size N of the feature map is 4, that is, 4 pictures; the height H of the feature map is 20, the width W of the feature map is 20, that is, the picture is 20×20=400 pixels; and the channel of the feature map is 3, that is, the RGB channel.
The axis (axis) of the tensor is relative to the shape of the tensor, e.g., [ [1,2], [3,4] ], [ [5,6] [7,8] ] ] is a three-dimensional tensor, its shape is (2, 2), then the 0-axis represents the data in the first dimension: two matrices of [ [1,2], [3,4] ] and [ [5,6] [7,8] ]; the 1-axis represents data in the second dimension: [1,2], [3,4], [5,6] and [7,8]; the 2-axis represents data for the third dimension: 1. 2, 3,4, 5,6, 7 and 8. As another example, as shown in fig. 18, the tensor shape is (4,20,20,3), the 0-axis is the data of the batch size of the feature map, the 1-axis is the data of the feature map height, the 2-axis is the data of the feature map width, and the 3-axis is the data of the feature map channel.
(3) Operator
An operator (operation) may also be referred to as an arithmetic unit, a calculation unit, or an operator, and represents a symbolized operation process, and is a basic unit of the mainstream deep learning framework, i.e., a node in the graph. The input and output of the arithmetic unit are tensors. All transformations learned by the depth network can be reduced to some tensor operation on the tensor of the numerical data.
Common arithmetic units are add (add) units, batch regularization (convolution) units, gated loop units (Gated Recurrent Unit), local response normalization (local response normalization, LRN) units, long short-term memory (LSTM) units, max pool units, sparse activation functions (rectified liner uints, reLU), recurrent neural network (recurrent neural networks, RNN) units, and Softmax functions, among others.
(4) Calculation map
A computational graph (graph), also known as a dataflow graph, is defined as a directed acyclic graph (directed acyclic graph, DAG). Both tensors and arithmetic units are objects in the graph, the arithmetic units are nodes of the graph, and the tensors are data flowing on edges of the graph. Loop free (acyclic) means that the graph cannot have loops, e.g., tensor x cannot be the input to a layer that generates x. The only processing cycles allowed (i.e., the loop connections) are the internal cycles of the loop layer.
Most deep learning frameworks can be described using a directed acyclic graph in which each node represents a neuron, and if the output of one node is the input of another node, the two nodes share an edge. That is, the nodes in this computational graph represent operators, and the nodes and edges between the nodes represent that there is a data dependency between the two nodes.
(5) Operator splitting
Operator slicing is the slicing of the input tensor and the output tensor of the operator.
Fig. 1 is a schematic diagram of a deep learning compiler architecture according to an embodiment of the present application, and the deep learning compiler will be briefly described with reference to fig. 1.
The deep learning compiler can be divided into a front end of the compiler, a middle end of the compiler and a rear end of the compiler, wherein the front end of the compiler is in butt joint with an application layer, namely the front end of the compiler is in butt joint with a deep learning model, the front end of the compiler comprises a parser and the like, the parser of the front end of the compiler mainly converts the models trained under different frames into an internal format which can be recognized by hardware, for example, a computational graph of frames such as tensorsurface or cafe 2 is converted into a computational graph of an internal recognizable format. The compiler middle-end comprises a graph optimizer, operator information and the like, and the graph optimizer can be also called a graph optimization module. The middle end of the compiler distributes different computing tasks to different computing resources (such as a CPU (Central processing Unit), a GPU (graphics processing Unit) for subsequent model operation, and the rear end of the compiler mainly automatically produces code instructions matched with different hardware and comprises an operator compiler, an operator library and the like.
Deep learning compilers are typically at both the graph optimization and operator optimization levels to improve the performance of models running on different devices. The graph optimization and operator optimization are relatively decoupled and independent. Graph optimization is a generic optimization strategy, i.e. an optimization strategy that is independent of a specific operator type, whereas operator optimization is an operator specific optimization strategy, i.e. an optimization strategy that is dependent on a specific operator type.
Typical operator optimization strategy types are computational (optimization) and scheduling (schedule) optimization, which allows a single specific operator to achieve extreme optimization on a specific hardware platform by means of manual or automatic tuning. For example, for general matrix-matrix multiplication (GEMM) operators, scheduling of GEMM operators is typically optimized by artificial scheduling techniques such as blocking (blocking), vectorization (vectorization), cyclic transformation (loop demux), data reordering (packing), multi-core multi-threaded parallelism (parallel), etc., such that GEMM operators obtain tens of times of performance benefits on a CPU.
A typical graph optimization strategy has constant folding, wherein the constant folding is that if all input tensors relied on by one operator are constants, operator nodes irrelevant to model operation can be calculated in advance during compiling, so that the cost of model operation is saved.
Currently, there are many other graph optimization strategies such as graph cut and execution order optimization, multi-die parallelism, multi-threaded parallelism, chip-to-chip parallelism, and the like. These graph optimization strategies all need to be based on the principle of operators, and if the principle of operators is not based, the parallel optimization strategies cannot be expressed in the computational graph.
For example, graph segmentation and execution sequence optimization are graph optimization strategies for reducing the memory limit of operator operation, specifically, through unified segmentation of external loop iteration variables of operators, based on the unified segmentation, the subsequent execution sequence of the operators is adjusted, so that the operators can locally perform a large number of iteration operations, the requirement of the operators on the memory is reduced, more intermediate results generated by the local operations are stored in an L2 cache, the memory limit of the subsequent operation of the operators is reduced, and finally the operation performance of the overall network model is optimized. Therefore, the splitting mode of the operator is particularly important.
For example, multiple dies are parallel, the dies are unpackaged chips, advanced packaging technology is adopted for accumulating calculation force, and in order to fully play the performance of the chips, one operator can be split into a plurality of dies for calculation, so that data interaction among different dies is reduced as much as possible. Therefore, the splitting mode of the operator is particularly important.
For another example, multithreading may use a sub-graph as a basic operation unit, that is, a sub-graph including different operators as a basic operation unit, where the operators need to be split when a sub-graph is allocated to multiple threads and executed in parallel, that is, where the operators in the sub-graph need to be split when a sub-graph is split to different computing resources and executed in parallel, for example, executed in parallel on different CPUs. Because the data synchronization among the multiple threads at one time means that the running of one sub-graph is finished, the interaction among different threads is reduced as much as possible when the multiple threads run in parallel. Thus, the segmentation of operators in the subgraph is also particularly important.
Fig. 2 is a schematic diagram of operator segmentation provided in an embodiment of the present application, where tensors of the same operator may be segmented into different slices, and the different slices may run on different threads, different dies, or different chips. For example, as shown in fig. 2, the input tensor of the operator 1 is divided into a slice 1 of the operator 1 and a slice 2 of the operator 1, the input tensor of the operator 2 is divided into a slice 1 of the operator 2 and a slice 2 of the operator 2, the slice 1 of the operator 1 is run on the computing resource 1, and the running result is transferred to the corresponding computing resource 2 of the slice 1 of the operator 2 to run; meanwhile, the slice 2 of the operator 1 runs on the computing resource 3, and the running result is transmitted to the computing resource 4 corresponding to the slice 2 of the operator 2 to run, and finally the running results on the computing resource 2 and the computing resource 4 are spliced.
Because most of the current graph optimization schemes related to operator segmentation modes are based on the principle of operators, for example, graph optimization needs to perform operator segmentation based on the properties of operator loop iteration variables, and architectural operator optimization and graph optimization also need to be relatively decoupled, one processing mode is that an algorithm engineer manually classifies loop variables in necessary operators and sums up changes generated after each class of loop variables are subjected to axial segmentation, so that efficient generation of an auxiliary graph optimization strategy is realized. The efficient generation of the current graph optimization strategy cannot be used for classifying the circulating variables in the necessary operators manually, and therefore any operator cannot be automatically segmented and operated.
Currently, there is a method for implementing parallel operation on multiple GPUs through operator splitting by splitting an input tensor of an operator at an application layer based on a separable axis (e.g., a sample axis, a parameter axis, and an attribute axis) of the operator output.
The sample axis, the parameter axis and the attribute axis are three types of splittable axes on operator output respectively, wherein samples of operator input tensors are split according to the sample axis, namely the operator input tensors are split in sample dimensions, and the operator input tensors split along the sample axis are distributed to different computing resources for data parallelism; dividing the parameters of the operator input tensor according to the parameter axis, namely dividing the operator input tensor in the parameter dimension, distributing the operator input tensor divided along the parameter axis to different computing resources, and carrying out model parallelism; the attribute axis is an axis except for the sample axis and the parameter axis in the operator output, and the operator input tensor is segmented according to the attribute axis of the sample, namely the operator input tensor sample is segmented in the attribute dimension.
According to the three axes, operator input tensors can be divided into different computing resources for operation, the operator input tensors can be divided independently according to the three axes, and the operator input tensors can be mixed and divided according to the combination of the three axes, so that the effect of parallel operation of multiple computing resources is realized. Although the current segmentation method can realize automatic operator segmentation of an application layer to a certain extent, certain limitations still exist. Firstly, operator segmentation is only carried out according to axes of three dimensions defined by axes in an output tensor aiming at a matrix multiplier at present, and all the separable axes and segmentation modes of the operator cannot be covered. Second, the definition of these three axes is currently determined according to the type of axes of the operator output tensor, that is, if there is no axis in the operator output tensor, the segmentation is not performed from the actual situation of the operator input tensor separable axes. The operator input tensor is coarser in segmentation, and the operator input tensor cannot be distributed to different computing resources for operation after being accurately segmented; finally, the method still defines a segmentation axis and a segmentation mode at the application layer, namely an algorithm engineer determines the segmentation mode at the application layer according to the segmentation axis included in a certain operator type through a script language, so that automatic segmentation of input and output of different operators still cannot be realized, and complete decoupling of graph optimization and operator optimization cannot be realized.
In order to solve the above-mentioned problems, an embodiment of the present application proposes a method and an apparatus for processing a computing task, which will be described in detail below with reference to fig. 3 to 19.
Fig. 3 is a flowchart of a method for processing a computing task according to an embodiment of the present application.
S301, determining a first operator for executing a computing task, wherein the first operator comprises N separable axes, and N is a positive integer greater than or equal to 1.
It should be appreciated that the N separable axes included in the first operator represent that the N separable axes are included in the input tensor of the first operator.
S302, acquiring segmentation information of a first operator from an operator segmentation information base, wherein the segmentation information of the first operator comprises an axis type of an nth one of N separable axes in the first operator and first position information, and the first position information is used for indicating the position of the nth separable axis in an input tensor of the first operator, wherein n=1, … and N.
That is, the information included in the splitting information of the first operator may indicate that each of the N split axes has an axis type corresponding to itself in the first operator, and a detailed description will be given about different types of axes and splitting manners corresponding to the different types of axes with reference to fig. 7 to 17. The information included in the segmentation information of the first operators may also indicate where each of the separable axes will appear on the input tensors of the first operators and in what axes of those input tensors. For example, from the position information of the splittable axis 1 in the first operator, it can be known that the splittable axis 1 appears in the input tensors 1 and 2 of the first operator, and that the splittable axis 1 appears on the 0 axis in the input tensor 1, and that the splittable axis 1 appears on the 0 axis in the input tensor 2.
S303, according to the segmentation information of the first operator, the input tensor of the first operator is segmented to obtain K groups of input tensors, wherein K is a positive integer greater than or equal to 2.
It will be appreciated that the number of input tensors included in each of the K sets of input tensors is the same as the number of input tensors included by the first operator.
As a possible implementation manner, according to the segmentation information of the first operator and the number of computing resources M, the input tensor of the first operator is segmented to obtain K groups of input tensors, where M is a positive integer greater than or equal to 2.
Although the number of available computing resources is M, the graph optimizer does not need to use all computing resources, for example, the required target computing resource number K may be estimated according to the size of the computing task, or the target computing resource number K may be determined randomly, which is not limited in the embodiment of the present application.
It should also be appreciated that each of the K sets of input tensors is the input tensor required for each target computing resource, e.g., if a single computing resource for completing a computing task requires a input tensors before the input tensor of the first operator is split, then a input tensors are also required for each target computing resource for completing a computing task after the input tensor of the first operator is split.
As a possible implementation manner, a target segmentation axis is determined according to segmentation information of the first operator, and input tensors of the first operator are segmented according to the target segmentation axis, so as to obtain K groups of input tensors. The corresponding operator splitting scheme flow for completing a computing task using a single operator will be described in detail below in conjunction with fig. 4.
It should be noted that, the input tensor of the first operator is split, not all the input tensors of the first operator are split, but the input tensor including the target split axis is split, and the input tensor not including the target split axis is sent as shared input data to each target computing resource.
As a possible implementation manner, if the second operator is further needed for executing the computing task, determining a candidate segmentation mode space according to the segmentation information of the first operator, the segmentation information of the second operator and the computing resource quantity M, determining a target segmentation mode according to the candidate segmentation mode space, and segmenting the input tensor of the first operator according to the target segmentation mode to obtain K groups of input tensors. The corresponding split-mode flow for completing a computing task using multiple operators will be described in detail below in conjunction with fig. 5.
S304, respectively sending the K groups of input tensors to the K target computing resources so that the K target computing resources complete the computing tasks.
It should be appreciated that the number of target computing resources K is determined based on the number of computing resources M.
In the embodiment of the application, the graph optimizer obtains the operator segmentation information from the operator segmentation information base, and as the segmentation information of each operator can be directly obtained from the operator segmentation information base, the graph optimizer can realize automatic segmentation of the input tensor of each operator without sensing the mathematical semantics and bottom realization of each operator, thereby realizing complete decoupling of graph optimization and operator optimization.
Fig. 4 is a flow chart of a corresponding operator segmentation method for completing a computing task by using a single operator according to an embodiment of the present application. Fig. 4 is a specific illustration of one possible implementation of S303.
S401, determining a target dividing axis, wherein the target dividing axis is one of N separable axes.
As a possible implementation, the graph optimizer randomly picks a segmentable axis as the target segmentation axis, e.g., takes the first axis of the input tensor of the first operator, which may be the batch (batch) axis, as the target segmentation axis.
As one possible implementation, the graph optimizer selects the splittable axis with the most common axis among the input tensors of all the first operators as the target split axis, e.g., if the first operator has 3 input tensors with splittable axis 1 appearing in the 3 input tensors and splittable axis 2 appearing in the 2 input tensors, splittable axis 1 may be the target split axis.
As one possible implementation manner, according to the operation time required for completing the splitting task in the splitting manner corresponding to each split-able axis, the split-able axis with the shortest operation time is selected as the target split axis.
As one possible implementation manner, the target split axis is determined according to the operation time required for completing the calculation task in the splitting manner corresponding to each split axis and the target calculation resource number K. For example, if the operation time corresponding to the calculation task is completed by adopting the splitting manner corresponding to the splittable axis 1 and the b target calculation resources and the operation time corresponding to the calculation task is the same by adopting the splitting manner corresponding to the splittable axis 2 and the c target calculation resources, but the number b of the target calculation resources corresponding to the splittable axis 1 is smaller than the number c of the target calculation resources corresponding to the splittable axis 2, the splittable axis 1 is selected as the target splitting axis, and the number b of the target calculation resources corresponding to the splittable axis 1 is selected.
S402, determining a segmentation mode corresponding to the axis type of the target segmentation axis in the first operator according to segmentation information of the first operator.
S403, according to the splitting mode of the target splitting axis corresponding to the axis type in the first operator, splitting the input tensor of the first operator to obtain K groups of input tensors.
As a possible implementation manner, according to a segmentation manner corresponding to an axis type of the target segmentation axis in the first operator, determining Q first input tensors including the target segmentation axis in the first operator and positions of the target segmentation axis in each of the Q first input tensors, where Q is a positive integer greater than or equal to 1.
It should be appreciated that the Q first input tensors are input tensors of the first operator including the target split axis.
It will be appreciated that the position of the target split axis in each of the Q first input tensors represents the axis of the target split axis in each first input tensor, e.g., the target split axis is on the 0 axis of the 1 st first input tensor and on the 0 axis of the 2 nd first input tensor.
As a possible implementation manner, each of the Q first input tensors is split according to the axis type of the target split axis in the first operator and the number K of target computing resources, so as to obtain Q groups of second input tensors, where each group of second input tensors in the Q groups of second input tensors includes K second input tensors.
It should be understood that the Q-th group of second input tensors in the Q-th group of second input tensors is a segmentation result of the Q-th first input tensor in the Q-th first input tensor into K, where q=1, …, Q.
It should be appreciated that when the number of target computing resources is K, each of the first input tensors including the target split axis is split according to the target split axis into K second input tensors, which are respectively the input tensors of the K target computing resources, and thus Q groups of second input tensors are formed when the first input tensors including the target split axis are Q.
It should be understood that the axis types of the target segmentation axis may be an element (element) axis, a sliding window (sliding window) axis, and a reduction (reduction) axis, and the segmentation manners of these several axis types will be described in detail with reference to fig. 7 to 13
As a possible implementation, K sets of input tensors are obtained from the Q sets of second input tensors and the input tensors of the first operator that are not split.
The K-th input tensor in the K-th input tensor comprises the K-th second input tensor in each group of second input tensors in the Q-th group of second input tensors and the input tensors of the first operator which are not segmented.
It should be appreciated that each of the K sets of input tensors includes an input tensor that is an un-segmented first operator of the shared data and a second input tensor of the segmented first operator corresponding to each target computing resource.
Fig. 5 is a flowchart of another method for processing a computing task according to an embodiment of the present application. Fig. 5 is a specific illustration of another possible implementation of S303.
When the operator for performing the computing task further includes a second operator, the second operator includes P separable axes, the P separable axes being a subset of the N separable axes.
S501, acquiring segmentation information of a second operator from an operator segmentation information base, wherein the segmentation information of the second operator comprises an axis type of a P-th cuttable axis in the second operator and second position information, the second position information is used for indicating the position of the P-th cuttable axis in an input tensor of the second operator, the input tensor of the second operator is an output tensor of the first operator, P is a positive integer which is greater than or equal to 1 and less than or equal to N, and p=1, … and P.
It should be appreciated that the P separable axes are represented as a subset of the N separable axes, the P separable axes appear in the output tensor of the first operator, and the output tensor of the first operator is taken as the input tensor of the second operator. That is, the P separable axes of the second operator also appear in the N separable axes of the first operator.
S502, determining P segmentation reference information according to segmentation information of a first operator and segmentation information of a second operator, wherein P segmentation reference information in the P segmentation reference information comprises: the type of axis of the p-th splittable axis in the first operator, the type of axis of the p-th splittable axis in the second operator, and the position of the p-th splittable axis in the input tensor of the first operator.
S503, determining P groups of candidate segmentation modes according to the P segmentation reference information and the calculated resource quantity M, wherein the P-th group of candidate segmentation modes in the P groups of candidate segmentation modes comprises at least one segmentation mode.
The segmentation mode included in the P-th group candidate segmentation mode is determined according to the P-th segmentation reference information in the P segmentation reference information and the computing resource quantity M.
It should be understood that each set of candidate segmentation methods is a candidate segmentation method corresponding to each segmentation reference information, that is, segmentation reference information corresponding to each segmentable axis of the P segmentable axes. The candidate splitting ways of each group include at least one splitting way, and it is further understood that each group of candidate splitting ways includes M-1 splitting ways, for example, when the number of computing resources is 4, the number of target computing resources may be 2, 3, and 4, that is, there are 3 target computing resources, so each group of candidate splitting ways includes 3 splitting ways.
S504, determining a target segmentation mode according to the time required by each segmentation mode in the P group of candidate segmentation modes to complete the calculation task.
As one possible implementation manner, the segmentation mode with the shortest time required for completing the calculation task in the P groups of candidate segmentation modes is determined as the target segmentation mode.
Specifically, when the total number of the segmentation modes in the P group candidate segmentation modes is not more, traversing the P group candidate segmentation modes to obtain time required for completing the calculation task in all the candidate segmentation modes, and selecting the segmentation mode with the shortest time required for completing the calculation task as the target segmentation mode, wherein the traversing mode can be through simulation, theoretical calculation or running on actual hardware, and the traversing mode is not limited in the embodiment of the application.
Specifically, when the total number of segmentation modes in the P group candidate segmentation modes is large, a target segmentation mode is searched out from the P group candidate segmentation modes, wherein the searching modes are various and can be a monte carlo markov algorithm or a genetic algorithm, and the searching modes are not limited in the embodiment of the application.
As a possible implementation manner, the target segmentation mode is determined according to the time required for completing the calculation task by each segmentation mode in the P groups of candidate segmentation modes and the target calculation resource quantity K. For example, if the calculation time corresponding to completing the calculation task by using the splitting manner 1 and d target calculation resources is the same as the calculation time corresponding to completing the calculation task by using the splitting manner 2 and e target calculation resources, but the number d of target calculation resources corresponding to the splitting manner 1 is smaller than the number e of target calculation resources corresponding to the splitting manner, then the splitting manner 1 is selected as the target splitting manner, and the number d of target calculation resources corresponding to the target splitting manner is set.
S505, according to the target segmentation mode, the input tensor of the first operator is segmented, and K groups of input tensors are obtained.
Fig. 6 is a flow chart of a corresponding operator splitting manner for completing a computing task by multiple operators according to an embodiment of the present application, and S505 will be specifically described with reference to fig. 6.
S601, determining a target segmentation axis, an axis type of the target segmentation axis in a first operator, an axis type of the target segmentation axis in a second operator, Q first input tensors including the target segmentation axis in the first operator and positions of the target segmentation axis in each of the Q first input tensors according to a target segmentation mode, wherein Q is a positive integer greater than or equal to 1.
It should be understood that, in S601, the explanation of Q first input tensors is similar to that in S402, and for brevity, reference may be made specifically to the description in S402, which is not repeated here.
S602, respectively segmenting each first input tensor in Q first input tensors according to the shaft type of the target segmentation shaft in the first operator, the shaft type of the target segmentation shaft in the second operator and the number K of target computing resources to obtain Q groups of second input tensors, wherein each group of second input tensors in the Q groups of second input tensors comprises K second input tensors, and the Q-th group of second input tensors in the Q groups of second input tensors is a segmentation result obtained by segmenting the Q-th first input tensor in the Q first input tensors into K.
It should be understood that the explanation of the Q-group second input tensor in S602 is similar to that in S403, and for brevity, reference may be made specifically to the description in S403, which is not repeated here.
It should be noted that, in S602, the obtaining of the Q-group second input tensor needs to be based on the axis type of the target split axis in the first operator and the axis type in the second operator, and a specific split manner will be illustrated in connection with fig. 15.
S603, obtaining K groups of input tensors according to the Q groups of second input tensors and the input tensors of the first operator which is not segmented.
The K-th input tensor in the K-th input tensor comprises the K-th second input tensor in each group of second input tensors in the Q-th group of second input tensors and the input tensors of the first operator which are not segmented.
It should be understood that, in S603, the explanation of the K sets of second input tensors is similar to that in S404, and for brevity, reference may be made specifically to the description in S404, which is not repeated here.
It should be noted that, completing the computing task may further include more operators, in this embodiment of the present application, the completing the computing task includes taking the first operator and the second operator as examples for describing in detail, when an operator other than the first operator and the second operator is further required for completing the computing task, the graph optimizer needs to obtain segmentation information of the other operators, so as to obtain candidate segmentation modes, so as to determine a target segmentation mode.
The axial type of the operator input tensor, the position information of the cleavable axis in the input tensor and the output tensor of the operator, and the operator splitting mode corresponding to different axial types in the embodiment of the present application will be described in detail below with reference to fig. 7 to 17.
The axis type is the data dependency relationship between the operator input tensor and the output, that is, the graph optimizer can determine the segmentation mode corresponding to the axis type according to the axis type of the input tensor. Thus, when different operator inputs include the same axis type, there may be the same operator splitting pattern.
As one possible implementation, the axis types of the operator input tensor may include an element axis, a specification axis, a sliding window axis, and the like, as well as other types of separable axes, which embodiments of the present application do not limit.
The element axis, the reduction axis, and the sliding window axis will be specifically described with reference to fig. 7 to 13. It should be noted that fig. 7 to fig. 13 are schematic diagrams of operator splitting modes corresponding to the completion of a calculation task by using a single operator, and in fig. 7 to fig. 13, each of the operator a, the operator B, and the operator C may represent a first operator, which is not limited by the name of the first operator in the embodiment of the present application.
Element (elementwise) axis: if a certain iteration variable in the input tensor of the operator a is an element axis, the element axis is an axis of a mapping relationship between the input tensor of the operator a and the elements in the output tensor, that is, the positions of the points in the output tensor and the points of the input tensor on which the output tensor depends are the same on the axis. For example, the shape of the input tensor is a four-dimensional tensor of (5,7,9,3), wherein the 3-axis of the input tensor has a length of 3, the 3-axis of the input tensor includes data a0, a1, and a2, the shape of the output tensor is (4,6,8,3), wherein the 3-axis of the output tensor has a length of 3, and the 3-axis of the output tensor includes data b0, b1, and b2, wherein the positions of a0 and b0 correspond, the positions of a1 and b1 correspond, and the positions of a2 and b2 correspond, then the axis types of the 3-axes of the input tensor and the output tensor are element axes.
Fig. 7 is a schematic diagram of a splitting manner of an element axis according to an embodiment of the present application. The step of splitting the operator a input tensor according to the element axis is shown in fig. 7. In fig. 7, operator a is illustrated as an activation function operator. The embodiment of the application does not limit the type of the operator A. It should be noted that, in fig. 7, the input tensor and the output tensor of the activation function operator are described by taking a single input tensor and a single output tensor as an example, and the number of the input tensor and the number of the output tensor of the operator are not limited in the embodiment of the present application.
Specifically, the type of the target split axis in the activation function operator is an element axis, and according to the position information of the target split axis in the activation function operator, it can be determined that the type of the target split axis is that the element axis appears on the 0 axis of the first input tensor of the activation function operator, that is, the 0 axis with the length of 8 is the element axis. Deriving a function y=f from the forward shape of the first input tensor and the first output tensor element axis based on the length of the first input tensor element axis 1 (x) A length of the first output tensor element axis is obtained, where x represents the length of the first input tensor element axis and y represents the length of the first output tensor element axis. Wherein the forward derivation logic of the element axes is that the lengths of the element axes of the first input tensor and the first output tensor are equal. As shown in fig. 7 (a), the first input tensor of the activation function operator is (8,56,56,64), the 0 axis of the first input tensor is the element axis, and according to the logic that the length of the output tensor element axis is equal to the length of the input tensor element axis, the length of the first output tensor 0 axis is also 8, that is, the first output tensor is (8,56,56,64).
And according to the number of the target computing resources, segmenting the first output tensor according to the element axes to obtain a second output tensor of activating function operators on each target computing resource. Then, deriving a function from the inverse shape of the element axis based on the element axis length corresponding to the second output tensor of the operator on each computing resource Number of digitsThe element axis length of each second input tensor is inversely derived. When the first input tensor is segmented according to the element axis, a segmentation function (split) is needed, and after the operation of different computing resources is finished, a second output tensor on different target computing resources is needed to be subjected to a splicing (concat) function to obtain the first output tensor.
As shown in fig. 7 (b), there are two target computing resources for the activation function operator operation, the 0-axis length of the first output tensor is 8, so the length of the element axis of the second output tensor on each computing resource is 4, i.e., the 1 st second output tensor is (4,56,56,64), the 2 nd second output tensor is (4,56,56,64), and a stitching function is used between the second output tensor and the first output tensor to synchronize data. Wherein the element on the 1 st second output tensor 0 axis and the element on the 2 nd second output tensor 0 axis do not intersect with each other. Then, the length of the element axis of the second input tensor is inversely deduced from the inverse shape deduction function of the element axis, that is, the 1 st second input tensor is (4,56,56,64), and the 2 nd second input tensor is (4,56,56,64), so that, in order to obtain the shape of the second input tensor, the graph optimizer obtains two second input tensors by calling the first slicing function to slice the first input tensor according to the element axis, that is, according to the 0 axis of the first input tensor.
Note that, the lengths of the 0 axes in the two second output tensors shown in fig. 7 (b) are equal, the lengths of the 0 axes in the two second input tensors are also equal, and the second input tensors corresponding to different target computing resources only need to satisfy the following conditions: the elements corresponding to the 0 axes of the two second input tensors have no overlapping parts, i.e. the elements corresponding to the 0 axes of the two second input tensors are a subset of the elements corresponding to the 0 axes of the first input tensor and have no intersection, and in addition, the union of the elements corresponding to the 0 axes of the two second input tensors is the element corresponding to the 0 axes of the first input tensor. The embodiment of the application does not limit whether the lengths corresponding to the 0 axes of the second input tensors obtained on different computing resources are equal or not.
Protocol (reduction) axis: if a certain iteration variable in the input tensor of operator B is the reduction axis, then the reduction axis is the axis in the input tensor of the operator, and there is no or length 1 in the output tensor of the operator.
In particular, the reduction axes can be further divided into two types, the first type of reduction axis being one in which the operator B performs a reduction operation on the elements in the input tensor. For example, the shape of the input tensor of the operator B is (2, 3,4, 5), where the 0 axis of the input tensor is the reduction axis and the length is 2, and then the shape of the output tensor obtained after the input tensor has undergone the operation of the operator B is (, 3,4, 5) or (1, 3,4, 5).
The second type of reduction axis is a reduction axis where operator B does not perform a reduction operation on elements in the input tensor, although operator B does not perform a reduction operation on elements on the second type of reduction axis, it does not appear in the output tensor as such, but in the input tensor. For example, a protocol acquisition axis, particularly with respect to the protocol acquisition axis, will be described in detail in connection with FIG. 11
The first type of reduction axis may include a sum of reduction (reduction sum) axis, a reduction maximum (reduction max) axis, a reduction minimum (reduction min) axis, a reduction mean (reduction mean) axis, and the like. It should be noted that these different types of protocol axes have general features of the protocol axes, except that the types of functions that need to be called for obtaining the equivalent first output tensor before the splitting after the split first input tensor passes through the operator B on different target computing resources are different, and specific splitting manners of the different types of first type protocol axes will be specifically described below with reference to fig. 8 to 10.
FIG. 8 is a schematic diagram of a rule sum axis segmentation method according to an embodiment of the present application. The step of splitting the sum axis of the conventions in the first input tensor of operator B is shown in fig. 8. In fig. 8, operator B is illustrated as the integrated sum operator. The embodiment of the application does not limit the type of the operator B. It should be noted that, the input tensor and the output tensor of the integrated sum operator in fig. 8 are described by taking a single input tensor and a single output tensor as an example, and the number of the input tensor and the number of the output tensor of the operator are not limited in the embodiment of the present application.
Specifically, the type of the target segmentation axis in the integrated sum operator is the sum axis of the specification, and according to the position information of the target segmentation axis in the integrated sum operator, it can be determined that the sum axis of the target segmentation axis is the sum axis of the specification and appears on the 0 axis of the first input tensor of the integrated sum operator. As shown in fig. 8 (a), the first input tensor of the integrated sum operator is (8,56,56,64), and the axis type of the 0 axis of the first input tensor is the sum axis of the reduction, that is, the 0 axis of length 8 is the sum axis of the reduction. Thus, the length of the sum axis of the first output tensor reduction is 1, i.e., the first output tensor is (, 56,56,64), according to the characteristics of the sum axis of the reduction.
According to the number of target computing resources and the length of the sum axis of the reduction in the first output tensor, the first input tensor is divided into two second input tensors according to the sum axis of the reduction by calling a third segmentation function, the two second input tensors are sent to the two target computing resources to be operated, the two second output tensors are obtained, the 1 st second output tensor is (, 56,56,64), the 2 nd second output tensor is (, 56,56,64), and the data on the two target computing resources are synchronized by calling an addition (AddN) function to obtain the first output tensor. As shown in fig. 8 (b), there are two available computing resources for the integrated sum operator operation, the length of the sum axis of the reduction of the first output tensor is 1, the second output tensor of the operator on each computing resource is obtained by adding the operators, and the shape of the second output tensor is the same as that of the first output tensor (, 56,56,64). Because of the two computing resources, the sum axis of the reduction of the first input tensor is split by the splitting operator, resulting in a second input tensor, wherein the length of the sum axis of the reduction of the second input tensor is 4, i.e. the shape of the second input tensor is (4,56,56,64).
Note that, the lengths of the 0 axes in the two second output tensors shown in fig. 8 (b) are equal, the lengths of the 0 axes in the two second input tensors are also equal, and the second input tensors corresponding to different target computing resources only need to satisfy the following conditions: the elements corresponding to the 0 axes of the two second input tensors have no overlapping parts, i.e. the elements corresponding to the 0 axes of the two second input tensors are a subset of the elements corresponding to the 0 axes of the first input tensor and have no intersection, and in addition, the union of the elements corresponding to the 0 axes of the two second input tensors is the element corresponding to the 0 axes of the first input tensor. The embodiment of the application does not limit whether the lengths corresponding to the 0 axes of the second input tensors obtained on different computing resources are equal or not.
Fig. 9 is a schematic diagram of a splitting manner of a reduction maximum axis according to an embodiment of the present application. The step of slicing the reduction maximum axis of the first input tensor of operator B is shown in fig. 9. In fig. 9, an operator B is exemplified as an integrated maximum operator. The embodiment of the application does not limit the type of the operator B. It should be noted that, in fig. 9, the input tensor and the output tensor of the integrated maximum operator are described by taking a single input tensor and a single output tensor as an example, and the number of the input tensor and the number of the output tensor of the operator are not limited in the embodiment of the present application.
For the first input tensor including the reduction maximum axis, when the first input tensor of the integral maximum operator is split, the main steps are the same as the overall steps of splitting the sum axis of the reduction of the first input tensor of the integral sum operator in fig. 8, and the description of the steps of splitting the first input tensor according to the reduction in fig. 8 will be omitted here.
It should be noted that, as shown in fig. 9, after the sum axis of the first input tensor protocol is split, the types of functions required to be called by the operator B after the sum axis of the first input tensor protocol is split are different, the second output tensor of the sum operator operation for integrating the second input tensors on different target computing resources is obtained by calling the addition function to perform data synchronization, the maximum operator operation is performed on the second input tensors on different computing resources, and the first output tensor is obtained by calling the maximum function to perform data synchronization.
For the first input tensor including the reduction minimum axis, when the first input tensor of the integrated minimum operator is split according to the reduction minimum axis, the main steps are generally the same as the overall process of splitting the first input tensor of the integrated sum operator according to the reduction sum axis in fig. 8, and the description of the process of splitting the first input tensor according to the reduction sum axis in fig. 8 will be omitted here.
It should be noted that, after the sum axis of the first input tensor protocol is split, the types of functions required to be called by the operator B after the sum axis of the first input tensor protocol is split are different, the second output tensor of the sum operator operation for integrating the second input tensors on different target computing resources is obtained by calling the addition function to perform data synchronization, the minimum operator operation is performed on the second input tensors on different computing resources, and the first output tensor is obtained by calling the minimum function to perform data synchronization.
Fig. 10 is a schematic diagram of a cutting mode of a protocol average axis according to an embodiment of the present application. The step of slicing the input tensor of operator B according to the reduced mean axis is shown in fig. 10. In fig. 10, an operator B is exemplified as a reduction average operator. The embodiment of the application does not limit the type of the operator B. It should be noted that, in fig. 10, the input tensor and the output tensor of the integrated average operator are described by taking a single input tensor and a single output tensor as an example, and the number of the input tensor and the number of the output tensor of the operator are not limited in the embodiment of the present application.
For the first input tensor including the reduction average axis, when the first input tensor of the integration average operator is split, the main steps are the same as the overall steps of fig. 8 for splitting the first input tensor of the integration sum operator according to the reduction sum axis, and the description of the steps of splitting the first input tensor according to the reduction sum in fig. 8 will be omitted here.
It should be noted that, as shown in fig. 10, the number of functions required to be called by the operator B after the minimum value axis of the first input tensor protocol is split and the sum axis of the first input tensor protocol is split is different, the second output tensor of the integrated average operator operation is performed on the second input tensor on different target computing resources, the data synchronization is performed by calling the addition function, so as to obtain an intermediate output tensor, and the multiplication function is required to be called, so as to obtain the first output tensor. It should be noted that, the addition function is a synchronization node that sums the integrated average axes of the second output tensors of different computing resources, and the multiplication function is to multiply the integrated average axes of the intermediate output tensors that are synchronized by the summation by 1/group to obtain the first output tensor, where group is the number of target computing resources, for example, in fig. 10, the target computing resource is 2, and then group is 2.
The second type of reduction axis comprises a reduction-gather axis, the operator B indexes data on elements of the input tensor of the operator B according to the address indicated by the elements on the index input tensor of the operator B, i.e. when the first input tensor comprises a reduction gather axis, the corresponding data needs to be found in the reduction gather axis of the first input tensor according to the address on the first index input tensor (indication) as data of the 0 axis of the first output tensor. Fig. 11 is a schematic diagram of a method for partitioning a protocol acquisition axis according to an embodiment of the present application, which uses an operator B as an acquisition (gather 2) operator as an example, and the type of the operator B is not limited in the embodiment of the present application. It should be noted that, in fig. 11, the input tensor of the collection operator is taken as an index input tensor and a first input tensor, and the output tensor of the collection operator is taken as a first output tensor as an example, and the number of the first input tensor and the number of the first output tensor of the collection operator are not limited in the embodiment of the present application.
Specifically, as shown in fig. 11 (a), the acquisition operator has two input tensors, namely, a first input tensor and a first index input tensor, wherein the first input tensor is a data input tensor, the shape of which is (80, 64), and the first index input tensor is an input tensor including an index address, the shape of which is (20). According to the segmentation information of the acquisition operator, determining that the target segmentation axis is a protocol acquisition axis, wherein the protocol acquisition axis appears on the 0 axis of the first input tensor, and according to the characteristics of the protocol acquisition axis, the data of the 0 axis of the first output tensor find the corresponding data element in the 0 axis of the first input tensor according to the index address on the first index input tensor, so that the shape of the first output tensor is (20,64).
And according to the number of the target computing resources and the length of the protocol acquisition axis of the first output tensor, the first output tensor is segmented according to the protocol acquisition axis by calling a third segmentation function, so that two second input tensors are obtained. Each target computing resource is provided with a corresponding second input tensor, and the first index input tensor is biased by calling a biasing function. And then each target computing resource passes through an acquisition operator to obtain respective second output tensors, and the second output tensors on different target computing resources are added and data are synchronized by calling an addition function to obtain first output tensors. As shown in fig. 11 (b), there are two target computing resources for the acquisition operator operation, the first output tensor is about the acquisition axis randomly, and the length of the 0 axis in the first output tensor is equal to the length of the 0 axis of the first index input tensor. Because of the two computing resources, the second input tensor is obtained by slicing the reduction collection axis of the first input tensor by calling the third slicing function, wherein the length of the reduction collection axis of the second input tensor is 40, that is, the shape of the 1 st second input tensor is 40,64, and the shape of the 2 nd second input tensor is 40,64. When each computing resource performs the operation of the acquisition operator, the same first index input tensor is obtained, and as the acquisition operator on each computing resource only obtains half of data on the axis of the first input tensor protocol acquisition axis, namely the 1 st second input tensor and the 2 nd second input tensor, the accuracy of the second output tensor after the operation of the acquisition operator on each computing resource can be ensured by the first index input tensor through the bias operator operation.
When the first input tensor is divided into two parts during the operation of the acquisition operator, the acquisition operator on each computing resource searches in the second input tensor protocol acquisition axis according to the address on the first index input tensor to obtain the data on the second output tensor 0 axis, the situation that the data does not exist occurs, at this time, 0 is used as a search result, and finally the second output tensor on the two computing resources subjected to the operation of the acquisition operator is subjected to the operation of the addition operator, so as to obtain the first output tensor.
In the embodiment of the application, since the specific segmentation mode is determined according to the type of the protocol axis, when the graph is optimized, the graph optimizer can reasonably segment the input tensor comprising the specific operator of the protocol axis without being based on the principle of the specific operator.
Sliding window (sliding window) axis: if a certain iteration variable in the input tensor of the operator C is a sliding window axis, the sliding window axis is an axis of sliding window scanning operation of the operator C on elements in the input tensor of the operator C, and if the sliding window is larger than the step length, overlapping occurs in every two adjacent scanned windows.
If the first output tensor is segmented according to the sliding window axis, when two target computing resources exist, the elements corresponding to the sliding window axis in the first output tensor are equally divided, and then partial data on the sliding window axis of the equally divided first output tensor depends on the same data on the sliding window axis of the first input tensor at the same time. Thus, there are two ways of slicing the first input tensor including the sliding window axis, which will be specifically described in connection with fig. 12 and 13.
The forward shape derivation function y=f of the sliding window axes of the first input tensor and the first output tensor 2 (x) That is, the length of the first output tensor sliding window axis is forward derived from the length of the first input tensor sliding window axis, where x represents the length of the first input tensor sliding window axis and y represents the length of the first output tensor sliding window axis. f (f) 2 () ConvolutionThe fill value, the convolution kernel size, the convolution step size, and the convolution kernel expansion coefficient.
Inverse shape derivation function of first input tensor and first output tensor sliding window axisThat is, a reverse derivation is performed according to the length of the sliding window axis in the first output tensor, and an appropriate segmentation mode is determined, so as to obtain a second output tensor and a second input tensor of each computing resource. Wherein,also related to the convolution fill value, convolution kernel size, convolution step size, and convolution kernel expansion coefficient.
Fig. 12 is a schematic diagram of a sliding window axis splitting manner according to an embodiment of the present application. The manner of splitting the operator C input tensor with overlap according to the sliding window axis is shown in fig. 12. In fig. 12, an operator C is exemplified as a convolution operator. The embodiment of the application does not limit the type of the operator C. In fig. 12, the input tensor and the output tensor of the convolution operator are described by taking a single input tensor and a single output tensor as an example, and the number of the input tensor and the number of the output tensor of the operator are not limited in the embodiment of the present application.
Specifically, the type of the target split axis in the convolution operator is a sliding window axis, and according to the position information of the target split axis in the convolution operator, it can be determined that the target split axis appears on the 1 axis of the first input tensor of the convolution operator as the sliding window axis, that is, the 1 axis with the length of 56 is the sliding window axis. Thus, the length of the sliding window axis in the first output tensor is forward derived from the forward shape derivation function of the sliding window axis and the length of the sliding window axis in the first input tensor of the convolution operator. As shown in fig. 12 (a), the operator first input tensor is (1,56,56,64), and a function is derived from the forward shape of the sliding window axis of the first input tensor and the first output tensor, where the convolution step size is 2, the convolution kernel size is 3, and the first output tensor is (1,28,56,64).
And according to the number K of the target computing resources and the length of the sliding window axis of the first output tensor, segmenting the first output tensor according to the sliding window axis to obtain K second output tensors. The length of the sliding window axis of the second input tensor of the convolution operator on each target computing resource is then inversely derived by a sliding window axis inverse shape derivation function. According to the length of the sliding window axis in each second output tensor, the first input tensor can be segmented according to the sliding window axis by calling the first slicing function, so that the second input tensor is obtained. After the second output tensor is obtained after the calculation of different target computing resources, the equivalent first output tensor of the calculation before segmentation can be obtained by calling a splicing function.
As shown in fig. 12 (b), there are two target computing resources for the convolution operator operation, the 1-axis of the first output tensor is the sliding window axis, length 28, and thus the length of the second output tensor 1-axis on each computing resource before passing through the call splice function is 14. The logic is then derived from the sliding window axis inverse shape, with a convolution step size of 2 and a convolution kernel size of 3, and thus the length of the second input tensor for each computing resource is 29. Subsequently, the first input tensor is sliced according to the 1-axis by calling the first slicing function 1 and the first slicing function 2, so as to obtain two second input tensors with the 1-axis length of 29, wherein the two second input tensors with the 1-axis length of 29 have overlapped data, the data range of one second input tensor 1-axis is from 0 to 28 in the first input tensor 1-axis, the data range of the other second input tensor 1-axis is from 28 to 56 in the first input tensor 1-axis, and the 29 th data in the first input tensor 1-axis is the overlapped part of the two second input tensors.
The splitting manner with overlapping shown in fig. 12 is suitable for a scenario that after the split input tensor is subjected to operator operation on different computing resources, data synchronization is not required frequently, for example, multithreading is parallel, and different threads are completely independent, so that pipelining and parallelism can be achieved. However, in some scenarios, the segmented input tensors need to be frequently data-synchronized after being operated on different computing resources, which may cause overlapping portions of the output tensors obtained on different computing resources to be frequently spliced, so that the overlapping portions are increased continuously, and unnecessary repeated computation is caused. Therefore, the embodiment of the present application also provides another splitting mode without overlapping the sliding window axis, as shown in fig. 13.
Fig. 13 is a schematic view of another sliding window axis splitting method according to an embodiment of the present application. The step of splitting the sliding window axis of the operator C input tensor without overlap is shown in fig. 13. In fig. 13, an operator C is exemplified as a convolution operator. In fig. 13, the input tensor and the output tensor of the convolution operator are described by taking a single input tensor and a single output tensor as an example, and the number of the input tensor and the number of the output tensor of the operator are not limited in the embodiment of the present application, as in fig. 12.
Specifically, the step of deriving the lengths corresponding to the sliding window axes in the second input tensors of each target computing resource in fig. 13 is the same as the step of splitting with overlapping in fig. 12, and specifically, reference may be made to the description in fig. 12, where fig. 13 and fig. 12 are different in that the process of splitting the first input tensor to obtain K second input tensors is different.
Specifically, the second slicing function in fig. 13 (b) is used for equally dividing the first input tensor according to the sliding window axis, the second slicing function 1 and the second slicing function 2 are used for obtaining overlapping parts of sliding window axes of the second input tensor commonly relied on by sliding window axis data in the second output tensor obtained by operation of different target computing resources, the first splicing function 1 and the first splicing function 2 are used for splicing the third input tensor and the fourth input tensor which pass through the second slicing function, the second slicing function 1 and the second slicing function 2 according to the sliding window axes to obtain the second input tensor serving as each target computing resource, and the second splicing function is used for splicing the second output tensor which passes through operation of the convolution operator on different target computing resources to obtain the first output tensor.
Specifically, as shown in fig. 13 (b), there are two target computing resources, computing resource 1 and computing resource 2, respectively, and the shape of the first output tensor is (1,28,56,64), where the 1-axis of the first input tensor is the sliding window axis.
And splitting the first input tensor according to the 1 axis by calling a second splitting function to obtain two equally-divided third output tensors, wherein the shapes of the two equally-divided third output tensors are (1,28,56,64) respectively the 1 st third input tensor and the 2 nd third input tensor. By calling the second slicing function 1, slicing the 1 st third input tensor according to the 1 axis, and obtaining a 2 nd fourth input tensor, wherein the shape of the 2 nd fourth input tensor is (1,1,56,64), and the data of the 2 nd fourth input tensor on the 1 axis is the last data of the 1 st third input tensor on the sliding window axis, namely the 28 th data in the 1 axis of the first input tensor. By calling the second slicing function 2, slicing the 2 nd third input tensor according to the 1-axis, a 1 st fourth input tensor is obtained, wherein the shape of the 1 st fourth input tensor is (1,1,56,64), and the data of the 1 st fourth input tensor on the 1-axis is the first data of the 2 nd third input tensor on the sliding window axis, namely the 29 th data in the 1-axis of the first input tensor.
By calling the first splicing function 1, the 1 st third input tensor and the 1 st fourth input tensor are spliced according to 1 axis, so that the 1 st second input tensor is obtained, the shape of the second input tensor is (1,29,56,64), and the data range of the 1 axis of the second input tensor is from 0 to 28 in the 1 axis of the first input tensor. Similarly, by calling the first stitching function 2, the 2 nd third input tensor and the 2 nd fourth input tensor are stitched according to the 1 axis, so as to obtain the 2 nd second input tensor, the shape of the 2 nd second input tensor is (1,29,56,64), and the data range of the 1 axis in the 2 nd second input tensor is from 28 to 56 in the 1 axis of the first input tensor.
In the embodiment of the application, the splitting mode of the sliding window shaft without overlapping is suitable for being applied to the scene of frequent data synchronization between different computing resources, for example, multi-grain parallelism, and the splicing function is used as a data synchronization node between different bare chips, so that repeated computation of overlapping data is not caused, the overlapping data is not increased continuously, and the computing pressure and the storage pressure of the computing resources can be effectively linked.
In the embodiment of the application, the graph optimizer performs single operator segmentation on different types of axes in the operator input tensor and the segmentation mode corresponding to the types of the axes, so that the graph optimizer can automatically obtain different single operator segmentation strategies without based on the principle of specific operators, and further complete decoupling of the graph optimizer and an operator optimization module is realized.
The foregoing describes in detail the different types of axes and their corresponding splitting patterns, and the different types of axes may be represented by the following data structures:
it should be noted that, the types of tensor axes are not limited to those listed in the embodiments of the present application, and other tensor axes and their corresponding operator splitting manners may also be used, which are not limited in the embodiments of the present application.
It should be noted that, the computing resources may be GPU, CPU, die, chip, etc., and the embodiment of the present application does not limit the type of computing resources, and in addition, the embodiment of the present application does not limit the number of computing resources, and in the embodiment of the present application, two computing resources are only an example.
In an embodiment of the application, the graph optimizer automatically segments the operator input and output tensors according to different types of axes. The graph optimizer does not need to split the input and output tensors based on the principle of a specific operator, only needs to split the input and output tensors based on operator splitting modes corresponding to different types of axes, and for the operator, the calculation formula of the operator is not changed before and after the input and output tensors of the operator are split, only part of parameters of the operator are changed, so that graph optimization and thorough decoupling of the principle of the specific operator can be realized, and further, the generalization capability of the splitting mode of the first input tensor of the operator based on different types of axes is stronger.
The above is a specific description of the segmentation method of the input tensor and the output tensor of the single operator, the graph optimizer of the embodiment of the present application may determine the segmentation method according to the axis type of the input tensor of the single operator and the position information of the target segmentation axis in the input tensor and the output tensor of the single operator, and when a plurality of operators are required for completing the calculation task, the position information of the separable axis in the input tensor and the output tensor of the plurality of operators may enable the graph optimizer to rank the segmentation methods of different operators into the sub graph. The position information of the operator's separable axes in the operator's input tensor and output tensor will be specifically described below in connection with fig. 14.
FIG. 14 is a schematic diagram of the position information of operator-dividable axes in the input tensor and the output tensor of an operator according to an embodiment of the present application.
The location information of the operator's splittable axes in the input tensor and the output tensor indicates which input tensor and which output tensor the same splittable axis is on, and the specific location of the same splittable axis in the input tensor and the output tensor. Wherein each of the separable shafts is of one of the different types of shafts described above.
It should be understood that, in the embodiment of the present application, the number of input tensors and the number of output tensors of the operator are not limited, and a plurality of input tensors may be calculated by the operator to obtain a plurality of output tensors, and in the embodiment of the present application, the number of first tensor axes in the input tensor and the output tensor is not limited.
As shown in fig. 14, taking the operator as an example of a convolution operator, there are two input tensors, namely a first feature map input tensor and a first weight input tensor, for the convolution operator, and the corresponding shapes are (8,56,56,64) and (4,3,3,64), respectively. Wherein the 0 axis, 1 axis and 2 axis of the first feature map input tensor are sliding window axes, the 3 axis is a reduction axis, the 0 axis of the first weight input tensor is an element axis, the 1 axis, 2 axis and 3 axis are reduction axes, and do not appear on the output tensor according to the reduction axis, so that the tensor axes appearing in the first output tensor are the 0 axis, 1 axis and 2 axis of the first feature map input tensor, and the 0 axis of the first weight input tensor, respectively, and the shape of the first output tensor is (8,56,56,4).
The specific description of the position information of the separable shaft in the operator input tensor and the output tensor is given above in combination with the graph, and the specific data structures of the position information of the two separable shafts in the operator are respectively a data structure centered on the separable shaft and a data structure centered on the input tensor and the output tensor.
As one possible implementation, the data structure centered on the separable axis includes the type of separable axis and the type of input tensor in which the separable axis occurs, and the location in each of the input tensor and the output tensor in which the separable axis occurs:
specifically, taking an addition operator as an example, one input tensor shape is (3,1,5), the other input tensor shape is (3, 4, 1), and the output tensor obtained through the addition operator is (3, 4, 5), and the specific data structure taking the separable axis as the middle can be expressed as:
as one possible implementation, the data structure centered on the input tensor and the output tensor includes the number of each axis in each input tensor, the number of each axis in each output tensor, and the type of axis to which each number corresponds:
input_dim_name_defs vector < vector < int > > \represents the number of each axis in each input
output_dim_name_defs vector < vector < int > > \represents the number of each axis in each output
dim_slice_types: map < int, AXIS_TYPE > \represents the TYPE of AXIS to which each number corresponds.
Specifically, taking an addition operator as an example, one input tensor shape is (3,1,5), the other input tensor shape is (3, 4, 1), and the output tensor obtained through the addition operator is (3, 4, 5), and the specific data structure taking the tensor axis as the middle can be expressed as:
the graph optimizer segments and links different operators into subgraphs according to the types of axes in the input tensor of the operators and the position information of each axis in the input tensor and the output tensor. Different axis position information may have different applications, and a specific application of the operator segmentation method for processing a computing task according to the embodiment of the present application will be described in detail with reference to fig. 15 to 17.
Fig. 15 is a schematic diagram of an operator segmentation specific application according to an embodiment of the present application. In a first scene, the first operator comprises an output tensor with a separable axis as an input tensor of the second operator, and the segmentation optimization is performed on the input tensors of a plurality of continuous operators.
As one possible implementation, the splitting manner of the first input tensor is determined according to the shaft type of the target splitting shaft in the different operators and the position information in the first input tensor and the first output tensor of the different operators.
Specifically, there are two kinds of activation function operators in fig. 15, namely, a ReLU operator and a TanH operator. The graph optimizer acquires segmentation information of a ReLU operator and segmentation information of a TanH operator; the axis type of the corresponding target segmentation axis in the target segmentation mode determined by the graph optimizer is an element axis, and for the ReLU operator, the element axis appears on the 0 axis of the first input tensor and the 0 axis of the first output tensor of the ReLU operator, the shape of the first input tensor is (8,56,56,64), and according to the corresponding segmentation mode of the element axis, the shape of the first output tensor is (8,56,56,64). For the TanH operator, the element axis appears on the 0 axis of the first input tensor and the 0 axis of the first output tensor of the TanH operator, and the shape of the first input tensor is (8,56,56,64), so that, according to the above-mentioned element axis corresponding segmentation method, the shape of the first output tensor is also (8,56,56,64).
If the position information of the element axis in the input tensor and the output tensor of the ReLU operator and the tanH operator is unknown, the second output tensor of each target computing resource after passing through the ReLU operator is subjected to splicing synchronization to obtain a first input tensor of the tanH operator, and then the first input tensor of the tanH operator after the splicing synchronization is subjected to segmentation to obtain a second input tensor of different target computing resources. As shown in fig. 15 (a), to complete the operations of the ReLU operator and the TanH operator, two slicing functions and two stitching functions need to be called.
Since the graph optimizer knows the position information of the target split axis in the first input tensor and the first output tensor of the different operators, the intermediate splicing operator and the intermediate splitting operator generated due to the splitting of the tensors of the ReLU operator and the TanH operator can be omitted when the ReLU operator and the TanH operator are operated continuously. As shown in fig. 12 (b), the element axes are all present on the 0 axis in the input and output tensors of the ReLU operator and the TanH operator, and only one segmentation operator node and one splicing operator node are needed to implement continuous operation on the ReLU operator and the TanH operator.
Specifically, according to the above-mentioned element axis splitting manner, the first input tensor is split according to the element axis by calling a one-time splitting function, so that the second input tensor of two equal-division ReLU operators can be obtained, the second output tensor of the ReLU operators is obtained through the operation of the ReLU operators on each computing resource, the second output tensor of the ReLU operators is used as the second input tensor of the tanH operators, the second output tensor of the tanH operators is obtained through the operation of the tanH operators on each target computing resource, and finally, the first output tensor is obtained through the operation of the splicing operators.
In the embodiment of the present application, the axis types of the target segmentation axes in the continuous operator are not limited, and the axis types of the target segmentation axes in the first operator and the second operator are the same, and are both illustrated as element axes, and the axis types of the target segmentation axes in the continuous operator may be the same or different.
In the embodiment of the application, the input tensor after segmentation is subjected to continuous operator operation on the same target computing resource, so that the parallel computation of multiple target computing resources can be realized.
Fig. 16 is a schematic diagram of another operator segmentation specific application provided in an embodiment of the present application. Scene two, the segmentable axis appears on multiple input tensors and a single output tensor of a single operator.
As a possible implementation manner, determining the segmentation manner in the input tensor of the first operator according to the segmentation information of the first operator and the number K of target computing resources.
Specifically, the operator is taken as an adding operator to exemplify, the adding operator has two first input tensors, the shape of the 1 st first input tensor x is (m, n), and the shape of the 2 nd first input tensor y is (m').
The segmentation information according to the addition operator comprises the types of the separable axes 1 as element axes, wherein the separable axes 1 are arranged on the 0 axis of the first input tensor x and the 0 axis of the first input tensor y, and the length is m; the type of the splittable axis 2 is the element axis, the splittable axis 2 appears on the 1-axis of the first input tensor x, and the length is n.
Based on the splitting information of the first operator, two operator splitting modes can be determined, the first is to split the input tensor including the splitting axis 1 with the length m, and the second is to split the input tensor including the splitting axis 2 with the length n.
As shown in fig. 16 (a), the input tensor including the splittable axis 1 having the length m is split. The length m of the splittable axis 1 is determined as a target splitting axis, and the first input tensor x is determined to be split according to the 0 axis and the first input tensor y is determined to be split according to the 0 axis according to the position information of the splittable axis 1 in the input tensors of the addition operator. According to the splitting mode with the splitting axis 1 as the element axis, the first input tensor x can be equally split into two second input tensors x0 and x1 according to the 0 axis, the first input tensor y is equally split into two second input tensors y0 and y1 according to the 0 axis, and the two second input tensors y0 and y1 are respectively sent to two target computing resources to perform addition operator operation to obtain a second output tensor, and then a splicing function is called to obtain the first output tensor.
As shown in fig. 16 (b), the input tensor including the splittable axis 2 of length n is split. The length n of the splittable axis 2 is determined as the target splitting axis, and the first input tensor x is determined to be split according to the 1-axis according to the position information of the splittable axis 1 in the input tensors of the addition operator. According to the splitting method in which the split axis 1 is the element axis, the first input tensor x can be equally split into two second input tensors x0 'and x1' according to the 1 axis. Since there is no splittable axis 2 in the first input tensor y, the first input tensor y is sent as shared data to a different target computing resource. Then, an addition operator operation is performed on each target computing resource to obtain a second output tensor, and then a first output tensor is obtained by calling a splicing function.
As a possible implementation manner, each target computing resource may acquire the first input tensor y through addressing, or may copy the first input tensor y to each target computing resource, and the sharing manner of the first input tensor y is not limited in the embodiment of the present application.
In the embodiment of the application, according to the shaft type of the splittable shaft and the position information of the splittable shaft on the input tensor and the output tensor of the operator, which are included in the splitting information of the operator, a proper operator splitting mode can be flexibly selected.
Fig. 17 is a schematic diagram of another specific application of operator segmentation provided in an embodiment of the present application. Scene three, the position of the separable axis 1 in the first input tensor and the first output tensor in the first operator is different.
Taking the operator as a conversion (transfer) operator as an example, as shown in fig. 17 (a), the graph optimizer acquires segmentation information of the conversion operator, determines that the segmentable axis 1 is an element axis, that the segmentable axis 1 is a position in the first input tensor, as shown in fig. 17 (a), that the segmentable axis 1 is a 0 axis in the first input tensor, and that the segmentable axis 1 is a position in the first output tensor, as shown in fig. 17 (a). Based on the forward shape derivation function of the element axes, the shape (56,8,56,64) of the first output tensor may be derived from the shape (8,56,56,64) of the first input tensor.
Specifically, as shown in fig. 17 (b), there are two target computing resources that can be used for the conversion operator operation, splitting the first output tensor according to the 1 axis to obtain two second output tensors with 1 axis length of 4, determining that the 0 axis length of the two second input tensors is 4 according to the inverse shape derivation function of the element axis, and splitting the first input tensor with 0 axis length of 8 according to the 0 axis by calling the splitting function to obtain the two second input tensors.
In the embodiment of the application, the graph optimizer only needs to know the axis type of the splittable axes of the input tensor of the operator and the position information of the splittable axes in the input tensor and the output tensor, and can carry out proper segmentation on the input tensor and the output tensor of the operator under the condition of not needing to be based on the specific operator type, so that the operator optimization and the complete decoupling of the graph optimization can be realized.
The foregoing is a description of a method for processing a computing task according to an embodiment of the present application, and a device for processing a computing task according to an embodiment of the present application is described below with reference to fig. 19. It should be understood that the apparatus described below is capable of performing the method of the foregoing embodiments of the present application, and in order to avoid unnecessary repetition, the repeated description is appropriately omitted when describing the apparatus of the embodiments of the present application.
FIG. 19 is a schematic diagram of an apparatus for processing computing tasks according to an embodiment of the present application. The apparatus 1900 applies to a graph optimizer, the apparatus comprising: a processor 1901 and a transmission interface 1902. Optionally, the apparatus may further include a memory 1903 and a bus 1904.
The memory 1903, the processor 1901, and the transmission interface 1902 are communicatively connected to each other via a bus 1904.
The memory 1903 may be a ROM, a static storage device, and a RAM. The memory 1903 may store programs that, when executed by the processor 1901, the processor 1901 and the communication interface 1902 are configured to perform the steps of the method of processing computing tasks of embodiments of the application.
Illustratively, the processor 1901 is configured to determine a first operator for performing a computational task, the first operator comprising N separable axes, N being a positive integer greater than or equal to 1;
the processor 1901 is configured to obtain slicing information of a first operator from an operator slicing information base, where the slicing information of the first operator includes an axis type of an nth one of the N separable axes in the first operator and position information of the nth one of the separable axes in the first operator, where the position information of the nth one of the separable axes in the first operator is used to indicate a position of the nth one of the separable axes in an input tensor of the first operator, where n=1, …, N.
The processor 1901 is configured to segment the input tensor of the first operator according to the segmentation information of the first operator, and determine K groups of input tensors, where K is a positive integer greater than or equal to 2.
The transmission interface 1902 is configured to send K sets of input tensors to K target computing resources, respectively, so that the K target computing resources complete a computing task.
It should be understood that the above is only an exemplary description of the processing computing task means being for performing the methods or steps mentioned in the method embodiments described above, and that the processing computing task means thus correspond to the method embodiments described above. For details, reference may be made to the description of the foregoing method embodiments, which are not repeated here.
The processor 1901 may employ a general-purpose, CPU, microprocessor, ASIC, GPU, or one or more integrated circuits for performing the functions required to be performed by the elements of the processing computing task device of an embodiment of the application or to perform the processing computing task method of an embodiment of the application.
The processor 1901 may also be an integrated circuit chip with signal processing capabilities. In implementation, various steps of a method for processing computing tasks according to an embodiment of the application may be performed by integrated logic circuitry in hardware or by instructions in software in processor 1901.
The processor 1901 may also be a general purpose processor, DSP, ASIC, FPGA or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 1903, and the processor 1901 reads information in the memory 1903, and combines the hardware thereof to perform functions required to be performed by units included in the processing computing task device of the embodiment of the present application, or to perform the processing computing task method of the embodiment of the method of the present application.
The transmission interface 1902 enables communication between the apparatus 1900 and other devices or communication networks using a transceiver apparatus such as, but not limited to, a transceiver. For example, the image to be processed may be acquired through the transmission interface 1902.
Bus 1904 may include a path for transferring information between components of device 1900 (e.g., memory 1903, processor 1901, transmission interface 1902).
It should be noted that although the apparatus 1900 described above shows only a memory, a processor, a transmission interface, in a particular implementation, those skilled in the art will appreciate that the apparatus 1900 may also include other devices necessary to achieve proper operation. Also, as will be appreciated by those of skill in the art, the apparatus 1900 may also include hardware devices that implement other additional functions, as desired. Furthermore, it will be appreciated by those skilled in the art that the apparatus 1900 may include only the necessary components to implement the embodiments of the present application, and not all of the components shown in FIG. 19.
It is to be appreciated that the processor in embodiments of the application may be a central processing unit (central processing unit, CPU), but may also be other general purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate arrays (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
It should also be appreciated that the memory in embodiments of the present application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as an external cache. By way of example but not limitation, many forms of random access memory (random access memory, RAM) are available, such as Static RAM (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link DRAM (SLDRAM), and direct memory bus RAM (DR RAM).
The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more sets of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.
Embodiments of the present application provide a computer readable storage medium for storing a computer program which, when run on a computer, causes the computer to perform a method of processing computing tasks as in the method embodiments described above.
An embodiment of the present application provides a computer program product comprising: computer program code which, when executed, implements a method of processing computing tasks as in the foregoing method embodiments.
It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: there are three cases, a alone, a and B together, and B alone, wherein a, B may be singular or plural. In addition, the character "/" herein generally indicates that the associated object is an "or" relationship, but may also indicate an "and/or" relationship, and may be understood by referring to the context.
In the present application, "at least one" means one or more, and "a plurality" means two or more. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.
It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program code.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (31)

  1. A method of processing a computing task, the method performed by a graph optimizer, the method comprising:
    determining a first operator for performing a computing task, the first operator comprising N separable axes, N being a positive integer greater than or equal to 1;
    obtaining segmentation information of the first operator from an operator segmentation information base, wherein the segmentation information of the first operator comprises an axis type of an nth segmentable axis in the first operator and first position information, and the first position information is used for indicating the position of the nth segmentable axis in an input tensor of the first operator, wherein n=1, … and N;
    dividing the input tensor of the first operator according to the dividing information of the first operator to obtain K groups of input tensors, wherein K is a positive integer greater than or equal to 2;
    And respectively sending the K groups of input tensors to K target computing resources so that the K target computing resources finish the computing task.
  2. The method of claim 1, wherein the axis type of the splittable axis is one of the following types: element axis, specification axis and sliding window axis;
    wherein, the axes of elements in the input tensor and the output tensor of the operator, which have point-to-point mapping relation, are the element axes;
    if there is a first axis in the operator's input tensor and there is no first axis in the operator's output tensor, the first axis is the reduction axis;
    and the axis of the operator for carrying out sliding window scanning operation on the elements in the input tensor of the operator is the sliding window axis.
  3. The method of claim 2, wherein the splitting the input tensor of the first operator according to the splitting information of the first operator, to obtain K groups of input tensors includes:
    determining a target split axis, wherein the target split axis is one of the N split axes;
    determining a segmentation mode corresponding to the shaft type of the target segmentation shaft in the first operator according to the segmentation information of the first operator;
    And splitting the input tensor of the first operator according to the splitting mode of the target splitting axis corresponding to the axis type in the first operator to obtain the K groups of input tensors.
  4. The method of claim 3, wherein slicing the input tensor of the first operator according to the slicing mode corresponding to the axis type of the target slicing axis in the first operator, to obtain the K groups of input tensors comprises:
    according to the segmentation mode, determining Q first input tensors comprising the target segmentation axis and the positions of the target segmentation axis in each of the Q first input tensors, wherein Q is a positive integer greater than or equal to 1;
    dividing each first input tensor in the Q first input tensors according to the shaft type of the target dividing shaft in the first operator and the number K of the target computing resources to obtain Q groups of second input tensors, wherein each group of second input tensors in the Q groups of second input tensors comprises K second input tensors;
    and obtaining the K groups of input tensors according to the Q groups of second input tensors and the input tensors of the first operator which are not segmented.
  5. The method of claim 2, wherein, where the operator for performing the computing task further comprises a second operator, the second operator comprises P separable axes, the P separable axes being a subset of the N separable axes,
    the step of segmenting the input tensor of the first operator according to the segmentation information of the first operator, and the step of obtaining K groups of input tensors comprises the following steps:
    obtaining segmentation information of the second operator from the operator segmentation information base, wherein the segmentation information of the second operator comprises an axis type of a P-th segmentable axis in the second operator and second position information, wherein the second position information is used for indicating the position of the P-th segmentable axis in an input tensor of the second operator, the input tensor of the second operator is an output tensor of the first operator, P is a positive integer greater than or equal to 1 and less than or equal to N, and p=1, … and P;
    determining P segmentation reference information according to the segmentation information of the first operator and the segmentation information of the second operator, wherein the P segmentation reference information in the P segmentation reference information comprises: the axis type of the p-th splittable axis in the first operator, the axis type of the p-th splittable axis in the second operator, and the position of the p-th splittable axis in the input tensor of the first operator;
    Determining P groups of candidate segmentation modes according to the P segmentation reference information, wherein the P-th group of candidate segmentation modes in the P groups of candidate segmentation modes comprise at least one segmentation mode;
    determining a target segmentation mode according to the time required by each segmentation mode in the P group of candidate segmentation modes to finish the calculation task;
    and according to the target segmentation mode, segmenting the input tensor of the first operator to obtain K groups of input tensors.
  6. The method of claim 5, wherein the splitting the input tensor of the first operator according to the target splitting manner to obtain K groups of input tensors comprises:
    according to the target segmentation mode, determining a target segmentation axis, an axis type of the target segmentation axis in the first operator, an axis type of the target segmentation axis in the second operator, Q first input tensors including the target segmentation axis in the first operator and positions of the target segmentation axis in each of the Q first input tensors, wherein Q is a positive integer greater than or equal to 1;
    dividing each first input tensor in the Q first input tensors according to the shaft type of the target dividing shaft in the first operator, the shaft type of the target dividing shaft in the second operator and the number K of the target computing resources to obtain Q groups of second input tensors,
    Wherein each of the Q sets of second input tensors comprises K second input tensors, a Q-th set of second input tensors of the Q sets of second input tensors being a result of a segmentation of a Q-th first input tensor of the Q first input tensors into K, wherein Q = 1, …, Q;
    and determining the K groups of input tensors according to the Q groups of second input tensors and the input tensors of the first operator which are not segmented.
  7. The method of claim 4, wherein when the axis type of the target split axis in the first operator is the element axis or the sliding window axis, the first position information of the target split axis is further used to indicate a position of the target split axis in an output tensor of the first operator, and the dividing each of the Q first input tensors according to the axis type of the target split axis in the first operator and the number K of target computing resources, respectively, to obtain Q sets of second input tensors includes:
    determining, according to the first position information of the target segmentation axis, L first output tensors including the target segmentation axis in the first operator, and a position of the target segmentation axis in each of the L first output tensors, where L is a positive integer greater than or equal to 1;
    Obtaining a first output length by taking a first input length as an input of a forward shape derivation function of the target segmentation axis, wherein the first input length is the length of the target segmentation axis in each first input tensor, and the lengths of the target segmentation axes in each first input tensor are equal;
    dividing the L first output tensors according to the target dividing axis according to the first output length and the number K of the target computing resources to obtain L groups of second output tensors, wherein each group of second output tensors in the L groups of second output tensors comprises K second output tensors;
    respectively taking K second output lengths corresponding to the target segmentation axis in each group of second output tensors in the L groups of second output tensors as inputs of a reverse derivation function of the target segmentation axis to obtain K second input lengths corresponding to the target segmentation axis in each group of second input tensors in the Q groups of second input tensors;
    and respectively segmenting each first input tensor in the Q first input tensors according to the target segmentation axis according to K second input lengths corresponding to the target segmentation axis in each second input tensor in the Q second input tensors to obtain the Q second input tensors.
  8. The method of claim 7, wherein when the axis type of the target split axis in the first operator is the element axis, the dividing each of the Q first input tensors according to the target split axis according to the K second input lengths corresponding to the target split axis in each of the Q second input tensors, respectively, to obtain the Q second input tensors includes:
    and according to K second input lengths corresponding to the target segmentation axes in each group of the Q groups of second input tensors, respectively segmenting each first input tensor in the Q groups of first input tensors according to the target segmentation axes by scheduling a first segmentation function to obtain the Q groups of second input tensors.
  9. The method of claim 7, wherein when the axis type of the target split axis in the first operator is the sliding window axis, the dividing, according to the K second input lengths corresponding to the target split axis in each of the Q sets of second input tensors, each of the Q first input tensors according to the target split axis, to obtain the Q sets of second input tensors includes:
    And according to K second input lengths corresponding to the target segmentation axes in each group of the Q groups of second input tensors, carrying out segmentation with overlapping on each first input tensor in the Q first input tensors according to the target segmentation axes by scheduling a first slicing function, so as to obtain the Q groups of second input tensors.
  10. The method of claim 7, wherein when the axis type of the target split axis in the first operator is the sliding window axis, the dividing, according to the K second input lengths corresponding to the target split axis in each of the Q sets of second input tensors, each of the Q first input tensors according to the target split axis, to obtain the Q sets of second input tensors includes:
    dividing each first input tensor in the Q first input tensors according to the target dividing axis by scheduling a second dividing function to obtain Q groups of third input tensors, wherein the Q groups of third input tensors comprise K third input tensors;
    according to K second input lengths corresponding to the target segmentation axes in each group of the second input tensors in the Q groups of the second input tensors, respectively segmenting the K third input tensors in each group of the third input tensors in the Q groups of the third input tensors according to the target segmentation axes by scheduling a second slicing function to obtain Q groups of fourth input tensors;
    And splicing the kth fourth input tensor in the Q-th group of fourth input tensors and the kth third input tensor in the Q-th group of third input tensors according to the target slicing axis by scheduling a splicing function to obtain the Q-th group of second input tensors.
  11. The method of claim 4, wherein when the axis type of the target split axis in the first operator is the reduction axis, respectively splitting each of the Q first input tensors according to the axis type of the target split axis in the first operator and the number K of the target computing resources, to obtain Q sets of second input tensors comprises:
    and according to the number K of the target computing resources, respectively segmenting each first input tensor in the Q first input tensors by calling a third segmentation function to obtain Q groups of second input tensors.
  12. The method of claim 11, wherein the reduction axes comprise a first type of reduction axes and a second type of reduction axes, wherein the first type of reduction axes are reduction axes of the operator on elements in the operator's input tensor, and the second type of reduction axes are reduction axes of the operator on elements in the operator's input tensor.
  13. The method of claim 12, wherein the first type of protocol axis comprises any one of: a sum axis of the specifications, a maximum axis of the specifications, a minimum axis of the specifications, and an average axis of the specifications;
    the sum axis of the protocol is the protocol axis of the operator for carrying out summation reduction operation on elements in the input tensor of the operator;
    the reduction maximum value axis is a reduction axis for the operator to perform maximum value reduction operation on elements in the input tensor of the operator;
    the protocol minimum value axis is a protocol axis for the operator to perform minimum value reduction operation on elements in an input tensor of the operator;
    the reduction average value axis is the reduction axis of the operator for carrying out the operation of averaging and reducing on the elements in the input tensor of the operator.
  14. The method of claim 12, wherein the second type of protocol axis comprises a protocol acquisition axis that is an axis of element index data for the operator on an input tensor of the operator according to an address indicated by an element on the operator's index input tensor.
  15. The method of any of claims 1 to 14, wherein the target computing resource comprises one of the following categories:
    An image processing unit GPU, a central processing unit CPU, a die, or a chip.
  16. An apparatus for processing computing tasks, the apparatus being for use in a graph optimizer, the apparatus comprising a processor and a transmission interface:
    the processor is used for determining a first operator for executing a computing task, wherein the first operator comprises N separable axes, and N is a positive integer greater than or equal to 1;
    the processor is configured to obtain, from an operator segmentation information base, segmentation information of the first operator, where the segmentation information of the first operator includes an axis type of an nth one of the N separable axes in the first operator and first position information, where the first position information is used to indicate a position of the nth separable axis in an input tensor of the first operator, where n=1, …, N;
    the processor is used for cutting the input tensor of the first operator according to the cutting information of the first operator to obtain K groups of input tensors, wherein K is a positive integer greater than or equal to 2;
    the transmission interface is used for respectively sending the K groups of input tensors to K target computing resources so that the K target computing resources complete the computing task.
  17. The apparatus of claim 16, wherein the axis type of the splittable axis is one of the following types: element axis, specification axis and sliding window axis;
    wherein, the axes of elements in the input tensor and the output tensor of the operator, which have point-to-point mapping relation, are the element axes;
    if there is a first axis in the operator's input tensor and there is no first axis in the operator's output tensor, the first axis is the reduction axis;
    and the axis of the operator for carrying out sliding window scanning operation on the elements in the input tensor of the operator is the sliding window axis.
  18. The apparatus of claim 17, wherein the processor is configured to segment the input tensor of the first operator based on the segmentation information of the first operator, to obtain K sets of input tensors comprising:
    the processor is configured to:
    determining a target split axis, wherein the target split axis is one of the N split axes;
    determining a segmentation mode corresponding to the shaft type of the target segmentation shaft in the first operator according to the segmentation information of the first operator;
    and splitting the input tensor of the first operator according to the splitting mode of the target splitting axis corresponding to the axis type in the first operator to obtain the K groups of input tensors.
  19. The apparatus of claim 18, wherein the processor is configured to,
    determining the shaft type in the first operator, Q first input tensors including the target split shaft in the first operator and the position of the target split shaft in each first input tensor in the Q first input tensors according to the splitting mode corresponding to the shaft type of the target split shaft in the first operator, wherein Q is a positive integer greater than or equal to 1;
    dividing each first input tensor in the Q first input tensors according to the shaft type of the target dividing shaft in the first operator and the number K of the target computing resources to obtain Q groups of second input tensors, wherein each group of second input tensors in the Q groups of second input tensors comprises K second input tensors;
    and obtaining the K groups of input tensors according to the Q groups of second input tensors and the input tensors of the first operator which are not segmented.
  20. The apparatus of claim 17, wherein, where the operator for performing the computing task further comprises a second operator, the second operator comprises P separable axes, the P separable axes being a subset of the N separable axes,
    The processor is specifically configured to:
    obtaining segmentation information of the second operator from the operator segmentation information base, wherein the segmentation information of the second operator comprises an axis type of a P-th segmentable axis in the second operator and second position information, wherein the second position information is used for indicating the position of the P-th segmentable axis in an input tensor of the second operator, the input tensor of the second operator is an output tensor of the first operator, P is a positive integer greater than or equal to 1 and less than or equal to N, and p=1, … and P;
    determining P segmentation reference information according to the segmentation information of the first operator and the segmentation information of the second operator, wherein the P segmentation reference information in the P segmentation reference information comprises: the axis type of the p-th splittable axis in the first operator, the axis type of the p-th splittable axis in the second operator, and the position of the p-th splittable axis in the input tensor of the first operator;
    determining P groups of candidate segmentation modes according to the P segmentation reference information, wherein the P-th group of candidate segmentation modes in the P groups of candidate segmentation modes comprise at least one segmentation mode;
    Determining a target segmentation mode according to the time required by each segmentation mode in the P group of candidate segmentation modes to finish the calculation task;
    and according to the target segmentation mode, segmenting the input tensor of the first operator to obtain K groups of input tensors.
  21. The apparatus of claim 20, wherein the processor is specifically configured to:
    according to the target segmentation mode, determining a target segmentation axis, an axis type of the target segmentation axis in the first operator, an axis type of the target segmentation axis in the second operator, Q first input tensors including the target segmentation axis in the first operator and positions of the target segmentation axis in each of the Q first input tensors, wherein Q is a positive integer greater than or equal to 1;
    dividing each first input tensor in the Q first input tensors according to the shaft type of the target dividing shaft in the first operator, the shaft type of the target dividing shaft in the second operator and the number K of the target computing resources to obtain Q groups of second input tensors,
    wherein each of the Q sets of second input tensors comprises K second input tensors, a Q-th set of second input tensors of the Q sets of second input tensors being a result of a segmentation of a Q-th first input tensor of the Q first input tensors into K, wherein Q = 1, …, Q;
    And obtaining the K groups of input tensors according to the Q groups of second input tensors and the input tensors of the first operator which are not segmented.
  22. The apparatus of claim 19, wherein when the axis type of the target split axis in the first operator is the element axis or the sliding window axis, the first position information of the target split axis is further used to indicate a position of the target split axis in an output tensor of the first operator, the processor is specifically configured to:
    determining, according to the first position information of the target segmentation axis, L first output tensors including the target segmentation axis in the first operator, and a position of the target segmentation axis in each of the L first output tensors, where L is a positive integer greater than or equal to 1;
    obtaining a first output length by taking a first input length as an input of a forward shape derivation function of the target segmentation axis, wherein the first input length is the length of the target segmentation axis in each first input tensor, and the lengths of the target segmentation axes in each first input tensor are equal;
    according to the first output length and the number K of the target computing resources, the L first output tensors are segmented according to the target segmentation axis, the L second output tensors are obtained, each group of second output tensors in the L second output tensors comprises K second output tensors, and the first second output tensor in the L second output tensors is a segmentation result obtained by segmenting the first output tensor in the L first output tensors into K;
    Respectively taking K second output lengths corresponding to the target segmentation axes in each group of the second output tensors in the L groups of the second output tensors as inputs of a reverse derivation function of the target segmentation axes to obtain K second input lengths corresponding to the target segmentation axes in each group of the second input tensors in the Q groups of the second input tensors, wherein the lengths corresponding to the target segmentation axes in the kth second output tensors in each group of the second output tensors in the L groups of the second output tensors are equal, and the lengths corresponding to the target segmentation axes in the kth second input tensors in each group of the second input tensors in the Q groups of the second input tensors are equal;
    and respectively segmenting each first input tensor in the Q first input tensors according to the target segmentation axis according to K second input lengths corresponding to the target segmentation axis in each second input tensor in the Q second input tensors to obtain the Q second input tensors.
  23. The apparatus of claim 22, wherein when the axis type of the target segmentation axis in the first operator is the element axis, the processor is specifically configured to:
    And according to K second input lengths corresponding to the target segmentation axes in each group of the Q groups of second input tensors, respectively segmenting each first input tensor in the Q groups of first input tensors according to the target segmentation axes by scheduling a first segmentation function to obtain the Q groups of second input tensors.
  24. The apparatus of claim 22, wherein when the axis type of the target segmentation axis in the first operator is the sliding window axis, the processor is specifically configured to:
    and according to K second input lengths corresponding to the target segmentation axes in each group of the Q groups of second input tensors, carrying out segmentation with overlapping on each first input tensor in the Q first input tensors according to the target segmentation axes by scheduling a first slicing function, so as to obtain the Q groups of second input tensors.
  25. The apparatus of claim 22, wherein when the axis type of the target segmentation axis in the first operator is the sliding window axis, the processor is specifically configured to:
    dividing each first input tensor in the Q first input tensors according to the target dividing axis by scheduling a second dividing function to obtain Q groups of third input tensors, wherein the Q groups of third input tensors comprise K third input tensors;
    According to K second input lengths corresponding to the target segmentation axes in each group of the second input tensors in the Q groups of the second input tensors, respectively segmenting the K third input tensors in each group of the third input tensors in the Q groups of the third input tensors according to the target segmentation axes by scheduling a second slicing function to obtain Q groups of fourth input tensors;
    and splicing the kth fourth input tensor in the Q-th group of fourth input tensors and the kth third input tensor in the Q-th group of third input tensors according to the target slicing axis by scheduling a splicing function to obtain the Q-th group of second input tensors.
  26. The apparatus of claim 19, wherein when the axis type of the target-slicing axis in the first operator is the reduction axis, the processor is specifically configured to:
    and according to the number K of the target computing resources, respectively segmenting each first input tensor in the Q first input tensors by calling a third segmentation function to obtain Q groups of second input tensors.
  27. The apparatus of claim 26, wherein the reduction axes comprise a first type of reduction axes and a second type of reduction axes, wherein the first type of reduction axes are reduction axes of the operator to elements in the operator's input tensor, and the second type of reduction axes are reduction axes of the operator to elements in the operator's input tensor.
  28. The apparatus of claim 27, wherein the first type of protocol axis comprises any one of: a sum axis of the specifications, a maximum axis of the specifications, a minimum axis of the specifications, and an average axis of the specifications;
    the sum axis of the protocol is the protocol axis of the operator for carrying out summation reduction operation on elements in the input tensor of the operator;
    the reduction maximum value axis is a reduction axis for the operator to perform maximum value reduction operation on elements in the input tensor of the operator;
    the protocol minimum value axis is a protocol axis for the operator to perform minimum value reduction operation on elements in the input tensor of the operator;
    the reduction average value axis is the reduction axis of the operator for carrying out the operation of averaging and reducing on the elements in the input tensor of the operator.
  29. The apparatus of claim 27, wherein the second type of protocol axis comprises a protocol acquisition axis that is an axis of element index data for the operator on an input tensor of the operator according to an address indicated by an element on the index input tensor of the operator.
  30. The apparatus of any of claims 16 to 29, wherein the target computing resource comprises one of the following categories:
    An image processing unit GPU, a central processing unit CPU, a die, or a chip.
  31. A computer readable storage medium, characterized in that the computer readable medium stores a program code comprising instructions for performing the method of any one of claims 1 to 15.
CN202280012811.6A 2022-01-28 2022-01-28 Methods and devices for processing computing tasks Pending CN116888601A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/074576 WO2023141939A1 (en) 2022-01-28 2022-01-28 Method and device for processing computing task

Publications (1)

Publication Number Publication Date
CN116888601A true CN116888601A (en) 2023-10-13

Family

ID=87469960

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202280012811.6A Pending CN116888601A (en) 2022-01-28 2022-01-28 Methods and devices for processing computing tasks

Country Status (2)

Country Link
CN (1) CN116888601A (en)
WO (1) WO2023141939A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118312327B (en) * 2024-06-06 2024-10-22 北京壁仞科技开发有限公司 Hardware resource allocation method, electronic device and storage medium
CN119718687B (en) * 2025-02-28 2025-06-17 青岛国实科技集团有限公司 Parallel computing method, system and computer device based on tensor reduction
CN120723486B (en) * 2025-08-29 2025-11-14 苏州元脑智能科技有限公司 Method for acquiring protocol results and computing equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449859A (en) * 2020-03-27 2021-09-28 华为技术有限公司 Data processing method and device
CN113449857A (en) * 2020-03-27 2021-09-28 华为技术有限公司 A data processing method and data processing device
WO2021190761A1 (en) * 2020-03-27 2021-09-30 Huawei Technologies Co., Ltd. Parallel computing scheme generation for neural networks
CN113485837A (en) * 2021-07-21 2021-10-08 瀚博半导体(上海)有限公司 Tensor processing method and processing system based on parallel branch and tensor segmentation

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10970619B1 (en) * 2020-08-21 2021-04-06 Moffett Technologies Co., Limited Method and system for hierarchical weight-sparse convolution processing
CN112465108B (en) * 2020-11-11 2022-07-22 上海交通大学 A Neural Network Compilation Method for Storage and Computing Integrated Platform

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449859A (en) * 2020-03-27 2021-09-28 华为技术有限公司 Data processing method and device
CN113449857A (en) * 2020-03-27 2021-09-28 华为技术有限公司 A data processing method and data processing device
WO2021190761A1 (en) * 2020-03-27 2021-09-30 Huawei Technologies Co., Ltd. Parallel computing scheme generation for neural networks
CN113485837A (en) * 2021-07-21 2021-10-08 瀚博半导体(上海)有限公司 Tensor processing method and processing system based on parallel branch and tensor segmentation

Also Published As

Publication number Publication date
WO2023141939A9 (en) 2024-07-25
WO2023141939A1 (en) 2023-08-03

Similar Documents

Publication Publication Date Title
Ye et al. Inverted pyramid multi-task transformer for dense scene understanding
EP4036803A1 (en) Neural network model processing method and apparatus, computer device, and storage medium
US11514324B2 (en) Methods of optimization of computational graphs of neural networks
CN113449859B (en) A data processing method and device thereof
EP3979143A1 (en) Method of performing splitting in neural network model by means of multi-core processor, and related product
WO2021190127A1 (en) Data processing method and data processing device
US20200202216A1 (en) Computationally-efficient quaternion-based machine-learning system
CN116888601A (en) Methods and devices for processing computing tasks
WO2018171717A1 (en) Automated design method and system for neural network processor
CN113128285B (en) Method and device for processing video
CN116187391A (en) Neural network model processing method and device
CN116050469A (en) AI model processing method, calculation method and device
US20210342673A1 (en) Inter-processor data transfer in a machine learning accelerator, using statically scheduled instructions
US20260023967A1 (en) Loop transformation in tensor compilers of deep neural networks (dnns)
CN117908894A (en) Calculation graph processing method and device
WO2023122854A1 (en) Data processing method and apparatus
CN114841309B (en) Data processing method, device and electronic equipment
US11790228B2 (en) Methods and systems for performing tasks on media using attribute specific joint learning
EP4677564A2 (en) Data representation with cross-modality knowledge sharing
US11481604B2 (en) Apparatus and method for neural network processing
CN117011640A (en) Model distillation real-time target detection method and device based on pseudo-label filtering
US20220301216A1 (en) Efficient pose estimation through iterative refinement
CN114912569B (en) A model training method and apparatus
CN115147635B (en) An image processing method and related apparatus
WO2024221278A1 (en) Searching loop transformation schedule for deep learning model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination