[go: up one dir, main page]

WO2021114757A1 - Procédé et appareil d'optimisation d'un graphique de calcul, dispositif informatique, et support de stockage - Google Patents

Procédé et appareil d'optimisation d'un graphique de calcul, dispositif informatique, et support de stockage Download PDF

Info

Publication number
WO2021114757A1
WO2021114757A1 PCT/CN2020/113290 CN2020113290W WO2021114757A1 WO 2021114757 A1 WO2021114757 A1 WO 2021114757A1 CN 2020113290 W CN2020113290 W CN 2020113290W WO 2021114757 A1 WO2021114757 A1 WO 2021114757A1
Authority
WO
WIPO (PCT)
Prior art keywords
calculation
computing
node
current
check node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2020/113290
Other languages
English (en)
Chinese (zh)
Inventor
周舒畅
王田
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Megvii Technology Co Ltd
Original Assignee
Beijing Megvii Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Megvii Technology Co Ltd filed Critical Beijing Megvii Technology Co Ltd
Publication of WO2021114757A1 publication Critical patent/WO2021114757A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the field of computer technology, and in particular to a method, device, computer equipment, and storage medium for optimizing a calculation graph.
  • the optimization process of the existing computing network model is to use a unified optimization method to optimize the computing network model, that is, to design the computing network model and the application environment for the computing network model and the combined application environment according to the hardware index requirements proposed by the user.
  • the corresponding optimization model enables the computer equipment resources consumed when the optimization model is compiled and run in the later stage to meet the performance index requirements proposed by the user.
  • the above optimization method can only be applied to the same computing network model and application environment.
  • the computing network model and application environment are changed, the corresponding optimization method needs to be redesigned. Therefore, the adaptability of the above optimization method is extremely low, which leads to The operating efficiency of the computational network model is extremely low.
  • a method for optimizing a calculation graph includes:
  • the calculation graph includes multiple calculation nodes
  • the computing nodes after the current check node are optimized.
  • the current performance margin includes the current delay performance margin.
  • determining the optimization strategy includes: if the current delay performance margin is sufficient, determining the storage optimization strategy as the optimization strategy; The storage optimization strategy is used to reduce the memory occupied by the computing node during the calculation; if the current delay performance margin is not sufficient, the delay optimization strategy is determined as the optimization strategy; the delay optimization strategy is used to reduce the calculation of the computing node during the calculation It takes a long time.
  • the current performance margin includes the current storage performance margin
  • the optimization strategy is determined according to the current delay performance margin, including: if the current storage performance margin is sufficient, determining the delay optimization strategy as the optimization strategy ; The delay optimization strategy is used to reduce the time consumed by the computing node in the calculation; if the current storage performance margin is not sufficient, the storage optimization strategy is determined as the optimization strategy; the storage optimization strategy is used to reduce the memory occupied by the computing node in the calculation .
  • the storage optimization strategy includes: storing the data generated by the computing node after the check node during calculation in a storage space with high access latency; the storage space with high access latency includes at least global memory and off-chip memory; And/or, the delay optimization strategy includes: storing the data generated by the computing node after the check node during calculation in a storage space with low access latency; the storage space with low access latency includes at least cache space and on-chip storage.
  • the delay optimization strategy further includes: obtaining the size of the data generated by the computing node after the check node during the calculation; comparing the size of the data generated by the computing node during the calculation with the size of the preset storage memory ; If the size of the data generated by the computing node during the calculation exceeds the preset storage memory size, the computing node after the check node will be split, and the data generated by the split computing node during the calculation will be stored in the storage with low access latency Space; if the size of the data generated by the computing node during the calculation does not exceed the preset storage memory size, the data generated during the calculation by the computing node after the check node is stored in a storage space with low access latency.
  • obtaining the current performance margin through the current check node includes: obtaining the first total target calculation consumption time length and the total actual calculation consumption time length of all computing nodes before the current check node; according to the first total target Calculate the consumption time and the total actual calculation consumption time to determine the current delay performance margin.
  • obtaining the first total target calculation consumption time length of all computing nodes before the current check node includes: obtaining the second total target calculation consumption time length of all the computing nodes on the path where the current check node is located; The second total target calculation consumption time and the preset ratio are used to determine the first total target calculation consumption time; the preset ratio is the total calculation consumption time of all the computing nodes before the current check node and the total calculation consumption time of all the computing nodes on the path where the check node is located. The percentage of the total computing time consumed.
  • inserting at least one check node in the calculation graph includes: obtaining the calculation consumption time ratio of each calculation node on the longest path in the calculation graph; and determining at least one check node on the longest path according to the calculation consumption time ratio Insertion position of the inspection node; insert at least one inspection node at the insertion position of the at least one inspection node.
  • obtaining the proportion of computing time consumed by each computing node on the longest path in the calculation graph includes: obtaining the calculation amount of each computing node on the longest path; obtaining the calculation amount of each computing node on the longest path The computing time consumed by each computing node; according to the computing time consumed by each computing node on the longest path, the ratio of computing time consumed by each computing node on the longest path is determined.
  • obtaining the calculation consumption time ratio of each computing node on the longest path in the calculation graph includes: constructing a consumption time estimation model; using the consumption time estimation model to obtain the calculation of each computing node on the longest path Time consumption: According to the calculation time consumption of each computing node on the longest path, the ratio of the calculation time consumption of each computing node on the longest path is determined.
  • determining the insertion position of at least one check node on the longest path according to the calculated consumption time ratio includes: dividing the longest path into a preset number of multiple sub-paths according to the calculated consumption time ratio; At least one of the multiple sub-paths is selected as the insertion position for inserting the checkpoint.
  • inserting at least one check node in the calculation graph includes: obtaining a start-end calculation node and an end calculation node that are separated by at least one calculation node in the calculation graph; inserting at least one check node in the middle position between the start-end calculation node and the end calculation node A check node.
  • obtaining the calculation graph of the calculation network model includes: loading the topology structure and parameters of the calculation network model; compiling the topology structure and parameters of the calculation network model to obtain the calculation graph of the calculation network model.
  • an optimization device for a calculation graph includes:
  • the first obtaining module is used to obtain a calculation graph of a calculation network model; the calculation graph includes a plurality of calculation nodes;
  • Insert module used to insert at least one check node in the calculation graph
  • the second obtaining module is used to obtain the current performance margin through the current check node when running to each check node;
  • the determining module is used to determine the optimization strategy according to the current performance margin
  • the optimization module is used to optimize the computing nodes after the current check node according to the optimization strategy.
  • a computer device in a third aspect, includes a memory and a processor, the memory stores a computer program, and the processor implements the calculation graph optimization method described in any embodiment of the first aspect when the processor executes the computer program.
  • a computer-readable storage medium has a computer program stored thereon, and the computer program, when executed by a processor, implements the method for optimizing the calculation graph described in any one of the embodiments of the first aspect.
  • the method, device, computer equipment, and storage medium for optimizing calculation graphs obtained in this application obtain a calculation graph of a calculation network model including multiple calculation nodes, and then insert at least one check node in the calculation graph, and when it runs to At each check node, the current performance margin is obtained through the current check node, and then an optimization strategy is determined according to the current performance margin, and the required consumption resources of the computing nodes after the current check node are optimized according to the optimization strategy.
  • the above optimization method obtains the current performance margin of the computer equipment when it runs to each check node by inserting the check node, and then selects the optimization strategy that meets the actual operation of the computer equipment according to the current performance margin to consume the computing node after the check node Resources are optimized, so that when the computer equipment is running to the above-mentioned computing node, the resource usage of each computing node in the calculation graph can be dynamically adjusted to meet the performance index requirements of the user for the calculation graph, and to improve the computer equipment Resource utilization.
  • FIG. 1 is a schematic diagram of the internal structure of a computer device provided by an embodiment
  • FIG. 2 is a flowchart of a method for optimizing a calculation graph according to an embodiment
  • 2A is a flowchart of a method for optimizing a calculation graph provided by an embodiment
  • FIG. 3 is a flowchart of an implementation manner of S103 in the embodiment of FIG. 2;
  • FIG. 4 is a flowchart of an implementation manner of S201 in the embodiment of FIG. 3;
  • FIG. 5 is a flowchart of an implementation manner of S102 in the embodiment of FIG. 2;
  • FIG. 6 is a flowchart of an implementation manner of S401 in the embodiment of FIG. 5;
  • FIG. 7 is a flowchart of another implementation manner of S401 in the embodiment of FIG. 5;
  • FIG. 8 is a flowchart of another implementation manner of S402 in the embodiment of FIG. 5;
  • FIG. 8A is a schematic structural diagram of a calculation graph provided by an embodiment
  • FIG. 8B is a schematic structural diagram of a calculation graph provided by an embodiment
  • FIG. 9 is a flowchart of another implementation manner of S102 in the embodiment of FIG. 2;
  • FIG. 9A is a schematic structural diagram of a calculation graph provided by an embodiment
  • Fig. 10 is a flowchart of an implementation manner of S101 in the embodiment of Fig. 2;
  • FIG. 11 is a flowchart of a method for optimizing a calculation graph provided by an embodiment
  • FIG. 12 is a schematic structural diagram of an optimization device for a calculation graph provided by an embodiment
  • Fig. 13 schematically shows a block diagram of a computing processing device for executing the method according to the present invention
  • Fig. 14 schematically shows a storage unit for holding or carrying program codes for implementing the method according to the present invention.
  • the method for optimizing the calculation graph provided in this application can be applied to the computer device as shown in FIG. 1.
  • the computer device may be a server or a terminal, and its internal structure diagram may be as shown in FIG.
  • the computer equipment includes a processor, a memory, a network interface, a display screen and an input device connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and a computer program.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program is executed by the processor to realize an optimization method of the calculation graph.
  • the display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen
  • the input device of the computer equipment can be a touch layer covered on the display screen, or it can be a button, a trackball or a touchpad set on the housing of the computer equipment , It can also be an external keyboard, touchpad, or mouse.
  • FIG. 1 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • Fig. 2 is a flowchart of a method for optimizing a calculation graph provided by an embodiment.
  • the execution body of the method is the computer device in Fig. 1, and the method involves the computer device running the calculation graph of the calculation network model.
  • the specific process of optimization of the calculation graph As shown in Figure 2, the method specifically includes the following steps:
  • the computing network model may be constructed by computer equipment in advance according to actual application requirements, and it may specifically be a computing network model with various functional applications, such as a neural network model, a machine learning network model, and an intelligent algorithm network model.
  • Computational graphs are a kind of "language” for describing calculation methods, which are specifically composed of multiple computing nodes, and multiple computing nodes with dependencies are connected to each other.
  • the computing node may include code for executing a certain computing function, so that when the computer device runs to the computing node, it can execute the corresponding computing task in the computing network model.
  • the computer device can compile a preset calculation network model through a compiler to generate a compiled calculation graph.
  • the computer device may also directly obtain the calculation graph of the calculation network model after the compilation meeting through other methods, which is not limited in this embodiment.
  • the computer device may also construct the computational network model according to actual application requirements in advance, and then compile the computational network model based on the constructed computational network model for later use.
  • the computer device can also directly obtain a pre-compiled computing network model, and then compile it based on the obtained computing network model for later use during runtime, which is not limited in this embodiment.
  • the check node may include code for executing a certain calculation or test function, so that when the computer device runs to the check node, the corresponding calculation or test task can be executed, and the check node may be pre-configured by the computing device.
  • the computer device obtains the calculation graph of the calculation network model and needs to optimize the calculation graph in the process of running the calculation graph later, at least one check node may be further inserted into the calculation graph to make
  • the computer device can detect the resource consumption of the computer device at the current moment, so as to dynamically adjust the resource utilization mode of the subsequent calculation node according to the current resource consumption, so that the calculation graph is in the process of being executed In the process, the consumption of resources can always meet the performance indicators proposed by the user or reach the optimum, making full use of the resources on the computer equipment.
  • the current performance margin represents the margin between the computing device resources actually consumed when the computing device runs to the current check node and the computing device resources indicated by the user's expected performance index
  • the current performance margin may represent delay performance
  • the performance margin of the index may also be the performance margin of the storage performance index, or the performance margin of other types of performance indexes consumed by the computer equipment during calculation.
  • the computer device can obtain the current performance margin of the computing device by executing the code on the check node, so that the computer device can then obtain the current performance margin according to the current performance margin.
  • the computing nodes after the checkpoint are optimized in different ways, so that the optimized computing nodes can make full use of the resources of the computing device when they are executed.
  • S104 Determine an optimization strategy according to the current performance margin.
  • the optimization strategy is used to optimize the resources consumed by the computing nodes after the check node, so that the resources consumed by the subsequent computing nodes when executed can meet user needs or match the performance indicators of the computing device.
  • the computer device when the computer device obtains the current performance margin through the current check node, it can determine whether the current performance margin is sufficient, and then select different optimization strategies according to the judgment result to dynamically optimize the calculation after the check node in the graph. calculate node.
  • the optimization strategy of reducing memory can be used to compile and run to reduce the performance consumption of the computer equipment during storage, so that the performance indicators of all aspects of the computer equipment can meet the user Demand
  • the optimization strategy of reducing the memory access operation with high access delay can be used to compile and run, so as to reduce the delay performance consumption of computer equipment calculation. So that the performance indicators of all aspects of computer equipment can meet the needs of users.
  • the optimization strategy of reducing the memory access operation with high access latency can be used to compile and run, so as to reduce the delay performance consumption of computer equipment calculation; if it indicates that the storage The current performance margin of the performance index is sufficient to reduce the performance consumption of the computer equipment during storage.
  • the parameters or variables on the computing node after the current check node can be optimized according to the optimization strategy.
  • the computer device can change the parameters or variables on the computing node.
  • the storage method of parameters or variables thereby changing the length of time when the computing node reads or writing data, and then changes the computing time when the computing device runs to the computing node, so as to improve the delay performance of the computing device and complete the calculation Optimization of nodes.
  • a computer device can also split a computing node, so that the resources consumed by one computing node are divided into resources consumed by multiple computing nodes, so as to reduce the burden of resource consumption when the computing device runs to each computing node, and complete the computing node Optimization.
  • the method for optimizing the calculation graph obtained in this embodiment obtains a calculation graph of a calculation network model including multiple calculation nodes, inserts at least one check node in the calculation graph, and when it runs to each check node, passes the current
  • the check node obtains the current performance margin, and then determines an optimization strategy according to the current performance margin, and optimizes the required consumption resources of the computing nodes after the current check node according to the optimization strategy.
  • the above optimization method obtains the current performance margin of the computer equipment when it runs to each check node by inserting the check node, and then selects the optimization strategy that meets the actual operation of the computer equipment according to the current performance margin to consume the computing node after the check node Resources are optimized, so that when the computer equipment is running to the above-mentioned computing node, the resource usage of each computing node in the calculation graph can be dynamically adjusted to meet the performance index requirements of the user for the calculation graph, and to improve the computer equipment Resource utilization.
  • the foregoing current performance margin includes a performance margin representing a delay performance index, that is, the current delay performance margin.
  • this application provides an implementation manner of the foregoing S104. Including: if the current delay performance margin is sufficient, the storage optimization strategy is determined as the optimization strategy; the storage optimization strategy is used to reduce the memory occupied by the computing node during calculation.
  • This embodiment relates to an application scenario in which a computer device obtains a sufficient margin of current delay performance, indicating that the computing device requires a sufficient amount of time for calculation at this time, and it can also meet the calculation requirements of later computing nodes.
  • the computer The device does not need to pay attention to the computing time consumed by the computing node during the calculation, but can focus on the memory occupied by the computing node during the calculation to optimize the memory resources on the computing device and prevent the computer device from being occupied too much, thereby affecting the computer device
  • the calculation performance which in turn affects the calculation speed at which the calculation graph is executed.
  • the aforementioned storage optimization strategy may specifically include: storing data generated by the computing node after the check node during calculation in a storage space with high access latency; the storage space with high access latency includes at least global memory and off-chip memory.
  • the data generated by the calculation node during calculation may include intermediate results and temporary variables required in the calculation.
  • the computer device optimizes the computing node after the check node according to the storage optimization strategy, it can specifically store the data generated by the computing node after the check node in the storage space with high access latency, for example, the global memory of the GPU or the TPU. Off-chip storage, etc., to reduce the occupancy rate of the computer equipment memory, thereby increasing the computing speed of the computer equipment.
  • the delay optimization strategy is determined as the optimization strategy; the delay optimization strategy is used to reduce the computational time spent by the computing node during calculation.
  • This embodiment relates to an application scenario where the current delay performance margin is not sufficient for the computer equipment to obtain, which indicates that the computing time required by the computer equipment at this time is relatively tight, and may not be able to meet the computing requirements of the later computing nodes.
  • Computer equipment needs to focus on the computing time consumed by computing nodes during calculations. You can ignore the memory occupied by computing nodes during computing to optimize the delay performance of computing equipment and avoid computing nodes from consuming too long time during computing. , Thereby affecting the calculation speed of the calculation graph is executed.
  • the above-mentioned delay optimization strategy may specifically include: storing data generated by the computing node after the check node during calculation in a low-access-latency storage space; the low-access-latency storage space includes at least cache space and on-chip storage.
  • the computer device when the computer device optimizes the computing node after the check node according to the delay optimization strategy, it can specifically store the data generated by the computing node after the check node in the storage space with low access latency, for example, The memory or cache of the computer equipment, etc., to reduce the time for the computing node to access the storage space during calculation, thereby increasing the computing speed of the computing node, and then increasing the computing speed of the computer equipment.
  • the foregoing current performance margin includes a performance margin representing a storage performance index, that is, the current storage performance margin.
  • this application provides an implementation manner of the foregoing S104, and the method includes: If the current storage performance margin is sufficient, the delay optimization strategy is determined as the optimization strategy; the delay optimization strategy is used to reduce the time consumed by the computing node during calculation.
  • This embodiment relates to an application scenario in which a computer device obtains a sufficient current storage performance margin, indicating that the memory resource required by the computer device at this time is relatively sufficient, and it can also meet the computing needs of the later computing node.
  • the computer device needs Focus on the computational time consumed by the computing node during the calculation. You can ignore the memory occupied by the computing node during the calculation to optimize the delay performance of the computing device and avoid the computing node from spending too much time during the calculation, which affects the calculation The calculation speed at which the graph is executed.
  • the storage optimization strategy is determined as the optimization strategy; the storage optimization strategy is used to reduce the memory occupied by the computing node during calculation.
  • This embodiment relates to an application scenario where a computer device obtains the current storage performance margin is insufficient, indicating that the memory resource required by the computer device at this time is relatively tight, and may not be able to meet the computing needs of the later computing node.
  • the computer The device needs to focus on the memory occupied by the computing node during the calculation. You can ignore the time consumption of the computing node during the calculation to optimize the storage performance of the computing device and avoid the computing node from occupying too much memory during the calculation, which affects the calculation graph. The speed of calculation performed.
  • the delay optimization strategy may further include:
  • This embodiment is applicable to an application scenario when the memory or cache on a computer device cannot meet the memory or cache required by the computing node for calculation.
  • the optimization strategy determined by the computer device is a delay optimization strategy
  • the data size generated by the computing node after the current check node during calculation can be obtained first, so as to estimate whether the memory or cache on the computer device meets the computing requirements based on the data size.
  • step S1042. Compare the size of the data generated by the computing node during calculation with the size of the preset storage space. If the size of the data generated by the computing node during calculation exceeds the size of the preset storage space, step S1043 is executed. If the computing node The size of the data generated during the calculation does not exceed the size of the preset storage space, then step S1044 is executed.
  • the computer device when it obtains the size of the data generated by the computing node during calculation, it can further compare the size of the data generated by the computing node during calculation with the size of the preset storage space to obtain a comparison result. Then, according to the comparison result, different delay optimization strategies are selected to optimize the computing nodes after the current check node.
  • the aforementioned preset storage space may be a storage space with low access latency, such as the memory and/or cache space of a computer device.
  • the above comparison results include: the size of the data generated by the computing node during the calculation exceeds the size of the preset storage space. At this time, it indicates that the existing storage space on the computer device cannot meet the computing requirements of the computing node, and the computing node generated during the calculation. The size of the data does not exceed the size of the preset storage space, which indicates that the existing storage space on the computer device is still relatively abundant, which can meet the computing requirements of the computing node.
  • S1043 Split the computing node after the current check node, and store the data generated during the calculation of the split computing node in a storage space with low access latency.
  • This embodiment relates to an application scenario in which the above-mentioned comparison result is that the size of the data generated by the computing node during calculation exceeds the size of the preset storage space.
  • the computer device can split the computing node after the check node, and remove it.
  • the data generated by the divided computing node during calculation is stored in a storage space with low access latency, that is, the memory and/or cache on the computer device. Because the foregoing computing nodes have been split, the existing storage space on the computer device can meet the size of the preset storage space required by the split computing nodes during calculation.
  • This embodiment relates to an application scenario where the result of the above comparison is that the size of the data generated by the computing node during calculation does not exceed the size of the preset storage space.
  • the computer device can directly check the computing node after the node during the calculation.
  • the generated data is stored in a storage space with low access latency. This step is the same as that described in the previous description of the delay optimization strategy. For details, please refer to the foregoing description, and the redundant description will not be repeated here.
  • Fig. 3 is a flowchart of an implementation manner of S103 in the embodiment of Fig. 2. As shown in Fig. 3, the “obtain current performance margin through the current check node” in S103 includes:
  • the first total target calculation consumption time period indicates the cumulative calculation consumption time length of all the calculation nodes expected by the user when all the calculation nodes are performing calculations before the current check node.
  • the total actual consumption time refers to the actual calculation consumption time accumulated during the calculation of all the computing nodes before the current check node when the computer equipment runs to the current check node.
  • the computer device needs to obtain the current performance margin, it can first obtain the first total target calculation consumption time length and total actual calculation consumption time length of all computing nodes before the current check node, so as to calculate the total consumption time length and the total consumption time length according to the first total target.
  • the total actual calculation consumption time determines the current performance margin.
  • S202 Determine the current performance margin according to the first total target calculation consumption time period and the total actual calculation consumption time period.
  • the computer device When the computer device obtains the first total target calculation consumption time and the total actual calculation consumption time, it can directly perform the difference calculation on the first total target calculation consumption time and the total actual calculation consumption time to obtain the current performance margin; optionally , It is also possible to weight the first total target calculation consumption time and the total actual calculation consumption time respectively and then perform the difference calculation to obtain the current performance margin.
  • a method of "obtaining the first total target calculation consumption time length of all computing nodes before the current check node" in S201 may specifically include:
  • the second total target calculation consumption time indicates the accumulated calculation consumption time expected by the user during calculation of all the computing nodes on the path where the current check node is located.
  • the computer device needs to obtain the first total target computing consumption time of all computing nodes before the current check node, it can first obtain the second total target computing consumption time of all computing nodes on the path where the current check node is located, and then according to The second total target calculation consumption time length determines the first total target calculation consumption time length.
  • S302. Determine the first total target calculation consumption time according to the second total target calculation consumption time and a preset ratio; the preset ratio is that the total calculation consumption time of all computing nodes before the current check node is on the path where the check node is located The percentage of the total computing time consumed by all computing nodes in the.
  • the preset ratio can be obtained in advance by a computer device, and can be specifically obtained by a variety of methods.
  • the computer device can pre-calculate the calculation amount of each computing node in the calculation graph, and then estimate each computing node based on the calculation amount of each calculation node Finally, the total calculation consumption time of the calculation node before the current check node and the total calculation consumption time of all the calculation nodes are used to determine the preset ratio.
  • the computer equipment can also use the existing consumption time estimation model to estimate the calculation consumption time of each computing node, and then correspondingly pass the total calculation consumption time of the computing node before the current check node, and the total computing time of all computing nodes. The calculation consumes time and determines the preset ratio.
  • the preset ratio may also be determined by other methods, which is not limited in this embodiment.
  • the second total target calculation consumption may be The duration and the preset ratio are multiplied to obtain the first total target calculation consumption duration. For example, if the second total target calculation consumption time is 10 hours, and the preset ratio is 1/2, the corresponding first total target calculation consumption time is 5 hours.
  • the computer device may also weight the second total target calculation consumption time with a preset ratio, and then perform a multiplication operation to obtain the first total target calculation consumption time.
  • Fig. 5 is a flowchart of an implementation manner of S102 in the embodiment of Fig. 2. As shown in Fig. 5, “insert at least one check node in the calculation graph” in S102 includes:
  • the computer device may first determine the longest path according to the layout mode of each computing node in the calculation graph, and then obtain the calculation consumption time of each computing node on the longest path, and then may base on the calculation consumption of each computing node The time length is calculated to obtain the ratio of the computing time consumed by the computing node on the longest path.
  • S402 Determine the insertion position of at least one check node on the longest path according to the proportion of the calculated consumption time.
  • the computer device in order to balance the computing time consumed by each computing node on the longest path, when the computer device inserts a check node, it can determine the insertion position of at least one check node on the longest path according to the proportion of the computing time consumed when inserting the check node.
  • the total computational consumption time of all computing nodes before and after the inserted check node is equal.
  • the computer device determines the insertion position of at least one check node on the longest path, it can insert at least one check node at the insertion position of the at least one check node, so as to optimize the computing node after the check node.
  • this application provides a specific way for a computer device to obtain the proportion of computing time consumed.
  • the above-mentioned S401 "Obtain the proportion of computing time consumed by a computing node on the longest path in the calculation graph
  • One method can specifically include:
  • the computer equipment When the computer equipment obtains the corresponding calculation graph by compiling the calculation network model, it can obtain the calculation amount of each calculation node in the calculation graph according to the calculation steps included in each calculation node. Therefore, in this embodiment, the computer device can first determine the longest path in the calculation graph, and then determine the computing nodes included in the longest path, and then can determine the calculation steps included in each computing node on the longest path and other information , Get the calculation amount of each computing node on the longest path.
  • the computer equipment When the computer equipment obtains the calculation amount of each calculation node on the longest path, it can further estimate the calculation time of each calculation node according to the calculation amount of each calculation node, so as to obtain the calculation amount of each calculation node on the longest path. Calculate the elapsed time.
  • the greater the amount of calculation of the foregoing computing nodes the longer the estimated computing time of each computing node, the smaller the amount of computing of each computing node, and the smaller the estimated computing time of each computing node.
  • S503 Determine the proportion of the computing time consumed by each computing node on the longest path according to the computing time consumed by each computing node on the longest path.
  • the computer device When the computer device obtains the computing time consumed by each computing node on the longest path, it can perform a proportional calculation on the computing time consumed by each computing node to obtain the computing time consumed by each computing node on the longest path.
  • the present application provides another specific way for the computer device to obtain the proportion of computing time consumed.
  • the above-mentioned S401 "Obtain the proportion of computing time consumed by each computing node on the longest path in the calculation graph"
  • a method may specifically include:
  • the computer equipment When the computer equipment needs to obtain the proportion of computing time consumed by each computing node on the longest path, it can first construct a time-consuming estimation model to analyze the calculation steps included in each computing node and estimate the calculation amount of each computing node.
  • a time-consuming estimation model may be a pre-trained estimation model, which belongs to the prior art, and there is no detailed description of this.
  • S602. Use the consumption time estimation model to obtain the calculation consumption time of each computing node on the longest path.
  • the consumption time estimation model can be used to estimate the calculation consumption time of each calculation node on the longest path by analyzing the calculation steps of each computing node on the longest path .
  • S603 Determine the proportion of the computing time consumed by each computing node on the longest path according to the computing time consumed by each computing node on the longest path.
  • step S603 in this embodiment is the same as the content of the step S503.
  • step S402 determines the insertion position of at least one check node on the longest path according to the proportion of the calculated consumption time.
  • the preset number represents the number of pre-inserted check nodes on the longest path
  • the preset number can be determined by the computer device in advance according to the length of the longest path or actual application requirements.
  • the computer device determines the preset number of check nodes to be inserted, it can further divide the longest path into a preset number of sub-components by analyzing the proportion of computing time consumed by the computing nodes on the longest path. The path balances the computing time consumed by the computing nodes on each sub-path.
  • At least one sub-path can be selected from the multiple sub-paths as the insertion position of the insertion checkpoint, where all the computing nodes before the selected sub-path and those after the selected sub-path
  • the total computing time of all computing nodes is as equal as possible to balance the computing time of each computing node on the longest path.
  • Fig. 9 is a flowchart of another implementation manner of S102 in the embodiment of Fig. 2. As shown in Fig. 9, "insert at least one check node in the calculation graph" in S102 includes:
  • the beginning and end computing nodes on the path can be obtained, so that check nodes can be inserted between the beginning and end computing nodes later.
  • the number of cross-computing nodes may be one or multiple, which is not limited in this embodiment.
  • At least one check node can be inserted in the middle of the start-end computing node and the end-end computing node. Illustratively illustrate the method of inserting at least one check node in the calculation graph described in the embodiment of FIG. 9, such as the calculation graph shown in FIG.
  • FIG. 10 is a flowchart of an implementation manner of S101 in the embodiment of FIG. 2. As shown in FIG. 10, the "acquisition of the calculation diagram of the calculation network model" in the above S101 includes:
  • the compiler of the computer equipment can obtain the calculation graph of the calculation network model by loading the topology structure and parameters of the calculation network model to compile.
  • the compiler of the computer device loads the topological structure and parameters of the computational network model, it can be compiled to obtain the computational graph of the computational network model so as to optimize the computational consumption resources of the computational graph in the process of running the computational graph later.
  • this application also provides a calculation graph optimization method, as shown in FIG. 11, the method includes:
  • step S1009 Determine whether the current delay performance margin is sufficient. If the current delay performance margin is sufficient, perform step S1010; if the current delay performance margin is insufficient, perform step S1011.
  • the storage optimization strategy includes: storing data generated during the calculation of the computing nodes after the current check node in a storage space with high access latency.
  • step S1011 Obtain the size of the data generated by the computing node during the calculation after the current check node, and compare the size of the data generated by the computing node during the calculation with the size of the preset storage space, if the computing node generates the data during the calculation If the size of the data exceeds the size of the preset storage space, step S1012 is executed, and if the size of the data generated by the computing node during calculation does not exceed the size of the storage memory, step S1013 is executed.
  • the steps of the optimization method are illustrated by way of example.
  • the calculation graph shown in FIG. 8B is taken as an example for description, assuming that the calculation node 1, the calculation node 2, and the calculation node are on the longest path in FIG. 8B
  • the total preset calculation consumption time of 3 is T
  • the memory usage is M.
  • the calculation time ratio of computing node 1, computing node 2, and computing node 3 estimated by the computer equipment through the calculation amount or the estimated time calculation model is 1:5:6, set the preset time threshold to th, then when the computer equipment runs to check node a, the actual calculation time of calculation node 1 and calculation node 2 before check node a is Tr, then the current delay performance
  • the margin is T*6/12-Tr.
  • the storage optimization strategy is used to optimize the computing nodes after the check node. Specifically, the intermediate required for calculation in computing node 3 is optimized. The results or temporary variables are stored in the high-access latency storage space (off-chip GPU storage space or TPU storage space); if T*6/12-Tr ⁇ th, the corresponding delay optimization strategy is used to optimize the computing nodes after the check node , Specifically store the intermediate results or temporary variables required for the calculation in the computing node 3 into the low-access latency storage space (memory or cache).
  • an optimization device for a calculation graph including: a first acquisition module 11, an insertion mold module 12, a second acquisition module 13, a determination module 14, and an optimization module 15, wherein :
  • the first obtaining module 11 is used to obtain the calculation graph of the calculation network model;
  • the calculation graph includes a plurality of calculation nodes;
  • the insertion module 12 is used to insert at least one check node in the calculation graph;
  • the second obtaining module 13 is used to When running to each inspection node, the current performance margin is obtained through the current inspection node;
  • the determining module 14 is used to determine the optimization strategy according to the current performance margin;
  • the optimization module 15 is used to determine the current inspection node according to the optimization strategy The subsequent computing nodes are optimized.
  • Each module in the optimization device of the above calculation graph can be implemented in whole or in part by software, hardware and a combination thereof.
  • the above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
  • a computer device including a memory and a processor, and a computer program is stored in the memory.
  • the processor executes the computer program, the following steps are implemented: obtaining a calculation graph of a calculation network model; A computing node; insert at least one check node in the calculation graph; when running to each check node, obtain the current delay performance margin through the current check node; determine the optimization strategy according to the current delay performance margin; according to the optimization The strategy optimizes the computing nodes after the current check node.
  • the implementation principle and technical effect of a computer device provided by the foregoing embodiment are similar to those of the foregoing method embodiment, and will not be repeated here.
  • a computer-readable storage medium on which a computer program is stored.
  • the following steps are also implemented: obtaining a calculation graph of a calculation network model; the calculation graph includes multiple calculations. Node; insert at least one check node in the calculation graph; when running to each check node, obtain the current performance margin through the current check node; determine the optimization strategy according to the current performance margin; determine the current check node according to the optimization strategy
  • the subsequent computing nodes are optimized.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • the various component embodiments of the present invention may be implemented by hardware, or by software modules running on one or more processors, or by a combination of them.
  • a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in the computing processing device according to the embodiments of the present invention.
  • DSP digital signal processor
  • the present invention can also be implemented as a device or device program (for example, a computer program and a computer program product) for executing part or all of the methods described herein.
  • Such a program for realizing the present invention may be stored on a computer-readable medium, or may have the form of one or more signals.
  • FIG. 13 shows a computing processing device that can implement the method of the present invention.
  • the computing processing device may be a computer device, which traditionally includes a processor 1010 and a computer program product in the form of a memory 1020 or a computer readable medium.
  • the memory 1020 has a storage space 1030 for executing program codes 1031 of any method steps in the above methods.
  • the storage space 1030 for program codes may include various program codes 1031 respectively used to implement various steps in the above method. These program codes can be read from or written into one or more computer program products.
  • Such computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards, or floppy disks.
  • Such a computer program product is usually a portable or fixed storage unit as described with reference to FIG. 14.
  • the storage unit may have storage segments, storage spaces, and the like arranged similarly to the memory 1020 in the computing processing device of FIG. 13.
  • the program code can be compressed in an appropriate form, for example.
  • the storage unit includes computer-readable code 1031', that is, code that can be read by a processor such as 1010, which, when run by a computing processing device, causes the computing processing device to execute the method described above. The various steps.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

Procédé et appareil d'optimisation d'un graphique de calcul, dispositif informatique et support de stockage. Le procédé consiste : à obtenir un graphique de calcul d'un modèle de réseau de calcul ; puis à introduire au moins un nœud de contrôle dans le graphique de calcul ; lors de l'exécution sur chaque nœud de contrôle, à obtenir la marge de performance actuelle au moyen du nœud de contrôle actuel ; puis à déterminer une politique d'optimisation en fonction de la marge de performance actuelle ; et, conformément à la politique d'optimisation, à optimiser une ressource qui doit être consommée par un nœud de calcul après le nœud de contrôle actuel. Selon le procédé d'optimisation, en introduisant des nœuds de contrôle, en obtenant la marge de performance actuelle d'un dispositif informatique lorsque le dispositif informatique s'exécute sur chaque nœud de contrôle, et en sélectionnant, en fonction de la marge de performance actuelle, la politique d'optimisation répondant à l'état de fonctionnement réel du dispositif informatique afin d'optimiser la ressource qui doit être consommée par le nœud de calcul, la condition d'utilisation de ressource de chaque nœud de calcul pendant le calcul peut être réglée de manière dynamique dans le processus du dispositif informatique s'exécutant sur le nœud de calcul, ce qui permet d'améliorer le taux d'utilisation de la ressource sur le dispositif informatique.
PCT/CN2020/113290 2019-12-09 2020-09-03 Procédé et appareil d'optimisation d'un graphique de calcul, dispositif informatique, et support de stockage Ceased WO2021114757A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911249112.XA CN111158901B (zh) 2019-12-09 2019-12-09 计算图的优化方法、装置、计算机设备和存储介质
CN201911249112.X 2019-12-09

Publications (1)

Publication Number Publication Date
WO2021114757A1 true WO2021114757A1 (fr) 2021-06-17

Family

ID=70555798

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/113290 Ceased WO2021114757A1 (fr) 2019-12-09 2020-09-03 Procédé et appareil d'optimisation d'un graphique de calcul, dispositif informatique, et support de stockage

Country Status (2)

Country Link
CN (1) CN111158901B (fr)
WO (1) WO2021114757A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113987987A (zh) * 2021-09-30 2022-01-28 深圳市紫光同创电子有限公司 时序优化方法、系统、设备及存储介质
CN114489604A (zh) * 2022-02-17 2022-05-13 山东产业技术研究院智能计算研究院 一种基于计算图的算子监测方法及系统

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111158901B (zh) * 2019-12-09 2023-09-08 爱芯元智半导体(宁波)有限公司 计算图的优化方法、装置、计算机设备和存储介质
CN114003306B (zh) * 2021-10-27 2024-03-15 上海商汤科技开发有限公司 一种显存优化方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106339252A (zh) * 2015-07-08 2017-01-18 阿里巴巴集团控股有限公司 分布式dag系统的自适应优化方法和装置
CN107045455A (zh) * 2017-06-19 2017-08-15 华中科技大学 一种基于负载预测的Docker Swarm集群资源调度优化方法
CN109189572A (zh) * 2018-08-02 2019-01-11 中兴飞流信息科技有限公司 一种资源预估方法及系统、电子设备和存储介质
CN110362611A (zh) * 2019-07-12 2019-10-22 拉卡拉支付股份有限公司 一种数据库查询方法、装置、电子设备及存储介质
CN111158901A (zh) * 2019-12-09 2020-05-15 北京迈格威科技有限公司 计算图的优化方法、装置、计算机设备和存储介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8504970B1 (en) * 2011-07-05 2013-08-06 Altera Corporation Method and apparatus for performing automated timing closure analysis for systems implemented on target devices
WO2017083399A2 (fr) * 2015-11-09 2017-05-18 Google Inc. Formation de réseaux neuronaux représentés sous forme de graphes de calcul
US10346206B2 (en) * 2016-08-27 2019-07-09 International Business Machines Corporation System, method and computer program product for resource management in a distributed computation system
US10592280B2 (en) * 2016-11-23 2020-03-17 Amazon Technologies, Inc. Resource allocation and scheduling for batch jobs
CN107547746B (zh) * 2017-08-31 2020-09-04 Oppo广东移动通信有限公司 资源配置方法及相关产品
US10965536B2 (en) * 2019-03-30 2021-03-30 Intel Corporation Methods and apparatus to insert buffers in a dataflow graph
CN110515739B (zh) * 2019-10-23 2020-01-31 上海燧原智能科技有限公司 深度学习神经网络模型负载计算方法、装置、设备及介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106339252A (zh) * 2015-07-08 2017-01-18 阿里巴巴集团控股有限公司 分布式dag系统的自适应优化方法和装置
CN107045455A (zh) * 2017-06-19 2017-08-15 华中科技大学 一种基于负载预测的Docker Swarm集群资源调度优化方法
CN109189572A (zh) * 2018-08-02 2019-01-11 中兴飞流信息科技有限公司 一种资源预估方法及系统、电子设备和存储介质
CN110362611A (zh) * 2019-07-12 2019-10-22 拉卡拉支付股份有限公司 一种数据库查询方法、装置、电子设备及存储介质
CN111158901A (zh) * 2019-12-09 2020-05-15 北京迈格威科技有限公司 计算图的优化方法、装置、计算机设备和存储介质

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113987987A (zh) * 2021-09-30 2022-01-28 深圳市紫光同创电子有限公司 时序优化方法、系统、设备及存储介质
CN113987987B (zh) * 2021-09-30 2025-08-26 深圳市紫光同创电子股份有限公司 时序优化方法、系统、设备及存储介质
CN114489604A (zh) * 2022-02-17 2022-05-13 山东产业技术研究院智能计算研究院 一种基于计算图的算子监测方法及系统
CN114489604B (zh) * 2022-02-17 2025-06-17 济南中科泛在智能计算研究院 一种基于计算图的算子监测方法及系统

Also Published As

Publication number Publication date
CN111158901A (zh) 2020-05-15
CN111158901B (zh) 2023-09-08

Similar Documents

Publication Publication Date Title
CN113095474B (zh) 深度学习模型的资源使用情况预测
US11500959B2 (en) Multiple output fusion for operations performed in a multi-dimensional array of processing units
WO2021114757A1 (fr) Procédé et appareil d'optimisation d'un graphique de calcul, dispositif informatique, et support de stockage
Peters Parallel Python with Dask: Perform distributed computing, concurrent programming and manage large dataset
Li et al. Sculptor: Flexible approximation with selective dynamic loop perforation
CN117008916B (zh) 张量程序优化方法和装置
US9031826B2 (en) Method and apparatus for simulating operation in a data processing system
CN104718529B (zh) 表示没有外部引用的引用属性注释
Huang et al. Novel heuristic speculative execution strategies in heterogeneous distributed environments
CN108205469A (zh) 一种基于MapReduce的资源分配方法及服务器
Wernsing et al. Elastic computing: A portable optimization framework for hybrid computers
CN114021733B (zh) 模型训练优化方法、装置、计算机设备及存储介质
Kersten et al. A hoare logic for energy consumption analysis
US8661424B2 (en) Auto-generation of concurrent code for multi-core applications
JP5687603B2 (ja) プログラム変換装置、プログラム変換方法、および変換プログラム
CN108021563B (zh) 一种指令间数据依赖的检测方法和装置
CN116301874A (zh) 代码编译方法、电子设备及存储介质
JP2009075965A (ja) ソフトウェア開発方法及びソフトウェア開発装置
US12190086B1 (en) Method and apparatus for ML graphs by a compiler
US20240370239A1 (en) Compilation system and method
US11068250B2 (en) Crowdsourced API resource consumption information for integrated development environments
CN105718223B (zh) 管理工作负载存储器分配的方法和设备
CN113704687A (zh) 一种张量计算运行方法、装置及运算系统
CN114817119B (zh) 面向粗粒度可重构处理器的任务划分方法和装置
McKean et al. Use of model‐based architecture attributes to construct a component‐level trade space

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20898036

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20898036

Country of ref document: EP

Kind code of ref document: A1