[go: up one dir, main page]

CN108228351B - GPU performance balance scheduling method, storage medium and electronic terminal - Google Patents

GPU performance balance scheduling method, storage medium and electronic terminal Download PDF

Info

Publication number
CN108228351B
CN108228351B CN201711460215.1A CN201711460215A CN108228351B CN 108228351 B CN108228351 B CN 108228351B CN 201711460215 A CN201711460215 A CN 201711460215A CN 108228351 B CN108228351 B CN 108228351B
Authority
CN
China
Prior art keywords
performance
performance degradation
degree
gpu
pressure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711460215.1A
Other languages
Chinese (zh)
Other versions
CN108228351A (en
Inventor
过敏意
赵文益
陈�全
徐莉婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiao Tong University
Original Assignee
Shanghai Jiao Tong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiao Tong University filed Critical Shanghai Jiao Tong University
Priority to CN201711460215.1A priority Critical patent/CN108228351B/en
Publication of CN108228351A publication Critical patent/CN108228351A/en
Application granted granted Critical
Publication of CN108228351B publication Critical patent/CN108228351B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明提供一种GPU的性能均衡调度方法、存储介质及电子终端,所述方法包括:收集各个共享应用的各级缓存运行时统计信息和当前的流处理器簇分配方案;由训练好的运行时压力提取器提取各个共享应用在二级缓存和内存带宽上所承受到的压力;将收集到的运行时统计信息和该共享应用的压力作为输入,由训练好的冲突性能下降预测器预测输出该共享应用的冲突性能下降程度,由训练好的拓展性能下降预测器预测输出该共享应用的拓展性能下降程度;根据该共享应用的冲突性能下降程度和拓展性能下降程度,获取GPU的性能的不均衡度并根据不均衡度确定重新分配流处理器簇的流处理器簇新分配方案。本发明可以保证共享应用之间性能下降程度的均衡。

Figure 201711460215

The present invention provides a performance-balanced scheduling method for GPU, a storage medium and an electronic terminal. The method includes: collecting runtime statistics of caches at all levels of each shared application and a current stream processor cluster allocation scheme; The time pressure extractor extracts the pressure on the L2 cache and memory bandwidth of each shared application; the collected runtime statistics and the pressure of the shared application are used as input, and the trained conflict performance degradation predictor predicts the output. The degree of conflict performance degradation of the shared application is predicted and outputted by the trained expansion performance degradation predictor; The degree of balance is determined and a new allocation scheme of the stream processor cluster for redistributing the stream processor cluster is determined according to the degree of imbalance. The present invention can ensure the balance of performance degradation degrees among shared applications.

Figure 201711460215

Description

GPU performance balance scheduling method, storage medium and electronic terminal
Technical Field
The invention relates to the technical field of processors, in particular to the technical field of a GPU (graphics processing unit), and specifically relates to a GPU performance balancing scheduling method, a storage medium and an electronic terminal.
Background
With the deployment of large numbers of compute-intensive applications such as speech recognition, machine translation, personal private assistants, etc., mainstream private data centers or public cloud platforms have begun to heavily use coprocessors like GPUs to deal with the problem of insufficient computing power of traditional CPUs. GPUs were originally dedicated processors designed for graphics image computation, and because of their incomparable parallelism with conventional CPUs, more and more non-graphics image applications are migrating to GPU platforms to meet their rapidly increasing computational demands. However, studies have shown that non-graphics image applications often do not have sufficient parallelism to fully utilize the hardware resources of the GPU, resulting in a waste of hardware resources. On the other hand, due to the development of GPU architecture and process, more and more Stream Multiprocessors (SMs) are integrated into one GPU, so that the problem of resource waste is more prominent.
For this reason, spatial parallelism, i.e., multiple applications running simultaneously sharing a GPU, is proposed to solve the above-mentioned problem. Relevant researches show that the resource utilization rate of the GPU and the overall performance of the system can be greatly improved by using space parallelism. When multiple applications share a GPU, they may compete with each other for 1) streaming multiprocessors, 2) level two shared caches, and 3) global memory bandwidth. Thus, the performance of each application is degraded in the case of shared GPUs compared to when it monopolizes the entire GPU. Meanwhile, different applications often have different performance expansibility and sensitivity to competition of shared secondary cache and global memory bandwidth, so that the performance fairness among the applications sharing the GPU cannot be guaranteed by the scheduling scheme of the flow sharing multiprocessor used in the traditional method, that is, each application has different performance degradation degrees.
For a multi-tenant cloud platform, it is of great significance to guarantee the fairness of performance among various shared applications. If the fairness of performance among the shared applications cannot be guaranteed, according to the theory related to the game theory, a platform user tends to resist sharing of one GPU with other users, so that the opportunity of improving the GPU resource utilization rate and the overall system performance by using space parallel is greatly reduced, and the competition with other platforms is not facilitated. Therefore, on the basis of improving the resource utilization rate and the overall system performance by using the space parallel technology, the method has important significance for ensuring the performance fairness among all shared applications.
Disclosure of Invention
In view of the foregoing drawbacks of the prior art, an object of the present invention is to provide a performance balancing scheduling method for a GPU, a storage medium, and an electronic terminal, which ensure performance fairness among shared applications while improving resource utilization and overall system performance by using a space parallel technique through accurate performance degradation prediction and dynamic SM allocation scheduling.
In order to achieve the above and other related objects, the present invention provides a performance equalization scheduling method for a GPU, including: collecting statistical information of each level of cache run time of each shared application and a current stream processor cluster distribution scheme; extracting the pressure born by each shared application on the secondary cache and the memory bandwidth by a trained runtime pressure extractor; the collected running statistical information of the shared application and the pressure on various shared resources borne by the shared application are used as input, the trained conflict performance degradation predictor is used for predicting and outputting the conflict performance degradation degree of the shared application, and the trained expansion performance degradation predictor is used for predicting and outputting the expansion performance degradation degree of the shared application; and acquiring the imbalance degree of the performance of the GPU according to the predicted and output conflict performance degradation degree and the expansion performance degradation degree of the shared application, and determining a new distribution scheme of the stream processor cluster for redistributing the stream processor cluster according to the imbalance degree.
In an embodiment of the invention, the training process for training the runtime pressure extractor includes: respectively designing a plurality of pressure measurement programs aiming at a secondary cache and a memory bandwidth; designing a plurality of pressure generators aiming at a secondary cache and a memory bandwidth respectively; enabling a plurality of pressure measurement programs and a plurality of pressure generators to share a GPU for operation, collecting corresponding operation statistical information and measuring pressure values generated on a secondary cache and a memory bandwidth; and training a preset neural network by taking the collected running statistical information as input and the measured pressure value as output to form the running pressure extractor.
In an embodiment of the present invention, the training process for training the conflict performance degradation predictor and training the extended performance degradation predictor includes: selecting a plurality of application programs; enabling the multiple application programs and the multiple pressure generators to share a GPU for operation and collecting corresponding operation statistical information, and measuring pressure values generated on secondary cache and memory bandwidth, the conflict performance degradation degree of the application programs and the expansion performance degradation degree of the application programs; using the collected running statistical information and the measured pressure value as input, using the conflict performance degradation degree of the application program as output, training a preset neural network, and forming the conflict performance degradation predictor; and taking the collected running statistical information and the measured pressure value as input, taking the expansion performance degradation degree of the application program as output, training a preset neural network, and forming the conflict performance degradation predictor.
In an embodiment of the present invention, the training of the preset neural network with the collected runtime statistical information as input and the measured pressure value as output to form the runtime pressure extractor specifically includes: using the collected statistical information during operation as input, using the measured secondary cache pressure value as output, training a preset neural network, and forming a secondary cache pressure extractor; and (3) taking the collected running statistical information as input, taking the measured memory bandwidth pressure value as output, training a preset neural network, and forming a memory bandwidth pressure extractor.
In an embodiment of the present invention, the neural network used by the second-level cache pressure extractor and the memory bandwidth pressure extractor includes an input layer, two hidden layers and an output layer; wherein the number of hidden layer neurons equals the number of inputs; the activation function of the neural network adopts a LeakyRelu function.
In an embodiment of the present invention, the conflict performance degradation degree is a degradation degree of the performance of the application program when there is contention between the second level cache and the memory bandwidth, relative to the performance without contention, under the condition that the number of the stream processor clusters is fixed; the extent of performance degradation is the extent of performance degradation of an application using a particular number of stream processor clusters relative to its performance exclusive to the entire GPU, without any contention on shared cache and memory bandwidth.
In an embodiment of the present invention, the obtaining the imbalance degree of the performance of the GPU specifically includes: acquiring the actual performance degradation degree of the shared application according to the predicted and output conflict performance degradation degree and the expanded performance degradation degree of the shared application, and acquiring the imbalance degree of the performance of the GPU according to the actual performance degradation degree; wherein the real performance degradation degree is equal to the product of the corresponding conflict performance degradation degree and the expansion performance degradation degree.
In an embodiment of the present invention, the determining, according to the imbalance, new allocation scheme information of the stream processor cluster to which the stream processor cluster is reallocated specifically includes: and if the imbalance degree exceeds a set threshold value, performing reallocation, gradually reducing the imbalance degree by adopting 1 stream processor cluster which reallocates the application with the minimum performance degradation degree and the application with the maximum performance degradation degree every time by adopting a preset algorithm, and determining the current new allocation scheme as a stream processor cluster new allocation scheme when the distance between the new allocation scheme and the initial allocation scheme is greater than a specific threshold value.
Embodiments of the present invention also provide a storage medium including a GPU processor and a memory, the memory storing program instructions, the GPU processor executing the program instructions to implement the method as described above.
An embodiment of the present invention further provides an electronic terminal, which includes a GPU processor and a memory, where the memory stores program instructions, and the GPU processor executes the program instructions to implement the method described above.
As described above, the performance balancing scheduling method of the GPU, the storage medium and the electronic terminal of the present invention have the following beneficial effects:
the invention provides a performance balance scheduling mechanism facing a preemptive shared multitask GPU. The mechanism can further ensure the balance of performance reduction degree among shared applications on the basis of ensuring that the GPU resource utilization rate and the overall system performance are improved in parallel by using the space on the premise of not increasing hardware support.
Drawings
Fig. 1 is a flowchart illustrating a performance balancing scheduling method of a GPU according to the present invention.
Fig. 2 is a software architecture diagram illustrating a performance balancing scheduling method for a GPU according to the present invention.
Fig. 3 is a diagram illustrating a hardware system architecture applied to the performance balancing scheduling method of the GPU of the present invention.
Fig. 4 is a schematic flow chart of the offline training phase of the GPU performance balancing scheduling method according to the present invention.
Fig. 5 is a schematic flow chart of the on-line scheduling stage of the performance balancing scheduling method for the GPU of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.
The present embodiment aims to provide a performance equalization scheduling method for a GPU, a storage medium, and an electronic terminal, which ensure performance fairness among shared applications on the basis of improving resource utilization and overall system performance by using a space parallel technique through accurate performance degradation prediction and dynamic SM allocation scheduling.
The principle and the implementation of the performance equalization scheduling method, the storage medium and the electronic terminal of the GPU of the present invention will be described in detail below, so that those skilled in the art can understand the performance equalization scheduling method, the storage medium and the electronic terminal of the GPU of the present invention without creative work.
Specifically, this embodiment aims to implement a performance balancing scheduling mechanism based on a preemptive shared multi-task GPU with low overhead, and ensure the fairness of performance among various shared applications on the basis of improving the resource utilization rate and the overall system performance.
The performance equalization scheduling method of the GPU, the storage medium, and the electronic terminal of the present embodiment are described in detail below.
As shown in fig. 1, this embodiment provides a performance equalization scheduling method for a GPU, where the performance equalization scheduling method for the GPU includes the following steps:
step S110, collecting all levels of cache run-time statistical information of each shared application and a current stream processor cluster distribution scheme;
step S120, extracting the pressure born by each shared application on the secondary cache and the memory bandwidth by the trained runtime pressure extractor;
step S130, using the collected running time statistical information of the shared application and the pressure of the shared application on various shared resources as input, predicting and outputting the conflict performance degradation degree of the shared application by a trained conflict performance degradation predictor, and predicting and outputting the expansion performance degradation degree of the shared application by the trained expansion performance degradation predictor;
step S140, according to the predicted conflict performance degradation degree and the expansion performance degradation degree of the shared application, acquiring the imbalance degree of the performance of the GPU and determining a new distribution scheme of the stream processor cluster for redistributing the stream processor cluster according to the imbalance degree.
The performance equalization scheduling method of the GPU of this embodiment is described in detail below.
The bottom hardware applied in the performance balancing scheduling method of the GPU in this embodiment is a multitask GPU supporting SM-level preemption. As shown in fig. 2, a GPU is generally composed of several SMs (streaming multiprocessors, streaming processor clusters, also called GPU big cores), each SM has its own private primary cache, and the used SMs share a secondary shared cache. When two applications App-a and App-b run on a multitask GPU supporting preemption, the applications compete with each other for SM, second-level shared cache and global memory bandwidth.
Fig. 3 is a diagram illustrating the overall software architecture of the performance balancing scheduling mechanism based on the preemptive shared multitask GPU according to the present invention. The software architecture of the runtime system is divided into four layers: the device comprises a runtime information extraction layer, a pressure extraction layer, a performance prediction layer and a distribution scheduling layer.
Step S110, collecting the cache runtime statistical information of each shared application at each level and the current stream processor cluster allocation scheme.
And collecting the caching runtime statistical information of each level of the shared application and the current stream processor cluster allocation scheme through a runtime information extraction layer. And the runtime information extraction layer is used for extracting the cached information at each level provided by the statistic module on the GPU chip and the current SM distribution scheme. This information reflects the characteristics of the respective shared application and is fully relied upon for subsequent processing.
Step S120, the trained runtime pressure extractor extracts the pressure borne by each shared application on the secondary cache and the memory bandwidth.
Pressure is used to reflect the severity of competition on shared resources.
In this embodiment, as shown in fig. 4, the training process for training the runtime pressure extractor includes: respectively designing a plurality of pressure measurement programs aiming at a secondary cache and a memory bandwidth; designing a plurality of pressure generators aiming at a secondary cache and a memory bandwidth respectively; enabling a plurality of pressure measurement programs and a plurality of pressure generators to share a GPU for operation, collecting corresponding operation statistical information and measuring pressure values generated on a secondary cache and a memory bandwidth; and training a preset neural network by taking the collected running statistical information as input and the measured pressure value as output to form the running pressure extractor.
With the quantitative analysis of the stress, there is a dedicated stress test procedure for each shared resource (level two shared cache or memory bandwidth). When an application shares a GPU with a stress tester, the stress that the application generates on the corresponding shared resource is defined as the degree of degradation in performance of the corresponding stress tester (relative to performance when the entire GPU is monopolized). The pressure experienced by an application on a shared resource is defined as the pressure generated by other applications on the shared resource.
In this embodiment, taking the collected runtime statistical information as an input, taking the measured pressure value as an output, training a preset neural network, and forming the runtime pressure extractor specifically includes: using the collected statistical information during operation as input, using the measured secondary cache pressure value as output, training a preset neural network, and forming a secondary cache pressure extractor; and (3) taking the collected running statistical information as input, taking the measured memory bandwidth pressure value as output, training a preset neural network, and forming a memory bandwidth pressure extractor.
That is, the pressure fetch layer includes a level two shared cache pressure fetch and a memory bandwidth pressure fetch.
The second level shared cache pressure extraction is responsible for extracting the pressure born by each application program on the second level shared cache during operation. In operation, the real-time extraction of pressure on the secondary cache is accomplished through an off-line trained neural network. The input to the neural network is the information collected by the runtime extraction layer. In the stage of training the neural network, the training of the neural network is completed by acquiring the runtime information when a series of secondary cache pressure generators and secondary shared cache pressure measurers share the GPU and taking the runtime information as input, and the pressure values measured by the pressure measurers are taken as labels. A pressure generator is an application program that is better able to generate a certain amount of pressure on a corresponding shared resource.
The memory bandwidth pressure extraction is responsible for extracting the pressure on the memory bandwidth of each application program during operation. During operation, the real-time extraction of the pressure on the memory bandwidth is completed through an off-line trained neural network. The input to the neural network is the information collected by the runtime extraction layer. In the stage of training the neural network, the training of the neural network is completed by acquiring the runtime information of a series of memory bandwidth generators and memory bandwidth pressure gauges sharing the GPU and taking the runtime information as input, and the pressure values measured by the pressure gauges are taken as labels.
In this embodiment, the neural network used by the second-level cache pressure extractor and the memory bandwidth pressure extractor includes an input layer, two hidden layers, and an output layer; wherein the number of hidden layer neurons equals the number of inputs; the activation function of the neural network adopts a LeakyRelu function.
Step S130, the collected running time statistical information of the shared application and the pressure on various shared resources borne by the shared application are taken as input, the trained conflict performance degradation predictor is used for predicting and outputting the conflict performance degradation degree of the shared application, and the trained expansion performance degradation predictor is used for predicting and outputting the expansion performance degradation degree of the shared application.
In this embodiment, the conflict performance degradation degree is a degradation degree of the performance of the application program when there is contention between the secondary cache and the memory bandwidth, relative to the performance of the application program when there is no contention, under the condition that the number of the stream processor clusters is fixed; the extent of performance degradation is the extent of performance degradation of an application using a particular number of stream processor clusters relative to its performance exclusive to the entire GPU, without any contention on shared cache and memory bandwidth.
As shown in fig. 4, in this embodiment, the training process for training the conflict performance degradation predictor and training the extended performance degradation predictor includes:
selecting a plurality of application programs; enabling the multiple application programs and the multiple pressure generators to share a GPU for operation and collecting corresponding operation statistical information, and measuring pressure values generated on secondary cache and memory bandwidth, the conflict performance degradation degree of the application programs and the expansion performance degradation degree of the application programs; using the collected running statistical information and the measured pressure value as input, using the conflict performance degradation degree of the application program as output, training a preset neural network, and forming the conflict performance degradation predictor; and taking the collected running statistical information and the measured pressure value as input, taking the expansion performance degradation degree of the application program as output, training a preset neural network, and forming the conflict performance degradation predictor.
That is, in the embodiment, the performance prediction layer includes a conflict performance prediction and an expansion performance prediction.
1) Conflict performance prediction is responsible for predicting the extent of conflict performance degradation for an application. The conflict performance degradation degree is the degradation degree of the performance when competition exists on the secondary cache and the memory bandwidth relative to the performance when no competition exists under the condition that the number of SMs is fixed. In operation, the prediction of the conflict performance decline is completed through an off-line trained neural network. The inputs to the neural network are the information gathered by the runtime abstraction layer and the pressure output by the pressure abstraction layer on the secondary cache and on the memory bandwidth. In the stage of training the neural network, a plurality of training sets are collected and a training set with wide representativeness is constructed. The training of the neural network is completed by collecting the runtime information when a series of pressure generators and applications in the training set share the GPU and taking the runtime information and the corresponding pressure values born by the training applications as input, and taking the actual conflict performance degradation degree of the training applications as a label.
2) The extended performance prediction is responsible for predicting the extent of the decrease in the extended performance and the extent of the change in the extended performance of the application. The extent of performance degradation is extended, i.e., the extent of performance degradation of an application using a particular number of SMs versus its exclusive ownership of the entire GPU, without contention on shared cache and memory bandwidth at all. The extent of performance variation is extended, i.e., the variation in the extent of performance degradation due to the addition or subtraction of one SM when a particular number of SMs are used without contention for secondary cache and memory bandwidth. In operation, the prediction of the reduction degree of the conflict performance and the change degree of the conflict performance is completed through an off-line trained neural network. The inputs to the neural network are the information gathered by the runtime abstraction layer and the pressure output by the pressure abstraction layer on the secondary cache and on the memory bandwidth. In the stage of training the neural network, a number of training sets are first collected and a widely representative training set is constructed. Under the condition of different SM distribution of each training application, the training application shares the GPU which is acquired by the GPU with each pressure generator, and takes the information and the corresponding pressure value borne by the training application as input, and the actual performance degradation degree and the performance change degree of the training application are taken as labels to finish the training of the neural network.
The neural network used for the conflict performance prediction and the expansion performance prediction is similar to the neural network used by the pressure extraction layer.
Step S140, according to the predicted conflict performance degradation degree and the expansion performance degradation degree of the shared application, acquiring the imbalance degree of the performance of the GPU and determining a new distribution scheme of the stream processor cluster for redistributing the stream processor cluster according to the imbalance degree.
In this embodiment, the obtaining the imbalance of the performance of the GPU specifically includes: acquiring the actual performance degradation degree of the shared application according to the predicted and output conflict performance degradation degree and the expanded performance degradation degree of the shared application, and acquiring the imbalance degree of the performance of the GPU according to the actual performance degradation degree; wherein the real performance degradation degree is equal to the product of the corresponding conflict performance degradation degree and the expansion performance degradation degree.
Specifically, in this embodiment, the determining, according to the imbalance, new allocation scheme information of the stream processor cluster to which the stream processor cluster is reallocated specifically includes: and if the imbalance degree exceeds a set threshold value, performing reallocation, gradually reducing the imbalance degree by adopting 1 stream processor cluster which reallocates the application with the minimum performance degradation degree and the application with the maximum performance degradation degree every time by adopting a preset algorithm, and determining the current new allocation scheme as a stream processor cluster new allocation scheme when the distance between the new allocation scheme and the initial allocation scheme is greater than a specific threshold value.
The dispatch layer is responsible for predicting the true performance degradation of the application based on the outputs of the conflict performance prediction and the expansion performance prediction. The true performance degradation level is the degradation level of the performance when there is contention on the secondary cache and on the memory bandwidth, using the currently allocated number of SMs, relative to the performance when it monopolizes the entire GPU. It is known by definition that the actual performance degradation level is equal to the product of the corresponding conflicting performance degradation level and the extended performance degradation level. On the basis of predicting the actual performance degradation degree, the number of SMs distributed to each application is gradually adjusted by using a heuristic greedy algorithm so as to reduce the imbalance of the performance degradation degree among the applications. Wherein the imbalance of the degree of performance degradation is defined as a difference between a maximum value and a minimum value of the degree of performance degradation of the respective shared applications. The specific scheduling algorithm is shown in table 1 below.
TABLE 1
Figure BDA0001530030260000081
Figure BDA0001530030260000091
The algorithm is invoked periodically. It first determines whether the imbalance has been less than a specified threshold under the current allocation, and if so, the algorithm terminates directly. If the current imbalance is greater than a specified threshold, it gradually reduces the imbalance by using a greedy approach to reassign 1 SM at a time for the application with the least current performance degradation and the application with the greatest performance degradation. The distance between the two sets of allocations is defined as the maximum value of the number of changes for each application SM. The algorithm terminates immediately when the new allocation scheme is more than a certain threshold from the original allocation scheme. This is to take into account that the accuracy of the predicted values is related to the distance that the allocation plan deviates from the original plan.
In summary, when the present invention is used, the off-line training is performed first, as shown in fig. 4, the training process is as follows:
1) designing a corresponding pressure measurement program according to the architecture of the target GPU: pressure measurement programs for the second-level shared cache and the global memory bandwidth are respectively designed to quantify the competition severity on various shared resources.
2) Designing a corresponding design pressure generator according to the architecture of the target GPU: pressure generators for the level two cache and global memory bandwidth are designed separately to generate a specific amount of pressure on a specific shared resource.
3) The pressure measurement program runs in common with the pressure generator: the various pressure measurement programs are run with the various pressure generators sharing a GPU, and corresponding run-time information and generated pressure values are collected.
4) Training the runtime pressure extractor: and taking the operation information collected in the previous stage as input, and taking the measured pressure value as output to train a neural network for online pressure extraction.
5) A set of well-represented applications is collected: and collecting a group of application programs, fully covering various conditions of the main stream, and collecting representative application programs according to the target application scenes to form a training set.
6) The application program runs in sharing with the pressure generator: the application program and the pressure generator are enabled to run in a shared mode, running statistical information is collected, the conflict performance of the application program is reduced, the performance reduction is expanded, and the pressure generated by the pressure generator is increased.
7) Training the collision performance degradation predictor: the runtime information collected in the previous stage and the pressure generated by the pressure generator are used as input, and the collision performance degradation of the application program is used as a label to train a neural network for online prediction of the collision performance degradation.
8) Training the expansion performance degradation predictor: and taking the runtime information collected in the previous stage and the pressure generated by the pressure generator as input, and taking the expansion performance reduction of the application program as a label to train a neural network for online expansion performance reduction prediction.
After the offline training is completed, the online scheduling can be performed, and the online scheduling process is as shown in fig. 5:
and (3) online scheduling process:
1) collecting information in the runtime: and collecting the statistic information of each level of cache operation of each shared application and the current SM distribution scheme.
2) Pressure extraction: and taking the information acquired in the previous stage as input, and extracting the pressure borne by each sharing application on the secondary cache and the memory bandwidth.
3) Collision performance degradation prediction: the collected runtime information of the shared application and the pressure on various shared resources borne by the runtime information are used as input, and the conflict performance degradation degree of the application is output by the conflict performance degradation predictor.
4) And (3) expanding performance decline prediction: the collected running information of the shared application and the pressure on various shared resources borne by the running information are used as input, and the expansion performance degradation degree of the application is output by the expansion performance degradation predictor.
5) SM redistribution: and calculating the actual performance degradation degree of a shared application according to the predicted conflict performance degradation prediction and the expanded performance prediction, and further calculating the imbalance degree of the performance of the system according to the actual performance degradation degree. If the disparity exceeds a certain threshold, the reallocation is performed. When the non-allocation is carried out again, firstly, the imbalance degree of the current system is gradually reduced by adopting a greedy method, and therefore, an allocation scheme which is better than the current scheme is obtained for carrying out the re-allocation.
The embodiment of the present invention further provides a storage medium, which includes a GPU processor and a memory, where the memory stores program instructions, and the GPU processor executes the program instructions to implement the performance balancing scheduling method of the GPU as described above. The performance equalization scheduling method of the GPU has already been described in detail above, and is not described herein again.
The embodiment of the present invention further provides an electronic terminal, for example, a server, where the electronic terminal includes a GPU processor and a memory, where the memory stores program instructions, and the GPU processor executes the program instructions to implement the performance balancing scheduling method for the GPU. The performance equalization scheduling method of the GPU has already been described in detail above, and is not described herein again.
In summary, the present invention provides a performance balancing scheduling mechanism for preemptive shared multitask GPU. The mechanism can further ensure the balance of performance reduction degree among shared applications on the basis of ensuring that the GPU resource utilization rate and the overall system performance are improved in parallel by using the space on the premise of not increasing hardware support. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (9)

1.一种GPU的性能均衡调度方法,其特征在于,所述性能均衡调度方法包括:1. A performance-balanced scheduling method for GPU, wherein the performance-balanced scheduling method comprises: 收集各个共享应用的各级缓存运行时统计信息和当前的流处理器簇分配方案;Collect cache runtime statistics at all levels of each shared application and the current stream processor cluster allocation scheme; 由训练好的运行时压力提取器提取各个共享应用在二级缓存和内存带宽上所承受到的压力;The pressure on L2 cache and memory bandwidth of each shared application is extracted by the trained runtime pressure extractor; 将收集到的共享应用的运行时统计信息和该共享应用所承受到的各种共享资源上的压力作为输入,由训练好的冲突性能下降预测器预测输出该共享应用的冲突性能下降程度,由训练好的拓展性能下降预测器预测输出该共享应用的拓展性能下降程度;Taking the collected runtime statistics of the shared application and the pressure on various shared resources that the shared application is subjected to as input, the trained conflict performance degradation predictor predicts and outputs the conflict performance degradation degree of the shared application, which is expressed by The trained scaling performance degradation predictor predicts and outputs the scaling performance degradation degree of the shared application; 根据预测输出的该共享应用的冲突性能下降程度和拓展性能下降程度,获取GPU的性能的不均衡度并根据所述不均衡度确定重新分配流处理器簇的流处理器簇新分配方案;According to the predicted output conflict performance degradation degree and expansion performance degradation degree of the shared application, obtain the unbalanced degree of GPU performance and determine a new allocation scheme of the stream processor cluster for redistributing the stream processor cluster according to the unbalanced degree; 所述冲突性能下降程度即在流处理器簇数目固定的情况下,应用程序在存在二级缓存和内存带宽上竞争时的性能相对于没有竞争时性能的下降程度;所述拓展性能下降程度即在完全没有共享缓存和内存带宽上的竞争的情况下,一个应用程序在使用特定数目流处理器簇时的性能相对于其独占整块GPU时的性能的下降程度。The degree of degradation of the conflict performance is the degree of degradation of the performance of the application when there is competition on the second-level cache and memory bandwidth relative to the performance when there is no competition under the condition that the number of stream processor clusters is fixed; the degree of degradation of the expansion performance is The degree to which an application's performance using a given number of stream processor clusters is degraded relative to its monopoly on the entire GPU in the absence of shared cache and memory bandwidth competition. 2.根据权利要求1所述的GPU的性能均衡调度方法,其特征在于,训练所述运行时压力提取器的训练过程包括:2. The performance-balanced scheduling method of GPU according to claim 1, wherein the training process of training the runtime stress extractor comprises: 分别设计针对二级缓存和内存带宽的多个压力测量程序;Design multiple stress measurement programs for L2 cache and memory bandwidth respectively; 分别设计针对二级缓存和内存带宽的多个压力发生器;Design multiple pressure generators for L2 cache and memory bandwidth respectively; 令多个所述压力测量程序与多个所述压力发生器共享GPU运行并收集相应的运行时统计信息并测量二级缓存和内存带宽上产生的压力值;Making a plurality of the pressure measurement programs and a plurality of the pressure generators share the GPU to run and collect corresponding runtime statistics and measure the pressure values generated on the secondary cache and memory bandwidth; 将收集到的运行时统计信息作为输入,将测量得到的压力值作为输出,训练预设的神经网络,形成所述运行时压力提取器。The collected runtime statistical information is used as input, and the measured pressure value is used as output to train a preset neural network to form the runtime pressure extractor. 3.根据权利要求2所述的GPU的性能均衡调度方法,其特征在于,训练所述冲突性能下降预测器和训练所述拓展性能下降预测器的训练过程包括:3. The performance balancing scheduling method of GPU according to claim 2, wherein the training process of training the conflict performance degradation predictor and training the expansion performance degradation predictor comprises: 选取多个应用程序;select multiple applications; 令多个所述应用程序与多个所述压力发生器共享GPU运行并收集相应的运行时统计信息,测量二级缓存和内存带宽上产生的压力值、应用程序的冲突性能下降程度以及应用程序的拓展性能下降程度;Let a plurality of the application programs and a plurality of the stress generators share the GPU to run and collect the corresponding runtime statistics, measure the stress value generated on the L2 cache and memory bandwidth, the conflict performance degradation degree of the application program and the application program The extent to which the expansion performance is degraded; 将收集到的运行时统计信息和测量得到的压力值作为输入,将应用程序的冲突性能下降程度作为输出,训练预设的神经网络,形成所述冲突性能下降预测器;The collected runtime statistics and the measured pressure value are used as input, and the conflict performance degradation degree of the application is used as output, and a preset neural network is trained to form the conflict performance degradation predictor; 将收集到的运行时统计信息和测量得到的压力值作为输入,将应用程序的拓展性能下降程度作为输出,训练预设的神经网络,形成所述拓展性能下降预测器。The collected runtime statistical information and the measured pressure value are used as input, and the degree of expansion performance degradation of the application is used as output, and a preset neural network is trained to form the expansion performance degradation predictor. 4.根据权利要求2所述的GPU的性能均衡调度方法,其特征在于,将收集到的运行时统计信息作为输入,将测量得到的压力值作为输出,训练预设的神经网络,形成所述运行时压力提取器具体包括:4. The performance-balanced scheduling method of GPU according to claim 2, characterized in that, using the collected runtime statistical information as input, using the measured pressure value as output, training a preset neural network to form the The runtime pressure extractor specifically includes: 将收集到的运行时统计信息作为输入,将测量得到的二级缓存压力值作为输出,训练预设的神经网络,形成二级缓存压力提取器;Taking the collected runtime statistics as input and the measured L2 cache pressure value as output, trains a preset neural network to form a L2 cache pressure extractor; 将收集到的运行时统计信息作为输入,将测量得到的内存带宽压力值作为输出,训练预设的神经网络,形成内存带宽压力提取器。Taking the collected runtime statistics as input and the measured memory bandwidth pressure value as output, train the preset neural network to form a memory bandwidth pressure extractor. 5.根据权利要求4所述的GPU的性能均衡调度方法,其特征在于,所述二级缓存压力提取器和所述内存带宽压力提取器使用的所述神经网络包含一个输入层,两个隐藏层和一个输出层;其中,所述隐藏层神经元的个数等于输入的数量;所述神经网络的激化函数采用的是LeakyRelu函数。5. The performance balance scheduling method of GPU according to claim 4, wherein the neural network used by the L2 cache pressure extractor and the memory bandwidth pressure extractor comprises one input layer, two hidden layers. layer and an output layer; wherein, the number of neurons in the hidden layer is equal to the number of inputs; the activation function of the neural network adopts the LeakyRelu function. 6.根据权利要求1所述的GPU的性能均衡调度方法,其特征在于,获取GPU的性能的不均衡度具体包括:6. The performance balancing scheduling method of GPU according to claim 1, wherein obtaining the unbalanced degree of performance of GPU specifically comprises: 所述根据预测输出的该共享应用的冲突性能下降程度和拓展性能下降程度获取共享应用的真实性能下降程度,并根据所述真实性能下降程度获取GPU的性能的不均衡度;其中,所述真实性能下降程度等于相应的冲突性能下降程度与拓展性能下降程度的乘积。The actual performance degradation degree of the shared application is obtained according to the conflict performance degradation degree and the expansion performance degradation degree of the shared application that are predicted and output, and the performance imbalance degree of the GPU is obtained according to the real performance degradation degree; wherein the real performance degradation degree is obtained; The performance degradation is equal to the product of the corresponding conflict performance degradation and the expansion performance degradation. 7.根据权利要求1或6所述的GPU的性能均衡调度方法,其特征在于,所述根据所述不均衡度确定重新分配流处理器簇的流处理器簇新分配方案信息具体包括:7. The performance balancing scheduling method of GPU according to claim 1 or 6, wherein the determining the new allocation scheme information of the stream processor cluster for redistributing the stream processor cluster according to the imbalance degree specifically comprises: 如果不均衡度超过了设定的阈值,则进行重新分配,在进行重新分配时,采用与预设算法每次重新分配当前性能下降程度最小的应用和最大的应用的1个流处理器簇来逐步的减少不均衡度,当新的分配方案与最初的分配方案的距离大于特定阈值时,确定当前的新的分配方案为流处理器簇新分配方案。If the unbalanced degree exceeds the set threshold, it will be re-allocated. When re-allocating, the preset algorithm is used to re-allocate one stream processor cluster of the application with the smallest performance degradation and the application with the largest application. The degree of imbalance is gradually reduced, and when the distance between the new allocation scheme and the original allocation scheme is greater than a certain threshold, the current new allocation scheme is determined as a new allocation scheme for the stream processor cluster. 8.一种存储介质,存储有程序指令,其特征在于,所述程序指令被处理器执行时实现如权利要求1至权利要求7任一权利要求所述的方法。8 . A storage medium storing program instructions, wherein, when the program instructions are executed by a processor, the method according to any one of claims 1 to 7 is implemented. 9 . 9.一种电子终端,包括GPU处理器和存储器,其特征在于,所述存储器存储有程序指令,所述GPU处理器运行程序指令实现如权利要求1至权利要求7任一项所述的方法。9. An electronic terminal comprising a GPU processor and a memory, wherein the memory stores program instructions, and the GPU processor executes the program instructions to implement the method according to any one of claims 1 to 7 .
CN201711460215.1A 2017-12-28 2017-12-28 GPU performance balance scheduling method, storage medium and electronic terminal Active CN108228351B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711460215.1A CN108228351B (en) 2017-12-28 2017-12-28 GPU performance balance scheduling method, storage medium and electronic terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711460215.1A CN108228351B (en) 2017-12-28 2017-12-28 GPU performance balance scheduling method, storage medium and electronic terminal

Publications (2)

Publication Number Publication Date
CN108228351A CN108228351A (en) 2018-06-29
CN108228351B true CN108228351B (en) 2021-07-27

Family

ID=62646577

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711460215.1A Active CN108228351B (en) 2017-12-28 2017-12-28 GPU performance balance scheduling method, storage medium and electronic terminal

Country Status (1)

Country Link
CN (1) CN108228351B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11900157B2 (en) 2018-09-19 2024-02-13 Intel Corporation Hybrid virtual GPU co-scheduling
CN110929627B (en) * 2019-11-18 2021-12-28 北京大学 Image recognition method of efficient GPU training model based on wide-model sparse data set
CN117762654B (en) * 2023-12-22 2024-09-10 摩尔线程智能科技(北京)有限责任公司 Method, device, equipment and storage medium for collecting GPU information by application program

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104461928A (en) * 2013-09-16 2015-03-25 华为技术有限公司 Method and device for dividing caches
CN105487927A (en) * 2014-09-15 2016-04-13 华为技术有限公司 Resource management method and device
CN106383792A (en) * 2016-09-20 2017-02-08 北京工业大学 Missing perception-based heterogeneous multi-core cache replacement method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9721315B2 (en) * 2007-07-13 2017-08-01 Cerner Innovation, Inc. Claim processing validation system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104461928A (en) * 2013-09-16 2015-03-25 华为技术有限公司 Method and device for dividing caches
CN105487927A (en) * 2014-09-15 2016-04-13 华为技术有限公司 Resource management method and device
CN106383792A (en) * 2016-09-20 2017-02-08 北京工业大学 Missing perception-based heterogeneous multi-core cache replacement method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CGPredict:Embedded GPU performance estimation from single-threaded application;Siqi wang;《ACM》;20170930;全文 *
performace prediction of parallel applications based on small-scale executions;Rodrigo Escobar;《IEEE》;20170202;全文 *

Also Published As

Publication number Publication date
CN108228351A (en) 2018-06-29

Similar Documents

Publication Publication Date Title
CN110399222B (en) GPU cluster deep learning task parallelization method and device and electronic equipment
US11367160B2 (en) Simultaneous compute and graphics scheduling
KR101839544B1 (en) Automatic load balancing for heterogeneous cores
US8707314B2 (en) Scheduling compute kernel workgroups to heterogeneous processors based on historical processor execution times and utilizations
Basireddy et al. AdaMD: Adaptive mapping and DVFS for energy-efficient heterogeneous multicores
US20200142466A1 (en) Optimal operating point estimator for hardware operating under a shared power/thermal constraint
Berezovskyi et al. WCET measurement-based and extreme value theory characterisation of CUDA kernels
US20230418997A1 (en) Comprehensive contention-based thread allocation and placement
CN104850461B (en) A kind of virtual cpu method for optimizing scheduling towards NUMA architecture
US11165848B1 (en) Evaluating qualitative streaming experience using session performance metadata
CN108228351B (en) GPU performance balance scheduling method, storage medium and electronic terminal
US10778605B1 (en) System and methods for sharing memory subsystem resources among datacenter applications
JP7554795B2 (en) Data-driven scheduler on multiple computing cores
CN111966453A (en) Load balancing method, system, equipment and storage medium
US9442696B1 (en) Interactive partitioning and mapping of an application across multiple heterogeneous computational devices from a co-simulation design environment
CN112068957A (en) Resource allocation method, device, computer equipment and storage medium
US20220193558A1 (en) Measuring and detecting idle processing periods and identifying root causes thereof in cloud-based, streaming applications
CN107657599A (en) Remote sensing image fusion system in parallel implementation method based on combination grain division and dynamic load balance
CN104123119B (en) Dynamic vision measurement feature point center quick positioning method based on GPU
US11966765B2 (en) Memory bandwidth throttling for virtual machines
CN108897625A (en) Method of Scheduling Parallel based on DAG model
Kim et al. Interference-aware execution framework with Co-scheML on GPU clusters
CN114237913B (en) Interference-aware GPU heterogeneous cluster scheduling method, system and medium
Shukla et al. Investigating policies for performance of multi-core processors
CN109522106B (en) A Dynamic Task Scheduling Method for Value at Risk Simulation Based on Collaborative Computing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant