Disclosure of Invention
In view of the foregoing drawbacks of the prior art, an object of the present invention is to provide a performance balancing scheduling method for a GPU, a storage medium, and an electronic terminal, which ensure performance fairness among shared applications while improving resource utilization and overall system performance by using a space parallel technique through accurate performance degradation prediction and dynamic SM allocation scheduling.
In order to achieve the above and other related objects, the present invention provides a performance equalization scheduling method for a GPU, including: collecting statistical information of each level of cache run time of each shared application and a current stream processor cluster distribution scheme; extracting the pressure born by each shared application on the secondary cache and the memory bandwidth by a trained runtime pressure extractor; the collected running statistical information of the shared application and the pressure on various shared resources borne by the shared application are used as input, the trained conflict performance degradation predictor is used for predicting and outputting the conflict performance degradation degree of the shared application, and the trained expansion performance degradation predictor is used for predicting and outputting the expansion performance degradation degree of the shared application; and acquiring the imbalance degree of the performance of the GPU according to the predicted and output conflict performance degradation degree and the expansion performance degradation degree of the shared application, and determining a new distribution scheme of the stream processor cluster for redistributing the stream processor cluster according to the imbalance degree.
In an embodiment of the invention, the training process for training the runtime pressure extractor includes: respectively designing a plurality of pressure measurement programs aiming at a secondary cache and a memory bandwidth; designing a plurality of pressure generators aiming at a secondary cache and a memory bandwidth respectively; enabling a plurality of pressure measurement programs and a plurality of pressure generators to share a GPU for operation, collecting corresponding operation statistical information and measuring pressure values generated on a secondary cache and a memory bandwidth; and training a preset neural network by taking the collected running statistical information as input and the measured pressure value as output to form the running pressure extractor.
In an embodiment of the present invention, the training process for training the conflict performance degradation predictor and training the extended performance degradation predictor includes: selecting a plurality of application programs; enabling the multiple application programs and the multiple pressure generators to share a GPU for operation and collecting corresponding operation statistical information, and measuring pressure values generated on secondary cache and memory bandwidth, the conflict performance degradation degree of the application programs and the expansion performance degradation degree of the application programs; using the collected running statistical information and the measured pressure value as input, using the conflict performance degradation degree of the application program as output, training a preset neural network, and forming the conflict performance degradation predictor; and taking the collected running statistical information and the measured pressure value as input, taking the expansion performance degradation degree of the application program as output, training a preset neural network, and forming the conflict performance degradation predictor.
In an embodiment of the present invention, the training of the preset neural network with the collected runtime statistical information as input and the measured pressure value as output to form the runtime pressure extractor specifically includes: using the collected statistical information during operation as input, using the measured secondary cache pressure value as output, training a preset neural network, and forming a secondary cache pressure extractor; and (3) taking the collected running statistical information as input, taking the measured memory bandwidth pressure value as output, training a preset neural network, and forming a memory bandwidth pressure extractor.
In an embodiment of the present invention, the neural network used by the second-level cache pressure extractor and the memory bandwidth pressure extractor includes an input layer, two hidden layers and an output layer; wherein the number of hidden layer neurons equals the number of inputs; the activation function of the neural network adopts a LeakyRelu function.
In an embodiment of the present invention, the conflict performance degradation degree is a degradation degree of the performance of the application program when there is contention between the second level cache and the memory bandwidth, relative to the performance without contention, under the condition that the number of the stream processor clusters is fixed; the extent of performance degradation is the extent of performance degradation of an application using a particular number of stream processor clusters relative to its performance exclusive to the entire GPU, without any contention on shared cache and memory bandwidth.
In an embodiment of the present invention, the obtaining the imbalance degree of the performance of the GPU specifically includes: acquiring the actual performance degradation degree of the shared application according to the predicted and output conflict performance degradation degree and the expanded performance degradation degree of the shared application, and acquiring the imbalance degree of the performance of the GPU according to the actual performance degradation degree; wherein the real performance degradation degree is equal to the product of the corresponding conflict performance degradation degree and the expansion performance degradation degree.
In an embodiment of the present invention, the determining, according to the imbalance, new allocation scheme information of the stream processor cluster to which the stream processor cluster is reallocated specifically includes: and if the imbalance degree exceeds a set threshold value, performing reallocation, gradually reducing the imbalance degree by adopting 1 stream processor cluster which reallocates the application with the minimum performance degradation degree and the application with the maximum performance degradation degree every time by adopting a preset algorithm, and determining the current new allocation scheme as a stream processor cluster new allocation scheme when the distance between the new allocation scheme and the initial allocation scheme is greater than a specific threshold value.
Embodiments of the present invention also provide a storage medium including a GPU processor and a memory, the memory storing program instructions, the GPU processor executing the program instructions to implement the method as described above.
An embodiment of the present invention further provides an electronic terminal, which includes a GPU processor and a memory, where the memory stores program instructions, and the GPU processor executes the program instructions to implement the method described above.
As described above, the performance balancing scheduling method of the GPU, the storage medium and the electronic terminal of the present invention have the following beneficial effects:
the invention provides a performance balance scheduling mechanism facing a preemptive shared multitask GPU. The mechanism can further ensure the balance of performance reduction degree among shared applications on the basis of ensuring that the GPU resource utilization rate and the overall system performance are improved in parallel by using the space on the premise of not increasing hardware support.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.
The present embodiment aims to provide a performance equalization scheduling method for a GPU, a storage medium, and an electronic terminal, which ensure performance fairness among shared applications on the basis of improving resource utilization and overall system performance by using a space parallel technique through accurate performance degradation prediction and dynamic SM allocation scheduling.
The principle and the implementation of the performance equalization scheduling method, the storage medium and the electronic terminal of the GPU of the present invention will be described in detail below, so that those skilled in the art can understand the performance equalization scheduling method, the storage medium and the electronic terminal of the GPU of the present invention without creative work.
Specifically, this embodiment aims to implement a performance balancing scheduling mechanism based on a preemptive shared multi-task GPU with low overhead, and ensure the fairness of performance among various shared applications on the basis of improving the resource utilization rate and the overall system performance.
The performance equalization scheduling method of the GPU, the storage medium, and the electronic terminal of the present embodiment are described in detail below.
As shown in fig. 1, this embodiment provides a performance equalization scheduling method for a GPU, where the performance equalization scheduling method for the GPU includes the following steps:
step S110, collecting all levels of cache run-time statistical information of each shared application and a current stream processor cluster distribution scheme;
step S120, extracting the pressure born by each shared application on the secondary cache and the memory bandwidth by the trained runtime pressure extractor;
step S130, using the collected running time statistical information of the shared application and the pressure of the shared application on various shared resources as input, predicting and outputting the conflict performance degradation degree of the shared application by a trained conflict performance degradation predictor, and predicting and outputting the expansion performance degradation degree of the shared application by the trained expansion performance degradation predictor;
step S140, according to the predicted conflict performance degradation degree and the expansion performance degradation degree of the shared application, acquiring the imbalance degree of the performance of the GPU and determining a new distribution scheme of the stream processor cluster for redistributing the stream processor cluster according to the imbalance degree.
The performance equalization scheduling method of the GPU of this embodiment is described in detail below.
The bottom hardware applied in the performance balancing scheduling method of the GPU in this embodiment is a multitask GPU supporting SM-level preemption. As shown in fig. 2, a GPU is generally composed of several SMs (streaming multiprocessors, streaming processor clusters, also called GPU big cores), each SM has its own private primary cache, and the used SMs share a secondary shared cache. When two applications App-a and App-b run on a multitask GPU supporting preemption, the applications compete with each other for SM, second-level shared cache and global memory bandwidth.
Fig. 3 is a diagram illustrating the overall software architecture of the performance balancing scheduling mechanism based on the preemptive shared multitask GPU according to the present invention. The software architecture of the runtime system is divided into four layers: the device comprises a runtime information extraction layer, a pressure extraction layer, a performance prediction layer and a distribution scheduling layer.
Step S110, collecting the cache runtime statistical information of each shared application at each level and the current stream processor cluster allocation scheme.
And collecting the caching runtime statistical information of each level of the shared application and the current stream processor cluster allocation scheme through a runtime information extraction layer. And the runtime information extraction layer is used for extracting the cached information at each level provided by the statistic module on the GPU chip and the current SM distribution scheme. This information reflects the characteristics of the respective shared application and is fully relied upon for subsequent processing.
Step S120, the trained runtime pressure extractor extracts the pressure borne by each shared application on the secondary cache and the memory bandwidth.
Pressure is used to reflect the severity of competition on shared resources.
In this embodiment, as shown in fig. 4, the training process for training the runtime pressure extractor includes: respectively designing a plurality of pressure measurement programs aiming at a secondary cache and a memory bandwidth; designing a plurality of pressure generators aiming at a secondary cache and a memory bandwidth respectively; enabling a plurality of pressure measurement programs and a plurality of pressure generators to share a GPU for operation, collecting corresponding operation statistical information and measuring pressure values generated on a secondary cache and a memory bandwidth; and training a preset neural network by taking the collected running statistical information as input and the measured pressure value as output to form the running pressure extractor.
With the quantitative analysis of the stress, there is a dedicated stress test procedure for each shared resource (level two shared cache or memory bandwidth). When an application shares a GPU with a stress tester, the stress that the application generates on the corresponding shared resource is defined as the degree of degradation in performance of the corresponding stress tester (relative to performance when the entire GPU is monopolized). The pressure experienced by an application on a shared resource is defined as the pressure generated by other applications on the shared resource.
In this embodiment, taking the collected runtime statistical information as an input, taking the measured pressure value as an output, training a preset neural network, and forming the runtime pressure extractor specifically includes: using the collected statistical information during operation as input, using the measured secondary cache pressure value as output, training a preset neural network, and forming a secondary cache pressure extractor; and (3) taking the collected running statistical information as input, taking the measured memory bandwidth pressure value as output, training a preset neural network, and forming a memory bandwidth pressure extractor.
That is, the pressure fetch layer includes a level two shared cache pressure fetch and a memory bandwidth pressure fetch.
The second level shared cache pressure extraction is responsible for extracting the pressure born by each application program on the second level shared cache during operation. In operation, the real-time extraction of pressure on the secondary cache is accomplished through an off-line trained neural network. The input to the neural network is the information collected by the runtime extraction layer. In the stage of training the neural network, the training of the neural network is completed by acquiring the runtime information when a series of secondary cache pressure generators and secondary shared cache pressure measurers share the GPU and taking the runtime information as input, and the pressure values measured by the pressure measurers are taken as labels. A pressure generator is an application program that is better able to generate a certain amount of pressure on a corresponding shared resource.
The memory bandwidth pressure extraction is responsible for extracting the pressure on the memory bandwidth of each application program during operation. During operation, the real-time extraction of the pressure on the memory bandwidth is completed through an off-line trained neural network. The input to the neural network is the information collected by the runtime extraction layer. In the stage of training the neural network, the training of the neural network is completed by acquiring the runtime information of a series of memory bandwidth generators and memory bandwidth pressure gauges sharing the GPU and taking the runtime information as input, and the pressure values measured by the pressure gauges are taken as labels.
In this embodiment, the neural network used by the second-level cache pressure extractor and the memory bandwidth pressure extractor includes an input layer, two hidden layers, and an output layer; wherein the number of hidden layer neurons equals the number of inputs; the activation function of the neural network adopts a LeakyRelu function.
Step S130, the collected running time statistical information of the shared application and the pressure on various shared resources borne by the shared application are taken as input, the trained conflict performance degradation predictor is used for predicting and outputting the conflict performance degradation degree of the shared application, and the trained expansion performance degradation predictor is used for predicting and outputting the expansion performance degradation degree of the shared application.
In this embodiment, the conflict performance degradation degree is a degradation degree of the performance of the application program when there is contention between the secondary cache and the memory bandwidth, relative to the performance of the application program when there is no contention, under the condition that the number of the stream processor clusters is fixed; the extent of performance degradation is the extent of performance degradation of an application using a particular number of stream processor clusters relative to its performance exclusive to the entire GPU, without any contention on shared cache and memory bandwidth.
As shown in fig. 4, in this embodiment, the training process for training the conflict performance degradation predictor and training the extended performance degradation predictor includes:
selecting a plurality of application programs; enabling the multiple application programs and the multiple pressure generators to share a GPU for operation and collecting corresponding operation statistical information, and measuring pressure values generated on secondary cache and memory bandwidth, the conflict performance degradation degree of the application programs and the expansion performance degradation degree of the application programs; using the collected running statistical information and the measured pressure value as input, using the conflict performance degradation degree of the application program as output, training a preset neural network, and forming the conflict performance degradation predictor; and taking the collected running statistical information and the measured pressure value as input, taking the expansion performance degradation degree of the application program as output, training a preset neural network, and forming the conflict performance degradation predictor.
That is, in the embodiment, the performance prediction layer includes a conflict performance prediction and an expansion performance prediction.
1) Conflict performance prediction is responsible for predicting the extent of conflict performance degradation for an application. The conflict performance degradation degree is the degradation degree of the performance when competition exists on the secondary cache and the memory bandwidth relative to the performance when no competition exists under the condition that the number of SMs is fixed. In operation, the prediction of the conflict performance decline is completed through an off-line trained neural network. The inputs to the neural network are the information gathered by the runtime abstraction layer and the pressure output by the pressure abstraction layer on the secondary cache and on the memory bandwidth. In the stage of training the neural network, a plurality of training sets are collected and a training set with wide representativeness is constructed. The training of the neural network is completed by collecting the runtime information when a series of pressure generators and applications in the training set share the GPU and taking the runtime information and the corresponding pressure values born by the training applications as input, and taking the actual conflict performance degradation degree of the training applications as a label.
2) The extended performance prediction is responsible for predicting the extent of the decrease in the extended performance and the extent of the change in the extended performance of the application. The extent of performance degradation is extended, i.e., the extent of performance degradation of an application using a particular number of SMs versus its exclusive ownership of the entire GPU, without contention on shared cache and memory bandwidth at all. The extent of performance variation is extended, i.e., the variation in the extent of performance degradation due to the addition or subtraction of one SM when a particular number of SMs are used without contention for secondary cache and memory bandwidth. In operation, the prediction of the reduction degree of the conflict performance and the change degree of the conflict performance is completed through an off-line trained neural network. The inputs to the neural network are the information gathered by the runtime abstraction layer and the pressure output by the pressure abstraction layer on the secondary cache and on the memory bandwidth. In the stage of training the neural network, a number of training sets are first collected and a widely representative training set is constructed. Under the condition of different SM distribution of each training application, the training application shares the GPU which is acquired by the GPU with each pressure generator, and takes the information and the corresponding pressure value borne by the training application as input, and the actual performance degradation degree and the performance change degree of the training application are taken as labels to finish the training of the neural network.
The neural network used for the conflict performance prediction and the expansion performance prediction is similar to the neural network used by the pressure extraction layer.
Step S140, according to the predicted conflict performance degradation degree and the expansion performance degradation degree of the shared application, acquiring the imbalance degree of the performance of the GPU and determining a new distribution scheme of the stream processor cluster for redistributing the stream processor cluster according to the imbalance degree.
In this embodiment, the obtaining the imbalance of the performance of the GPU specifically includes: acquiring the actual performance degradation degree of the shared application according to the predicted and output conflict performance degradation degree and the expanded performance degradation degree of the shared application, and acquiring the imbalance degree of the performance of the GPU according to the actual performance degradation degree; wherein the real performance degradation degree is equal to the product of the corresponding conflict performance degradation degree and the expansion performance degradation degree.
Specifically, in this embodiment, the determining, according to the imbalance, new allocation scheme information of the stream processor cluster to which the stream processor cluster is reallocated specifically includes: and if the imbalance degree exceeds a set threshold value, performing reallocation, gradually reducing the imbalance degree by adopting 1 stream processor cluster which reallocates the application with the minimum performance degradation degree and the application with the maximum performance degradation degree every time by adopting a preset algorithm, and determining the current new allocation scheme as a stream processor cluster new allocation scheme when the distance between the new allocation scheme and the initial allocation scheme is greater than a specific threshold value.
The dispatch layer is responsible for predicting the true performance degradation of the application based on the outputs of the conflict performance prediction and the expansion performance prediction. The true performance degradation level is the degradation level of the performance when there is contention on the secondary cache and on the memory bandwidth, using the currently allocated number of SMs, relative to the performance when it monopolizes the entire GPU. It is known by definition that the actual performance degradation level is equal to the product of the corresponding conflicting performance degradation level and the extended performance degradation level. On the basis of predicting the actual performance degradation degree, the number of SMs distributed to each application is gradually adjusted by using a heuristic greedy algorithm so as to reduce the imbalance of the performance degradation degree among the applications. Wherein the imbalance of the degree of performance degradation is defined as a difference between a maximum value and a minimum value of the degree of performance degradation of the respective shared applications. The specific scheduling algorithm is shown in table 1 below.
TABLE 1
The algorithm is invoked periodically. It first determines whether the imbalance has been less than a specified threshold under the current allocation, and if so, the algorithm terminates directly. If the current imbalance is greater than a specified threshold, it gradually reduces the imbalance by using a greedy approach to reassign 1 SM at a time for the application with the least current performance degradation and the application with the greatest performance degradation. The distance between the two sets of allocations is defined as the maximum value of the number of changes for each application SM. The algorithm terminates immediately when the new allocation scheme is more than a certain threshold from the original allocation scheme. This is to take into account that the accuracy of the predicted values is related to the distance that the allocation plan deviates from the original plan.
In summary, when the present invention is used, the off-line training is performed first, as shown in fig. 4, the training process is as follows:
1) designing a corresponding pressure measurement program according to the architecture of the target GPU: pressure measurement programs for the second-level shared cache and the global memory bandwidth are respectively designed to quantify the competition severity on various shared resources.
2) Designing a corresponding design pressure generator according to the architecture of the target GPU: pressure generators for the level two cache and global memory bandwidth are designed separately to generate a specific amount of pressure on a specific shared resource.
3) The pressure measurement program runs in common with the pressure generator: the various pressure measurement programs are run with the various pressure generators sharing a GPU, and corresponding run-time information and generated pressure values are collected.
4) Training the runtime pressure extractor: and taking the operation information collected in the previous stage as input, and taking the measured pressure value as output to train a neural network for online pressure extraction.
5) A set of well-represented applications is collected: and collecting a group of application programs, fully covering various conditions of the main stream, and collecting representative application programs according to the target application scenes to form a training set.
6) The application program runs in sharing with the pressure generator: the application program and the pressure generator are enabled to run in a shared mode, running statistical information is collected, the conflict performance of the application program is reduced, the performance reduction is expanded, and the pressure generated by the pressure generator is increased.
7) Training the collision performance degradation predictor: the runtime information collected in the previous stage and the pressure generated by the pressure generator are used as input, and the collision performance degradation of the application program is used as a label to train a neural network for online prediction of the collision performance degradation.
8) Training the expansion performance degradation predictor: and taking the runtime information collected in the previous stage and the pressure generated by the pressure generator as input, and taking the expansion performance reduction of the application program as a label to train a neural network for online expansion performance reduction prediction.
After the offline training is completed, the online scheduling can be performed, and the online scheduling process is as shown in fig. 5:
and (3) online scheduling process:
1) collecting information in the runtime: and collecting the statistic information of each level of cache operation of each shared application and the current SM distribution scheme.
2) Pressure extraction: and taking the information acquired in the previous stage as input, and extracting the pressure borne by each sharing application on the secondary cache and the memory bandwidth.
3) Collision performance degradation prediction: the collected runtime information of the shared application and the pressure on various shared resources borne by the runtime information are used as input, and the conflict performance degradation degree of the application is output by the conflict performance degradation predictor.
4) And (3) expanding performance decline prediction: the collected running information of the shared application and the pressure on various shared resources borne by the running information are used as input, and the expansion performance degradation degree of the application is output by the expansion performance degradation predictor.
5) SM redistribution: and calculating the actual performance degradation degree of a shared application according to the predicted conflict performance degradation prediction and the expanded performance prediction, and further calculating the imbalance degree of the performance of the system according to the actual performance degradation degree. If the disparity exceeds a certain threshold, the reallocation is performed. When the non-allocation is carried out again, firstly, the imbalance degree of the current system is gradually reduced by adopting a greedy method, and therefore, an allocation scheme which is better than the current scheme is obtained for carrying out the re-allocation.
The embodiment of the present invention further provides a storage medium, which includes a GPU processor and a memory, where the memory stores program instructions, and the GPU processor executes the program instructions to implement the performance balancing scheduling method of the GPU as described above. The performance equalization scheduling method of the GPU has already been described in detail above, and is not described herein again.
The embodiment of the present invention further provides an electronic terminal, for example, a server, where the electronic terminal includes a GPU processor and a memory, where the memory stores program instructions, and the GPU processor executes the program instructions to implement the performance balancing scheduling method for the GPU. The performance equalization scheduling method of the GPU has already been described in detail above, and is not described herein again.
In summary, the present invention provides a performance balancing scheduling mechanism for preemptive shared multitask GPU. The mechanism can further ensure the balance of performance reduction degree among shared applications on the basis of ensuring that the GPU resource utilization rate and the overall system performance are improved in parallel by using the space on the premise of not increasing hardware support. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.