Disclosure of Invention
The invention provides a cloud platform computing power resource performance monitoring and real-time scheduling optimization method.
The cloud platform computing power resource performance monitoring and real-time scheduling optimization method comprises the following steps:
s1, acquiring real-time performance data of heterogeneous computing nodes, wherein the real-time performance data at least comprises CPU instruction period occupancy rate, GPU video memory bandwidth utilization rate and NPU matrix operation delay;
s2, carrying out multidimensional index fusion on the real-time performance data to generate a fusion performance index set containing time correlation characteristics;
S3, constructing a dynamic algorithm topology map based on the fusion performance index set, wherein the map nodes represent physical calculation units, and the edge weights represent normalized values of communication bandwidth and delay among the nodes;
S4, generating an elastic expansion strategy set according to the dynamic algorithm topology map, wherein the elastic expansion strategy set comprises a topological constraint condition of a virtual resource pool and a load mutation probability threshold;
s5, performing real-time scheduling by adopting a priority dynamic adjustment algorithm, and mapping the task to be allocated to a physical node meeting the topological constraint condition;
And S6, detecting topology conflict of the scheduling result through a feedback verification module, and dynamically correcting the parameter weight of the elastic expansion strategy set.
Optionally, the S1 specifically includes:
S11, acquiring the CPU instruction period occupancy rate in real time by installing a performance monitoring agent on each heterogeneous computing node, wherein the instruction period occupancy rate acquires the ratio of the idle period to the busy period of each CPU core through a monitoring tool;
S12, acquiring the bandwidth utilization rate of the GPU video memory in real time through a GPU driver interface, wherein the bandwidth utilization rate records the bandwidth use condition of the GPU memory access through a GPU hardware counter;
And S13, acquiring the matrix operation delay of the NPU in real time through a management interface of the NPU hardware platform, wherein the matrix operation delay records the delay condition of the NPU for executing the matrix operation task through a hardware timer.
Optionally, the S2 specifically includes:
s21, carrying out normalization processing on the real-time performance data of each heterogeneous computing node to obtain a standardized performance data set ;
S22, aligning the normalized performance data of the heterogeneous computing nodes according to time sequence to obtain a performance data set in a time window, wherein,Time stamps representing different moments in time;
S23, fusing normalized performance data of different heterogeneous computing nodes by adopting a weighted average method to obtain a comprehensive performance index set ;
S24, combining the integrated performance index sets by a time sequence analysis-based moving average methodSmoothing to generate a fusion performance index set containing time correlation featuresTo capture long-term trends and short-term fluctuations in performance changes.
Optionally, the generating includes generating a fusion performance index set including time-related featuresExpressed as:
, wherein, Expressed in timeThe comprehensive performance index of the moment of time,The size of the sliding window, the length of the time window for smoothing,Representing each point in time within the sliding window,,The fusion performance index set after the sliding average smoothing treatment reflects the average performance of time points in a window.
Optionally, the step S3 specifically includes:
s31, node definition according to the fusion performance index set Each heterogeneous computing node is defined as a node in the graphWherein each nodeRepresenting a physical computing unit (CPU, GPU or NPU), the attributes of each node comprisingAnd the load of the computing node, available resources, current task state,Representing nodesAt the time ofA time fusion performance index set;
S32. edge definition-edges between nodes Representing nodesSum nodeCommunication connection and edge weight betweenNormalized values representing communication bandwidth and delay between nodes;
S33, dynamically updating based on real-time performance data and fusion performance index set Dynamically updating node attributes and edge weights, recalculating the load of the nodes, distributing task states and updating the attributes of the corresponding nodes when performance data is updated, and updating the edge weights according to the real-time communication bandwidth and delay among the nodes。
Optionally, the edge weightNormalized values representing communication bandwidth and delay between nodes are expressed as:
, wherein, Representing nodesAnd nodeThe bandwidth of the communication between them,Representing nodesAnd nodeThe communication delay between the two is such that,AndRepresenting the maximum of bandwidth and delay, respectively, between all nodes, for normalization,For edge weights, the higher the value is, the worse the bandwidth and delay performance of the communication path is.
Optionally, the step S4 specifically includes:
s41, defining topological constraint conditions, namely rubbing edge weights in the map with dynamic computing power For the basis, determining the topological connection relation among the physical computing units in the virtual resource pools, and defining topological constraint conditions, namely, for the nodes in each virtual resource poolBased on edge weightsDetermining other nodes with which it can communicateDefining constraint thresholdsIf nodeSum nodeEdge weights in betweenGreater than constraint thresholdThen task scheduling to the node is not allowedSum nodeOn a communication path between them;
S42, calculating the load mutation probability and setting the threshold value, namely setting the threshold value of the load mutation probability Set to 0.3 for each nodeCalculating the load mutation probabilityIf the load mutation probability exceeds a threshold valueThe risk of sudden change of the load of the corresponding node is high;
S43, generating an elastic expansion strategy set, namely generating the elastic expansion strategy set of the virtual resource pool according to the topological constraint condition and the load mutation probability threshold value Elastic telescoping strategy setComprising the following steps:
Defining the load, communication bandwidth and communication delay constraint of each node, and ensuring that the topology constraint condition is followed when task scheduling;
load mutation probability threshold value set for each node deployment Instructs when to make dynamic adjustments to the resources (e.g., expanding or contracting the number of nodes in the virtual resource pool, increasing or decreasing resource allocation, etc.).
Optionally, the step S5 specifically includes:
s51, evaluating task priority, namely, according to the calculation requirement, timeliness requirement and resource consumption factor of the task, each task to be allocated is evaluated Calculating task priority thereofExpressed as:
, wherein, For the deadline of the task,In order to be able to determine the degree of urgency of the task,In order to be a resource requirement of a task,Taking 0.5/0.3/0.2 as a weight factor, and adjusting according to specific scenes;
S52, screening topological constraint conditions, namely screening out a physical node set meeting the resource requirements and meeting the topological constraint conditions according to the topological constraint conditions Wherein each nodeEdge weights of (2)The following conditions are satisfied:;
s53, task mapping according to task priority And the load condition of the physical node, adopting a priority dynamic adjustment algorithm to carry out tasksMapping to physical nodes meeting topology constraints。
Optionally, the step S6 specifically includes:
S61, after the real-time scheduling of S5 is completed, collecting scheduling results of each task, including mapping nodes of the task, task completion time and node load conditions, and monitoring the obtained performance data, including node load, communication delay and bandwidth utilization rate in real time;
s62, performing topology conflict detection on the scheduling result by using a feedback verification module, and checking whether the condition of violating the topology constraint condition exists or not, wherein the detection of the topology conflict comprises the following steps:
checking whether the task mapping causes too high a communication delay between nodes;
checking edge weights between nodes Whether or not the constraint threshold is exceededIf the topology conflict exceeds the topology conflict is considered to occur;
Judging whether task allocation conditions which do not accord with topology constraints exist or not by comparing actual node connection conditions with predefined topology structures;
S63, if topology conflict is detected, triggering a conflict correction mechanism by a feedback verification module;
And S64, dynamically correcting the elastic expansion strategy set, namely adjusting the parameter weight of the elastic expansion strategy set in the collision correction process.
Optionally, the collision correction mechanism includes:
Readjusting the mapping of the tasks and selecting new nodes conforming to the topological constraint conditions;
and (3) adjusting the load distribution of the nodes to ensure that the communication bandwidth and the delay meet preset requirements.
The invention has the beneficial effects that:
According to the invention, the task is accurately mapped to the most suitable physical node based on real-time performance data and topology constraint conditions by adopting a dynamic algorithm topology map and a priority dynamic adjustment algorithm, the task allocation process can be optimized by comprehensively evaluating the priority of the task, the load state of the node and the topology constraint conditions, the resource waste and the excessive concentration of the load are avoided, the scheduling efficiency and the scheduling precision of the computing resources are obviously improved, in addition, the dynamic adjustment algorithm ensures that the task allocation can be continuously adapted to the system state change according to the real-time feedback and the analysis of historical data, and the scheduling precision is further improved;
The invention introduces a topology conflict detection and dynamic correction mechanism, can timely find and correct topology constraint conflicts in a real-time scheduling process, can rapidly detect potential topology conflicts in a task allocation process through a real-time feedback verification module, such as insufficient communication bandwidth, excessively high delay among nodes and the like, automatically adjusts topology constraint conditions when the conflicts are found, and dynamically corrects the threshold value of the edge weight.
Detailed Description
The invention will now be described in detail with reference to the drawings and to specific embodiments. While the invention has been described herein in detail in order to make the embodiments more detailed, the following embodiments are preferred and can be embodied in other forms as well known to those skilled in the art, and the accompanying drawings are only for the purpose of describing the embodiments more specifically and are not intended to limit the invention to the specific forms disclosed herein.
It should be noted that references in the specification to "one embodiment," "an example embodiment," "some embodiments," etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the relevant art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
Generally, the terminology may be understood, at least in part, from the use of context. For example, the term "one or more" as used herein may be used to describe any feature, structure, or characteristic in a singular sense, or may be used to describe a combination of features, structures, or characteristics in a plural sense, depending at least in part on the context. In addition, the term "based on" may be understood as not necessarily intended to convey an exclusive set of factors, but may instead, depending at least in part on the context, allow for other factors that are not necessarily explicitly described.
As shown in fig. 1-2, the cloud platform computing power resource performance monitoring and real-time scheduling optimization method comprises the following steps:
S1, acquiring real-time performance data of heterogeneous computing nodes, wherein the real-time performance data at least comprises CPU instruction period occupancy rate, GPU video memory bandwidth utilization rate and NPU matrix operation delay;
S2, carrying out multidimensional index fusion on the real-time performance data to generate a fusion performance index set containing time correlation characteristics;
s3, constructing a dynamic algorithm topology map based on the fusion performance index set, wherein map nodes represent physical calculation units, and edge weights represent normalized values of communication bandwidth and delay between the nodes;
S4, generating an elastic expansion strategy set according to the dynamic algorithm topology map, wherein the elastic expansion strategy set comprises topology constraint conditions of a virtual resource pool and a load mutation probability threshold;
S5, performing real-time scheduling by adopting a priority dynamic adjustment algorithm, and mapping the task to be allocated to a physical node meeting the topological constraint condition;
And S6, detecting topology conflict of the scheduling result through a feedback verification module, and dynamically correcting the parameter weight of the elastic telescopic strategy set.
S1 specifically comprises:
S11, acquiring the CPU instruction period occupancy rate in real time by installing a performance monitoring agent on each heterogeneous computing node, and acquiring the ratio of the idle period to the busy period of each CPU core by the instruction period occupancy rate through a monitoring tool;
S12, acquiring the bandwidth utilization rate of the GPU video memory in real time through a GPU driver interface, wherein the bandwidth utilization rate records the bandwidth utilization condition of the GPU memory access through a GPU hardware counter;
and S13, acquiring NPU matrix operation delay in real time through a management interface of the NPU hardware platform, and recording the delay condition of the NPU executing matrix operation task by the matrix operation delay through a hardware timer.
Calculation of CPU instruction period occupancy rateCalculated by the following formula:
, wherein, Representing the number of busy cycles of the CPU core in a cycle,Representing the total number of cycles (including idle cycles and busy cycles) of the CPU core within the same cycle;
calculating the bandwidth utilization rate of the GPU video memory, namely the bandwidth utilization rate of the GPU video memory Calculated by the following formula:
, wherein, Represents the actual amount of use of the GPU memory bandwidth,Representation ofMaximum bandwidth of the video memory;
Calculation of NPU matrix operation delay Calculated by the following formula:
, wherein, A time stamp indicating when the NPU began performing the matrix operation,A timestamp indicating the completion of the matrix operation by the NPU.
S2 specifically comprises:
s21, carrying out normalization processing on the real-time performance data of each heterogeneous computing node to obtain a standardized performance data set Expressed as: , wherein, Representing the actual value of a certain performance index (CPU instruction period occupancy rate, GPU video memory bandwidth utilization rate or NPU matrix operation delay) at a specific moment,Represents the average of the corresponding performance indicators in the historical data,Representing standard deviation of corresponding performance indexes in the historical data;
s22, aligning the normalized performance data of the heterogeneous computing nodes according to time sequence to obtain a performance data set in a time window , wherein,Time stamps representing different moments in time;
S23, fusing normalized performance data of different heterogeneous computing nodes by adopting a weighted average method to obtain a comprehensive performance index set Expressed as: , wherein, Representing the number of computing nodes that are to be counted,Represent the firstThe weight of the node is calculated, the weight value is determined by factors such as the load and the reliability of the node,Represent the firstNormalized performance data for the individual compute nodes;
s24, combining the integrated performance index sets by a time sequence analysis-based moving average method Smoothing to generate a fusion performance index set containing time correlation featuresTo capture long-term trends and short-term fluctuations in performance changes.
Generating a fused performance index set including time-related featuresExpressed as:
, wherein, Expressed in timeThe comprehensive performance index of the moment of time,The size of the sliding window, the length of the time window for smoothing,Representing each point in time within the sliding window,,The fusion performance index set after the sliding average smoothing treatment reflects the average performance of time points in a window;
through the above-mentioned moving average processing, short-term fluctuation can be effectively smoothed, long-term trend and time correlation characteristics can be captured, and a more stable performance index set with time correlation can be obtained Smooth and stable performance data is provided for subsequent dynamic scheduling optimization.
S3 specifically comprises:
s31, node definition according to the fusion performance index set Each heterogeneous computing node is defined as a node in the graphWherein each nodeRepresenting a physical computing unit (CPU, GPU or NPU), the attributes of each node comprisingAnd the load of the computing node, available resources, current task state,Representing nodesAt the time ofA time fusion performance index set;
S32. edge definition-edges between nodes Representing nodesSum nodeCommunication connection and edge weight betweenNormalized values representing communication bandwidth and delay between nodes;
S33, dynamically updating based on real-time performance data and fusion performance index set Dynamically updating node attributes and edge weights, recalculating the load of the nodes, distributing task states and updating the attributes of the corresponding nodes when performance data is updated, and updating the edge weights according to the real-time communication bandwidth and delay among the nodesThe map is guaranteed to reflect the real-time algorithm topology structure. The key idea is to dynamically adjust the attributes of nodes and edges in the algorithm topology map through updating the real-time performance data so as to ensure that the map can accurately reflect the state of the computing resources of the current cloud platform, thereby providing real-time support for scheduling optimization;
1. Each computing node (e.g., CPU, GPU, or NPU) has a corresponding node in the topology map that is not only an abstract representation of the computing resources, but whose attributes are also adjusted based on real-time performance data changes, including load and task allocation status, where:
the load is that along with the execution of the task and the change of the load, the load of the node is changed continuously, for example, if a certain node is executing a large number of calculation tasks, the load can be very high;
Task allocation status-the task execution status of each node may also affect the attributes of the nodes, for example, some nodes may have completed tasks while others are waiting for tasks or busy performing certain operations, dynamically updating such information, facilitating real-time knowledge of the current status of each node.
2. Edges in the graph represent communication paths between two nodes, and the edge weights are closely related to communication bandwidth and delay;
Bandwidth and latency-communication bandwidth and latency between nodes are important factors affecting task allocation efficiency, e.g., tasks can transfer data more quickly and efficiently when bandwidth between two nodes is higher and latency is lower, while connections with low bandwidth or high latency can cause performance bottlenecks;
Real-time updates-bandwidth and delay change as network conditions change (e.g., network traffic increases or decreases), and when these data change, the corresponding edge weights need to be recalculated to reflect the most current communication state, e.g., if bandwidth increases, edge weights may decrease because the path's communication performance is better, while delay increases, resulting in an increase in weights that indicate poor path performance.
3. Through the updating, the algorithm topology map always keeps the real-time mapping of the cloud platform computing resources, and each time performance data (such as load, bandwidth, delay and the like) change, nodes and edges in the map are updated correspondingly, so that the map can be ensured to accurately reflect the state of the current cloud platform, and a scheduling system can make decisions according to the latest topology map, for example, tasks are distributed to nodes with lower load and better communication performance, and therefore resource utilization rate and system performance are optimized.
In summary, the core goal of this section is to enable the power topology map to reflect the change in computing resources "in real time" ensuring that scheduling decisions can be made based on the current most accurate resource state rather than relying on outdated data, thereby improving the efficiency and reliability of the overall system.
Edge weightsNormalized values representing communication bandwidth and delay between nodes are expressed as:
, wherein, Representing nodesAnd nodeThe bandwidth of the communication between them,Representing nodesAnd nodeThe communication delay between the two is such that,AndRepresenting the maximum of bandwidth and delay, respectively, between all nodes, for normalization,For edge weights, the higher the value is, the worse the bandwidth and delay performance of the communication path is.
S4 specifically comprises the following steps:
s41, defining topological constraint conditions, namely rubbing edge weights in the map with dynamic computing power For the basis, determining the topological connection relation among the physical computing units in the virtual resource pools, and defining topological constraint conditions, namely, for the nodes in each virtual resource poolBased on edge weightsDetermining other nodes with which it can communicateDefining constraint thresholdsIf nodeSum nodeEdge weights in betweenGreater than constraint thresholdThen task scheduling to the node is not allowedSum nodeOn the communication path between them, thereby avoiding performance degradation;
constraint threshold in a high performance computing scenario Taking 0.3, ensuring that only high bandwidth, low latency communication paths are selected;
Constraint threshold in general load or resource scheduling scenarios Taking 0.6, allowing a certain degree of communication delay and bandwidth bottleneck;
S42, calculating the load mutation probability and setting the threshold value, namely setting the threshold value of the load mutation probability Set to 0.3 for each nodeCalculating the load mutation probabilityIf the load mutation probability exceeds a threshold valueThe risk of the load mutation of the corresponding node is high, the resource quota of the node needs to be adjusted, and the probability threshold of the load mutation is calculatedThe method is used for judging whether the virtual resource pool needs to be elastically stretched or not, so that overload or resource waste is avoided;
probability of load abrupt change The calculation includes:
Collecting load data, namely acquiring nodes At successive points in timeLoad data;
Calculating the load variation amplitude, namely, calculating the load variation amplitude between adjacent time points, namely:;
Setting a mutation threshold Setting a sudden change threshold value of load change amplitude according to the load fluctuation range tolerated by the system,Is arranged as;
Calculating the load mutation frequency, calculating the load mutation frequency at continuous time pointsIn which the load variation amplitude exceeds a threshold valueIs the number of times:;
probability of load abrupt change The calculation is as follows:
, wherein, Is the total number of load data points, representing the time span of the data;
load mutation probability threshold The tolerance to load fluctuation is reflected by the system demand setting:
1. for tolerating high load abrupt changes, the probability of load abrupt changes is thresholded Set to a higher value (0.3 or 0.4);
2. If the tolerance to load variations is low, a low value (0.1 or 0.2) is set.
S43, generating an elastic expansion strategy set, namely generating the elastic expansion strategy set of the virtual resource pool according to the topological constraint condition and the load mutation probability threshold valueElastic telescoping strategy setComprising the following steps:
Defining the load, communication bandwidth and communication delay constraint of each node, and ensuring that the topology constraint condition is followed when task scheduling;
load mutation probability threshold value set for each node deployment Instructs when to make dynamic adjustments to the resources (e.g., expanding or contracting the number of nodes in the virtual resource pool, increasing or decreasing resource allocation, etc.).
S5 specifically comprises the following steps:
s51, evaluating task priority, namely, according to the calculation requirement, timeliness requirement and resource consumption factor of the task, each task to be allocated is evaluated Calculating task priority thereofExpressed as:
, wherein, For the deadline of the task,In order to be able to determine the degree of urgency of the task,In order to be a resource requirement of a task,Taking 0.5/0.3/0.2 as a weight factor, and adjusting according to specific scenes;
S52, screening topological constraint conditions, namely screening out a physical node set meeting the resource requirements and meeting the topological constraint conditions according to the topological constraint conditions Wherein each nodeEdge weights of (2)The following conditions are satisfied:;
s53, task mapping according to task priority And the load condition of the physical node, adopting a priority dynamic adjustment algorithm to carry out tasksMapping to physical nodes meeting topology constraintsThe mapping rule is:
Selecting tasks with highest priority ;
At a physical node setSelecting the node with the lightest current load and meeting the topology constraintSo that the taskIs satisfied and the load of the physical node does not exceed its own capacity threshold.
S6 specifically comprises the following steps:
S61, after the real-time scheduling of S5 is completed, collecting scheduling results of each task, including mapping nodes of the task, task completion time and node load conditions, and monitoring the obtained performance data, including node load, communication delay and bandwidth utilization rate in real time;
s62, performing topology conflict detection on the scheduling result by using a feedback verification module, and checking whether the condition of violating the topology constraint condition exists or not, wherein the detection of the topology conflict comprises the following steps:
checking whether the task mapping causes too high a communication delay between nodes;
checking edge weights between nodes Whether or not the constraint threshold is exceededIf the topology conflict exceeds the topology conflict is considered to occur;
Judging whether task allocation conditions which do not accord with topology constraints exist or not by comparing actual node connection conditions with predefined topology structures;
S63, if topology conflict is detected, triggering a conflict correction mechanism by a feedback verification module;
and S64, dynamically correcting the elastic expansion strategy set, namely adjusting the parameter weight of the elastic expansion strategy set in the collision correction process, wherein the adjusting step comprises the following steps:
adjusting constraint threshold in topology constraint condition according to conflict detection result To adapt to the new task allocation mode, if the current system resource constraint is too strict, the communication between most nodes exceeds the constraint threshold, and the improvement can be properly carried outTo increase the flexibility of the system to allow more task allocation, if collision detection indicates that the system is too dependent on certain high bandwidth or low latency communication links, resulting in too high edge weights between partial nodes, which may cause excessive resource contention or load imbalance, may be reduced appropriatelyTo enhance control of the quality of communication between nodes.
Recalculating load mutation probability。
The collision correction mechanism includes:
Readjusting the mapping of the tasks and selecting new nodes conforming to the topological constraint conditions;
and (3) adjusting the load distribution of the nodes to ensure that the communication bandwidth and the delay meet preset requirements.
The invention is intended to cover any alternatives, modifications, equivalents, and variations that fall within the spirit and scope of the invention. In the following description of preferred embodiments of the invention, specific details are set forth in order to provide a thorough understanding of the invention, and the invention will be fully understood to those skilled in the art without such details. In other instances, well-known methods, procedures, flows, components, circuits, and the like have not been described in detail so as not to unnecessarily obscure aspects of the present invention.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.