CN119883651A

CN119883651A - Cloud platform computing power resource performance monitoring and real-time scheduling optimization method

Info

Publication number: CN119883651A
Application number: CN202510361535.XA
Authority: CN
Inventors: 乐均斌; 陈启; 黄秀月; 林东
Original assignee: Zhejiang Wanwang Information Technology Co ltd
Current assignee: Zhejiang Wanwang Information Technology Co ltd
Priority date: 2025-03-26
Filing date: 2025-03-26
Publication date: 2025-04-25

Abstract

The present invention relates to the technical field of cloud platform monitoring and analysis, and specifically to a cloud platform computing power resource performance monitoring and real-time scheduling optimization method, comprising the following steps: collecting real-time performance data of heterogeneous computing nodes; fusing the real-time performance data with multi-dimensional indicators to generate a fusion performance indicator set containing time-related features; constructing a dynamic computing power topology map based on the fusion performance indicator set; generating an elastic scaling strategy set according to the dynamic computing power topology map; mapping tasks to be assigned to physical nodes that meet the topological constraints; performing topological conflict detection on the scheduling results through a feedback verification module, and dynamically correcting the parameter weights of the elastic scaling strategy set. The present invention reduces resource conflicts, improves the utilization efficiency of computing power resources, avoids system bottleneck problems caused by topological conflicts, and thus enhances the stability and load balancing capabilities of the system.

Description

Cloud platform computing power resource performance monitoring and real-time scheduling optimization method

Technical Field

The invention relates to the technical field of cloud platform monitoring and analysis, in particular to a cloud platform computing power resource performance monitoring and real-time scheduling optimization method.

Background

With the rapid development of cloud computing and big data technology, a cloud platform has become an important infrastructure for supporting various application services and task processing, and provides powerful computing power resources, but with the increasing demands of users and increasing complexity of computing tasks, how to efficiently and stably schedule and manage computing resources in the cloud platform has become a key problem to be solved in the current cloud computing field.

In the existing cloud platform computing power resource scheduling system, the resource allocation is usually based on a preset strategy or a static load balancing algorithm, and the method is easy to cause resource waste or unbalanced task scheduling under the condition that the demand of computing tasks is changed greatly or the resource load is suddenly, for example, when the system load is too high, some nodes may face the problem that computing resources are excessively concentrated, while other nodes are in an idle state and cannot be fully utilized, so that the resource utilization rate is reduced, and the system performance is also possibly severely reduced.

Disclosure of Invention

The invention provides a cloud platform computing power resource performance monitoring and real-time scheduling optimization method.

The cloud platform computing power resource performance monitoring and real-time scheduling optimization method comprises the following steps:

s1, acquiring real-time performance data of heterogeneous computing nodes, wherein the real-time performance data at least comprises CPU instruction period occupancy rate, GPU video memory bandwidth utilization rate and NPU matrix operation delay;

s2, carrying out multidimensional index fusion on the real-time performance data to generate a fusion performance index set containing time correlation characteristics;

S3, constructing a dynamic algorithm topology map based on the fusion performance index set, wherein the map nodes represent physical calculation units, and the edge weights represent normalized values of communication bandwidth and delay among the nodes;

S4, generating an elastic expansion strategy set according to the dynamic algorithm topology map, wherein the elastic expansion strategy set comprises a topological constraint condition of a virtual resource pool and a load mutation probability threshold;

s5, performing real-time scheduling by adopting a priority dynamic adjustment algorithm, and mapping the task to be allocated to a physical node meeting the topological constraint condition;

And S6, detecting topology conflict of the scheduling result through a feedback verification module, and dynamically correcting the parameter weight of the elastic expansion strategy set.

Optionally, the S1 specifically includes:

S11, acquiring the CPU instruction period occupancy rate in real time by installing a performance monitoring agent on each heterogeneous computing node, wherein the instruction period occupancy rate acquires the ratio of the idle period to the busy period of each CPU core through a monitoring tool;

S12, acquiring the bandwidth utilization rate of the GPU video memory in real time through a GPU driver interface, wherein the bandwidth utilization rate records the bandwidth use condition of the GPU memory access through a GPU hardware counter;

And S13, acquiring the matrix operation delay of the NPU in real time through a management interface of the NPU hardware platform, wherein the matrix operation delay records the delay condition of the NPU for executing the matrix operation task through a hardware timer.

Optionally, the S2 specifically includes:

s21, carrying out normalization processing on the real-time performance data of each heterogeneous computing node to obtain a standardized performance data set ;

S22, aligning the normalized performance data of the heterogeneous computing nodes according to time sequence to obtain a performance data set in a time window, wherein,Time stamps representing different moments in time;

S23, fusing normalized performance data of different heterogeneous computing nodes by adopting a weighted average method to obtain a comprehensive performance index set ;

S24, combining the integrated performance index sets by a time sequence analysis-based moving average methodSmoothing to generate a fusion performance index set containing time correlation featuresTo capture long-term trends and short-term fluctuations in performance changes.

Optionally, the generating includes generating a fusion performance index set including time-related featuresExpressed as:

, wherein, Expressed in timeThe comprehensive performance index of the moment of time,The size of the sliding window, the length of the time window for smoothing,Representing each point in time within the sliding window,,The fusion performance index set after the sliding average smoothing treatment reflects the average performance of time points in a window.

Optionally, the step S3 specifically includes:

s31, node definition according to the fusion performance index set Each heterogeneous computing node is defined as a node in the graphWherein each nodeRepresenting a physical computing unit (CPU, GPU or NPU), the attributes of each node comprisingAnd the load of the computing node, available resources, current task state,Representing nodesAt the time ofA time fusion performance index set;

S32. edge definition-edges between nodes Representing nodesSum nodeCommunication connection and edge weight betweenNormalized values representing communication bandwidth and delay between nodes;

S33, dynamically updating based on real-time performance data and fusion performance index set Dynamically updating node attributes and edge weights, recalculating the load of the nodes, distributing task states and updating the attributes of the corresponding nodes when performance data is updated, and updating the edge weights according to the real-time communication bandwidth and delay among the nodes。

Optionally, the edge weightNormalized values representing communication bandwidth and delay between nodes are expressed as:

, wherein, Representing nodesAnd nodeThe bandwidth of the communication between them,Representing nodesAnd nodeThe communication delay between the two is such that,AndRepresenting the maximum of bandwidth and delay, respectively, between all nodes, for normalization,For edge weights, the higher the value is, the worse the bandwidth and delay performance of the communication path is.

Optionally, the step S4 specifically includes:

s41, defining topological constraint conditions, namely rubbing edge weights in the map with dynamic computing power For the basis, determining the topological connection relation among the physical computing units in the virtual resource pools, and defining topological constraint conditions, namely, for the nodes in each virtual resource poolBased on edge weightsDetermining other nodes with which it can communicateDefining constraint thresholdsIf nodeSum nodeEdge weights in betweenGreater than constraint thresholdThen task scheduling to the node is not allowedSum nodeOn a communication path between them;

S42, calculating the load mutation probability and setting the threshold value, namely setting the threshold value of the load mutation probability Set to 0.3 for each nodeCalculating the load mutation probabilityIf the load mutation probability exceeds a threshold valueThe risk of sudden change of the load of the corresponding node is high;

S43, generating an elastic expansion strategy set, namely generating the elastic expansion strategy set of the virtual resource pool according to the topological constraint condition and the load mutation probability threshold value Elastic telescoping strategy setComprising the following steps:

Defining the load, communication bandwidth and communication delay constraint of each node, and ensuring that the topology constraint condition is followed when task scheduling;

load mutation probability threshold value set for each node deployment Instructs when to make dynamic adjustments to the resources (e.g., expanding or contracting the number of nodes in the virtual resource pool, increasing or decreasing resource allocation, etc.).

Optionally, the step S5 specifically includes:

s51, evaluating task priority, namely, according to the calculation requirement, timeliness requirement and resource consumption factor of the task, each task to be allocated is evaluated Calculating task priority thereofExpressed as:

, wherein, For the deadline of the task,In order to be able to determine the degree of urgency of the task,In order to be a resource requirement of a task,Taking 0.5/0.3/0.2 as a weight factor, and adjusting according to specific scenes;

S52, screening topological constraint conditions, namely screening out a physical node set meeting the resource requirements and meeting the topological constraint conditions according to the topological constraint conditions Wherein each nodeEdge weights of (2)The following conditions are satisfied:;

s53, task mapping according to task priority And the load condition of the physical node, adopting a priority dynamic adjustment algorithm to carry out tasksMapping to physical nodes meeting topology constraints。

Optionally, the step S6 specifically includes:

S61, after the real-time scheduling of S5 is completed, collecting scheduling results of each task, including mapping nodes of the task, task completion time and node load conditions, and monitoring the obtained performance data, including node load, communication delay and bandwidth utilization rate in real time;

s62, performing topology conflict detection on the scheduling result by using a feedback verification module, and checking whether the condition of violating the topology constraint condition exists or not, wherein the detection of the topology conflict comprises the following steps:

checking whether the task mapping causes too high a communication delay between nodes;

checking edge weights between nodes Whether or not the constraint threshold is exceededIf the topology conflict exceeds the topology conflict is considered to occur;

Judging whether task allocation conditions which do not accord with topology constraints exist or not by comparing actual node connection conditions with predefined topology structures;

S63, if topology conflict is detected, triggering a conflict correction mechanism by a feedback verification module;

And S64, dynamically correcting the elastic expansion strategy set, namely adjusting the parameter weight of the elastic expansion strategy set in the collision correction process.

Optionally, the collision correction mechanism includes:

Readjusting the mapping of the tasks and selecting new nodes conforming to the topological constraint conditions;

and (3) adjusting the load distribution of the nodes to ensure that the communication bandwidth and the delay meet preset requirements.

The invention has the beneficial effects that:

According to the invention, the task is accurately mapped to the most suitable physical node based on real-time performance data and topology constraint conditions by adopting a dynamic algorithm topology map and a priority dynamic adjustment algorithm, the task allocation process can be optimized by comprehensively evaluating the priority of the task, the load state of the node and the topology constraint conditions, the resource waste and the excessive concentration of the load are avoided, the scheduling efficiency and the scheduling precision of the computing resources are obviously improved, in addition, the dynamic adjustment algorithm ensures that the task allocation can be continuously adapted to the system state change according to the real-time feedback and the analysis of historical data, and the scheduling precision is further improved;

The invention introduces a topology conflict detection and dynamic correction mechanism, can timely find and correct topology constraint conflicts in a real-time scheduling process, can rapidly detect potential topology conflicts in a task allocation process through a real-time feedback verification module, such as insufficient communication bandwidth, excessively high delay among nodes and the like, automatically adjusts topology constraint conditions when the conflicts are found, and dynamically corrects the threshold value of the edge weight.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only of the invention and that other drawings can be obtained from them without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a method according to an embodiment of the invention;

Fig. 2 is a schematic diagram of an elastic stretching policy set according to an embodiment of the present invention.

Detailed Description

The invention will now be described in detail with reference to the drawings and to specific embodiments. While the invention has been described herein in detail in order to make the embodiments more detailed, the following embodiments are preferred and can be embodied in other forms as well known to those skilled in the art, and the accompanying drawings are only for the purpose of describing the embodiments more specifically and are not intended to limit the invention to the specific forms disclosed herein.

It should be noted that references in the specification to "one embodiment," "an example embodiment," "some embodiments," etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the relevant art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.

Generally, the terminology may be understood, at least in part, from the use of context. For example, the term "one or more" as used herein may be used to describe any feature, structure, or characteristic in a singular sense, or may be used to describe a combination of features, structures, or characteristics in a plural sense, depending at least in part on the context. In addition, the term "based on" may be understood as not necessarily intended to convey an exclusive set of factors, but may instead, depending at least in part on the context, allow for other factors that are not necessarily explicitly described.

As shown in fig. 1-2, the cloud platform computing power resource performance monitoring and real-time scheduling optimization method comprises the following steps:

s3, constructing a dynamic algorithm topology map based on the fusion performance index set, wherein map nodes represent physical calculation units, and edge weights represent normalized values of communication bandwidth and delay between the nodes;

S4, generating an elastic expansion strategy set according to the dynamic algorithm topology map, wherein the elastic expansion strategy set comprises topology constraint conditions of a virtual resource pool and a load mutation probability threshold;

And S6, detecting topology conflict of the scheduling result through a feedback verification module, and dynamically correcting the parameter weight of the elastic telescopic strategy set.

S1 specifically comprises:

S11, acquiring the CPU instruction period occupancy rate in real time by installing a performance monitoring agent on each heterogeneous computing node, and acquiring the ratio of the idle period to the busy period of each CPU core by the instruction period occupancy rate through a monitoring tool;

S12, acquiring the bandwidth utilization rate of the GPU video memory in real time through a GPU driver interface, wherein the bandwidth utilization rate records the bandwidth utilization condition of the GPU memory access through a GPU hardware counter;

and S13, acquiring NPU matrix operation delay in real time through a management interface of the NPU hardware platform, and recording the delay condition of the NPU executing matrix operation task by the matrix operation delay through a hardware timer.

Calculation of CPU instruction period occupancy rateCalculated by the following formula:

, wherein, Representing the number of busy cycles of the CPU core in a cycle,Representing the total number of cycles (including idle cycles and busy cycles) of the CPU core within the same cycle;

calculating the bandwidth utilization rate of the GPU video memory, namely the bandwidth utilization rate of the GPU video memory Calculated by the following formula:

, wherein, Represents the actual amount of use of the GPU memory bandwidth,Representation ofMaximum bandwidth of the video memory;

Calculation of NPU matrix operation delay Calculated by the following formula:

, wherein, A time stamp indicating when the NPU began performing the matrix operation,A timestamp indicating the completion of the matrix operation by the NPU.

S2 specifically comprises:

s21, carrying out normalization processing on the real-time performance data of each heterogeneous computing node to obtain a standardized performance data set Expressed as: , wherein, Representing the actual value of a certain performance index (CPU instruction period occupancy rate, GPU video memory bandwidth utilization rate or NPU matrix operation delay) at a specific moment,Represents the average of the corresponding performance indicators in the historical data,Representing standard deviation of corresponding performance indexes in the historical data;

s22, aligning the normalized performance data of the heterogeneous computing nodes according to time sequence to obtain a performance data set in a time window , wherein,Time stamps representing different moments in time;

S23, fusing normalized performance data of different heterogeneous computing nodes by adopting a weighted average method to obtain a comprehensive performance index set Expressed as: , wherein, Representing the number of computing nodes that are to be counted,Represent the firstThe weight of the node is calculated, the weight value is determined by factors such as the load and the reliability of the node,Represent the firstNormalized performance data for the individual compute nodes;

s24, combining the integrated performance index sets by a time sequence analysis-based moving average method Smoothing to generate a fusion performance index set containing time correlation featuresTo capture long-term trends and short-term fluctuations in performance changes.

Generating a fused performance index set including time-related featuresExpressed as:

, wherein, Expressed in timeThe comprehensive performance index of the moment of time,The size of the sliding window, the length of the time window for smoothing,Representing each point in time within the sliding window,,The fusion performance index set after the sliding average smoothing treatment reflects the average performance of time points in a window;

through the above-mentioned moving average processing, short-term fluctuation can be effectively smoothed, long-term trend and time correlation characteristics can be captured, and a more stable performance index set with time correlation can be obtained Smooth and stable performance data is provided for subsequent dynamic scheduling optimization.

S3 specifically comprises:

S33, dynamically updating based on real-time performance data and fusion performance index set Dynamically updating node attributes and edge weights, recalculating the load of the nodes, distributing task states and updating the attributes of the corresponding nodes when performance data is updated, and updating the edge weights according to the real-time communication bandwidth and delay among the nodesThe map is guaranteed to reflect the real-time algorithm topology structure. The key idea is to dynamically adjust the attributes of nodes and edges in the algorithm topology map through updating the real-time performance data so as to ensure that the map can accurately reflect the state of the computing resources of the current cloud platform, thereby providing real-time support for scheduling optimization;

1. Each computing node (e.g., CPU, GPU, or NPU) has a corresponding node in the topology map that is not only an abstract representation of the computing resources, but whose attributes are also adjusted based on real-time performance data changes, including load and task allocation status, where:

the load is that along with the execution of the task and the change of the load, the load of the node is changed continuously, for example, if a certain node is executing a large number of calculation tasks, the load can be very high;

Task allocation status-the task execution status of each node may also affect the attributes of the nodes, for example, some nodes may have completed tasks while others are waiting for tasks or busy performing certain operations, dynamically updating such information, facilitating real-time knowledge of the current status of each node.

2. Edges in the graph represent communication paths between two nodes, and the edge weights are closely related to communication bandwidth and delay;

Bandwidth and latency-communication bandwidth and latency between nodes are important factors affecting task allocation efficiency, e.g., tasks can transfer data more quickly and efficiently when bandwidth between two nodes is higher and latency is lower, while connections with low bandwidth or high latency can cause performance bottlenecks;

Real-time updates-bandwidth and delay change as network conditions change (e.g., network traffic increases or decreases), and when these data change, the corresponding edge weights need to be recalculated to reflect the most current communication state, e.g., if bandwidth increases, edge weights may decrease because the path's communication performance is better, while delay increases, resulting in an increase in weights that indicate poor path performance.

3. Through the updating, the algorithm topology map always keeps the real-time mapping of the cloud platform computing resources, and each time performance data (such as load, bandwidth, delay and the like) change, nodes and edges in the map are updated correspondingly, so that the map can be ensured to accurately reflect the state of the current cloud platform, and a scheduling system can make decisions according to the latest topology map, for example, tasks are distributed to nodes with lower load and better communication performance, and therefore resource utilization rate and system performance are optimized.

In summary, the core goal of this section is to enable the power topology map to reflect the change in computing resources "in real time" ensuring that scheduling decisions can be made based on the current most accurate resource state rather than relying on outdated data, thereby improving the efficiency and reliability of the overall system.

Edge weightsNormalized values representing communication bandwidth and delay between nodes are expressed as:

S4 specifically comprises the following steps:

s41, defining topological constraint conditions, namely rubbing edge weights in the map with dynamic computing power For the basis, determining the topological connection relation among the physical computing units in the virtual resource pools, and defining topological constraint conditions, namely, for the nodes in each virtual resource poolBased on edge weightsDetermining other nodes with which it can communicateDefining constraint thresholdsIf nodeSum nodeEdge weights in betweenGreater than constraint thresholdThen task scheduling to the node is not allowedSum nodeOn the communication path between them, thereby avoiding performance degradation;

constraint threshold in a high performance computing scenario Taking 0.3, ensuring that only high bandwidth, low latency communication paths are selected;

Constraint threshold in general load or resource scheduling scenarios Taking 0.6, allowing a certain degree of communication delay and bandwidth bottleneck;

S42, calculating the load mutation probability and setting the threshold value, namely setting the threshold value of the load mutation probability Set to 0.3 for each nodeCalculating the load mutation probabilityIf the load mutation probability exceeds a threshold valueThe risk of the load mutation of the corresponding node is high, the resource quota of the node needs to be adjusted, and the probability threshold of the load mutation is calculatedThe method is used for judging whether the virtual resource pool needs to be elastically stretched or not, so that overload or resource waste is avoided;

probability of load abrupt change The calculation includes:

Collecting load data, namely acquiring nodes At successive points in timeLoad data;

Calculating the load variation amplitude, namely, calculating the load variation amplitude between adjacent time points, namely:;

Setting a mutation threshold Setting a sudden change threshold value of load change amplitude according to the load fluctuation range tolerated by the system,Is arranged as;

Calculating the load mutation frequency, calculating the load mutation frequency at continuous time pointsIn which the load variation amplitude exceeds a threshold valueIs the number of times:;

probability of load abrupt change The calculation is as follows:

, wherein, Is the total number of load data points, representing the time span of the data;

load mutation probability threshold The tolerance to load fluctuation is reflected by the system demand setting:

1. for tolerating high load abrupt changes, the probability of load abrupt changes is thresholded Set to a higher value (0.3 or 0.4);

2. If the tolerance to load variations is low, a low value (0.1 or 0.2) is set.

S43, generating an elastic expansion strategy set, namely generating the elastic expansion strategy set of the virtual resource pool according to the topological constraint condition and the load mutation probability threshold valueElastic telescoping strategy setComprising the following steps:

S5 specifically comprises the following steps:

s53, task mapping according to task priority And the load condition of the physical node, adopting a priority dynamic adjustment algorithm to carry out tasksMapping to physical nodes meeting topology constraintsThe mapping rule is:

Selecting tasks with highest priority ;

At a physical node setSelecting the node with the lightest current load and meeting the topology constraintSo that the taskIs satisfied and the load of the physical node does not exceed its own capacity threshold.

S6 specifically comprises the following steps:

and S64, dynamically correcting the elastic expansion strategy set, namely adjusting the parameter weight of the elastic expansion strategy set in the collision correction process, wherein the adjusting step comprises the following steps:

adjusting constraint threshold in topology constraint condition according to conflict detection result To adapt to the new task allocation mode, if the current system resource constraint is too strict, the communication between most nodes exceeds the constraint threshold, and the improvement can be properly carried outTo increase the flexibility of the system to allow more task allocation, if collision detection indicates that the system is too dependent on certain high bandwidth or low latency communication links, resulting in too high edge weights between partial nodes, which may cause excessive resource contention or load imbalance, may be reduced appropriatelyTo enhance control of the quality of communication between nodes.

Recalculating load mutation probability。

The collision correction mechanism includes:

The invention is intended to cover any alternatives, modifications, equivalents, and variations that fall within the spirit and scope of the invention. In the following description of preferred embodiments of the invention, specific details are set forth in order to provide a thorough understanding of the invention, and the invention will be fully understood to those skilled in the art without such details. In other instances, well-known methods, procedures, flows, components, circuits, and the like have not been described in detail so as not to unnecessarily obscure aspects of the present invention.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. A method for monitoring the performance of computing resources on a cloud platform and optimizing real-time scheduling, characterized in that it comprises the following steps:

S1: Collecting real-time performance data of heterogeneous computing nodes, where the real-time performance data includes at least CPU instruction cycle occupancy, GPU memory bandwidth utilization, and NPU matrix operation delay;

S2: performing multi-dimensional index fusion on the real-time performance data to generate a fusion performance index set including time-related features;

S3: construct a dynamic computing power topology map based on the fusion performance indicator set, wherein the map nodes represent physical computing units, and the edge weights represent normalized values of communication bandwidth and delay between nodes;

S4: Generate an elastic scaling strategy set according to the dynamic computing power topology map, wherein the elastic scaling strategy set includes topological constraints of the virtual resource pool and a load mutation probability threshold;

S5: Using a priority dynamic adjustment algorithm to perform real-time scheduling, mapping the tasks to be assigned to physical nodes that meet the topology constraints;

S6: Perform topology conflict detection on the scheduling result through the feedback verification module, and dynamically correct the parameter weights of the elastic scaling strategy set.

2. The cloud platform computing resource performance monitoring and real-time scheduling optimization method according to claim 1, characterized in that S1 specifically includes:

S11: by installing a performance monitoring agent on each heterogeneous computing node, the CPU instruction cycle occupancy rate is obtained in real time. The instruction cycle occupancy rate is obtained by using a monitoring tool to obtain the ratio of the idle cycle to the busy cycle of each CPU core;

S12: Obtain GPU memory bandwidth utilization in real time through a GPU driver interface, wherein the bandwidth utilization records bandwidth usage of GPU memory access through a GPU hardware counter;

S13: Obtain the NPU matrix operation delay in real time through the management interface of the NPU hardware platform. The matrix operation delay records the delay of the NPU executing the matrix operation task through a hardware timer.

3. The cloud platform computing resource performance monitoring and real-time scheduling optimization method according to claim 1, characterized in that S2 specifically includes:

S21: Normalize the real-time performance data of each heterogeneous computing node to obtain a standardized performance data set ;

S22: Align the normalized performance data of multiple heterogeneous computing nodes in time series to obtain a performance data set within the time window ,in, Timestamps representing different moments;

S23: The normalized performance data of different heterogeneous computing nodes are integrated using the weighted average method to obtain a comprehensive performance index set. ;

S24: Sliding average method based on time series analysis for the fused comprehensive performance index set Perform smoothing to generate a fusion performance indicator set containing time-related features , to capture both long-term trends and short-term fluctuations in performance changes.

4. The cloud platform computing resource performance monitoring and real-time scheduling optimization method according to claim 3 is characterized in that the generation of a fusion performance indicator set containing time-related features is expressed as:

,in, Indicates at time Comprehensive performance indicators at all times, Indicates the size of the sliding window, the length of the time window used for smoothing, Represents each time point in the sliding window, , It is a set of fusion performance indicators after sliding average smoothing, reflecting the average performance of the time points within the window.

5. The cloud platform computing resource performance monitoring and real-time scheduling optimization method according to claim 1, wherein S3 specifically includes:

S31: Node definition: Based on the fusion performance indicator set , defining each heterogeneous computing node as a node in the graph , where each node Represents a physical computing unit. The properties of each node include As well as the load of the computing node, available resources, and current task status, Representation Node In time The set of fusion performance indicators at each moment;

S32: Edge definition: edges between nodes Representation Node and nodes The communication connection between them, the edge weight Represents the normalized value of communication bandwidth and delay between nodes;

S33: Dynamic update: based on real-time performance data and fused performance indicator set ,Dynamically update node attributes and edge weights. Whenever performance data is updated, recalculate the node load, task allocation status, and update the corresponding node attributes; at the same time, update the edge weights according to the real-time communication bandwidth and delay between nodes. .

6. The cloud platform computing resource performance monitoring and real-time scheduling optimization method according to claim 5 is characterized in that the edge weight Represents the normalized value of the communication bandwidth and delay between nodes, expressed as:

,in, Representation Node With Node The communication bandwidth between Representation Node With Node The communication delay between and Respectively represent the maximum value of bandwidth and delay between all nodes, used for normalization, is the edge weight, which indicates the performance of the communication path. The higher the value, the worse the bandwidth and delay performance of the path.

7. The cloud platform computing resource performance monitoring and real-time scheduling optimization method according to claim 6, characterized in that S4 specifically includes:

S41, topological constraint definition: based on the edge weights in the dynamic computing power topology graph Based on this, the topological connection relationship between the physical computing units in the virtual resource pool is determined, and the topological constraints are defined, that is, for each node in the virtual resource pool , based on edge weights Identify other nodes with which it can communicate , define the constraint threshold , if the node and nodes The edge weight between Greater than the constraint threshold , then the task is not allowed to be scheduled to the node and nodes On the communication path between

S42, load mutation probability calculation and threshold setting: setting load mutation probability threshold , for each node , calculate the probability of load mutation , if the load mutation probability exceeds the threshold , it means that the risk of load mutation of the corresponding node is high;

S43, generating an elastic scaling strategy set: generating an elastic scaling strategy set for the virtual resource pool according to the topology constraints and the load mutation probability threshold , elastic scaling policy set include:

Topological constraints of the virtual resource pool: define the load, communication bandwidth, and communication delay constraints of each node to ensure that the topological constraints are followed during task scheduling;

Load mutation probability threshold: The load mutation probability threshold set for each node deployment , guiding when to make dynamic adjustments to resources.

8. The cloud platform computing resource performance monitoring and real-time scheduling optimization method according to claim 7, characterized in that S5 specifically includes:

S51, Task Priority Assessment: Based on the task's computing requirements, timeliness requirements, and resource consumption factors, each task to be assigned Calculate its task priority , expressed as:

,in, is the deadline for the task, The urgency of the task, is the resource requirement of the task, is the weight factor;

S52, topology constraint screening: According to the topology constraint, screen out the set of physical nodes that meet the resource requirements and satisfy the topology constraint. , where each node The edge weight The following conditions are met: ;

S53, Task Mapping: Based on Task Priority According to the load of physical nodes, the priority dynamic adjustment algorithm is used to adjust the tasks Mapping to physical nodes that satisfy topological constraints .

9. The cloud platform computing resource performance monitoring and real-time scheduling optimization method according to claim 7, wherein S6 specifically comprises:

S61: After the real-time scheduling of S5 is completed, the scheduling results of each task are collected, including the mapping node of the task, the task completion time, and the node load. The performance data obtained through real-time monitoring includes node load, communication delay, and bandwidth utilization;

S62: Use the feedback verification module to perform topology conflict detection on the scheduling result to check whether there is any violation of the topology constraint condition. The topology conflict detection includes:

Check whether the task mapping causes excessive communication latency between nodes;

Check edge weights between nodes Whether the constraint threshold is exceeded ;

By comparing the actual node connection situation with the predefined topology structure, it is determined whether there is a task allocation situation that does not meet the topology constraints;

S63: If a topology conflict is detected, the feedback verification module triggers a conflict correction mechanism;

S64: Dynamically correct the elastic scaling policy set: During the conflict correction process, adjust the parameter weights of the elastic scaling policy set.

10. The cloud platform computing resource performance monitoring and real-time scheduling optimization method according to claim 9, characterized in that the conflict correction mechanism includes:

Re-adjust the task mapping and select new nodes that meet the topological constraints;

Adjust the load distribution of nodes to ensure that communication bandwidth and latency meet preset requirements.