[go: up one dir, main page]

CN115756854B - Automatic operation and maintenance method and device for cluster project, storage medium and computer equipment - Google Patents

Automatic operation and maintenance method and device for cluster project, storage medium and computer equipment

Info

Publication number
CN115756854B
CN115756854B CN202211476924.XA CN202211476924A CN115756854B CN 115756854 B CN115756854 B CN 115756854B CN 202211476924 A CN202211476924 A CN 202211476924A CN 115756854 B CN115756854 B CN 115756854B
Authority
CN
China
Prior art keywords
workload
maintenance
cluster
data
item
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211476924.XA
Other languages
Chinese (zh)
Other versions
CN115756854A (en
Inventor
侯记强
王刚
马幸晖
梁苑文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianyi Digital Life Technology Co Ltd
Original Assignee
Tianyi Digital Life Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianyi Digital Life Technology Co Ltd filed Critical Tianyi Digital Life Technology Co Ltd
Priority to CN202211476924.XA priority Critical patent/CN115756854B/en
Publication of CN115756854A publication Critical patent/CN115756854A/en
Application granted granted Critical
Publication of CN115756854B publication Critical patent/CN115756854B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

本申请提供的集群项目自动运维方法、装置、存储介质、计算机设备,在当前运维周期中对集群项目进行自动运维前,可以先获取各个集群项目中每一工作负载的汇聚分析数据,从而确定每一工作负载的运维数据,进而可以对该运维数据进行分析,以确定各个工作负载中需要进行调整的各个目标工作负载,以及每个目标工作负载所在的中间集群项目,接着可以依据用户在上一运维周期对自动运维的集群项目范围进行调整得到的运维项目组,从所有中间集群项目中筛选出需要进行自动运维的目标集群项目,形成待运维项目集合,并依次对待运维项目集合中每一目标集群项目对应的工作负载进行自动调整,以提高集群整体服务的稳定性,实现各个所述集群项目的自动运维。

The cluster project automatic operation and maintenance method, device, storage medium, and computer equipment provided in the present application can first obtain the aggregated analysis data of each workload in each cluster project before automatically operating and maintaining the cluster project in the current operation and maintenance cycle, so as to determine the operation and maintenance data of each workload, and then analyze the operation and maintenance data to determine the target workloads that need to be adjusted in each workload, as well as the intermediate cluster projects where each target workload is located. Then, based on the operation and maintenance project group obtained by the user adjusting the scope of the cluster project for automatic operation and maintenance in the previous operation and maintenance cycle, the target cluster projects that need to be automatically operated and maintained can be screened out from all intermediate cluster projects to form a set of projects to be operated and maintained, and the workload corresponding to each target cluster project in the set of projects to be operated and maintained can be automatically adjusted in turn to improve the stability of the overall cluster service and realize automatic operation and maintenance of each of the cluster projects.

Description

Automatic operation and maintenance method and device for cluster project, storage medium and computer equipment
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and apparatus for automatically operating and maintaining a cluster project, a storage medium, and a computer device.
Background
With the popularization of internet technology, container technology has become an implementation scheme of mainstream micro services of Paas (platform as a service), and Kubernetes cluster (K8S cluster for short) is a outstanding choice among currently popular container arrangement management technologies. Each node in the cluster has limited CPU and memory resources, and as more and more micro services run in the K8S cluster, it is particularly important to reasonably apply for resources and quota and perform resource optimization adjustment on each service in the K8S cluster at regular time.
In the current cloud primary field, a K8S cluster has thousands of micro services, when a service is deployed in a corresponding K8S cluster, the service can predict the use condition of resources such as a CPU and a memory for different services, the values of requests and limits are initialized by the resources of the CPU and the memory of the service, the stability of the whole service of the cluster can be affected by the overlarge or the overlarge values of the requests and limits in the cluster, however, the actual use condition of the resources of each micro service can be changed along with the change or adjustment of factors such as business, demand or business flow, and the like, so that unreasonable states are formed by the values of the requests and limits and the occupation of the current service resources, and the stability of the whole service of the cluster is further affected.
Disclosure of Invention
The application aims to at least solve one of the technical defects, in particular to the technical defect that resource initialization requests and limits values of a CPU and a memory of each micro service in a cluster form an unreasonable state with the occupation of current service resources in the prior art, thereby influencing the stability of the whole service of the cluster.
The application provides an automatic operation and maintenance method for cluster projects, which comprises the following steps:
in the current operation and maintenance period, acquiring the convergence analysis data of each workload in each cluster item, and adjusting the cluster item range of automatic operation and maintenance by a user according to the convergence analysis data of the previous operation and maintenance period to obtain an operation and maintenance item group;
determining operation and maintenance data of each workload based on the convergence analysis data of each workload;
analyzing the operation and maintenance data of each workload, and determining a plurality of target workloads needing to be adjusted in each workload and an intermediate cluster item where each target workload is located;
According to the operation and maintenance item group, selecting target cluster items needing automatic operation and maintenance from all the middle cluster items to form an operation and maintenance item set;
and according to the operation and maintenance data of each workload, automatically adjusting the target workload corresponding to each target cluster item in the to-be-operated item set in turn so as to realize the automatic operation and maintenance of each cluster item.
Optionally, in the current operation and maintenance period, acquiring the aggregate analysis data of each workload in each cluster project includes:
in the current operation and maintenance period, obtaining a CPU resource average value, a CPU resource maximum value, a memory resource average value and a memory resource maximum value of each workload in each cluster project;
And according to a preset adjustment coefficient, adjusting the CPU resource average value, the CPU resource maximum value, the memory resource average value and the memory resource maximum value of each workload to obtain a CPU resource request value corresponding to the CPU resource average value, a CPU resource limit value corresponding to the CPU resource maximum value, a memory resource request value corresponding to the memory resource average value and a memory resource limit value corresponding to the memory resource maximum value, and forming the aggregate analysis data of each workload.
Optionally, the obtaining, in the current operation and maintenance period, a CPU resource average value, a CPU resource maximum value, a memory resource average value, and a memory resource maximum value of each workload in each cluster item includes:
in the current operation and maintenance period, acquiring performance indexes of each workload in each cluster project, wherein the performance indexes comprise CPU (Central processing Unit) resource indexes and memory resource indexes corresponding to each workload;
Performing data format conversion on the CPU resource index of each workload to obtain a CPU resource average value and a CPU resource maximum value of each workload;
and performing data format conversion on the memory resource index of each workload to obtain a memory resource average value and a memory resource maximum value of each workload.
Optionally, the determining operation and maintenance data of each workload based on the aggregate analysis data includes:
The method comprises the steps of obtaining persistent data of each workload in a previous operation and maintenance period, wherein the persistent data is data obtained by carrying out secondary aggregation on aggregation analysis data of each workload in the previous operation and maintenance period;
And integrating the persistence data of each workload with the convergence analysis data of each workload to obtain the operation and maintenance data of each workload.
Optionally, the selecting, according to the operation and maintenance item group, a target cluster item that needs to be automatically operated and maintained from all the intermediate cluster items to form a set of items to be operated and maintained includes:
And determining the middle cluster items which are the same as the cluster items in the operation and maintenance item group in all the middle cluster items, and forming a set of items to be operated and maintained after taking the determined middle cluster items as target cluster items needing to be operated and maintained automatically.
Optionally, the automatically adjusting, according to the operation and maintenance data of each workload, the target workload corresponding to each target cluster item in the to-be-operated and maintained item set in turn includes:
and aiming at a target workload corresponding to each target cluster item in the to-be-operated and maintained item set, acquiring operation and maintenance data of the target workload, analyzing the operation and maintenance data, determining a performance index to be adjusted of the target workload, and automatically adjusting the performance index.
Optionally, the method further comprises:
and after the automatic operation and maintenance of each cluster item in the current operation and maintenance period is finished, updating the performance index of the target workload corresponding to each target cluster item in the to-be-operated and maintained item set.
The application also provides an automatic operation and maintenance device for the cluster project, which comprises the following steps:
The aggregation analysis data acquisition module is used for acquiring the aggregation analysis data of each workload in each cluster project in the current operation and maintenance period, and an operation and maintenance project group obtained by adjusting the cluster project range of automatic operation and maintenance according to the aggregation analysis data of the previous operation and maintenance period by a user;
The operation and maintenance data acquisition module is used for determining operation and maintenance data of each workload based on the convergence analysis data;
the data analysis module is used for analyzing the operation and maintenance data of each workload and determining a plurality of target workloads which need to be adjusted in each workload and an intermediate cluster item where each target workload is located;
The data screening module is used for screening target cluster items which need to be automatically operated and maintained from all the middle cluster items according to the operation and maintenance item group to form an operation and maintenance item set;
And the automatic operation and maintenance module is used for sequentially and automatically adjusting the target workload corresponding to each target cluster item in the to-be-operated item set according to the operation and maintenance data of each workload so as to realize the automatic operation and maintenance of each cluster item.
The present application also provides a storage medium having stored therein computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the cluster project automation method as in any of the above embodiments.
The application also provides a computer device comprising one or more processors and a memory;
The memory has stored therein computer readable instructions which, when executed by the one or more processors, perform the steps of the cluster project automation method of any of the above embodiments.
From the above technical solutions, the embodiment of the present application has the following advantages:
According to the cluster project automatic operation and maintenance method, device, storage medium and computer equipment provided by the application, before the cluster project is automatically operated and maintained in the current operation and maintenance period, the convergence analysis data of each workload in each cluster project can be acquired, so that the operation and maintenance data of each workload can be determined, the operation and maintenance data can be analyzed, so that the actual resource use condition in the cluster project can be determined, whether the performance index of each workload and the current service resource occupation are reasonable or not is obtained, and if not, adjustment is needed. After each target workload which needs to be adjusted in each workload and the middle cluster item where each target workload is located are determined, the application can screen out target cluster items which need to be automatically operated and maintained from all middle cluster items according to the operation and maintenance item group obtained by adjusting the cluster item range of the automatic operation and maintenance in the last operation and maintenance period of a user, form a set of items to be operated and maintained, and automatically adjust the target workload corresponding to each target cluster item in the set of items to be operated and maintained in sequence according to the operation and maintenance data of each workload so as to improve the stability of the overall service of the cluster and realize the automatic operation and maintenance of each cluster item.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the application, and that other drawings can be obtained from these drawings without inventive faculty for a person skilled in the art.
FIG. 1 is a schematic flow chart of an automatic operation and maintenance method for cluster projects according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of an automatic operation and maintenance method for cluster projects according to an embodiment of the present application;
Fig. 3 is a schematic diagram of an internal structure of a computer device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
With the popularization of internet technology, container technology has become a mainstream micro-service implementation scheme of Paas, and Kubernetes cluster is outstanding among currently popular container arrangement management technologies. Each node in the cluster has limited CPU and memory resources, and as more and more micro services run in the K8S cluster, it is particularly important to reasonably apply for resources and quota and perform resource optimization adjustment on each service in the K8S cluster at regular time.
In the current cloud primary field, a K8S cluster has thousands of micro services, when a service is deployed in a corresponding K8S cluster, the service can predict the use condition of resources such as a CPU and a memory for different services, the values of requests and limits are initialized by the resources of the CPU and the memory of the service, the stability of the whole service of the cluster can be affected by the overlarge or the overlarge values of the requests and limits in the cluster, however, the actual use condition of the resources of each micro service can be changed along with the change or adjustment of factors such as business, demand or business flow, and the like, so that unreasonable states are formed by the values of the requests and limits and the occupation of the current service resources, and the stability of the whole service of the cluster is further affected.
Therefore, the application aims to solve the technical problems that in the prior art, the resource initialization requests and limits values of the CPU and the memory of each micro service in the cluster and the current service resource occupation form an unreasonable state, thereby influencing the stability of the whole service of the cluster, and provides the following technical scheme:
In one embodiment, as shown in fig. 1, fig. 1 is a flow chart of an automatic operation and maintenance method for cluster items, which is provided by the embodiment of the application, and specifically includes the following steps:
S110, acquiring the convergence analysis data of each workload in each cluster project in the current operation and maintenance period, and adjusting the cluster project range of the automatic operation and maintenance by a user according to the convergence analysis data of the previous operation and maintenance period to obtain an operation and maintenance project group.
In this step, before the cluster items in the current operation and maintenance period are automatically operated and maintained, performance index data of each workload in each cluster item in the current operation and maintenance period can be obtained first, and then the collected analysis data in the current operation and maintenance period can be obtained through collection, so as to obtain operation and maintenance data of each workload through analysis.
It should be understood that the cluster items herein refer to each item in each cluster, and the clusters are a group of mutually independent servers interconnected through a high-speed network, which form a group and are managed in a single system mode, and when a client interacts with the clusters, the clusters are like an independent server, and the Kubernetes cluster is the most popular container arrangement management technology at present. In the application, the server in the cluster project can be expressed as a container, and the automatic operation and maintenance of the cluster project is to adjust the workload with unreasonable occupation of resources in the container.
For example, in Kubernetes clusters, a K8S cluster monitoring data index may be obtained periodically every day to store, where the monitoring data index includes container resource index data of the K8S cluster, when a new operation and maintenance cycle begins, and when the system initiates automatic operation and maintenance request data, the system may obtain daily data in the previous operation and maintenance cycle to perform aggregation and adjustment, so as to obtain aggregated analysis data of each workload in each cluster project, where the operation and maintenance cycle may be set and adjusted according to the current cluster type and configuration of the automatic operation and maintenance platform, and may be an operation and maintenance cycle for 7 days or an operation and maintenance cycle for 10 days, which is not limited herein.
Further, the user can adjust the cluster project range of the automatic operation and maintenance according to the convergence analysis data of the previous operation and maintenance period to obtain an operation and maintenance project group, namely after the convergence analysis data of each workload in each cluster project is obtained in the previous operation and maintenance period, the convergence analysis data can be returned to the user, so that the user can adjust the cluster project range of the automatic operation and maintenance according to the convergence analysis data.
Furthermore, the aggregated analysis data is returned to the user, a mail timing sending task can be adopted, the aggregated analysis data obtained in the current operation and maintenance period is sent to the user, after the user receives and checks the aggregated analysis data, the user can adjust on an automatic operation and maintenance platform according to the working operation condition of each item in the cluster, and whether the item can be automatically operated and maintained in the next operation and maintenance period is determined, so that an operation and maintenance item group in the next operation and maintenance period is determined.
And S120, determining operation and maintenance data of each workload based on the aggregation analysis data of each workload.
In this step, the collected analysis data of the current operation and maintenance period obtained in step S110 may be integrated with the collected analysis data of the previous operation and maintenance period to determine the operation and maintenance data of each workload of each cluster project clock, so as to screen the workload of which the resource data needs to be adjusted.
For example, the automatic operation and maintenance platform performs secondary aggregation on the collected analysis data of the previous operation and maintenance period after obtaining the collected analysis data when the automatic operation and maintenance period of the K8S cluster is 7 days and starts a new operation and maintenance period, so as to obtain operation and maintenance data of each workload in the K8S cluster, so as to determine whether the occupation of the container resources corresponding to the workload is reasonable according to the information contained in the operation and maintenance data.
And S130, analyzing the operation and maintenance data, and determining each target workload which needs to be adjusted in each workload and an intermediate cluster item where each target workload is located.
In this step, after the operation and maintenance data of each workload is obtained in step S120, the operation and maintenance data of each workload may be further analyzed, and each target workload that needs to be adjusted in each workload and the intermediate cluster item where each target workload is located may be determined according to the analysis result.
Specifically, after analysis of the operation and maintenance data, the use condition of the CPU and the memory resources of each container in the K8S cluster can be further determined according to the analysis result, and the set values of the CPU and the memory resource initialization requests and limits are checked to determine whether the occupation of the container resources is reasonable, if not, the workload of the container is confirmed to be adjusted, so that all the target workloads needing to be adjusted and the middle cluster item where each target workload is located can be obtained according to the analysis result.
It should be noted that, in order to implement efficient scheduling and full utilization of resources in the K8S cluster, the K8S cluster adopts two constraint types of requests and limits to allocate granularity of containers to the resources, and each container may independently set corresponding requests and limits. These 2 parameters are set by the resources field of each container containerSpec, the requests define the minimum amount of resources required by the corresponding container, limits define the maximum upper limit of resources that can be consumed by the container, and prevent excessive consumption of resources from causing resource shortages or even downtime, and in general, the requests are important at the time of scheduling and the requests are important at the time of running limits.
For example, for a Spring Boot service container, where the request is the minimum resource that the JVM virtual machine needs to occupy in the container image, if the memory request of the Pod is designated as 10mi, the memory Xms actually occupied by the JVM exceeds the memory allocated to the Pod by the K8S cluster, resulting in overflow of the Pod memory, so that the K8S cluster continuously restarts the Pod, and setting limits to 0 indicates that the used resource is not limited, and when limits is set and no request is set, the K8S cluster defaults to the request equal limits, where Pod is the minimum atomic schedule unit in the K8S cluster.
And S140, selecting target cluster items which need to be automatically operated and maintained from all the middle cluster items according to the operation and maintenance item group to form an operation and maintenance item set.
In this step, after determining the middle cluster item through step S130, the target cluster item that needs to be automatically operated and maintained may be screened out from all the target items according to the operation and maintenance item group set by the user in advance, and a to-be-operated item set is formed, so as to automatically operate and maintain each target cluster item in the to-be-operated item set.
Specifically, after obtaining intermediate operation and maintenance items corresponding to containers with unreasonable resource occupation needing to be adjusted, the automatic operation and maintenance system can screen again according to operation and maintenance item groups set by a user in advance, determine target cluster items needing to be subjected to automatic operation and maintenance and form a to-be-operated item set, and avoid unnecessary problems caused by overlapping with key operation time of the items in the automatic operation and maintenance process.
Further, the operation and maintenance project group can be changed after the collected analysis data is obtained in each operation and maintenance period, and is used as a basis for screening cluster projects in the middle of the next operation and maintenance period, a user can further adjust the cluster projects according to the working operation conditions of the cluster projects in the current operation and maintenance period, and the projects which can be automatically operated and maintained in the next operation and maintenance period are determined, so that the operation and maintenance project group is obtained.
And S150, according to the operation and maintenance data of each workload, automatically adjusting the target workload corresponding to each target cluster item in the to-be-operated item set in turn so as to realize the automatic operation and maintenance of each cluster item.
In this step, after determining the to-be-operated item set in step S140, the target workload corresponding to each target cluster item in the to-be-operated item set may be automatically adjusted in sequence according to the operation data, so as to implement automatic operation and maintenance of each cluster item.
Specifically, when the automatic operation and maintenance of the cluster items are performed, according to the reference value obtained by analyzing the operation and maintenance data, the automatic adjustment of the CPU or the adjustment of the value of the requests corresponding to each target cluster item in the set of the items to be operated and maintained can be circularly performed, so that the stability of the overall service of the cluster is improved, and the automatic operation and maintenance of each cluster item is completed.
It should be noted that, when the single thread executes the loop operation and controls the execution of the modification or restarting of the multiple services in the same cluster, only a single automatic operation and maintenance item is allowed to restart within a preset time in the single cluster, generally, the restart time of the single automatic operation and maintenance item may be set to 30s, and under the condition that the cluster resources are sufficient, 30s can satisfy the adjustment and starting resource allocation of the single automatic operation and maintenance item, if the automatic operation and maintenance platform allows, the preset time may be adjusted, and the limitation is not limited herein.
In the above embodiment, before the cluster project in the current operation and maintenance period performs automatic operation and maintenance, the aggregate analysis data of each workload in each cluster project may be obtained first, so as to determine the operation and maintenance data of each workload, and further, the operation and maintenance data may be analyzed, so as to determine the actual resource usage situation in the cluster project, and whether the performance index of each workload and the current service resource occupation are reasonable is obtained, and if not, adjustment is needed. After each target workload which needs to be adjusted in each workload and the middle cluster item where each target workload is located are determined, the application can screen out target cluster items which need to be automatically operated and maintained from all middle cluster items according to the operation and maintenance item group obtained by adjusting the cluster item range of the automatic operation and maintenance in the last operation and maintenance period of a user, form a set of items to be operated and maintained, and automatically adjust the target workload corresponding to each target cluster item in the set of items to be operated and maintained in sequence according to the operation and maintenance data of each workload so as to improve the stability of the overall service of the cluster and realize the automatic operation and maintenance of each cluster item.
In one embodiment, in the current operation and maintenance period in step S110, acquiring the aggregate analysis data of each workload in each cluster item may include:
and S111, acquiring a CPU resource average value, a CPU resource maximum value, a memory resource average value and a memory resource maximum value of each workload in each cluster project in the current operation and maintenance period.
And S112, according to a preset adjustment coefficient, adjusting the CPU resource average value, the CPU resource maximum value, the memory resource average value and the memory resource maximum value of each workload to obtain a CPU resource request value corresponding to the CPU resource average value, a CPU resource limit value corresponding to the CPU resource maximum value, a memory resource request value corresponding to the memory resource average value and a memory resource limit value corresponding to the memory resource maximum value, and forming the aggregated analysis data of each workload.
In this embodiment, before the cluster item in the current operation and maintenance period performs automatic operation and maintenance, the CPU resource average value, the CPU resource maximum value, the memory resource average value and the memory resource maximum value of each workload in each cluster item in the current operation and maintenance period may be obtained first, then, according to a preset adjustment coefficient, the CPU resource average value, the CPU resource maximum value, the memory resource average value and the memory resource maximum value of each workload are adjusted, so as to obtain a CPU resource request value corresponding to the CPU resource average value, a CPU resource limit value corresponding to the CPU resource maximum value, a memory resource request value corresponding to the memory resource average value and a memory resource limit value corresponding to the memory resource maximum value, and form the aggregate analysis data of each workload.
Specifically, when determining the aggregate analysis data, the adjustment coefficients corresponding to the average value of the CPU and the memory resources and the adjustment coefficients corresponding to the maximum value of the CPU and the memory resources may be determined first, and then the calculation is performed according to the average value and the maximum value of the CPU and the memory resources and the corresponding adjustment coefficients, so as to obtain the CPU resource request value corresponding to the average value of the CPU resources, the CPU resource limit value corresponding to the maximum value of the CPU resources, the memory resource request value corresponding to the average value of the memory resources and the memory resource limit value corresponding to the maximum value of the memory resources, and form the aggregate analysis data.
Furthermore, the adjustment coefficient of the present application may be obtained by first researching each cluster item to obtain an initial value, and then adjusting according to the cluster stability and the historical data occupied by the CPU and the memory resource in the cluster item, for example, the adjustment coefficient adopted in the present application defaults to an average value adjustment coefficient of 0.8 and a maximum value of 1.2.
In one embodiment, in the current operation and maintenance period in step S111, obtaining the average value of the CPU resources, the maximum value of the CPU resources, the average value of the memory resources, and the maximum value of the memory resources of each workload in each cluster item may include:
S1111, in the current operation and maintenance period, acquiring the performance index of each workload in each cluster project, wherein the performance index comprises a CPU resource index and a memory resource index corresponding to each workload.
And S1112, carrying out data format conversion on the CPU resource index of each workload to obtain the CPU resource average value and the CPU resource maximum value of each workload.
And S1113, performing data format conversion on the memory resource index of each workload to obtain a memory resource average value and a memory resource maximum value of each workload.
In this embodiment, before the cluster item in the current operation and maintenance period is automatically operated, the performance index of each workload in each cluster item in the current operation and maintenance period may be obtained first, and then the CPU resource index and the memory resource index in each performance index may be subjected to data format conversion to obtain index data corresponding to each performance index, where the index data includes a CPU resource average value, a CPU resource maximum value, a memory resource average value, and a memory resource maximum value of each workload.
Specifically, the performance index of each workload in the cluster project can be obtained at regular time, the index is converted into index data in a format, so that the resource occupation condition in the cluster project is visualized and used as the basis for automatic adjustment of the cluster project, and then the automatic operation and maintenance platform can obtain the index data from the cluster to perform analysis and calculation, so that the CPU resource average value, the CPU resource maximum value, the memory resource average value and the memory resource maximum value of each workload are obtained.
Further, the performance index of each workload in the cluster project can be obtained by using Prometaus, which is an open-source system monitoring and alarming system, in the K8S cluster automatic operation and maintenance platform, the performance index of each container can be obtained by being monitored by matching with Prometaus, the automatic operation and maintenance platform can obtain index data from Prometaus, can obtain index data by adopting PromQL, promQL is a query language built in the Prometaus monitoring system, and can convert and calculate time series data in a flexible mode, such as selecting, aggregating and the like, and the language is only used for reading data.
It should be noted that, in the automatic operation and maintenance platform, index data can be obtained and stored in the database in each operation and maintenance period, in order to avoid the storage pressure of the database, the automatic operation and maintenance platform can clear the historical data at regular time.
In one embodiment, in the current operation and maintenance period in step S120, determining operation and maintenance data of each workload based on the aggregate analysis data of each workload may include:
s121, obtaining the persistent data of each workload in the previous operation and maintenance period.
And S122, integrating the persistence data of each workload with the convergence analysis data of each workload to obtain the operation and maintenance data of each workload.
In this embodiment, after the aggregate analysis data of each workload in the current operation and maintenance period is obtained, the persistence data of each workload in the previous operation and maintenance period can be obtained, and integrated with the aggregate analysis data of each workload in the current operation and maintenance period to obtain the operation and maintenance data of each workload, so as to screen out the workload of which the resource data needs to be adjusted.
It can be understood that, the persistent data herein is data obtained by performing secondary aggregation on the aggregated analysis data of each workload in the previous operation and maintenance period, and four index data of each workload needing to be automatically operated and maintained are aggregated, so as to be most preferably aggregated into one piece of data.
For example, after the index data of each workload is calculated and adjusted to obtain convergence analysis data, the four index data of the CPU resource request value, the CPU resource limit value, the memory resource request value and the memory resource limit value of each workload in the convergence analysis data are secondarily converged to obtain a piece of data most preferably, and the data obtained by the secondary convergence is integrated and analyzed with the newly acquired convergence analysis data in the next operation and maintenance period to obtain the workload data which has not been subjected to automatic operation and maintenance by itself.
In one embodiment, in step S140, selecting, according to the operation and maintenance item group, a target cluster item that needs to be automatically operated and maintained from all the intermediate cluster items, to form a set of items to be operated and maintained may include:
s141, determining the middle cluster items which are the same as the cluster items in the operation and maintenance item group in all the middle cluster items, and forming a set of items to be operated and maintained after taking the determined middle cluster items as target cluster items needing automatic operation and maintenance.
In this embodiment, according to an operation and maintenance item group set by a user on an automatic operation and maintenance platform, an intermediate cluster item identical to a cluster item in the operation and maintenance item group is determined in all intermediate cluster items, and after the determined intermediate cluster item is used as a target cluster item to be automatically operated and maintained, a set of items to be operated and maintained is formed.
It should be noted that before the automatic operation and maintenance of the cluster project is started for the first time, the operation and maintenance project group needs to be set, and then each operation and maintenance period can be notified to the user in a mail mode after the collected analysis data is acquired, so as to provide the user with a data notification and a self-adjustable range time to adjust the automatic operation and maintenance range of the next operation and maintenance period, thereby avoiding unnecessary problems caused by overlapping of the operation key time and the automatic operation and maintenance time of the cluster project when the automatic operation and maintenance is performed in the next operation and maintenance period.
In one embodiment, in step S150, according to the operation and maintenance data of each workload, the automatically adjusting the target workload corresponding to each target cluster item in the to-be-operated item set in turn may include:
S151, acquiring operation and maintenance data of each target workload corresponding to each target cluster item in the to-be-operated and maintained item set, analyzing the operation and maintenance data, determining a performance index to be adjusted of the target workload, and automatically adjusting the performance index.
In this embodiment, for a target workload corresponding to each target cluster item in the to-be-maintained item set, operation and maintenance data corresponding to the target workload may be obtained, the operation and maintenance data may be analyzed to obtain actual resource usage conditions of a CPU and a memory of the target workload, and according to resource initialization requests and limits values of the CPU and the memory set by the target workload, a performance index to be adjusted of the target workload is determined, and the performance index is automatically adjusted.
It should be noted that, in each cluster item, the requests value Of the workload and the current service resource occupation form an unreasonable state, if the requests value is too large, the resource waste is caused, the resource utilization rate Of the cluster is affected, if the requests value is too small, the stability Of each service and the current service in the cluster item is affected, if the requests value is too large, too many overstocks are generated, the stability Of the overall service Of the cluster is affected, if the requests value is too small, the conditions Of OOM (Out Of Memory) or repeated restarting and the like are possibly generated, and therefore, the requests value or limits value Of the target workload needs to be automatically adjusted.
In one embodiment, the method may further comprise:
and S160, after the automatic operation and maintenance of each cluster item in the current operation and maintenance period is finished, updating the performance index of the target workload corresponding to each target cluster item in the to-be-operated and maintained item set.
In this embodiment, after the automatic operation and maintenance of each cluster item in the current operation and maintenance period is finished, the performance index of the target workload corresponding to each target cluster item in the to-be-operated and maintained item set is updated, so that the promethaus can timely obtain the latest performance index of each workload in the cluster.
The automatic operation and maintenance device for cluster items provided by the embodiment of the application is described below, and the automatic operation and maintenance device for cluster items described below and the automatic operation and maintenance method for cluster items described above can be correspondingly referred to each other.
In one embodiment, as shown in fig. 2, fig. 2 is a schematic structural diagram of an automatic operation and maintenance device for cluster projects according to an embodiment of the present application, and the application further provides an automatic operation and maintenance device for cluster projects, including an aggregation analysis data acquisition module 210, an operation and maintenance data acquisition module 220, a data analysis module 230, a data screening module 240, and an automatic operation and maintenance module 250, which specifically includes the following steps:
The aggregate analysis data obtaining module 210 is configured to obtain aggregate analysis data of each workload of each cluster item in the current operation and maintenance period, and an operation and maintenance item group obtained by adjusting the cluster item range of the automatic operation and maintenance according to the aggregate analysis data of the previous operation and maintenance period by a user.
The operation and maintenance data acquisition module 220 is configured to determine operation and maintenance data of each workload based on the aggregate analysis data of each workload.
The data analysis module 230 is configured to analyze the operation data of each workload, determine a plurality of target workloads that need to be adjusted in each workload, and an intermediate cluster item where each target workload is located.
And the data screening module 240 is configured to screen target cluster items that need to be automatically operated and maintained from all the intermediate cluster items according to the operation and maintenance item group, so as to form a set of items to be operated and maintained.
The automatic operation and maintenance module 250 is configured to automatically adjust, in sequence, a target workload corresponding to each target cluster item in the to-be-operated item set according to operation and maintenance data of each workload, so as to implement automatic operation and maintenance of each cluster item.
In the above embodiment, before the cluster project in the current operation and maintenance period performs automatic operation and maintenance, the aggregate analysis data of each workload in each cluster project may be obtained first, so as to determine the operation and maintenance data of each workload, and further, the operation and maintenance data may be analyzed, so as to determine the actual resource usage situation in the cluster project, and whether the performance index of each workload and the current service resource occupation are reasonable is obtained, and if not, adjustment is needed. After each target workload which needs to be adjusted in each workload and the middle cluster item where each target workload is located are determined, the application can screen out target cluster items which need to be automatically operated and maintained from all middle cluster items according to the operation and maintenance item group obtained by adjusting the cluster item range of the automatic operation and maintenance in the last operation and maintenance period of a user, form a set of items to be operated and maintained, and automatically adjust the target workload corresponding to each target cluster item in the set of items to be operated and maintained in sequence according to the operation and maintenance data of each workload so as to improve the stability of the overall service of the cluster and realize the automatic operation and maintenance of each cluster item.
In one embodiment, the aggregate analysis data acquisition module 210 may include:
and the resource data acquisition sub-module is used for acquiring the CPU resource average value, the CPU resource maximum value, the memory resource average value and the memory resource maximum value of each workload in each cluster project in the current operation and maintenance period.
The resource data adjustment sub-module is used for adjusting the CPU resource average value, the CPU resource maximum value, the memory resource average value and the memory resource maximum value of each workload according to a preset adjustment coefficient, obtaining a CPU resource request value corresponding to the CPU resource average value, a CPU resource limit value corresponding to the CPU resource maximum value, a memory resource request value corresponding to the memory resource average value and a memory resource limit value corresponding to the memory resource maximum value, and forming the convergence analysis data of each workload.
In one embodiment, the resource data acquisition sub-module may include:
The performance index obtaining unit is used for obtaining the performance index corresponding to each workload in each cluster project in the current operation and maintenance period, wherein the performance index comprises a CPU resource index and a memory resource index corresponding to each workload.
The CPU resource data acquisition unit is used for carrying out data format conversion on the CPU resource index of each workload to obtain the CPU resource average value and the CPU resource maximum value of each workload.
The memory resource data acquisition unit is used for carrying out data format conversion on the memory resource index of each workload to obtain the average value and the maximum value of the memory resource of each workload.
In one embodiment, the operation data acquisition module 220 may include:
The persistent data acquisition sub-module is used for acquiring the persistent data of each workload in the previous operation and maintenance period, wherein the persistent data is data obtained by carrying out secondary aggregation on the aggregated analysis data of each workload in the previous operation and maintenance period.
And the data integration module is used for integrating the persistent data of each workload with the convergence analysis data of each workload to obtain the operation and maintenance data of each workload.
In one embodiment, the data screening module 240 may include:
And the data screening sub-module is used for determining the middle cluster items which are the same as the cluster items in the operation and maintenance item group in all the middle cluster items, and forming a set of items to be operated and maintained after taking the determined middle cluster items as target cluster items needing to be operated and maintained automatically.
In one embodiment, the automated operation and maintenance module 250 may include:
the automatic operation and maintenance sub-module is used for acquiring operation and maintenance data of the target workload aiming at the target workload corresponding to each target cluster item in the to-be-operated and maintained item set, analyzing the operation and maintenance data, determining a performance index to be adjusted of the target workload, and automatically adjusting the performance index.
In one embodiment, the apparatus may further include:
The performance index updating module is used for updating the performance index of the target workload corresponding to each target cluster item in the to-be-operated item set after the automatic operation and maintenance of each cluster item in the current operation and maintenance period is finished.
In one embodiment, the present application also provides a storage medium having stored therein computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the cluster project automation method as set forth in any one of the above embodiments.
In one embodiment, the present application also provides a computer device having stored therein computer readable instructions, which when executed by one or more processors, cause the one or more processors to perform the steps of the cluster project automation method according to any of the above embodiments.
Schematically, as shown in fig. 3, fig. 3 is a schematic internal structure of a computer device according to an embodiment of the present application, and the computer device 300 may be provided as a server. Referring to FIG. 3, a computer device 300 includes a processing component 302 that further includes one or more processors, and memory resources represented by memory 301, for storing instructions, such as applications, executable by the processing component 302. The application program stored in the memory 301 may include one or more modules each corresponding to a set of instructions. Further, the processing component 302 is configured to execute instructions to perform the cluster project automation method of any of the embodiments described above.
The computer device 300 may also include a power supply component 303 configured to perform power management of the computer device 300, a wired or wireless network interface 304 configured to connect the computer device 300 to a network, and an input output (I/O) interface 305. The computer device 300 may operate based on an operating system stored in the memory 301, such as Windows Server TM, mac OS XTM, unix, linux, free BSDTM, or the like.
It will be appreciated by those skilled in the art that the structure shown in FIG. 3 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
In the present specification, each embodiment is described in a progressive manner, and each embodiment focuses on the difference from other embodiments, and may be combined according to needs, and the same similar parts may be referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (8)

1. An automatic operation and maintenance method for cluster projects, which is characterized by comprising the following steps:
in the current operation and maintenance period, acquiring the convergence analysis data of each workload in each cluster item, and adjusting the cluster item range of automatic operation and maintenance by a user according to the convergence analysis data of the previous operation and maintenance period to obtain an operation and maintenance item group;
determining operation and maintenance data of each workload based on the convergence analysis data of each workload;
analyzing the operation and maintenance data of each workload, and determining a plurality of target workloads needing to be adjusted in each workload and an intermediate cluster item where each target workload is located;
According to the operation and maintenance item group, selecting target cluster items needing automatic operation and maintenance from all the middle cluster items to form an operation and maintenance item set;
According to the operation and maintenance data of each workload, automatically adjusting the target workload corresponding to each target cluster item in the to-be-operated item set in turn so as to realize the automatic operation and maintenance of each cluster item;
the determining operation and maintenance data of each workload based on the convergence analysis data of each workload comprises the following steps:
The method comprises the steps of obtaining persistent data of each workload in a previous operation and maintenance period, wherein the persistent data is data obtained by carrying out secondary aggregation on aggregation analysis data of each workload in the previous operation and maintenance period;
integrating the persistence data of each workload with the convergence analysis data of each workload to obtain the operation and maintenance data of each workload;
according to the operation and maintenance data of each workload, the automatic adjustment of the target workload corresponding to each target cluster item in the to-be-operated item set sequentially comprises the following steps:
and aiming at a target workload corresponding to each target cluster item in the to-be-operated and maintained item set, acquiring operation and maintenance data of the target workload, analyzing the operation and maintenance data, determining a performance index to be adjusted of the target workload, and automatically adjusting the performance index.
2. The method for automatically operating and maintaining cluster items according to claim 1, wherein the step of obtaining the aggregate analysis data of each workload in each cluster item in the current operation and maintenance period includes:
in the current operation and maintenance period, obtaining a CPU resource average value, a CPU resource maximum value, a memory resource average value and a memory resource maximum value of each workload in each cluster project;
And according to a preset adjustment coefficient, adjusting the CPU resource average value, the CPU resource maximum value, the memory resource average value and the memory resource maximum value of each workload to obtain a CPU resource request value corresponding to the CPU resource average value, a CPU resource limit value corresponding to the CPU resource maximum value, a memory resource request value corresponding to the memory resource average value and a memory resource limit value corresponding to the memory resource maximum value, and forming the aggregate analysis data of each workload.
3. The method for automatically operating and maintaining cluster items according to claim 2, wherein the obtaining, in the current operation and maintenance period, the CPU resource average value, the CPU resource maximum value, the memory resource average value, and the memory resource maximum value of each workload in each cluster item includes:
in the current operation and maintenance period, acquiring performance indexes of each workload in each cluster project, wherein the performance indexes comprise CPU (Central processing Unit) resource indexes and memory resource indexes corresponding to each workload;
Performing data format conversion on the CPU resource index of each workload to obtain a CPU resource average value and a CPU resource maximum value of each workload;
and performing data format conversion on the memory resource index of each workload to obtain a memory resource average value and a memory resource maximum value of each workload.
4. The method for automatically operating and maintaining cluster items according to claim 1, wherein the step of selecting target cluster items to be automatically operated and maintained from all intermediate cluster items according to the operation and maintenance item group to form a set of items to be operated and maintained comprises the steps of:
And determining the middle cluster items which are the same as the cluster items in the operation and maintenance item group in all the middle cluster items, and forming a set of items to be operated and maintained after taking the determined middle cluster items as target cluster items needing to be operated and maintained automatically.
5. The clustered item automatic operation and maintenance method of claim 1, wherein the method further comprises:
and after the automatic operation and maintenance of each cluster item in the current operation and maintenance period is finished, updating the performance index of the target workload corresponding to each target cluster item in the to-be-operated and maintained item set.
6. An automatic operation and maintenance device for cluster projects, which is characterized by comprising:
The aggregation analysis data acquisition module is used for acquiring the aggregation analysis data of each workload in each cluster project in the current operation and maintenance period, and an operation and maintenance project group obtained by adjusting the cluster project range of automatic operation and maintenance according to the aggregation analysis data of the previous operation and maintenance period by a user;
The operation and maintenance data acquisition module is used for determining operation and maintenance data of each workload based on the convergence analysis data;
the data analysis module is used for analyzing the operation and maintenance data of each workload and determining a plurality of target workloads which need to be adjusted in each workload and an intermediate cluster item where each target workload is located;
The data screening module is used for screening target cluster items which need to be automatically operated and maintained from all the middle cluster items according to the operation and maintenance item group to form an operation and maintenance item set;
The automatic operation and maintenance module is used for sequentially and automatically adjusting the target workload corresponding to each target cluster item in the to-be-operated item set according to the operation and maintenance data of each workload so as to realize the automatic operation and maintenance of each cluster item;
The operation and maintenance data acquisition module comprises:
The persistent data acquisition sub-module is used for acquiring the persistent data of each workload in the previous operation and maintenance period, wherein the persistent data is data obtained by carrying out secondary aggregation on the aggregated analysis data of each workload in the previous operation and maintenance period;
The data integration module is used for integrating the persistence data of each workload with the convergence analysis data of each workload to obtain the operation and maintenance data of each workload;
The automatic operation and maintenance module comprises:
the automatic operation and maintenance sub-module is used for acquiring operation and maintenance data of the target workload aiming at the target workload corresponding to each target cluster item in the to-be-operated and maintained item set, analyzing the operation and maintenance data, determining a performance index to be adjusted of the target workload, and automatically adjusting the performance index.
7. A storage medium having stored therein computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the clustered item automated operation and maintenance method of any of claims 1 to 5.
8. A computer device includes one or more processors and a memory;
stored in the memory are computer readable instructions which, when executed by the one or more processors, perform the steps of the cluster project automation method of any one of claims 1 to 5.
CN202211476924.XA 2022-11-23 2022-11-23 Automatic operation and maintenance method and device for cluster project, storage medium and computer equipment Active CN115756854B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211476924.XA CN115756854B (en) 2022-11-23 2022-11-23 Automatic operation and maintenance method and device for cluster project, storage medium and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211476924.XA CN115756854B (en) 2022-11-23 2022-11-23 Automatic operation and maintenance method and device for cluster project, storage medium and computer equipment

Publications (2)

Publication Number Publication Date
CN115756854A CN115756854A (en) 2023-03-07
CN115756854B true CN115756854B (en) 2025-09-09

Family

ID=85336253

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211476924.XA Active CN115756854B (en) 2022-11-23 2022-11-23 Automatic operation and maintenance method and device for cluster project, storage medium and computer equipment

Country Status (1)

Country Link
CN (1) CN115756854B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113296840A (en) * 2020-02-20 2021-08-24 银联数据服务有限公司 Cluster operation and maintenance method and device

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7171668B2 (en) * 2001-12-17 2007-01-30 International Business Machines Corporation Automatic data interpretation and implementation using performance capacity management framework over many servers
US9838269B2 (en) * 2011-12-27 2017-12-05 Netapp, Inc. Proportional quality of service based on client usage and system metrics
US8978037B1 (en) * 2013-09-05 2015-03-10 zIT Consulting GmbH System and method for managing workload performance on billed computer systems
CN108667666A (en) * 2018-05-20 2018-10-16 北京工业大学 An intelligent operation and maintenance method and system based on visualization technology
CN110933178B (en) * 2019-12-09 2022-02-01 聚好看科技股份有限公司 Method for adjusting node configuration in cluster system and server
CN111431748B (en) * 2020-03-20 2022-09-30 支付宝(杭州)信息技术有限公司 Method, system and device for automatically operating and maintaining cluster
CN114116185A (en) * 2020-08-26 2022-03-01 上海佳投互联网技术集团有限公司 A method for automatically optimizing load balancing weights
CN114118223A (en) * 2021-11-02 2022-03-01 浪潮云信息技术股份公司 IT operation and maintenance optimization method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113296840A (en) * 2020-02-20 2021-08-24 银联数据服务有限公司 Cluster operation and maintenance method and device

Also Published As

Publication number Publication date
CN115756854A (en) 2023-03-07

Similar Documents

Publication Publication Date Title
CN113971066B (en) A Kubernetes cluster resource dynamic adjustment method and electronic device
CN104317658B (en) A kind of loaded self-adaptive method for scheduling task based on MapReduce
Fu et al. DRS: Dynamic resource scheduling for real-time analytics over fast streams
US12112214B2 (en) Predicting expansion failures and defragmenting cluster resources
CN104915407B (en) A kind of resource regulating method based under Hadoop multi-job environment
CN108920153B (en) A Dynamic Scheduling Method for Docker Containers Based on Load Prediction
CN103530189B (en) It is a kind of towards the automatic telescopic of stream data and the method and device of migration
CN105912401B (en) A kind of distributed data batch processing system and method
CN110532078A (en) A kind of edge calculations method for optimizing scheduling and system
Arkian et al. Model-based stream processing auto-scaling in geo-distributed environments
CN112685153A (en) Micro-service scheduling method and device and electronic equipment
CN113010260A (en) Elastic expansion method and system for container quantity
Dabbagh et al. Energy-efficient cloud resource management
CN111767145A (en) Container scheduling system, method, device and equipment
CN104243617A (en) Task scheduling method and system facing mixed load in heterogeneous cluster
WO2020206699A1 (en) Predicting virtual machine allocation failures on server node clusters
CN115756854B (en) Automatic operation and maintenance method and device for cluster project, storage medium and computer equipment
CN115562844A (en) Heterogeneous task cooperative scheduling method, system, terminal and storage medium
CN119127419A (en) Task allocation method, device, computer equipment, readable storage medium and program product
CN115562841B (en) Cloud video service self-adaptive resource scheduling system and method
Martin et al. Low cost energy forecasting for smart grids using Stream Mine 3G and Amazon EC2
CN115658319B (en) Resource scheduling method, system, device and storage medium
Costa et al. Towards automating the configuration of a distributed storage system
Su et al. GBA: A Tuning-free Approach to Switch between Synchronous and Asynchronous Training for Recommendation Models
CN113608798A (en) Method and device for configuring resources used by non-JAVA application and cloud-native application

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant