CN105183573B

CN105183573B - The ONLINE RECOGNITION method and system of the continuous failure task of cloud computing system medium-high frequency time

Info

Publication number: CN105183573B
Application number: CN201510649451.2A
Authority: CN
Inventors: 李影; 唐红艳; 贾统; 吴中海; 张齐勋
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2015-10-09
Filing date: 2015-10-09
Publication date: 2017-12-01
Anticipated expiration: 2035-10-09
Also published as: CN105183573A

Abstract

The invention discloses an online identification method and system for high-frequency continuous failure tasks in a cloud computing system. According to offline monitoring data, offline analysis and learning based on time series are performed to obtain a certain confidence level that can represent all non-high-frequency continuous failure tasks. The failure frequency threshold of the failure frequency feature of the failure task is then identified to obtain high-frequency continuous failure tasks in the online data. The present invention analyzes the tasks in the cloud computing system from the perspective of events and resources, obtains the frequency of failure events within a time period and the system resources consumed by the tasks, and identifies cloud computing in real time by analyzing the failure frequency characteristics of tasks and the time series pattern of resource usage. For high-frequency continuous failure tasks that fail repeatedly and are difficult to repair in the system, notify the cloud computing system in advance to take proactive failure recovery measures to save system resources for the cloud computing system and improve the reliability and availability of the cloud computing system.

Description

Online identification method and system for high-frequency continuous failure tasks in cloud computing system

技术领域technical field

本发明属于云计算技术领域，具体涉及一种云计算系统中高频次连续失效任务的在线识别方法和系统。The invention belongs to the technical field of cloud computing, and in particular relates to an online identification method and system for high-frequency continuous failure tasks in a cloud computing system.

背景技术Background technique

云计算以其按需使用的消费模式，逐渐被广泛应用于金融、商务等各个领域，云计算环境下系统的高可用性也日益成为云计算技术走向成熟的关键。然而，由于云计算系统规模逐渐扩大，异构性逐渐增强，各种类型的失效在云计算系统中频繁发生，这成为威胁云计算系统可用性和可靠性的关键因素之一。在云计算系统中，任务作为运行在单节点上的最小调度单元，是用户应用正常执行的基本保证，也是失效普遍发生的层次。为了提高系统可用性，提升容错性和自动恢复能力，目前云计算系统处理任务失效使其快速恢复的常用方法是重新提交调度。With its on-demand consumption model, cloud computing is gradually widely used in various fields such as finance and business. The high availability of the system in the cloud computing environment is increasingly becoming the key to the maturity of cloud computing technology. However, due to the gradual expansion of the scale and heterogeneity of the cloud computing system, various types of failures frequently occur in the cloud computing system, which has become one of the key factors that threaten the availability and reliability of the cloud computing system. In a cloud computing system, a task, as the smallest scheduling unit running on a single node, is the basic guarantee for the normal execution of user applications, and it is also the level where failures commonly occur. In order to improve system availability, improve fault tolerance and automatic recovery capabilities, the current common method for cloud computing systems to deal with task failures and make them recover quickly is to resubmit the schedule.

然而，云计算系统中往往存在某些任务，这类任务通常具有以下特点：1)高频次失效：由于它们在生命周期内反复发生失效事件，其失效总频次远大于其他任务；2)失效连续性：它们在失效后无法通过重启进行快速恢复，因而任意两次失效之间几乎不存在正常运行状态，即失效状态连续。因此，我们将此类任务称为“高频次连续失效任务”。However, there are often certain tasks in cloud computing systems, which usually have the following characteristics: 1) High-frequency failures: due to repeated failure events in their life cycle, their total failure frequency is much higher than other tasks; 2) failure Continuity: They cannot be quickly recovered by restarting after failure, so there is almost no normal operating state between any two failures, that is, the failure state is continuous. Therefore, we refer to such tasks as "high frequency continuous failure tasks".

经过分析发现，高频次连续失效任务表现出以下特定模式：1)资源使用模式：高频次连续失效任务在失效时刻和运行时刻的资源使用量存在显著差异，又由于其失效呈现连续性，因此它们的资源消耗在时间序列上存在明显分段趋势；2)失效频率特征：高频率连续失效任务的单位时间失效次数(失效频率)远大于其他任务。这些模式是高频次连续失效任务区别于其他任务的重要特征。但是，现有技术还没有识别高频次连续失效任务的有效方法。而高频次连续失效任务虽然会在每次失效后立刻被系统重新调度，但却无法通过重启而快速恢复，反而会在重复调度后反复发生失效。反复失效不仅造成系统资源的大量浪费，还会增加集群调度器的负载，给云计算系统带来潜在危害，难以满足云计算系统对高可用性的需求。After analysis, it is found that the high-frequency continuous failure tasks show the following specific patterns: 1) Resource usage pattern: there is a significant difference in resource usage between the failure time and the running time of the high-frequency continuous failure tasks, and because the failures are continuous, Therefore, their resource consumption has an obvious segmentation trend in the time series; 2) Failure frequency characteristics: the number of failures per unit time (failure frequency) of high-frequency continuous failure tasks is much larger than that of other tasks. These patterns are the important features that distinguish high-frequency continuous failure tasks from other tasks. However, the prior art does not have an effective method for identifying high-frequency continuous failure tasks. Although high-frequency continuous failure tasks will be rescheduled by the system immediately after each failure, they cannot be quickly recovered by restarting, but will fail repeatedly after repeated scheduling. Repeated failures not only cause a lot of waste of system resources, but also increase the load of the cluster scheduler, bringing potential harm to the cloud computing system, making it difficult to meet the high availability requirements of the cloud computing system.

发明内容Contents of the invention

为了克服上述现有技术的不足，本发明提供一种云计算系统中高频次连续失效任务的在线识别系统与方法，实现云计算系统中频繁失效且无法快速修复的任务的实时准确识别，提前通知云计算系统采取前摄性失效恢复措施，为云计算系统节约系统资源，减轻系统调度器负载，提高云计算系统的可靠性与可用性。In order to overcome the shortcomings of the above-mentioned prior art, the present invention provides an online identification system and method for high-frequency continuous failure tasks in the cloud computing system, which realizes real-time and accurate identification of tasks that fail frequently and cannot be quickly repaired in the cloud computing system, and notify in advance The cloud computing system adopts proactive failure recovery measures to save system resources for the cloud computing system, reduce the load of the system scheduler, and improve the reliability and availability of the cloud computing system.

本发明的原理是：从事件和资源两个角度对云计算系统中的任务进行分析，其中事件角度指时间周期内失效事件发生频率，资源角度为时间周期内任务消耗的系统资源(如CPU、内存和磁盘等)，通过分析高频次连续失效任务和非高频次连续失效任务的失效频率特征和资源使用时间序列模式，实时识别云计算系统中反复失效且难以修复的高频次连续失效任务，可避免不必要的资源浪费和调度负载，提高云计算系统的可靠性和可用性。The principle of the present invention is: analyze the task in the cloud computing system from two angles of event and resource, wherein the event angle refers to the failure event occurrence frequency in the time period, and the resource angle is the system resource (such as CPU, CPU, etc.) consumed by the task in the time period Memory and disk, etc.), by analyzing the failure frequency characteristics and resource usage time series patterns of high-frequency continuous failure tasks and non-high-frequency continuous failure tasks, real-time identification of high-frequency continuous failures that fail repeatedly and are difficult to repair in cloud computing systems Tasks can avoid unnecessary waste of resources and scheduling loads, and improve the reliability and availability of cloud computing systems.

本发明提供的技术方案是：The technical scheme provided by the invention is:

一种云计算系统中高频次连续失效任务的在线识别方法，根据离线监控数据进行基于时间序列的离线分析与学习，得到在一定置信水平上能代表所有非高频次连续失效任务失效频率特征的失效频率阈值，再识别得到在线数据中的高频次连续失效任务；包括如下步骤：An online identification method for high-frequency continuous failure tasks in a cloud computing system. According to offline monitoring data, offline analysis and learning based on time series is carried out, and the failure frequency characteristics that can represent all non-high-frequency continuous failure tasks at a certain level of confidence are obtained. The failure frequency threshold, and then identify the high-frequency continuous failure tasks in the online data; including the following steps:

1)从离线监控数据中抽取出事件和资源时间序列数据，通过转换为特定格式得到离线数据格式转换结果，包括任务失效频次和资源使用量；1) Extract event and resource time series data from offline monitoring data, and obtain offline data format conversion results by converting to a specific format, including task failure frequency and resource usage;

2)配置参数值，所述参数包括失效频次阈值、失效连续指数阈值、资源变动阈值和置信度阈值；所述失效频次阈值和失效连续指数阈值用于定义高频次连续失效任务；2) Configure parameter values, the parameters include a failure frequency threshold, a failure continuation index threshold, a resource change threshold and a confidence threshold; the failure frequency threshold and the failure continuation index threshold are used to define high-frequency continuous failure tasks;

3)根据步骤1)所述离线数据格式转换结果和步骤2)所述高频次连续失效任务定义参数(失效频次阈值和失效连续指数阈值)，将离线数据中的任务标记为高频次连续失效任务或非高频次连续失效任务；3) According to the offline data format conversion result in step 1) and the high-frequency continuous failure task definition parameters (failure frequency threshold and failure continuous index threshold) in step 2), the tasks in the offline data are marked as high-frequency continuous Failure tasks or non-high frequency continuous failure tasks;

4)利用步骤2)所述资源变动阈值和置信度阈值，分析得到非高频次连续失效任务的资源使用模式和失效频率特征，学习得到在一定置信水平上能代表所有非高频次连续失效任务失效频率特征的失效频率阈值；4) Using the resource change threshold and confidence threshold in step 2), analyze the resource usage pattern and failure frequency characteristics of non-high frequency continuous failure tasks, and learn to represent all non-high frequency continuous failure tasks at a certain confidence level The failure frequency threshold of the task failure frequency feature;

5)将在线监控数据实时输入，抽取出任务的事件和资源时间序列数据，转换为特定格式，得到在线数据格式转换结果，包括任务失效频次和资源使用量；5) Input the online monitoring data in real time, extract the event and resource time series data of the task, convert it into a specific format, and obtain the online data format conversion result, including the task failure frequency and resource usage;

6)根据步骤5)获取的在线数据格式转换结果和步骤4)得到的失效频率阈值，实时地识别得到在线数据中的高频次连续失效任务。6) According to the online data format conversion result obtained in step 5) and the failure frequency threshold obtained in step 4), the high-frequency continuous failure tasks in the online data are identified in real time.

针对上述云计算系统中高频次连续失效任务的在线识别方法，进一步地，步骤1)所述将事件和资源时间序列数据转换为特定格式，具体是：For the online identification method of high-frequency continuous failure tasks in the above-mentioned cloud computing system, further, step 1) described event and resource time series data are converted into a specific format, specifically:

设定所述事件和资源时间序列数据中存在n个任务，任务i(1≤i≤n)在t_i0时刻提交进入系统，在系统中驻留时间为T_i个时间周期；为每个任务i抽取出T_i个三元组{t,F_i,t,R_i,t}；It is set that there are n tasks in the event and resource time series data, task i (1≤i≤n) is submitted to enter the system at time t _i0 , and the residence time in the system is T _i time periods; for each task i extracts T _i triples {t, F _{i, t} , R _{i, t} };

针对所述事件和资源时间序列数据，得到特定格式为四元组{i,t,F_i,t,R_i,t}列表；其中，t∈{t_i0,T_i0+1,…,t_i0+T_i-1}表征任务i(1≤i≤n)在系统中驻留的每一个时间周期；F_i,t为任务i在第t个时间周期内的失效频次，通过统计离线数据中的失效事件求得，若任务i在第t个时间周期内没有发生失效事件，则F_i,t为0；R_i,t为任务i在第t个时间周期内的资源使用量，由监控系统获取并通过离线数据源传入。For the time series data of events and resources, a specific format is obtained as a list of four-tuple {i,t,F _i,t ,R _i,t }; where, t∈{t _i0 ,T _i0 +1,...,t _i0 +T _i -1} characterizes each time period that task i (1≤i≤n) resides in the system; F _i,t is the failure frequency of task i in the tth time period, by statistical offline data The failure event in is obtained, if no failure event occurs in the task i in the tth time period, then F _i,t is 0; R _i,t is the resource usage of the task i in the tth time period, given by The monitoring system acquires and imports through offline data sources.

针对上述云计算系统中高频次连续失效任务的在线识别方法，进一步地，步骤3)所述将离线数据中的任务标记为高频次连续失效任务或非高频次连续失效任务，具体为：For the online identification method of high-frequency continuous failure tasks in the above-mentioned cloud computing system, further, in step 3), the tasks in the offline data are marked as high-frequency continuous failure tasks or non-high-frequency continuous failure tasks, specifically:

根据步骤1)所述离线数据中的任务的失效频次，计算得到任务的失效总频次；根据连续的失效次数占该任务失效总频次的百分比求得失效连续指数；将所述失效总频次大于步骤2)所述失效频次阈值而且所述任务失效连续指数大于步骤2)所述失效连续指数阈值的任务，标记为高频次连续失效任务；否则标记为非高频次连续失效任务。According to the failure frequency of the task in the offline data described in step 1), calculate the failure total frequency of the task; obtain the failure continuity index according to the percentage of the continuous failure times accounting for the total failure frequency of the task; make the failure total frequency greater than the step 2) If the failure frequency threshold and the task failure continuity index is greater than the failure continuity index threshold in step 2), it is marked as a high-frequency continuous failure task; otherwise, it is marked as a non-high frequency continuous failure task.

针对上述云计算系统中高频次连续失效任务的在线识别方法，进一步地，步骤4)所述通过学习得到在一定置信水平上能代表所有非高频次连续失效任务失效频率特征的失效频率阈值，具体包括如下步骤：For the online identification method of high-frequency continuous failure tasks in the above-mentioned cloud computing system, further, step 4) obtains the failure frequency threshold value that can represent the failure frequency characteristics of all non-high-frequency continuous failure tasks on a certain confidence level through learning, Specifically include the following steps:

41)划分资源异常窗口；41) dividing the resource exception window;

根据资源使用量得到资源异常程度，通过步骤2)所设定的资源异常阈值，根据资源异常程度将非高频次连续失效任务在系统中的驻留时间划分为多个资源异常窗口，得到每个任务的资源异常窗口的划分结果；According to the resource usage, the resource abnormality degree is obtained, and the resource abnormality threshold set in step 2) is used to divide the residence time of non-high frequency continuous failure tasks in the system into multiple resource abnormality windows according to the resource abnormality degree, and each The division result of the resource exception window of a task;

42)计算得到失效频率；42) Calculate the failure frequency;

根据步骤41)得到的每个任务的资源异常窗口的划分结果，计算得到每个任务在每个资源异常窗口内的失效频率；再将每个任务在所有窗口内失效频率的最大值设定为该任务的失效频率；According to the division result of the resource exception window of each task obtained in step 41), the failure frequency of each task in each resource exception window is calculated; then the maximum value of each task's failure frequency in all windows is set as the failure frequency of the task;

43)通过学习得到失效频率阈值；43) Obtaining the failure frequency threshold through learning;

根据步骤42)得到的所有非高频次连续失效任务的失效频率，拟合出任务失效频率的累积分布函数，通过步骤2)所设定的置信度阈值，计算得到在一定置信水平上能代表所有非高频次连续失效任务失效频率的失效频率阈值。According to the failure frequency of all non-high-frequency continuous failure tasks obtained in step 42), the cumulative distribution function of task failure frequency is fitted, and the confidence threshold set in step 2) is calculated to represent Failure frequency threshold for failure frequency of all non-high frequency consecutive failure tasks.

针对上述云计算系统中高频次连续失效任务的在线识别方法，进一步地，步骤41)所述资源异常窗口的划分方法具体如下：For the online identification method of high-frequency continuous failure tasks in the above-mentioned cloud computing system, further, the division method of the resource exception window described in step 41) is specifically as follows:

若任务在第t个时间周期内的资源使用量与t-1存在显著差异，则设定第t个时间周期属于一个新的资源异常窗口，否则t与t-1属于同一窗口；If there is a significant difference between the resource usage of the task in the tth time period and t-1, set the tth time period to belong to a new resource exception window, otherwise t and t-1 belong to the same window;

通过资源使用变动量来度量资源使用量是否存在显著差异，所述资源使用变动量V_i,i由任务i在时刻t的资源使用量R_i,t通过式1计算得到：Whether there is a significant difference in resource usage is measured by resource usage variation. The resource usage variation V _i,i is calculated by the resource usage R _{i,t of task i at time t} through formula 1:

(式1) (Formula 1)

式1中，V_i,t为资源使用变动量；R_i,t-1为任务i在时刻t-1的资源使用量；R_i,t为任务i在时刻t的资源使用量；p(R_i)表示任务i的资源使用数量级，用于标准化资源使用变动量，避免因不同任务资源使用数量级不同而对资源使用变动量造成负面影响；若V_i,t大于资源变动阈值，则开启一个新窗口，否则继续保留在上一窗口。In formula 1, V _i,t is the resource usage change; R _i,t-1 is the resource usage of task i at time t-1; R _i,t is the resource usage of task i at time t; p( R _i ) indicates the magnitude of resource usage of task i, which is used to standardize resource usage changes and avoid negative impacts on resource usage changes due to different resource usage magnitudes of different tasks; if V _i,t is greater than the resource change threshold, a new window, otherwise keep the previous window.

针对上述云计算系统中高频次连续失效任务的在线识别方法，进一步地，步骤42)所述计算得到每个任务的失效频率的方法如下：For the online identification method of high-frequency continuous failure tasks in the above-mentioned cloud computing system, further, the method for calculating the failure frequency of each task as described in step 42) is as follows:

首先计算每个任务在每个资源异常窗口内的失效频率；假设任务i的第j个资源异常窗口的起始时间为S_ij，结束时间为E_ij；F_i,t为任务i在第t个时间周期内的失效频次；则任务i在第j个窗口内的失效频率Fre_ij通过式2得到：First calculate the failure frequency of each task in each resource exception window; suppose the start time of the jth resource exception window of task i is S _ij , and the end time is E _ij ; F _i,t is the The failure frequency within a time period; then the failure frequency Fre _ij of task i in the jth window can be obtained by formula 2:

(式2) (Formula 2)

通过式3设定任务i的总体失效频率Fre_i为任务i在所有窗口内失效频率的最大值，作为区分高频次连续失效任务和非高频次连续失效任务的判定值：The overall failure frequency Fre _i of task i is set as the maximum failure frequency of task i in all windows by Equation 3, which is used as the judgment value to distinguish between high-frequency continuous failure tasks and non-high-frequency continuous failure tasks:

Fre_i＝max(Fre_ij) (式3)Fre _i ＝max(Fre _ij ) (Formula 3)

式3中，Fre_ij为任务i在第j个窗口内的失效频率。In formula 3, Fre _ij is the failure frequency of task i in the jth window.

针对上述云计算系统中高频次连续失效任务的在线识别方法，进一步地，步骤43)所述失效频率阈值的计算方法如下：For the online identification method of high-frequency continuous failure tasks in the above-mentioned cloud computing system, further, the calculation method of the failure frequency threshold in step 43) is as follows:

首先根据所有非高频次连续失效任务的失效频率Fre_i，拟合出任务失效频率的累积分布函数F₁(fre)，通过式4将满足F₁(fre)等于置信度阈值V_conf的fre作为失效频率阈值f_thre：First, according to the failure frequency Fre _i of all non-high frequency continuous failure tasks, the cumulative distribution function F ₁ (fre) of the task failure frequency is fitted, and the formula 4 will satisfy that F ₁ (fre) is equal to the fre of the confidence threshold V _conf As the failure frequency threshold f _thre :

f_thre＝F₁ ^-1(V_conf) (式4)f _thre ＝F ₁ ^-1 (V _conf ) (Formula 4)

式4中，F₁ ^-1(fre)是F₁(fre)的反函数；V_conf为置信度阈值。In Formula 4, F ₁ ^-1 (fre) is the inverse function of F ₁ (fre); V _conf is the confidence threshold.

针对上述云计算系统中高频次连续失效任务的在线识别方法，进一步地，步骤6)所述高频次连续失效任务在线识别具体包括如下步骤：For the online identification method of high-frequency continuous failure tasks in the above-mentioned cloud computing system, further, the online identification of high-frequency continuous failure tasks in step 6) specifically includes the following steps:

51)根据在线事件和资源使用数据转换后的结果，划分得到资源异常窗口结果；51) According to the converted results of online events and resource usage data, divide and obtain resource exception window results;

所述资源异常窗口的划分方法具体如下：The division method of the resource exception window is as follows:

通过资源使用变动量来度量资源使用量是否存在显著差异，所述资源使用变动量V_i,t由任务i在时刻t的资源使用量R_i,t通过式1计算得到：Whether there is a significant difference in resource usage can be measured by resource usage variation. The resource usage variation V _i,t is calculated by the resource usage R _{i,t of task i at time t} through formula 1:

(式1) (Formula 1)

式1中，V_i,t为资源使用变动量；R_i,t-1为任务i在时刻t-1的资源使用量；R_i,t为任务i在时刻t的资源使用量；p(R_i)表示任务i的资源使用数量级，用于标准化资源使用变动量，避免因不同任务资源使用数量级不同而对资源使用变动量造成负面影响；若V_i,t大于资源变动阈值，则开启一个新窗口，否则继续保留在上一窗口；In formula 1, V _i,t is the resource usage change; R _i,t-1 is the resource usage of task i at time t-1; R _i,t is the resource usage of task i at time t; p( R _i ) indicates the magnitude of resource usage of task i, which is used to standardize resource usage changes and avoid negative impacts on resource usage changes due to different resource usage magnitudes of different tasks; if V _i,t is greater than the resource change threshold, a A new window, otherwise it will remain in the previous window;

52)计算每个任务在当前资源异常窗口内的失效频率，计算得到失效频率；52) Calculate the failure frequency of each task within the current resource exception window, and calculate the failure frequency;

设定N_i为任务i当前累积的失效频次，F_i表示任务i在当前窗口内的失效频率，S_i为任务i在当前窗口内的起始时间；若当前时刻t与时刻t-1属于同一资源异常窗口，则将时刻t加入时刻t-1所在的资源异常窗口，计算得到该窗口内的失效频率；否则，开启一个新的资源异常窗口并计算得到该窗口内的失效频率；Set N _i as the current cumulative failure frequency of task i, F _i represents the failure frequency of task i in the current window, S _i is the start time of task i in the current window; if the current time t and time t-1 belong to For the same resource exception window, add time t to the resource exception window at time t-1, and calculate the failure frequency in this window; otherwise, open a new resource exception window and calculate the failure frequency in this window;

53)识别得出高频次连续失效任务：53) Identify high-frequency continuous failure tasks:

根据步骤52)得到的任务在当前窗口内的失效频率和步骤4)得到的失效频率阈值，通过比较识别得出云计算系统中的高频次连续失效任务；若任务i在当前失效窗口内的失效频率大于失效频率阈值，则任务i为高频次连续失效任务。According to the failure frequency of tasks obtained in step 52) in the current window and the failure frequency threshold obtained in step 4), the high-frequency continuous failure tasks in the cloud computing system are identified by comparison; if task i is in the current failure window If the failure frequency is greater than the failure frequency threshold, task i is a high-frequency continuous failure task.

针对上述云计算系统中高频次连续失效任务的在线识别方法，进一步地，将识别出在线数据中的高频次连续失效任务进行可视化，并采取有效前摄性失效恢复机制从而避免不必要的调度负载和系统资源浪费。Aiming at the above-mentioned online identification method for high-frequency continuous failure tasks in the cloud computing system, further, the identification of high-frequency continuous failure tasks in online data is visualized, and an effective proactive failure recovery mechanism is adopted to avoid unnecessary scheduling Waste of load and system resources.

本发明同时提供一种云计算系统中高频次连续失效任务的在线识别系统，包括ETL模块、阈值配置模块、高频次连续失效任务标记器、基于时间序列的离线分析与学习模块、高频次连续失效任务在线识别模块和高频次连续失效任务预警模块；其特征是，The present invention also provides an online identification system for high-frequency continuous failure tasks in a cloud computing system, including an ETL module, a threshold configuration module, a high-frequency continuous failure task marker, an offline analysis and learning module based on time series, and a high-frequency continuous failure task marker. Continuous failure task online identification module and high-frequency continuous failure task early warning module; it is characterized in that,

ETL模块用于对输入的离线监控数据和在线监控数据进行处理，从数据中抽取出事件和资源时间序列数据，进行转换，将包括任务失效频次和资源使用量的转换结果分别传递给高频次连续失效任务标记器和高频次连续失效任务在线识别模块；The ETL module is used to process the input offline monitoring data and online monitoring data, extract event and resource time series data from the data, perform conversion, and transfer the conversion results including task failure frequency and resource usage to high frequency Continuous failure task marker and high-frequency continuous failure task online identification module;

阈值配置模块用于配置失效频次阈值、失效连续指数阈值、资源变动阈值和置信度阈值，并将配置参数值传递给高频次连续失效任务标记器；所述配置失效频次阈值和失效连续指数阈值用于定义高频次连续失效任务；所述资源变动阈值用于判断资源使用是否异常；所述置信度阈值用于限定学习结果需满足的置信度等级；The threshold configuration module is used to configure the failure frequency threshold, the failure continuity index threshold, the resource change threshold and the confidence threshold, and pass the configuration parameter value to the high-frequency continuous failure task marker; the configuration failure frequency threshold and the failure continuity index threshold It is used to define high-frequency continuous failure tasks; the resource change threshold is used to judge whether the resource usage is abnormal; the confidence threshold is used to limit the confidence level that the learning results need to meet;

高频次连续失效任务标记器根据失效频次阈值和失效连续指数阈值，将离线数据中的任务标记为高频次连续失效任务或非高频次连续失效任务，将标记出的非高频次连续失效任务的事件数据和资源时间序列数据、资源变动阈值和置信度阈值传递给基于时间序列的离线分析与学习模块；According to the failure frequency threshold and the failure continuation index threshold, the high-frequency continuous failure task marker marks the tasks in the offline data as high-frequency continuous failure tasks or non-high-frequency continuous failure tasks, and marks the non-high-frequency continuous failure tasks The event data and resource time series data of the failure task, resource change threshold and confidence threshold are passed to the offline analysis and learning module based on time series;

基于时间序列的离线分析与学习模块，用于学习得到失效频率阈值，并传递给高频次连续失效任务在线识别模块；所述失效频率阈值用于表征非高频次连续失效任务失效频率特征；The off-line analysis and learning module based on time series is used to learn the failure frequency threshold and pass it to the high-frequency continuous failure task online identification module; the failure frequency threshold is used to characterize the failure frequency characteristics of non-high-frequency continuous failure tasks;

高频次连续失效任务在线识别模块，根据失效频率阈值，对在线数据中的高频次连续失效任务进行识别，得到高频次连续失效任务，并将识别结果传递给高频次连续失效任务预警模块；The online recognition module for high-frequency continuous failure tasks identifies high-frequency continuous failure tasks in the online data according to the failure frequency threshold, obtains high-frequency continuous failure tasks, and passes the identification results to the early warning of high-frequency continuous failure tasks module;

高频次连续失效任务预警，用于将识别得出的高频次连续失效任务进行可视化展示。Early warning of high-frequency continuous failure tasks is used to visualize the identified high-frequency continuous failure tasks.

与现有技术相比，本发明的有益效果是：Compared with prior art, the beneficial effect of the present invention is:

本发明以对任务失效频率特征和资源使用时间序列模式的分析为基础，采用统计和模式匹配方式，识别云计算系统中的高频次连续失效任务，主要有以下特点：Based on the analysis of task failure frequency characteristics and resource usage time series patterns, the present invention uses statistics and pattern matching methods to identify high-frequency continuous failure tasks in the cloud computing system, and mainly has the following characteristics:

一、本发明从任务在时间序列上的模式角度针对云计算系统中高频次且连续失效的任务进行分析。1. The present invention analyzes the high-frequency and continuous failure tasks in the cloud computing system from the perspective of the time sequence mode of the tasks.

二、本发明从事件和资源两个方面对任务在时间序列上的模式进行描述，事件描述任务在时间周期内的失效频率，资源使用描述任务在时间周期内的资源消耗模式。2. The present invention describes the mode of the task in the time sequence from two aspects of events and resources, the event describes the failure frequency of the task within the time period, and the resource usage describes the resource consumption mode of the task within the time period.

三、本发明通过分析对比高频次连续失效任务和非高频次连续失效任务的资源使用和失效频率特征，在失效早期实现从大量任务中准确识别出高频次连续失效任务。3. By analyzing and comparing the resource usage and failure frequency characteristics of high-frequency continuous failure tasks and non-high-frequency continuous failure tasks, the present invention can accurately identify high-frequency continuous failure tasks from a large number of tasks in the early stage of failure.

因此，本发明能够实现云计算系统中频繁失效且无法快速修复的任务的实时准确识别。提前识别出高频次连续失效任务，能够提前通知云计算系统采取前摄性失效恢复措施，避免重复调度这些任务，以节约系统资源并减轻调度负载为云计算系统节约系统资源，减轻系统调度器负载，提高云计算系统的可靠性与可用性。Therefore, the present invention can realize real-time and accurate identification of tasks that fail frequently and cannot be repaired quickly in the cloud computing system. Identify high-frequency continuous failure tasks in advance, and notify the cloud computing system in advance to take proactive failure recovery measures to avoid repeated scheduling of these tasks, so as to save system resources and reduce scheduling load. Save system resources for the cloud computing system and reduce system scheduler load, and improve the reliability and availability of cloud computing systems.

附图说明Description of drawings

图1是本发明实施例提供的云计算系统中高频次连续失效任务在线识别方法的流程框图。Fig. 1 is a flowchart of an online identification method for high-frequency continuous failure tasks in a cloud computing system provided by an embodiment of the present invention.

图2是本发明实施例提供的云计算系统中高频次连续失效任务在线识别系统中的数据处理流程框图。Fig. 2 is a block diagram of the data processing flow in the online identification system for high-frequency continuous failure tasks in the cloud computing system provided by the embodiment of the present invention.

具体实施方式detailed description

下面结合附图，通过实施例进一步描述本发明，但不以任何方式限制本发明的范围。Below in conjunction with accompanying drawing, further describe the present invention through embodiment, but do not limit the scope of the present invention in any way.

图1是本发明实施例提供的云计算系统中高频次连续失效任务在线识别方法的流程框图，图2是本发明实施例提供的云计算系统中高频次连续失效任务在线识别系统的结构和系统数据处理流程框图。下面通过具体实例说明本发明提供方法执行操作的流程：Fig. 1 is a flow chart of an online identification method for high-frequency continuous failure tasks in a cloud computing system provided by an embodiment of the present invention, and Fig. 2 is a structure and system of an online identification system for high-frequency continuous failure tasks in a cloud computing system provided by an embodiment of the present invention Flow chart of data processing. The following is a specific example to illustrate the process of performing the operation of the method provided by the present invention:

1)首先，ETL模块从离线数据源中读取出离线监控数据，并将数据转换为特定数据结构；1) First, the ETL module reads the offline monitoring data from the offline data source, and converts the data into a specific data structure;

数据的读取可通过系统提供API进行。对于文件系统，常用操作系统如WINDOWS、LINUX均有可用API进行文件读写操作；对于数据库系统，以MYSQL为例，可通过JDBC连接到数据库系统并读取数据。Data can be read through the API provided by the system. For file systems, commonly used operating systems such as WINDOWS and LINUX have available APIs for file read and write operations; for database systems, taking MYSQL as an example, you can connect to the database system and read data through JDBC.

假设离线数据中存在n个任务，任务i(1≤i≤n)在t_i0时刻提交进入系统，在系统中驻留时间为T_i个时间周期，ETL模块的功能在于为每个任务i抽取出T_i个三元组{t,F_i,t,R_i,t}。其中，t∈{t_i0,t_i0+1,…,t_i0+T_i-1}表征任务i在系统中驻留的每一个时间周期；F_i,t为任务i在第t个时间周期内的失效频次，通过统计离线数据中的失效事件求得，若任务i在第t个时间周期内没有发生失效事件，则F_i,t为0；R_i,t为任务i在第t个时间周期内的资源使用量，资源使用量可能包括CPU、内存、磁盘等多个指标，由监控系统获取并通过离线数据源传入。由于事件数据和资源使用数据的监控粒度可能不同，为统一监控粒度，时间周期等于二者中粒度较粗的监控周期。结合所有任务的抽取结果，离线监控数据可转换成如表1所示的结构，并传递给高频次连续失效任务标记器。Assuming that there are n tasks in the offline data, task i (1≤i≤n) is submitted to the system at time t _i0 , and the residence time in the system is T _i time periods. The function of the ETL module is to extract Generate T _i triples {t, F _{i, t} , R _{i, t} }. Among them, t∈{t _i0 ,t _i0 +1,...,t _i0 +T _i -1} represents each time period that task i resides in the system; F _i,t is the task i in the tth time period The failure frequency within is calculated by counting the failure events in the offline data. If no failure event occurs in the task i in the tth time period, then F _i,t is 0; R _i,t is the task i in the tth time period Resource usage within a time period. Resource usage may include multiple indicators such as CPU, memory, and disk, which are obtained by the monitoring system and imported through offline data sources. Since the monitoring granularity of event data and resource usage data may be different, in order to unify the monitoring granularity, the time period is equal to the coarser monitoring period of the two. Combining the extraction results of all tasks, the offline monitoring data can be transformed into the structure shown in Table 1, and passed to the high-frequency continuous failure task marker.

表1任务、时间周期、事件失效频次和资源使用数据Table 1 Task, time period, event failure frequency and resource usage data

2)用户通过阈值配置模块设置失效频次阈值、失效连续指数阈值、资源变动阈值和置信度阈值；2) The user sets the failure frequency threshold, failure continuity index threshold, resource change threshold and confidence threshold through the threshold configuration module;

其中，失效频次阈值用于限定高频次连续失效任务的最低失效总频次，只有失效总频次大于失效频次阈值的任务才可能是高频次连续失效任务，失效频次阈值由用户根据系统中任务的失效频次分布确定，若F(x)为失效频次的累积分布函数，则失效频次的建议取值范围为[F^-1(0.5％),F^-1(10％)]，其中F^-1(x)为F(x)的反函数；失效连续指数等于连续的失效次数占该任务失效总频次的百分比，连续的失效指任务在当前周期和上一周期均处于失效状态，高频次连续失效任务的失效连续指数必须大于失效连续指数阈值，失效连续指数阈值的取值范围为区间(0,1)，阈值越大表示对高频次连续失效任务的失效连续性要求越高；资源变动阈值用于限定资源异常窗口内的资源变动量的最小值，若两个相邻周期的资源使用变动量(计算方法详见步骤41中基于时间序列的离线分析与学习模块划分资源异常窗口的方法)大于资源变动阈值，则这两个周期处于不同的资源异常窗口，资源变动阈值由用户根据系统中任务的资源变动情况确定；置信度阈值用于限定在学习任务失效频率特征时应满足多大置信度，即学习结果在多高的置信水平上能够代表所有训练任务，为保证学习结果足够可信，置信度阈值的建议取值范围为区间[80％,100％]；Among them, the failure frequency threshold is used to limit the minimum total failure frequency of high-frequency continuous failure tasks. Only tasks with a total failure frequency greater than the failure frequency threshold may be high-frequency continuous failure tasks. The failure frequency threshold is determined by the user according to the tasks in the system. The failure frequency distribution is determined. If F(x) is the cumulative distribution function of the failure frequency, the recommended value range of the failure frequency is [F ^-1 (0.5%), F ^-1 (10%)], where F ^-1 ( x) is the inverse function of F(x); the failure continuity index is equal to the percentage of the number of continuous failures in the total failure frequency of the task. Continuous failure means that the task is in the failure state in the current cycle and the previous cycle, and the high-frequency continuous failure The failure continuity index of the task must be greater than the failure continuity index threshold, and the value range of the failure continuity index threshold is the interval (0,1). The larger the threshold, the higher the failure continuity requirements for high-frequency continuous failure tasks; the resource change threshold It is used to limit the minimum value of the resource variation within the resource exception window, if the resource usage variation of two adjacent periods (for the calculation method, please refer to the method of dividing the resource exception window based on the offline analysis and learning module of time series in step 41) If it is greater than the resource change threshold, the two cycles are in different resource exception windows. The resource change threshold is determined by the user according to the resource change of the task in the system; the confidence threshold is used to limit how much confidence should be satisfied when learning the failure frequency characteristics of the task , that is, how high the confidence level of the learning results can represent all training tasks. In order to ensure that the learning results are sufficiently credible, the recommended value range of the confidence threshold is the interval [80%, 100%];

3)基于以上阈值，高频次连续失效任务标记器标记出离线数据中的高频次连续失效任务和非高频次连续失效任务(失效总频次大于失效频次阈值的任务才可能是高频次连续失效任务；高频次连续失效任务的失效连续指数必须大于失效连续指数阈值)，之后将标记结果和接收到的离线数据以及配置参数传递给基于时间序列的离线分析与学习模块；3) Based on the above threshold, the high-frequency continuous failure task marker marks the high-frequency continuous failure tasks and non-high-frequency continuous failure tasks in the offline data (tasks whose total failure frequency is greater than the failure frequency threshold may be high-frequency tasks Continuous failure tasks; the failure continuation index of high-frequency continuous failure tasks must be greater than the failure continuation index threshold), and then pass the marking results, received offline data and configuration parameters to the time series-based offline analysis and learning module;

本发明中，设定高频次连续失效任务必须满足高频次失效和连续失效特征，因此，离线数据中的高频次连续失效任务和非高频次连续失效任务具体根据如下方法进行划分：若任务的失效总频次大于失效频次阈值，且连续失效指数大于连续失效指数阈值，则该任务为高频次连续失效任务；若任务的失效总频次大于失效频次阈值，但连续失效指数小于连续失效指数阈值，则该任务为非高频次连续失效任务；若任务不满足高频次失效条件(失效总频次小于失效频次阈值)，显然该任务不是高频次连续失效任务，但由于其失效频次太低，其特征与高频次连续失效任务特征存在较大差异，本发明对此种情形不作分析。In the present invention, the high-frequency continuous failure tasks must meet the characteristics of high-frequency continuous failure and continuous failure. Therefore, the high-frequency continuous failure tasks and non-high-frequency continuous failure tasks in the offline data are specifically divided according to the following method: If the total failure frequency of the task is greater than the failure frequency threshold, and the continuous failure index is greater than the continuous failure index threshold, the task is a high-frequency continuous failure task; if the total failure frequency of the task is greater than the failure frequency threshold, but the continuous failure index is less than the continuous failure index index threshold, the task is a non-high-frequency continuous failure task; if the task does not meet the high-frequency failure condition (the total failure frequency is less than the failure frequency threshold), obviously the task is not a high-frequency continuous failure task, but due to its failure frequency If it is too low, there is a large difference between its characteristics and the characteristics of high-frequency continuous failure tasks, and the present invention does not analyze this situation.

4)根据获取的数据和参数，基于时间序列的离线分析与学习模块利用资源变动阈值和置信度阈值，通过分析非高频次连续失效任务的资源使用模式和失效频率特征，学习到在一定置信水平上能代表所有非高频次连续失效任务失效频率特征的失效频率阈值，并把结果传递给高频次连续失效任务在线学习模块；4) According to the obtained data and parameters, the offline analysis and learning module based on time series uses the resource change threshold and confidence threshold to analyze the resource usage pattern and failure frequency characteristics of non-high-frequency continuous failure tasks, and learns that at a certain confidence The failure frequency threshold that can represent the failure frequency characteristics of all non-high frequency continuous failure tasks horizontally, and pass the results to the online learning module of high frequency continuous failure tasks;

学习得到在一定置信水平上能代表所有非高频次连续失效任务失效频率特征的失效频率阈值，具体包括如下步骤：Learning to obtain the failure frequency threshold that can represent the failure frequency characteristics of all non-high frequency continuous failure tasks at a certain level of confidence, specifically includes the following steps:

41)划分资源异常窗口；41) dividing the resource exception window;

由于高频次连续失效任务在正常运行时刻和失效时刻资源消耗量不同，且高频次连续失效任务具有连续失效性，导致高频次连续失效任务的资源使用量存在显著分段趋势。因此，资源异常窗口划分模块的主要功能是根据资源异常程度将非高频次连续失效任务在系统中的驻留时间划分为若干个资源异常窗口，并将划分结果传递给失效频率计算模块；资源异常窗口划分方法如下：Since the resource consumption of high-frequency continuous failure tasks is different between normal operation time and failure time, and the high-frequency continuous failure tasks have continuous failure, the resource usage of high-frequency continuous failure tasks has a significant segmentation trend. Therefore, the main function of the resource exception window division module is to divide the residence time of non-high-frequency continuous failure tasks in the system into several resource exception windows according to the degree of resource exception, and pass the division results to the failure frequency calculation module; The exception window division method is as follows:

若任务在第t个时间周期内的资源使用量与t-1存在显著差异，则认为第t个时间周期属于一个新的资源异常窗口，否则t与t-1属于同一窗口。资源使用是否存在显著差异可用资源使用变动量来度量，即对任务i在时刻t的资源使用量R_i,t(从高频次连续失效任务标记器获取)，通过式1计算其资源使用变动量V_i,t：If there is a significant difference between the resource usage of the task in the t-th time period and t-1, it is considered that the t-th time period belongs to a new resource exception window, otherwise t and t-1 belong to the same window. Whether there is a significant difference in resource usage can be measured by the resource usage variation, that is, for the resource usage R _i,t of task i at time t (obtained from the high-frequency continuous failure task marker), the resource usage variation is calculated by formula 1 Quantity V _i,t :

(式1) (Formula 1)

式1中，V_i,t为资源使用变动量；R_i,t-1为任务i在时刻t-1的资源使用量；R_i,t为任务i在时刻t的资源使用量；p(R_i)表示任务i的资源使用数量级，用于标准化资源使用变动量，避免因不同任务资源使用数量级不同而对资源使用变动量造成负面影响；若V_i,i大于资源变动阈值(由阈值配置模块设置，通过高频次连续失效任务标记器传递)，则开启一个新窗口，否则继续保留在上一窗口。In formula 1, V _i,t is the resource usage change; R _i,t-1 is the resource usage of task i at time t-1; R _i,t is the resource usage of task i at time t; p( R _i ) indicates the magnitude of resource usage of task i, which is used to standardize resource usage changes and avoid negative impacts on resource usage changes due to different resource usage magnitudes of different tasks; if V _i,i is greater than the resource change threshold (configured by the threshold Module settings, passed through the high-frequency continuous failure task marker), a new window will be opened, otherwise it will remain in the previous window.

42)计算得到失效频率；42) Calculate the failure frequency;

在接收到每个任务的资源异常窗口划分结果后，失效频率计算模块的主要功能是计算每个任务的失效频率，并将其传递给失效频率阈值学习模块。失效频率计算方法如下：After receiving the resource exception window division result of each task, the main function of the failure frequency calculation module is to calculate the failure frequency of each task and pass it to the failure frequency threshold learning module. The calculation method of failure frequency is as follows:

首先计算每个任务在每个资源异常窗口内的失效频率。假设任务i的第j个资源异常窗口的起始时间为S_ij,结束时间为E_ij，F_i,t为任务i在第t个时间周期内的失效频次，则任务i在第j个窗口内的失效频率Fre_ij为：First calculate the failure frequency of each task within each resource exception window. Assuming that the start time of the jth resource exception window of task i is S _ij , and the end time is E _ij , F _i,t is the failure frequency of task i in the tth time period, then task i in the jth window The failure frequency Fre _ij in is:

(式2) (Formula 2)

由于同一任务在不同的资源异常窗口内的失效频率可能不同，我们希望通过学习得到非高频次连续失效任务失效频率的最大值作为区分高频次连续失效任务和非高频次连续失效任务的判定值，因此，设定任务i的总体失效频率Fre_i为其在所有窗口内失效频率的最大值：Since the failure frequency of the same task in different resource exception windows may be different, we hope to obtain the maximum failure frequency of non-high frequency continuous failure tasks through learning as an indicator to distinguish between high frequency continuous failure tasks and non-high frequency continuous failure tasks. Judgment value, therefore, set the overall failure frequency Fre _i of task i to be the maximum value of the failure frequency in all windows:

Fre_i＝max(Fre_ij) (式3)Fre _i ＝max(Fre _ij ) (Formula 3)

失效频率阈值学习模块的主要功能在于接收来自失效频率计算模块的所有非高频次连续失效任务的失效频率，根据置信度阈值，计算出在一定置信水平上能代表所有非高频次连续失效任务失效频率的失效频率阈值，并将失效频率阈值传递给高频次连续失效任务在线识别模块。失效频率阈值的计算方法如下：The main function of the failure frequency threshold learning module is to receive the failure frequency of all non-high-frequency continuous failure tasks from the failure frequency calculation module, and calculate the failure frequency that can represent all non-high-frequency continuous failure tasks at a certain confidence level according to the confidence threshold. The failure frequency threshold value of the failure frequency, and the failure frequency threshold value is passed to the high-frequency continuous failure task online identification module. The calculation method of the failure frequency threshold is as follows:

首先根据所有非高频次连续失效任务的失效频率Fre_i，拟合出任务失效频率的累积分布函数F₁(fre)，然后将满足F₁(fre)等于置信度阈值V_conf的fre作为失效频率阈值f_thre，传递给高频次连续失效任务在线识别模块：Firstly, according to the failure frequency Fre _i of all non-high frequency continuous failure tasks, the cumulative distribution function F ₁ (fre) of the task failure frequency is fitted, and then the fre satisfying that F ₁ (fre) is equal to the confidence threshold V _conf is taken as the failure The frequency threshold f _thre is passed to the online recognition module for high-frequency continuous failure tasks:

f_thre＝F₁ ^-1(V_conf) (式4)f _thre ＝F ₁ ^-1 (V _conf ) (Formula 4)

其中F₁ ^-1(fre)是F₁(fre)的反函数。where F ₁ ^-1 (fre) is the inverse function of F ₁ (fre).

5)输入在线数据，实时地识别出在线数据中的高频次连续失效任务；5) Input online data and identify high-frequency continuous failure tasks in the online data in real time;

由于在线监控数据实时地输入某个时刻系统中任务的事件数据和资源使用情况，因此在线数据的输入和转换是一个循环的过程，且每次输入的任务数量可能不同。当在线数据输入时，ETL模块按照与处理离线数据相似的方法，将输入数据转换成如表1所示的结构，即四元组{j,t,F_j,t,R_j,t}列表。其中，t∈{t_j0,t_j0+1,,t_j0+T_j-1}，t_j0和T_j分别为任务j的首次提交时间和驻留时间；F_j,t为任务j在第t个时间周期内的失效频次，通过统计输入数据中的失效事件求得；R_j,t为任务j在第t个时间周期内的资源使用量，由监控系统得到并通过在线数据源输入；最后将转换得到的四元组列表传递给高频次连续失效任务在线识别模块；Since online monitoring data is input into event data and resource usage of tasks in the system at a certain moment in real time, the input and conversion of online data is a cyclic process, and the number of tasks input each time may be different. When the online data is input, the ETL module converts the input data into a structure as shown in Table 1, which is a list of quadruples {j,t,F _j,t ,R _j,t } according to a method similar to processing offline data . Among them, t∈{t _j0 ,t _j0 +1,,t _j0 +T _j -1}, _t _j0 and T _j are the first submission time and residence time of task j respectively; The failure frequency in t time period is obtained by counting the failure events in the input data; R _j,t is the resource usage of task j in the t time period, which is obtained by the monitoring system and input through the online data source; Finally, the converted quadruple list is passed to the high-frequency continuous failure task online identification module;

在线识别过程由在线数据输入事件触发。当在线监控数据输入时，ETL模块实时地读取数据并转换数据结构。高频次连续失效任务在线识别模块根据获取的数据和失效频率阈值等参数，通过算法实时地识别出在线数据中的高频次连续失效任务，并把结果传递给高频次连续失效任务预警模块。高频次连续失效任务预警模块展示识别结果，从而提示云计算系统管理人员采取相应措施。在线识别过程会随在线数据输入循环进行，直到在线数据输入完毕。The online recognition process is triggered by an online data input event. When monitoring data input online, the ETL module reads the data and transforms the data structure in real time. The online recognition module for high-frequency continuous failure tasks uses algorithms to identify high-frequency continuous failure tasks in the online data in real time based on the acquired data and parameters such as the failure frequency threshold, and transmits the results to the early warning module for high-frequency continuous failure tasks . The high-frequency continuous failure task early warning module displays the identification results, thereby prompting the cloud computing system management personnel to take corresponding measures. The online recognition process will be carried out cyclically with the online data input until the online data input is completed.

高频次连续失效任务在线识别模块作为系统的核心，主要功能在于根据获取的失效频率阈值和资源变动阈值参数，利用接收到的在线监控数据，实时地识别出云计算系统中的高频次连续失效任务，传递给高频次连续失效任务预警模块。高频次连续失效任务在线识别模块包括三个子模块：The high-frequency continuous failure task online identification module is the core of the system. Its main function is to identify high-frequency continuous failure tasks in the cloud computing system in real time based on the acquired failure frequency threshold and resource change threshold parameters, using the received online monitoring data. The failure task is passed to the high-frequency continuous failure task early warning module. The high-frequency continuous failure task online identification module includes three sub-modules:

51>资源异常判断51> Resource exception judgment

资源异常判断模块的主要功能是接收转换后的在线事件和资源使用数据，判断每个任务i在每个周期t的资源使用是否与上一周期处于同一个资源异常窗口内，并将判断结果传递给失效频率计算模块。判断方法与基于时间序列的离线分析与学习模块中的资源异常窗口划分子模块中使用的方法类似，若任务在第t个周期内的资源使用量相对t-1周期的资源使用变动量V_i,t大于资源变动阈值，则认为周期t的资源使用相对于周期t-1异常，即位于不同资源异常窗口The main function of the resource exception judgment module is to receive the converted online events and resource usage data, judge whether the resource usage of each task i in each cycle t is in the same resource exception window as the previous cycle, and pass the judgment result to the failure frequency calculation module. The judgment method is similar to the method used in the resource exception window division sub-module in the time series-based offline analysis and learning module. If the resource usage of the task in the t-th cycle is relative to the resource usage change V _i of the t-1 cycle _,t is greater than the resource change threshold, it is considered that the resource usage of period t is abnormal relative to period t-1, that is, it is located in a different resource exception window

52>失效频率计算52>failure frequency calculation

接收到任务在当前时刻的资源异常判断结果后，失效频率计算模块需要计算每个任务在当前资源异常窗口内的失效频率，并传递给高频次连续失效任务识别模块。计算方法如以下伪代码所示，其中N_i为任务i当前累积的失效频次，F_i表示任务i在当前窗口内的失效频率，S_i为任务i在当前窗口内的起始时间。即若当前时刻t与时刻t-1属于同一资源异常窗口，则将时刻t加入时刻t-1所在的资源异常窗口，并计算该窗口内的失效频率；否则，开启一个新的资源异常窗口并计算窗口内的失效频率。After receiving the resource abnormal judgment result of the task at the current moment, the failure frequency calculation module needs to calculate the failure frequency of each task within the current resource abnormal window, and pass it to the high-frequency continuous failure task identification module. The calculation method is shown in the following pseudo code, where N _i is the current cumulative failure frequency of task i, F _i represents the failure frequency of task i in the current window, and S _i is the start time of task i in the current window. That is, if the current time t and time t-1 belong to the same resource exception window, add time t to the resource exception window at time t-1, and calculate the failure frequency in this window; otherwise, open a new resource exception window and Calculate the failure frequency within the window.

53>高频次连续失效任务识别53>High-frequency continuous failure task identification

高频次连续失效任务识别模块的主要功能是根据当前窗口内的失效频率和基于时间序列的离线分析和学习模块学习到的失效频率阈值，识别出云计算系统中的高频次连续失效任务，将识别结果传递给高频次连续失效任务预警模块。识别的核心方法是比较任务在当前窗口内的失效频率和失效频率阈值，若任务i在当前失效窗口内的失效频率大于失效频率阈值，则任务i为高频次连续失效任务。The main function of the high-frequency continuous failure task identification module is to identify high-frequency continuous failure tasks in the cloud computing system according to the failure frequency in the current window and the failure frequency threshold learned by the offline analysis and learning module based on time series. Pass the recognition result to the high-frequency continuous failure task early warning module. The core method of identification is to compare the failure frequency of the task in the current window with the failure frequency threshold. If the failure frequency of task i in the current failure window is greater than the failure frequency threshold, task i is a high-frequency continuous failure task.

6.高频次连续失效任务预警6. Early warning of high-frequency continuous failure tasks

高频次连续失效任务预警模块主要功能是接收高频次连续失效任务在线识别模块识别出来的高频次连续失效任务并进行可视化。用户可通过高频次连续失效任务预警模块查看云计算系统中的高频次连续失效任务，从而采取有效前摄性失效恢复机制(例如通知调度器停止调度该任务并请求对该任务进行故障诊断或Bug修复)，避免不必要的调度负载和系统资源浪费。The main function of the high-frequency continuous failure task early warning module is to receive and visualize the high-frequency continuous failure task identified by the high-frequency continuous failure task online identification module. Users can view high-frequency continuous failure tasks in the cloud computing system through the high-frequency continuous failure task early warning module, so as to adopt an effective proactive failure recovery mechanism (for example, notify the scheduler to stop scheduling the task and request fault diagnosis for the task or Bug fixes), to avoid unnecessary scheduling load and waste of system resources.

整个系统的数据处理流程如图2所示，首先，ETL模块从离线数据源中读取出离线监控数据，并将数据转换为特定数据结构，同时，用户通过阈值配置模块设置失效频次阈值、失效连续指数阈值、资源变动阈值和置信度阈值。基于以上阈值，高频次连续失效任务标记器标记出离线数据中的高频次连续失效任务和非高频次连续失效任务，之后将标记结果和接收到的离线数据以及配置参数传递给基于时间序列的离线分析与学习模块。根据获取的数据和参数，基于时间序列的离线分析与学习模块利用资源变动阈值和置信度阈值，通过分析非高频次连续失效任务的资源使用模式和失效频率特征，学习到在一定置信水平上能代表所有非高频次连续失效任务失效频率特征的失效频率阈值，并把结果传递给高频次连续失效任务在线学习模块。The data processing flow of the whole system is shown in Figure 2. First, the ETL module reads the offline monitoring data from the offline data source and converts the data into a specific data structure. At the same time, the user sets the failure frequency threshold and failure frequency threshold through the threshold configuration module. Continuity Index Threshold, Resource Change Threshold, and Confidence Threshold. Based on the above thresholds, the high-frequency continuous failure task marker marks the high-frequency continuous failure tasks and non-high-frequency continuous failure tasks in the offline data, and then passes the marking results and the received offline data and configuration parameters to the time-based Sequence offline analysis and learning module. According to the acquired data and parameters, the offline analysis and learning module based on time series uses the resource change threshold and confidence threshold to analyze the resource usage pattern and failure frequency characteristics of non-high frequency continuous failure tasks, and learn to a certain level of confidence The failure frequency threshold that can represent the failure frequency characteristics of all non-high frequency continuous failure tasks, and pass the result to the online learning module of high frequency continuous failure tasks.

在线识别过程由在线数据输入事件触发。当在线监控数据输入时，ETL模块实时地读取数据并转换数据结构。高频次连续失效任务在线识别模块根据获取的数据和失效频率阈值等参数，通过算法实时地识别出在线数据中的高频次连续失效任务，并把结果传递给高频次连续失效任务预警模块。高频次连续失效任务预警模块展示识别结果，从而提示云计算系统管理人员采取相应措施。在线识别过程会随在线数据输入循环进行，直到在线数据输入完毕。The online recognition process is triggered by an online data input event. When monitoring data input online, the ETL module reads the data and transforms the data structure in real time. The online recognition module for high-frequency continuous failure tasks uses algorithms to identify high-frequency continuous failure tasks in the online data in real time based on the acquired data and parameters such as the failure frequency threshold, and transmits the results to the high-frequency continuous failure task early warning module . The high-frequency continuous failure task early warning module displays the identification results, thereby prompting the cloud computing system management personnel to take corresponding measures. The online recognition process will be carried out cyclically with the online data input until the online data input is completed.

需要注意的是，公布实施例的目的在于帮助进一步理解本发明，但是本领域的技术人员可以理解：在不脱离本发明及所附权利要求的精神和范围内，各种替换和修改都是可能的。因此，本发明不应局限于实施例所公开的内容，本发明要求保护的范围以权利要求书界定的范围为准。It should be noted that the purpose of the disclosed embodiments is to help further understand the present invention, but those skilled in the art can understand that various replacements and modifications are possible without departing from the spirit and scope of the present invention and the appended claims of. Therefore, the present invention should not be limited to the content disclosed in the embodiments, and the protection scope of the present invention is subject to the scope defined in the claims.

Claims

1. An online identification method for high-frequency continuous failure tasks in a cloud computing system. According to offline monitoring data, offline analysis and learning based on time series is performed, and the failure frequency that can represent all non-high-frequency continuous failure tasks at a certain level of confidence is obtained. The failure frequency threshold of the feature, and then identify the high-frequency continuous failure tasks in the online data; including the following steps:

1) Extract event and resource time series data from offline monitoring data, and obtain offline data format conversion results by converting to a specific format, including task failure frequency and resource usage;

2) Configure parameter values, the parameters include a failure frequency threshold, a failure continuation index threshold, a resource change threshold and a confidence threshold; the failure frequency threshold and the failure continuation index threshold are used to define high-frequency continuous failure tasks;

3) According to the offline data format conversion result described in step 1) and the configuration parameter value described in step 2), the tasks in the offline data are marked as high-frequency continuous failure tasks or non-high-frequency continuous failure tasks;

4) Using the resource change threshold and confidence threshold in step 2), analyze the resource usage pattern and failure frequency characteristics of non-high frequency continuous failure tasks, and learn to represent all non-high frequency continuous failure tasks at a certain confidence level The failure frequency threshold of the task failure frequency feature;

5) Input the online monitoring data in real time, extract the event and resource time series data of the task, convert it into a specific format, and obtain the online data format conversion result, including the task failure frequency and resource usage;

6) According to the online data format conversion result obtained in step 5) and the failure frequency threshold obtained in step 4), the high-frequency continuous failure tasks in the online data are identified in real time.

2. the online identification method of high-frequency continuous failure task in cloud computing system as claimed in claim 1, is characterized in that, step 1) event and resource time series data are converted into specific format, specifically:

It is set that there are n tasks in the event and resource time series data, task i (1≤i≤n) is submitted to enter the system at time t _i0 , and the residence time in the system is T _i time periods; for each task i extracts T _i triples {t, F _{i, t} , R _{i, t} };

For the time series data of events and resources, a specific format is obtained as a list of four-tuples {i,t,F _i,t ,R _i,t }; where, t∈{t _i0 ,t _i0 +1,...,t _i0 +T _i -1} represents each time period that task i resides in the system; F _i,t is the failure frequency of task i in the tth time period, which is obtained by counting failure events in offline data, If task i does not have a failure event in the tth time period, F _i,t is 0; R _i,t is the resource usage of task i in the tth time period, which is obtained by the monitoring system and passed offline data source incoming.

3. as claimed in claim 1, the online identification method of high-frequency continuous failure tasks in the cloud computing system is characterized in that, step 3) tasks in the offline data are marked as high-frequency continuous failure tasks or non-high-frequency continuous failure tasks tasks, specifically:

According to the failure frequency of the task in the offline data described in step 1), calculate the failure total frequency of the task; obtain the failure continuity index according to the percentage of the continuous failure times accounting for the total failure frequency of the task; make the failure total frequency greater than the step 2) If the failure frequency threshold and the failure continuation index is greater than the failure continuation index threshold in step 2), it is marked as a high-frequency continuous failure task; otherwise, it is marked as a non-high-frequency continuous failure task.

4. as claimed in claim 1, the online identification method of high-frequency continuous failure tasks in the cloud computing system is characterized in that, step 4) obtains the failure frequency characteristics that can represent all non-high-frequency continuous failure tasks on a certain confidence level by learning The failure frequency threshold of , specifically includes the following steps:

41) dividing the resource exception window;

The degree of resource abnormality is obtained according to the amount of resource usage, and the resource change threshold set in step 2) is used to divide the residence time of non-high-frequency continuous failure tasks in the system into multiple resource abnormal windows according to the degree of resource abnormality. The division result of the resource exception window of a task;

42) Calculate the failure frequency;

According to the division result of the resource exception window of each task obtained in step 41), the failure frequency of each task in each resource exception window is calculated; then the maximum value of each task's failure frequency in all windows is set as the failure frequency of the task;

43) Obtaining the failure frequency threshold through learning;

According to the failure frequency of all non-high-frequency continuous failure tasks obtained in step 42), the cumulative distribution function of task failure frequency is fitted, and the confidence threshold set in step 2) is calculated to represent Failure frequency threshold for failure frequency of all non-high frequency consecutive failure tasks.

5. the online identification method of high-frequency continuous failure task in cloud computing system as claimed in claim 4, it is characterized in that, step 41) the division method of described resource abnormality window is specifically as follows:

If there is a significant difference between the resource usage of the task in the t-th time period and the resource usage of the task in the t-1 time period, set the t-th time period to belong to a new resource exception window, otherwise t belongs to the same window as t-1;

Whether there is a significant difference in resource usage can be measured by resource usage variation. The resource usage variation V _i,t is calculated by the resource usage R _{i,t of task i at time t} through formula 1:

In formula 1, V _i,t is the resource usage change; R _i,t-1 is the resource usage of task i at time t-1; R _i,t is the resource usage of task i at time t; p( R _i ) indicates the magnitude of resource usage of task i, which is used to standardize resource usage changes and avoid negative impacts on resource usage changes due to different resource usage magnitudes of different tasks; if V _i,t is greater than the resource change threshold, a new window, otherwise keep the previous window.

6. as claimed in claim 4, the online identification method of high-frequency continuous failure tasks in the cloud computing system is characterized in that, step 42) calculates the method for obtaining the failure frequency of each task as follows:

First calculate the failure frequency of each task in each resource exception window; suppose the start time of the jth resource exception window of task i is S _ij , and the end time is E _ij ; F _i,t is the The failure frequency within a time period; then the failure frequency Fre _ij of task i in the jth window can be obtained by formula 2:

The overall failure frequency Fre _i of task i is set as the maximum failure frequency of task i in all windows by Equation 3, which is used as the judgment value to distinguish between high-frequency continuous failure tasks and non-high-frequency continuous failure tasks:

Fre _i ＝max(Fre _ij ) Formula 3

In formula 3, Fre _ij is the failure frequency of task i in the jth window.

7. the online identification method of high-frequency continuous failure task in cloud computing system as claimed in claim 4, it is characterized in that, step 43) the computing method of described failure frequency threshold is as follows:

First, according to the failure frequency Fre _i of all non-high frequency continuous failure tasks, the cumulative distribution function F ₁ (fre) of the task failure frequency is fitted, and the formula 4 will satisfy that F ₁ (fre) is equal to the fre of the confidence threshold V _conf As the failure frequency threshold f _thre :

f _thre ＝F ₁ ^-1 (V _conf ) Formula 4

In Formula 4, F ₁ ^-1 (fre) is the inverse function of F ₁ (fre); V _conf is the confidence threshold.

8. the online identification method of high-frequency continuous failure task in cloud computing system as claimed in claim 1, is characterized in that, step 6) online identification of high-frequency continuous failure task specifically comprises the following steps:

51) According to the converted results of online events and resource usage data, divide and obtain resource exception window results;

The division method of the resource exception window is as follows:

In formula 1, V _i,t is the resource usage change; R _i,t-1 is the resource usage of task i at time t-1; R _i,t is the resource usage of task i at time t; p( R _i ) indicates the magnitude of resource usage of task i, which is used to standardize resource usage changes and avoid negative impacts on resource usage changes due to different resource usage magnitudes of different tasks; if V _i,t is greater than the resource change threshold, a A new window, otherwise it will remain in the previous window;

52) Calculate the failure frequency of each task within the current resource exception window;

Set N _i as the current cumulative failure frequency of task i, F _i represents the failure frequency of task i in the current window, S _i is the start time of task i in the current window; if the current time t and time t-1 belong to For the same resource exception window, add time t to the resource exception window at time t-1, and calculate the failure frequency in this window; otherwise, open a new resource exception window and calculate the failure frequency in this window;

53) Identify high-frequency continuous failure tasks;

According to the failure frequency of tasks obtained in step 52) in the current window and the failure frequency threshold obtained in step 4), the high-frequency continuous failure tasks in the cloud computing system are identified by comparison; if task i is in the current failure window If the failure frequency is greater than the failure frequency threshold, task i is a high-frequency continuous failure task.

9. The online identification method of high-frequency continuous failure tasks in the cloud computing system as claimed in claim 1, wherein the high-frequency continuous failure tasks in the online data are identified to be visualized, and effective proactive failure recovery is adopted Mechanism to avoid unnecessary scheduling load and waste of system resources.

10. An online identification system for high-frequency continuous failure tasks in a cloud computing system, including an ETL module, a threshold configuration module, a high-frequency continuous failure task marker, an offline analysis and learning module based on time series, and a high-frequency continuous failure task Task online identification module and high-frequency continuous failure task early warning module; it is characterized in that,

The ETL module is used to process the input offline monitoring data and online monitoring data, extract event and resource time series data from the data, perform conversion, and transfer the conversion results including task failure frequency and resource usage to high frequency Continuous failure task marker and high-frequency continuous failure task online identification module;

The threshold configuration module is used to configure the failure frequency threshold, the failure continuity index threshold, the resource change threshold and the confidence threshold, and pass the configuration parameter value to the high-frequency continuous failure task marker; the configuration failure frequency threshold and the failure continuity index threshold It is used to define high-frequency continuous failure tasks; the resource change threshold is used to judge whether the resource usage is abnormal; the confidence threshold is used to limit the confidence level that the learning results need to meet;

According to the failure frequency threshold and the failure continuation index threshold, the high-frequency continuous failure task marker marks the tasks in the offline data as high-frequency continuous failure tasks or non-high-frequency continuous failure tasks, and marks the non-high-frequency continuous failure tasks The event data and resource time series data of the failure task, resource change threshold and confidence threshold are passed to the offline analysis and learning module based on time series;

The off-line analysis and learning module based on time series uses the resource change threshold and confidence threshold to analyze the resource usage pattern and failure frequency characteristics of non-high frequency continuous failure tasks, and learns that it can represent all non-high frequency tasks at a certain confidence level The failure frequency threshold value of the failure frequency feature of the second consecutive failure task, and passed to the online identification module of the high frequency continuous failure task; the failure frequency threshold is used to characterize the failure frequency characteristic of the non-high frequency continuous failure task;

The online recognition module for high-frequency continuous failure tasks identifies high-frequency continuous failure tasks in the online data according to the failure frequency threshold, obtains high-frequency continuous failure tasks, and passes the identification results to the early warning of high-frequency continuous failure tasks module;

The high-frequency continuous failure task early warning module is used to visualize the identified high-frequency continuous failure tasks.