CN117272207A

CN117272207A - Data center anomaly analysis method and system

Info

Publication number: CN117272207A
Application number: CN202311310311.3A
Authority: CN
Inventors: 张巍; 邬青; 张军
Original assignee: Jiangsu Hengxin Digital Intelligence Technology Co ltd
Current assignee: Jiangsu Hengxin Digital Intelligence Technology Co ltd
Priority date: 2023-10-10
Filing date: 2023-10-10
Publication date: 2023-12-22

Abstract

The embodiment of the invention provides a data center abnormality analysis method and system, and provides a data center operation abnormality analysis and prediction method. First, operation data of a target data center is acquired. And then, loading the operation data of the target data center into an anomaly analysis prediction network generated through priori learning, wherein the anomaly analysis prediction network performs knowledge learning generation based on the associated data center operation data and static scheduling data. The anomaly analysis prediction network can generate a corresponding operation anomaly class. Finally, according to the determined operation abnormality category, abnormality diagnosis data is determined. The method can effectively analyze and predict the possible abnormal situation in the running process of the data center, and provides powerful support for the management and maintenance of the data center.

Description

Data center anomaly analysis method and system

技术领域Technical field

本发明涉及数据中心技术领域，具体而言，涉及一种数据中心异常分析方法及系统。The present invention relates to the technical field of data centers, and specifically to a data center anomaly analysis method and system.

背景技术Background technique

随着信息技术的发展，数据中心作为存储和处理大量数据的核心设施，其运行状态对于整个IT系统的稳定性至关重要。然而，由于数据中心的复杂性以及运行环境的多变性，运转过程中可能会出现各种不同类型的异常情况，如硬件故障、软件错误、网络中断等。With the development of information technology, data centers serve as core facilities for storing and processing large amounts of data, and their operating status is crucial to the stability of the entire IT system. However, due to the complexity of the data center and the variability of the operating environment, various types of abnormal situations may occur during operation, such as hardware failures, software errors, network interruptions, etc.

传统的异常检测方法通常依赖于人工监控或者预先设置的阈值报警，这些方法存在一定的局限性。例如，人工监控效率低且容易出错；预设阈值的方法则无法适应数据中心运转环境的动态变化，可能会产生大量的误报或漏报。Traditional anomaly detection methods usually rely on manual monitoring or preset threshold alarms, and these methods have certain limitations. For example, manual monitoring is inefficient and error-prone; the preset threshold method cannot adapt to the dynamic changes in the data center operating environment and may produce a large number of false positives or false negatives.

此外，当数据中心出现异常时，需要迅速并准确地进行故障诊断，以最小化因故障导致的服务中断时间。然而，当前的故障诊断方法往往需要依赖专业的维护人员经验，且在面对复杂或者未知类型的异常时，可能无法做出有效的诊断。In addition, when an abnormality occurs in the data center, fault diagnosis needs to be performed quickly and accurately to minimize the service interruption time caused by the failure. However, current fault diagnosis methods often rely on the experience of professional maintenance personnel, and may not be able to make effective diagnoses when facing complex or unknown types of anomalies.

因此，急需一种新的方法，能够有效地分析和预测数据中心运转过程中的异常情况，并根据异常类别进行快速准确的故障诊断。Therefore, there is an urgent need for a new method that can effectively analyze and predict abnormal situations during data center operation, and perform fast and accurate fault diagnosis based on abnormal categories.

发明内容Contents of the invention

有鉴于此，本发明实施例的目的在于提供一种数据中心异常分析方法及系统，本发明提供了一种数据中心运转异常的分析预测方法。首先，获取目标数据中心的运转数据。然后，将目标数据中心的运转数据加载至一个通过先验学习生成的异常分析预测网络中，此异常分析预测网络基于关联的数据中心运转数据和静态调度数据进行知识学习生成。通过该异常分析预测网络可以生成对应的运转异常类别。最后，依据确定的运转异常类别，确定异常诊断数据。此方法能够有效地分析和预测数据中心运转过程中可能出现的异常情况，为数据中心的管理和维护提供了有力支持。In view of this, the purpose of embodiments of the present invention is to provide a data center abnormality analysis method and system. The present invention provides a data center abnormal operation analysis and prediction method. First, obtain the operating data of the target data center. Then, the operation data of the target data center is loaded into an anomaly analysis and prediction network generated through prior learning. This anomaly analysis and prediction network is generated based on knowledge learning based on the associated data center operation data and static scheduling data. Through this abnormality analysis and prediction network, corresponding operating abnormality categories can be generated. Finally, abnormal diagnosis data is determined based on the determined operating abnormality category. This method can effectively analyze and predict abnormal situations that may occur during the operation of the data center, providing strong support for the management and maintenance of the data center.

依据本发明实施例的一个方面，提供一种数据中心异常分析方法及系统，所述方法包括：According to one aspect of the embodiment of the present invention, a data center anomaly analysis method and system are provided. The method includes:

获取目标数据中心运转数据；Obtain target data center operation data;

将所述目标数据中心运转数据加载至先验学习的异常分析预测网络中，生成所述异常分析预测网络确定的运转异常类别，所述异常分析预测网络根据关联的数据中心运转数据和静态调度数据进行知识学习生成的；The target data center operation data is loaded into the a priori learned anomaly analysis and prediction network to generate an operation anomaly category determined by the anomaly analysis and prediction network. The anomaly analysis and prediction network is based on the associated data center operation data and static scheduling data. Generated through knowledge learning;

依据所述运转异常类别确定异常诊断数据。Abnormality diagnosis data is determined based on the operation abnormality category.

一种可替代的实施方式中，所述异常分析预测网络包括自注意力单元和全连接输出单元，所述将所述目标数据中心运转数据加载至先验学习的异常分析预测网络中，生成所述异常分析预测网络确定的运转异常类别，包括：In an alternative implementation, the anomaly analysis and prediction network includes a self-attention unit and a fully connected output unit, and the target data center operation data is loaded into the a priori learned anomaly analysis and prediction network to generate the The abnormal operation categories determined by the above-mentioned anomaly analysis and prediction network include:

将所述目标数据中心运转数据加载至所述自注意力单元中，生成所述自注意力单元确定的目标自注意力特征；Load the target data center operation data into the self-attention unit, and generate the target self-attention features determined by the self-attention unit;

将所述目标自注意力特征加载至所述全连接输出单元中，生成所述全连接输出单元确定的运转异常类别。The target self-attention feature is loaded into the fully connected output unit to generate an abnormal operating category determined by the fully connected output unit.

一种可替代的实施方式中，所述异常分析预测网络的训练步骤，包括：In an alternative implementation, the training step of the anomaly analysis and prediction network includes:

依据数据中心日志数据中联动的运行调度事件数据生成积极学习特征和消极学习特征，依据所述积极学习特征和所述消极学习特征生成模板学习数据；Generate positive learning features and negative learning features based on the linked operation scheduling event data in the data center log data, and generate template learning data based on the positive learning features and the negative learning features;

依据所述模板学习数据，对初始化的异常分析预测网络进行参数更新，生成参数更新后的异常分析预测网络。Based on the template learning data, the parameters of the initialized anomaly analysis and prediction network are updated to generate an anomaly analysis and prediction network with updated parameters.

一种可替代的实施方式中，所述依据数据中心日志数据中联动的运行调度事件数据生成积极学习特征和消极学习特征，包括：In an alternative implementation, the positive learning features and negative learning features are generated based on the linked operation scheduling event data in the data center log data, including:

获取数据中心日志数据中的联动的运行调度事件数据作为基础训练模板数据；Obtain the linked running scheduling event data in the data center log data as basic training template data;

对所述基础训练模板数据进行规则化转换，生成所述积极学习特征；Perform regular conversion on the basic training template data to generate the active learning features;

将所述数据中心日志数据中的数据中心运转数据和噪声特征随机置乱，生成所述消极学习特征。The data center operation data and noise features in the data center log data are randomly scrambled to generate the passive learning features.

一种可替代的实施方式中，所述自注意力单元包括第一自注意力单元和第二自注意力单元，所述依据所述模板学习数据，对初始化的异常分析预测网络进行参数更新，生成参数更新后的异常分析预测网络，包括：In an alternative implementation, the self-attention unit includes a first self-attention unit and a second self-attention unit, and the parameters of the initialized anomaly analysis and prediction network are updated based on the template learning data, Generate anomaly analysis and prediction network with updated parameters, including:

针对所述模板学习数据中的运行调度事件数据对，将所述运行调度事件数据对的静态调度事件加载至所述第一自注意力单元中，生成所述第一自注意力单元确定的目标静态自注意力特征，将所述运行调度事件数据对的动态调度事件加载至所述第二自注意力单元中，生成所述第二自注意力单元确定的目标动态自注意力特征；For the running scheduling event data pair in the template learning data, load the static scheduling event of the running scheduling event data pair into the first self-attention unit, and generate the target determined by the first self-attention unit Static self-attention features: load the dynamic scheduling events of the running scheduling event data pair into the second self-attention unit, and generate the target dynamic self-attention features determined by the second self-attention unit;

依据所述目标静态自注意力特征和所述目标动态自注意力特征确定目标训练误差参数，以所述目标训练误差参数最小化为目标，对所述自注意力单元进行参数更新，生成参数更新后的自注意力单元；Determine target training error parameters based on the target static self-attention features and the target dynamic self-attention features, with the goal of minimizing the target training error parameters, perform parameter updates on the self-attention units, and generate parameter updates The subsequent self-attention unit;

依据参数更新后的自注意力单元对所述全连接输出单元进行参数更新，生成参数更新后的全连接输出单元。The parameters of the fully connected output unit are updated according to the self-attention unit after the parameter update, and a fully connected output unit after the parameter update is generated.

一种可替代的实施方式中，所述运行调度事件数据对中包含多个动态调度事件，所述将所述运行调度事件数据对的动态调度事件加载至所述第二自注意力单元中，生成所述第二自注意力单元确定的目标动态自注意力特征，包括：In an alternative implementation, the running scheduling event data pair includes multiple dynamic scheduling events, and the dynamic scheduling events of the running scheduling event data pair are loaded into the second self-attention unit, Generating target dynamic self-attention features determined by the second self-attention unit includes:

将各所述动态调度事件集成，生成集成数据中心运转数据；Integrate each of the dynamic scheduling events to generate integrated data center operation data;

将所述集成数据中心运转数据加载至所述第二自注意力单元中，生成所述第二自注意力单元确定的目标动态自注意力特征。The integrated data center operation data is loaded into the second self-attention unit, and the target dynamic self-attention feature determined by the second self-attention unit is generated.

一种可替代的实施方式中，所述运行调度事件数据对中包含多个动态调度事件，所述将所述运行调度事件数据对的动态调度事件加载至所述第二自注意力单元中，生成所述第二自注意力单元确定的动态自注意力特征，包括：In an alternative implementation, the running scheduling event data pair includes multiple dynamic scheduling events, and the dynamic scheduling events of the running scheduling event data pair are loaded into the second self-attention unit, Generating dynamic self-attention features determined by the second self-attention unit includes:

分别将各所述动态调度事件加载至所述第二自注意力单元中，生成所述第二自注意力单元确定的各所述动态调度事件的动态自注意力特征；Load each of the dynamic scheduling events into the second self-attention unit respectively, and generate dynamic self-attention features of each of the dynamic scheduling events determined by the second self-attention unit;

将各所述动态调度事件的动态自注意力特征求和，生成所述目标动态自注意力特征。The dynamic self-attention features of each of the dynamic scheduling events are summed to generate the target dynamic self-attention feature.

一种可替代的实施方式中，所述第一自注意力单元和所述第二自注意力单元分别与所述全连接输出单元相连接，所述依据参数更新后的自注意力单元对所述全连接输出单元进行参数更新，生成参数更新后的全连接输出单元，包括：In an alternative implementation, the first self-attention unit and the second self-attention unit are respectively connected to the fully connected output unit, and the self-attention unit updated according to the parameters is The fully connected output unit is updated with parameters and a fully connected output unit with updated parameters is generated, including:

针对所述模板学习数据中的静态调度数据，依据所述第一自注意力单元确定所述静态调度数据的静态自注意力特征；For the static scheduling data in the template learning data, determine the static self-attention characteristics of the static scheduling data according to the first self-attention unit;

依据所述静态自注意力特征和所述静态调度数据的异常类别构建目标异常学习模板数据；Constructing target anomaly learning template data based on the static self-attention feature and the anomaly category of the static scheduling data;

依据所述目标异常学习模板数据，对初始化的全连接输出单元进行参数更新，生成参数更新后的全连接输出单元。According to the target anomaly learning template data, the parameters of the initialized fully connected output unit are updated to generate a fully connected output unit with updated parameters.

依据本发明实施例的另一方面，提供一种数据中心异常分析方法及系统，所述系统包括：According to another aspect of the embodiment of the present invention, a data center anomaly analysis method and system are provided. The system includes:

获取模块，用于获取目标数据中心运转数据；The acquisition module is used to obtain the operation data of the target data center;

生成模块，用于将所述目标数据中心运转数据加载至先验学习的异常分析预测网络中，生成所述异常分析预测网络确定的运转异常类别，所述异常分析预测网络依据关联的数据中心运转数据和静态调度数据进行知识学习生成的；A generation module configured to load the target data center operation data into the a priori learned anomaly analysis and prediction network, and generate an operation anomaly category determined by the anomaly analysis and prediction network, and the anomaly analysis and prediction network operates according to the associated data center Data and static scheduling data are generated through knowledge learning;

确定模块，用于依据所述运转异常类别确定异常诊断数据。A determining module, configured to determine abnormal diagnosis data according to the abnormal operation category.

依据本发明实施例的另一方面，提供一种服务器，包括：处理器、通信接口、存储器和通信总线，其中，所述处理器、通信接口和存储器通过通信总线完成相互间的通信；所述存储器，用于存放计算机程序；所述处理器，用于执行所述计算机程序时，实现以上任一项所述的数据中心异常分析方法步骤。According to another aspect of the embodiment of the present invention, a server is provided, including: a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory complete communication with each other through the communication bus; The memory is used to store computer programs; the processor is used to implement any of the above data center anomaly analysis method steps when executing the computer program.

依据本发明实施例的另一方面，提供一种可读存储介质，该可读存储介质上存储有计算机程序，该计算机程序被处理器运行时可以执行上述的数据中心异常分析方法的步骤。According to another aspect of an embodiment of the present invention, a readable storage medium is provided. A computer program is stored on the readable storage medium. When the computer program is run by a processor, the computer program can perform the steps of the above-mentioned data center anomaly analysis method.

为使本发明实施例的上述目的、特征和优点能更明显易懂，下面将结合实施例，并配合所附附图，作详细说明。In order to make the above objects, features and advantages of the embodiments of the present invention more obvious and easy to understand, a detailed description will be given below with reference to the embodiments and the accompanying drawings.

附图说明Description of the drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，应当理解，以下附图仅示出了本发明的某些实施例，因此不应被看作是对范围的限定，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以依据这些附图获得其他相关的附图。In order to explain the technical solutions of the embodiments of the present invention more clearly, the drawings required to be used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present invention and therefore do not It should be regarded as a limitation of the scope. For those of ordinary skill in the art, other relevant drawings can also be obtained based on these drawings without exerting creative efforts.

图1示出了本发明实施例所提供的服务器的组件示意图；Figure 1 shows a schematic component diagram of a server provided by an embodiment of the present invention;

图2示出了本发明实施例所提供的数据中心异常分析方法的流程示意图；Figure 2 shows a schematic flow chart of the data center anomaly analysis method provided by an embodiment of the present invention;

图3示出了本发明实施例所提供的依据数据中心异常分析系统的功能模块框图。Figure 3 shows a functional module block diagram of a data center anomaly analysis system provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了使本技术领域的学员更好地理解本发明方案，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅是本发明一部分实施例，而不是全部的实施例。依据本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to enable students in the technical field to better understand the solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only These are some embodiments of the present invention, rather than all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.

本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”等(如果存在)是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本发明的实施例例如能够以除了在这里图示或描述的那些以外的顺序实施。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", etc. (if present) in the description and claims of the present invention and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific sequence. Or sequence. It is to be understood that the figures so used are interchangeable under appropriate circumstances so that the embodiments of the invention described herein, for example, can be practiced in sequences other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions, e.g., a process, method, system, product, or apparatus that encompasses a series of steps or units and need not be limited to those explicitly listed. Those steps or elements may instead include other steps or elements not expressly listed or inherent to the process, method, product or apparatus.

图1示出了服务器100的示例性组件示意图。服务器100可以包括一个或多个处理器104，诸如一个或多个中央处理单元(CPU)，每个处理单元可以实现一个或多个硬件线程。服务器100还可以包括任何存储介质106，其用于存储诸如代码、设置、数据等之类的任何种类的信息。非限制性的，比如，存储介质106可以包括以下任一项或多种组合：任何类型的RAM，任何类型的ROM，闪存设备，硬盘，光盘等。更一般地，任何存储介质都可以使用任何技术来存储信息。进一步地，任何存储介质可以提供信息的易失性或非易失性保留。进一步地，任何存储介质可以表示服务器100的固定或可移除部件。在一种情况下，当处理器104执行被存储在任何存储介质或存储介质的组合中的相关联的指令时，服务器100可以执行相关联指令的任一操作。服务器100还包括用于与任何存储介质交互的一个或多个驱动单元108，诸如硬盘驱动单元、光盘驱动单元等。FIG. 1 shows an exemplary component diagram of server 100. Server 100 may include one or more processors 104, such as one or more central processing units (CPUs), each of which may implement one or more hardware threads. Server 100 may also include any storage media 106 for storing any kind of information such as code, settings, data, and the like. For example, without limitation, the storage medium 106 may include any one or more combinations of the following: any type of RAM, any type of ROM, flash memory device, hard disk, optical disk, etc. More generally, any storage medium can use any technology to store information. Further, any storage medium can provide volatile or non-volatile retention of information. Further, any storage media may represent fixed or removable components of server 100. In one instance, when processor 104 executes the associated instructions stored in any storage medium or combination of storage media, server 100 may perform any operation of the associated instructions. The server 100 also includes one or more drive units 108 for interacting with any storage media, such as a hard disk drive unit, an optical disk drive unit, and the like.

服务器100还包括输入/输出110(I/O)，其用于接收各种输入(经由输入单元112)和用于提供各种输出(经由输出单元114))。一个具体输出机构可以包括呈现设备116和相关联的图形用户接口(GUI)118。服务器100还可以包括一个或多个网络接口120，其用于经由一个或多个通信单元122与其他设备交换数据。一个或多个通信总线124将上文所描述的部件耦合在一起。Server 100 also includes input/output 110 (I/O) for receiving various inputs (via input unit 112) and for providing various outputs (via output unit 114). One particular output mechanism may include a presentation device 116 and an associated graphical user interface (GUI) 118 . Server 100 may also include one or more network interfaces 120 for exchanging data with other devices via one or more communication units 122 . One or more communication buses 124 couple together the components described above.

通信单元122可以以任何方式实现，例如，通过局域网、广域网(例如，因特网)、点对点连接等、或其任何组合。通信单元122可以包括由任何协议或协议组合支配的硬连线链路、无线链路、路由器、网关功能、名称服务器100等的任何组合。The communication unit 122 may be implemented in any manner, for example, through a local area network, a wide area network (eg, the Internet), a point-to-point connection, etc., or any combination thereof. Communications unit 122 may include any combination of hardwired links, wireless links, routers, gateway functions, name servers 100, etc. governed by any protocol or combination of protocols.

图2示出了本发明实施例提供的数据中心异常分析方法及系统的流程示意图，该数据中心异常分析方法及系统可由图1中所示的服务器100执行，该数据中心异常分析方法的详细步骤介绍如下。Figure 2 shows a schematic flow chart of the data center anomaly analysis method and system provided by an embodiment of the present invention. The data center anomaly analysis method and system can be executed by the server 100 shown in Figure 1. The detailed steps of the data center anomaly analysis method The introduction is as follows.

步骤S110，获取目标数据中心运转数据；Step S110, obtain target data center operation data;

步骤S120，将所述目标数据中心运转数据加载至先验学习的异常分析预测网络中，生成所述异常分析预测网络确定的运转异常类别，所述异常分析预测网络根据关联的数据中心运转数据和静态调度数据进行知识学习生成的；Step S120: Load the target data center operation data into the a priori learned anomaly analysis and prediction network, and generate an operation anomaly category determined by the anomaly analysis and prediction network. The anomaly analysis and prediction network is based on the associated data center operation data and Static scheduling data is generated through knowledge learning;

步骤S130，依据所述运转异常类别确定异常诊断数据。Step S130: Determine abnormal diagnosis data according to the abnormal operation category.

基于以上步骤，本实施例提供了一种数据中心运转异常的分析预测方法。首先，获取目标数据中心的运转数据。然后，将目标数据中心的运转数据加载至一个通过先验学习生成的异常分析预测网络中，此异常分析预测网络基于关联的数据中心运转数据和静态调度数据进行知识学习生成。通过该异常分析预测网络可以生成对应的运转异常类别。最后，依据确定的运转异常类别，确定异常诊断数据。此方法能够有效地分析和预测数据中心运转过程中可能出现的异常情况，为数据中心的管理和维护提供了有力支持。Based on the above steps, this embodiment provides a method for analyzing and predicting abnormal operation of a data center. First, obtain the operating data of the target data center. Then, the operation data of the target data center is loaded into an anomaly analysis and prediction network generated through prior learning. This anomaly analysis and prediction network is generated based on knowledge learning based on the associated data center operation data and static scheduling data. Through this abnormality analysis and prediction network, corresponding operating abnormality categories can be generated. Finally, abnormal diagnosis data is determined based on the determined operating abnormality category. This method can effectively analyze and predict abnormal situations that may occur during the operation of the data center, providing strong support for the management and maintenance of the data center.

图3示出了本发明实施例提供的依据数据中心异常分析系统200的功能模块图，该依据数据中心异常分析系统200实现的功能可以对应上述方法执行的步骤。该依据数据中心异常分析系统200可以理解为上述服务器100，或服务器100的处理器，也可以理解为独立于上述服务器100或处理器之外的在服务器100控制下实现本发明功能的组件，如图3所示，下面分别对该依据数据中心异常分析系统200的各个功能模块的功能进行详细阐述。FIG. 3 shows a functional module diagram of the data center anomaly analysis system 200 provided by an embodiment of the present invention. The functions implemented by the data center anomaly analysis system 200 may correspond to the steps performed by the above method. The data center anomaly analysis system 200 can be understood as the above-mentioned server 100, or the processor of the server 100, or can be understood as a component independent of the above-mentioned server 100 or processor that implements the functions of the present invention under the control of the server 100, such as As shown in FIG. 3 , the functions of each functional module of the data center anomaly analysis system 200 will be described in detail below.

获取模块210，用于获取目标数据中心运转数据；The acquisition module 210 is used to acquire target data center operation data;

生成模块220，用于将所述目标数据中心运转数据加载至先验学习的异常分析预测网络中，生成所述异常分析预测网络确定的运转异常类别，所述异常分析预测网络依据关联的数据中心运转数据和静态调度数据进行知识学习生成的；The generation module 220 is used to load the operation data of the target data center into the a priori learned anomaly analysis and prediction network, and generate the operation anomaly category determined by the anomaly analysis and prediction network. The anomaly analysis and prediction network is based on the associated data center. Operation data and static dispatch data are generated through knowledge learning;

确定模块230，用于依据所述运转异常类别确定异常诊断数据。The determination module 230 is used to determine abnormal diagnosis data according to the abnormal operation category.

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的系统，装置和单元的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that for the convenience and simplicity of description, the specific working processes of the above-described systems, devices and units can be referred to the corresponding processes in the foregoing method embodiments, and will not be described again here.

对于本领域技术人员而言，显然本发明不限于上述示范性实施例的细节，而且在不背离本发明的精神或基本特征时，能够以其它的具体形式实现本发明。因此，无论从哪一点来看，均应将实施例看作是示范性的，而且是非限制性的，本发明的范围由所附权利要求而不是上述说明限定，因此旨在将落在权利要求的等同要件的含义和范围内的所有变化囊括在本发明内。It is obvious to those skilled in the art that the present invention is not limited to the details of the above-described exemplary embodiments, and the present invention can be implemented in other specific forms without departing from the spirit or essential characteristics of the invention. Therefore, the embodiments should be regarded as illustrative and non-restrictive from any point of view, and the scope of the present invention is defined by the appended claims rather than the above description, and it is therefore intended that all claims falling within the claims All changes within the meaning and scope of equivalent elements are included in the present invention.

Claims

1. A data center anomaly analysis method, characterized in that the method includes:

Obtain target data center operation data;

The target data center operation data is loaded into the a priori learned anomaly analysis and prediction network to generate an operation anomaly category determined by the anomaly analysis and prediction network. The anomaly analysis and prediction network is based on the associated data center operation data and static scheduling data. Generated through knowledge learning;

Abnormality diagnosis data is determined based on the operation abnormality category.

2. The data center anomaly analysis method according to claim 1, wherein the anomaly analysis and prediction network includes a self-attention unit and a fully connected output unit, and the target data center operation data is loaded into a priori In the learned abnormality analysis and prediction network, the operating abnormality categories determined by the abnormality analysis and prediction network are generated, including:

Load the target data center operation data into the self-attention unit, and generate the target self-attention features determined by the self-attention unit;

The target self-attention feature is loaded into the fully connected output unit to generate an abnormal operating category determined by the fully connected output unit.

3. The data center anomaly analysis method according to claim 2, characterized in that the training step of the anomaly analysis and prediction network includes:

Generate positive learning features and negative learning features based on the linked operation scheduling event data in the data center log data, and generate template learning data based on the positive learning features and the negative learning features;

Based on the template learning data, the parameters of the initialized anomaly analysis and prediction network are updated to generate an anomaly analysis and prediction network with updated parameters.

4. The data center anomaly analysis method according to claim 3, characterized in that the positive learning features and negative learning features are generated based on the linked operation scheduling event data in the data center log data, including:

Obtain the linked running scheduling event data in the data center log data as basic training template data;

Perform regular conversion on the basic training template data to generate the active learning features;

The data center operation data and noise features in the data center log data are randomly scrambled to generate the passive learning features.

5. The data center anomaly analysis method according to claim 3, wherein the self-attention unit includes a first self-attention unit and a second self-attention unit, and the learning data based on the template is The initialized anomaly analysis and prediction network is updated with parameters, and an anomaly analysis and prediction network with updated parameters is generated, including:

For the running scheduling event data pair in the template learning data, load the static scheduling event of the running scheduling event data pair into the first self-attention unit, and generate the target determined by the first self-attention unit Static self-attention features: load the dynamic scheduling events of the running scheduling event data pair into the second self-attention unit, and generate the target dynamic self-attention features determined by the second self-attention unit;

Determine target training error parameters based on the target static self-attention features and the target dynamic self-attention features, with the goal of minimizing the target training error parameters, perform parameter updates on the self-attention units, and generate parameter updates The subsequent self-attention unit;

The parameters of the fully connected output unit are updated according to the self-attention unit after the parameter update, and a fully connected output unit after the parameter update is generated.

6. The data center anomaly analysis method according to claim 5, wherein the operation scheduling event data pair contains a plurality of dynamic scheduling events, and the dynamic scheduling events of the operation scheduling event data pair are loaded into In the second self-attention unit, generating the target dynamic self-attention features determined by the second self-attention unit includes:

Integrate each of the dynamic scheduling events to generate integrated data center operation data;

The integrated data center operation data is loaded into the second self-attention unit, and the target dynamic self-attention feature determined by the second self-attention unit is generated.

7. The data center anomaly analysis method according to claim 5, wherein the operation scheduling event data pair contains a plurality of dynamic scheduling events, and the dynamic scheduling events of the operation scheduling event data pair are loaded into In the second self-attention unit, generating dynamic self-attention features determined by the second self-attention unit includes:

Load each of the dynamic scheduling events into the second self-attention unit respectively, and generate dynamic self-attention features of each of the dynamic scheduling events determined by the second self-attention unit;

The dynamic self-attention features of each of the dynamic scheduling events are summed to generate the target dynamic self-attention feature.

8. The data center anomaly analysis method according to claim 5, wherein the first self-attention unit and the second self-attention unit are respectively connected to the fully connected output unit, and the basis is The self-attention unit with updated parameters updates the parameters of the fully connected output unit to generate a fully connected output unit with updated parameters, including:

For the static scheduling data in the template learning data, determine the static self-attention characteristics of the static scheduling data according to the first self-attention unit;

Constructing target anomaly learning template data based on the static self-attention feature and the anomaly category of the static scheduling data;

According to the target anomaly learning template data, the parameters of the initialized fully connected output unit are updated to generate a fully connected output unit with updated parameters.

9. A data center anomaly analysis system, characterized by including:

The acquisition module is used to obtain the operation data of the target data center;

A generation module configured to load the target data center operation data into the a priori learned anomaly analysis and prediction network, and generate an operation anomaly category determined by the anomaly analysis and prediction network, and the anomaly analysis and prediction network operates according to the associated data center Data and static scheduling data are generated through knowledge learning;

A determining module, configured to determine abnormal diagnosis data according to the abnormal operation category.

10. A server, characterized in that it includes: a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete communication with each other through the communication bus; the memory is used to store Computer program; the processor is configured to implement the steps of the data center anomaly analysis method described in any one of claims 1-8 when executing the computer program.