CN106656636A

CN106656636A - Cloud platform fault detection method and device

Info

Publication number: CN106656636A
Application number: CN201710096134.1A
Authority: CN
Inventors: 陈彦灵; 吴安; 石江涛
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2017-02-22
Filing date: 2017-02-22
Publication date: 2017-05-10

Abstract

The invention discloses a cloud platform fault detection method, which includes the following steps: in each detection cycle, when the set fault detection trigger condition is reached, determine one or more hardware resources currently to be detected, and each The detection method and detection method corresponding to each hardware resource; for each hardware resource, start the detection process or detect the virtual machine according to the detection method corresponding to the hardware resource; use the detection method corresponding to the hardware resource through the detection process or the detection virtual machine The hardware resource performs fault detection; according to the detection result, it is determined whether the hardware resource has a fault. Applying the method provided by the embodiment of the present invention can detect faults of hardware resources in the cloud platform, find faults in time, provide strong guarantee for the normal operation of the cloud platform, reduce the maintenance cost of the cloud platform, and improve the usability of the data center. The invention also discloses a cloud platform fault detection device, which has corresponding technical effects.

Description

A cloud platform fault detection method and device

技术领域technical field

本发明涉及云计算技术领域，特别是涉及一种云平台故障探测方法及装置。The invention relates to the technical field of cloud computing, in particular to a cloud platform fault detection method and device.

背景技术Background technique

随着云计算技术的快速发展，对计算资源、存储资源和网络资源统一管理和编排的技术发展的越来越成熟，使得云平台中基于硬件资源而存在的计算、存储、网络和虚拟化操作系统等各种元素紧密地结合在一起，且规模越来越大。With the rapid development of cloud computing technology, the development of unified management and orchestration of computing resources, storage resources and network resources has become more and more mature, making computing, storage, network and virtualization operations based on hardware resources in the cloud platform Various elements such as the system are closely integrated, and the scale is getting larger and larger.

在云平台中，如何及时发现硬件资源存在的故障，是目前本领域技术人员亟需解决的技术问题。In the cloud platform, how to timely discover the faults of hardware resources is a technical problem urgently needed to be solved by those skilled in the art.

发明内容Contents of the invention

本发明的目的是提供一种云平台故障探测方法及装置，以对云平台中硬件资源进行故障探测，及时发现故障，为云平台的正常运行提供有力保障，降低云平台的维护成本，提高数据中心的可用性。The purpose of the present invention is to provide a cloud platform fault detection method and device to detect faults on hardware resources in the cloud platform, find faults in time, provide a strong guarantee for the normal operation of the cloud platform, reduce the maintenance cost of the cloud platform, and improve data quality. Availability of the center.

为解决上述技术问题，本发明提供如下技术方案：In order to solve the above technical problems, the present invention provides the following technical solutions:

一种云平台故障探测方法，包括：A cloud platform fault detection method, comprising:

在每个探测周期内，在达到设定的故障探测触发条件时，确定当前待探测的一个或多个硬件资源，及每个硬件资源对应的探测方式和探测方法；In each detection period, when the set fault detection trigger condition is reached, determine one or more hardware resources currently to be detected, and the detection mode and detection method corresponding to each hardware resource;

针对每个硬件资源，根据该硬件资源对应的探测方式，启动探测进程或者探测虚拟机；For each hardware resource, start the detection process or detect the virtual machine according to the detection method corresponding to the hardware resource;

通过所述探测进程或者所述探测虚拟机使用该硬件资源对应的探测方法对该硬件资源进行故障探测；Perform fault detection on the hardware resource by using the detection process or the detection virtual machine using a detection method corresponding to the hardware resource;

根据探测结果，确定该硬件资源是否存在故障。According to the detection result, it is determined whether the hardware resource is faulty.

在本发明的一种具体实施方式中，所述在达到设定的故障探测触发条件时，确定当前待探测的一个或多个硬件资源，及每个硬件资源对应的探测方式和探测方法，包括：In a specific implementation manner of the present invention, when the set fault detection trigger condition is reached, determining one or more hardware resources currently to be detected, and the detection mode and detection method corresponding to each hardware resource include :

在达到设定的目标探测时间点时，根据预设的覆盖策略，确定当前待探测的一个或多个硬件资源，及每个硬件资源对应的探测方式和探测方法，所述目标探测时间点为所述探测周期包含的一个或多个探测时间点中的任意一个探测时间点，依据在一个探测周期内，完成对云平台中所有硬件资源的故障探测的原则设定所述覆盖策略。When the set target detection time point is reached, according to the preset coverage strategy, determine one or more hardware resources currently to be detected, and the detection mode and detection method corresponding to each hardware resource. The target detection time point is At any one of the one or more detection time points included in the detection period, the coverage strategy is set according to the principle of completing fault detection of all hardware resources in the cloud platform within one detection period.

在捕获到云平台的异常事件时，对所述异常事件进行分析，预测所述异常事件对应的故障类型；When the abnormal event of the cloud platform is captured, the abnormal event is analyzed, and the fault type corresponding to the abnormal event is predicted;

根据所述故障类型，确定当前待探测的一个或多个硬件资源，及每个硬件资源对应的探测方式和探测方法。According to the fault type, determine one or more hardware resources currently to be detected, and a detection mode and detection method corresponding to each hardware resource.

在本发明的一种具体实施方式中，还包括：In a specific embodiment of the present invention, it also includes:

在每个探测周期结束时，注销所述探测进程或者所述探测虚拟机。At the end of each detection period, log off the detection process or the detection virtual machine.

在本发明的一种具体实施方式中，在确定该硬件资源存在故障时，还包括：In a specific implementation manner of the present invention, when determining that the hardware resource has a fault, it also includes:

将故障上报给云平台的设定系统。Report the fault to the setting system of the cloud platform.

一种云平台故障探测装置，包括：A cloud platform fault detection device, comprising:

探测相关确定模块，用于在每个探测周期内，在达到设定的故障探测触发条件时，确定当前待探测的一个或多个硬件资源，及每个硬件资源对应的探测方式和探测方法；A detection-related determination module, configured to determine one or more hardware resources currently to be detected, and a detection mode and detection method corresponding to each hardware resource when a set fault detection trigger condition is reached within each detection cycle;

启动模块，用于针对每个硬件资源，根据该硬件资源对应的探测方式，启动探测进程或者探测虚拟机；The startup module is configured to start a detection process or detect a virtual machine for each hardware resource according to a detection method corresponding to the hardware resource;

故障探测模块，用于通过所述探测进程或者所述探测虚拟机使用该硬件资源对应的探测方法对该硬件资源进行故障探测；A fault detection module, configured to perform fault detection on the hardware resource by using the detection method corresponding to the hardware resource through the detection process or the detection virtual machine;

故障确定模块，用于根据探测结果，确定该硬件资源是否存在故障。The fault determining module is configured to determine whether the hardware resource has a fault according to the detection result.

在本发明的一种具体实施方式中，所述探测相关确定模块，具体用于：In a specific implementation manner of the present invention, the detection correlation determination module is specifically used for:

注销模块，用于在每个探测周期结束时，注销所述探测进程或者所述探测虚拟机。A logout module, configured to log off the detection process or the detection virtual machine at the end of each detection period.

故障上报模块，用于在确定该硬件资源存在故障时，将故障上报给云平台的设定系统。The fault reporting module is configured to report the fault to the setting system of the cloud platform when it is determined that the hardware resource has a fault.

应用本发明实施例所提供的技术方案，在每个探测周期内，在达到设定的故障探测触发条件时，确定当前待探测的一个或多个硬件资源，及每个硬件资源对应的探测方式和探测方法，针对每个硬件资源，根据该硬件资源对应的探测方式，启动探测进程或者探测虚拟机，通过探测进程或者探测虚拟机使用该硬件资源对应的探测方法对该硬件资源进行故障探测，确定该硬件资源是否存在故障。这样，可以对云平台中硬件资源进行故障探测，及时发现故障，为云平台的正常运行提供有力保障，降低了云平台的维护成本，提高了数据中心的可用性。Applying the technical solution provided by the embodiment of the present invention, in each detection period, when the set fault detection trigger condition is reached, one or more hardware resources to be detected and the detection mode corresponding to each hardware resource are determined and the detection method, for each hardware resource, according to the detection method corresponding to the hardware resource, start the detection process or the detection virtual machine, and use the detection method corresponding to the hardware resource to detect the fault of the hardware resource through the detection process or the detection virtual machine, Determine if the hardware resource is faulty. In this way, fault detection can be performed on hardware resources in the cloud platform, and faults can be found in time, which provides a strong guarantee for the normal operation of the cloud platform, reduces the maintenance cost of the cloud platform, and improves the availability of the data center.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. Those skilled in the art can also obtain other drawings based on these drawings without creative work.

图1为本发明实施例中一种云平台故障探测方法的实施流程图；Fig. 1 is the implementation flowchart of a kind of cloud platform fault detection method in the embodiment of the present invention;

图2为本发明实施例中云平台硬件资源部署结构示意图；FIG. 2 is a schematic diagram of a cloud platform hardware resource deployment structure in an embodiment of the present invention;

图3为本发明实施例中一种云平台故障探测装置的结构示意图。Fig. 3 is a schematic structural diagram of a cloud platform fault detection device in an embodiment of the present invention.

具体实施方式detailed description

为了使本技术领域的人员更好地理解本发明方案，下面结合附图和具体实施方式对本发明作进一步的详细说明。显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to enable those skilled in the art to better understand the solution of the present invention, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments. Apparently, the described embodiments are only some of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

参见图1所示，为本发明实施例所提供的一种云平台故障探测方法的实施流程图，该方法可以包括以下步骤：Referring to Fig. 1, it is an implementation flowchart of a cloud platform fault detection method provided by an embodiment of the present invention, the method may include the following steps:

S110：在每个探测周期内，在达到设定的故障探测触发条件时，确定当前待探测的一个或多个硬件资源，及每个硬件资源对应的探测方式和探测方法。S110: In each detection period, when a set fault detection trigger condition is reached, determine one or more hardware resources currently to be detected, and a detection mode and detection method corresponding to each hardware resource.

云平台中可以包含多个硬件资源，为保证云平台的正常运行，可以按照探测周期对云平台中的硬件资源进行故障探测，以及时发现存在的故障，对故障进行处理。The cloud platform can contain multiple hardware resources. In order to ensure the normal operation of the cloud platform, fault detection can be performed on the hardware resources in the cloud platform according to the detection cycle, so as to detect existing faults in time and handle the faults.

如图2所示，在云平台中，服务器一般会有一条或多条链路上行接入到网络设备，如以太网交换机设备，存储资源可以作为服务器的硬盘通过服务器的上行链路接入到网络设备，也可以作为独立的存储服务器通过多条链路接入到网络设备。服务器或者存储资源接入的网络设备到其上行的网络设备可以有多条路径，这些网络设备可以是盒式设备或机架式设备，机架式设备可以有多个业务板通过多个网络接口连接服务器或者互联网。As shown in Figure 2, in the cloud platform, the server generally has one or more links uplinked to the network device, such as an Ethernet switch device, and the storage resource can be used as the hard disk of the server to be connected to the uplink of the server through the uplink of the server. The network device can also be used as an independent storage server to access the network device through multiple links. There can be multiple paths from the server or storage resource access network device to its upstream network device. These network devices can be box-type devices or rack-mounted devices. Rack-mounted devices can have multiple service boards through multiple network interfaces. Connect to a server or the Internet.

在实际应用中，可以根据实际需要设定探测周期，比如一周或者一天等。In practical applications, the detection period may be set according to actual needs, such as one week or one day.

在每个探测周期内，在达到设定的故障探测触发条件时，可以确定当前待探测的一个或多个硬件资源，即确定出需要探测的硬件资源的范围，包括服务器网络和存储等应用程序会使用的硬件资源。同时，需要确定每个硬件资源对应的探测方式和探测方法。具体的哪类硬件资源对应哪种探测方式和哪种探测方法，可以根据实际情况预先设定。In each detection period, when the set fault detection trigger condition is reached, one or more hardware resources currently to be detected can be determined, that is, the scope of hardware resources to be detected can be determined, including applications such as server network and storage The hardware resources that will be used. At the same time, it is necessary to determine the detection mode and detection method corresponding to each hardware resource. Which type of hardware resource corresponds to which detection method and which detection method may be preset according to actual conditions.

在本发明实施例中，探测方式可以是探测进程或者探测虚拟机。探测方法可以为：启动探测用网页服务器(web server)和网页客户端(web client)、启动探测用数据库服务器(server)和客户端(client)、启动通用的网络测试工具、启动通用的硬盘测试工具。In this embodiment of the present invention, the detection manner may be to detect a process or detect a virtual machine. The detection method can be: start the web server (web server) and web client (web client) for detection, start the database server (server) and client (client) for detection, start the general network test tool, start the general hard disk test tool.

在本发明的一种具体实施方式中，在达到设定的目标探测时间点时，根据预设的覆盖策略，确定当前待探测的一个或多个硬件资源，及每个硬件资源对应的探测方式和探测方法。目标探测时间点为探测周期包含的一个或多个探测时间点中的任意一个探测时间点，依据在一个探测周期内，完成对云平台中所有硬件资源的故障探测的原则设定覆盖策略。In a specific implementation of the present invention, when the set target detection time point is reached, one or more hardware resources currently to be detected and the detection method corresponding to each hardware resource are determined according to the preset coverage strategy and detection methods. The target detection time point is any one of one or more detection time points included in the detection cycle, and the coverage strategy is set based on the principle of completing fault detection of all hardware resources in the cloud platform within a detection cycle.

在本发明实施例中，一个探测周期可以包含一个或多个探测时间点，目标探测时间点为其中的任意一个探测时间点。在达到目标探测时间点时，启动对云平台的硬件资源的故障探测。In the embodiment of the present invention, a detection period may include one or more detection time points, and the target detection time point is any one of the detection time points. When the target detection time point is reached, the fault detection of the hardware resources of the cloud platform is started.

根据云平台包含的硬件资源的实际情况，可以设定覆盖策略，设定的覆盖策略依据在一个探测周期内，完成对云平台中所有硬件资源的故障探测的原则。即预先设定一个探测周期中在每个探测时间点需探测的硬件资源，及每个硬件资源的探测方式和探测方法，使得在一个探测周期内可以完成一次对云平台中所有硬件资源的故障探测。具体哪个硬件资源需要使用哪种探测方式和探测方法，可以随机确定，或者按照一定的分配原则进行分配，比如平均分配原则，或者权重分配原则，或者轮流使用各探测方式和各探测方法。According to the actual situation of the hardware resources contained in the cloud platform, a coverage strategy can be set. The coverage strategy is based on the principle of completing fault detection of all hardware resources in the cloud platform within one detection cycle. That is to pre-set the hardware resources to be detected at each detection time point in a detection cycle, as well as the detection method and detection method of each hardware resource, so that the failure of all hardware resources in the cloud platform can be completed once in a detection cycle probing. Which hardware resource needs to use which detection method and detection method can be randomly determined, or allocated according to a certain distribution principle, such as the principle of equal distribution, or the principle of weight distribution, or use each detection method and each detection method in turn.

在本发明的另一种具体实施方式中，在捕获到云平台的异常事件时，对异常事件进行分析，预测异常事件对应的故障类型，根据故障类型，确定当前待探测的一个或多个硬件资源，及每个硬件资源的探测方式和探测方法。In another specific embodiment of the present invention, when an abnormal event of the cloud platform is captured, the abnormal event is analyzed, the fault type corresponding to the abnormal event is predicted, and one or more hardware to be detected is determined according to the fault type resources, and the detection method and detection method of each hardware resource.

在一个探测周期内，如果捕获到云平台的异常事件，如温度异常告警事件、湿度异常告警事件、震动异常告警事件或者硬件告警事件等，则可以认为达到了设定的故障探测触发条件。对该异常事件进行分析，可以预测该异常事件可能造成的故障，及异常事件对应的故障类型，如资源配置故障类型、器件物理损伤类型、环境类型等。根据故障类型，可以确定当前待探测的一个或多个硬件资源，及每个硬件资源的探测方式和探测方法。比如，异常事件与硬盘相关，故障类型为资源配置故障类型，则可以确定当前待探测的硬件资源为硬盘，确定的对硬盘的探测方式为，探测进程，确定的对硬盘的探测方法为：启动通用的硬盘测试工具。In a detection period, if abnormal events of the cloud platform are captured, such as abnormal temperature alarm events, abnormal humidity alarm events, abnormal vibration alarm events or hardware alarm events, etc., it can be considered that the set fault detection trigger condition has been reached. By analyzing the abnormal event, it is possible to predict the possible failure caused by the abnormal event, and the corresponding fault type of the abnormal event, such as resource configuration fault type, device physical damage type, environment type, etc. According to the fault type, one or more hardware resources currently to be detected, and the detection method and detection method of each hardware resource can be determined. For example, if the abnormal event is related to the hard disk, and the fault type is the resource configuration fault type, it can be determined that the current hardware resource to be detected is the hard disk, the determined detection method for the hard disk is the detection process, and the determined detection method for the hard disk is: start Common hard drive testing tool.

可以预先设定异常事件与故障类型的对应关系，及每个故障类型对应的硬件资源范围、探测方式和探测方法。The corresponding relationship between abnormal events and fault types, and the range of hardware resources, detection methods and detection methods corresponding to each fault type can be preset.

步骤S120：针对每个硬件资源，根据该硬件资源对应的探测方式，启动探测进程或者探测虚拟机。Step S120: For each hardware resource, start a detection process or detect a virtual machine according to the detection mode corresponding to the hardware resource.

根据步骤S110可以确定针对当前待探测的每个硬件资源所对应的探测方式为探测进程或者探测虚拟机。针对每个硬件资源，根据该硬件资源对应的探测方式，启动探测进程或者探测虚拟机。According to step S110, it may be determined that the detection mode corresponding to each hardware resource currently to be detected is a detection process or a virtual machine detection. For each hardware resource, start a detection process or detect a virtual machine according to a detection mode corresponding to the hardware resource.

S130：通过探测进程或者探测虚拟机使用该硬件资源对应的探测方法对该硬件资源进行故障探测。S130: Perform fault detection on the hardware resource by using the detection method corresponding to the hardware resource by the detection process or detection virtual machine.

根据步骤S110可以确定针对当前待探测的每个硬件资源所对应的探测方法。通过探测进程或者探测虚拟机使用该硬件资源对应的探测方法对该硬件资源进行故障探测。如，模拟应用程序在物理机上启动探测进程或者启动探测虚拟机，在探测进程或探测虚拟机上启动探测用的网页服务器(web server)和网页客户端(web client)，启动探测用的数据库服务器(sever)和客户端(client)，在保证不影响正常应用运行情况下短时间启动标准的网络或硬盘检测的测试程序测试网络的丢包率和硬盘的IPOS(IP Over SDH，以SDH网络作为IP数据网络的物理传输网络)。According to step S110, a detection method corresponding to each hardware resource currently to be detected can be determined. Fault detection is performed on the hardware resource by using the detection method corresponding to the hardware resource by the detection process or the detection virtual machine. For example, the simulation application starts the detection process on the physical machine or the detection virtual machine, starts the detection web server (web server) and web client (web client) on the detection process or the detection virtual machine, and starts the detection database server (sever) and client (client), start the standard network or hard disk detection test program for a short time to test the packet loss rate of the network and the IPOS of the hard disk (IP Over SDH, with SDH network as the Physical transport network of IP data network).

S140：根据探测结果，确定该硬件资源是否存在故障。S140: Determine whether the hardware resource has a fault according to the detection result.

针对当前待探测的每个硬件资源，启动探测进程或者探测虚拟机后，通过探测进程或者探测虚拟机使用该硬件资源对应的探测方法对该硬件资源进行故障探测，根据探测结果，可以确定该硬件资源是否存在故障。For each hardware resource currently to be detected, after starting the detection process or detecting the virtual machine, use the detection method corresponding to the hardware resource to detect the fault of the hardware resource through the detection process or the detection virtual machine. According to the detection result, the hardware resource can be determined Whether the resource is faulty.

在确定该硬件资源存在故障时，可以将故障上报给云平台的设定系统，以由设定系统进行后续处理。如发送给故障告警系统，故障告警系统将故障输出给运维人员，以使运维人员对故障进行及时处理，或者发送给故障修复系统，以使故障修复系统对故障进行智能修复。When it is determined that there is a fault in the hardware resource, the fault can be reported to the setting system of the cloud platform for subsequent processing by the setting system. If it is sent to the fault alarm system, the fault alarm system will output the fault to the operation and maintenance personnel, so that the operation and maintenance personnel can deal with the fault in time, or send it to the fault repair system, so that the fault repair system can intelligently repair the fault.

在本发明的一个实施例中，该方法还可以包括以下步骤：In one embodiment of the present invention, the method may also include the following steps:

在每个探测周期结束时，注销探测进程或者探测虚拟机。At the end of each probe cycle, log off the probe process or probe the virtual machine.

在每个探测周期结束时，注销在该探测周期中启动的探测进程或者探测虚拟机，以及时释放资源，进行资源回收，避免云平台的资源浪费。At the end of each detection period, log off the detection process or detection virtual machine started in the detection period, so as to release resources in time, perform resource recovery, and avoid resource waste on the cloud platform.

应用本发明实施例所提供的方法，在每个探测周期内，在达到设定的故障探测触发条件时，确定当前待探测的一个或多个硬件资源，及每个硬件资源对应的探测方式和探测方法，针对每个硬件资源，根据该硬件资源对应的探测方式，启动探测进程或者探测虚拟机，通过探测进程或者探测虚拟机使用该硬件资源对应的探测方法对该硬件资源进行故障探测，确定该硬件资源是否存在故障。这样，可以对云平台中硬件资源进行故障探测，及时发现故障，为云平台的正常运行提供有力保障，降低了云平台的维护成本，提高了数据中心的可用性。By applying the method provided by the embodiment of the present invention, in each detection cycle, when the set fault detection trigger condition is reached, one or more hardware resources to be detected, and the detection mode and detection mode corresponding to each hardware resource are determined. The detection method, for each hardware resource, according to the detection method corresponding to the hardware resource, start the detection process or the detection virtual machine, use the detection method corresponding to the hardware resource to detect the fault of the hardware resource through the detection process or the detection virtual machine, and determine Whether the hardware resource is faulty. In this way, fault detection can be performed on hardware resources in the cloud platform, and faults can be found in time, which provides a strong guarantee for the normal operation of the cloud platform, reduces the maintenance cost of the cloud platform, and improves the availability of the data center.

相应于上面的方法实施例，本发明实施例还提供了一种云平台故障探测装置，下文描述的一种云平台故障探测装置与上文描述的一种云平台故障探测方法可相互对应参照。Corresponding to the above method embodiment, the embodiment of the present invention also provides a cloud platform fault detection device, a cloud platform fault detection device described below and a cloud platform fault detection method described above can refer to each other correspondingly.

参见图3所示，该装置包括以下模块：Referring to shown in Figure 3, the device includes the following modules:

探测相关确定模块310，用于在每个探测周期内，在达到设定的故障探测触发条件时，确定当前待探测的一个或多个硬件资源，及每个硬件资源对应的探测方式和探测方法；The detection correlation determination module 310 is configured to determine one or more hardware resources currently to be detected, and the detection mode and detection method corresponding to each hardware resource when the set fault detection trigger condition is reached in each detection cycle ;

启动模块320，用于针对每个硬件资源，根据该硬件资源对应的探测方式，启动探测进程或者探测虚拟机；The startup module 320 is configured to, for each hardware resource, start a detection process or detect a virtual machine according to a detection mode corresponding to the hardware resource;

故障探测模块330，用于通过探测进程或者探测虚拟机使用该硬件资源对应的探测方法对该硬件资源进行故障探测；The fault detection module 330 is configured to detect the fault of the hardware resource by using the detection method corresponding to the hardware resource by the detection process or the detection virtual machine;

故障确定模块340，用于根据探测结果，确定该硬件资源是否存在故障。The failure determination module 340 is configured to determine whether the hardware resource has a failure according to the detection result.

应用本发明实施例所提供的装置，在每个探测周期内，在达到设定的故障探测触发条件时，确定当前待探测的一个或多个硬件资源，及每个硬件资源对应的探测方式和探测方法，针对每个硬件资源，根据该硬件资源对应的探测方式，启动探测进程或者探测虚拟机，通过探测进程或者探测虚拟机使用该硬件资源对应的探测方法对该硬件资源进行故障探测，确定该硬件资源是否存在故障。这样，可以对云平台中硬件资源进行故障探测，及时发现故障，为云平台的正常运行提供有力保障，降低了云平台的维护成本，提高了数据中心的可用性。Using the device provided by the embodiment of the present invention, in each detection period, when the set fault detection trigger condition is reached, one or more hardware resources to be detected, and the corresponding detection method and detection method for each hardware resource are determined. The detection method, for each hardware resource, according to the detection method corresponding to the hardware resource, start the detection process or the detection virtual machine, use the detection method corresponding to the hardware resource to detect the fault of the hardware resource through the detection process or the detection virtual machine, and determine Whether the hardware resource is faulty. In this way, fault detection can be performed on hardware resources in the cloud platform, and faults can be found in time, which provides a strong guarantee for the normal operation of the cloud platform, reduces the maintenance cost of the cloud platform, and improves the availability of the data center.

在本发明的一种具体实施方式中，探测相关确定模块310，具体用于：In a specific implementation manner of the present invention, the detection correlation determination module 310 is specifically used for:

在达到设定的目标探测时间点时，根据预设的覆盖策略，确定当前待探测的一个或多个硬件资源，及每个硬件资源对应的探测方式和探测方法，目标探测时间点为探测周期包含的一个或多个探测时间点中的任意一个探测时间点，依据在一个探测周期内，完成对云平台中所有硬件资源的故障探测的原则设定覆盖策略。When the set target detection time point is reached, according to the preset coverage strategy, determine one or more hardware resources currently to be detected, and the detection method and detection method corresponding to each hardware resource. The target detection time point is the detection cycle For any one of the one or more detection time points included, the coverage strategy is set based on the principle of completing fault detection of all hardware resources in the cloud platform within a detection cycle.

在捕获到云平台的异常事件时，对异常事件进行分析，预测异常事件对应的故障类型；When an abnormal event of the cloud platform is captured, the abnormal event is analyzed to predict the type of failure corresponding to the abnormal event;

根据故障类型，确定当前待探测的一个或多个硬件资源，及每个硬件资源对应的探测方式和探测方法。According to the fault type, determine one or more hardware resources currently to be detected, and a detection mode and detection method corresponding to each hardware resource.

注销模块，用于在每个探测周期结束时，注销探测进程或者探测虚拟机。The logout module is configured to log off the detection process or the detection virtual machine at the end of each detection cycle.

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其它实施例的不同之处，各个实施例之间相同或相似部分互相参见即可。对于实施例公开的装置而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。Each embodiment in this specification is described in a progressive manner, each embodiment focuses on the difference from other embodiments, and the same or similar parts of each embodiment can be referred to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and for the related information, please refer to the description of the method part.

专业人员还可以进一步意识到，结合本文中所公开的实施例描述的各示例的单元及算法步骤，能够以电子硬件、计算机软件或者二者的结合来实现，为了清楚地说明硬件和软件的可互换性，在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本发明的范围。Professionals can further realize that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software or a combination of the two. In order to clearly illustrate the possible For interchangeability, in the above description, the composition and steps of each example have been generally described according to their functions. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present invention.

结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块，或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of the methods or algorithms described in connection with the embodiments disclosed herein may be directly implemented by hardware, software modules executed by a processor, or a combination of both. Software modules can be placed in random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other Any other known storage medium.

本文中应用了具体个例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的技术方案及其核心思想。应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以对本发明进行若干改进和修饰，这些改进和修饰也落入本发明权利要求的保护范围内。In this paper, specific examples are used to illustrate the principles and implementation methods of the present invention, and the descriptions of the above embodiments are only used to help understand the technical solutions and core ideas of the present invention. It should be pointed out that for those skilled in the art, without departing from the principles of the present invention, some improvements and modifications can be made to the present invention, and these improvements and modifications also fall within the protection scope of the claims of the present invention.

Claims

1. A cloud platform fault detection method, characterized in that, comprising:

In each detection period, when the set fault detection trigger condition is reached, determine one or more hardware resources currently to be detected, and the detection mode and detection method corresponding to each hardware resource;

For each hardware resource, start the detection process or detect the virtual machine according to the detection method corresponding to the hardware resource;

Perform fault detection on the hardware resource by using the detection process or the detection virtual machine using a detection method corresponding to the hardware resource;

According to the detection result, it is determined whether the hardware resource is faulty.

2. cloud platform fault detection method according to claim 1, is characterized in that, when described reaching the fault detection trigger condition of setting, determine one or more hardware resources to be detected at present, and each hardware resource corresponds to detection methods and detection methods, including:

When the set target detection time point is reached, according to the preset coverage strategy, determine one or more hardware resources currently to be detected, and the detection mode and detection method corresponding to each hardware resource. The target detection time point is At any one of the one or more detection time points included in the detection period, the coverage strategy is set according to the principle of completing fault detection of all hardware resources in the cloud platform within one detection period.

3. cloud platform fault detection method according to claim 1, is characterized in that, when described reaching the fault detection trigger condition of setting, determine one or more hardware resources to be detected at present, and each hardware resource corresponds to detection methods and detection methods, including:

When the abnormal event of the cloud platform is captured, the abnormal event is analyzed, and the fault type corresponding to the abnormal event is predicted;

According to the fault type, determine one or more hardware resources currently to be detected, and a detection mode and detection method corresponding to each hardware resource.

4. The cloud platform failure detection method according to any one of claims 1 to 3, further comprising:

At the end of each detection period, log off the detection process or the detection virtual machine.

5. cloud platform fault detection method according to claim 4, is characterized in that, when determining that this hardware resource has fault, also comprises:

Report the fault to the setting system of the cloud platform.

6. A cloud platform fault detection device, characterized in that, comprising:

A detection-related determination module, configured to determine one or more hardware resources currently to be detected, and a detection method and detection method corresponding to each hardware resource when a set fault detection trigger condition is reached in each detection cycle;

The startup module is configured to start a detection process or detect a virtual machine for each hardware resource according to a detection method corresponding to the hardware resource;

A fault detection module, configured to perform fault detection on the hardware resource by using the detection method corresponding to the hardware resource through the detection process or the detection virtual machine;

The fault determining module is configured to determine whether the hardware resource has a fault according to the detection result.

7. cloud platform fault detection device according to claim 6, is characterized in that, described detection correlation determination module is specifically used for:

When the set target detection time point is reached, according to the preset coverage strategy, determine one or more hardware resources currently to be detected, and the detection mode and detection method corresponding to each hardware resource. The target detection time point is At any one of the one or more detection time points included in the detection cycle, the coverage strategy is set according to the principle of completing fault detection of all hardware resources in the cloud platform within one detection cycle.

8. cloud platform fault detection device according to claim 6, is characterized in that, described detection correlation determination module is specifically used for:

9. The cloud platform fault detection device according to any one of claims 6 to 8, further comprising:

A logout module, configured to log off the detection process or the detection virtual machine at the end of each detection period.

10. cloud platform fault detection device according to claim 9, is characterized in that, also comprises:

The fault reporting module is configured to report the fault to the setting system of the cloud platform when it is determined that the hardware resource has a fault.