CN114428703A

CN114428703A - IO fault location method, apparatus, device and computer readable storage medium

Info

Publication number: CN114428703A
Application number: CN202011104291.0A
Authority: CN
Inventors: 戴伟; 郭岳; 周勋; 吴天东; 陈琪
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Zhejiang Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Zhejiang Co Ltd
Priority date: 2020-10-15
Filing date: 2020-10-15
Publication date: 2022-05-03
Anticipated expiration: 2040-10-15
Also published as: CN114428703B

Abstract

The embodiment of the invention relates to the technical field of distributed storage, and discloses an IO fault positioning method, device, equipment and a computer readable storage medium. The method comprises the following steps: obtaining IO time delay of a front-end computing node and an IO time delay threshold of an external disk; if the difference value between the IO time delay of the computing node and the IO time delay threshold of the external disk is within a preset numerical range, the IO time delay of the logic unit number and the IO time delay threshold of the logic unit number are obtained; if the IO time delay of the logic unit number is not larger than the IO time delay threshold value, and the difference value between the IO time delay of the calculation node and the IO time delay of the logic unit number is within a preset value range, the IO fault is positioned at the front end; and if the difference value between the IO time delay of the logic unit number and the IO time delay threshold value is within the preset value range, and the difference value between the IO time delay of the calculation node and the IO time delay of the logic unit number is within the preset value range, positioning the IO fault at the rear end. Through the mode, the embodiment of the invention realizes the rapid and accurate positioning of the IO fault in the distributed storage.

Description

IO fault location method, apparatus, device and computer readable storage medium

技术领域technical field

本发明实施例涉及分布式存储技术领域，具体涉及一种IO故障定位方法、装置、设备及计算机可读存储介质。Embodiments of the present invention relate to the technical field of distributed storage, and in particular, to an IO fault location method, apparatus, device, and computer-readable storage medium.

背景技术Background technique

分布式存储是目前比较流行的一种存储方式，它是将多个存储服务器通过网络互联以灵活分配存储空间，可以提高存储效率及存储容量。Distributed storage is a popular storage method at present. It interconnects multiple storage servers through a network to flexibly allocate storage space, which can improve storage efficiency and storage capacity.

分布式存储系统一般包括前端计算节点和后端存储节点。前端计算节点上可以运行各种应用程序，后端存储节点可以对前端计算节点的数据进行保存。在分布式存储系统的运行过程中，经常出现IO故障。为了使得分布式存储系统可以正常运行，需要对IO故障进行定位，以解决IO故障。在实现本发明实施例的过程中，发明人发现相关技术中对IO故障进行定位往往依赖于分布式存储系统所提供的告警信息以及技术人员的经验，对IO故障的定位手段比较单一，定位过程效率较低。A distributed storage system generally includes front-end computing nodes and back-end storage nodes. Various applications can run on the front-end computing nodes, and the back-end storage nodes can save the data of the front-end computing nodes. During the operation of a distributed storage system, IO failures often occur. In order to make the distributed storage system run normally, it is necessary to locate the IO fault to solve the IO fault. In the process of implementing the embodiments of the present invention, the inventor found that locating IO faults in the related art often relies on the alarm information provided by the distributed storage system and the experience of technicians, and the locating means for IO faults is relatively simple, and the locating process is relatively simple. less efficient.

发明内容SUMMARY OF THE INVENTION

鉴于上述问题，本发明实施例提供了一种IO故障定位方法、装置、设备及计算机可读存储介质，用于解决现有技术中存在的分布式存储IO故障定位过程效率较低的问题。In view of the above problems, embodiments of the present invention provide an IO fault location method, apparatus, device, and computer-readable storage medium, which are used to solve the problem of low efficiency in the distributed storage IO fault location process in the prior art.

根据本发明实施例的一个方面，提供了一种IO故障定位方法，所述IO故障为分布式存储的IO故障，所述方法包括：According to an aspect of the embodiments of the present invention, a method for locating an IO fault is provided, where the IO fault is an IO fault of distributed storage, and the method includes:

获取前端计算节点的IO时延与所述计算节点的外接盘的IO时延阈值；Obtain the IO delay of the front-end computing node and the IO delay threshold of the external disk of the computing node;

若所述计算节点的IO时延与所述外接盘的IO时延阈值的差值在第一预设数值范围内，则进一步获取所述计算节点所对应的逻辑单元号的IO时延以及所述逻辑单元号的IO时延阈值；If the difference between the IO delay of the computing node and the IO delay threshold of the external disk is within the first preset value range, further obtain the IO delay of the logical unit number corresponding to the computing node and all the IO delay threshold of the logical unit number;

若所述逻辑单元号的IO时延不大于所述逻辑单元号的IO时延阈值，且所述计算节点的IO时延与所述逻辑单元号的IO时延的差值在第二预设数值范围内，则将所述IO故障定位在前端；If the IO latency of the logical unit number is not greater than the IO latency threshold of the logical unit number, and the difference between the IO latency of the computing node and the IO latency of the logical unit number is within the second preset value Within the value range, the IO fault is located at the front end;

若所述逻辑单元号的IO时延与所述逻辑单元号的IO时延阈值的差值在第三预设数值范围内，且所述计算节点的IO时延与所述逻辑单元号的IO时延之间的差值在第四预设数值范围内，则将所述IO故障定位在后端。If the difference between the IO delay of the logical unit number and the IO delay threshold of the logical unit number is within a third preset value range, and the IO delay of the computing node and the IO delay of the logical unit number If the difference between the delays is within the fourth preset value range, the IO fault is located at the back end.

在一种可选的方式中，在所述将所述IO故障定位在前端之后，所述方法进一步包括：In an optional manner, after locating the IO fault at the front end, the method further includes:

若所述计算节点为虚拟机，则分别获取与所述计算节点对应于同一逻辑单元号的其它计算节点的IO时延、与所述计算节点对应于同一物理机的其它虚拟机的IO时延以及与所述计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延；If the computing node is a virtual machine, obtain the IO delay of other computing nodes corresponding to the same logical unit number as the computing node and the IO delay of other virtual machines corresponding to the same physical machine as the computing node, respectively. and the IO delay of other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located;

若判断所述与所述计算节点对应于同一逻辑单元号的其它计算节点的IO时延异常，则将所述IO故障定位在前端与后端的互联网络；If it is determined that the IO delay of the other computing nodes corresponding to the same logical unit number as the computing node is abnormal, the IO fault is located in the interconnection network between the front end and the back end;

若判断所述与所述计算节点对应于同一物理机的其它虚拟机的IO时延正常，并且判断所述与所述计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延正常，则将所述IO故障定位在所述计算节点；If it is judged that the IO delay of the other virtual machines corresponding to the same physical machine as the computing node is normal, and the IO of the other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located is judged to be normal If the delay is normal, the IO fault is located on the computing node;

若判断所述与所述计算节点对应于同一物理机的其它虚拟机的IO时延异常，并且判断所述与所述计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延正常，则将所述IO故障定位在所述计算节点所在的物理机。If it is determined that the IO delay of the other virtual machines corresponding to the same physical machine as the computing node is abnormal, and the IO of the other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located is determined to be abnormal If the delay is normal, the IO fault is located on the physical machine where the computing node is located.

若所述计算节点为物理机，则获取与所述计算节点对应于同一逻辑单元号的其它计算节点的IO时延；If the computing node is a physical machine, obtain the IO delay of other computing nodes corresponding to the same logical unit number as the computing node;

若判断所述与所述计算节点对应于同一逻辑单元号的其它计算节点的IO时延正常，则将所述IO故障定位在所述计算节点。If it is determined that the IO delay of the other computing nodes corresponding to the same logical unit number as the computing node is normal, the IO fault is located on the computing node.

在一种可选的方式中，在所述将所述IO故障定位在后端之后，所述方法进一步包括：In an optional manner, after locating the IO fault at the backend, the method further includes:

获取每一个后端存储节点的ping网络时延、所有后端存储节点的ping网络时延的平均值以及后端存储节点的ping网络时延阈值；Obtain the ping network delay of each back-end storage node, the average ping network delay of all back-end storage nodes, and the ping network delay threshold of the back-end storage nodes;

若所述所有后端存储节点的ping网络时延的平均值与所述后端存储节点的ping网络时延阈值的差值在第五预设数值范围内，并且所述每一个后端存储节点的ping网络时延与所述后端存储节点的ping网络时延阈值的差值均在第六预设数值范围内，则将所述IO故障定位在后端存储节点之间的互联网络；If the difference between the average of the ping network delays of all the back-end storage nodes and the ping network delay threshold of the back-end storage nodes is within the fifth preset value range, and each back-end storage node If the difference between the ping network delay of the back-end storage node and the ping network delay threshold of the back-end storage node is within the sixth preset value range, the IO fault is located in the interconnection network between the back-end storage nodes;

若所述所有后端存储节点的ping网络时延的平均值与所述后端存储节点的ping网络时延阈值的差值在第七预设数值范围内，并且预设数量的后端存储节点的ping网络时延与所述后端存储节点的ping网络时延阈值的差值均在第八预设数值范围内，所述预设数量的后端存储节点之外的其它后端存储节点的ping网络时延不大于所述后端存储节点的ping网络时延阈值，则将所述IO故障定位在所述预设数量的后端存储节点的网络；If the difference between the average of the ping network delays of all the back-end storage nodes and the ping network delay threshold of the back-end storage nodes is within the seventh preset value range, and the preset number of back-end storage nodes The difference between the ping network delay of the back-end storage node and the ping network delay threshold of the back-end storage node is within the eighth preset value range, and the back-end storage nodes other than the preset number of back-end storage nodes have If the ping network delay is not greater than the ping network delay threshold of the back-end storage node, the IO fault is located in the network of the preset number of back-end storage nodes;

若所述所有后端存储节点的ping网络时延的平均值与所述后端存储节点的ping网络时延阈值之间的差值在第九预设数值范围内，则获取所有所述后端存储节点的内置盘的IO时延以及与所述内置盘的IO时延对应的内置盘的IO时延阈值，将使所述内置盘的IO时延与所述内置盘的IO时延阈值的差值在第十预设数值范围内的内置盘识别为异常内置盘，将所述IO故障定位在所述异常内置盘。If the difference between the average of the ping network delays of all the back-end storage nodes and the ping network delay threshold of the back-end storage nodes is within the ninth preset value range, obtain all the back-end storage nodes. The IO delay of the built-in disk of the storage node and the IO delay threshold of the built-in disk corresponding to the IO delay of the built-in disk will make the difference between the IO delay of the built-in disk and the IO delay threshold of the built-in disk. The built-in disk whose difference is within the tenth preset value range is identified as an abnormal built-in disk, and the IO fault is located in the abnormal built-in disk.

根据本发明实施例的另一方面，提供了一种IO故障定位装置，所述IO故障为分布式存储的IO故障，所述装置包括：According to another aspect of the embodiments of the present invention, an apparatus for locating an IO fault is provided, where the IO fault is an IO fault of distributed storage, and the apparatus includes:

第一获取模块，用于获取前端计算节点的IO时延与所述计算节点的外接盘的IO时延阈值；a first obtaining module, configured to obtain the IO delay of the front-end computing node and the IO delay threshold of the external disk of the computing node;

第二获取模块，用于若所述计算节点的IO时延与所述外接盘的IO时延阈值的差值在第一预设数值范围内，则获取所述计算节点所对应的逻辑单元号的IO时延以及所述逻辑单元号的IO时延阈值；The second obtaining module is configured to obtain the logical unit number corresponding to the computing node if the difference between the IO delay of the computing node and the IO delay threshold of the external disk is within a first preset value range The IO delay and the IO delay threshold of the logical unit number;

定位模块，用于若所述逻辑单元号的IO时延不大于所述逻辑单元号的IO时延阈值，且所述计算节点的IO时延与所述逻辑单元号的IO时延的差值在第二预设数值范围内，则将所述IO故障定位在前端；若所述逻辑单元号的IO时延与所述逻辑单元号的IO时延阈值的差值在第三预设数值范围内，且所述计算节点的IO时延与所述逻辑单元号的IO时延之间的差值在第四预设数值范围内，则将所述IO故障定位在后端。A positioning module, used for if the IO delay of the logical unit number is not greater than the IO delay threshold of the logical unit number, and the difference between the IO delay of the computing node and the IO delay of the logical unit number Within the second preset value range, the IO fault is located at the front end; if the difference between the IO delay of the logical unit number and the IO delay threshold of the logical unit number is within the third preset value range and the difference between the IO delay of the computing node and the IO delay of the logical unit number is within a fourth preset value range, the IO fault is located at the back end.

根据本发明实施例的另一方面，提供了一种计算设备，包括：处理器、存储器、通信接口和通信总线，所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信；According to another aspect of the embodiments of the present invention, a computing device is provided, including: a processor, a memory, a communication interface, and a communication bus, and the processor, the memory, and the communication interface complete each other through the communication bus. communication between;

所述存储器用于存放至少一可执行指令，所述可执行指令使所述处理器执行上述的IO故障定位方法的操作。The memory is used for storing at least one executable instruction, and the executable instruction enables the processor to perform the operations of the above-mentioned IO fault location method.

根据本发明实施例的另一方面，提供了一种IO故障定位设备，所述IO故障为分布式存储的IO故障，所述设备包括：According to another aspect of the embodiments of the present invention, an IO fault location device is provided, where the IO fault is an IO fault of distributed storage, and the device includes:

采集模块，用于采集IO时延信息；The acquisition module is used to collect IO delay information;

配置模块，用于配置IO时延阈值；The configuration module is used to configure the IO delay threshold;

定位模块，用于执行上述的IO故障定位方法，以对所述IO故障进行定位；a locating module, configured to execute the above-mentioned IO fault locating method, so as to locate the IO fault;

展示模块，用于获取所述定位模块对所述IO故障进行定位的结果，并对所述结果进行展示。A display module, configured to obtain a result of the positioning module locating the IO fault, and display the result.

在一种可选的方式中，所述展示模块还用于：In an optional way, the display module is also used for:

展示所述采集模块采集的所述IO时延信息，以及，展示前端计算节点的外接盘与后端存储集群的逻辑单元号的对应关系。Display the IO delay information collected by the collection module, and display the correspondence between the external disk of the front-end computing node and the logical unit number of the back-end storage cluster.

在一种可选的方式中，所述配置模块还用于：In an optional manner, the configuration module is further used to:

获取前端计算节点的配置信息以及后端存储集群的配置信息，将所述前端计算节点的配置信息以及后端存储集群的配置信息发送至所述定位模块；Obtain the configuration information of the front-end computing node and the configuration information of the back-end storage cluster, and send the configuration information of the front-end computing node and the configuration information of the back-end storage cluster to the positioning module;

所述定位模块还用于：根据所述前端计算节点的配置信息以及后端存储集群的配置信息对所述IO故障进行定位。The locating module is further configured to: locate the IO fault according to the configuration information of the front-end computing node and the configuration information of the back-end storage cluster.

根据本发明实施例的又一方面，提供了一种计算机可读存储介质，所述存储介质中存储有至少一可执行指令，所述可执行指令在计算设备上运行时，使得计算设备执行上述的IO故障定位方法的操作。According to yet another aspect of the embodiments of the present invention, a computer-readable storage medium is provided, where at least one executable instruction is stored in the storage medium, and when the executable instruction is executed on a computing device, the computing device executes the above The operation of the IO fault location method.

本发明实施例通过获取前端计算节点的IO时延与所述计算节点的外接盘的IO时延阈值；若所述计算节点的IO时延与所述外接盘的IO时延阈值的差值在第一预设数值范围内，则进一步获取所述计算节点所对应的逻辑单元号的IO时延以及所述逻辑单元号的IO时延阈值；若所述逻辑单元号的IO时延不大于所述逻辑单元号的IO时延阈值，且所述计算节点的IO时延与所述逻辑单元号的IO时延的差值在第二预设数值范围内，则将所述IO故障定位在前端；若所述逻辑单元号的IO时延与所述逻辑单元号的IO时延阈值的差值在第三预设数值范围内，且所述计算节点的IO时延与所述逻辑单元号的IO时延之间的差值在第四预设数值范围内，则将所述IO故障定位在后端。可以看出，本发明实施例通过比较前端计算节点的IO时延与后端逻辑单元号的IO时延可以快速、准确的对IO故障进行定位。In this embodiment of the present invention, the IO delay of the front-end computing node and the IO delay threshold of the external disk of the computing node are obtained; if the difference between the IO delay of the computing node and the IO delay threshold of the external disk is within Within the first preset value range, the IO delay of the logical unit number corresponding to the computing node and the IO delay threshold of the logical unit number are further obtained; if the IO delay of the logical unit number is not greater than all The IO delay threshold of the logical unit number, and the difference between the IO delay of the computing node and the IO delay of the logical unit number is within the second preset value range, the IO fault is located at the front end ; If the difference between the IO delay of the logical unit number and the IO delay threshold of the logical unit number is within the third preset value range, and the IO delay of the computing node and the logical unit number If the difference between the IO delays is within the fourth preset value range, the IO fault is located at the back end. It can be seen that the embodiment of the present invention can quickly and accurately locate the IO fault by comparing the IO delay of the front-end computing node and the IO delay of the back-end logical unit number.

上述说明仅是本发明实施例技术方案的概述，为了能够更清楚了解本发明实施例的技术手段，而可依照说明书的内容予以实施，并且为了让本发明实施例的上述和其它目的、特征和优点能够更明显易懂，以下特举本发明的具体实施方式。The above description is only an overview of the technical solutions of the embodiments of the present invention. In order to understand the technical means of the embodiments of the present invention more clearly, it can be implemented according to the contents of the description, and in order to make the above and other purposes, features and The advantages can be more clearly understood, and the following specific embodiments of the present invention are given.

附图说明Description of drawings

附图仅用于示出实施方式，而并不认为是对本发明的限制。而且在整个附图中，用相同的参考符号表示相同的部件。在附图中：The drawings are only used to illustrate the embodiments and are not considered to be limiting of the present invention. Also, the same components are denoted by the same reference numerals throughout the drawings. In the attached image:

图1示出了本发明实施例提供的IO故障定位方法的流程示意图；1 shows a schematic flowchart of a method for locating an IO fault provided by an embodiment of the present invention;

图2示出了本发明实施例提供的IO故障定位装置的结构示意图；FIG. 2 shows a schematic structural diagram of an IO fault location device provided by an embodiment of the present invention;

图3示出了本发明实施例提供的计算设备的结构示意图；FIG. 3 shows a schematic structural diagram of a computing device provided by an embodiment of the present invention;

图4示出了本发明实施例提供的IO故障定位设备的结构示意图。FIG. 4 shows a schematic structural diagram of an IO fault location device provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将参照附图更详细地描述本发明的示例性实施例。虽然附图中显示了本发明的示例性实施例，然而应当理解，可以以各种形式实现本发明而不应被这里阐述的实施例所限制。Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited by the embodiments set forth herein.

图1示出了本发明IO故障定位方法实施例的流程图，该方法由计算设备执行。在本发明实施例中，计算设备的存储空间中存储有可执行指令，该可执行指令可以使处理器执行IO故障定位方法。如图1所示，该方法可以对分布式存储的IO故障进行定位，包括以下步骤：FIG. 1 shows a flowchart of an embodiment of a method for locating an IO fault according to the present invention, and the method is executed by a computing device. In this embodiment of the present invention, an executable instruction is stored in the storage space of the computing device, and the executable instruction can cause the processor to execute the method for locating an IO fault. As shown in Figure 1, the method can locate the IO fault of distributed storage, including the following steps:

步骤110：获取前端计算节点的IO时延与所述计算节点的外接盘的IO时延阈值。Step 110: Obtain the IO delay of the front-end computing node and the IO delay threshold of the external disk of the computing node.

其中，前端计算节点可以是各种类型的PC服务器，例如前端计算节点可以是x86服务器。计算节点的外接盘位于后端存储集群上的存储节点，后端存储集群一般包括多个存储节点，每一个存储节点可以包括至少一个本地存储盘，存储节点上的存储盘可以作为前端计算节点的外接盘。计算节点的外接盘在执行IO任务时会产生IO时延，可以为计算节点的外接盘设置IO时延阈值，对计算节点的外接盘的IO时延与计算节点的外接盘的IO时延阈值进行比较以确定计算节点的外接盘在执行IO任务过程中是否超时。The front-end computing nodes may be various types of PC servers, for example, the front-end computing nodes may be x86 servers. The external disk of the computing node is located on the storage node on the back-end storage cluster. The back-end storage cluster generally includes multiple storage nodes. Each storage node can include at least one local storage disk. external disk. When the external disks of the compute node perform IO tasks, an IO delay will occur. You can set the IO delay threshold for the external disks of the compute node, and set the IO delay threshold of the external disks of the compute node and the IO delay threshold of the external disks of the compute node. The comparison is made to determine whether the external disk of the computing node times out during the execution of the IO task.

在本发明实施例的一种实施方式中，可以通过相关操作系统命令来获取计算节点的外接盘的读写时延await值，以获取的await值作为计算节点的IO时延，相关操作系统命令例如可以是iostat命令。进一步的，可以按照固定的时间间隔来获取计算节点的外接盘的多个读写时延await值，按照移动平均法得到await值的移动平均值，并将await值的移动平均值作为计算节点的IO时延。移动平均法是一种处理数据的常用方法，可以有效消除异常的样本数据，使得到的await值的移动平均值更加准确，使最终得到的计算节点的IO时延更加准确。In an implementation manner of the embodiment of the present invention, the read and write delay await value of the external disk of the computing node can be obtained through a relevant operating system command, and the obtained await value is used as the IO delay of the computing node, and the relevant operating system command For example, it can be the iostat command. Further, multiple read and write delay await values of the external disks of the computing node can be obtained at fixed time intervals, the moving average of the await values can be obtained according to the moving average method, and the moving average of the await values can be used as the computing node's moving average. IO delay. The moving average method is a common method for processing data, which can effectively eliminate abnormal sample data, make the moving average of the await value more accurate, and make the IO delay of the final computing node more accurate.

其中，分布式存储包括前端和后端，前端包括多个前端计算节点，后端包括多个后端存储节点，前端与后端通过互联网络进行通信。前端计算节点上可以运行应用程序，后端存储节点上的存储盘可以对前端计算节点上应用程序的应用数据进行保存。后端存储节点之间可以通过后端存储节点之间的互联网络进行通信，每一个后端存储节点通过本地网络接入设备接入后端存储节点之间的互联网络，后端存储节点的本地网络接入设备例如可以包括网线、网卡。计算节点的IO时延通常可以包括计算节点到前端与后端的互联网络的时延、前端与后端的互联网络到存储节点的时延以及存储节点的内部时延，存储节点的内部时延通常还可以包括IO数据在存储节点之间同步的时延。The distributed storage includes a front-end and a back-end, the front-end includes multiple front-end computing nodes, the back-end includes multiple back-end storage nodes, and the front-end and the back-end communicate through the Internet. The front-end computing nodes can run applications, and the storage disks on the back-end storage nodes can save the application data of the applications on the front-end computing nodes. The back-end storage nodes can communicate through the internet between the back-end storage nodes. Each back-end storage node is connected to the internet between the back-end storage nodes through the local network access device. The network access device may include, for example, a network cable and a network card. The IO delay of a computing node can usually include the delay from the computing node to the front-end and back-end interconnection network, the delay from the front-end and back-end interconnection network to the storage node, and the internal delay of the storage node. The internal delay of the storage node usually also It can include the delay of IO data synchronization between storage nodes.

步骤120：若所述计算节点的IO时延与所述外接盘的IO时延阈值的差值在第一预设数值范围内，则进一步获取所述计算节点所对应的逻辑单元号的IO时延以及所述逻辑单元号的IO时延阈值。Step 120: If the difference between the IO delay of the computing node and the IO delay threshold of the external disk is within the first preset value range, further obtain the IO time of the logical unit number corresponding to the computing node. delay and the IO delay threshold of the logical unit number.

其中，逻辑单元号的IO时延可以通过移动平均法计算，可以首先计算出计算节点的IO时延与外接盘的IO时延阈值的差值，再判断差值是否在第一预设数值范围内，若判断差值在第一预设数值范围内，则判断计算节点的IO时延异常，分布式存储系统出现IO故障。第一预设数值范围可以根据实际需要进行设置，第一预设数值范围例如可以设置为外接盘的IO时延阈值的10倍以上的数值范围，即当计算节点的IO时延大于外接盘的IO时延阈值的11倍时，则判断计算节点的IO时延与外接盘的IO时延阈值的差值在第一预设数值范围内。Among them, the IO delay of the logical unit number can be calculated by the moving average method. First, the difference between the IO delay of the computing node and the IO delay threshold of the external disk can be calculated, and then it can be judged whether the difference is within the first preset value range. If it is determined that the difference is within the first preset value range, it is determined that the IO delay of the computing node is abnormal, and the distributed storage system has an IO failure. The first preset value range can be set according to actual needs. For example, the first preset value range can be set to a value range that is more than 10 times the IO delay threshold of the external disk, that is, when the IO delay of the computing node is greater than that of the external disk. When the IO delay threshold is 11 times, it is determined that the difference between the IO delay of the computing node and the IO delay threshold of the external disk is within the first preset value range.

其中，逻辑单元号(LUN，即Logical Unit Number)是后端存储集群分配给前端计算节点的存储单元，后端存储集群一般包括多个逻辑单元号，后端存储集群的逻辑单元号在执行IO任务时会产生IO时延，可以为后端存储集群的逻辑单元号设置IO时延阈值，对逻辑单元号的IO时延与逻辑单元号的IO时延阈值进行比较以确定逻辑单元号在执行IO任务过程中是否超时。Among them, the logical unit number (LUN, namely Logical Unit Number) is the storage unit assigned to the front-end computing node by the back-end storage cluster. The back-end storage cluster generally includes multiple logical unit numbers, and the logical unit number of the back-end storage cluster is executing IO The IO delay will be generated during the task. You can set the IO delay threshold for the logical unit number of the backend storage cluster, and compare the IO delay of the logical unit number with the IO delay threshold of the logical unit number to determine that the logical unit number is executing. Whether to time out during the IO task.

步骤130：若所述逻辑单元号的IO时延不大于所述逻辑单元号的IO时延阈值，且所述计算节点的IO时延与所述逻辑单元号的IO时延的差值在第二预设数值范围内，则将所述IO故障定位在前端。Step 130: If the IO latency of the logical unit number is not greater than the IO latency threshold of the logical unit number, and the difference between the IO latency of the computing node and the IO latency of the logical unit number is within the first Within the range of two preset values, the IO fault is located at the front end.

其中，若判断逻辑单元号的IO时延不大于逻辑单元号的IO时延阈值，则判断为逻辑单元号的IO时延正常。若进一步判断计算节点的IO时延与逻辑单元号的IO时延的差值在第二预设数值范围内，则说明前端出现IO故障，将IO故障定位在前端。Wherein, if it is determined that the IO delay of the logical unit number is not greater than the IO delay threshold of the logical unit number, it is determined that the IO delay of the logical unit number is normal. If it is further judged that the difference between the IO delay of the computing node and the IO delay of the logical unit number is within the second preset value range, it means that an IO fault occurs at the front end, and the IO fault is located at the front end.

其中，第二预设数值范围可以根据实际需要进行设置，第二预设数值范围例如可以设置为逻辑单元号的IO时延的10倍以上的数值范围，即当计算节点的IO时延大于逻辑单元号的IO时延的11倍时，则判断计算节点的IO时延与逻辑单元号的IO时延的差值在第二预设数值范围内。The second preset value range can be set according to actual needs. For example, the second preset value range can be set to a value range that is more than 10 times the IO delay of the logical unit number, that is, when the IO delay of the computing node is greater than the logic When the IO delay of the unit number is 11 times, it is determined that the difference between the IO delay of the computing node and the IO delay of the logical unit number is within the second preset value range.

在本发明实施例的一种优选的实施方式中，在步骤130将所述IO故障定位在前端之后，所述方法进一步包括：In a preferred implementation of the embodiment of the present invention, after the IO fault is located at the front end in step 130, the method further includes:

步骤131：若计算节点为虚拟机，则分别获取与计算节点对应于同一逻辑单元号的其它计算节点的IO时延、与计算节点对应于同一物理机的其它虚拟机的IO时延以及与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延；若判断与计算节点对应于同一逻辑单元号的其它计算节点的IO时延异常，则将IO故障定位在前端与后端的互联网络；若判断与计算节点对应于同一物理机的其它虚拟机的IO时延正常，并且判断与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延正常，则将IO故障定位在计算节点；若判断与计算节点对应于同一物理机的其它虚拟机的IO时延异常，并且判断与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延正常，则将IO故障定位在计算节点所在的物理机。Step 131: If the computing node is a virtual machine, obtain the IO delay of other computing nodes corresponding to the same logical unit number as the computing node, the IO delay of other virtual machines corresponding to the same physical machine as the computing node, and the The physical machine where the node is located corresponds to the IO delay of other physical machines with the same logical unit number; if it is judged that the IO delay of other computing nodes corresponding to the same logical unit number as the computing node is abnormal, locate the IO fault at the front end and the back end. If it is judged that the IO delay of other virtual machines corresponding to the same physical machine as the computing node is normal, and the IO delay of other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located is normal, then Locate the IO fault on the computing node; if it is judged that the IO delay of other virtual machines corresponding to the same physical machine as the computing node is abnormal, and the IO of other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located is judged to be abnormal. If the delay is normal, the IO fault is located on the physical machine where the computing node is located.

其中，在步骤130将IO故障定位在前端之后，可以进一步判断计算节点是虚拟机还是物理机，若判断计算节点是虚拟机，再执行步骤131。与计算节点对应于同一逻辑单元号的其它计算节点的IO时延、与计算节点对应于同一物理机的其它虚拟机的IO时延以及与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延可以通过移动平均法计算。在获取与计算节点对应于同一逻辑单元号的其它计算节点的IO时延后，若判断与计算节点对应于同一逻辑单元号的其它计算节点的IO时延异常，则将IO故障定位在前端与后端的互联网络。在判断与计算节点对应于同一逻辑单元号的其它计算节点的IO时延是否异常时，可以对比每一个其它计算节点的IO时延与对应的IO时延阈值，将使得IO时延大于IO时延阈值的计算节点识别为IO时延异常的计算节点，统计所有其它计算节点中IO时延异常的计算节点所占的比例，若所有其它计算节点中IO时延异常的计算节点所占的比例超过预设比例，则判断为与计算节点对应于同一逻辑单元号的其它计算节点的IO时延异常。Wherein, after locating the IO fault at the front end in step 130, it may be further determined whether the computing node is a virtual machine or a physical machine, and if it is determined that the computing node is a virtual machine, step 131 is performed. IO delay of other computing nodes corresponding to the same logical unit number as the computing node, IO delay of other virtual machines corresponding to the same physical machine as the computing node, and other virtual machines corresponding to the same logical unit number as the physical machine where the computing node resides The IO delay of the physical machine can be calculated by the moving average method. After obtaining the IO delay of other computing nodes corresponding to the same logical unit number as the computing node, if it is judged that the IO delay of other computing nodes corresponding to the same logical unit number as the computing node is abnormal, locate the IO fault in the front-end and Backend Internet. When judging whether the IO delay of other computing nodes corresponding to the same logical unit number as the computing node is abnormal, the IO delay of each other computing node can be compared with the corresponding IO delay threshold, which will make the IO delay greater than the IO delay. The computing node with the delay threshold is identified as the computing node with abnormal IO delay, and the proportion of computing nodes with abnormal IO delay among all other computing nodes is counted. If the proportion of computing nodes with abnormal IO delay among all other computing nodes If the preset ratio is exceeded, it is determined that the IO delay of other computing nodes corresponding to the same logical unit number as the computing node is abnormal.

其中，在获取与计算节点对应于同一物理机的其它虚拟机的IO时延以及与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延之后，若判断与计算节点对应于同一物理机的其它虚拟机的IO时延正常，并且判断与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延正常，则将IO故障定位在所述计算节点。若判断与计算节点对应于同一物理机的其它虚拟机的IO时延异常，并且判断与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延正常，则将IO故障定位在所述计算节点所在的物理机。在判断与计算节点对应于同一物理机的其它虚拟机的IO时延是否正常时，可以对比每一个其它虚拟机的IO时延与对应的IO时延阈值，将使得IO时延大于IO时延阈值的虚拟机识别为IO时延异常的虚拟机，若所有其它虚拟机中不存在IO时延异常的虚拟机，则判断与计算节点对应于同一物理机的其它虚拟机的IO时延正常，若所有其它虚拟机中存在IO时延异常的虚拟机，则判断与计算节点对应于同一物理机的其它虚拟机的IO时延异常。在判断与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延是否正常时，可以对比每一个其它物理机的IO时延与对应的IO时延阈值，将使得IO时延大于IO时延阈值的物理机识别为IO时延异常的物理机，若所有其它物理机中不存在IO时延异常的物理机，则判断与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延正常，若所有其它物理机中存在IO时延异常的物理机，则判断与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延异常。Among them, after obtaining the IO delay of other virtual machines corresponding to the same physical machine as the computing node and the IO delay of other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located, if it is determined that the computing node corresponds to If the IO delay of other virtual machines on the same physical machine is normal, and it is judged that the IO delay of other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located is normal, the IO fault is located on the computing node. If it is judged that the IO delay of other virtual machines corresponding to the same physical machine as the computing node is abnormal, and it is judged that the IO delay of other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located is normal, locate the IO fault. on the physical machine where the computing node is located. When judging whether the IO delay of other virtual machines corresponding to the same physical machine as the computing node is normal, the IO delay of each other virtual machine can be compared with the corresponding IO delay threshold, which will make the IO delay greater than the IO delay The virtual machine with the threshold value is identified as a virtual machine with abnormal IO delay. If there is no virtual machine with abnormal IO delay among all other virtual machines, it is judged that the IO delay of other virtual machines corresponding to the same physical machine as the computing node is normal. If there is a virtual machine with abnormal IO delay among all other virtual machines, it is determined that the IO delay of other virtual machines corresponding to the same physical machine as the computing node is abnormal. When judging whether the IO delay of other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located is normal, the IO delay of each other physical machine can be compared with the corresponding IO delay threshold, which will make the IO delay A physical machine with a delay greater than the IO delay threshold is identified as a physical machine with an abnormal IO delay. If there is no physical machine with an abnormal IO delay among all other physical machines, it is determined that the physical machine where the computing node is located corresponds to the same logical unit number. The IO delay of other physical machines is normal. If there is a physical machine with abnormal IO delay among all other physical machines, it is judged that the IO delay of other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located is abnormal.

步骤132：若计算节点为物理机，则获取与计算节点对应于同一逻辑单元号的其它计算节点的IO时延；若判断与计算节点对应于同一逻辑单元号的其它计算节点的IO时延异常，则将IO故障定位在前端与后端的互联网络；若判断与计算节点对应于同一逻辑单元号的其它计算节点的IO时延正常，则将IO故障定位在计算节点。Step 132: If the computing node is a physical machine, obtain the IO delay of other computing nodes corresponding to the same logical unit number as the computing node; if it is determined that the IO delay of other computing nodes corresponding to the same logical unit number as the computing node is abnormal , the IO fault is located on the front-end and back-end interconnected networks; if it is judged that the IO delay of other computing nodes corresponding to the same logical unit number as the computing node is normal, the IO fault is located on the computing node.

其中，在步骤130将IO故障定位在前端之后，可以进一步判断计算节点是虚拟机还是物理机，若判断计算节点是物理机，再执行步骤132。与计算节点对应于同一逻辑单元号的其它计算节点的IO时延可以通过移动平均法计算。在获取与计算节点对应于同一逻辑单元号的其它计算节点的IO时延后，若判断与计算节点对应于同一逻辑单元号的其它计算节点的IO时延异常，则将IO故障定位在前端与后端的互联网络；若判断与计算节点对应于同一逻辑单元号的其它计算节点的IO时延正常，则将所述IO故障定位在所述计算节点。在判断与所述计算节点对应于同一逻辑单元号的其它计算节点的IO时延是否正常时，可以对比每一个其它计算节点的IO时延与IO时延阈值，将使得IO时延大于IO时延阈值的计算节点识别为IO时延异常的计算节点。若与计算节点对应于同一逻辑单元号的所有其它计算节点中预设比例的计算节点IO时延异常，则判断与计算节点对应于同一逻辑单元号的其它计算节点的IO时延异常，否则，判断与计算节点对应于同一逻辑单元号的其它计算节点的IO时延正常，预设比例例如可以是90％。其它计算节点可以是虚拟机，也可以是物理机。Wherein, after locating the IO fault at the front end in step 130, it may be further determined whether the computing node is a virtual machine or a physical machine, and if it is determined that the computing node is a physical machine, then step 132 is performed. The IO delay of other computing nodes corresponding to the same logical unit number as the computing node can be calculated by the moving average method. After obtaining the IO delay of other computing nodes corresponding to the same logical unit number as the computing node, if it is judged that the IO delay of other computing nodes corresponding to the same logical unit number as the computing node is abnormal, locate the IO fault in the front-end and The back-end interconnection network; if it is determined that the IO delay of other computing nodes corresponding to the same logical unit number as the computing node is normal, the IO fault is located on the computing node. When judging whether the IO delay of other computing nodes corresponding to the same logical unit number as the computing node is normal, the IO delay of each other computing node can be compared with the IO delay threshold, so that the IO delay is greater than the IO delay A computing node with a delay threshold is identified as a computing node with abnormal IO delay. If a preset proportion of the computing node IO delays of all other computing nodes corresponding to the same logical unit number as the computing node are abnormal, it is determined that the IO delays of other computing nodes corresponding to the same logical unit number are abnormal, otherwise, It is determined that the IO delay of other computing nodes corresponding to the same logical unit number as the computing node is normal, and the preset ratio may be, for example, 90%. Other computing nodes can be virtual machines or physical machines.

步骤140：若逻辑单元号的IO时延与逻辑单元号的IO时延阈值的差值在第三预设数值范围内，且计算节点的IO时延与逻辑单元号的IO时延之间的差值在第四预设数值范围内，则将IO故障定位在后端。Step 140: If the difference between the IO delay of the logical unit number and the IO delay threshold of the logical unit number is within the third preset value range, and the difference between the IO delay of the node and the IO delay of the logical unit number is calculated. If the difference is within the fourth preset value range, the IO fault is located at the back end.

其中，若判断逻辑单元号的IO时延与逻辑单元号的IO时延阈值的差值在第三预设数值范围内，则判断为逻辑单元号的IO时延异常。第三预设数值范围可以根据实际需要进行设置，第三预设数值范围例如可以设置为逻辑单元号的IO时延阈值的10倍以上的数值范围，即当逻辑单元号的IO时延大于逻辑单元号的IO时延阈值的11倍时，则判断逻辑单元号的IO时延与逻辑单元号的IO时延阈值的差值在第三预设数值范围内。Wherein, if it is determined that the difference between the IO delay of the logical unit number and the IO delay threshold of the logical unit number is within the third preset value range, it is determined that the IO delay of the logical unit number is abnormal. The third preset value range can be set according to actual needs. For example, the third preset value range can be set to a value range that is more than 10 times the IO delay threshold of the logical unit number, that is, when the IO delay of the logical unit number is greater than the logical unit number When the IO delay threshold of the unit number is 11 times, it is determined that the difference between the IO delay of the logical unit number and the IO delay threshold of the logical unit number is within the third preset value range.

其中，若判断计算节点的IO时延与逻辑单元号的IO时延之间的差值在第四预设数值范围内，则判断为计算节点的IO时延与逻辑单元号的IO时延比较接近。第四预设数值范围可以根据实际需要进行设置，第四预设数值范围例如可以设置为逻辑单元号的IO时延的0.1倍以下的数值范围，即当计算节点的IO时延小于逻辑单元号的IO时延的1.1倍时，则判断计算节点的IO时延与逻辑单元号的IO时延之间的差值在第四预设数值范围内。Wherein, if it is determined that the difference between the IO delay of the computing node and the IO delay of the logical unit number is within the fourth preset value range, it is determined that the IO delay of the computing node and the IO delay of the logical unit number are compared near. The fourth preset value range can be set according to actual needs. For example, the fourth preset value range can be set to a value range that is less than 0.1 times the IO delay of the logical unit number, that is, when the IO delay of the computing node is smaller than the logical unit number. When the IO delay is 1.1 times, it is judged that the difference between the IO delay of the computing node and the IO delay of the logical unit number is within the fourth preset value range.

在本发明实施例的一种优选的实施方式中，在步骤140将所述IO故障定位在后端之后，所述方法进一步包括：In a preferred implementation of the embodiment of the present invention, after the IO fault is located at the back end in step 140, the method further includes:

步骤141：获取每一个后端存储节点的ping网络时延、所有后端存储节点的ping网络时延的平均值以及后端存储节点的ping网络时延阈值；若所有后端存储节点的ping网络时延的平均值与后端存储节点的ping网络时延阈值的差值在第五预设数值范围内，并且每一个后端存储节点的ping网络时延与后端存储节点的ping网络时延阈值的差值均在第六预设数值范围内，则将所述IO故障定位在后端存储节点之间的互联网络；若所有后端存储节点的ping网络时延的平均值与后端存储节点的ping网络时延阈值的差值在第七预设数值范围内，并且预设数量的后端存储节点的ping网络时延与后端存储节点的ping网络时延阈值的差值均在第八预设数值范围内，预设数量的后端存储节点之外的其它后端存储节点的ping网络时延不大于后端存储节点的ping网络时延阈值，则将IO故障定位在预设数量的后端存储节点的网络；若所有后端存储节点的ping网络时延的平均值与后端存储节点的ping网络时延阈值之间的差值在第九预设数值范围内，则获取所有后端存储节点的内置盘的IO时延以及与内置盘的IO时延对应的内置盘的IO时延阈值，将使内置盘的IO时延与内置盘的IO时延阈值的差值在第十预设数值范围内的内置盘识别为异常内置盘，将IO故障定位在异常内置盘。Step 141: Obtain the ping network delay of each back-end storage node, the average ping network delay of all back-end storage nodes, and the ping network delay threshold of the back-end storage nodes; if the ping network delay of all back-end storage nodes The difference between the average delay and the ping network delay threshold of the back-end storage node is within the fifth preset value range, and the ping network delay of each back-end storage node and the ping network delay of the back-end storage node If the difference between the thresholds is within the sixth preset value range, the IO fault is located in the interconnection network between back-end storage nodes; if the average ping network delay of all back-end storage nodes is the same as the The difference between the ping network delay thresholds of the nodes is within the seventh preset value range, and the difference between the ping network delays of the preset number of back-end storage nodes and the ping network delay thresholds of the back-end storage nodes is within the seventh preset value range. 8. Within the preset value range, if the ping network delay of other back-end storage nodes other than the preset number of back-end storage nodes is not greater than the ping network delay threshold of the back-end storage node, the IO fault is located at the preset number The network of back-end storage nodes; if the difference between the average ping network delay of all back-end storage nodes and the ping network delay threshold of back-end storage nodes is within the ninth preset value range, obtain all The IO delay of the built-in disk of the back-end storage node and the IO delay threshold of the built-in disk corresponding to the IO delay of the built-in disk will make the difference between the IO delay of the built-in disk and the IO delay threshold of the built-in disk in the first The built-in disks within the preset value range are identified as abnormal built-in disks, and the IO fault is located in the abnormal built-in disk.

其中，每一个后端存储节点的ping网络时延以及所有后端存储节点的ping网络时延的平均值可以通过移动平均法计算。若所有后端存储节点的ping网络时延的平均值与后端存储节点的ping网络时延阈值的差值在第五预设数值范围内，则判断为所有后端存储节点的ping网络时延的平均值异常，第五预设数值范围例如可以设置为后端存储节点的ping网络时延阈值的10倍以上的数值范围。若每一个后端存储节点的ping网络时延与后端存储节点的ping网络时延阈值的差值均在第六预设数值范围内，则判断为每一个后端存储节点的ping网络时延均异常，第六预设数值范围例如可以设置为后端存储节点的ping网络时延阈值的10倍以上的数值范围。The ping network delay of each backend storage node and the average of the ping network delays of all backend storage nodes can be calculated by the moving average method. If the difference between the average of the ping network delays of all the back-end storage nodes and the ping network delay threshold of the back-end storage nodes is within the fifth preset value range, it is determined that the ping network delays of all the back-end storage nodes are The average value of is abnormal, and the fifth preset value range may be set to, for example, a value range that is more than 10 times the ping network delay threshold of the back-end storage node. If the difference between the ping network delay of each back-end storage node and the ping network delay threshold of the back-end storage node is within the sixth preset value range, it is determined that the ping network delay of each back-end storage node is If both are abnormal, the sixth preset value range may be set to, for example, a value range that is more than 10 times the ping network delay threshold of the back-end storage node.

其中，若所有后端存储节点的ping网络时延的平均值与后端存储节点的ping网络时延阈值的差值在第七预设数值范围内，则判断为所有后端存储节点的ping网络时延的平均值异常，第七预设数值范围例如可以设置为后端存储节点的ping网络时延阈值的10倍以上的数值范围。若预设数量的后端存储节点的ping网络时延与后端存储节点的ping网络时延阈值的差值均在第八预设数值范围内，则判断为预设数量的后端存储节点的ping网络时延异常，第八预设数值范围例如可以设置为后端存储节点的ping网络时延阈值的10倍以上的数值范围，预设数量例如可以设置为1。若所有后端存储节点的ping网络时延的平均值与后端存储节点的ping网络时延阈值之间的差值在第九预设数值范围内，则判断为所有后端存储节点的ping网络时延的平均值正常，第九预设数值范围例如可以设置为后端存储节点的ping网络时延阈值的0.1倍以内的数值范围。第十预设数值范围例如可以设置为内置盘的IO时延阈值的10倍以上的数值范围。Among them, if the difference between the average of the ping network delays of all the back-end storage nodes and the ping network delay threshold of the back-end storage nodes is within the seventh preset value range, it is determined that the ping network of all the back-end storage nodes is If the average value of the delay is abnormal, the seventh preset value range may be set to, for example, a value range that is more than 10 times the ping network delay threshold of the backend storage node. If the difference between the ping network delay of the preset number of back-end storage nodes and the ping network delay threshold of the back-end storage nodes is within the eighth preset value range, it is determined that the preset number of back-end storage nodes has If the ping network delay is abnormal, the eighth preset value range may be set to, for example, a value range that is more than 10 times the ping network delay threshold of the backend storage node, and the preset number may be set to 1, for example. If the difference between the average of the ping network delays of all the back-end storage nodes and the ping network delay threshold of the back-end storage nodes is within the ninth preset value range, it is determined that the ping network of all the back-end storage nodes is within the range of the ninth preset value. The average value of the delay is normal, and the ninth preset value range may be, for example, a value range within 0.1 times the ping network delay threshold of the backend storage node. The tenth preset numerical range may be set to, for example, a numerical range that is more than 10 times the IO delay threshold of the built-in disk.

在本发明实施例中，通过获取前端计算节点的IO时延与计算节点的外接盘的IO时延阈值；若计算节点的IO时延与外接盘的IO时延阈值的差值在第一预设数值范围内，则进一步获取计算节点所对应的逻辑单元号的IO时延以及逻辑单元号的IO时延阈值；若逻辑单元号的IO时延不大于逻辑单元号的IO时延阈值，且计算节点的IO时延与逻辑单元号的IO时延的差值在第二预设数值范围内，则将IO故障定位在前端；若逻辑单元号的IO时延与逻辑单元号的IO时延阈值的差值在第三预设数值范围内，且计算节点的IO时延与逻辑单元号的IO时延之间的差值在第四预设数值范围内，则将IO故障定位在后端。可以看出，本发明实施例通过比较前端计算节点的IO时延与后端逻辑单元号的IO时延可以快速、准确的对IO故障进行定位。In the embodiment of the present invention, the IO delay of the front-end computing node and the IO delay threshold of the external disk of the computing node are obtained; if the difference between the IO delay of the computing node and the IO delay threshold of the external disk is within the first preset value Within the set value range, the IO delay of the logical unit number corresponding to the computing node and the IO delay threshold of the logical unit number are further obtained; if the IO delay of the logical unit number is not greater than the IO delay threshold of the logical unit number, and If the difference between the IO delay of the computing node and the IO delay of the logical unit number is within the second preset value range, locate the IO fault at the front end; if the IO delay of the logical unit number and the IO delay of the logical unit number If the difference between the thresholds is within the third preset value range, and the difference between the IO delay of the computing node and the IO delay of the logical unit number is within the fourth preset value range, the IO fault is located at the backend . It can be seen that the embodiment of the present invention can quickly and accurately locate the IO fault by comparing the IO delay of the front-end computing node and the IO delay of the back-end logical unit number.

图2示出了本发明IO故障定位装置实施例的结构示意图。如图2所示，该装置300包括：第一获取模块310、第二获取模块320和定位模块330。FIG. 2 shows a schematic structural diagram of an embodiment of an IO fault locating apparatus according to the present invention. As shown in FIG. 2 , the apparatus 300 includes: a first obtaining module 310 , a second obtaining module 320 and a positioning module 330 .

第一获取模块310，用于获取前端计算节点的IO时延与所述计算节点的外接盘的IO时延阈值；The first obtaining module 310 is used to obtain the IO delay of the front-end computing node and the IO delay threshold of the external disk of the computing node;

第二获取模块320，用于若所述计算节点的IO时延与所述外接盘的IO时延阈值的差值在第一预设数值范围内，则获取所述计算节点所对应的逻辑单元号的IO时延以及所述逻辑单元号的IO时延阈值；The second obtaining module 320 is configured to obtain the logic unit corresponding to the computing node if the difference between the IO delay of the computing node and the IO delay threshold of the external disk is within a first preset value range The IO delay of the number and the IO delay threshold of the logical unit number;

定位模块330，用于若所述逻辑单元号的IO时延不大于所述逻辑单元号的IO时延阈值，且所述计算节点的IO时延与所述逻辑单元号的IO时延的差值在第二预设数值范围内，则将所述IO故障定位在前端；若所述逻辑单元号的IO时延与所述逻辑单元号的IO时延阈值的差值在第三预设数值范围内，且所述计算节点的IO时延与所述逻辑单元号的IO时延之间的差值在第四预设数值范围内，则将所述IO故障定位在后端。The positioning module 330 is configured to, if the IO delay of the logical unit number is not greater than the IO delay threshold of the logical unit number, and the difference between the IO delay of the computing node and the IO delay of the logical unit number If the value is within the second preset value range, the IO fault is located at the front end; if the difference between the IO delay of the logical unit number and the IO delay threshold of the logical unit number is within the third preset value and the difference between the IO delay of the computing node and the IO delay of the logical unit number is within a fourth preset value range, the IO fault is located at the back end.

在一种可选的方式中，定位模块330还用于在执行将IO故障定位在前端之后，在计算节点为虚拟机时，分别获取与计算节点对应于同一逻辑单元号的其它计算节点的IO时延、与计算节点对应于同一物理机的其它虚拟机的IO时延以及与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延；In an optional manner, the locating module 330 is further configured to obtain the IO of other computing nodes corresponding to the same logical unit number as the computing node respectively when the computing node is a virtual machine after locating the IO fault at the front end. Delay, IO delay of other virtual machines corresponding to the same physical machine as the computing node, and IO delay of other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located;

若判断与计算节点对应于同一逻辑单元号的其它计算节点的IO时延异常，则将IO故障定位在前端与后端的互联网络；If it is judged that the IO delay of other computing nodes corresponding to the same logical unit number as the computing node is abnormal, locate the IO fault on the front-end and back-end interconnected networks;

若判断与计算节点对应于同一物理机的其它虚拟机的IO时延正常，并且判断与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延正常，则将IO故障定位在计算节点；If it is judged that the IO delay of other virtual machines corresponding to the same physical machine as the computing node is normal, and the IO delay of other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located is normal, locate the IO fault. at the compute node;

若判断与计算节点对应于同一物理机的其它虚拟机的IO时延异常，并且判断与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延正常，则将IO故障定位在计算节点所在的物理机。If it is judged that the IO delay of other virtual machines corresponding to the same physical machine as the computing node is abnormal, and it is judged that the IO delay of other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located is normal, locate the IO fault. On the physical machine where the compute node is located.

在一种可选的方式中，定位模块330还用于在执行将IO故障定位在前端之后，在计算节点为物理机时，获取与计算节点对应于同一逻辑单元号的其它计算节点的IO时延；In an optional manner, the locating module 330 is further configured to obtain the IO of other computing nodes corresponding to the same logical unit number as the computing node when the computing node is a physical machine after locating the IO fault at the front end. extend;

若判断与计算节点对应于同一逻辑单元号的其它计算节点的IO时延正常，则将IO故障定位在计算节点。If it is judged that the IO delay of other computing nodes corresponding to the same logical unit number as the computing node is normal, the IO fault is located on the computing node.

在一种可选的方式中，定位模块330还用于在执行将IO故障定位在后端之后，获取每一个后端存储节点的ping网络时延、所有后端存储节点的ping网络时延的平均值以及后端存储节点的ping网络时延阈值；In an optional manner, the location module 330 is further configured to obtain the ping network delay of each back-end storage node and the ping network delay of all back-end storage nodes after locating the IO fault at the back end. The average value and the ping network delay threshold of the backend storage node;

若所有后端存储节点的ping网络时延的平均值与后端存储节点的ping网络时延阈值的差值在第五预设数值范围内，并且每一个后端存储节点的ping网络时延与后端存储节点的ping网络时延阈值的差值均在第六预设数值范围内，则将IO故障定位在后端存储节点之间的互联网络；If the difference between the average ping network delay of all back-end storage nodes and the ping network delay threshold of the back-end storage nodes is within the fifth preset value range, and the ping network delay of each back-end storage node is the same as If the difference between the ping network delay thresholds of the back-end storage nodes is within the sixth preset value range, the IO fault is located in the interconnection network between the back-end storage nodes;

若所有后端存储节点的ping网络时延的平均值与后端存储节点的ping网络时延阈值的差值在第七预设数值范围内，并且预设数量的后端存储节点的ping网络时延与后端存储节点的ping网络时延阈值的差值均在第八预设数值范围内，预设数量的后端存储节点之外的其它后端存储节点的ping网络时延不大于后端存储节点的ping网络时延阈值，则将IO故障定位在预设数量的后端存储节点的网络；If the difference between the average ping network delay of all back-end storage nodes and the ping network delay threshold of the back-end storage nodes is within the seventh preset value range, and the ping network time of a preset number of back-end storage nodes The difference between the delay and the ping network delay threshold of the backend storage node is within the eighth preset value range, and the ping network delay of other backend storage nodes other than the preset number of backend storage nodes is not greater than that of the backend The ping network delay threshold of the storage node, the IO fault is located in the network of the preset number of back-end storage nodes;

若所有后端存储节点的ping网络时延的平均值与后端存储节点的ping网络时延阈值之间的差值在第九预设数值范围内，则获取所有后端存储节点的内置盘的IO时延以及与内置盘的IO时延对应的内置盘的IO时延阈值，将使内置盘的IO时延与内置盘的IO时延阈值的差值在第十预设数值范围内的内置盘识别为异常内置盘，将IO故障定位在异常内置盘。If the difference between the average of the ping network delays of all the back-end storage nodes and the ping network delay threshold of the back-end storage nodes is within the ninth preset value range, obtain the built-in disks of all the back-end storage nodes. The IO delay and the IO delay threshold of the built-in disk corresponding to the IO delay of the built-in disk will make the difference between the IO delay of the built-in disk and the IO delay threshold of the built-in disk within the tenth preset value range. The disk is identified as an abnormal built-in disk, and the IO fault is located on the abnormal built-in disk.

在本发明实施例中，第一获取模块可以获取前端计算节点的IO时延与所述计算节点的外接盘的IO时延阈值，第二获取模块在所述计算节点的IO时延与所述外接盘的IO时延阈值的差值在第一预设数值范围内时，获取所述计算节点所对应的逻辑单元号的IO时延以及所述逻辑单元号的IO时延阈值，定位模块可以分别对逻辑单元号的IO时延与逻辑单元号的IO时延阈值进行比较，以及对计算节点的IO时延与逻辑单元号的IO时延进行比较，以将IO故障定位在前端或者将IO故障定位在后端。可以看出，本发明实施例的IO故障定位装置可以对IO故障进行快速、准确的定位。In the embodiment of the present invention, the first acquisition module can acquire the IO delay of the front-end computing node and the IO delay threshold of the external disk of the computing node, and the second acquisition module can obtain the IO delay of the computing node and the IO delay threshold of the computing node. When the difference between the IO delay thresholds of the external disks is within the first preset value range, obtain the IO delay of the logical unit number corresponding to the computing node and the IO delay threshold of the logical unit number, and the positioning module can Compare the IO latency of the logical unit number with the IO latency threshold of the logical unit number, and compare the IO latency of the computing node with the IO latency of the logical unit number, so as to locate the IO fault at the front end or The fault is located in the backend. It can be seen that the IO fault locating device according to the embodiment of the present invention can quickly and accurately locate the IO fault.

图3示出了本发明计算设备实施例的结构示意图，本发明具体实施例并不对计算设备的具体实现做限定。FIG. 3 shows a schematic structural diagram of an embodiment of a computing device of the present invention, and the specific embodiment of the present invention does not limit the specific implementation of the computing device.

如图3所示，该计算设备可以包括：处理器(processor)402、通信接口(Communications Interface)404、存储器(memory)406、以及通信总线408。As shown in FIG. 3 , the computing device may include: a processor (processor) 402 , a communications interface (Communications Interface) 404 , a memory (memory) 406 , and a communication bus 408 .

其中：处理器402、通信接口404、以及存储器406通过通信总线408完成相互间的通信。通信接口404，用于与其它设备比如客户端或其它服务器等的网元通信。处理器402，用于执行程序410，具体可以执行上述用于IO故障定位方法实施例中的相关步骤。The processor 402 , the communication interface 404 , and the memory 406 communicate with each other through the communication bus 408 . The communication interface 404 is used for communicating with network elements of other devices such as clients or other servers. The processor 402 is configured to execute the program 410, and may specifically execute the relevant steps in the above-mentioned embodiments of the method for locating an IO fault.

具体地，程序410可以包括程序代码，该程序代码包括计算机可执行指令。Specifically, program 410 may include program code, which includes computer-executable instructions.

处理器402可能是中央处理器CPU，或者是特定集成电路ASIC(ApplicationSpecific Integrated Circuit)，或者是被配置成实施本发明实施例的一个或多个集成电路。计算设备包括的一个或多个处理器，可以是同一类型的处理器，如一个或多个CPU；也可以是不同类型的处理器，如一个或多个CPU以及一个或多个ASIC。The processor 402 may be a central processing unit (CPU), or an application specific integrated circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors included in the computing device may be the same type of processors, such as one or more CPUs; or may be different types of processors, such as one or more CPUs and one or more ASICs.

存储器406，用于存放程序410。存储器406可能包含高速RAM存储器，也可能还包括非易失性存储器(non-volatile memory)，例如至少一个磁盘存储器。The memory 406 is used to store the program 410 . Memory 406 may include high-speed RAM memory, and may also include non-volatile memory, such as at least one disk memory.

程序410具体可以被处理器402调用使计算设备执行以下操作：The program 410 can be specifically called by the processor 402 to make the computing device perform the following operations:

在一种可选的方式中，所述程序410被处理器402调用使计算设备执行以下操作：In an optional manner, the program 410 is invoked by the processor 402 to cause the computing device to perform the following operations:

当所述计算节点为虚拟机时，分别获取与所述计算节点对应于同一逻辑单元号的其它计算节点的IO时延、与所述计算节点对应于同一物理机的其它虚拟机的IO时延以及与所述计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延；When the computing node is a virtual machine, obtain the IO delay of other computing nodes corresponding to the same logical unit number as the computing node, and the IO delay of other virtual machines corresponding to the same physical machine as the computing node. and the IO delay of other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located;

在所述计算节点为物理机时，获取与所述计算节点对应于同一逻辑单元号的其它计算节点的IO时延；When the computing node is a physical machine, obtain the IO delay of other computing nodes corresponding to the same logical unit number as the computing node;

在本发明实施例中，通过计算设备中的程序被处理器调用可以使计算设备执行获取前端计算节点的IO时延与计算节点的外接盘的IO时延阈值；若计算节点的IO时延与外接盘的IO时延阈值的差值在第一预设数值范围内，则进一步获取计算节点所对应的逻辑单元号的IO时延以及逻辑单元号的IO时延阈值；若逻辑单元号的IO时延不大于逻辑单元号的IO时延阈值，且计算节点的IO时延与逻辑单元号的IO时延的差值在第二预设数值范围内，则将IO故障定位在前端；若逻辑单元号的IO时延与逻辑单元号的IO时延阈值的差值在第三预设数值范围内，且计算节点的IO时延与逻辑单元号的IO时延之间的差值在第四预设数值范围内，则将IO故障定位在后端。可以看出，计算设备可以对IO故障进行快读、准确的定位。In the embodiment of the present invention, the program in the computing device is called by the processor, so that the computing device can execute the acquisition of the IO delay of the front-end computing node and the IO delay threshold of the external disk of the computing node; if the IO delay of the computing node is equal to If the difference between the IO delay thresholds of the external disks is within the first preset value range, the IO delay of the logical unit number corresponding to the computing node and the IO delay threshold of the logical unit number are further obtained; if the IO delay of the logical unit number is If the delay is not greater than the IO delay threshold of the logical unit number, and the difference between the IO delay of the computing node and the IO delay of the logical unit number is within the second preset value range, the IO fault is located at the front end; if the logic The difference between the IO delay of the unit number and the IO delay threshold of the logical unit number is within the third preset value range, and the difference between the IO delay of the computing node and the IO delay of the logical unit number is within the fourth Within the preset value range, the IO fault is located at the backend. It can be seen that the computing device can quickly read and accurately locate IO faults.

图4示出了本发明IO故障定位设备实施例的结构示意图。如图4所示，该设备500包括：FIG. 4 shows a schematic structural diagram of an embodiment of an IO fault locating device according to the present invention. As shown in Figure 4, the device 500 includes:

采集模块510，用于采集IO时延信息。The collection module 510 is used for collecting IO delay information.

其中，IO时延信息可以包括：前端计算节点的IO时延、计算节点所对应的逻辑单元号的IO时延、每一个后端存储节点的ping网络时延，以及，在计算节点为虚拟机时，采集与计算节点对应于同一LUN的其它计算节点的IO时延、与计算节点对应于同一物理机的其它虚拟机的IO时延、与计算节点所在的物理机对应于同一LUN的其它物理机的IO时延，以及，在当前计算节点为物理机时，采集与计算节点对应于同一LUN的其它计算节点的IO时延。The IO delay information may include: the IO delay of the front-end computing node, the IO delay of the logical unit number corresponding to the computing node, the ping network delay of each back-end storage node, and, if the computing node is a virtual machine Collects the IO delays of other computing nodes that correspond to the same LUN as the computing node, the IO delays of other virtual machines that correspond to the same physical machine as the computing node, and other physical machines that correspond to the same LUN as the physical machine where the computing node resides. The IO delay of the computer, and, when the current computing node is a physical machine, collects the IO delay of other computing nodes corresponding to the same LUN as the computing node.

配置模块520，用于配置IO时延阈值。The configuration module 520 is configured to configure the IO delay threshold.

其中，IO时延阈值可以包括：对应于计算节点的外接盘的IO时延阈值、计算节点所对应的逻辑单元号的IO时延阈值、后端存储节点的ping网络时延阈值，以及，在计算节点为虚拟机时，用于配置与计算节点对应于同一逻辑单元号的其它计算节点的IO时延阈值、与计算节点对应于同一物理机的其它虚拟机的IO时延阈值、与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延阈值，以及，在当前计算节点为物理机时，用于配置与计算节点对应于同一逻辑单元号的其它计算节点的IO时延阈值。The IO delay threshold may include: the IO delay threshold corresponding to the external disk of the computing node, the IO delay threshold of the logical unit number corresponding to the computing node, the ping network delay threshold of the back-end storage node, and, in When the computing node is a virtual machine, it is used to configure the IO delay threshold of other computing nodes corresponding to the same logical unit number as the computing node, the IO delay threshold of other virtual machines corresponding to the same physical machine as the computing node, and the computing node. The IO delay threshold of other physical machines whose physical machine corresponds to the same logical unit number, and, when the current computing node is a physical machine, is used to configure the IO of other computing nodes corresponding to the same logical unit number as the computing node. delay threshold.

定位模块530，用于执行上述的IO故障定位方法，以对所述IO故障进行定位。The locating module 530 is configured to execute the above-mentioned IO fault locating method to locate the IO fault.

展示模块540，用于获取所述定位模块对所述IO故障进行定位的结果，并对所述结果进行展示。The display module 540 is configured to obtain the result of the location module locating the IO fault, and display the result.

在一种可选的方式中，采集模块510还用于采集IOPS信息，IOPS即Input/OutputPer Second，是指磁盘每秒的输入输出量。采集模块可以采集后端存储节点的存储盘的IOPS信息。In an optional manner, the collection module 510 is further configured to collect IOPS information, where IOPS is Input/Output Per Second, which refers to the amount of input and output per second of the disk. The collection module can collect the IOPS information of the storage disks of the back-end storage nodes.

在一种可选的方式中，展示模块540还用于展示采集模块510采集的IO时延信息，以及，展示前端计算节点的外接盘与后端存储集群的逻辑单元号的对应关系。In an optional manner, the display module 540 is further configured to display the IO delay information collected by the collection module 510, and to display the correspondence between the external disk of the front-end computing node and the logical unit number of the back-end storage cluster.

在一种可选的方式中，配置模块520还用于：In an optional manner, the configuration module 520 is further configured to:

获取前端计算节点的配置信息以及后端存储集群的配置信息，将前端计算节点的配置信息以及后端存储集群的配置信息发送至定位模块530；Obtain the configuration information of the front-end computing node and the configuration information of the back-end storage cluster, and send the configuration information of the front-end computing node and the configuration information of the back-end storage cluster to the positioning module 530;

定位模块530还用于：根据前端计算节点的配置信息以及后端存储集群的配置信息对IO故障进行定位。The location module 530 is further configured to: locate the IO fault according to the configuration information of the front-end computing node and the configuration information of the back-end storage cluster.

其中，前端计算节点的配置信息可以包括：计算节点的主机名、计算节点的外接盘信息以及与计算节点对应的存储集群信息；后端存储集群的配置信息可以包括：与计算节点对应的存储集群名称、与计算节点对应的存储集群所包含的存储节点的数量及各个存储节点的名称、与计算节点对应的存储集群所划分的逻辑单元号的数量以及与计算节点对应于同一存储集群的其它计算节点的节点信息。The configuration information of the front-end computing node may include: the host name of the computing node, the external disk information of the computing node, and the storage cluster information corresponding to the computing node; the configuration information of the back-end storage cluster may include: the storage cluster corresponding to the computing node Name, the number of storage nodes included in the storage cluster corresponding to the computing node, the name of each storage node, the number of logical unit numbers divided by the storage cluster corresponding to the computing node, and other computing nodes that correspond to the same storage cluster as the computing node Node information for the node.

在一种可选的方式中，配置模块520还用于获取后端存储集群的上下层之间的对应关系，将获取的后端存储集群的上下层之间的对应关系发送至定位模块530；In an optional manner, the configuration module 520 is further configured to obtain the correspondence between the upper and lower layers of the back-end storage cluster, and send the obtained correspondence between the upper and lower layers of the back-end storage cluster to the positioning module 530;

定位模块530还用于：根据后端存储集群的上下层之间的对应关系对IO故障进行定位。The location module 530 is further configured to: locate the IO fault according to the correspondence between the upper and lower layers of the backend storage cluster.

在一种可选的方式中，配置模块520还用于获取参考信息，根据参考信息对IO时延阈值进行配置，参考信息可以包括：计算节点的硬件配置、与计算节点对应的后端存储节点的硬件配置以及计算节点所处理的任务对于IO时延的要求。In an optional manner, the configuration module 520 is further configured to obtain reference information, and configure the IO delay threshold according to the reference information. The reference information may include: the hardware configuration of the computing node, the back-end storage node corresponding to the computing node The hardware configuration and the IO latency requirements of the tasks processed by the computing nodes.

在本发明实施例中，采集模块可以采集IO时延信息，配置模块可以配置IO时延阈值，定位模块可以执行IO故障定位方法以对IO故障进行定位，展示模块可以获取定位模块对IO故障进行定位的结果，并对结果进行展示。可以看出，本发明实施例的IO故障定位设备可以对IO故障进行快速、准确的定位，并且将IO故障的定位结果进行展示，以便于对IO故障进行进一步的分析。In the embodiment of the present invention, the acquisition module can collect IO delay information, the configuration module can configure the IO delay threshold, the location module can execute the IO fault location method to locate the IO fault, and the display module can obtain the location module to perform the IO fault detection. The results of the positioning, and the results are displayed. It can be seen that the IO fault locating device in the embodiment of the present invention can quickly and accurately locate the IO fault, and display the IO fault locating result, so as to facilitate further analysis of the IO fault.

本发明实施例提供了一种计算机可读存储介质，所述存储介质存储有至少一可执行指令，该可执行指令在IO故障定位装置上运行时，使得所述IO故障定位装置执行上述任意方法实施例中的IO故障定位方法。An embodiment of the present invention provides a computer-readable storage medium, where the storage medium stores at least one executable instruction, and when the executable instruction is run on an IO fault locating device, causes the IO fault locating device to execute any of the above methods The IO fault location method in the embodiment.

本发明实施例提供了一种计算机程序，所述计算机程序可被处理器调用使计算设备执行上述任意方法实施例中的IO故障定位方法。An embodiment of the present invention provides a computer program, and the computer program can be invoked by a processor to cause a computing device to execute the method for locating an IO fault in any of the foregoing method embodiments.

本发明实施例提供了一种计算机程序产品，计算机程序产品包括存储在计算机可读存储介质上的计算机程序，计算机程序包括程序指令，当程序指令在计算机上运行时，使得所述计算机执行上述任意方法实施例中的IO故障定位方法。An embodiment of the present invention provides a computer program product. The computer program product includes a computer program stored on a computer-readable storage medium, and the computer program includes program instructions. When the program instructions are run on a computer, the computer is made to execute any of the above. The IO fault location method in the method embodiment.

在此提供的算法或显示不与任何特定计算机、虚拟系统或者其它设备固有相关。各种通用系统也可以与基于在此的示教一起使用。根据上面的描述，构造这类系统所要求的结构是显而易见的。此外，本发明实施例也不针对任何特定编程语言。应当明白，可以利用各种编程语言实现在此描述的本发明的内容，并且上面对特定语言所做的描述是为了披露本发明的最佳实施方式。The algorithms or displays provided herein are not inherently related to any particular computer, virtual system, or other device. Various general-purpose systems can also be used with teaching based on this. The structure required to construct such a system is apparent from the above description. Furthermore, embodiments of the present invention are not directed to any particular programming language. It is to be understood that various programming languages may be used to implement the inventions described herein, and that the descriptions of specific languages above are intended to disclose the best mode for carrying out the invention.

在此处所提供的说明书中，说明了大量具体细节。然而，能够理解，本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中，并未详细示出公知的方法、结构和技术，以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. It will be understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

类似地，应当理解，为了精简本发明并帮助理解各个发明方面中的一个或多个，在上面对本发明的示例性实施例的描述中，本发明实施例的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而，并不应将该公开的方法解释成反映如下意图：即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。Similarly, it is to be understood that in the above description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together into a single implementation in order to simplify the invention and to aid in the understanding of one or more of the various aspects of the invention. examples, figures, or descriptions thereof. This disclosure, however, should not be construed as reflecting an intention that the invention as claimed requires more features than are expressly recited in each claim.

本领域技术人员可以理解，可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件，以及可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外，可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述，本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。Those skilled in the art can understand that the modules in the device in the embodiment can be adaptively changed and arranged in one or more devices different from the embodiment. The modules or units or components in the embodiments may be combined into one module or unit or component, and they may be divided into multiple sub-modules or sub-units or sub-assemblies. All features disclosed in this specification (including accompanying claims, abstract and drawings) and any method so disclosed may be employed in any combination unless at least some of such features and/or procedures or elements are mutually exclusive. All processes or units of equipment are combined. Each feature disclosed in this specification (including accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制，并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中，不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中，这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。上述实施例中的步骤，除有特殊说明外，不应理解为对执行顺序的限定。It should be noted that the above-described embodiments illustrate rather than limit the invention, and that alternative embodiments may be devised by those skilled in the art without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several different elements and by means of a suitably programmed computer. In a unit claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, and third, etc. do not denote any order. These words can be interpreted as names. The steps in the above embodiments should not be construed as limitations on the execution order unless otherwise specified.

Claims

1. An IO fault location method is characterized in that the IO fault is an IO fault of distributed storage, and the method comprises the following steps:

obtaining IO time delay of a front-end computing node and an IO time delay threshold of an external disk of the computing node;

if the difference value between the IO time delay of the computing node and the IO time delay threshold of the external disk is within a first preset value range, further acquiring the IO time delay of the logic unit number corresponding to the computing node and the IO time delay threshold of the logic unit number;

if the IO time delay of the logic unit number is not larger than the IO time delay threshold of the logic unit number, and the difference value between the IO time delay of the computing node and the IO time delay of the logic unit number is within a second preset value range, positioning the IO fault at the front end;

and if the difference value between the IO time delay of the logic unit number and the IO time delay threshold of the logic unit number is within a third preset numerical range, and the difference value between the IO time delay of the computing node and the IO time delay of the logic unit number is within a fourth preset numerical range, the IO fault is positioned at the rear end.

2. The method of claim 1, wherein after the locating the IO fault at a front end, the method further comprises:

if the computing node is a virtual machine, respectively acquiring IO time delays of other computing nodes corresponding to the same logic unit number with the computing node, IO time delays of other virtual machines corresponding to the same physical machine with the computing node and IO time delays of other physical machines corresponding to the same logic unit number with the physical machine where the computing node is located;

if the IO time delay of other computing nodes corresponding to the same logic unit number with the computing node is judged to be abnormal, the IO fault is positioned in the interconnection network of the front end and the rear end;

if the IO time delay of other virtual machines corresponding to the same physical machine as the computing node is judged to be normal, and the IO time delay of other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located is judged to be normal, the IO fault is positioned at the computing node;

and if the IO time delay of other virtual machines corresponding to the same physical machine as the computing node is judged to be abnormal, and the IO time delay of other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located is judged to be normal, locating the IO fault in the physical machine where the computing node is located.

3. The method of claim 1 or 2, wherein after the locating the IO fault at a front end, the method further comprises:

if the computing node is a physical machine, IO time delay of other computing nodes corresponding to the same logic unit number as the computing node is obtained;

and if the IO time delay of the other computing nodes corresponding to the same logic unit number with the computing node is judged to be normal, the IO fault is positioned in the computing node.

4. The method of claim 1, wherein after the locating the IO fault at a backend, the method further comprises:

acquiring ping network delay of each rear-end storage node, an average value of ping network delays of all rear-end storage nodes and a ping network delay threshold of the rear-end storage node;

if the difference value between the average value of the ping network time delays of all the rear-end storage nodes and the ping network time delay threshold value of the rear-end storage node is within a fifth preset numerical range, and the difference value between the ping network time delay of each rear-end storage node and the ping network time delay threshold value of the rear-end storage node is within a sixth preset numerical range, the IO fault is positioned in an interconnection network between the rear-end storage nodes;

if the difference value between the average value of the ping network time delays of all the rear-end storage nodes and the ping network time delay threshold value of the rear-end storage nodes is within a seventh preset numerical range, the difference value between the ping network time delays of the rear-end storage nodes with a preset number and the ping network time delay threshold value of the rear-end storage nodes is within an eighth preset numerical range, and the ping network time delays of other rear-end storage nodes except the rear-end storage nodes with the preset number are not greater than the ping network time delay threshold value of the rear-end storage nodes, locating the IO fault in the network of the rear-end storage nodes with the preset number;

if the difference value between the average value of the ping network time delays of all the rear-end storage nodes and the ping network time delay threshold value of the rear-end storage node is within a ninth preset numerical range, the IO time delay of the built-in disks of all the rear-end storage nodes and the IO time delay threshold value of the built-in disks corresponding to the IO time delay of the built-in disks are obtained, the built-in disks which enable the difference value between the IO time delay of the built-in disks and the IO time delay threshold value of the built-in disks to be within a tenth preset numerical range are identified as abnormal built-in disks, and the IO faults are located on the abnormal built-in disks.

5. An IO fault location device, wherein an IO fault is an IO fault for distributed storage, the device comprising:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring IO time delay of a front-end computing node and an IO time delay threshold of an external disk of the computing node;

a second obtaining module, configured to obtain an IO delay of a logical unit number corresponding to the compute node and an IO delay threshold of the logical unit number if a difference between the IO delay of the compute node and the IO delay threshold of the external disk is within a first preset numerical range;

the positioning module is used for positioning the IO fault at the front end if the IO time delay of the logic unit number is not larger than the IO time delay threshold of the logic unit number and the difference value between the IO time delay of the computing node and the IO time delay of the logic unit number is within a second preset value range; and if the difference value between the IO time delay of the logic unit number and the IO time delay threshold of the logic unit number is within a third preset numerical range, and the difference value between the IO time delay of the computing node and the IO time delay of the logic unit number is within a fourth preset numerical range, the IO fault is positioned at the rear end.

6. A computing device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is configured to store at least one executable instruction that causes the processor to perform the operations of the IO fault localization method of any one of claims 1-4.

7. An IO fault location device, wherein the IO fault is an IO fault for distributed storage, the device comprising:

the acquisition module is used for acquiring IO time delay information;

the configuration module is used for configuring an IO delay threshold;

a positioning module, configured to execute the IO fault positioning method according to any one of claims 1 to 4, so as to position the IO fault;

and the display module is used for acquiring the result of the positioning module for positioning the IO fault and displaying the result.

8. The apparatus of claim 7, wherein the presentation module is further configured to:

and displaying the IO time delay information acquired by the acquisition module, and displaying the corresponding relation between the external disk of the front-end computing node and the logic unit number of the rear-end storage cluster.

9. The apparatus of claim 7 or 8, wherein the configuration module is further configured to:

acquiring configuration information of a front-end computing node and configuration information of a rear-end storage cluster, and sending the configuration information of the front-end computing node and the configuration information of the rear-end storage cluster to the positioning module;

the positioning module is further configured to: and positioning the IO fault according to the configuration information of the front-end computing node and the configuration information of the rear-end storage cluster.

10. A computer-readable storage medium having stored therein at least one executable instruction that, when executed on a computing device, causes the computing device to perform operations of the IO fault localization method of any one of claims 1-4.