CN114428703A - IO fault location method, apparatus, device and computer readable storage medium - Google Patents
IO fault location method, apparatus, device and computer readable storage medium Download PDFInfo
- Publication number
- CN114428703A CN114428703A CN202011104291.0A CN202011104291A CN114428703A CN 114428703 A CN114428703 A CN 114428703A CN 202011104291 A CN202011104291 A CN 202011104291A CN 114428703 A CN114428703 A CN 114428703A
- Authority
- CN
- China
- Prior art keywords
- time delay
- computing node
- fault
- delay
- unit number
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3034—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a storage system, e.g. DASD based or network based
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3041—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is an input/output interface
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3055—Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0677—Localisation of faults
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
- G06F2009/45579—I/O management, e.g. providing access to device drivers or storage
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
- G06F2009/45591—Monitoring or debugging support
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
- G06F2009/45595—Network integration; Enabling network access in virtual machine instances
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Quality & Reliability (AREA)
- Mathematical Physics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Debugging And Monitoring (AREA)
- Hardware Redundancy (AREA)
Abstract
Description
技术领域technical field
本发明实施例涉及分布式存储技术领域,具体涉及一种IO故障定位方法、装置、设备及计算机可读存储介质。Embodiments of the present invention relate to the technical field of distributed storage, and in particular, to an IO fault location method, apparatus, device, and computer-readable storage medium.
背景技术Background technique
分布式存储是目前比较流行的一种存储方式,它是将多个存储服务器通过网络互联以灵活分配存储空间,可以提高存储效率及存储容量。Distributed storage is a popular storage method at present. It interconnects multiple storage servers through a network to flexibly allocate storage space, which can improve storage efficiency and storage capacity.
分布式存储系统一般包括前端计算节点和后端存储节点。前端计算节点上可以运行各种应用程序,后端存储节点可以对前端计算节点的数据进行保存。在分布式存储系统的运行过程中,经常出现IO故障。为了使得分布式存储系统可以正常运行,需要对IO故障进行定位,以解决IO故障。在实现本发明实施例的过程中,发明人发现相关技术中对IO故障进行定位往往依赖于分布式存储系统所提供的告警信息以及技术人员的经验,对IO故障的定位手段比较单一,定位过程效率较低。A distributed storage system generally includes front-end computing nodes and back-end storage nodes. Various applications can run on the front-end computing nodes, and the back-end storage nodes can save the data of the front-end computing nodes. During the operation of a distributed storage system, IO failures often occur. In order to make the distributed storage system run normally, it is necessary to locate the IO fault to solve the IO fault. In the process of implementing the embodiments of the present invention, the inventor found that locating IO faults in the related art often relies on the alarm information provided by the distributed storage system and the experience of technicians, and the locating means for IO faults is relatively simple, and the locating process is relatively simple. less efficient.
发明内容SUMMARY OF THE INVENTION
鉴于上述问题,本发明实施例提供了一种IO故障定位方法、装置、设备及计算机可读存储介质,用于解决现有技术中存在的分布式存储IO故障定位过程效率较低的问题。In view of the above problems, embodiments of the present invention provide an IO fault location method, apparatus, device, and computer-readable storage medium, which are used to solve the problem of low efficiency in the distributed storage IO fault location process in the prior art.
根据本发明实施例的一个方面,提供了一种IO故障定位方法,所述IO故障为分布式存储的IO故障,所述方法包括:According to an aspect of the embodiments of the present invention, a method for locating an IO fault is provided, where the IO fault is an IO fault of distributed storage, and the method includes:
获取前端计算节点的IO时延与所述计算节点的外接盘的IO时延阈值;Obtain the IO delay of the front-end computing node and the IO delay threshold of the external disk of the computing node;
若所述计算节点的IO时延与所述外接盘的IO时延阈值的差值在第一预设数值范围内,则进一步获取所述计算节点所对应的逻辑单元号的IO时延以及所述逻辑单元号的IO时延阈值;If the difference between the IO delay of the computing node and the IO delay threshold of the external disk is within the first preset value range, further obtain the IO delay of the logical unit number corresponding to the computing node and all the IO delay threshold of the logical unit number;
若所述逻辑单元号的IO时延不大于所述逻辑单元号的IO时延阈值,且所述计算节点的IO时延与所述逻辑单元号的IO时延的差值在第二预设数值范围内,则将所述IO故障定位在前端;If the IO latency of the logical unit number is not greater than the IO latency threshold of the logical unit number, and the difference between the IO latency of the computing node and the IO latency of the logical unit number is within the second preset value Within the value range, the IO fault is located at the front end;
若所述逻辑单元号的IO时延与所述逻辑单元号的IO时延阈值的差值在第三预设数值范围内,且所述计算节点的IO时延与所述逻辑单元号的IO时延之间的差值在第四预设数值范围内,则将所述IO故障定位在后端。If the difference between the IO delay of the logical unit number and the IO delay threshold of the logical unit number is within a third preset value range, and the IO delay of the computing node and the IO delay of the logical unit number If the difference between the delays is within the fourth preset value range, the IO fault is located at the back end.
在一种可选的方式中,在所述将所述IO故障定位在前端之后,所述方法进一步包括:In an optional manner, after locating the IO fault at the front end, the method further includes:
若所述计算节点为虚拟机,则分别获取与所述计算节点对应于同一逻辑单元号的其它计算节点的IO时延、与所述计算节点对应于同一物理机的其它虚拟机的IO时延以及与所述计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延;If the computing node is a virtual machine, obtain the IO delay of other computing nodes corresponding to the same logical unit number as the computing node and the IO delay of other virtual machines corresponding to the same physical machine as the computing node, respectively. and the IO delay of other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located;
若判断所述与所述计算节点对应于同一逻辑单元号的其它计算节点的IO时延异常,则将所述IO故障定位在前端与后端的互联网络;If it is determined that the IO delay of the other computing nodes corresponding to the same logical unit number as the computing node is abnormal, the IO fault is located in the interconnection network between the front end and the back end;
若判断所述与所述计算节点对应于同一物理机的其它虚拟机的IO时延正常,并且判断所述与所述计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延正常,则将所述IO故障定位在所述计算节点;If it is judged that the IO delay of the other virtual machines corresponding to the same physical machine as the computing node is normal, and the IO of the other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located is judged to be normal If the delay is normal, the IO fault is located on the computing node;
若判断所述与所述计算节点对应于同一物理机的其它虚拟机的IO时延异常,并且判断所述与所述计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延正常,则将所述IO故障定位在所述计算节点所在的物理机。If it is determined that the IO delay of the other virtual machines corresponding to the same physical machine as the computing node is abnormal, and the IO of the other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located is determined to be abnormal If the delay is normal, the IO fault is located on the physical machine where the computing node is located.
在一种可选的方式中,在所述将所述IO故障定位在前端之后,所述方法进一步包括:In an optional manner, after locating the IO fault at the front end, the method further includes:
若所述计算节点为物理机,则获取与所述计算节点对应于同一逻辑单元号的其它计算节点的IO时延;If the computing node is a physical machine, obtain the IO delay of other computing nodes corresponding to the same logical unit number as the computing node;
若判断所述与所述计算节点对应于同一逻辑单元号的其它计算节点的IO时延异常,则将所述IO故障定位在前端与后端的互联网络;If it is determined that the IO delay of the other computing nodes corresponding to the same logical unit number as the computing node is abnormal, the IO fault is located in the interconnection network between the front end and the back end;
若判断所述与所述计算节点对应于同一逻辑单元号的其它计算节点的IO时延正常,则将所述IO故障定位在所述计算节点。If it is determined that the IO delay of the other computing nodes corresponding to the same logical unit number as the computing node is normal, the IO fault is located on the computing node.
在一种可选的方式中,在所述将所述IO故障定位在后端之后,所述方法进一步包括:In an optional manner, after locating the IO fault at the backend, the method further includes:
获取每一个后端存储节点的ping网络时延、所有后端存储节点的ping网络时延的平均值以及后端存储节点的ping网络时延阈值;Obtain the ping network delay of each back-end storage node, the average ping network delay of all back-end storage nodes, and the ping network delay threshold of the back-end storage nodes;
若所述所有后端存储节点的ping网络时延的平均值与所述后端存储节点的ping网络时延阈值的差值在第五预设数值范围内,并且所述每一个后端存储节点的ping网络时延与所述后端存储节点的ping网络时延阈值的差值均在第六预设数值范围内,则将所述IO故障定位在后端存储节点之间的互联网络;If the difference between the average of the ping network delays of all the back-end storage nodes and the ping network delay threshold of the back-end storage nodes is within the fifth preset value range, and each back-end storage node If the difference between the ping network delay of the back-end storage node and the ping network delay threshold of the back-end storage node is within the sixth preset value range, the IO fault is located in the interconnection network between the back-end storage nodes;
若所述所有后端存储节点的ping网络时延的平均值与所述后端存储节点的ping网络时延阈值的差值在第七预设数值范围内,并且预设数量的后端存储节点的ping网络时延与所述后端存储节点的ping网络时延阈值的差值均在第八预设数值范围内,所述预设数量的后端存储节点之外的其它后端存储节点的ping网络时延不大于所述后端存储节点的ping网络时延阈值,则将所述IO故障定位在所述预设数量的后端存储节点的网络;If the difference between the average of the ping network delays of all the back-end storage nodes and the ping network delay threshold of the back-end storage nodes is within the seventh preset value range, and the preset number of back-end storage nodes The difference between the ping network delay of the back-end storage node and the ping network delay threshold of the back-end storage node is within the eighth preset value range, and the back-end storage nodes other than the preset number of back-end storage nodes have If the ping network delay is not greater than the ping network delay threshold of the back-end storage node, the IO fault is located in the network of the preset number of back-end storage nodes;
若所述所有后端存储节点的ping网络时延的平均值与所述后端存储节点的ping网络时延阈值之间的差值在第九预设数值范围内,则获取所有所述后端存储节点的内置盘的IO时延以及与所述内置盘的IO时延对应的内置盘的IO时延阈值,将使所述内置盘的IO时延与所述内置盘的IO时延阈值的差值在第十预设数值范围内的内置盘识别为异常内置盘,将所述IO故障定位在所述异常内置盘。If the difference between the average of the ping network delays of all the back-end storage nodes and the ping network delay threshold of the back-end storage nodes is within the ninth preset value range, obtain all the back-end storage nodes. The IO delay of the built-in disk of the storage node and the IO delay threshold of the built-in disk corresponding to the IO delay of the built-in disk will make the difference between the IO delay of the built-in disk and the IO delay threshold of the built-in disk. The built-in disk whose difference is within the tenth preset value range is identified as an abnormal built-in disk, and the IO fault is located in the abnormal built-in disk.
根据本发明实施例的另一方面,提供了一种IO故障定位装置,所述IO故障为分布式存储的IO故障,所述装置包括:According to another aspect of the embodiments of the present invention, an apparatus for locating an IO fault is provided, where the IO fault is an IO fault of distributed storage, and the apparatus includes:
第一获取模块,用于获取前端计算节点的IO时延与所述计算节点的外接盘的IO时延阈值;a first obtaining module, configured to obtain the IO delay of the front-end computing node and the IO delay threshold of the external disk of the computing node;
第二获取模块,用于若所述计算节点的IO时延与所述外接盘的IO时延阈值的差值在第一预设数值范围内,则获取所述计算节点所对应的逻辑单元号的IO时延以及所述逻辑单元号的IO时延阈值;The second obtaining module is configured to obtain the logical unit number corresponding to the computing node if the difference between the IO delay of the computing node and the IO delay threshold of the external disk is within a first preset value range The IO delay and the IO delay threshold of the logical unit number;
定位模块,用于若所述逻辑单元号的IO时延不大于所述逻辑单元号的IO时延阈值,且所述计算节点的IO时延与所述逻辑单元号的IO时延的差值在第二预设数值范围内,则将所述IO故障定位在前端;若所述逻辑单元号的IO时延与所述逻辑单元号的IO时延阈值的差值在第三预设数值范围内,且所述计算节点的IO时延与所述逻辑单元号的IO时延之间的差值在第四预设数值范围内,则将所述IO故障定位在后端。A positioning module, used for if the IO delay of the logical unit number is not greater than the IO delay threshold of the logical unit number, and the difference between the IO delay of the computing node and the IO delay of the logical unit number Within the second preset value range, the IO fault is located at the front end; if the difference between the IO delay of the logical unit number and the IO delay threshold of the logical unit number is within the third preset value range and the difference between the IO delay of the computing node and the IO delay of the logical unit number is within a fourth preset value range, the IO fault is located at the back end.
根据本发明实施例的另一方面,提供了一种计算设备,包括:处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;According to another aspect of the embodiments of the present invention, a computing device is provided, including: a processor, a memory, a communication interface, and a communication bus, and the processor, the memory, and the communication interface complete each other through the communication bus. communication between;
所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行上述的IO故障定位方法的操作。The memory is used for storing at least one executable instruction, and the executable instruction enables the processor to perform the operations of the above-mentioned IO fault location method.
根据本发明实施例的另一方面,提供了一种IO故障定位设备,所述IO故障为分布式存储的IO故障,所述设备包括:According to another aspect of the embodiments of the present invention, an IO fault location device is provided, where the IO fault is an IO fault of distributed storage, and the device includes:
采集模块,用于采集IO时延信息;The acquisition module is used to collect IO delay information;
配置模块,用于配置IO时延阈值;The configuration module is used to configure the IO delay threshold;
定位模块,用于执行上述的IO故障定位方法,以对所述IO故障进行定位;a locating module, configured to execute the above-mentioned IO fault locating method, so as to locate the IO fault;
展示模块,用于获取所述定位模块对所述IO故障进行定位的结果,并对所述结果进行展示。A display module, configured to obtain a result of the positioning module locating the IO fault, and display the result.
在一种可选的方式中,所述展示模块还用于:In an optional way, the display module is also used for:
展示所述采集模块采集的所述IO时延信息,以及,展示前端计算节点的外接盘与后端存储集群的逻辑单元号的对应关系。Display the IO delay information collected by the collection module, and display the correspondence between the external disk of the front-end computing node and the logical unit number of the back-end storage cluster.
在一种可选的方式中,所述配置模块还用于:In an optional manner, the configuration module is further used to:
获取前端计算节点的配置信息以及后端存储集群的配置信息,将所述前端计算节点的配置信息以及后端存储集群的配置信息发送至所述定位模块;Obtain the configuration information of the front-end computing node and the configuration information of the back-end storage cluster, and send the configuration information of the front-end computing node and the configuration information of the back-end storage cluster to the positioning module;
所述定位模块还用于:根据所述前端计算节点的配置信息以及后端存储集群的配置信息对所述IO故障进行定位。The locating module is further configured to: locate the IO fault according to the configuration information of the front-end computing node and the configuration information of the back-end storage cluster.
根据本发明实施例的又一方面,提供了一种计算机可读存储介质,所述存储介质中存储有至少一可执行指令,所述可执行指令在计算设备上运行时,使得计算设备执行上述的IO故障定位方法的操作。According to yet another aspect of the embodiments of the present invention, a computer-readable storage medium is provided, where at least one executable instruction is stored in the storage medium, and when the executable instruction is executed on a computing device, the computing device executes the above The operation of the IO fault location method.
本发明实施例通过获取前端计算节点的IO时延与所述计算节点的外接盘的IO时延阈值;若所述计算节点的IO时延与所述外接盘的IO时延阈值的差值在第一预设数值范围内,则进一步获取所述计算节点所对应的逻辑单元号的IO时延以及所述逻辑单元号的IO时延阈值;若所述逻辑单元号的IO时延不大于所述逻辑单元号的IO时延阈值,且所述计算节点的IO时延与所述逻辑单元号的IO时延的差值在第二预设数值范围内,则将所述IO故障定位在前端;若所述逻辑单元号的IO时延与所述逻辑单元号的IO时延阈值的差值在第三预设数值范围内,且所述计算节点的IO时延与所述逻辑单元号的IO时延之间的差值在第四预设数值范围内,则将所述IO故障定位在后端。可以看出,本发明实施例通过比较前端计算节点的IO时延与后端逻辑单元号的IO时延可以快速、准确的对IO故障进行定位。In this embodiment of the present invention, the IO delay of the front-end computing node and the IO delay threshold of the external disk of the computing node are obtained; if the difference between the IO delay of the computing node and the IO delay threshold of the external disk is within Within the first preset value range, the IO delay of the logical unit number corresponding to the computing node and the IO delay threshold of the logical unit number are further obtained; if the IO delay of the logical unit number is not greater than all The IO delay threshold of the logical unit number, and the difference between the IO delay of the computing node and the IO delay of the logical unit number is within the second preset value range, the IO fault is located at the front end ; If the difference between the IO delay of the logical unit number and the IO delay threshold of the logical unit number is within the third preset value range, and the IO delay of the computing node and the logical unit number If the difference between the IO delays is within the fourth preset value range, the IO fault is located at the back end. It can be seen that the embodiment of the present invention can quickly and accurately locate the IO fault by comparing the IO delay of the front-end computing node and the IO delay of the back-end logical unit number.
上述说明仅是本发明实施例技术方案的概述,为了能够更清楚了解本发明实施例的技术手段,而可依照说明书的内容予以实施,并且为了让本发明实施例的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。The above description is only an overview of the technical solutions of the embodiments of the present invention. In order to understand the technical means of the embodiments of the present invention more clearly, it can be implemented according to the contents of the description, and in order to make the above and other purposes, features and The advantages can be more clearly understood, and the following specific embodiments of the present invention are given.
附图说明Description of drawings
附图仅用于示出实施方式,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:The drawings are only used to illustrate the embodiments and are not considered to be limiting of the present invention. Also, the same components are denoted by the same reference numerals throughout the drawings. In the attached image:
图1示出了本发明实施例提供的IO故障定位方法的流程示意图;1 shows a schematic flowchart of a method for locating an IO fault provided by an embodiment of the present invention;
图2示出了本发明实施例提供的IO故障定位装置的结构示意图;FIG. 2 shows a schematic structural diagram of an IO fault location device provided by an embodiment of the present invention;
图3示出了本发明实施例提供的计算设备的结构示意图;FIG. 3 shows a schematic structural diagram of a computing device provided by an embodiment of the present invention;
图4示出了本发明实施例提供的IO故障定位设备的结构示意图。FIG. 4 shows a schematic structural diagram of an IO fault location device provided by an embodiment of the present invention.
具体实施方式Detailed ways
下面将参照附图更详细地描述本发明的示例性实施例。虽然附图中显示了本发明的示例性实施例,然而应当理解,可以以各种形式实现本发明而不应被这里阐述的实施例所限制。Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited by the embodiments set forth herein.
图1示出了本发明IO故障定位方法实施例的流程图,该方法由计算设备执行。在本发明实施例中,计算设备的存储空间中存储有可执行指令,该可执行指令可以使处理器执行IO故障定位方法。如图1所示,该方法可以对分布式存储的IO故障进行定位,包括以下步骤:FIG. 1 shows a flowchart of an embodiment of a method for locating an IO fault according to the present invention, and the method is executed by a computing device. In this embodiment of the present invention, an executable instruction is stored in the storage space of the computing device, and the executable instruction can cause the processor to execute the method for locating an IO fault. As shown in Figure 1, the method can locate the IO fault of distributed storage, including the following steps:
步骤110:获取前端计算节点的IO时延与所述计算节点的外接盘的IO时延阈值。Step 110: Obtain the IO delay of the front-end computing node and the IO delay threshold of the external disk of the computing node.
其中,前端计算节点可以是各种类型的PC服务器,例如前端计算节点可以是x86服务器。计算节点的外接盘位于后端存储集群上的存储节点,后端存储集群一般包括多个存储节点,每一个存储节点可以包括至少一个本地存储盘,存储节点上的存储盘可以作为前端计算节点的外接盘。计算节点的外接盘在执行IO任务时会产生IO时延,可以为计算节点的外接盘设置IO时延阈值,对计算节点的外接盘的IO时延与计算节点的外接盘的IO时延阈值进行比较以确定计算节点的外接盘在执行IO任务过程中是否超时。The front-end computing nodes may be various types of PC servers, for example, the front-end computing nodes may be x86 servers. The external disk of the computing node is located on the storage node on the back-end storage cluster. The back-end storage cluster generally includes multiple storage nodes. Each storage node can include at least one local storage disk. external disk. When the external disks of the compute node perform IO tasks, an IO delay will occur. You can set the IO delay threshold for the external disks of the compute node, and set the IO delay threshold of the external disks of the compute node and the IO delay threshold of the external disks of the compute node. The comparison is made to determine whether the external disk of the computing node times out during the execution of the IO task.
在本发明实施例的一种实施方式中,可以通过相关操作系统命令来获取计算节点的外接盘的读写时延await值,以获取的await值作为计算节点的IO时延,相关操作系统命令例如可以是iostat命令。进一步的,可以按照固定的时间间隔来获取计算节点的外接盘的多个读写时延await值,按照移动平均法得到await值的移动平均值,并将await值的移动平均值作为计算节点的IO时延。移动平均法是一种处理数据的常用方法,可以有效消除异常的样本数据,使得到的await值的移动平均值更加准确,使最终得到的计算节点的IO时延更加准确。In an implementation manner of the embodiment of the present invention, the read and write delay await value of the external disk of the computing node can be obtained through a relevant operating system command, and the obtained await value is used as the IO delay of the computing node, and the relevant operating system command For example, it can be the iostat command. Further, multiple read and write delay await values of the external disks of the computing node can be obtained at fixed time intervals, the moving average of the await values can be obtained according to the moving average method, and the moving average of the await values can be used as the computing node's moving average. IO delay. The moving average method is a common method for processing data, which can effectively eliminate abnormal sample data, make the moving average of the await value more accurate, and make the IO delay of the final computing node more accurate.
其中,分布式存储包括前端和后端,前端包括多个前端计算节点,后端包括多个后端存储节点,前端与后端通过互联网络进行通信。前端计算节点上可以运行应用程序,后端存储节点上的存储盘可以对前端计算节点上应用程序的应用数据进行保存。后端存储节点之间可以通过后端存储节点之间的互联网络进行通信,每一个后端存储节点通过本地网络接入设备接入后端存储节点之间的互联网络,后端存储节点的本地网络接入设备例如可以包括网线、网卡。计算节点的IO时延通常可以包括计算节点到前端与后端的互联网络的时延、前端与后端的互联网络到存储节点的时延以及存储节点的内部时延,存储节点的内部时延通常还可以包括IO数据在存储节点之间同步的时延。The distributed storage includes a front-end and a back-end, the front-end includes multiple front-end computing nodes, the back-end includes multiple back-end storage nodes, and the front-end and the back-end communicate through the Internet. The front-end computing nodes can run applications, and the storage disks on the back-end storage nodes can save the application data of the applications on the front-end computing nodes. The back-end storage nodes can communicate through the internet between the back-end storage nodes. Each back-end storage node is connected to the internet between the back-end storage nodes through the local network access device. The network access device may include, for example, a network cable and a network card. The IO delay of a computing node can usually include the delay from the computing node to the front-end and back-end interconnection network, the delay from the front-end and back-end interconnection network to the storage node, and the internal delay of the storage node. The internal delay of the storage node usually also It can include the delay of IO data synchronization between storage nodes.
步骤120:若所述计算节点的IO时延与所述外接盘的IO时延阈值的差值在第一预设数值范围内,则进一步获取所述计算节点所对应的逻辑单元号的IO时延以及所述逻辑单元号的IO时延阈值。Step 120: If the difference between the IO delay of the computing node and the IO delay threshold of the external disk is within the first preset value range, further obtain the IO time of the logical unit number corresponding to the computing node. delay and the IO delay threshold of the logical unit number.
其中,逻辑单元号的IO时延可以通过移动平均法计算,可以首先计算出计算节点的IO时延与外接盘的IO时延阈值的差值,再判断差值是否在第一预设数值范围内,若判断差值在第一预设数值范围内,则判断计算节点的IO时延异常,分布式存储系统出现IO故障。第一预设数值范围可以根据实际需要进行设置,第一预设数值范围例如可以设置为外接盘的IO时延阈值的10倍以上的数值范围,即当计算节点的IO时延大于外接盘的IO时延阈值的11倍时,则判断计算节点的IO时延与外接盘的IO时延阈值的差值在第一预设数值范围内。Among them, the IO delay of the logical unit number can be calculated by the moving average method. First, the difference between the IO delay of the computing node and the IO delay threshold of the external disk can be calculated, and then it can be judged whether the difference is within the first preset value range. If it is determined that the difference is within the first preset value range, it is determined that the IO delay of the computing node is abnormal, and the distributed storage system has an IO failure. The first preset value range can be set according to actual needs. For example, the first preset value range can be set to a value range that is more than 10 times the IO delay threshold of the external disk, that is, when the IO delay of the computing node is greater than that of the external disk. When the IO delay threshold is 11 times, it is determined that the difference between the IO delay of the computing node and the IO delay threshold of the external disk is within the first preset value range.
其中,逻辑单元号(LUN,即Logical Unit Number)是后端存储集群分配给前端计算节点的存储单元,后端存储集群一般包括多个逻辑单元号,后端存储集群的逻辑单元号在执行IO任务时会产生IO时延,可以为后端存储集群的逻辑单元号设置IO时延阈值,对逻辑单元号的IO时延与逻辑单元号的IO时延阈值进行比较以确定逻辑单元号在执行IO任务过程中是否超时。Among them, the logical unit number (LUN, namely Logical Unit Number) is the storage unit assigned to the front-end computing node by the back-end storage cluster. The back-end storage cluster generally includes multiple logical unit numbers, and the logical unit number of the back-end storage cluster is executing IO The IO delay will be generated during the task. You can set the IO delay threshold for the logical unit number of the backend storage cluster, and compare the IO delay of the logical unit number with the IO delay threshold of the logical unit number to determine that the logical unit number is executing. Whether to time out during the IO task.
步骤130:若所述逻辑单元号的IO时延不大于所述逻辑单元号的IO时延阈值,且所述计算节点的IO时延与所述逻辑单元号的IO时延的差值在第二预设数值范围内,则将所述IO故障定位在前端。Step 130: If the IO latency of the logical unit number is not greater than the IO latency threshold of the logical unit number, and the difference between the IO latency of the computing node and the IO latency of the logical unit number is within the first Within the range of two preset values, the IO fault is located at the front end.
其中,若判断逻辑单元号的IO时延不大于逻辑单元号的IO时延阈值,则判断为逻辑单元号的IO时延正常。若进一步判断计算节点的IO时延与逻辑单元号的IO时延的差值在第二预设数值范围内,则说明前端出现IO故障,将IO故障定位在前端。Wherein, if it is determined that the IO delay of the logical unit number is not greater than the IO delay threshold of the logical unit number, it is determined that the IO delay of the logical unit number is normal. If it is further judged that the difference between the IO delay of the computing node and the IO delay of the logical unit number is within the second preset value range, it means that an IO fault occurs at the front end, and the IO fault is located at the front end.
其中,第二预设数值范围可以根据实际需要进行设置,第二预设数值范围例如可以设置为逻辑单元号的IO时延的10倍以上的数值范围,即当计算节点的IO时延大于逻辑单元号的IO时延的11倍时,则判断计算节点的IO时延与逻辑单元号的IO时延的差值在第二预设数值范围内。The second preset value range can be set according to actual needs. For example, the second preset value range can be set to a value range that is more than 10 times the IO delay of the logical unit number, that is, when the IO delay of the computing node is greater than the logic When the IO delay of the unit number is 11 times, it is determined that the difference between the IO delay of the computing node and the IO delay of the logical unit number is within the second preset value range.
在本发明实施例的一种优选的实施方式中,在步骤130将所述IO故障定位在前端之后,所述方法进一步包括:In a preferred implementation of the embodiment of the present invention, after the IO fault is located at the front end in step 130, the method further includes:
步骤131:若计算节点为虚拟机,则分别获取与计算节点对应于同一逻辑单元号的其它计算节点的IO时延、与计算节点对应于同一物理机的其它虚拟机的IO时延以及与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延;若判断与计算节点对应于同一逻辑单元号的其它计算节点的IO时延异常,则将IO故障定位在前端与后端的互联网络;若判断与计算节点对应于同一物理机的其它虚拟机的IO时延正常,并且判断与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延正常,则将IO故障定位在计算节点;若判断与计算节点对应于同一物理机的其它虚拟机的IO时延异常,并且判断与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延正常,则将IO故障定位在计算节点所在的物理机。Step 131: If the computing node is a virtual machine, obtain the IO delay of other computing nodes corresponding to the same logical unit number as the computing node, the IO delay of other virtual machines corresponding to the same physical machine as the computing node, and the The physical machine where the node is located corresponds to the IO delay of other physical machines with the same logical unit number; if it is judged that the IO delay of other computing nodes corresponding to the same logical unit number as the computing node is abnormal, locate the IO fault at the front end and the back end. If it is judged that the IO delay of other virtual machines corresponding to the same physical machine as the computing node is normal, and the IO delay of other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located is normal, then Locate the IO fault on the computing node; if it is judged that the IO delay of other virtual machines corresponding to the same physical machine as the computing node is abnormal, and the IO of other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located is judged to be abnormal. If the delay is normal, the IO fault is located on the physical machine where the computing node is located.
其中,在步骤130将IO故障定位在前端之后,可以进一步判断计算节点是虚拟机还是物理机,若判断计算节点是虚拟机,再执行步骤131。与计算节点对应于同一逻辑单元号的其它计算节点的IO时延、与计算节点对应于同一物理机的其它虚拟机的IO时延以及与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延可以通过移动平均法计算。在获取与计算节点对应于同一逻辑单元号的其它计算节点的IO时延后,若判断与计算节点对应于同一逻辑单元号的其它计算节点的IO时延异常,则将IO故障定位在前端与后端的互联网络。在判断与计算节点对应于同一逻辑单元号的其它计算节点的IO时延是否异常时,可以对比每一个其它计算节点的IO时延与对应的IO时延阈值,将使得IO时延大于IO时延阈值的计算节点识别为IO时延异常的计算节点,统计所有其它计算节点中IO时延异常的计算节点所占的比例,若所有其它计算节点中IO时延异常的计算节点所占的比例超过预设比例,则判断为与计算节点对应于同一逻辑单元号的其它计算节点的IO时延异常。Wherein, after locating the IO fault at the front end in step 130, it may be further determined whether the computing node is a virtual machine or a physical machine, and if it is determined that the computing node is a virtual machine, step 131 is performed. IO delay of other computing nodes corresponding to the same logical unit number as the computing node, IO delay of other virtual machines corresponding to the same physical machine as the computing node, and other virtual machines corresponding to the same logical unit number as the physical machine where the computing node resides The IO delay of the physical machine can be calculated by the moving average method. After obtaining the IO delay of other computing nodes corresponding to the same logical unit number as the computing node, if it is judged that the IO delay of other computing nodes corresponding to the same logical unit number as the computing node is abnormal, locate the IO fault in the front-end and Backend Internet. When judging whether the IO delay of other computing nodes corresponding to the same logical unit number as the computing node is abnormal, the IO delay of each other computing node can be compared with the corresponding IO delay threshold, which will make the IO delay greater than the IO delay. The computing node with the delay threshold is identified as the computing node with abnormal IO delay, and the proportion of computing nodes with abnormal IO delay among all other computing nodes is counted. If the proportion of computing nodes with abnormal IO delay among all other computing nodes If the preset ratio is exceeded, it is determined that the IO delay of other computing nodes corresponding to the same logical unit number as the computing node is abnormal.
其中,在获取与计算节点对应于同一物理机的其它虚拟机的IO时延以及与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延之后,若判断与计算节点对应于同一物理机的其它虚拟机的IO时延正常,并且判断与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延正常,则将IO故障定位在所述计算节点。若判断与计算节点对应于同一物理机的其它虚拟机的IO时延异常,并且判断与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延正常,则将IO故障定位在所述计算节点所在的物理机。在判断与计算节点对应于同一物理机的其它虚拟机的IO时延是否正常时,可以对比每一个其它虚拟机的IO时延与对应的IO时延阈值,将使得IO时延大于IO时延阈值的虚拟机识别为IO时延异常的虚拟机,若所有其它虚拟机中不存在IO时延异常的虚拟机,则判断与计算节点对应于同一物理机的其它虚拟机的IO时延正常,若所有其它虚拟机中存在IO时延异常的虚拟机,则判断与计算节点对应于同一物理机的其它虚拟机的IO时延异常。在判断与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延是否正常时,可以对比每一个其它物理机的IO时延与对应的IO时延阈值,将使得IO时延大于IO时延阈值的物理机识别为IO时延异常的物理机,若所有其它物理机中不存在IO时延异常的物理机,则判断与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延正常,若所有其它物理机中存在IO时延异常的物理机,则判断与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延异常。Among them, after obtaining the IO delay of other virtual machines corresponding to the same physical machine as the computing node and the IO delay of other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located, if it is determined that the computing node corresponds to If the IO delay of other virtual machines on the same physical machine is normal, and it is judged that the IO delay of other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located is normal, the IO fault is located on the computing node. If it is judged that the IO delay of other virtual machines corresponding to the same physical machine as the computing node is abnormal, and it is judged that the IO delay of other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located is normal, locate the IO fault. on the physical machine where the computing node is located. When judging whether the IO delay of other virtual machines corresponding to the same physical machine as the computing node is normal, the IO delay of each other virtual machine can be compared with the corresponding IO delay threshold, which will make the IO delay greater than the IO delay The virtual machine with the threshold value is identified as a virtual machine with abnormal IO delay. If there is no virtual machine with abnormal IO delay among all other virtual machines, it is judged that the IO delay of other virtual machines corresponding to the same physical machine as the computing node is normal. If there is a virtual machine with abnormal IO delay among all other virtual machines, it is determined that the IO delay of other virtual machines corresponding to the same physical machine as the computing node is abnormal. When judging whether the IO delay of other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located is normal, the IO delay of each other physical machine can be compared with the corresponding IO delay threshold, which will make the IO delay A physical machine with a delay greater than the IO delay threshold is identified as a physical machine with an abnormal IO delay. If there is no physical machine with an abnormal IO delay among all other physical machines, it is determined that the physical machine where the computing node is located corresponds to the same logical unit number. The IO delay of other physical machines is normal. If there is a physical machine with abnormal IO delay among all other physical machines, it is judged that the IO delay of other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located is abnormal.
步骤132:若计算节点为物理机,则获取与计算节点对应于同一逻辑单元号的其它计算节点的IO时延;若判断与计算节点对应于同一逻辑单元号的其它计算节点的IO时延异常,则将IO故障定位在前端与后端的互联网络;若判断与计算节点对应于同一逻辑单元号的其它计算节点的IO时延正常,则将IO故障定位在计算节点。Step 132: If the computing node is a physical machine, obtain the IO delay of other computing nodes corresponding to the same logical unit number as the computing node; if it is determined that the IO delay of other computing nodes corresponding to the same logical unit number as the computing node is abnormal , the IO fault is located on the front-end and back-end interconnected networks; if it is judged that the IO delay of other computing nodes corresponding to the same logical unit number as the computing node is normal, the IO fault is located on the computing node.
其中,在步骤130将IO故障定位在前端之后,可以进一步判断计算节点是虚拟机还是物理机,若判断计算节点是物理机,再执行步骤132。与计算节点对应于同一逻辑单元号的其它计算节点的IO时延可以通过移动平均法计算。在获取与计算节点对应于同一逻辑单元号的其它计算节点的IO时延后,若判断与计算节点对应于同一逻辑单元号的其它计算节点的IO时延异常,则将IO故障定位在前端与后端的互联网络;若判断与计算节点对应于同一逻辑单元号的其它计算节点的IO时延正常,则将所述IO故障定位在所述计算节点。在判断与所述计算节点对应于同一逻辑单元号的其它计算节点的IO时延是否正常时,可以对比每一个其它计算节点的IO时延与IO时延阈值,将使得IO时延大于IO时延阈值的计算节点识别为IO时延异常的计算节点。若与计算节点对应于同一逻辑单元号的所有其它计算节点中预设比例的计算节点IO时延异常,则判断与计算节点对应于同一逻辑单元号的其它计算节点的IO时延异常,否则,判断与计算节点对应于同一逻辑单元号的其它计算节点的IO时延正常,预设比例例如可以是90%。其它计算节点可以是虚拟机,也可以是物理机。Wherein, after locating the IO fault at the front end in step 130, it may be further determined whether the computing node is a virtual machine or a physical machine, and if it is determined that the computing node is a physical machine, then step 132 is performed. The IO delay of other computing nodes corresponding to the same logical unit number as the computing node can be calculated by the moving average method. After obtaining the IO delay of other computing nodes corresponding to the same logical unit number as the computing node, if it is judged that the IO delay of other computing nodes corresponding to the same logical unit number as the computing node is abnormal, locate the IO fault in the front-end and The back-end interconnection network; if it is determined that the IO delay of other computing nodes corresponding to the same logical unit number as the computing node is normal, the IO fault is located on the computing node. When judging whether the IO delay of other computing nodes corresponding to the same logical unit number as the computing node is normal, the IO delay of each other computing node can be compared with the IO delay threshold, so that the IO delay is greater than the IO delay A computing node with a delay threshold is identified as a computing node with abnormal IO delay. If a preset proportion of the computing node IO delays of all other computing nodes corresponding to the same logical unit number as the computing node are abnormal, it is determined that the IO delays of other computing nodes corresponding to the same logical unit number are abnormal, otherwise, It is determined that the IO delay of other computing nodes corresponding to the same logical unit number as the computing node is normal, and the preset ratio may be, for example, 90%. Other computing nodes can be virtual machines or physical machines.
步骤140:若逻辑单元号的IO时延与逻辑单元号的IO时延阈值的差值在第三预设数值范围内,且计算节点的IO时延与逻辑单元号的IO时延之间的差值在第四预设数值范围内,则将IO故障定位在后端。Step 140: If the difference between the IO delay of the logical unit number and the IO delay threshold of the logical unit number is within the third preset value range, and the difference between the IO delay of the node and the IO delay of the logical unit number is calculated. If the difference is within the fourth preset value range, the IO fault is located at the back end.
其中,若判断逻辑单元号的IO时延与逻辑单元号的IO时延阈值的差值在第三预设数值范围内,则判断为逻辑单元号的IO时延异常。第三预设数值范围可以根据实际需要进行设置,第三预设数值范围例如可以设置为逻辑单元号的IO时延阈值的10倍以上的数值范围,即当逻辑单元号的IO时延大于逻辑单元号的IO时延阈值的11倍时,则判断逻辑单元号的IO时延与逻辑单元号的IO时延阈值的差值在第三预设数值范围内。Wherein, if it is determined that the difference between the IO delay of the logical unit number and the IO delay threshold of the logical unit number is within the third preset value range, it is determined that the IO delay of the logical unit number is abnormal. The third preset value range can be set according to actual needs. For example, the third preset value range can be set to a value range that is more than 10 times the IO delay threshold of the logical unit number, that is, when the IO delay of the logical unit number is greater than the logical unit number When the IO delay threshold of the unit number is 11 times, it is determined that the difference between the IO delay of the logical unit number and the IO delay threshold of the logical unit number is within the third preset value range.
其中,若判断计算节点的IO时延与逻辑单元号的IO时延之间的差值在第四预设数值范围内,则判断为计算节点的IO时延与逻辑单元号的IO时延比较接近。第四预设数值范围可以根据实际需要进行设置,第四预设数值范围例如可以设置为逻辑单元号的IO时延的0.1倍以下的数值范围,即当计算节点的IO时延小于逻辑单元号的IO时延的1.1倍时,则判断计算节点的IO时延与逻辑单元号的IO时延之间的差值在第四预设数值范围内。Wherein, if it is determined that the difference between the IO delay of the computing node and the IO delay of the logical unit number is within the fourth preset value range, it is determined that the IO delay of the computing node and the IO delay of the logical unit number are compared near. The fourth preset value range can be set according to actual needs. For example, the fourth preset value range can be set to a value range that is less than 0.1 times the IO delay of the logical unit number, that is, when the IO delay of the computing node is smaller than the logical unit number. When the IO delay is 1.1 times, it is judged that the difference between the IO delay of the computing node and the IO delay of the logical unit number is within the fourth preset value range.
在本发明实施例的一种优选的实施方式中,在步骤140将所述IO故障定位在后端之后,所述方法进一步包括:In a preferred implementation of the embodiment of the present invention, after the IO fault is located at the back end in step 140, the method further includes:
步骤141:获取每一个后端存储节点的ping网络时延、所有后端存储节点的ping网络时延的平均值以及后端存储节点的ping网络时延阈值;若所有后端存储节点的ping网络时延的平均值与后端存储节点的ping网络时延阈值的差值在第五预设数值范围内,并且每一个后端存储节点的ping网络时延与后端存储节点的ping网络时延阈值的差值均在第六预设数值范围内,则将所述IO故障定位在后端存储节点之间的互联网络;若所有后端存储节点的ping网络时延的平均值与后端存储节点的ping网络时延阈值的差值在第七预设数值范围内,并且预设数量的后端存储节点的ping网络时延与后端存储节点的ping网络时延阈值的差值均在第八预设数值范围内,预设数量的后端存储节点之外的其它后端存储节点的ping网络时延不大于后端存储节点的ping网络时延阈值,则将IO故障定位在预设数量的后端存储节点的网络;若所有后端存储节点的ping网络时延的平均值与后端存储节点的ping网络时延阈值之间的差值在第九预设数值范围内,则获取所有后端存储节点的内置盘的IO时延以及与内置盘的IO时延对应的内置盘的IO时延阈值,将使内置盘的IO时延与内置盘的IO时延阈值的差值在第十预设数值范围内的内置盘识别为异常内置盘,将IO故障定位在异常内置盘。Step 141: Obtain the ping network delay of each back-end storage node, the average ping network delay of all back-end storage nodes, and the ping network delay threshold of the back-end storage nodes; if the ping network delay of all back-end storage nodes The difference between the average delay and the ping network delay threshold of the back-end storage node is within the fifth preset value range, and the ping network delay of each back-end storage node and the ping network delay of the back-end storage node If the difference between the thresholds is within the sixth preset value range, the IO fault is located in the interconnection network between back-end storage nodes; if the average ping network delay of all back-end storage nodes is the same as the The difference between the ping network delay thresholds of the nodes is within the seventh preset value range, and the difference between the ping network delays of the preset number of back-end storage nodes and the ping network delay thresholds of the back-end storage nodes is within the seventh preset value range. 8. Within the preset value range, if the ping network delay of other back-end storage nodes other than the preset number of back-end storage nodes is not greater than the ping network delay threshold of the back-end storage node, the IO fault is located at the preset number The network of back-end storage nodes; if the difference between the average ping network delay of all back-end storage nodes and the ping network delay threshold of back-end storage nodes is within the ninth preset value range, obtain all The IO delay of the built-in disk of the back-end storage node and the IO delay threshold of the built-in disk corresponding to the IO delay of the built-in disk will make the difference between the IO delay of the built-in disk and the IO delay threshold of the built-in disk in the first The built-in disks within the preset value range are identified as abnormal built-in disks, and the IO fault is located in the abnormal built-in disk.
其中,每一个后端存储节点的ping网络时延以及所有后端存储节点的ping网络时延的平均值可以通过移动平均法计算。若所有后端存储节点的ping网络时延的平均值与后端存储节点的ping网络时延阈值的差值在第五预设数值范围内,则判断为所有后端存储节点的ping网络时延的平均值异常,第五预设数值范围例如可以设置为后端存储节点的ping网络时延阈值的10倍以上的数值范围。若每一个后端存储节点的ping网络时延与后端存储节点的ping网络时延阈值的差值均在第六预设数值范围内,则判断为每一个后端存储节点的ping网络时延均异常,第六预设数值范围例如可以设置为后端存储节点的ping网络时延阈值的10倍以上的数值范围。The ping network delay of each backend storage node and the average of the ping network delays of all backend storage nodes can be calculated by the moving average method. If the difference between the average of the ping network delays of all the back-end storage nodes and the ping network delay threshold of the back-end storage nodes is within the fifth preset value range, it is determined that the ping network delays of all the back-end storage nodes are The average value of is abnormal, and the fifth preset value range may be set to, for example, a value range that is more than 10 times the ping network delay threshold of the back-end storage node. If the difference between the ping network delay of each back-end storage node and the ping network delay threshold of the back-end storage node is within the sixth preset value range, it is determined that the ping network delay of each back-end storage node is If both are abnormal, the sixth preset value range may be set to, for example, a value range that is more than 10 times the ping network delay threshold of the back-end storage node.
其中,若所有后端存储节点的ping网络时延的平均值与后端存储节点的ping网络时延阈值的差值在第七预设数值范围内,则判断为所有后端存储节点的ping网络时延的平均值异常,第七预设数值范围例如可以设置为后端存储节点的ping网络时延阈值的10倍以上的数值范围。若预设数量的后端存储节点的ping网络时延与后端存储节点的ping网络时延阈值的差值均在第八预设数值范围内,则判断为预设数量的后端存储节点的ping网络时延异常,第八预设数值范围例如可以设置为后端存储节点的ping网络时延阈值的10倍以上的数值范围,预设数量例如可以设置为1。若所有后端存储节点的ping网络时延的平均值与后端存储节点的ping网络时延阈值之间的差值在第九预设数值范围内,则判断为所有后端存储节点的ping网络时延的平均值正常,第九预设数值范围例如可以设置为后端存储节点的ping网络时延阈值的0.1倍以内的数值范围。第十预设数值范围例如可以设置为内置盘的IO时延阈值的10倍以上的数值范围。Among them, if the difference between the average of the ping network delays of all the back-end storage nodes and the ping network delay threshold of the back-end storage nodes is within the seventh preset value range, it is determined that the ping network of all the back-end storage nodes is If the average value of the delay is abnormal, the seventh preset value range may be set to, for example, a value range that is more than 10 times the ping network delay threshold of the backend storage node. If the difference between the ping network delay of the preset number of back-end storage nodes and the ping network delay threshold of the back-end storage nodes is within the eighth preset value range, it is determined that the preset number of back-end storage nodes has If the ping network delay is abnormal, the eighth preset value range may be set to, for example, a value range that is more than 10 times the ping network delay threshold of the backend storage node, and the preset number may be set to 1, for example. If the difference between the average of the ping network delays of all the back-end storage nodes and the ping network delay threshold of the back-end storage nodes is within the ninth preset value range, it is determined that the ping network of all the back-end storage nodes is within the range of the ninth preset value. The average value of the delay is normal, and the ninth preset value range may be, for example, a value range within 0.1 times the ping network delay threshold of the backend storage node. The tenth preset numerical range may be set to, for example, a numerical range that is more than 10 times the IO delay threshold of the built-in disk.
在本发明实施例中,通过获取前端计算节点的IO时延与计算节点的外接盘的IO时延阈值;若计算节点的IO时延与外接盘的IO时延阈值的差值在第一预设数值范围内,则进一步获取计算节点所对应的逻辑单元号的IO时延以及逻辑单元号的IO时延阈值;若逻辑单元号的IO时延不大于逻辑单元号的IO时延阈值,且计算节点的IO时延与逻辑单元号的IO时延的差值在第二预设数值范围内,则将IO故障定位在前端;若逻辑单元号的IO时延与逻辑单元号的IO时延阈值的差值在第三预设数值范围内,且计算节点的IO时延与逻辑单元号的IO时延之间的差值在第四预设数值范围内,则将IO故障定位在后端。可以看出,本发明实施例通过比较前端计算节点的IO时延与后端逻辑单元号的IO时延可以快速、准确的对IO故障进行定位。In the embodiment of the present invention, the IO delay of the front-end computing node and the IO delay threshold of the external disk of the computing node are obtained; if the difference between the IO delay of the computing node and the IO delay threshold of the external disk is within the first preset value Within the set value range, the IO delay of the logical unit number corresponding to the computing node and the IO delay threshold of the logical unit number are further obtained; if the IO delay of the logical unit number is not greater than the IO delay threshold of the logical unit number, and If the difference between the IO delay of the computing node and the IO delay of the logical unit number is within the second preset value range, locate the IO fault at the front end; if the IO delay of the logical unit number and the IO delay of the logical unit number If the difference between the thresholds is within the third preset value range, and the difference between the IO delay of the computing node and the IO delay of the logical unit number is within the fourth preset value range, the IO fault is located at the backend . It can be seen that the embodiment of the present invention can quickly and accurately locate the IO fault by comparing the IO delay of the front-end computing node and the IO delay of the back-end logical unit number.
图2示出了本发明IO故障定位装置实施例的结构示意图。如图2所示,该装置300包括:第一获取模块310、第二获取模块320和定位模块330。FIG. 2 shows a schematic structural diagram of an embodiment of an IO fault locating apparatus according to the present invention. As shown in FIG. 2 , the apparatus 300 includes: a first obtaining
第一获取模块310,用于获取前端计算节点的IO时延与所述计算节点的外接盘的IO时延阈值;The first obtaining
第二获取模块320,用于若所述计算节点的IO时延与所述外接盘的IO时延阈值的差值在第一预设数值范围内,则获取所述计算节点所对应的逻辑单元号的IO时延以及所述逻辑单元号的IO时延阈值;The second obtaining
定位模块330,用于若所述逻辑单元号的IO时延不大于所述逻辑单元号的IO时延阈值,且所述计算节点的IO时延与所述逻辑单元号的IO时延的差值在第二预设数值范围内,则将所述IO故障定位在前端;若所述逻辑单元号的IO时延与所述逻辑单元号的IO时延阈值的差值在第三预设数值范围内,且所述计算节点的IO时延与所述逻辑单元号的IO时延之间的差值在第四预设数值范围内,则将所述IO故障定位在后端。The
在一种可选的方式中,定位模块330还用于在执行将IO故障定位在前端之后,在计算节点为虚拟机时,分别获取与计算节点对应于同一逻辑单元号的其它计算节点的IO时延、与计算节点对应于同一物理机的其它虚拟机的IO时延以及与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延;In an optional manner, the locating
若判断与计算节点对应于同一逻辑单元号的其它计算节点的IO时延异常,则将IO故障定位在前端与后端的互联网络;If it is judged that the IO delay of other computing nodes corresponding to the same logical unit number as the computing node is abnormal, locate the IO fault on the front-end and back-end interconnected networks;
若判断与计算节点对应于同一物理机的其它虚拟机的IO时延正常,并且判断与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延正常,则将IO故障定位在计算节点;If it is judged that the IO delay of other virtual machines corresponding to the same physical machine as the computing node is normal, and the IO delay of other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located is normal, locate the IO fault. at the compute node;
若判断与计算节点对应于同一物理机的其它虚拟机的IO时延异常,并且判断与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延正常,则将IO故障定位在计算节点所在的物理机。If it is judged that the IO delay of other virtual machines corresponding to the same physical machine as the computing node is abnormal, and it is judged that the IO delay of other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located is normal, locate the IO fault. On the physical machine where the compute node is located.
在一种可选的方式中,定位模块330还用于在执行将IO故障定位在前端之后,在计算节点为物理机时,获取与计算节点对应于同一逻辑单元号的其它计算节点的IO时延;In an optional manner, the locating
若判断与计算节点对应于同一逻辑单元号的其它计算节点的IO时延异常,则将IO故障定位在前端与后端的互联网络;If it is judged that the IO delay of other computing nodes corresponding to the same logical unit number as the computing node is abnormal, locate the IO fault on the front-end and back-end interconnected networks;
若判断与计算节点对应于同一逻辑单元号的其它计算节点的IO时延正常,则将IO故障定位在计算节点。If it is judged that the IO delay of other computing nodes corresponding to the same logical unit number as the computing node is normal, the IO fault is located on the computing node.
在一种可选的方式中,定位模块330还用于在执行将IO故障定位在后端之后,获取每一个后端存储节点的ping网络时延、所有后端存储节点的ping网络时延的平均值以及后端存储节点的ping网络时延阈值;In an optional manner, the
若所有后端存储节点的ping网络时延的平均值与后端存储节点的ping网络时延阈值的差值在第五预设数值范围内,并且每一个后端存储节点的ping网络时延与后端存储节点的ping网络时延阈值的差值均在第六预设数值范围内,则将IO故障定位在后端存储节点之间的互联网络;If the difference between the average ping network delay of all back-end storage nodes and the ping network delay threshold of the back-end storage nodes is within the fifth preset value range, and the ping network delay of each back-end storage node is the same as If the difference between the ping network delay thresholds of the back-end storage nodes is within the sixth preset value range, the IO fault is located in the interconnection network between the back-end storage nodes;
若所有后端存储节点的ping网络时延的平均值与后端存储节点的ping网络时延阈值的差值在第七预设数值范围内,并且预设数量的后端存储节点的ping网络时延与后端存储节点的ping网络时延阈值的差值均在第八预设数值范围内,预设数量的后端存储节点之外的其它后端存储节点的ping网络时延不大于后端存储节点的ping网络时延阈值,则将IO故障定位在预设数量的后端存储节点的网络;If the difference between the average ping network delay of all back-end storage nodes and the ping network delay threshold of the back-end storage nodes is within the seventh preset value range, and the ping network time of a preset number of back-end storage nodes The difference between the delay and the ping network delay threshold of the backend storage node is within the eighth preset value range, and the ping network delay of other backend storage nodes other than the preset number of backend storage nodes is not greater than that of the backend The ping network delay threshold of the storage node, the IO fault is located in the network of the preset number of back-end storage nodes;
若所有后端存储节点的ping网络时延的平均值与后端存储节点的ping网络时延阈值之间的差值在第九预设数值范围内,则获取所有后端存储节点的内置盘的IO时延以及与内置盘的IO时延对应的内置盘的IO时延阈值,将使内置盘的IO时延与内置盘的IO时延阈值的差值在第十预设数值范围内的内置盘识别为异常内置盘,将IO故障定位在异常内置盘。If the difference between the average of the ping network delays of all the back-end storage nodes and the ping network delay threshold of the back-end storage nodes is within the ninth preset value range, obtain the built-in disks of all the back-end storage nodes. The IO delay and the IO delay threshold of the built-in disk corresponding to the IO delay of the built-in disk will make the difference between the IO delay of the built-in disk and the IO delay threshold of the built-in disk within the tenth preset value range. The disk is identified as an abnormal built-in disk, and the IO fault is located on the abnormal built-in disk.
在本发明实施例中,第一获取模块可以获取前端计算节点的IO时延与所述计算节点的外接盘的IO时延阈值,第二获取模块在所述计算节点的IO时延与所述外接盘的IO时延阈值的差值在第一预设数值范围内时,获取所述计算节点所对应的逻辑单元号的IO时延以及所述逻辑单元号的IO时延阈值,定位模块可以分别对逻辑单元号的IO时延与逻辑单元号的IO时延阈值进行比较,以及对计算节点的IO时延与逻辑单元号的IO时延进行比较,以将IO故障定位在前端或者将IO故障定位在后端。可以看出,本发明实施例的IO故障定位装置可以对IO故障进行快速、准确的定位。In the embodiment of the present invention, the first acquisition module can acquire the IO delay of the front-end computing node and the IO delay threshold of the external disk of the computing node, and the second acquisition module can obtain the IO delay of the computing node and the IO delay threshold of the computing node. When the difference between the IO delay thresholds of the external disks is within the first preset value range, obtain the IO delay of the logical unit number corresponding to the computing node and the IO delay threshold of the logical unit number, and the positioning module can Compare the IO latency of the logical unit number with the IO latency threshold of the logical unit number, and compare the IO latency of the computing node with the IO latency of the logical unit number, so as to locate the IO fault at the front end or The fault is located in the backend. It can be seen that the IO fault locating device according to the embodiment of the present invention can quickly and accurately locate the IO fault.
图3示出了本发明计算设备实施例的结构示意图,本发明具体实施例并不对计算设备的具体实现做限定。FIG. 3 shows a schematic structural diagram of an embodiment of a computing device of the present invention, and the specific embodiment of the present invention does not limit the specific implementation of the computing device.
如图3所示,该计算设备可以包括:处理器(processor)402、通信接口(Communications Interface)404、存储器(memory)406、以及通信总线408。As shown in FIG. 3 , the computing device may include: a processor (processor) 402 , a communications interface (Communications Interface) 404 , a memory (memory) 406 , and a
其中:处理器402、通信接口404、以及存储器406通过通信总线408完成相互间的通信。通信接口404,用于与其它设备比如客户端或其它服务器等的网元通信。处理器402,用于执行程序410,具体可以执行上述用于IO故障定位方法实施例中的相关步骤。The processor 402 , the
具体地,程序410可以包括程序代码,该程序代码包括计算机可执行指令。Specifically, program 410 may include program code, which includes computer-executable instructions.
处理器402可能是中央处理器CPU,或者是特定集成电路ASIC(ApplicationSpecific Integrated Circuit),或者是被配置成实施本发明实施例的一个或多个集成电路。计算设备包括的一个或多个处理器,可以是同一类型的处理器,如一个或多个CPU;也可以是不同类型的处理器,如一个或多个CPU以及一个或多个ASIC。The processor 402 may be a central processing unit (CPU), or an application specific integrated circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors included in the computing device may be the same type of processors, such as one or more CPUs; or may be different types of processors, such as one or more CPUs and one or more ASICs.
存储器406,用于存放程序410。存储器406可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。The memory 406 is used to store the program 410 . Memory 406 may include high-speed RAM memory, and may also include non-volatile memory, such as at least one disk memory.
程序410具体可以被处理器402调用使计算设备执行以下操作:The program 410 can be specifically called by the processor 402 to make the computing device perform the following operations:
获取前端计算节点的IO时延与所述计算节点的外接盘的IO时延阈值;Obtain the IO delay of the front-end computing node and the IO delay threshold of the external disk of the computing node;
若所述计算节点的IO时延与所述外接盘的IO时延阈值的差值在第一预设数值范围内,则进一步获取所述计算节点所对应的逻辑单元号的IO时延以及所述逻辑单元号的IO时延阈值;If the difference between the IO delay of the computing node and the IO delay threshold of the external disk is within the first preset value range, further obtain the IO delay of the logical unit number corresponding to the computing node and all the IO delay threshold of the logical unit number;
若所述逻辑单元号的IO时延不大于所述逻辑单元号的IO时延阈值,且所述计算节点的IO时延与所述逻辑单元号的IO时延的差值在第二预设数值范围内,则将所述IO故障定位在前端;If the IO latency of the logical unit number is not greater than the IO latency threshold of the logical unit number, and the difference between the IO latency of the computing node and the IO latency of the logical unit number is within the second preset value Within the value range, the IO fault is located at the front end;
若所述逻辑单元号的IO时延与所述逻辑单元号的IO时延阈值的差值在第三预设数值范围内,且所述计算节点的IO时延与所述逻辑单元号的IO时延之间的差值在第四预设数值范围内,则将所述IO故障定位在后端。If the difference between the IO delay of the logical unit number and the IO delay threshold of the logical unit number is within a third preset value range, and the IO delay of the computing node and the IO delay of the logical unit number If the difference between the delays is within the fourth preset value range, the IO fault is located at the back end.
在一种可选的方式中,所述程序410被处理器402调用使计算设备执行以下操作:In an optional manner, the program 410 is invoked by the processor 402 to cause the computing device to perform the following operations:
当所述计算节点为虚拟机时,分别获取与所述计算节点对应于同一逻辑单元号的其它计算节点的IO时延、与所述计算节点对应于同一物理机的其它虚拟机的IO时延以及与所述计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延;When the computing node is a virtual machine, obtain the IO delay of other computing nodes corresponding to the same logical unit number as the computing node, and the IO delay of other virtual machines corresponding to the same physical machine as the computing node. and the IO delay of other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located;
若判断所述与所述计算节点对应于同一逻辑单元号的其它计算节点的IO时延异常,则将所述IO故障定位在前端与后端的互联网络;If it is determined that the IO delay of the other computing nodes corresponding to the same logical unit number as the computing node is abnormal, the IO fault is located in the interconnection network between the front end and the back end;
若判断所述与所述计算节点对应于同一物理机的其它虚拟机的IO时延正常,并且判断所述与所述计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延正常,则将所述IO故障定位在所述计算节点;If it is judged that the IO delay of the other virtual machines corresponding to the same physical machine as the computing node is normal, and the IO of the other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located is judged to be normal If the delay is normal, the IO fault is located on the computing node;
若判断所述与所述计算节点对应于同一物理机的其它虚拟机的IO时延异常,并且判断所述与所述计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延正常,则将所述IO故障定位在所述计算节点所在的物理机。If it is determined that the IO delay of the other virtual machines corresponding to the same physical machine as the computing node is abnormal, and the IO of the other physical machines corresponding to the same logical unit number as the physical machine where the computing node is located is determined to be abnormal If the delay is normal, the IO fault is located on the physical machine where the computing node is located.
在一种可选的方式中,所述程序410被处理器402调用使计算设备执行以下操作:In an optional manner, the program 410 is invoked by the processor 402 to cause the computing device to perform the following operations:
在所述计算节点为物理机时,获取与所述计算节点对应于同一逻辑单元号的其它计算节点的IO时延;When the computing node is a physical machine, obtain the IO delay of other computing nodes corresponding to the same logical unit number as the computing node;
若判断所述与所述计算节点对应于同一逻辑单元号的其它计算节点的IO时延异常,则将所述IO故障定位在前端与后端的互联网络;If it is determined that the IO delay of the other computing nodes corresponding to the same logical unit number as the computing node is abnormal, the IO fault is located in the interconnection network between the front end and the back end;
若判断所述与所述计算节点对应于同一逻辑单元号的其它计算节点的IO时延正常,则将所述IO故障定位在所述计算节点。If it is determined that the IO delay of the other computing nodes corresponding to the same logical unit number as the computing node is normal, the IO fault is located on the computing node.
在一种可选的方式中,所述程序410被处理器402调用使计算设备执行以下操作:In an optional manner, the program 410 is invoked by the processor 402 to cause the computing device to perform the following operations:
获取每一个后端存储节点的ping网络时延、所有后端存储节点的ping网络时延的平均值以及后端存储节点的ping网络时延阈值;Obtain the ping network delay of each back-end storage node, the average ping network delay of all back-end storage nodes, and the ping network delay threshold of the back-end storage nodes;
若所述所有后端存储节点的ping网络时延的平均值与所述后端存储节点的ping网络时延阈值的差值在第五预设数值范围内,并且所述每一个后端存储节点的ping网络时延与所述后端存储节点的ping网络时延阈值的差值均在第六预设数值范围内,则将所述IO故障定位在后端存储节点之间的互联网络;If the difference between the average of the ping network delays of all the back-end storage nodes and the ping network delay threshold of the back-end storage nodes is within the fifth preset value range, and each back-end storage node If the difference between the ping network delay of the back-end storage node and the ping network delay threshold of the back-end storage node is within the sixth preset value range, the IO fault is located in the interconnection network between the back-end storage nodes;
若所述所有后端存储节点的ping网络时延的平均值与所述后端存储节点的ping网络时延阈值的差值在第七预设数值范围内,并且预设数量的后端存储节点的ping网络时延与所述后端存储节点的ping网络时延阈值的差值均在第八预设数值范围内,所述预设数量的后端存储节点之外的其它后端存储节点的ping网络时延不大于所述后端存储节点的ping网络时延阈值,则将所述IO故障定位在所述预设数量的后端存储节点的网络;If the difference between the average of the ping network delays of all the back-end storage nodes and the ping network delay threshold of the back-end storage nodes is within the seventh preset value range, and the preset number of back-end storage nodes The difference between the ping network delay of the back-end storage node and the ping network delay threshold of the back-end storage node is within the eighth preset value range, and the back-end storage nodes other than the preset number of back-end storage nodes have If the ping network delay is not greater than the ping network delay threshold of the back-end storage node, the IO fault is located in the network of the preset number of back-end storage nodes;
若所述所有后端存储节点的ping网络时延的平均值与所述后端存储节点的ping网络时延阈值之间的差值在第九预设数值范围内,则获取所有所述后端存储节点的内置盘的IO时延以及与所述内置盘的IO时延对应的内置盘的IO时延阈值,将使所述内置盘的IO时延与所述内置盘的IO时延阈值的差值在第十预设数值范围内的内置盘识别为异常内置盘,将所述IO故障定位在所述异常内置盘。If the difference between the average of the ping network delays of all the back-end storage nodes and the ping network delay threshold of the back-end storage nodes is within the ninth preset value range, obtain all the back-end storage nodes. The IO delay of the built-in disk of the storage node and the IO delay threshold of the built-in disk corresponding to the IO delay of the built-in disk will make the difference between the IO delay of the built-in disk and the IO delay threshold of the built-in disk. The built-in disk whose difference is within the tenth preset value range is identified as an abnormal built-in disk, and the IO fault is located in the abnormal built-in disk.
在本发明实施例中,通过计算设备中的程序被处理器调用可以使计算设备执行获取前端计算节点的IO时延与计算节点的外接盘的IO时延阈值;若计算节点的IO时延与外接盘的IO时延阈值的差值在第一预设数值范围内,则进一步获取计算节点所对应的逻辑单元号的IO时延以及逻辑单元号的IO时延阈值;若逻辑单元号的IO时延不大于逻辑单元号的IO时延阈值,且计算节点的IO时延与逻辑单元号的IO时延的差值在第二预设数值范围内,则将IO故障定位在前端;若逻辑单元号的IO时延与逻辑单元号的IO时延阈值的差值在第三预设数值范围内,且计算节点的IO时延与逻辑单元号的IO时延之间的差值在第四预设数值范围内,则将IO故障定位在后端。可以看出,计算设备可以对IO故障进行快读、准确的定位。In the embodiment of the present invention, the program in the computing device is called by the processor, so that the computing device can execute the acquisition of the IO delay of the front-end computing node and the IO delay threshold of the external disk of the computing node; if the IO delay of the computing node is equal to If the difference between the IO delay thresholds of the external disks is within the first preset value range, the IO delay of the logical unit number corresponding to the computing node and the IO delay threshold of the logical unit number are further obtained; if the IO delay of the logical unit number is If the delay is not greater than the IO delay threshold of the logical unit number, and the difference between the IO delay of the computing node and the IO delay of the logical unit number is within the second preset value range, the IO fault is located at the front end; if the logic The difference between the IO delay of the unit number and the IO delay threshold of the logical unit number is within the third preset value range, and the difference between the IO delay of the computing node and the IO delay of the logical unit number is within the fourth Within the preset value range, the IO fault is located at the backend. It can be seen that the computing device can quickly read and accurately locate IO faults.
图4示出了本发明IO故障定位设备实施例的结构示意图。如图4所示,该设备500包括:FIG. 4 shows a schematic structural diagram of an embodiment of an IO fault locating device according to the present invention. As shown in Figure 4, the device 500 includes:
采集模块510,用于采集IO时延信息。The
其中,IO时延信息可以包括:前端计算节点的IO时延、计算节点所对应的逻辑单元号的IO时延、每一个后端存储节点的ping网络时延,以及,在计算节点为虚拟机时,采集与计算节点对应于同一LUN的其它计算节点的IO时延、与计算节点对应于同一物理机的其它虚拟机的IO时延、与计算节点所在的物理机对应于同一LUN的其它物理机的IO时延,以及,在当前计算节点为物理机时,采集与计算节点对应于同一LUN的其它计算节点的IO时延。The IO delay information may include: the IO delay of the front-end computing node, the IO delay of the logical unit number corresponding to the computing node, the ping network delay of each back-end storage node, and, if the computing node is a virtual machine Collects the IO delays of other computing nodes that correspond to the same LUN as the computing node, the IO delays of other virtual machines that correspond to the same physical machine as the computing node, and other physical machines that correspond to the same LUN as the physical machine where the computing node resides. The IO delay of the computer, and, when the current computing node is a physical machine, collects the IO delay of other computing nodes corresponding to the same LUN as the computing node.
配置模块520,用于配置IO时延阈值。The
其中,IO时延阈值可以包括:对应于计算节点的外接盘的IO时延阈值、计算节点所对应的逻辑单元号的IO时延阈值、后端存储节点的ping网络时延阈值,以及,在计算节点为虚拟机时,用于配置与计算节点对应于同一逻辑单元号的其它计算节点的IO时延阈值、与计算节点对应于同一物理机的其它虚拟机的IO时延阈值、与计算节点所在的物理机对应于同一逻辑单元号的其它物理机的IO时延阈值,以及,在当前计算节点为物理机时,用于配置与计算节点对应于同一逻辑单元号的其它计算节点的IO时延阈值。The IO delay threshold may include: the IO delay threshold corresponding to the external disk of the computing node, the IO delay threshold of the logical unit number corresponding to the computing node, the ping network delay threshold of the back-end storage node, and, in When the computing node is a virtual machine, it is used to configure the IO delay threshold of other computing nodes corresponding to the same logical unit number as the computing node, the IO delay threshold of other virtual machines corresponding to the same physical machine as the computing node, and the computing node. The IO delay threshold of other physical machines whose physical machine corresponds to the same logical unit number, and, when the current computing node is a physical machine, is used to configure the IO of other computing nodes corresponding to the same logical unit number as the computing node. delay threshold.
定位模块530,用于执行上述的IO故障定位方法,以对所述IO故障进行定位。The locating
展示模块540,用于获取所述定位模块对所述IO故障进行定位的结果,并对所述结果进行展示。The
在一种可选的方式中,采集模块510还用于采集IOPS信息,IOPS即Input/OutputPer Second,是指磁盘每秒的输入输出量。采集模块可以采集后端存储节点的存储盘的IOPS信息。In an optional manner, the
在一种可选的方式中,展示模块540还用于展示采集模块510采集的IO时延信息,以及,展示前端计算节点的外接盘与后端存储集群的逻辑单元号的对应关系。In an optional manner, the
在一种可选的方式中,配置模块520还用于:In an optional manner, the
获取前端计算节点的配置信息以及后端存储集群的配置信息,将前端计算节点的配置信息以及后端存储集群的配置信息发送至定位模块530;Obtain the configuration information of the front-end computing node and the configuration information of the back-end storage cluster, and send the configuration information of the front-end computing node and the configuration information of the back-end storage cluster to the
定位模块530还用于:根据前端计算节点的配置信息以及后端存储集群的配置信息对IO故障进行定位。The
其中,前端计算节点的配置信息可以包括:计算节点的主机名、计算节点的外接盘信息以及与计算节点对应的存储集群信息;后端存储集群的配置信息可以包括:与计算节点对应的存储集群名称、与计算节点对应的存储集群所包含的存储节点的数量及各个存储节点的名称、与计算节点对应的存储集群所划分的逻辑单元号的数量以及与计算节点对应于同一存储集群的其它计算节点的节点信息。The configuration information of the front-end computing node may include: the host name of the computing node, the external disk information of the computing node, and the storage cluster information corresponding to the computing node; the configuration information of the back-end storage cluster may include: the storage cluster corresponding to the computing node Name, the number of storage nodes included in the storage cluster corresponding to the computing node, the name of each storage node, the number of logical unit numbers divided by the storage cluster corresponding to the computing node, and other computing nodes that correspond to the same storage cluster as the computing node Node information for the node.
在一种可选的方式中,配置模块520还用于获取后端存储集群的上下层之间的对应关系,将获取的后端存储集群的上下层之间的对应关系发送至定位模块530;In an optional manner, the
定位模块530还用于:根据后端存储集群的上下层之间的对应关系对IO故障进行定位。The
在一种可选的方式中,配置模块520还用于获取参考信息,根据参考信息对IO时延阈值进行配置,参考信息可以包括:计算节点的硬件配置、与计算节点对应的后端存储节点的硬件配置以及计算节点所处理的任务对于IO时延的要求。In an optional manner, the
在本发明实施例中,采集模块可以采集IO时延信息,配置模块可以配置IO时延阈值,定位模块可以执行IO故障定位方法以对IO故障进行定位,展示模块可以获取定位模块对IO故障进行定位的结果,并对结果进行展示。可以看出,本发明实施例的IO故障定位设备可以对IO故障进行快速、准确的定位,并且将IO故障的定位结果进行展示,以便于对IO故障进行进一步的分析。In the embodiment of the present invention, the acquisition module can collect IO delay information, the configuration module can configure the IO delay threshold, the location module can execute the IO fault location method to locate the IO fault, and the display module can obtain the location module to perform the IO fault detection. The results of the positioning, and the results are displayed. It can be seen that the IO fault locating device in the embodiment of the present invention can quickly and accurately locate the IO fault, and display the IO fault locating result, so as to facilitate further analysis of the IO fault.
本发明实施例提供了一种计算机可读存储介质,所述存储介质存储有至少一可执行指令,该可执行指令在IO故障定位装置上运行时,使得所述IO故障定位装置执行上述任意方法实施例中的IO故障定位方法。An embodiment of the present invention provides a computer-readable storage medium, where the storage medium stores at least one executable instruction, and when the executable instruction is run on an IO fault locating device, causes the IO fault locating device to execute any of the above methods The IO fault location method in the embodiment.
本发明实施例提供了一种计算机程序,所述计算机程序可被处理器调用使计算设备执行上述任意方法实施例中的IO故障定位方法。An embodiment of the present invention provides a computer program, and the computer program can be invoked by a processor to cause a computing device to execute the method for locating an IO fault in any of the foregoing method embodiments.
本发明实施例提供了一种计算机程序产品,计算机程序产品包括存储在计算机可读存储介质上的计算机程序,计算机程序包括程序指令,当程序指令在计算机上运行时,使得所述计算机执行上述任意方法实施例中的IO故障定位方法。An embodiment of the present invention provides a computer program product. The computer program product includes a computer program stored on a computer-readable storage medium, and the computer program includes program instructions. When the program instructions are run on a computer, the computer is made to execute any of the above. The IO fault location method in the method embodiment.
在此提供的算法或显示不与任何特定计算机、虚拟系统或者其它设备固有相关。各种通用系统也可以与基于在此的示教一起使用。根据上面的描述,构造这类系统所要求的结构是显而易见的。此外,本发明实施例也不针对任何特定编程语言。应当明白,可以利用各种编程语言实现在此描述的本发明的内容,并且上面对特定语言所做的描述是为了披露本发明的最佳实施方式。The algorithms or displays provided herein are not inherently related to any particular computer, virtual system, or other device. Various general-purpose systems can also be used with teaching based on this. The structure required to construct such a system is apparent from the above description. Furthermore, embodiments of the present invention are not directed to any particular programming language. It is to be understood that various programming languages may be used to implement the inventions described herein, and that the descriptions of specific languages above are intended to disclose the best mode for carrying out the invention.
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. It will be understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
类似地,应当理解,为了精简本发明并帮助理解各个发明方面中的一个或多个,在上面对本发明的示例性实施例的描述中,本发明实施例的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释成反映如下意图:即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。Similarly, it is to be understood that in the above description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together into a single implementation in order to simplify the invention and to aid in the understanding of one or more of the various aspects of the invention. examples, figures, or descriptions thereof. This disclosure, however, should not be construed as reflecting an intention that the invention as claimed requires more features than are expressly recited in each claim.
本领域技术人员可以理解,可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件,以及可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。Those skilled in the art can understand that the modules in the device in the embodiment can be adaptively changed and arranged in one or more devices different from the embodiment. The modules or units or components in the embodiments may be combined into one module or unit or component, and they may be divided into multiple sub-modules or sub-units or sub-assemblies. All features disclosed in this specification (including accompanying claims, abstract and drawings) and any method so disclosed may be employed in any combination unless at least some of such features and/or procedures or elements are mutually exclusive. All processes or units of equipment are combined. Each feature disclosed in this specification (including accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。上述实施例中的步骤,除有特殊说明外,不应理解为对执行顺序的限定。It should be noted that the above-described embodiments illustrate rather than limit the invention, and that alternative embodiments may be devised by those skilled in the art without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several different elements and by means of a suitably programmed computer. In a unit claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, and third, etc. do not denote any order. These words can be interpreted as names. The steps in the above embodiments should not be construed as limitations on the execution order unless otherwise specified.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202011104291.0A CN114428703B (en) | 2020-10-15 | 2020-10-15 | IO fault location method, device, equipment and computer readable storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202011104291.0A CN114428703B (en) | 2020-10-15 | 2020-10-15 | IO fault location method, device, equipment and computer readable storage medium |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN114428703A true CN114428703A (en) | 2022-05-03 |
| CN114428703B CN114428703B (en) | 2025-04-29 |
Family
ID=81309297
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202011104291.0A Active CN114428703B (en) | 2020-10-15 | 2020-10-15 | IO fault location method, device, equipment and computer readable storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN114428703B (en) |
Citations (17)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101582915A (en) * | 2008-05-14 | 2009-11-18 | 株式会社日立制作所 | Storage system and method of managing a storage system using a managing apparatus |
| CN105024879A (en) * | 2015-07-15 | 2015-11-04 | 中国船舶重工集团公司第七0九研究所 | Virtual machine fault detection and recovery system and virtual machine detection, recovery and starting method |
| US20170010931A1 (en) * | 2015-07-08 | 2017-01-12 | Cisco Technology, Inc. | Correctly identifying potential anomalies in a distributed storage system |
| CN106407083A (en) * | 2016-10-26 | 2017-02-15 | 华为技术有限公司 | Fault detection method and device |
| US20170147425A1 (en) * | 2015-11-25 | 2017-05-25 | Salesforce.Com, Inc. | System and method for monitoring and detecting faulty storage devices |
| CN106874136A (en) * | 2017-02-22 | 2017-06-20 | 郑州云海信息技术有限公司 | The fault handling method and device of a kind of storage system |
| CN109088794A (en) * | 2018-08-20 | 2018-12-25 | 郑州云海信息技术有限公司 | A kind of fault monitoring method and device of node |
| CN109542704A (en) * | 2018-11-23 | 2019-03-29 | 郑州云海信息技术有限公司 | A kind of I/O request method for tracing, device and relevant device |
| CN109697193A (en) * | 2017-10-24 | 2019-04-30 | 中兴通讯股份有限公司 | A kind of method, node and the computer readable storage medium of determining abnormal nodes |
| CN109815048A (en) * | 2019-01-31 | 2019-05-28 | 新华三技术有限公司成都分公司 | Method for reading data, device and equipment |
| CN110730110A (en) * | 2019-10-18 | 2020-01-24 | 深圳市网心科技有限公司 | Node exception handling method, electronic device, system and medium |
| CN110740072A (en) * | 2018-07-20 | 2020-01-31 | 华为技术有限公司 | fault detection method, device and related equipment |
| CN110750213A (en) * | 2019-09-09 | 2020-02-04 | 华为技术有限公司 | Hard disk management method and device |
| CN110932894A (en) * | 2019-11-22 | 2020-03-27 | 北京金山云网络技术有限公司 | Network fault positioning method and device of cloud storage system and electronic equipment |
| CN110943864A (en) * | 2019-11-29 | 2020-03-31 | 北京金山云网络技术有限公司 | Network anomaly positioning method and device of distributed storage system |
| CN111104239A (en) * | 2019-11-21 | 2020-05-05 | 北京浪潮数据技术有限公司 | Hard disk fault processing method, system and device for distributed storage cluster |
| CN111756573A (en) * | 2020-05-28 | 2020-10-09 | 浪潮电子信息产业股份有限公司 | CTDB dual network card fault monitoring method and related equipment in distributed cluster |
-
2020
- 2020-10-15 CN CN202011104291.0A patent/CN114428703B/en active Active
Patent Citations (17)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101582915A (en) * | 2008-05-14 | 2009-11-18 | 株式会社日立制作所 | Storage system and method of managing a storage system using a managing apparatus |
| US20170010931A1 (en) * | 2015-07-08 | 2017-01-12 | Cisco Technology, Inc. | Correctly identifying potential anomalies in a distributed storage system |
| CN105024879A (en) * | 2015-07-15 | 2015-11-04 | 中国船舶重工集团公司第七0九研究所 | Virtual machine fault detection and recovery system and virtual machine detection, recovery and starting method |
| US20170147425A1 (en) * | 2015-11-25 | 2017-05-25 | Salesforce.Com, Inc. | System and method for monitoring and detecting faulty storage devices |
| CN106407083A (en) * | 2016-10-26 | 2017-02-15 | 华为技术有限公司 | Fault detection method and device |
| CN106874136A (en) * | 2017-02-22 | 2017-06-20 | 郑州云海信息技术有限公司 | The fault handling method and device of a kind of storage system |
| CN109697193A (en) * | 2017-10-24 | 2019-04-30 | 中兴通讯股份有限公司 | A kind of method, node and the computer readable storage medium of determining abnormal nodes |
| CN110740072A (en) * | 2018-07-20 | 2020-01-31 | 华为技术有限公司 | fault detection method, device and related equipment |
| CN109088794A (en) * | 2018-08-20 | 2018-12-25 | 郑州云海信息技术有限公司 | A kind of fault monitoring method and device of node |
| CN109542704A (en) * | 2018-11-23 | 2019-03-29 | 郑州云海信息技术有限公司 | A kind of I/O request method for tracing, device and relevant device |
| CN109815048A (en) * | 2019-01-31 | 2019-05-28 | 新华三技术有限公司成都分公司 | Method for reading data, device and equipment |
| CN110750213A (en) * | 2019-09-09 | 2020-02-04 | 华为技术有限公司 | Hard disk management method and device |
| CN110730110A (en) * | 2019-10-18 | 2020-01-24 | 深圳市网心科技有限公司 | Node exception handling method, electronic device, system and medium |
| CN111104239A (en) * | 2019-11-21 | 2020-05-05 | 北京浪潮数据技术有限公司 | Hard disk fault processing method, system and device for distributed storage cluster |
| CN110932894A (en) * | 2019-11-22 | 2020-03-27 | 北京金山云网络技术有限公司 | Network fault positioning method and device of cloud storage system and electronic equipment |
| CN110943864A (en) * | 2019-11-29 | 2020-03-31 | 北京金山云网络技术有限公司 | Network anomaly positioning method and device of distributed storage system |
| CN111756573A (en) * | 2020-05-28 | 2020-10-09 | 浪潮电子信息产业股份有限公司 | CTDB dual network card fault monitoring method and related equipment in distributed cluster |
Non-Patent Citations (3)
| Title |
|---|
| "磁盘 IO 和网络 IO 该如何评估、监控、性能定位和优化?", Retrieved from the Internet <URL:阿里云开发者社区> * |
| TSE-WEI WU; DONG-ZHEN LEE: "Layout-Based Dual-Cell-Aware Tests", 2019 IEEE 37TH VLSI TEST SYMPOSIUM(VTS), 11 July 2019 (2019-07-11) * |
| 夏晓峰;何常胜;: "LSM结合邻居干扰抵抗模型的传感器网络节点故障检测", 湘潭大学自然科学学报, no. 01, 15 March 2016 (2016-03-15) * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN114428703B (en) | 2025-04-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN108683562B (en) | Anomaly detection and positioning method, device, computer equipment and storage medium | |
| CN109032879B (en) | Multi-core processor access data detection and verification system and method | |
| US8819220B2 (en) | Management method of computer system and management system | |
| CN106407083A (en) | Fault detection method and device | |
| CN106557400A (en) | Method for dynamic data collection in a device and corresponding device | |
| CN112636942A (en) | Method and device for monitoring service host node | |
| CN108228442B (en) | Abnormal node detection method and device | |
| CN106802785B (en) | A kind of stack analysis method and device | |
| JP5267681B2 (en) | Performance data collection method, performance data collection device, and performance data management system | |
| US20260046227A1 (en) | Service session data tracing method and apparatus, and related device | |
| US12086022B2 (en) | Bus monitoring device and method, storage medium, and electronic device | |
| US9645873B2 (en) | Integrated configuration management and monitoring for computer systems | |
| US9417942B1 (en) | Event-based in-band host registration | |
| US9092333B2 (en) | Fault isolation with abstracted objects | |
| CN114070755A (en) | Method, apparatus, electronic device and storage medium for determining network traffic of virtual machine | |
| CN113645056B (en) | Method and system for positioning fault of intelligent network card | |
| CN108062224B (en) | Data reading and writing method, device and computing device based on file handle | |
| CN114428703A (en) | IO fault location method, apparatus, device and computer readable storage medium | |
| EP3010194A1 (en) | Method of tracing a transaction in a network | |
| CN119759628B (en) | Method and device for determining fault graphic processing chip in computing power cluster and electronic equipment | |
| CN109923846B (en) | Method and device for determining hotspot address | |
| CN119544539B (en) | Methods, devices, computer equipment and storage media for detecting computing power servers | |
| CN112765925B (en) | Interconnected circuit system, verification system and method | |
| WO2024183529A1 (en) | Data processing method and apparatus for medical device, electronic device, and medium | |
| CN112396186B (en) | Execution method, execution device and related product |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |