CN114490157A - A fault detection method, device, equipment and storage medium - Google Patents
A fault detection method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN114490157A CN114490157A CN202210091905.9A CN202210091905A CN114490157A CN 114490157 A CN114490157 A CN 114490157A CN 202210091905 A CN202210091905 A CN 202210091905A CN 114490157 A CN114490157 A CN 114490157A
- Authority
- CN
- China
- Prior art keywords
- node
- faulty
- fault
- main
- chain
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0709—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0766—Error or fault reporting or storing
- G06F11/0778—Dumping, i.e. gathering error/state information after a fault for later diagnosis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3006—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3051—Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Computer Hardware Design (AREA)
- Debugging And Monitoring (AREA)
Abstract
Description
技术领域technical field
本发明实施例涉及人工智能技术领域,尤其涉及一种故障检测方法、装置、设备及存储介质。The embodiments of the present invention relate to the technical field of artificial intelligence, and in particular, to a fault detection method, apparatus, device, and storage medium.
背景技术Background technique
随着计算机技术的快速发展,互联网业务系统中的任务节点越来越多,技术架构也越来越复杂。当业务系统发生故障时,运维人员需要快速、准确地确定引起业务系统故障的根源故障节点,并基于根源故障的监控数据,如CPU(Central Processing Unit,中央处理器)使用率、内存占用率、磁盘使用率、QPS(Query Per Second,每秒查询率)、TPS(Transaction Per Second,吞吐量),进行后续的分析和故障处理。With the rapid development of computer technology, there are more and more task nodes in the Internet business system, and the technical architecture is becoming more and more complex. When a business system fails, operation and maintenance personnel need to quickly and accurately determine the root fault node that causes the business system failure, and monitor data based on the root fault, such as CPU (Central Processing Unit, central processing unit) usage rate, memory usage rate , disk usage, QPS (Query Per Second, query rate per second), and TPS (Transaction Per Second, throughput) for subsequent analysis and troubleshooting.
然而,当业务系统发生故障时,一般基于各个任务节点的关系表,对各个任务节点的监控数据进行分析,确定出故障节点链,但是无法具体定位到故障节点链中的根源故障节点,从而影响故障检测的准确性。However, when the business system fails, the monitoring data of each task node is generally analyzed based on the relationship table of each task node, and the faulty node chain is determined, but the root fault node in the faulty node chain cannot be specifically located, thus affecting the Accuracy of fault detection.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供了一种故障检测方法、装置、设备及存储介质,用于提高故障检测的准确性。Embodiments of the present application provide a fault detection method, apparatus, device, and storage medium, which are used to improve the accuracy of fault detection.
一方面,本申请实施例提供了一种故障检测方法,该方法包括:On the one hand, the embodiment of the present application provides a fault detection method, the method includes:
获取原始故障节点链,所述原始故障节点链包括多个原始故障节点;obtaining an original faulty node chain, where the original faulty node chain includes a plurality of original faulty nodes;
沿所述原始故障节点链的延伸方向,依次从所述多个原始故障节点中选取主故障节点,并基于选取的各个主故障节点,确定主故障节点链;along the extension direction of the original faulty node chain, sequentially selecting main faulty nodes from the plurality of original faulty nodes, and determining the main faulty node chain based on each selected main faulty node;
基于所述主故障节点链中各个主故障节点之间的因果概率,从所述主故障节点链中确定出目标根源故障节点。Based on the causal probability between each main fault node in the main fault node chain, the target root fault node is determined from the main fault node chain.
可选地,所述获取原始故障节点链,包括:Optionally, the obtaining the original faulty node chain includes:
获取任务节点链中初始的基准故障节点;Get the initial benchmark faulty node in the task node chain;
基于所述基准故障节点,以及所述任务节点链中相邻两个任务节点之间的故障相似性,迭代从所述任务节点链中,确定原始故障节点链;其中,每次迭代过程包括以下步骤:Based on the reference faulty node and the fault similarity between two adjacent task nodes in the task node chain, iteratively determines the original faulty node chain from the task node chain; wherein, each iteration process includes the following step:
从所述任务节点链中,确定与所述基准故障节点相邻的至少一个候选节点;From the task node chain, determine at least one candidate node adjacent to the reference faulty node;
确定所述基准故障节点分别与所述至少一个候选节点之间的故障相似性;determining a fault similarity between the reference faulty node and the at least one candidate node, respectively;
基于获得的各个故障相似性,从所述至少一个候选节点中选取至少一个原始故障节点,并将所述至少一个原始故障节点作为基准故障节点。Based on the obtained fault similarities, at least one original faulty node is selected from the at least one candidate node, and the at least one original faulty node is used as a reference faulty node.
可选地,所述确定所述基准故障节点分别与所述至少一个候选节点之间的故障相似性,包括:Optionally, the determining the fault similarity between the reference fault node and the at least one candidate node respectively includes:
针对所述至少一个候选节点,分别执行以下步骤:For the at least one candidate node, respectively perform the following steps:
获取所述基准故障节点在预设时段内对应的基准节点资源属性信息,以及一个候选节点在预设时段内对应的候选节点资源属性信息;Acquiring the resource attribute information of the reference node corresponding to the reference fault node within the preset time period, and the resource attribute information of the candidate node corresponding to a candidate node within the preset time period;
基于所述基准节点资源属性信息和所述候选节点资源属性信息,确定所述基准故障节点与所述一个候选节点的故障相似性。Based on the resource attribute information of the reference node and the resource attribute information of the candidate node, the failure similarity between the reference faulty node and the one candidate node is determined.
可选地,所述基于所述基准节点资源属性信息和所述候选节点资源属性信息,确定所述基准故障节点与所述一个候选节点的故障相似性,包括:Optionally, the determining, based on the resource attribute information of the reference node and the resource attribute information of the candidate node, determines the failure similarity between the reference faulty node and the one candidate node, including:
基于所述基准节点资源属性信息,确定基准节点资源属性图像;determining a reference node resource attribute image based on the reference node resource attribute information;
基于所述候选节点资源属性信息,确定候选节点资源属性图像;determining a candidate node resource attribute image based on the candidate node resource attribute information;
采用相似性网络模型,确定所述基准节点资源属性图像和所述候选节点资源属性图像的图像相似性;Using a similarity network model, determine the image similarity between the resource attribute image of the reference node and the resource attribute image of the candidate node;
将所述图像相似性,作为所述基准故障节点与所述一个候选节点的故障相似性。The image similarity is used as the fault similarity between the reference fault node and the one candidate node.
可选地,所述沿所述原始故障节点链的延伸方向,依次从所述多个原始故障节点中选取主故障节点,包括:Optionally, selecting the main faulty node from the plurality of original faulty nodes in sequence along the extension direction of the original faulty node chain, including:
从所述原始故障节点链中获取初始的参考主故障节点;Obtain an initial reference primary faulty node from the original faulty node chain;
基于所述参考主故障节点,沿所述原始故障节点链的延伸方向,迭代从所述多个原始故障节点中选取主故障节点,其中,每次迭代过程包括以下步骤:Based on the reference main faulty node, along the extension direction of the original faulty node chain, iteratively selects the main faulty node from the plurality of original faulty nodes, wherein each iteration process includes the following steps:
若所述参考主故障节点不为分叉故障节点,则将所述参考主故障节点作为主故障节点,并将所述原始故障节点链的延伸方向上,与所述主故障节点相邻的原始故障节点作为所述参考主故障节点;If the reference main faulty node is not a bifurcation faulty node, the reference main faulty node is regarded as the main faulty node, and the original faulty node chain adjacent to the main faulty node is set in the extension direction of the original faulty node chain. The faulty node is used as the reference main faulty node;
若所述参考主故障节点为分叉故障节点,则将所述参考主故障节点作为主故障节点,并基于所述分叉故障节点分别与对应的多个子故障节点之间的因果概率,从所述多个子故障节点中选取一个子故障节点作为所述参考主故障节点。If the reference main faulty node is a bifurcation faulty node, the reference main faulty node is regarded as the main faulty node, and based on the causal probability between the bifurcation faulty node and the corresponding multiple sub-faulty nodes, from the A sub-faulty node is selected from the plurality of sub-faulty nodes as the reference main faulty node.
可选地,所述基于所述分叉故障节点分别与对应的多个子故障节点之间的因果概率,从所述多个子故障节点中选取一个子故障节点作为所述参考主故障节点,包括:Optionally, selecting a sub-fault node from the plurality of sub-fault nodes as the reference main fault node based on the causal probability between the bifurcated fault nodes and the corresponding multiple sub-fault nodes, including:
针对所述多个子故障节点,分别执行以下步骤:基于所述分叉故障节点和一个子故障节点各自在预设时段内对应的目标资源异常信息,确定所述分叉故障节点与所述一个子故障节点之间的因果概率;For the plurality of sub-faulty nodes, the following steps are respectively performed: based on the target resource exception information corresponding to the fork-faulty node and a sub-faulty node within a preset time period, determine the fork-faulty node and the one sub-faulty node. Causal probability between faulty nodes;
确定获得的多个因果概率中的最大因果概率,并将所述多个子故障节点中,所述最大因果概率对应的子故障节点,作为所述参考主故障节点。A maximum causal probability among the obtained multiple causal probabilities is determined, and among the multiple sub-fault nodes, the sub-fault node corresponding to the maximum causal probability is used as the reference main fault node.
可选地,所述基于所述主故障节点链中各个主故障节点之间的因果概率,从所述主故障节点链中确定出目标根源故障节点,包括:Optionally, determining the target root fault node from the main fault node chain based on the causal probability between each main fault node in the main fault node chain, including:
从所述主故障节点链中获取初始的参考根源故障节点;Obtain an initial reference root fault node from the main fault node chain;
基于所述参考根源故障节点与所述主故障节点链中其他主故障节点之间的因果概率,迭代更新所述参考根源故障节点,直到迭代结束,将所述参考根源故障节点作为目标根源故障节点。Based on the causal probability between the reference root fault node and other primary fault nodes in the primary fault node chain, the reference root fault node is iteratively updated until the iteration ends, and the reference root fault node is used as the target root fault node .
可选地,每次迭代过程包括以下步骤:Optionally, each iteration process includes the following steps:
从所述其他主故障节点中,获取一个主故障节点;Obtain a primary faulty node from the other primary faulty nodes;
基于所述参考根源故障节点对应的目标资源异常信息和所述一个主故障节点对应的目标资源异常信息,确定所述参考根源故障节点与所述一个主故障节点之间的因果概率;determining the causal probability between the reference root fault node and the one main fault node based on the target resource abnormality information corresponding to the reference root fault node and the target resource abnormality information corresponding to the one main fault node;
若所述因果概率大于预设因果阈值,则所述参考根源故障节点保持不变;If the causal probability is greater than the preset causal threshold, the reference root fault node remains unchanged;
否则,将所述一个主故障节点作为所述参考根源故障节点。Otherwise, the one main faulty node is used as the reference root faulty node.
可选地,所述目标资源异常信息包括目标资源异常时间点、目标资源异常幅值以及目标资源异常持续时间段。Optionally, the target resource abnormality information includes a target resource abnormality time point, a target resource abnormality amplitude, and a target resource abnormality duration period.
一方面,本申请实施例提供了一种故障检测装置,该装置包括:On the one hand, an embodiment of the present application provides a fault detection device, the device comprising:
原始故障节点链获取模块,用于获取原始故障节点链,所述原始故障节点链包括多个原始故障节点;an original faulty node chain acquisition module, used for acquiring the original faulty node chain, the original faulty node chain including a plurality of original faulty nodes;
主故障节点链获取模块,用于沿所述原始故障节点链的延伸方向,依次从所述多个原始故障节点中选取主故障节点,并基于选取的各个主故障节点,确定主故障节点链;a main faulty node chain obtaining module, configured to sequentially select main faulty nodes from the plurality of original faulty nodes along the extension direction of the original faulty node chain, and determine the main faulty node chain based on each selected main faulty node;
目标确定模块,用于基于所述主故障节点链中各个主故障节点之间的因果概率,从所述主故障节点链中确定出目标根源故障节点。A target determination module, configured to determine a target root fault node from the main fault node chain based on the causal probability between each main fault node in the main fault node chain.
可选地,所述原始故障节点链获取模块具体用于:Optionally, the original faulty node chain acquisition module is specifically used for:
获取任务节点链中初始的基准故障节点;Get the initial benchmark faulty node in the task node chain;
基于所述基准故障节点,以及所述任务节点链中相邻两个任务节点之间的故障相似性,迭代从所述任务节点链中,确定原始故障节点链;其中,每次迭代过程包括以下步骤:Based on the reference faulty node and the fault similarity between two adjacent task nodes in the task node chain, iteratively determines the original faulty node chain from the task node chain; wherein, each iteration process includes the following step:
从所述任务节点链中,确定与所述基准故障节点相邻的至少一个候选节点;From the task node chain, determine at least one candidate node adjacent to the reference faulty node;
确定所述基准故障节点分别与所述至少一个候选节点之间的故障相似性;determining a fault similarity between the reference faulty node and the at least one candidate node, respectively;
基于获得的各个故障相似性,从所述至少一个候选节点中选取至少一个原始故障节点,并将所述至少一个原始故障节点作为基准故障节点。Based on the obtained fault similarities, at least one original faulty node is selected from the at least one candidate node, and the at least one original faulty node is used as a reference faulty node.
可选地,所述原始故障节点链获取模块具体用于:Optionally, the original faulty node chain acquisition module is specifically used for:
针对所述至少一个候选节点,分别执行以下步骤:For the at least one candidate node, respectively perform the following steps:
获取所述基准故障节点在预设时段内对应的基准节点资源属性信息,以及一个候选节点在预设时段内对应的候选节点资源属性信息;Acquiring the resource attribute information of the reference node corresponding to the reference fault node within the preset time period, and the resource attribute information of the candidate node corresponding to a candidate node within the preset time period;
基于所述基准节点资源属性信息和所述候选节点资源属性信息,确定所述基准故障节点与所述一个候选节点的故障相似性。Based on the resource attribute information of the reference node and the resource attribute information of the candidate node, the failure similarity between the reference faulty node and the one candidate node is determined.
可选地,所述原始故障节点链获取模块具体用于:Optionally, the original faulty node chain acquisition module is specifically used for:
基于所述基准节点资源属性信息,确定基准节点资源属性图像;determining a reference node resource attribute image based on the reference node resource attribute information;
基于所述候选节点资源属性信息,确定候选节点资源属性图像;determining a candidate node resource attribute image based on the candidate node resource attribute information;
采用相似性网络模型,确定所述基准节点资源属性图像和所述候选节点资源属性图像的图像相似性;Using a similarity network model, determine the image similarity between the resource attribute image of the reference node and the resource attribute image of the candidate node;
将所述图像相似性,作为所述基准故障节点与所述一个候选节点的故障相似性。The image similarity is used as the fault similarity between the reference fault node and the one candidate node.
可选地,所述主故障节点链获取模块具体用于:Optionally, the main fault node chain acquisition module is specifically used for:
从所述原始故障节点链中获取初始的参考主故障节点;Obtain an initial reference primary faulty node from the original faulty node chain;
基于所述参考主故障节点,沿所述原始故障节点链的延伸方向,迭代从所述多个原始故障节点中选取主故障节点,其中,每次迭代过程包括以下步骤:Based on the reference main faulty node, along the extension direction of the original faulty node chain, iteratively selects the main faulty node from the plurality of original faulty nodes, wherein each iteration process includes the following steps:
若所述参考主故障节点不为分叉故障节点,则将所述参考主故障节点作为主故障节点,并将所述原始故障节点链的延伸方向上,与所述主故障节点相邻的原始故障节点作为所述参考主故障节点;If the reference main faulty node is not a bifurcation faulty node, the reference main faulty node is regarded as the main faulty node, and the original faulty node chain adjacent to the main faulty node is set in the extension direction of the original faulty node chain. The faulty node is used as the reference main faulty node;
若所述参考主故障节点为分叉故障节点,则将所述参考主故障节点作为主故障节点,并基于所述分叉故障节点分别与对应的多个子故障节点之间的因果概率,从所述多个子故障节点中选取一个子故障节点作为所述参考主故障节点。If the reference main faulty node is a bifurcation faulty node, the reference main faulty node is regarded as the main faulty node, and based on the causal probability between the bifurcation faulty node and the corresponding multiple sub-faulty nodes, from the A sub-faulty node is selected from the plurality of sub-faulty nodes as the reference main faulty node.
可选地,所述主故障节点链获取模块具体用于:Optionally, the main fault node chain acquisition module is specifically used for:
针对所述多个子故障节点,分别执行以下步骤:基于所述分叉故障节点和一个子故障节点各自在预设时段内对应的目标资源异常信息,确定所述分叉故障节点与所述一个子故障节点之间的因果概率;For the plurality of sub-faulty nodes, the following steps are respectively performed: based on the target resource exception information corresponding to the fork-faulty node and a sub-faulty node within a preset time period, determine the fork-faulty node and the one sub-faulty node. Causal probability between faulty nodes;
确定获得的多个因果概率中的最大因果概率,并将所述多个子故障节点中,所述最大因果概率对应的子故障节点,作为所述参考主故障节点。A maximum causal probability among the obtained multiple causal probabilities is determined, and among the multiple sub-fault nodes, the sub-fault node corresponding to the maximum causal probability is used as the reference main fault node.
可选地,所述目标确定模块具体用于:Optionally, the target determination module is specifically used for:
从所述主故障节点链中获取初始的参考根源故障节点;Obtain an initial reference root fault node from the main fault node chain;
基于所述参考根源故障节点与所述主故障节点链中其他主故障节点之间的因果概率,迭代更新所述参考根源故障节点,直到迭代结束,将所述参考根源故障节点作为目标根源故障节点。Based on the causal probability between the reference root fault node and other primary fault nodes in the primary fault node chain, the reference root fault node is iteratively updated until the iteration ends, and the reference root fault node is used as the target root fault node .
可选地,所述目标确定模块具体用于:Optionally, the target determination module is specifically used for:
每次迭代过程包括以下步骤:Each iteration process includes the following steps:
从所述其他主故障节点中,获取一个主故障节点;Obtain a primary faulty node from the other primary faulty nodes;
基于所述参考根源故障节点对应的目标资源异常信息和所述一个主故障节点对应的目标资源异常信息,确定所述参考根源故障节点与所述一个主故障节点之间的因果概率;determining the causal probability between the reference root fault node and the one main fault node based on the target resource abnormality information corresponding to the reference root fault node and the target resource abnormality information corresponding to the one main fault node;
若所述因果概率大于预设因果阈值,则所述参考根源故障节点保持不变;If the causal probability is greater than the preset causal threshold, the reference root fault node remains unchanged;
否则,将所述一个主故障节点作为所述参考根源故障节点。Otherwise, the one main faulty node is used as the reference root faulty node.
可选地,所述目标资源异常信息包括目标资源异常时间点、目标资源异常幅值以及目标资源异常持续时间段。Optionally, the target resource abnormality information includes a target resource abnormality time point, a target resource abnormality amplitude, and a target resource abnormality duration period.
一方面,本申请实施例提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述故障检测方法的步骤。On the one hand, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored in the memory and running on the processor, the processor implementing the above-mentioned fault detection method when the program is executed. step.
一方面,本申请实施例提供了一种计算机可读存储介质,其存储有可由计算机设备执行的计算机程序,当所述程序在计算机设备上运行时,使得所述计算机设备执行上述故障检测方法的步骤。On the one hand, an embodiment of the present application provides a computer-readable storage medium, which stores a computer program executable by a computer device, and when the program runs on the computer device, causes the computer device to perform the above-mentioned fault detection method. step.
在本申请实施例中,由于原始故障节点链中包含的原始故障节点比较多,通过从多个原始故障节点中选取主故障节点,并基于选取的各个主故障节点确定主故障节点链,可以有效地提高故障检测的效率。最后,通过主故障节点链中各个主故障节点之间的因果概率,从主故障节点链中确定出目标根源故障节点,有效提高了故障检测的准确性。In the embodiment of the present application, since the original faulty node chain contains many original faulty nodes, the main faulty node is selected from the multiple original faulty nodes, and the main faulty node chain is determined based on each selected main faulty node, which can effectively to improve the efficiency of fault detection. Finally, through the causal probability between each main fault node in the main fault node chain, the target root fault node is determined from the main fault node chain, which effectively improves the accuracy of fault detection.
附图说明Description of drawings
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简要介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.
图1为本申请实施例提供的一种系统架构示意图;1 is a schematic diagram of a system architecture provided by an embodiment of the present application;
图2为本申请实施例提供的一种故障检测方法的流程示意图;2 is a schematic flowchart of a fault detection method provided by an embodiment of the present application;
图3为本申请实施例提供的一种任务节点链结构示意图;FIG. 3 is a schematic structural diagram of a task node chain provided by an embodiment of the present application;
图4为本申请实施例提供的一种获取原始故障节点链方法的流程示意图;FIG. 4 is a schematic flowchart of a method for obtaining an original faulty node chain according to an embodiment of the present application;
图5为本申请实施例提供的一种任务节点链结构示意图;FIG. 5 is a schematic structural diagram of a task node chain provided by an embodiment of the present application;
图6为本申请实施例提供的一种确定故障相似性方法的流程示意图;6 is a schematic flowchart of a method for determining fault similarity provided by an embodiment of the present application;
图7为本申请实施例提供的一种确定故障相似性方法的流程示意图;FIG. 7 is a schematic flowchart of a method for determining fault similarity according to an embodiment of the present application;
图8为本申请实施例提供的一种基准节点资源属性图像示意图;FIG. 8 is a schematic diagram of a reference node resource attribute image provided by an embodiment of the present application;
图9为本申请实施例提供的一种候选节点资源属性图像示意图;FIG. 9 is a schematic diagram of a candidate node resource attribute image provided by an embodiment of the present application;
图10为本申请实施例提供的一种相似性网络模型的结构示意图;FIG. 10 is a schematic structural diagram of a similarity network model provided by an embodiment of the present application;
图11为本申请实施例提供的一种CPU使用率示意图;FIG. 11 is a schematic diagram of a CPU usage rate provided by an embodiment of the present application;
图12为本申请实施例提供的一种原始故障节点链的结构示意图;FIG. 12 is a schematic structural diagram of an original faulty node chain provided by an embodiment of the present application;
图13为本申请实施例提供的一种确定参考根源故障节点方法的流程示意图;13 is a schematic flowchart of a method for determining a reference root fault node according to an embodiment of the present application;
图14为本申请实施例提供的一种主故障节点链的结构示意图;FIG. 14 is a schematic structural diagram of a main fault node chain provided by an embodiment of the present application;
图15为本申请实施例提供的一种故障检测方法的流程示意图;15 is a schematic flowchart of a fault detection method provided by an embodiment of the present application;
图16为本申请实施例提供的一种系统架构示意图;FIG. 16 is a schematic diagram of a system architecture provided by an embodiment of the present application;
图17为本申请实施例提供的一种故障检测装置的结构示意图;17 is a schematic structural diagram of a fault detection apparatus provided by an embodiment of the application;
图18为本申请实施例提供的一种计算机设备的结构示意图。FIG. 18 is a schematic structural diagram of a computer device according to an embodiment of the present application.
具体实施方式Detailed ways
为了使本发明的目的、技术方案及有益效果更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the objectives, technical solutions and beneficial effects of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.
参考图1,其为本申请实施例适用的一种系统架构图,该系统架构至少包括终端设备101、故障检测系统102以及业务系统103,其中故障检测系统102可以独立于业务系统103,也可以内置于业务系统103。Referring to FIG. 1 , which is a system architecture diagram to which the embodiments of the present application are applied, the system architecture at least includes a
终端设备101安装有故障检测的目标应用,该应用可以是预先安装的客户端、网页版应用或嵌入在其他应用中的小程序等。终端设备101可以是智能手机、平板电脑、笔记本电脑、台式计算机等,但并不局限于此。The
故障检测系统102为目标应用的后台服务器,为目标应用提供服务。故障检测系统102可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网路(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。The
业务系统103包括多个任务节点,当任务节点发生故障,则该任务节点为原始故障节点。业务系统103可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网路(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。The
终端设备101与故障检测系统102可以通过有线或无线通信方式进行直接或间接地连接,本申请在此不做限制。故障检测系统102与业务系统103可以通过有线或无线通信方式进行直接或间接地连接,本申请在此不做限制。The
终端设备101响应于用户的故障检测操作,发生故障检测指令至故障检测系统102。故障检测系统102接收故障检测指令,从业务系统103获取原始故障节点链,其中,原始故障节点链包括多个原始故障节点。故障检测系统102沿原始故障节点链的延伸方向,依次从多个原始故障节点中选取主故障节点,并基于选取的各个主故障节点,确定主故障节点链。最后,基于主故障节点链中各个主故障节点之间的因果概率,从主故障节点链中确定出目标根源故障节点。The
基于图1所述的系统架构图,本申请实施例提供了一种故障检测方法的流程,如图2所示,该方法的流程由计算机设备执行,该计算机设备可以是图1所示的故障检测系统102,包括以下步骤:Based on the system architecture diagram shown in FIG. 1 , an embodiment of the present application provides a process of a fault detection method. As shown in FIG. 2 , the process of the method is executed by a computer device, and the computer device may be the fault shown in FIG. 1 . The
步骤S201,获取原始故障节点链。Step S201, obtaining the original faulty node chain.
具体地,原始故障节点链包括多个原始故障节点。Specifically, the original faulty node chain includes a plurality of original faulty nodes.
基于每个任务节点对应的资源属性信息,可以判断该任务节点是否为原始故障节点。其中,资源属性信息包括CPU使用率、内存占用率、磁盘使用率、QPS、TPS。Based on the resource attribute information corresponding to each task node, it can be determined whether the task node is the original faulty node. The resource attribute information includes CPU usage, memory usage, disk usage, QPS, and TPS.
举例来说,如图3所示,业务系统中包括8个任务节点,分别为任务节点A、任务节点B、任务节点C、任务节点D、任务节点E、任务节点F、任务节点G、任务节点H。设定通过对以上8个任务节点对应的资源属性信息进行判断,确定任务节点B、任务节点D、任务节点E、任务节点F、任务节点G出现了故障,因此,任务节点B、任务节点D、任务节点E、任务节点F、任务节点G为原始故障节点,并由以上5个原始故障节点组成了原始故障节点链。For example, as shown in Figure 3, the business system includes 8 task nodes, namely task node A, task node B, task node C, task node D, task node E, task node F, task node G, task node Node H. Setting By judging the resource attribute information corresponding to the above 8 task nodes, it is determined that task node B, task node D, task node E, task node F, and task node G are faulty. Therefore, task node B, task node D , task node E, task node F, and task node G are the original faulty nodes, and the original faulty node chain is composed of the above five original faulty nodes.
步骤S202,沿原始故障节点链的延伸方向,依次从多个原始故障节点中选取主故障节点,并基于选取的各个主故障节点,确定主故障节点链。Step S202 , along the extension direction of the original faulty node chain, sequentially select main faulty nodes from a plurality of original faulty nodes, and determine the main faulty node chain based on each selected main faulty node.
具体地,原始故障节点链可以包括一个主故障节点链和至少一个从故障节点链,也可以只包括一个主故障节点链。Specifically, the original faulty node chain may include a master faulty node chain and at least one slave faulty node chain, or may only include a master faulty node chain.
原始故障节点链的延伸方向可以自左向右,也可以自右向左,也可以自上向下,还可以自下向上,还可以是其他任意方向。The extension direction of the original faulty node chain can be from left to right, from right to left, from top to bottom, from bottom to top, or in any other direction.
以原始故障节点链的延伸方向为自左向右举例来说,针对图3中的原始故障节点链,设定依次从5个原始故障节点中选取的主故障节点分别为任务节点B、任务节点D和任务节点E。以上3个主故障节点组成了主故障节点链。Taking the extension direction of the original faulty node chain as an example from left to right, for the original faulty node chain in Figure 3, set the main faulty nodes selected from the five original faulty nodes as task node B and task node respectively. D and task node E. The above three main fault nodes constitute the main fault node chain.
步骤S203,基于主故障节点链中各个主故障节点之间的因果概率,从主故障节点链中确定出目标根源故障节点。Step S203, based on the causal probability between each main fault node in the main fault node chain, determine the target root fault node from the main fault node chain.
举例来说,如图3所示,主故障节点为任务节点B、任务节点D和任务节点E。一种可能的实施方式,通过计算任务节点B和任务节点D之间的因果概率、任务节点B和任务节点E之间的因果概率、任务节点D和任务节点E之间的因果概率,比较获得的3个因果概率,最终从3个主故障节点中确定出目标根源故障节点。For example, as shown in FIG. 3 , the main faulty nodes are task node B, task node D and task node E. A possible implementation method is to obtain by comparing the causal probability between task node B and task node D, the causal probability between task node B and task node E, and the causal probability between task node D and task node E. The three causal probabilities of , and finally determine the target root fault node from the three main fault nodes.
在本申请实施例中,由于原始故障节点链中包含的原始故障节点比较多,通过从多个原始故障节点中选取主故障节点,并基于选取的各个主故障节点确定主故障节点链,可以有效地提高故障检测的效率。最后,通过主故障节点链中各个主故障节点之间的因果概率,从主故障节点链中确定出目标根源故障节点,有效提高了故障检测的准确性。In the embodiment of the present application, since the original faulty node chain contains many original faulty nodes, the main faulty node is selected from the multiple original faulty nodes, and the main faulty node chain is determined based on each selected main faulty node, which can effectively to improve the efficiency of fault detection. Finally, through the causal probability between each main fault node in the main fault node chain, the target root fault node is determined from the main fault node chain, which effectively improves the accuracy of fault detection.
可选地,在上述步骤S201中,获取原始故障节点链包括图4中的以下步骤:Optionally, in the above step S201, acquiring the original faulty node chain includes the following steps in FIG. 4:
步骤S401,获取任务节点链中初始的基准故障节点。Step S401, acquiring the initial reference fault node in the task node chain.
一种可能的实施方式,当业务系统出现任务节点告警时,可以直接将出现该任务节点告警的任务节点作为基准故障节点。In a possible implementation manner, when a task node alarm occurs in the service system, the task node that has the task node alarm may be directly used as a reference fault node.
举例来说,如图5所示,业务系统中包括5个任务节点,分别为任务节点A、任务节点B、任务节点C、任务节点D、任务节点E。当业务系统出现任务节点告警时,设定任务节点B发出了任务节点告警,将任务节点B作为基准故障节点。For example, as shown in FIG. 5 , the service system includes five task nodes, namely task node A, task node B, task node C, task node D, and task node E. When a task node alarm occurs in the business system, it is set that task node B sends out a task node alarm, and task node B is used as the reference fault node.
另一种可能的实施方式,当业务系统出现任务监控告警时,则确定业务系统目前正在执行的任务,并将业务系统中第一个执行该任务的任务节点作为基准故障节点。In another possible implementation manner, when a task monitoring alarm occurs in the business system, the task currently being executed by the business system is determined, and the first task node in the business system that executes the task is used as the reference fault node.
举例来说,如图5所示,业务系统中包括5个任务节点,分别为任务节点A、任务节点B、任务节点C、任务节点D、任务节点E。当业务系统出现任务监控告警时,确定业务系统当前正在执行的任务为交易数据存储任务,确定业务系统中第一个执行该交易数据存储任务的任务节点为任务节点A,将任务节点A作为基准故障节点。For example, as shown in FIG. 5 , the service system includes five task nodes, namely task node A, task node B, task node C, task node D, and task node E. When a task monitoring alarm occurs in the business system, determine that the task currently being executed by the business system is the transaction data storage task, determine the first task node in the business system to execute the transaction data storage task as task node A, and use task node A as the benchmark failed node.
步骤S402,基于基准故障节点,以及任务节点链中相邻两个任务节点之间的故障相似性,迭代从任务节点链中,确定原始故障节点链。Step S402 , based on the reference faulty node and the fault similarity between two adjacent task nodes in the task node chain, iteratively determine the original faulty node chain from the task node chain.
每次迭代过程包括以下步骤:Each iteration process includes the following steps:
先从任务节点链中,确定与基准故障节点相邻的至少一个候选节点。再确定基准故障节点分别与至少一个候选节点之间的故障相似性,基于获得的各个故障相似性,从至少一个候选节点中选取至少一个原始故障节点,并将至少一个原始故障节点作为基准故障节点。First, from the task node chain, determine at least one candidate node adjacent to the reference faulty node. Then determine the fault similarity between the reference faulty node and at least one candidate node, select at least one original faulty node from the at least one candidate node based on the obtained fault similarities, and use at least one original faulty node as the reference faulty node .
具体地,设定故障相似性阈值,针对获得的各个故障相似性分别进行以下判断:Specifically, the fault similarity threshold is set, and the following judgments are made for each obtained fault similarity:
若基准故障节点与候选节点之间的故障相似性小于故障相似性阈值,则表示该候选节点不为原始故障节点;否则,表示该候选节点为原始故障节点。If the fault similarity between the reference faulty node and the candidate node is less than the fault similarity threshold, it means that the candidate node is not the original faulty node; otherwise, it means that the candidate node is the original faulty node.
迭代停止条件包括以下两种可能的方式:第一种迭代停止条件,从至少一个候选节点中选取不到原始故障节点。第二种迭代停止条件,确定的原始故障节点超过预设数量阈值。The iterative stop condition includes the following two possible ways: The first iterative stop condition is that no original faulty node can be selected from at least one candidate node. In the second iterative stop condition, the determined original faulty nodes exceed a preset number threshold.
举例来说,如图5所示,业务系统中包括5个任务节点,分别为任务节点A、任务节点B、任务节点C、任务节点D、任务节点E,以上5个任务节点组成了任务节点链。设定任务节点B为基准故障节点。For example, as shown in Figure 5, the business system includes five task nodes, namely task node A, task node B, task node C, task node D, and task node E. The above five task nodes constitute task nodes. chain. Set task node B as the reference fault node.
确定与基准故障节点相邻的三个候选节点,分别为任务节点A、任务节点C和任务节点D。设定基准故障节点与任务节点A之间的故障相似性为0.3,基准故障节点与任务节点C之间的故障相似性为0.6,基准故障节点与任务节点D之间的故障相似性为0.7,故障相似性阈值为0.4。Three candidate nodes adjacent to the reference faulty node are determined, namely task node A, task node C and task node D. The fault similarity between the benchmark faulty node and task node A is set to 0.3, the fault similarity between the benchmark faulty node and task node C is 0.6, and the fault similarity between the benchmark faulty node and task node D is 0.7. The fault similarity threshold is 0.4.
由于0.3小于故障相似性阈值0.4,因此,任务节点A不为原始故障节点。由于0.6大于故障相似性阈值0.4,0.7大于故障相似性阈值0.4,因此,任务节点C和任务节点E均为原始故障节点。Since 0.3 is less than the fault similarity threshold of 0.4, task node A is not the original fault node. Since 0.6 is greater than the fault similarity threshold of 0.4, and 0.7 is greater than the fault similarity threshold of 0.4, both task node C and task node E are the original fault nodes.
再将任务节点C作为基准故障节点,由于基准故障节点没有相邻的候选节点,针对基准故障节点为任务节点C的迭代结束。Then, the task node C is used as the reference fault node. Since the reference fault node has no adjacent candidate nodes, the iteration of the task node C for the reference fault node ends.
再将任务节点D作为基准故障节点,确定与基准故障节点相邻的候选节点为任务节点E,设定基准故障节点与任务节点E之间的故障相似性为0.2,由于0.2小于故障相似性阈值0.4,因此,任务节点E不为原始故障节点。Then take task node D as the benchmark fault node, determine the candidate node adjacent to the benchmark fault node as task node E, and set the fault similarity between the benchmark fault node and task node E as 0.2, because 0.2 is less than the fault similarity threshold. 0.4, therefore, task node E is not the original faulty node.
最终确定的原始故障节点为任务节点B、任务节点C和任务节点D,由以上3个原始故障节点组成原始故障节点链。The original faulty nodes finally determined are task node B, task node C and task node D, and the original faulty node chain is composed of the above three original faulty nodes.
在本申请实施例中,通过比较基准故障节点与相邻的至少一个候选节点之间的故障相似性,从至少一个候选节点中确定出至少一个原始故障节点,可以有效地避免遗漏原始故障节点,有利于提高故障检测的准确性。同时,由于每次对基准故障节点和相邻的候选节点进行故障相似性判断,在简化判断复杂性的同时,提升了故障相似性判断的准确性。In this embodiment of the present application, by comparing the fault similarity between the reference faulty node and at least one adjacent candidate node, at least one original faulty node is determined from the at least one candidate node, which can effectively avoid omission of the original faulty node, It is beneficial to improve the accuracy of fault detection. At the same time, since the fault similarity judgment is performed on the reference fault node and the adjacent candidate nodes each time, the judgment complexity is simplified and the accuracy of the fault similarity judgment is improved.
可选地,在上述步骤S402中,为了确定基准故障节点分别与至少一个候选节点之间的故障相似性,针对至少一个候选节点,分别执行图6中的以下步骤:Optionally, in the above step S402, in order to determine the fault similarity between the reference fault node and the at least one candidate node, respectively, for the at least one candidate node, perform the following steps in FIG. 6 :
步骤S601,获取基准故障节点在预设时段内对应的基准节点资源属性信息,以及一个候选节点在预设时段内对应的候选节点资源属性信息。Step S601: Acquire the resource attribute information of the reference node corresponding to the reference faulty node within the preset time period, and the resource attribute information of the candidate node corresponding to a candidate node within the preset time period.
具体地,预设时段可以是根据故障时间点和预设时长确定的。Specifically, the preset time period may be determined according to the failure time point and the preset time period.
举例来说,故障时间点为10:05:00,预设时长为2分钟,预设时段可以是10:05:00-10:07:00,也可以是10:03:00-10:05:00,还可以是10:04:00-10:06:00。For example, the fault time point is 10:05:00, the preset duration is 2 minutes, and the preset time period can be 10:05:00-10:07:00, or 10:03:00-10:05 :00, or 10:04:00-10:06:00.
资源属性信息包括CPU使用率、内存占用率、磁盘使用率、QPS、TPS。Resource attribute information includes CPU usage, memory usage, disk usage, QPS, and TPS.
基准节点资源属性信息为基准故障节点在预设时段内对应的资源属性信息,候选节点资源属性信息为一个候选节点在预设时段内对应的资源属性信息。The resource attribute information of the reference node is the resource attribute information corresponding to the reference fault node within the preset time period, and the resource attribute information of the candidate node is the resource attribute information corresponding to a candidate node within the preset time period.
步骤S602,基于基准节点资源属性信息和候选节点资源属性信息,确定基准故障节点与一个候选节点的故障相似性。Step S602, based on the resource attribute information of the reference node and the resource attribute information of the candidate node, determine the failure similarity between the reference faulty node and a candidate node.
具体地,如图7所示,确定基准故障节点与一个候选节点的故障相似性,包括以下步骤:Specifically, as shown in Figure 7, determining the fault similarity between the reference fault node and a candidate node includes the following steps:
步骤S701,基于基准节点资源属性信息,确定基准节点资源属性图像。Step S701, based on the reference node resource attribute information, determine the reference node resource attribute image.
步骤S702,基于候选节点资源属性信息,确定候选节点资源属性图像。Step S702: Determine the resource attribute image of the candidate node based on the resource attribute information of the candidate node.
举例来说,基准故障节点为任务节点B,候选节点为任务节点C,获取基准故障节点在预设时段10:00:00-10:02:00内对应的基准节点资源属性信息,获取候选节点在预设时段10:00:00-10:02:00内对应的候选节点资源属性信息。For example, the reference failure node is task node B, and the candidate node is task node C. Obtain the resource attribute information of the reference node corresponding to the reference failure node in the preset time period 10:00:00-10:02:00, and obtain the candidate node The resource attribute information of the corresponding candidate node within the preset time period 10:00:00-10:02:00.
基于基准节点资源属性信息,确定基准节点资源属性图像,如图8所示。基于候选节点资源属性信息,确定候选节点资源属性图像,如图9所示。Based on the reference node resource attribute information, the reference node resource attribute image is determined, as shown in FIG. 8 . Based on the resource attribute information of the candidate node, the resource attribute image of the candidate node is determined, as shown in FIG. 9 .
步骤S703,采用相似性网络模型,确定基准节点资源属性图像和候选节点资源属性图像的图像相似性。In step S703, the similarity network model is used to determine the image similarity between the resource attribute image of the reference node and the resource attribute image of the candidate node.
具体地,相似性网络模型包括两个特征提取模块和一个相似性判断模块。两个特征提取模块分别为第一特征提取模块和第二特征提取模块,第一特征提取模块和第二特征提取模块完全相同。Specifically, the similarity network model includes two feature extraction modules and a similarity judgment module. The two feature extraction modules are respectively a first feature extraction module and a second feature extraction module, and the first feature extraction module and the second feature extraction module are identical.
将基准节点资源属性图像输入至第一特征提取模块,获得基准图像特征。同时,将候选节点资源属性图像输入至第二特征提取模块,获得候选图像特征。再将基准图像特征和候选图像特征输入至相似性判断模块,获得基准节点资源属性图像和候选节点资源属性图像的图像相似性。The reference node resource attribute image is input to the first feature extraction module to obtain reference image features. At the same time, input the candidate node resource attribute image to the second feature extraction module to obtain candidate image features. Then, the reference image features and the candidate image features are input to the similarity judgment module to obtain the image similarity between the reference node resource attribute image and the candidate node resource attribute image.
其中,第一特征提取模块包括多个不同的卷积模块以及一个数据压平层,每个卷积模块中包括至少一个卷积层和至少一个下采样层。第二特征提取模块包括多个不同的卷积模块以及一个数据压平层,每个卷积模块中包括至少一个卷积层和至少一个下采样层。相似性判断模块包括特征差值层、至少一个全连接层以及归一化层。归一化层的输出结果为0-1之间的值。The first feature extraction module includes a plurality of different convolution modules and a data flattening layer, and each convolution module includes at least one convolution layer and at least one downsampling layer. The second feature extraction module includes a plurality of different convolution modules and a data flattening layer, and each convolution module includes at least one convolution layer and at least one downsampling layer. The similarity judgment module includes a feature difference layer, at least one fully connected layer and a normalization layer. The output of the normalization layer is a value between 0-1.
举例来说,相似性网络模型如图10所示,第一特征提取模块包括3个不同的卷积模块和一个数据压平层。第二特征提取模块包括3个卷积模块和一个数据压平层。相似性判断模块包括一个特征差值层、一个全连接层以及一个归一化层。For example, the similarity network model is shown in Figure 10. The first feature extraction module includes 3 different convolution modules and a data flattening layer. The second feature extraction module includes 3 convolution modules and a data flattening layer. The similarity judgment module includes a feature difference layer, a fully connected layer and a normalization layer.
将基准节点资源属性图像输入至第一特征提取模块中的第一个卷积模块C1,获得基准图像特征f11;再将基准图像特征f11输入至第一特征提取模块中的第二个卷积模块C2,获得基准图像特征f12;再将基准图像特征f12输入至第一特征提取模块中的第三个卷积模块C3,获得基准图像特征f13,最后将基准图像特征f13输入至第一特征提取模块中的数据压平层,获得基准图像特征f14。Input the reference node resource attribute image into the first convolution module C1 in the first feature extraction module to obtain the reference image feature f11; then input the reference image feature f11 into the second convolution module in the first feature extraction module C2, obtain the reference image feature f12; then input the reference image feature f12 into the third convolution module C3 in the first feature extraction module to obtain the reference image feature f13, and finally input the reference image feature f13 into the first feature extraction module The data flattening layer in , obtains the reference image feature f14.
同时,将候选节点资源属性图像输入至第二特征提取模块中的第一个卷积模块C1,获得候选图像特征f21;再将候选图像特征f21输入至第二特征提取模块中的第二个卷积模块C2,获得候选图像特征f22;再将候选图像特征f22输入至第二特征提取模块中的第三个卷积模块C3,获得候选图像特征f23,最后将候选图像特征f23输入至第二特征提取模块中的数据压平层,获得候选图像特征f24。At the same time, input the candidate node resource attribute image into the first convolution module C1 in the second feature extraction module to obtain the candidate image feature f21; then input the candidate image feature f21 into the second volume in the second feature extraction module Product module C2 to obtain candidate image feature f22; then input candidate image feature f22 to the third convolution module C3 in the second feature extraction module to obtain candidate image feature f23, and finally input candidate image feature f23 to the second feature Extract the data flattening layer in the module to obtain candidate image features f24.
将基准图像特征f14和候选图像特征f24输入至相似性判断模块中的特征差值层,获得图像差值特征f3;再将图像差值特征f3输入至全连接层,获得图像差值特征f4;再将图像差值特征f4输入至归一化层,最终获得基准节点资源属性图像和候选节点资源属性图像的图像相似性。Input the reference image feature f14 and the candidate image feature f24 into the feature difference layer in the similarity judgment module to obtain the image difference feature f3; then input the image difference feature f3 into the fully connected layer to obtain the image difference feature f4; Then, the image difference feature f4 is input to the normalization layer, and finally the image similarity between the resource attribute image of the reference node and the resource attribute image of the candidate node is obtained.
步骤S704,将图像相似性,作为基准故障节点与一个候选节点的故障相似性。Step S704, the image similarity is used as the fault similarity between the reference fault node and a candidate node.
在本申请实施例中,在获取基准故障节点在预设时段内对应的基准节点资源属性信息,以及一个候选节点在预设时段内对应的候选节点资源属性信息后,确定出基准节点资源属性图像和候选节点资源属性图像,再通过相似性网络模型,判断基准节点资源属性图像和候选节点资源属性图像的图像相似性,将图像相似性作为准故障节点与一个候选节点的故障相似性。由于直接根据准节点资源属性信息和候选节点资源属性信息进行故障相似性判断,很大程度上依赖运维人员的先验知识,会造成故障相似性判断的低准确性和低效率性,而将以上两种信息转化为图像,再通过相似性网络模型进行故障相似性判断,增强了故障相似性判断的准确性,同时,提高了故障相似性判断的实用性。In the embodiment of the present application, after obtaining the reference node resource attribute information corresponding to the reference fault node within the preset time period and the candidate node resource attribute information corresponding to a candidate node within the preset time period, the reference node resource attribute image is determined. and candidate node resource attribute images, and then through the similarity network model, the image similarity between the reference node resource attribute image and the candidate node resource attribute image is judged, and the image similarity is regarded as the fault similarity between a quasi-faulty node and a candidate node. Because the judgment of fault similarity directly based on the resource attribute information of the quasi-node and the resource attribute information of the candidate node depends on the prior knowledge of the operation and maintenance personnel to a large extent, it will cause the low accuracy and low efficiency of the fault similarity judgment, and the The above two kinds of information are converted into images, and then the fault similarity judgment is carried out through the similarity network model, which enhances the accuracy of fault similarity judgment, and at the same time, improves the practicability of fault similarity judgment.
由于基于相似性网络模型进行故障相似性判断,对于资源属性信息没有任何的限制,因此,应用范围更加广泛。当资源属性信息增加时,并不会造成成本的增加,可以在低成本的情况下,快速实现故障检测系统的推广。Since the fault similarity judgment based on the similarity network model does not have any restrictions on the resource attribute information, the application scope is wider. When the resource attribute information is increased, the cost will not be increased, and the promotion of the fault detection system can be quickly realized at a low cost.
可选地,在上述步骤S202中,沿原始故障节点链的延伸方向,依次从多个原始故障节点中选取主故障节点,包括以下步骤:Optionally, in the above step S202, along the extension direction of the original faulty node chain, the main faulty node is selected from a plurality of original faulty nodes in sequence, including the following steps:
从原始故障节点链中获取初始的参考主故障节点。再基于参考主故障节点,沿原始故障节点链的延伸方向,迭代从多个原始故障节点中选取主故障节点。Get the initial reference primary failed node from the original failed node chain. Then, based on the reference main faulty node, along the extension direction of the original faulty node chain, iteratively selects the main faulty node from multiple original faulty nodes.
每次迭代过程包括以下步骤:Each iteration process includes the following steps:
若参考主故障节点不为分叉故障节点,则将参考主故障节点作为主故障节点,并将原始故障节点链的延伸方向上,与主故障节点相邻的原始故障节点作为参考主故障节点。If the reference main faulty node is not a bifurcation faulty node, the reference main faulty node is used as the main faulty node, and the original faulty node adjacent to the main faulty node in the extension direction of the original faulty node chain is used as the reference main faulty node.
若参考主故障节点为分叉故障节点,则将参考主故障节点作为主故障节点,并基于分叉故障节点分别与对应的多个子故障节点之间的因果概率,从多个子故障节点中选取一个子故障节点作为参考主故障节点。If the reference main fault node is a bifurcation fault node, the reference main fault node is used as the main fault node, and based on the causal probability between the bifurcation fault node and the corresponding sub-fault nodes, one of the sub-fault nodes is selected. The child faulty node serves as the reference master faulty node.
具体地,针对多个子故障节点,分别执行以下步骤:基于分叉故障节点和一个子故障节点各自在预设时段内对应的目标资源异常信息,确定分叉故障节点分别与一个子故障节点之间的因果概率。Specifically, for a plurality of sub-faulty nodes, the following steps are respectively performed: based on the target resource anomaly information corresponding to the fork-faulty node and a sub-faulty node within a preset time period, respectively, determine the relationship between the fork-faulty node and a sub-faulty node. causal probability.
最后,确定获得的多个因果概率中的最大因果概率,并将多个子故障节点中,最大因果概率对应的子故障节点,作为参考主故障节点。Finally, the maximum causal probability among the obtained multiple causal probabilities is determined, and among the multiple child fault nodes, the child fault node corresponding to the maximum causal probability is used as the reference main fault node.
其中,目标资源异常信息包括目标资源异常时间点、目标资源异常幅值以及目标资源异常持续时间段。资源属性信息包括CPU使用率、内存占用率、磁盘使用率、QPS、TPS。The target resource abnormality information includes the target resource abnormality time point, the target resource abnormality amplitude, and the target resource abnormality duration period. Resource attribute information includes CPU usage, memory usage, disk usage, QPS, and TPS.
确定目标资源异常信息,包括以下两种实施方式:Determining target resource exception information includes the following two implementations:
第一种可能的实施方式,从各个资源属性信息中,任选一个作为目标资源属性信息,再根据目标资源属性信息确定目标资源异常信息。In a first possible implementation manner, select one of the resource attribute information as the target resource attribute information, and then determine the target resource abnormality information according to the target resource attribute information.
第二种可能的实施方式,针对各个资源属性信息,分别确定各个资源属性信息对应的资源异常幅值,将最大资源异常幅值对应的资源属性信息确定为目标资源属性信息,再根据目标资源属性信息确定目标资源异常信息。In the second possible implementation manner, for each resource attribute information, respectively determine the resource abnormal amplitude value corresponding to each resource attribute information, determine the resource attribute information corresponding to the maximum resource abnormal amplitude value as the target resource attribute information, and then according to the target resource attribute information. The information determines the target resource exception information.
具体地,分叉故障节点与一个子故障节点之间的因果概率,采用以下公式1表示:Specifically, the causal probability between a bifurcation fault node and a sub-fault node is expressed by the following formula 1:
其中,Prob表示分叉故障节点和一个子故障节点之间的因果概率,t1表示分叉故障节点对应的目标资源异常时间点,A1表示分叉故障节点对应的目标资源异常幅值,D1表示分叉故障节点对应的目标资源异常持续时间段,t2表示一个子故障节点对应的目标资源异常时间点,A2表示一个子故障节点对应的目标资源异常幅值,D2表示一个子故障节点对应的目标资源异常持续时间段。Among them, Prob represents the causal probability between the bifurcation fault node and a sub-fault node, t1 represents the abnormal time point of the target resource corresponding to the bifurcation fault node, A1 represents the abnormal amplitude of the target resource corresponding to the bifurcation fault node, and D1 represents the bifurcation fault node. The target resource anomaly duration period corresponding to the fault node, t2 represents the abnormal time point of the target resource corresponding to a sub-fault node, A2 represents the abnormal amplitude of the target resource corresponding to a sub-fault node, and D2 represents the target resource corresponding to a sub-fault node. Abnormal duration period.
举例来说,设定目标资源属性信息为CPU使用率,预设时段为10:00:00-10:02:00。如图11所示,包括分叉故障节点在预设时段内对应的CPU使用率,子故障节点1在预设时段内对应的CPU使用率以及子故障节点2在预设时段内对应的CPU使用率。For example, the attribute information of the target resource is set as the CPU usage rate, and the preset time period is 10:00:00-10:02:00. As shown in FIG. 11 , including the CPU usage rate corresponding to the fork fault node within the preset period, the CPU usage rate corresponding to the
对于分叉故障节点在预设时段内对应的CPU使用率,CPU使用率异常时间点为10:00:12,CPU使用率异常幅值为100%减去40%,即60%,CPU使用率异常持续时间段为10:02:00减去10:00:12,即108s。For the CPU usage corresponding to the fork failure node in the preset period, the abnormal time point of the CPU usage is 10:00:12, and the abnormal amplitude of the CPU usage is 100% minus 40%, that is, 60%, and the CPU usage rate is 60%. The abnormal duration period is 10:02:00 minus 10:00:12, which is 108s.
对于子故障节点1在预设时段内对应的CPU使用率,CPU使用率异常时间点为10:00:15,CPU使用率异常幅值为85%减去25%,即60%,CPU使用率异常持续时间段为10:02:00减去10:00:15,即105s。For the CPU usage of the
对于子故障节点2在预设时段内对应的CPU使用率,CPU使用率异常时间点为10:00:15,CPU使用率异常幅值为50%减去30%,即20%,CPU使用率异常持续时间段为10:02:00减去10:00:15,即105s。For the corresponding CPU usage of the
根据公式(1),确定分叉故障节点与子故障节点1之间的因果概率为 According to formula (1), the causal probability between the bifurcation fault node and the
根据公式(1),确定分叉故障节点与子故障节点2之间的因果概率为 According to formula (1), the causal probability between the bifurcation fault node and the
由于分叉故障节点与子故障节点1之间的因果概率0.0162大于分叉故障节点与子故障节点2之间的因果概率0.0054,因此,将子故障节点1作为参考主故障节点。Since the causal probability 0.0162 between the bifurcation fault node and
针对上述步骤S202中,沿原始故障节点链的延伸方向,依次从多个原始故障节点中选取主故障节点,进行举例说明,如图12所示,原始故障节点链中包括5个原始故障节点,分别为原始故障节点1、原始故障节点2、原始故障节点3、原始故障节点4和原始故障节点5。In the above step S202, along the extension direction of the original faulty node chain, the main faulty node is selected from a plurality of original faulty nodes in turn, and an example is given. As shown in FIG. 12, the original faulty node chain includes 5 original faulty nodes. They are the original
设定原始故障节点链的延伸方向为自左向右,从原始故障节点链中获取原始故障节点1作为初始的参考主故障节点,由于参考主故障节点不为分叉故障节点,因此,将参考主故障节点作为主故障节点,并将原始故障节点2作为参考主故障节点。The extension direction of the original faulty node chain is set to be from left to right, and the original
当原始故障节点2为作为参考主故障节点时,参考主故障节点为分叉故障节点,将参考主故障节点作为主故障节点,确定参考主故障节点对应的2个子故障节点分别为原始故障节点3和原始故障节点4。确定参考故障节点与原始故障节点3之间的因果概率为prob23,确定参考故障节点与原始故障节点4之间的因果概率为prob24,设定因果概率prob23小于因果概率prob24,则将原始故障节点4作为参考主故障节点。When the original
当原始故障节点4作为参考主故障节点时,由于参考主故障节点不为分叉故障节点,因此,将参考主故障节点作为主故障节点,并将原始故障节点5作为参考主故障节点。When the original
当原始故障节点5作为参考主故障节点时,由于参考主故障节点不为分叉故障节点,因此,将参考主故障节点作为主故障节点,并结束。When the original
最终,原始故障节点1、原始故障节点2、原始故障节点4、原始故障节点5均为主故障节点,组成了主故障节点链。Finally, the original
在本申请实施例中,由于原始故障节点链中包含的原始故障节点比较多,通过从多个原始故障节点中选取主故障节点,并基于选取的各个主故障节点确定主故障节点链,可以有效地提高故障检测的效率。In the embodiment of the present application, since the original faulty node chain contains many original faulty nodes, by selecting the main faulty node from the multiple original faulty nodes, and determining the main faulty node chain based on each selected main faulty node, it is possible to effectively to improve the efficiency of fault detection.
针对参考主故障节点为分叉故障节点时,分别确定分叉故障节点与多个子故障节点之间的因果概率,由于因果概率越大,表示分叉故障节点和子故障节点之间的因果性越强,因此,选择最大因果概率对应的子故障节点,作为参考主故障节点,可以合理地过滤到因果性不强的子故障节点,保证了所选取的主故障节点的准确性,即保证了主故障节点链的准确性。When the reference main fault node is a bifurcation fault node, determine the causal probability between the bifurcation fault node and multiple sub-fault nodes. The larger the causal probability, the stronger the causality between the bifurcation fault node and the sub-fault nodes. , therefore, the sub-fault node corresponding to the maximum causal probability is selected as the reference main fault node, which can reasonably filter the sub-fault nodes with weak causality, which ensures the accuracy of the selected main failure node, that is, the main failure node is guaranteed. The accuracy of the node chain.
可选地,在上述步骤S203中,基于主故障节点链中各个主故障节点之间的因果概率,从主故障节点链中确定出目标根源故障节点,包括以下执行步骤:Optionally, in the above step S203, based on the causal probability between each main fault node in the main fault node chain, the target root fault node is determined from the main fault node chain, including the following execution steps:
从主故障节点链中获取初始的参考根源故障节点;再基于参考根源故障节点与主故障节点链中其他主故障节点之间的因果概率,迭代更新参考根源故障节点,直到迭代结束,将参考根源故障节点作为目标根源故障节点。Obtain the initial reference root fault node from the main fault node chain; then based on the causal probability between the reference root fault node and other main fault nodes in the main fault node chain, iteratively update the reference root fault node until the end of the iteration. The faulty node is used as the target root faulty node.
每次迭代过程包括图13中的以下步骤:Each iteration process includes the following steps in Figure 13:
步骤S1301,从其他主故障节点中,获取一个主故障节点。Step S1301: Obtain a main faulty node from other main faulty nodes.
步骤S1302,基于参考根源故障节点对应的目标资源异常信息和一个主故障节点对应的目标资源异常信息,确定参考根源故障节点与一个主故障节点之间的因果概率。Step S1302: Determine the causal probability between the reference source fault node and a main fault node based on the target resource abnormality information corresponding to the reference root fault node and the target resource abnormality information corresponding to a main fault node.
具体地,目标资源异常信息包括目标资源异常时间点、目标资源异常幅值以及目标资源异常持续时间段。目标资源异常信息的确定方法同上。Specifically, the target resource abnormality information includes the target resource abnormality time point, the target resource abnormality amplitude, and the target resource abnormality duration period. The method for determining the abnormal information of the target resource is the same as above.
参考根源故障节点与一个主故障节点之间的因果概率如公式1所示,其中,Prob表示参考根源故障节点和一个主故障节点之间的因果概率,t1表示参考根源故障节点对应的目标资源异常时间点,A1表示参考根源故障节点对应的目标资源异常幅值,D1表示参考根源故障节点对应的目标资源异常持续时间段,t2表示一个主故障节点对应的目标资源异常时间点,A2表示一个主故障节点对应的目标资源异常幅值,D2表示一个主故障节点对应的目标资源异常持续时间段。The causal probability between the reference root fault node and a main fault node is shown in
步骤S1303,若因果概率大于预设因果阈值,则执行步骤S1304;否则,执行步骤S1305。In step S1303, if the causal probability is greater than the preset causal threshold, step S1304 is performed; otherwise, step S1305 is performed.
步骤S1304,参考根源故障节点保持不变。Step S1304, the reference source fault node remains unchanged.
步骤S1305,将一个主故障节点作为参考根源故障节点。Step S1305, a main fault node is used as a reference root fault node.
举例来说,如图14所示,主故障节点链中包括主故障节点1、主故障节点2、主故障节点3、主故障节点4。从主故障节点链中获取主故障节点1作为初始的参考根源故障节点,其他主故障节点包括主故障节点2、主故障节点3、主故障节点4。设定因果概率阈值为0.45。For example, as shown in FIG. 14 , the main failure node chain includes a
主故障节点1为参考根源故障节点,从其他主故障节点中,获取一个主故障节点,即主故障节点2。设定参考根源故障节点与主故障节点2之间的因果概率为0.3,由于0.3小于因果概率阈值0.45,因此将主故障节点2作为参考根源故障节点。The main
主故障节点2为参考根源故障节点,从其他主故障节点中,获取一个主故障节点,即主故障节点3。设定参考根源故障节点与主故障节点3之间的因果概率为0.7,由于0.7大于因果概率阈值0.45,因此参考根源故障节点保持不变。The main
主故障节点2为参考根源故障节点,从其他主故障节点中,获取一个主故障节点,即主故障节点4。设定参考根源故障节点与主故障节点4之间的因果概率为0.6,由于0.6大于因果概率阈值0.45,因此参考根源故障节点保持不变。The main
最终,确定参考根源故障节点,即主故障节点2,作为目标根源故障节点。Finally, the reference root fault node, that is, the
在本申请实施例中,从其他主故障节点中选取一个主故障节点,每次判定参考根源故障节点与一个主故障节点之间的因果概率,可以有效地提升故障检测效率。最后,当参考根源故障节点和其他主故障节点中的各个主故障节点完成比较后,即确定了目标根源故障节点,实现了提高故障检测准确性的效果。In the embodiment of the present application, selecting a main fault node from other main fault nodes, and determining the causal probability between the reference root fault node and a main fault node each time, can effectively improve the fault detection efficiency. Finally, when the reference root fault node is compared with each main fault node in other main fault nodes, the target root fault node is determined, and the effect of improving the fault detection accuracy is achieved.
为了更好的解释本申请实施例,下面结合具体实施场景,描述本申请实施例提供的一种故障检测方法,如图15所示,包括以下步骤:In order to better explain the embodiments of the present application, a fault detection method provided by the embodiments of the present application is described below with reference to specific implementation scenarios. As shown in FIG. 15 , the method includes the following steps:
步骤S1501,获取任务节点链中初始的基准故障节点。Step S1501, obtaining the initial reference fault node in the task node chain.
步骤S1502,基于基准故障节点,以及任务节点链中相邻两个任务节点之间的故障相似性,迭代从任务节点链中,确定原始故障节点链。Step S1502, based on the reference faulty node and the fault similarity between two adjacent task nodes in the task node chain, iteratively determine the original faulty node chain from the task node chain.
步骤S1503,从原始故障节点链中获取初始的参考主故障节点。Step S1503: Obtain the initial reference main faulty node from the original faulty node chain.
步骤S1504,基于参考主故障节点,沿原始故障节点链的延伸方向,迭代从多个原始故障节点中选取主故障节点。Step S1504, based on the reference main faulty node, along the extension direction of the original faulty node chain, iteratively selects the main faulty node from a plurality of original faulty nodes.
步骤S1505,基于选取的各个主故障节点,确定主故障节点链。Step S1505, based on each selected main faulty node, determine the main faulty node chain.
步骤S1506,从主故障节点链中获取初始的参考根源故障节点,并确定其他主故障节点。In step S1506, an initial reference root fault node is obtained from the main fault node chain, and other main fault nodes are determined.
步骤S1507,判断其他主故障节点是否为空,若是,则执行步骤S1513;否则,执行步骤S1508。In step S1507, it is judged whether other main faulty nodes are empty, if so, step S1513 is performed; otherwise, step S1508 is performed.
步骤S1508,从其他主故障节点中,获取一个主故障节点。Step S1508: Obtain a main faulty node from other main faulty nodes.
步骤S1509,基于参考根源故障节点对应的目标资源异常信息和一个主故障节点对应的目标资源异常信息,确定参考根源故障节点与一个主故障节点之间的因果概率。Step S1509: Determine the causal probability between the reference root fault node and a main fault node based on the target resource abnormality information corresponding to the reference root fault node and the target resource abnormality information corresponding to a main fault node.
步骤S1510,判断因果概率是否大于预设因果阈值,若是,则执行步骤S1511;否则,执行步骤S1512。In step S1510, it is determined whether the causal probability is greater than the preset causal threshold, and if so, step S1511 is performed; otherwise, step S1512 is performed.
步骤S1511,参考根源故障节点保持不变。Step S1511, the reference source fault node remains unchanged.
步骤S1512,将一个主故障节点作为参考根源故障节点。Step S1512, a main fault node is used as a reference root fault node.
步骤S1513,将参考根源故障节点作为目标根源故障节点。Step S1513, taking the reference root fault node as the target root fault node.
在本申请实施例中,由于原始故障节点链中包含的原始故障节点比较多,通过从多个原始故障节点中选取主故障节点,并基于选取的各个主故障节点确定主故障节点链,可以有效地提高故障检测的效率。基于参考根源故障节点和其他主故障节点之间的因果概率,确定目标根源故障节点,在提升故障检测效率的同时,保证了故障检测的准确性。In the embodiment of the present application, since the original faulty node chain contains many original faulty nodes, by selecting the main faulty node from the multiple original faulty nodes, and determining the main faulty node chain based on each selected main faulty node, it is possible to effectively to improve the efficiency of fault detection. Based on the causal probability between the reference root fault node and other main fault nodes, the target root fault node is determined, which not only improves the efficiency of fault detection, but also ensures the accuracy of fault detection.
为了更好地解释本申请实施例,下面结合具体的实施场景描述本申请实施例提供的一种故障检测方法,该方法由图1中的故障检测系统102执行。如图16所示,包括任务节点链获取模块1601、监控模块1602、算法模块1603、故障处置模块1604以及反馈模块1605。In order to better explain the embodiments of the present application, a fault detection method provided by the embodiments of the present application is described below with reference to specific implementation scenarios. The method is executed by the
任务节点链获取模块1601获取任务系统103中任务节点链。The task node
监控模块1602获取任务节点链中各个任务节点对应的资源属性信息,包括CPU使用率、内存占用率、磁盘使用率、QPS、TPS。The
算法模块1603接收任务节点链获取模块1601发送的任务节点链,以及监控模块1602发送的各个任务节点对应的资源属性信息,并基于任务节点链和各个任务节点对应的资源属性信息,确定目标根源故障节点。The
故障处置模块1604根据算法模块1603确定的目标根源故障节点,进行重启进程,重启应用,或者重启虚机等操作。The
反馈模块1605对算法模块1603确定的目标根源故障节点,进行记录和分析,并定期对算法模块中的相似性网络模型进行更新。The
在本申请实施例中,任务节点链获取模块1601会对任务节点链进行实时的获取和更新,无需人工参与。反馈模块1605会对每一次故障检测结果进行记录和分析,并定期完成相似性网络模型的更新,保证了故障检测的准确性。In the embodiment of the present application, the task node
基于相同的技术构思,本申请实施例提供了一种故障检测装置,如图17所示,该故障检测装置1700包括:Based on the same technical concept, an embodiment of the present application provides a fault detection apparatus. As shown in FIG. 17 , the
原始故障节点链获取模块1701,用于获取原始故障节点链,所述原始故障节点链包括多个原始故障节点;The original faulty node
主故障节点链获取模块1702,用于沿所述原始故障节点链的延伸方向,依次从所述多个原始故障节点中选取主故障节点,并基于选取的各个主故障节点,确定主故障节点链;The main faulty node
目标确定模块1703,用于基于所述主故障节点链中各个主故障节点之间的因果概率,从所述主故障节点链中确定出目标根源故障节点。The
可选地,所述原始故障节点链获取模块1701具体用于:Optionally, the original faulty node
获取任务节点链中初始的基准故障节点;Get the initial benchmark faulty node in the task node chain;
基于所述基准故障节点,以及所述任务节点链中相邻两个任务节点之间的故障相似性,迭代从所述任务节点链中,确定原始故障节点链;其中,每次迭代过程包括以下步骤:Based on the reference faulty node and the fault similarity between two adjacent task nodes in the task node chain, iteratively determines the original faulty node chain from the task node chain; wherein, each iteration process includes the following step:
从所述任务节点链中,确定与所述基准故障节点相邻的至少一个候选节点;From the task node chain, determine at least one candidate node adjacent to the reference faulty node;
确定所述基准故障节点分别与所述至少一个候选节点之间的故障相似性;determining a fault similarity between the reference faulty node and the at least one candidate node, respectively;
基于获得的各个故障相似性,从所述至少一个候选节点中选取至少一个原始故障节点,并将所述至少一个原始故障节点作为基准故障节点。Based on the obtained fault similarities, at least one original faulty node is selected from the at least one candidate node, and the at least one original faulty node is used as a reference faulty node.
可选地,所述原始故障节点链获取模块1701具体用于:Optionally, the original faulty node
针对所述至少一个候选节点,分别执行以下步骤:For the at least one candidate node, respectively perform the following steps:
获取所述基准故障节点在预设时段内对应的基准节点资源属性信息,以及一个候选节点在预设时段内对应的候选节点资源属性信息;Acquiring the resource attribute information of the reference node corresponding to the reference fault node within the preset time period, and the resource attribute information of the candidate node corresponding to a candidate node within the preset time period;
基于所述基准节点资源属性信息和所述候选节点资源属性信息,确定所述基准故障节点与所述一个候选节点的故障相似性。Based on the resource attribute information of the reference node and the resource attribute information of the candidate node, the failure similarity between the reference faulty node and the one candidate node is determined.
可选地,所述原始故障节点链获取模块1701具体用于:Optionally, the original faulty node
基于所述基准节点资源属性信息,确定基准节点资源属性图像;determining a reference node resource attribute image based on the reference node resource attribute information;
基于所述候选节点资源属性信息,确定候选节点资源属性图像;determining a candidate node resource attribute image based on the candidate node resource attribute information;
采用相似性网络模型,确定所述基准节点资源属性图像和所述候选节点资源属性图像的图像相似性;Using a similarity network model, determine the image similarity between the resource attribute image of the reference node and the resource attribute image of the candidate node;
将所述图像相似性,作为所述基准故障节点与所述一个候选节点的故障相似性。The image similarity is used as the fault similarity between the reference fault node and the one candidate node.
可选地,所述主故障节点链获取模块1702具体用于:Optionally, the main faulty node
从所述原始故障节点链中获取初始的参考主故障节点;Obtain an initial reference primary faulty node from the original faulty node chain;
基于所述参考主故障节点,沿所述原始故障节点链的延伸方向,迭代从所述多个原始故障节点中选取主故障节点,其中,每次迭代过程包括以下步骤:Based on the reference main faulty node, along the extension direction of the original faulty node chain, iteratively selects the main faulty node from the plurality of original faulty nodes, wherein each iteration process includes the following steps:
若所述参考主故障节点不为分叉故障节点,则将所述参考主故障节点作为主故障节点,并将所述原始故障节点链的延伸方向上,与所述主故障节点相邻的原始故障节点作为所述参考主故障节点;If the reference main faulty node is not a bifurcation faulty node, the reference main faulty node is regarded as the main faulty node, and the original faulty node chain adjacent to the main faulty node is set in the extension direction of the original faulty node chain. The faulty node is used as the reference main faulty node;
若所述参考主故障节点为分叉故障节点,则将所述参考主故障节点作为主故障节点,并基于所述分叉故障节点分别与对应的多个子故障节点之间的因果概率,从所述多个子故障节点中选取一个子故障节点作为所述参考主故障节点。If the reference main faulty node is a bifurcation faulty node, the reference main faulty node is regarded as the main faulty node, and based on the causal probability between the bifurcation faulty node and the corresponding multiple sub-faulty nodes, from the A sub-faulty node is selected from the plurality of sub-faulty nodes as the reference main faulty node.
可选地,所述主故障节点链获取模块1702具体用于:Optionally, the main faulty node
针对所述多个子故障节点,分别执行以下步骤:基于所述分叉故障节点和一个子故障节点各自在预设时段内对应的目标资源异常信息,确定所述分叉故障节点与所述一个子故障节点之间的因果概率;For the plurality of sub-faulty nodes, the following steps are respectively performed: based on the target resource exception information corresponding to the fork-faulty node and a sub-faulty node within a preset time period, determine the fork-faulty node and the one sub-faulty node. Causal probability between faulty nodes;
确定获得的多个因果概率中的最大因果概率,并将所述多个子故障节点中,所述最大因果概率对应的子故障节点,作为所述参考主故障节点。A maximum causal probability among the obtained multiple causal probabilities is determined, and among the multiple sub-fault nodes, the sub-fault node corresponding to the maximum causal probability is used as the reference main fault node.
可选地,所述目标确定模块1703具体用于:Optionally, the
从所述主故障节点链中获取初始的参考根源故障节点;Obtain an initial reference root fault node from the main fault node chain;
基于所述参考根源故障节点与所述主故障节点链中其他主故障节点之间的因果概率,迭代更新所述参考根源故障节点,直到迭代结束,将所述参考根源故障节点作为目标根源故障节点。Based on the causal probability between the reference root fault node and other primary fault nodes in the primary fault node chain, the reference root fault node is iteratively updated until the iteration ends, and the reference root fault node is used as the target root fault node .
可选地,所述目标确定模块1703具体用于:Optionally, the
每次迭代过程包括以下步骤:Each iteration process includes the following steps:
从所述其他主故障节点中,获取一个主故障节点;Obtain a primary faulty node from the other primary faulty nodes;
基于所述参考根源故障节点对应的目标资源异常信息和所述一个主故障节点对应的目标资源异常信息,确定所述参考根源故障节点与所述一个主故障节点之间的因果概率;determining the causal probability between the reference root fault node and the one main fault node based on the target resource abnormality information corresponding to the reference root fault node and the target resource abnormality information corresponding to the one main fault node;
若所述因果概率大于预设因果阈值,则所述参考根源故障节点保持不变;If the causal probability is greater than the preset causal threshold, the reference root fault node remains unchanged;
否则,将所述一个主故障节点作为所述参考根源故障节点。Otherwise, the one main faulty node is used as the reference root faulty node.
可选地,所述目标资源异常信息包括目标资源异常时间点、目标资源异常幅值以及目标资源异常持续时间段。Optionally, the target resource abnormality information includes a target resource abnormality time point, a target resource abnormality amplitude, and a target resource abnormality duration period.
基于相同的技术构思,本申请实施例提供了一种计算机设备,计算机设备可以是终端或服务器,如图18所示,包括至少一个处理器1801,以及与至少一个处理器连接的存储器1802,本申请实施例中不限定处理器1801与存储器1802之间的具体连接介质,图18中处理器1801和存储器1802之间通过总线连接为例。总线可以分为地址总线、数据总线、控制总线等。Based on the same technical concept, an embodiment of the present application provides a computer device. The computer device may be a terminal or a server, as shown in FIG. 18 , and includes at least one
在本申请实施例中,存储器1802存储有可被至少一个处理器1801执行的指令,至少一个处理器1801通过执行存储器1802存储的指令,可以执行上述故障检测方法中所包括的步骤。In this embodiment of the present application, the
其中,处理器1801是计算机设备的控制中心,可以利用各种接口和线路连接计算机设备的各个部分,通过运行或执行存储在存储器1802内的指令以及调用存储在存储器1802内的数据,从而进行故障检测。可选的,处理器1801可包括一个或多个处理单元,处理器1801可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器1801中。在一些实施例中,处理器1801和存储器1802可以在同一芯片上实现,在一些实施例中,它们也可以在独立的芯片上分别实现。Among them, the
处理器1801可以是通用处理器,例如中央处理器(CPU)、数字信号处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件,可以实现或者执行本申请实施例中公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。The
存储器1802作为一种非易失性计算机可读存储介质,可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块。存储器1802可以包括至少一种类型的存储介质,例如可以包括闪存、硬盘、多媒体卡、卡型存储器、随机访问存储器(Random AccessMemory,RAM)、静态随机访问存储器(Static Random Access Memory,SRAM)、可编程只读存储器(Programmable Read Only Memory,PROM)、只读存储器(Read Only Memory,ROM)、带电可擦除可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、磁性存储器、磁盘、光盘等等。存储器1802是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。本申请实施例中的存储器1802还可以是电路或者其它任意能够实现存储功能的装置,用于存储程序指令和/或数据。The
基于同一发明构思,本申请实施例提供了一种计算机可读存储介质,其存储有可由计算机设备执行的计算机程序,当程序在计算机设备上运行时,使得计算机设备执行上述故障检测方法的步骤。Based on the same inventive concept, an embodiment of the present application provides a computer-readable storage medium, which stores a computer program executable by a computer device, and when the program runs on the computer device, causes the computer device to execute the steps of the above fault detection method.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
本申请是参照根据本申请的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the present application. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the spirit and scope of the present application. Thus, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include these modifications and variations.
Claims (12)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210091905.9A CN114490157A (en) | 2022-01-26 | 2022-01-26 | A fault detection method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210091905.9A CN114490157A (en) | 2022-01-26 | 2022-01-26 | A fault detection method, device, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114490157A true CN114490157A (en) | 2022-05-13 |
Family
ID=81473756
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210091905.9A Pending CN114490157A (en) | 2022-01-26 | 2022-01-26 | A fault detection method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114490157A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120005532A1 (en) * | 2010-07-02 | 2012-01-05 | Oracle International Corporation | Method and apparatus for determining ranked causal paths for faults in a complex multi-host system with probabilistic inference in a time series |
US20140172371A1 (en) * | 2012-12-04 | 2014-06-19 | Accenture Global Services Limited | Adaptive fault diagnosis |
KR20190104759A (en) * | 2018-03-02 | 2019-09-11 | 주식회사 케이티 | System and method for intelligent equipment abnormal symptom proactive detection |
CN110334775A (en) * | 2019-07-12 | 2019-10-15 | 广东工业大学 | A method and device for UAV line fault identification based on width learning |
WO2020071054A1 (en) * | 2018-10-02 | 2020-04-09 | 株式会社日立製作所 | Fault factor priority presentation device |
-
2022
- 2022-01-26 CN CN202210091905.9A patent/CN114490157A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120005532A1 (en) * | 2010-07-02 | 2012-01-05 | Oracle International Corporation | Method and apparatus for determining ranked causal paths for faults in a complex multi-host system with probabilistic inference in a time series |
US20140172371A1 (en) * | 2012-12-04 | 2014-06-19 | Accenture Global Services Limited | Adaptive fault diagnosis |
KR20190104759A (en) * | 2018-03-02 | 2019-09-11 | 주식회사 케이티 | System and method for intelligent equipment abnormal symptom proactive detection |
WO2020071054A1 (en) * | 2018-10-02 | 2020-04-09 | 株式会社日立製作所 | Fault factor priority presentation device |
CN110334775A (en) * | 2019-07-12 | 2019-10-15 | 广东工业大学 | A method and device for UAV line fault identification based on width learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US12039415B2 (en) | Debugging and profiling of machine learning model training | |
US10693711B1 (en) | Real-time event correlation in information networks | |
CN107430611B (en) | Filtering data lineage graph | |
US20170017537A1 (en) | Apparatus and method of leveraging semi-supervised machine learning principals to perform root cause analysis and derivation for remediation of issues in a computer environment | |
US10878335B1 (en) | Scalable text analysis using probabilistic data structures | |
CN113342500B (en) | Task execution method, device, equipment and storage medium | |
CN107251021B (en) | Filtering data lineage diagrams | |
AU2021254863B2 (en) | Dynamic discovery and correction of data quality issues | |
US11468365B2 (en) | GPU code injection to summarize machine learning training data | |
WO2020140624A1 (en) | Method for extracting data from log, and related device | |
US12174890B2 (en) | Automated query modification using graphical query representations | |
US20230222395A1 (en) | Privacy preserving and de-centralized detection of global outliers | |
WO2021067385A1 (en) | Debugging and profiling of machine learning model training | |
US11675766B1 (en) | Scalable hierarchical clustering | |
CN112131291A (en) | Structure parsing method, device, device and storage medium based on JSON data | |
US10817396B2 (en) | Recognition of operational elements by fingerprint in an application performance management system | |
CN114490157A (en) | A fault detection method, device, equipment and storage medium | |
WO2024167976A1 (en) | Systems and methods for reducing the cardinality of metrics queries | |
CN117787431A (en) | Group learning, privacy preservation, and decentralization IID drift control | |
CN110309206A (en) | Method and system for collecting order information | |
CN118796801B (en) | Data migration method, device and electronic equipment | |
US12332851B1 (en) | Generation of diverse simulated data | |
CN112837040B (en) | Power data management method and system applied to smart grid | |
US12393583B2 (en) | Derivation graph querying using deferred join processing | |
CN118071501B (en) | System switching method, device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |