The present application claims priority from the 16 th 2023 to the national intellectual property agency of China, application number 202310729844.9, and Chinese patent application entitled "method for locating link failure", which is incorporated herein by reference in its entirety.
Disclosure of Invention
The application provides a link fault positioning method, which utilizes a link fault positioning model or a link fault positioning rule base constructed by a fault feature sequence marked with a fault FRU to infer the FRU corresponding to the fault feature sequence of the current fault, thereby realizing the link fault positioning of FRU level and improving the positioning accuracy. Moreover, the method is not limited to positioning of the link fault of a specific type, can realize universal link fault positioning, and improves the coverage rate of the link fault positioning. The application also provides a system, a computing device cluster, a computer readable storage medium and a computer program product corresponding to the method.
In a first aspect, the present application provides a link failure positioning method. The method may be performed by a link failure localization system. The link failure location system may be a software system that may employ an on-line deployment (also referred to as an on-line deployment) mode or an off-line deployment (also referred to as an off-line deployment) mode. When the software system adopts an online deployment mode, the software system may be deployed on a computing system, such as a baseboard management controller (also referred to as an out-of-band management system) of the computing system. When the software system adopts an offline deployment mode, the software system can be deployed on a third party computing device cluster in an offline tool, such as public cloud, edge computing devices, and end-side computing devices. The link failure location system may also be a hardware system, such as a baseboard management controller in a computing system or a cluster of computing devices. When the hardware systems such as the baseboard management controller, the computing device cluster and the like run, the link fault positioning method can be executed.
Specifically, the link fault location system may obtain a fault log of a hardware system in the computing system to be detected, where the fault log records historical data of fault monitoring parameters of the hardware system, abnormal events, and replacement records of field replaceable units FRUs in the hardware system, and obtain current data of the fault monitoring parameters of the hardware system. The link fault location system may then extract a sequence of fault signatures for the hardware system based on the fault log and current data for fault monitoring parameters for the hardware system. The link failure location system may then predict the fault location of the FRU granularity based on the sequence of failure features of the hardware system. The fault position of FRU granularity is obtained by combining training results based on marking data, and the marking data marks the fault FRU.
In some possible implementations, the training results include a link failure localization model. Correspondingly, the link fault location system can input a fault feature sequence of the hardware system into the link fault location model to obtain the fault probability of the FRU in the hardware system, and then determine the fault position of the FRU granularity according to the fault probability of the FRU in the hardware system.
The link fault positioning model can excavate hidden layer features of the fault feature sequence and map the hidden layer features to different FRU categories, so that fault probability of FRU in the hardware system is obtained, fault positions of FRU granularity can be accurately predicted according to the fault probability, and the method has high reliability.
In some possible implementations, when the probability of failure of a presence FRU is greater than a set threshold, the link failure location system may determine that the failure location includes a FRU whose probability of failure is greater than the set threshold. When the probability of failure of the FRU in the hardware system is not greater than the set threshold, the link failure positioning system can determine the failure location as one or more of the FRUs in the front n of the order of the failure probability from high to low.
The method supports that for the link faults which can accurately locate the fault source (for example, the fault probability is larger than a set threshold value), the fault position of FRU granularity can be directly output, so that accurate alarm is realized, and for the link faults which are difficult to accurately locate the fault source, top N risk component recommendation can be carried out, so that the fault locating requirement is met. Further, the method also supports generating a link failure analysis report for review by an operation and maintenance person.
In some possible implementations, the link fault location model is constructed by obtaining training data comprising fault logs of hardware systems in a computing system deployed in a production environment, extracting a fault feature sequence from the training data, labeling a fault FRU for the fault feature sequence, obtaining labeling data, training a classifier according to the labeling data, and obtaining the link fault location model.
The method supports learning classification modes from the labeling data in a supervised learning mode and the like, and is applied to fault source classification, so that fault location of FRU granularity can be realized, and the requirement of refined fault location is met.
In some possible implementations, the training results include a link failure location rule base. The link failure location rule base includes at least one link failure location rule. Accordingly, the link failure location system may match the failure signature sequence of the hardware system with at least one link failure location rule. When the fault feature sequence of the hardware system is successfully matched with the target rule in the at least one link fault location rule, the link fault location system can determine the fault location of FRU granularity from the target rule.
The method supports that the fault characteristic sequence is directly matched with the fault locating rule in the fault locating rule base, so that the fault locating of the FRU level is realized efficiently, and the requirements of accuracy and efficiency are met.
In some possible implementations, the training results include a link failure localization model and a link failure localization rule base. The link failure location rule base includes at least one link failure location rule. Accordingly, the link failure location system may match the failure signature sequence of the hardware system with at least one link failure location rule. When the matching fails, the link fault positioning system inputs the fault characteristic sequence of the hardware system into a link fault positioning model to obtain the fault position of FRU granularity.
The method supports the cooperative completion of FRU granularity link fault location based on the link fault location rule and the link fault location model, and further realizes the balance of accuracy and efficiency of fault location.
In some possible implementations, the link failure location system may also generate a link topology of the hardware system from networking information of the hardware system, the link topology recording a link type of a link in the hardware system and a FRU included by the link of the at least one link type. Accordingly, the link failure location system may predict the fault location of the FRU granularity based on the sequence of failure features of the hardware system and the link topology of the hardware system.
The method can reduce the matching range or the searching range of the fault link positioning by increasing the priori information of the link topology, and improve the efficiency and the accuracy of the link fault positioning.
In some possible implementations, the fault-monitoring parameters include one or more of temperature, current, voltage, or in-bit state, plug-in state, the abnormal events include one or more of sideband detection anomalies, lighting anomalies, reliability availability serviceability RAS anomalies, the RAS anomalies include data access anomalies during chip, controller, bus, or input-output IO peripheral operations.
The method is easy to obtain by collecting parameters such as temperature, current, voltage or in-place state, inserting steady state and the like, and detecting events such as abnormal side band, abnormal lighting, abnormal RAS and the like as positioning input, and does not increase maintenance personnel operation.
In some possible implementations, the link fault location system may extract one or more of the occurrence number, the occurrence time, the event body, or the link where the event is located from the fault log, and construct a fault feature sequence of the hardware system according to the one or more of the occurrence number, the occurrence time, the event body, or the link where the event is located, and the historical data and the current data of the fault monitoring parameter.
The method is favorable for subsequent model reasoning or rule matching by extracting the occurrence times, occurrence time, event main bodies or links where the events are located and the like to construct a structured fault feature sequence, and provides assistance for FRU granularity fault positioning.
In a second aspect, the present application provides a link failure location system. The system comprises:
the system comprises a parameter and log acquisition module, a fault detection module and a fault detection module, wherein the parameter and log acquisition module is used for acquiring a fault log of a hardware system in a computing system to be detected, recording historical data of fault monitoring parameters of the hardware system, abnormal events and replacement records of field replaceable units FRUs in the hardware system, and acquiring current data of the fault monitoring parameters of the hardware system;
The preprocessing module is used for extracting a fault characteristic sequence of the hardware system according to the fault log and the current data of the fault monitoring parameters of the hardware system;
and the link fault positioning module is used for predicting the fault position of FRU granularity according to the fault characteristic sequence of the hardware system, wherein the fault position of FRU granularity is obtained by combining with the training result prediction based on the marking data, and the marking data marks the fault FRU.
In some possible implementations, the training results include a link failure localization model;
The link fault positioning module is specifically configured to:
Inputting the fault characteristic sequence of the hardware system into the link fault location model to obtain the fault probability of FRU in the hardware system;
And determining the fault position of FRU granularity according to the fault probability of FRU in the hardware system.
In some possible implementations, the link failure location module is specifically configured to:
When the fault probability of the FRU is larger than a set threshold, determining that the fault position comprises the FRU with the fault probability larger than the set threshold, and when the fault probability of the FRU in the hardware system is not larger than the set threshold, determining that the fault position is one or more FRUs in the n before the fault probability is sorted from high to low.
In some possible implementations, the link failure location system further includes:
The construction module is used for acquiring training data, wherein the training data comprises fault logs of hardware systems in a computing system deployed in a production environment, extracting fault feature sequences from the training data, labeling fault FRUs for the fault feature sequences, acquiring the labeling data, training a classifier according to the labeling data, and acquiring the link fault positioning model.
In some possible implementations, the training result includes a link failure location rule base including at least one link failure location rule;
The link fault positioning module is specifically configured to:
Matching the fault feature sequence of the hardware system with the at least one link fault location rule;
And when the fault feature sequence of the hardware system is successfully matched with a target rule in the at least one link fault location rule, determining the fault position of FRU granularity from the target rule.
In some possible implementations, the training results include a link failure localization model and a link failure localization rule base including at least one link failure localization rule;
The link fault positioning module is specifically configured to:
Matching the fault feature sequence of the hardware system with the at least one link fault location rule;
And when the matching is failed, inputting the fault characteristic sequence of the hardware system into the link fault positioning model to obtain the fault position of FRU granularity.
In some possible implementations, the preprocessing module is further configured to:
Generating a link topology of the hardware system according to networking information of the hardware system, wherein the link topology records a link type of a link in the hardware system and FRU (FRU) included by at least one link type of the link;
The link fault positioning module is specifically configured to:
predicting the fault location of FRU granularity according to the fault characteristic sequence of the hardware system and the link topology of the hardware system.
In some possible implementations, the fault-monitoring parameters include one or more of temperature, current, voltage, or in-bit state, plug-in state, the anomaly event includes one or more of a sideband detection anomaly, a lighting anomaly, a reliability availability serviceability RAS anomaly including a data access anomaly during operation of a chip, controller, bus, or input-output IO peripheral.
In some possible implementations, the preprocessing module is specifically configured to:
Extracting one or more of the occurrence times, occurrence time, event main body or links where the events are located of the abnormal events from the fault log;
and constructing a fault characteristic sequence of the hardware system according to one or more of the occurrence times, the occurrence time, the event main body or the link where the event is located, the historical data and the current data of the fault monitoring parameters.
In a third aspect, the present application provides a baseboard management controller. The baseboard management controller comprises a processor and a memory, wherein the memory stores computer readable instructions, and the processor executes the computer readable instructions to enable the baseboard management controller to execute the link fault positioning method according to the first aspect or any implementation manner of the first aspect.
In a fourth aspect, the present application provides a cluster of computing devices. The cluster of computing devices includes at least one computing device including at least one processor and at least one memory. The at least one processor and the at least one memory are in communication with each other. The at least one processor is configured to execute instructions stored in the at least one memory to cause a computing device or cluster of computing devices to perform the link failure localization method according to the first aspect or any implementation of the first aspect.
In a fifth aspect, the present application provides a computer readable storage medium having stored therein instructions for instructing a computing device or a cluster of computing devices to perform the link failure localization method according to the first aspect or any implementation manner of the first aspect.
In a sixth aspect, the present application provides a computer program product comprising instructions which, when run on a computing device or cluster of computing devices, cause the computing device or cluster of computing devices to perform the link failure localization method of the first aspect or any implementation of the first aspect.
Further combinations of the present application may be made to provide further implementations based on the implementations provided in the above aspects.
Detailed Description
The terms "first", "second" in embodiments of the application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature.
Some technical terms related to the embodiments of the present application will be described first.
Links refer to hardware paths between components of a computing system (e.g., terminals, servers, clusters of computing devices), such as hardware paths from a CPU to other components, and hardware paths from a baseboard management controller (Baseboard Management Controller, BMC) to other components. Other components may be PCIe peripherals, I2C peripherals, SAS/SATA hard disks, or other CPUs. The hardware path is used for carrying data signal circulation and control signal circulation among the CPU, the BMC and other components.
Links may be classified into high-speed links and low-speed links according to a signal transmission rate. The high-speed link is a link with a higher signal transmission rate, and is generally used for carrying data signal circulation, for example, the high-speed link includes a memory link (such as a link from a CPU to a memory), a PCIe link (such as a link from a CPU to a PCIe peripheral), a SAS/SATA link (such as a link from a CPU to a SAS/SATA hard disk), a computing fast link (compute express link, CXL) link, a Unified BUS (UB) link, and a coherence interconnect specification (hisilicon coherent connect specification, HCCS) link. The low-speed link is a link with a lower signal transmission rate and is generally used for carrying control signal circulation, such as an I2C, reduced pin bus (Low Pin Count Bus, LPC) and serial peripheral interface (SERIAL PERIPHERAL INTERFACE, SPI) link.
When a computing system has a link fault, if a fault source is not found rapidly to perform active operation and maintenance and repair in time, related services cannot be recovered rapidly, service interruption and even paralysis can be caused, and even the fault can be further diffused, so that the computing system is down. The industry provides a link failure positioning method. Specifically, when the computing system fails, PCIe link layer anomalies or message anomalies (e.g., packet loss, packet verification anomalies, message timeout) may be identified according to advanced error report (Advanced Error Reporting, AER) failure information reported by the PCIe peripheral (may also be referred to as an endpoint), so as to determine whether the PCIe link layer anomalies are link failures, and perform positioning recommendation of a failure source according to single-point problems recorded by the endpoint.
However, in the above link fault positioning method, whether the PCIe link fault is assumed to be the PCIe link fault can only be based on whether the PCIe peripheral device has a fault report, and the determination logic of the fault source is fixed, so that the positioning accuracy of the fault source is low, most of scenes cannot position the fault source, only the PCIe link fault can be roughly output, and it is difficult to determine that the fault source on the link belongs to the CPU controller, the cable, the back board, the adapter card or the terminal device.
In view of this, the present application provides a link failure positioning method. The method may be performed by a link failure localization system. The link failure location system may be a software system that may employ an on-line deployment (also referred to as an on-line deployment) mode or an off-line deployment (also referred to as an off-line deployment) mode. When the software system adopts an online deployment mode, the software system may be deployed on a computing system, such as a BMC (also referred to as an out-of-band management system) deployed on the computing system. When the software system adopts an offline deployment mode, the software system can be deployed on a third party computing device cluster in an offline tool, such as public cloud, edge computing devices, and end-side computing devices. The link failure location system may also be a hardware system, such as a BMC in a computing system or a cluster of computing devices. When hardware systems such as BMC, computing device cluster and the like run, the link fault positioning method can be executed.
Specifically, the link fault location system may obtain a fault log of a hardware system in a computing system to be detected, where the fault log records historical data of fault monitoring parameters of the hardware system, abnormal events, and replacement records of field replaceable units (FIELD REPLACEMENT units, FRUs) in the hardware system, obtain current data of the fault monitoring parameters of the hardware system, extract a fault feature sequence of the hardware system according to the fault log and the current data of the fault monitoring parameters of the hardware system, and predict a fault location of a FRU granularity according to the fault feature sequence of the hardware system. Wherein, the fault position of FRU granularity is obtained by combining training result prediction based on labeling data. The labeling data labels the failed FRU, for example, a sequence of failure features labeled the failed FRU. The training results based on the annotation data may include, but are not limited to, a link failure localization model or a link failure localization rule base.
According to the method, the fault feature sequence is extracted from the fault log and the current data of fault monitoring parameters when faults occur, then a training result based on marking data, such as a link fault positioning model or a link fault positioning rule base constructed based on the fault feature sequence marked with the fault FRU is utilized to infer the FRU corresponding to the fault feature sequence of the current fault, so that the FRU-level link fault positioning is realized, and the positioning accuracy is improved. In addition, the method is not limited to positioning of PCIe link faults, can realize universal link fault positioning, and improves coverage rate of link fault positioning.
In order to make the technical scheme of the application clearer and easier to understand, the architecture of the link fault locating system of the application is described with reference to the accompanying drawings.
Referring to a schematic architecture of a link failure location system shown in fig. 1, a link failure location system 10 is used to locate link failures for a computing system. The computing system may be a cross-architecture general server with a main processor, a coprocessor or a BMC, a storage server, a multi-component server such as high performance computing (high performance computing, HPC), or other integrated computing systems, or may be a terminal with an embedded controller, including but not limited to a smart phone, a tablet computer. The computing system may also be a data center, a cluster of computing devices, formed by a plurality of computing devices.
The computing system includes a hardware system, firmware, and an operating system. The hardware system includes a power module, a processor (e.g., CPU), a baseboard management controller BMC, and in some examples, may also include high/low speed peripherals. The power module is respectively connected with the processor, the BMC and the high/low speed peripheral equipment and is used for supplying power to the processor, the BMC and the high/low speed peripheral equipment. The processor is respectively connected with the BMC and the high/low speed peripheral equipment, and the BMC is connected with the high/low speed peripheral equipment.
Where the BMC is a small dedicated processor, the BMC may monitor the operating state of the computing system with sensors or by analyzing in-band logs. In-band logs include, but are not limited to, operating system logs, drive logs, or firmware logs. And, the BMC may communicate with the system administrator through a separate line. BMCs are typically contained within a motherboard or within a main circuit board of a monitored computing device. The BMC's sensors are used to measure internal physical variables including, but not limited to, temperature, humidity, supply voltage, fan speed. When the variables exceed the set range, the BMC can inform a system administrator, and the system administrator can adjust the variables in a remote control mode. It should be noted that BMC also supports a local control computing system.
Firmware is a solidified code, and the firmware occupies Non-Volatile Memory (NVM) when running, and does not need extra Memory and processor resources. The method includes the steps of evaluating for performing basic testing, initializing the hardware system, and then loading an operating system loader from a bootable device (e.g., hard drive) into memory. When the processor of the hardware system includes a main processor, a coprocessor, the firmware may include main processor firmware or coprocessor firmware.
An Operating System (OS) is a set of interrelated System software programs that hosts and controls the operation, deployment, and execution of computer hardware, software resources, and provides common services to organize user interactions. The operating system needs to handle basic transactions such as managing and configuring memory, prioritizing the supply and demand of system resources, controlling input and output devices, operating networks, and managing file systems. The operating system also provides an operator interface for the user to interact with the system.
The BMC in the hardware system can also communicate with the firmware and the operating system, so that fault logs reported by the firmware and the operating system are obtained for positioning link faults. In the example of fig. 1, the link fault location system 10 is deployed in a BMC, and the BMC runs program codes of the link fault location system 10, so as to execute the link fault location method of the present application, implement the link fault location with FRU granularity, and improve the location accuracy.
The link fault location system 10 may include a parameter and log acquisition module and a link fault location module. The parameter and log acquisition module is used for acquiring fault logs of the hardware system in the computing system to be detected and acquiring current data of fault monitoring parameters of the hardware system. The fault log records historical data of fault monitoring parameters of the hardware system, abnormal events and replacement records of FRUs in the hardware system. The link fault positioning module is used for predicting the fault position of FRU granularity according to the fault log and the fault characteristic sequence of the hardware system extracted from the current data of the fault monitoring parameters of the hardware system. The fault location of the FRU granularity can be predicted by combining training results (such as at least one of a link fault location model or a link fault location rule base) based on the labeling data.
It should be noted that, the fault log may be collected and reported by a hardware system, firmware, and an operating system, and based on this, the parameter and log collection module may also be deployed in the hardware system, firmware, and operating system.
Fig. 1 is merely illustrative of a link fault location system 10 deployed at a BMC, and the link fault location system 10 may be an offline link fault location tool in other possible implementations of embodiments of the present application.
The following describes the structure of the link fault location system 10 in detail from the viewpoint of functional modularization, taking the link fault location system 10 deployed in a BMC as an example.
Referring to the schematic structure of a link fault location system 10 shown in fig. 2, the link fault location system 10 includes a parameter and log acquisition module 102, a preprocessing module 104, and a link fault location module 106. Further, the link failure location system 10 may also include an output module 108, a build module 109. The functions of the respective modules are described below.
The parameter and log collection module 102 is configured to obtain a fault log of a hardware system in the computing system to be detected, and obtain current data of fault monitoring parameters of the hardware system. The fault log records historical data of fault monitoring parameters of the hardware system, abnormal events and replacement records of FRUs in the hardware system. This can be implemented to provide as comprehensive operating environment information and exception data as possible for link failure localization.
The fault log may be collected and reported by a hardware system, firmware, or operating system (e.g., a driver in an operating system). The firmware may include basic input output system (Basic Input Output System, BIOS) firmware. The fault monitoring parameters of the hardware system may be fault monitoring parameters of a high-speed link and a low-speed link of the hardware system. The high-speed link may include a link where the high-speed bus device is located. As shown in FIG. 1, the high-speed bus device includes, but is not limited to, CPU, mem, PCIe peripherals, CXL peripherals, HCCS peripherals, SAS hard disk, SATA hard disk, or UB peripherals. Low-speed bus devices include, but are not limited to, I2C peripherals, LPC peripherals, SPI peripherals, serial universal input/output (SGPIO) peripherals, or universal serial bus (Universal Serial Bus, USB) peripherals.
In some possible implementations, the fault-monitoring parameters may include one or more of temperature, current, voltage (e.g., supply voltage), or in-bit, stab-in, and the abnormal event may include one or more of a sideband detection abnormality, a lighting abnormality, a reliability availability serviceability (reliability availability serviceability, RAS) abnormality. The RAS exception comprises a data access exception in the running process of a chip, a controller, a bus or an IO peripheral, wherein the data access exception can be an error detection and correction exception or an illegal access exception. Among them, error detection and correction anomalies include, but are not limited to, error CHECKING AND correction (ECC), parity, cyclic redundancy check (Cyclic Redundancy Check, CRC). The IO peripheral may be a PCIe peripheral, and the PCIe peripheral is located in an End Point (EP) device in the PCIe subsystem, so that in the PCIe subsystem, the PCIe peripheral exception may be an EP exception, the PCIe subsystem may further include a root component (rootpoint, RP), and the root component exception is an RP exception. In some examples, the IO peripheral may also be an IO peripheral based on a UB, CXL, SAS, SATA, USB, I2C, SPI or the like protocol interface.
The preprocessing module 104 is configured to extract a fault feature sequence of the hardware system according to the fault log and current data of fault monitoring parameters of the hardware system. In order to improve the efficiency and quality of feature extraction, the preprocessing module 104 may also clean the fault log and the current data of the fault monitoring parameters. The cleaning of the fault log and the current data of the fault monitoring parameters can comprise classification, structuring and filtering. Filtering refers to filtering data which does not meet requirements, such as incomplete data or error data. Further, when the fault signature is provided to the back end for use through the form of a data interface, the preprocessing module 104 is further configured to abstract and normalize the data format of the fault signature after the fault signature is extracted, so that the link fault location can be performed based on the abstract or normalized fault signature subsequently. Where abstracting refers to separating data and implementations, presenting necessary information without presenting details, and normalizing refers to unifying data represented by different fields that have the same physical meaning.
The link fault location module 106 is configured to predict a fault location of the FRU granularity according to a fault signature sequence of the hardware system. Wherein, the fault position of FRU granularity is obtained by combining training result prediction based on labeling data. The labeling data labels the fault FRUs, for example, may be fault feature sequences labeled with the fault FRUs. The training result based on the labeling data can be a link failure location model or a link failure location rule base. In specific implementation, the link fault location module 106 performs accurate fault source matching according to the fault feature sequence of the hardware system through a link fault location model or a link fault location rule base. And when the matching is successful, determining that the fault source is a FRU which is successfully matched, and when the matching is unsuccessful, determining that the fault source is one or more FRUs in the front n of the sequence from high to low, wherein the FRUs in the front n of the sequence from high to low in the matching degree are also called risk components or suspected fault sources.
In some possible implementations, the preprocessing module 104 is further configured to generate a link topology of the hardware system according to networking information of the hardware system. The link topology records the link type of the link in the hardware system and the FRU included by the link of at least one link type. The preprocessing module 104 may obtain networking information in real time to generate a full amount of link topology. Accordingly, the link failure location module 106 is specifically configured to predict a failure location of the FRU granularity according to a failure feature sequence of the hardware system and a link topology of the hardware system. For example, the link fault location module 106 may determine a link fault location rule related to a link topology of the hardware system according to the link topology of the hardware system, and match a fault feature sequence of the hardware system with the link fault location rule related to the link topology to obtain a fault location of FRU granularity. For another example, the link fault location module 106 may input a link topology of the hardware system into a link fault location model, which infers the link topology as prior information to obtain a fault location at FRU granularity. Therefore, the matching range or the searching range can be reduced, and the efficiency and the accuracy of link fault positioning are improved.
The output module 108 is configured to output a link failure positioning result. Specifically, when the link failure positioning module 106 performs accurate matching successfully, it may determine that the failure positioning result is a successfully matched FRU, and the output module 108 may directly output the successfully matched FRU. Further, the output module 108 may also perform a precise alarm according to the FRU successfully matched. For example, output module 108 may notify a system administrator of the FRU failure for which the match was successful, so that the system administrator may mask or replace the FRU to achieve failure recovery. When the link failure location module 106 does not match successfully, the output module 108 may output n FRUs of the suspected failure sources. Whether or not the exact match is successful, the output module 108 may generate a link failure analysis report according to the link failure positioning result, and output the link failure analysis report for review during operation and maintenance.
The construction module 109 is configured to construct a link failure location model or a link failure location rule base. Specifically, the building module 109 may obtain training data, where the training data includes a fault log of a hardware system (also referred to as a log of current network problem cases) in a computing system deployed in a production environment, then extract a fault feature sequence from the training data, label a fault FRU for the fault feature sequence, and train a classifier according to the fault feature sequence labeled fault FRU to obtain a link fault location model. It should be noted that, when the construction module 109 trains the classifier to obtain the link fault location model, an offline fault log may be used to perform artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) training in an offline training manner (e.g., offline machine learning) to obtain the link fault location model based on a machine learning algorithm. The construction module 109 may extract a link failure location rule according to the failure feature sequence of the labeling failure FRU, and generate a link failure location rule base. The construction module 109 may match the fault feature sequence of the labeled fault FRU with a rule template, thereby extracting a link fault location rule. For example, for the existing network problem case with complete experience, the link fault location (i.e. the fault FRU can be determined) can be directly performed, the construction module 109 can generate the link fault location rule according to the fault feature sequence marked with the fault FRU and combining the rule template, and for the existing network problem case without complete experience, the construction module 109 can treat the fault feature sequence as a priori feature or add a label and then enter the link fault location model to reduce noise for training of the link fault location model. Where a priori features refer to a priori knowledge of a sample or class of samples in machine learning, it can help optimize the performance of the model during training. Generally, the prior features are obtained by expert knowledge or historical data, so that the training difficulty of the model can be reduced, and the generalization capability of the model can be improved.
It should be noted that, when the link fault location system 10 is deployed offline, the structure of the link fault location system 10 may refer to fig. 2, and will not be described herein.
Based on the foregoing link failure positioning system 10, the present application further provides a link failure positioning method. The link failure locating method of the present application will be described in detail with reference to the embodiments.
Referring to the flow chart of a link failure localization method shown in fig. 3, the method comprises the steps of:
S302, the link fault positioning system 10 acquires a fault log of a hardware system in the computing system to be detected.
The computing system may be a server, a terminal, or the like, or a cluster of computing devices. The terminal can be a desktop computer, a notebook computer or embedded computing equipment such as a tablet computer, a smart phone and the like. The hardware system in the computing system refers to a system formed by hardware devices, including components such as a processor, an IO peripheral device, a BMC and the like, and a hardware path between the components is called a link of the computing system.
When a link failure occurs in the hardware system, the computing system may generate a failure log. For example, a hardware system, firmware, operating system may generate a fault log. The link fault location system 10 obtains fault logs reported by a computing system, for example, obtaining fault logs reported by a hardware system, firmware, and an operating system.
The fault log records historical data of fault monitoring parameters of the hardware system, abnormal events and replacement records of field replaceable units FRU in the hardware system. In some possible implementations, the fault-monitoring parameters may include one or more of temperature, current, voltage, or in-bit state, insertion-stable state. The abnormal event may include one or more of a sideband detection abnormality, a lighting abnormality, a RAS abnormality. The RAS exception comprises a data access exception in the running process of a chip, a controller, a bus or an IO peripheral, wherein the data access exception can be an error detection and correction exception or an illegal access exception. Error detection and correction anomalies include, but are not limited to, ECC, parity, cyclic redundancy check, CRC.
S304, the link fault location system 10 acquires current data of fault monitoring parameters of the hardware system.
Specifically, the link fault location system 10 may obtain current data of fault monitoring parameters of the hardware system in real time through the sensor. For example, the link fault location system 10 may obtain the current value of the temperature of the hardware system in real time through a temperature sensor.
It should be noted that, the above S302 and S304 may be executed in parallel, or may be executed sequentially according to a set order, for example, the link fault locating method of the present application may also be executed first S304 and then S302, which is not limited in the embodiment of the present application.
S306, the link fault location system 10 extracts a fault characteristic sequence of the hardware system according to the fault log and the current data of the fault monitoring parameters of the hardware system.
In specific implementation, the link fault location system 10 may perform feature recognition according to the fault log and current data of fault monitoring parameters of the hardware system to obtain features of different feature classes, and then obtain a fault feature sequence according to the features of the different feature classes. In some examples, the feature class may include one of a BMC alarm, a device specific index, a start/run log, fault diagnosis management (fault diagnosis management, FDM) information, or a maintenance log.
The different feature categories may include the following fault features:
table 1 feature class and corresponding feature example
Feature class |
Features examples |
BMC alarm |
Alarm effective condition, occurrence number, main body FRU type, main body number statistics and main body link associated information |
Equipment specific index |
SMART, GList, PHY CRC, ECC, health Status, self Test Status |
Boot/run log |
PHY down、Init fail、Reset fail |
FDM information |
Caterr、IERR、AER、ARER、UCE、CE exceed |
Maintaining logs |
Voltage exceed |
The SMART is called Self-Monitoring ANALYSIS AND Reporting Technology, namely a Self-Monitoring, analysis and reporting technology, and is an automatic hard disk state detection and early warning system and specification. The operation conditions of hardware such as magnetic heads, discs, motors and circuits of the hard disk are monitored and recorded through detection instructions in the hardware of the hard disk and compared with preset safety values set by manufacturers, and if the monitoring conditions are about to exceed or exceed the safety range of the preset safety values, the monitoring hardware or software of a host can automatically give an alarm to a user and carry out slight automatic repair so as to ensure the safety of the data of the hard disk in advance. GList, called Grown-DEFECT LIST, namely "Grown defect list", is a detection tool made before the hard disk leaves the factory, and aims to record defect information during the manufacture of the hard disk. In use, if a hard disk has a problem such as a bad track, the hard disk can be marked in Glist to avoid accessing the bad track again in the subsequent reading and writing process, thereby generating data loss or other problems. PHY in PHY CRC, PHY down represents the physical layer. The Caterr, IERR is a timeout type error. The full names UCE, CE are Uncorrectable Error, correctable Error, respectively, indicating uncorrectable errors, correctable errors. exceed denotes a super threshold, CE exceed is a CE super threshold, and Voltage exceed is a voltage super threshold.
The link fault location system 10 may extract one or more of the occurrence number, occurrence time, event body, or link in which the event is located from the fault log, for example, the link fault location system may extract the occurrence number, occurrence time, event body, or link in which the event is located of the abnormal event from the features of different feature categories identified from the fault log, and then construct a fault feature sequence of the hardware system according to the one or more of the occurrence number, occurrence time, event body, or link in which the event is located of the abnormal event, and the historical data of the fault monitoring parameters, the current data.
S308, the link fault location system 10 predicts the fault location of FRU granularity according to the fault characteristic sequence of the hardware system.
Wherein, the fault position of FRU granularity is obtained by combining training result prediction based on labeling data. The annotation data is annotated with a failed FRU. For example, the annotation data may be a sequence of fault signatures annotated with a faulty FRU. Specifically, the link fault location system 10 may obtain training data, where the training data may include fault logs of hardware systems (also referred to as current network problem cases) in a computing system deployed in a production environment, then the link fault location system 10 may perform feature extraction on the current network problem cases to obtain a fault feature sequence, and then the link fault location system 10 may receive labeling information on the fault feature sequence, so as to obtain a fault feature sequence labeled with a faulty FRU. The training result based on the labeling data may be at least one of a link failure location model or a link failure location rule base. The link failure location rule base includes at least one link failure location rule.
The following describes a training process of the link fault location model and the link fault location rule.
The link fault location model may be a classification model obtained by training the classifier in a supervised learning manner. The classifier may be a random forest classifier (Random Forests Classifier, RFC) or a Neural Network (NN) classifier, among others. As shown in fig. 4, the link fault location system 10 may obtain training data, such as a fault log of a hardware system in a computing system deployed in a production environment, and may obtain a plurality of current network problem cases because the computing system in the production environment may fail multiple times over a period of time. For each current network problem case, the link fault location system 10 may extract a fault feature sequence and label a fault FRU, for example, label the fault FRU according to a FRU replacement record, identify and label the fault FRU by fault related information recorded in a fault log, or label the fault FRU according to experience, and label the fault FRU by a detection means (such as instrument and meter detection), so as to obtain a case feature matrix (i.e. label the fault feature sequence of the fault FRU). The link fault location system 10 may train the classifier using the case feature matrix and may determine the trained classifier as a link fault location model when the training stop condition is satisfied.
The link fault location rule can be obtained by matching the fault feature sequence marked with the fault FRU with the rule template. Wherein, the rule templates may be formed by conditional statements (e.g., if-then statements), and correspondingly, the link fault location rules may be conditional statements that populate the rule templates with fault feature sequences and fault FRUs.
In some possible implementations, the training results may be a link failure localization model. Accordingly, the link fault location system 10 may input the fault signature sequence of the hardware system into the link fault location model, and obtain the fault location of the FRU granularity through the classification of the link fault location model. When the link fault location system 10 classifies the fault location model through the link fault location model, the fault probability of the FRUs in the hardware system can be determined through the link fault location model according to the fault feature sequence of the hardware system, and the fault location of the FRU granularity can be obtained based on the fault probability of each FRU.
Specifically, when there is a probability of failure of a FRU greater than the set threshold, link failure localization system 10 may determine that the failure location includes a FRU whose probability of failure is greater than the set threshold. When none of the probability of failure of the FRUs in the hardware system is greater than the set threshold, link failure localization system 10 may determine the failure location as one or more of the FRUs in the top n of the order of the probability of failure from high to low. Further, the link failure localization system 10 may also output the failure probability of the FRU (failure source, suspected failure source).
In other possible implementations, the link fault location system 10 may match the sequence of fault signatures of the hardware system to link fault location rules in a link fault location rule base. The link failure location system 10 may determine the location of the failure at the FRU granularity based on the matching results. For example, when the matching result is that the matching is successful, the link failure positioning system 10 determines the failure location as the FRU for which the matching is successful. Specifically, when the failure signature sequence of the hardware system matches successfully with a target rule of the at least one link failure localization rule, the link failure localization system 10 may determine a failure location of the FRU granularity from the target rule.
Considering that the link fault location rule base includes link fault location rules that can directly determine the source of the fault, the link fault location system 10 may first match, for example, precisely match, the fault signature sequence of the hardware system with the link fault location rules in the fault location rule base. When the matching is successful, the link fault location system 10 can determine that the fault location is the FRU with successful matching, when the matching is unsuccessful, the link fault location system 10 can input the fault feature sequence of the hardware system into the link fault location model, and the fault location of the FRU granularity in the hardware system is obtained by reasoning through the link fault location model.
Based on the description, the method can be used for extracting the fault characteristic sequence from the fault log when the fault occurs and the current data of the fault monitoring parameters, and then deducing the FRU corresponding to the fault characteristic sequence of the current fault by using the training result (such as a link fault positioning model or a link fault positioning rule base) based on the marking data (such as the fault characteristic sequence marked with the fault FRU), so that the link fault positioning of the FRU level is realized, and the positioning accuracy is improved. In addition, the method is not limited to positioning of PCIe link faults, can realize universal link fault positioning, and improves coverage rate of link fault positioning. In addition, the method adopts out-of-band collected information as positioning input, is easy to obtain, and does not increase the operation of maintenance personnel.
Further, a link fault location model or a link fault location rule base is constructed based on fault logs of computing systems having different link topologies in the production environment, based on which, in an inference phase, the link fault location system 10 may obtain the link topology of the hardware system of the computing system to be detected, and predict the fault location of the FRU granularity based on the link topology and the fault feature sequence. For example, the link fault location system 10 may input a link topology of a hardware system of the computing system to be detected into a link fault location model, which infers based on the link topology and a sequence of fault signatures, obtaining a fault location at FRU granularity. For another example, the link fault location system 10 may determine a link fault location rule related to a link topology of the hardware system to be detected from a link fault location rule base, and match the fault feature sequence with the link fault location rule related to the link topology of the hardware system to be detected, to obtain a fault location of the FRU granularity. Therefore, the accuracy and the efficiency of link fault positioning can be improved.
As shown in fig. 5, the link failure location system 10 may generate a link topology of the hardware system according to networking information of the hardware system, where the link topology may record a link type of a link in the hardware system and a FRU included in the link of at least one link type. Accordingly, when determining the fault location of the FRU granularity, the link fault location system 10 may determine the fault location of the FRU granularity according to the fault feature sequence of the hardware system and the link topology of the hardware system through a link fault location model or a link fault location rule base.
The link fault location system 10 may determine, according to links included in a link topology of the hardware system, link fault location rules related to links in the link topology from a link fault location rule base, and then match the fault feature sequence with the link fault location rules related to links in the link topology, so as to determine a fault location of the FRU granularity. In some examples, link fault location system 10 may also input a link fault location model using the link topology of the hardware system as a priori feature, and the link fault location model may infer based on the a priori feature and the input sequence of fault features to determine the location of the fault at the FRU granularity.
Next, a process of generating a link topology will be described.
In one implementation, link failure location system 10 may generate a link topology based on the hardware connection relationships. FIG. 6 illustrates a link topology of different types of computing systems, such as a computing system employing a SATA/SAS hard disk without an Expander, a computing system employing a SATA/SAS hard disk with an Expander, a computing system employing a Non-volatile memory host controller interface Specification (Non-Volatile Memory Express, NVME) pass-through disk, a computing system employing NVME with PCIE SW.
The link topology may also be represented by a link topology table as follows:
table 2 link topology table
Wherein NA in table 2 indicates Not applicable, i.e. the FRU is empty for this link type.
In another implementation, the link failure location system 10 may also number the full number of FRUs and then generate a link topology map or a link topology table from the numbered FRUs. For ease of description, a link topology table illustration is generated.
In this example, the number (also referred to as identifier, ID) of the full FRU may be found in the following table:
TABLE 3FRU numbering results
Then, the link failure positioning system 10 generates the following link topology table in terms of the hardware connection relationship:
Table 4 link topology table
It should be noted that, the method for generating the link topology can be applied to a hardware system of flexible component networking, and also can be applied to a hardware system of fixed component networking, so that the method has higher availability.
The foregoing embodiments of the present application may be implemented alone or in any combination, so as to solve the problem that it is difficult to accurately locate a failure source at one time when a link failure occurs on various high-speed links and low-speed links in the running process of a computing system (such as a server), and can more accurately identify a failed link and diagnose a failed FRU (i.e., a failed FRU) or a suspected failed risk FRU on the link, and can alarm a risk FRU of a failed FRU or Top N.
Based on the foregoing method embodiments, the present application further provides a link failure positioning system 10. The link fault location system 10 of the present application is described below with reference to the accompanying drawings.
Referring to the schematic structure of a link failure localization system 10 shown in fig. 2, the link failure localization system 10 includes:
The parameter and log acquisition module 102 is configured to obtain a fault log of a hardware system in a computing system to be detected, where the fault log records historical data of fault monitoring parameters of the hardware system, an abnormal event, and replacement records of field replaceable units FRUs in the hardware system, and obtain current data of the fault monitoring parameters of the hardware system;
A preprocessing module 104, configured to extract a fault feature sequence of the hardware system according to the fault log and current data of fault monitoring parameters of the hardware system;
And the link fault positioning module 106 is configured to predict a fault location of the FRU granularity according to the fault feature sequence of the hardware system, where the fault location of the FRU granularity is predicted by combining training results based on labeling data, and the labeling data is labeled with a fault FRU.
The specific implementation of the parameter and log collection module 102 may be described with reference to the content related to S302 and S304 in the embodiment shown in fig. 3, the specific implementation of the preprocessing module 104 may be described with reference to the content related to S306 in the embodiment shown in fig. 3, and the specific implementation of the link failure positioning module 106 may be described with reference to the content related to S308 in the embodiment shown in fig. 3.
In some possible implementations, the training results include a link failure localization model;
the link failure location module 106 is specifically configured to:
Inputting the fault characteristic sequence of the hardware system into the link fault location model to obtain the fault probability of FRU in the hardware system;
And determining the fault position of FRU granularity according to the fault probability of FRU in the hardware system.
In some possible implementations, the link failure location module 106 is specifically configured to:
When the fault probability of the FRU is larger than a set threshold, determining that the fault position comprises the FRU with the fault probability larger than the set threshold, and when the fault probability of the FRU in the hardware system is not larger than the set threshold, determining that the fault position is one or more FRUs in the n before the fault probability is sorted from high to low.
In some possible implementations, the link failure location system 10 further includes:
The construction module 109 is configured to obtain training data, where the training data includes a fault log of a hardware system in a computing system deployed in a production environment, extract a fault feature sequence from the training data, label a fault FRU for the fault feature sequence, obtain the label data, train a classifier according to the label data, and obtain the link fault location model.
The specific implementation of the training of the link failure positioning model by the construction module 109 may be described with reference to fig. 4, and will not be described herein.
In some possible implementations, the training result includes a link failure location rule base including at least one link failure location rule;
the link failure location module 106 is specifically configured to:
Matching the fault feature sequence of the hardware system with the at least one link fault location rule;
And when the fault feature sequence of the hardware system is successfully matched with a target rule in the at least one link fault location rule, determining the fault position of FRU granularity from the target rule.
In some possible implementations, the training results include a link failure localization model and a link failure localization rule base including at least one link failure localization rule;
The link fault positioning module is specifically configured to:
Matching the fault feature sequence of the hardware system with the at least one link fault location rule;
And when the matching is failed, inputting the fault characteristic sequence of the hardware system into the link fault positioning model to obtain the fault position of FRU granularity.
The specific implementation of the link fault location module 106 for performing fault location with FRU granularity based on the link fault location rule and the link fault location model may be described with reference to fig. 5.
In some possible implementations, the preprocessing module 104 is further configured to:
Generating a link topology of the hardware system according to networking information of the hardware system, wherein the link topology records a link type of a link in the hardware system and FRU (FRU) included by at least one link type of the link;
the link failure location module 106 is specifically configured to:
predicting the fault location of FRU granularity according to the fault characteristic sequence of the hardware system and the link topology of the hardware system.
In some possible implementations, the fault-monitoring parameters include one or more of temperature, current, voltage, or in-bit state, plug-in state, the anomaly event includes one or more of a sideband detection anomaly, a lighting anomaly, a reliability availability serviceability RAS anomaly including a data access anomaly during operation of a chip, controller, bus, or input-output IO peripheral.
In some possible implementations, the preprocessing module 104 is specifically configured to:
Extracting one or more of the occurrence times, occurrence time, event main body or links where the events are located of the abnormal events from the fault log;
and constructing a fault characteristic sequence of the hardware system according to one or more of the occurrence times, the occurrence time, the event main body or the link where the event is located, the historical data and the current data of the fault monitoring parameters.
Based on the foregoing link failure positioning method and the link failure positioning system 10, the present application further provides a baseboard management controller, which is also called a motherboard management control unit. The baseboard management controller includes a processor and a memory, in which computer readable instructions, such as program instructions of the link fault location system 10, may be stored, the processor executing the computer readable instructions to cause the baseboard management controller to execute the link fault location method described above.
Based on the foregoing link failure positioning method and the link failure positioning system 70, the present application further provides a computing device 700. As shown in fig. 7, computing device 700 includes a bus 702, a processor 704, a memory 706, and a communication interface 708. Communication between processor 704, memory 706, and communication interface 708 is via bus 702. Computing device 700 may be a server or a terminal device. It should be understood that the present application is not limited to the number of processors, memories in computing device 700.
Bus 702 may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one line is shown in fig. 7, but not only one bus or one type of bus. Bus 702 may include a path for transferring information between various components of computing device 700 (e.g., memory 706, processor 704, communication interface 708).
The processor 704 may include any one or more of a central processing unit (central processing unit, CPU), a graphics processor (graphics processing unit, GPU), a Microprocessor (MP), or a digital signal processor (DIGITAL SIGNAL processor, DSP).
The memory 706 may include volatile memory (RAM), such as random access memory (random access memory). The memory 706 may also include non-volatile memory (ROM), such as read-only memory (ROM), flash memory, mechanical hard disk (HARD DISK DRIVE, HDD), or Solid State Disk (SSD) STATE DRIVE. The memory 706 has stored therein executable program code that is executed by the processor 704 to implement the aforementioned link failure localization method. Specifically, the memory 706 has stored thereon instructions for the link failure location system 10 to perform the link failure location method.
Communication interface 708 enables communication between computing device 700 and other devices or communication networks using a transceiver module such as, but not limited to, a network interface card, transceiver, or the like.
The embodiment of the application also provides a computing device cluster. The cluster of computing devices includes at least one computing device. The computing device may be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device may also be a terminal device such as a desktop, notebook, or smart phone.
As shown in fig. 8, the cluster of computing devices includes at least one computing device 700. The same link failure location system 10 may have stored in memory 706 in one or more computing devices 700 in the cluster of computing devices instructions for performing the link failure location method.
In some possible implementations, one or more computing devices 700 in the cluster of computing devices may also be used to execute some of the instructions of the link failure location system 10 for performing the link failure location method. In other words, a combination of one or more computing devices 700 may collectively execute instructions of link failure localization system 10 for performing a link failure localization method.
It should be noted that the memory 706 in different computing devices 700 in a cluster of computing devices may store different instructions for performing part of the functions of the link fault localization system 10.
Fig. 9 shows one possible implementation. As shown in fig. 9, two computing devices 700A and 700B are connected through a communication interface 708. Instructions for performing the functions of the parameter and log acquisition module 102, the preprocessing module 104 are stored in memory in the computing device 700A. Instructions for performing the functions of link failure location module 106 are stored on memory in computing device 700B. In other words, the memory 706 of computing devices 700A and 700B collectively store instructions for the link failure location system 10 to perform the link failure location method.
The connection manner between the clusters of computing devices shown in fig. 9 may be to consider that the link failure positioning method provided by the present application requires a great deal of computation to perform reasoning or matching, so as to predict the failure location of the FRU granularity. Accordingly, it is contemplated that the functions implemented by link failure location module 106 are performed by computing device 700B. Further, the functions of the output module 108, the build module 109 may also be performed by the computing device 700B.
It should be appreciated that the functionality of computing device 700A shown in fig. 9 may also be performed by multiple computing devices 700. Likewise, the functionality of computing device 700B may also be performed by multiple computing devices 700.
In some possible implementations, one or more computing devices in a cluster of computing devices may be connected through a network. Wherein the network may be a wide area network or a local area network, etc. Fig. 10 shows one possible implementation. As shown in fig. 10, two computing devices 700C and 700D are connected by a network. Specifically, the connection to the network is made through a communication interface in each computing device. In this type of possible implementation, the memory 706 in the computing device 700C has stored therein instructions for executing the parameters and the functions of the log acquisition module 102 and the preprocessing module 104. Meanwhile, instructions to perform the functions of link failure location module 106 are stored in memory 706 in computing device 700D.
The connection manner between the computing device clusters shown in fig. 10 may be to consider that the link failure positioning method provided by the present application requires a great deal of computation to perform reasoning or matching, so as to predict the failure location of the FRU granularity. It is therefore contemplated that the functions implemented by link failure location module 106 are performed by computing device 700D.
It should be appreciated that the functionality of computing device 700C shown in fig. 10 may also be performed by multiple computing devices 700. Likewise, the functionality of computing device 700D may also be performed by multiple computing devices 700.
The embodiment of the application also provides a computer readable storage medium. The computer readable storage medium may be any available medium that can be stored by a computing device or a data storage device such as a data center containing one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc. The computer readable storage medium includes instructions that instruct a computing device to perform the above-described method for performing a link failure localization method that applies to the link failure localization system 10.
Embodiments of the present application also provide a computer program product comprising instructions. The computer program product may be software or a program product containing instructions capable of running on a computing device or stored in any useful medium. The computer program product, when run on at least one computing device, causes the at least one computing device to perform the above-described link failure localization method.
It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the protection scope of the technical solution of the embodiments of the present invention.