[go: up one dir, main page]

CN120011127A - Equipment failure handling method, electronic equipment, storage medium and product - Google Patents

Equipment failure handling method, electronic equipment, storage medium and product Download PDF

Info

Publication number
CN120011127A
CN120011127A CN202510489529.2A CN202510489529A CN120011127A CN 120011127 A CN120011127 A CN 120011127A CN 202510489529 A CN202510489529 A CN 202510489529A CN 120011127 A CN120011127 A CN 120011127A
Authority
CN
China
Prior art keywords
fault
processor
faulty device
processing
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202510489529.2A
Other languages
Chinese (zh)
Other versions
CN120011127B (en
Inventor
贾帅帅
李盛新
芦飞
孙秀强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Metabrain Intelligent Technology Co Ltd
Original Assignee
Suzhou Metabrain Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Metabrain Intelligent Technology Co Ltd filed Critical Suzhou Metabrain Intelligent Technology Co Ltd
Priority to CN202510489529.2A priority Critical patent/CN120011127B/en
Publication of CN120011127A publication Critical patent/CN120011127A/en
Application granted granted Critical
Publication of CN120011127B publication Critical patent/CN120011127B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0745Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in an input/output transactions management context
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0778Dumping, i.e. gathering error/state information after a fault for later diagnosis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本公开提供了一种设备故障处理方法、电子设备、存储介质及产品,其中,方法包括:基于处理器支持的故障处理机制,对处理器进行初始化配置;响应于故障设备的故障信息,确定故障设备与处理器的连接方式;基于连接方式和/或初始化配置,对故障设备进行故障处理。本公开的方法,通过根据处理器的支持能力,为处理配置不同的故障处理机制,并且根据故障设备与处理连接关系,利用不同的故障处理机制为故障设备进行故障处理,避免了不同类型的故障报错均采用EDPC技术进行故障处理而导致正常PCIe设备的运行中断,提高了故障处理的处理粒度和准确度。

The present disclosure provides a device fault handling method, electronic device, storage medium and product, wherein the method includes: initializing and configuring the processor based on a fault handling mechanism supported by the processor; determining the connection mode between the faulty device and the processor in response to the fault information of the faulty device; and performing fault handling on the faulty device based on the connection mode and/or the initialization configuration. The method of the present disclosure configures different fault handling mechanisms for the processing according to the support capability of the processor, and performs fault handling for the faulty device using different fault handling mechanisms according to the connection relationship between the faulty device and the processing, thereby avoiding the interruption of normal PCIe device operation caused by using EDPC technology for fault handling for different types of fault reports, and improving the processing granularity and accuracy of fault handling.

Description

Equipment fault processing method, electronic equipment, storage medium and product
Technical Field
The present disclosure relates to the field of computers, and in particular, to an apparatus fault handling method, an electronic apparatus, a storage medium, and a product.
Background
With the development of computer technology, high-speed serial computer expansion bus (PERIPHERAL COMPONENT INTERCONNECT EXPRESS, PCIe) devices have become an integral part of current computer systems, however with the evolution of PCIe technology, PCIe devices operate faster and faster, and their failures are more frequent.
When a PCIe device performs fault reporting, the processor generally performs fault processing on different types of fault reporting by adopting an enhanced downstream port containment (Enhanced Downstream Port Containment, EDPC) technology, so that when the processor is connected with a plurality of PCIe devices, one or more PCIe devices perform fault processing by adopting a EDPC technology, and all PCIe devices interrupt connection with the processor, thereby affecting operation of normal PCIe devices.
Disclosure of Invention
The disclosure provides a device fault processing method, electronic equipment, storage media and products, wherein the method comprises the steps of carrying out initialization configuration on a processor based on a fault processing mechanism supported by the processor, responding to fault information of fault equipment, determining a connection mode of the fault equipment and the processor, and carrying out fault processing on the fault equipment based on the connection mode and/or the initialization configuration. According to the method and the device, different fault processing mechanisms are configured for processing according to the supporting capability of the processor, and according to the connection relation between the fault equipment and the processing, the fault processing is performed on the fault equipment by utilizing the different fault processing mechanisms, so that the problem that normal PCIe equipment is interrupted due to the fact that EDPC technologies are adopted for fault processing in fault reporting of different types is avoided, and the processing granularity and accuracy of the fault processing are improved.
An embodiment of a first aspect of the present disclosure provides an apparatus fault handling method, including performing an initialization configuration on a processor based on a fault handling mechanism supported by the processor, determining a connection manner of the fault apparatus and the processor in response to fault information of the fault apparatus, and performing fault handling on the fault apparatus based on the connection manner and/or the initialization configuration.
In some embodiments of the present disclosure, initializing the processor based on the processor-supported fault handling mechanism includes masking a first fault handling mechanism of the processor and configuring a second fault handling mechanism and/or a third fault handling mechanism for the processor when the processor supports root port programmable input output, and configuring the first fault handling mechanism for the processor when the processor does not support root port programmable input output.
In some embodiments of the present disclosure, the first failure handling mechanism includes triggering a port throttling process when a failure device has a failure, the second failure handling mechanism includes not triggering a port throttling process when the failure device fails to support a configuration space request, and the third failure handling mechanism includes not triggering a port throttling process when the failure device fails to complete a timeout error, and triggering a port throttling process when the failure device fails to complete an error other than the timeout error.
In some embodiments of the present disclosure, the fault device and the processor are connected in a manner that includes a direct connection between the fault device and the processor and an indirect connection between the fault device and the processor through a switch.
In some embodiments of the present disclosure, performing fault handling on a faulty device based on a connection manner and/or an initialization configuration includes triggering port throttling handling when the first fault handling mechanism is configured initially, and performing fault handling on the faulty device based on the connection manner when the second fault handling mechanism or the third fault handling mechanism is configured initially.
In some embodiments of the present disclosure, when the second failure handling mechanism or the third failure handling mechanism is configured in an initialized manner, performing failure handling on the failure device based on the connection mode includes performing failure handling on the failure device by using the second failure handling mechanism when the failure device is directly connected to the processor, and performing failure handling on the failure device by using the third failure handling mechanism when the failure device is indirectly connected to the processor through the switch.
In some embodiments of the present disclosure, when the faulty device is directly connected to the processor, performing fault handling on the faulty device using the second fault handling mechanism includes determining, based on the fault information, whether the fault of the faulty device is a non-supporting configuration space request, and performing fault handling on the faulty device based on a result of the determining.
In some embodiments of the present disclosure, performing fault processing on the faulty device based on the result of the determination includes generating first report information of the fault and not triggering port containment processing when the result of the determination is that the fault of the faulty device is a configuration space unsupported request, and triggering port containment processing when the result of the determination is that the fault of the faulty device is not a configuration space unsupported request.
In some embodiments of the present disclosure, when the failure device and the processor are indirectly connected through the switch and the processor, performing failure processing on the failure device by using a third failure processing mechanism includes determining whether a failure of the failure device is a completion timeout error based on failure information, and performing failure processing on the failure device based on a result of the determination.
In some embodiments of the present disclosure, performing fault processing on a fault device based on a result of the determination includes triggering port containment processing when the result of the determination is that the fault of the fault device is not a completion timeout error, generating second report information of the fault when the result of the determination is that the fault of the fault device is the completion timeout error, and acquiring a header information log corresponding to the fault to determine whether to perform the port containment processing based on the header information log.
In some embodiments of the present disclosure, determining whether to perform port throttling processing based on the header information log includes determining a spatial address indicated by the header information log based on a type of the header information log, and determining whether a device corresponding to the spatial address is a failed device to determine whether to perform port throttling processing.
In some embodiments of the present disclosure, determining the spatial address indicated by the header information log based on the type of the header information log includes determining the spatial address indicated by the header information log based on the information address of the header information log when the type of the header information log is a memory read-write type or an input-output read-write type to determine the first information set based on the spatial address, and determining the first information set based on the information address offset of the header information log when the type of the header information log is a configuration space read-write type.
In some embodiments of the present disclosure, determining whether the device corresponding to the spatial address is a failed device to determine whether to perform the port containment process includes not performing the port containment process when it is determined that the device corresponding to the first information set is a failed device or when it is determined that the device corresponding to the first information set is a switch connected to the failed device, and performing the port containment process when it is determined that the device corresponding to the first information set is not a failed device and the device corresponding to the first information set is not a switch connected to the failure.
In some embodiments of the present disclosure, the method further includes determining a type of the header information log based on a memory mapped space applied by the failed device.
In some embodiments of the present disclosure, the first set of information includes at least one of bus information of the failed device, device information of the failed device, and functional information of the failed device.
In some embodiments of the present disclosure, the method further includes determining device identity information of the failed device and a failure cause of the failed device based on the first reporting information or the second reporting information, and performing failure processing using a preset solution based on the device identity information and the failure cause.
In some embodiments of the present disclosure, the port containment process includes unloading the driver of the failed device and removing the device tag of the failed device, releasing the link state of the failed device with the processor, enumerating the failed device, and starting the driver of the failed device from the new.
An embodiment of a second aspect of the present disclosure proposes an electronic device comprising a processor and a memory for storing a computer program capable of running on the processor, wherein the processor is adapted to perform the method described in the embodiment of the first aspect of the present disclosure when the computer program is run.
Embodiments of a third aspect of the present disclosure propose a non-transitory computer readable storage medium storing computer commands for causing a computer to perform the method described in the embodiments of the first aspect of the present disclosure.
An embodiment of a fourth aspect of the present disclosure proposes a computer program product comprising a computer program which, when executed by a processor, implements the method described in an embodiment of the first aspect of the present disclosure.
In summary, the device fault processing method provided by the disclosure includes initializing a processor based on a fault processing mechanism supported by the processor, determining a connection mode of the fault device and the processor in response to fault information of the fault device, and performing fault processing on the fault device based on the connection mode and/or the initialization configuration. According to the method, different fault processing mechanisms are configured for processing according to the supporting capability of the processor, and according to the connection relation between the fault equipment and the processing, the fault processing is performed on the fault equipment by using the different fault processing mechanisms, so that the problem that normal PCIe equipment is interrupted due to the fact that the fault processing is performed by adopting EDPC technology on fault reporting errors of different types is avoided, and the processing granularity and accuracy of the fault processing are improved.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
Fig. 1 is a schematic flow chart of an apparatus fault handling method according to an embodiment of the disclosure;
fig. 2 is a schematic view of a scenario of an apparatus fault handling method according to an embodiment of the present disclosure;
Fig. 3 is a schematic view of a scenario of another method for processing equipment failure according to an embodiment of the present disclosure;
fig. 4 is a flow chart of an apparatus fault handling method according to an embodiment of the disclosure;
FIG. 5 is a flow chart of another method for processing equipment failure according to an embodiment of the present disclosure;
FIG. 6 is a flow chart of another method for processing equipment failure according to an embodiment of the present disclosure;
FIG. 7 is a flow chart of another method for processing equipment failure according to an embodiment of the present disclosure;
FIG. 8 is a flow chart of another method for processing equipment failure according to an embodiment of the present disclosure;
FIG. 9 is a flowchart illustrating another method for processing equipment failure according to an embodiment of the present disclosure;
FIG. 10 is a flowchart illustrating an exemplary method for processing a device failure according to an embodiment of the present disclosure;
FIG. 11 is a schematic structural diagram of another device fault handling apparatus according to an embodiment of the present disclosure;
fig. 12 is a schematic diagram of a hardware composition structure of an electronic device according to an embodiment of the disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
With the development of computer technology, PCIe devices have become an integral part of current computer systems, however, with the evolution of PCIe technology, PCIe devices operate faster and faster, and their failures are more frequent. For partial fault problems, the operating system corresponding to the PCIe device can be automatically repaired, but most of the operating systems cannot be automatically repaired, so that the machine is down or restarted, and the operating systems cannot normally run.
To solve the system downtime caused by the fatal faults of PCIe equipment, EDPC technologies are developed in the industry. EDPC is an error isolation and recovery technique for PCIe buses that avoids error propagation and protects the operating system from potentially bad data by disabling a single PCIe link and forcing termination of outstanding requests when a link error (e.g., a transaction layer data message in a format error, an unexpected shutdown, etc.) is detected.
In particular, the related art generally employs the advanced error reporting (Advanced Error Reporting, AER) mechanism of PCIe, which is a mechanism for detecting and reporting errors occurring in PCle devices that allows PCle devices to detect and report various types of errors, such as non-fatal, recoverable, and severe errors, to trigger EDPCs. The AER implements a set of registers and corresponding error notification mechanisms on the PCle device that can be read to obtain information about the error. By using AER, the system can better monitor and process error conditions of PCle equipment so as to improve data integrity and reliability.
Specifically, EDPC and AER cooperate to implement isolation and repair functions of PCIe devices, as shown in fig. 1, the method includes the following steps:
① The processor detects that the device has an uncorrectable error (i.e., an error that the operating system cannot automatically fix). ② The processor sends a signal to the operating system for error handling. ③ The operating system battery management module notifies the device driver to uninstall the driver of the device connected to the processor and removes the device tag of the device to prevent subsequent Memory-Mapped Input/Output (MMIO) space access. ④ The operating system battery management module restores the device by releasing the device's (Link) Link state, re-enumerating the device, and re-enabling the device's device driver.
The related art can accurately perform fault isolation and repair on a fault device when a PCIe root port of a processor corresponds to a single device, but when the PCIe root port corresponds to a plurality of devices, that is, the device corresponding to the PCIe root port is a PCIe expansion (PCIE SWITCH) device (as shown in fig. 2), when one of the plurality of devices fails (for example, device 1 fails), the PCIe switch 1 triggers the DPC, and causes a timeout fault to be associated with the PCIe root port, thereby causing the processor to trigger the DPC, resulting in normal device operation interruption.
And during enumeration of devices with DPC for failure handling, partial PCIE configuration (cfg) instructions cause PCIE unsupported requests (Unsupported Request, UR) errors to be generated because the devices have not been initialized. In the related art, false alarm is usually avoided by shielding UR errors, but when an operating system completes enumeration of equipment and operates normally, due to shielding PCIE UR errors, more serious errors are caused by a processor.
Technical terms related to the present disclosure are described below:
root Port input output (RP PIO) is a mechanism in the PCIe fabric for managing errors encountered when a Root Port (Root Port) sends a Non-posted request (Non-Posted Requests). It provides fine control over unrecoverable errors (uncorrectable errors) and proposed errors (advssory errors). The RP PIO error control register provides fine error management capability for non-POSTED requests, allowing flexible configuration depending on request type (configuration, input output, memory) and error type.
In order to solve the technical problems in the related art, embodiments of the present disclosure provide an apparatus fault handling method.
The application scenario to which the present disclosure relates is first described by way of example:
As shown in fig. 3, the processor is directly connected to the device 1 through a PCIe port, the device 2 is connected to the PCIe port of the processor through the PCIe switch 1, and the device 3 is connected to the PCIe port of the processor through the PCIe switch 2.
It should be understood that fig. 3 is only an exemplary illustration, and that the application scenario of the present disclosure may include some or all of the devices shown in fig. 3, and the present disclosure is not limited to the number of devices directly connected to the processor, and is not limited to the number of devices connected to the processor through the PCIe switch. For example, the application scenario of the present disclosure may include only one or more devices directly connected to the processor, or only one or more devices indirectly connected to the processor through a PCIe switch, or one or more devices directly connected to the processor, and one or more devices indirectly connected to the processor.
Embodiments of the present disclosure will be described in detail below.
As shown in fig. 4, an embodiment of the present disclosure provides an apparatus failure processing method, including the steps of:
Step 101, initializing and configuring the processor based on a fault handling mechanism supported by the processor.
In some embodiments, different initialization configurations may be performed for the processor depending on whether the processor supports root port programmable input output (Root Port Programmable I/O, RP PIO).
In some embodiments, when the processor does not support the root port programmable input/output, the initialization configuration of the processor may trigger the port containment process when the fault device has a fault, that is, the fault type of the fault device is not distinguished, and the fault device is uniformly processed by adopting the port containment process.
In some embodiments, when processing supporting root port programmable input and output, the initialization configuration of the processor may use port throttling processing for part of the fault types, and part of the fault types do not use port throttling processing and only perform fault reporting, so as to implement different fault processing modes for different fault types, so as to avoid affecting the operation of normal equipment.
In some embodiments, the port containment process may be EDPC processes, may be downstream port containment (Downstream Port Containment, DPC) processes, or may be modified processes based on DPC processes, etc., which are not limited in this disclosure.
And 102, responding to the fault information of the fault equipment, and determining the connection mode of the fault equipment and the processor.
In some embodiments, the fault information is used to indicate that a device connected to the processor is faulty.
In some embodiments, when the fault information is received, the processor may query the connection information of the fault device to determine a connection manner between the fault device and the processor, so that in combination with the initialization configuration of the processor, a targeted fault processing manner is performed on the fault device, but the present disclosure is not limited to the technical means adopted for determining the connection manner.
In some embodiments, the fault device is connected to the processor in a manner that includes a direct connection of the fault device to the processor and an indirect connection of the fault device to the processor through the switch. The switch may be a PCIe switch.
The fault device and the processing direct connection can be that the fault device and a PCIe root port of the processor are directly connected, and the fault device and the processor are indirectly connected through a PCIe switch and the fault device and the PCIe root port of the processor are connected through the PCIe switch.
And 103, performing fault processing on the fault equipment based on the connection mode and/or the initialization configuration.
In some embodiments, the initialization configuration is configured to trigger the port containment process when the fault device has a fault, and at this time, the fault device is subjected to the fault process by adopting the port containment process mode, that is, the fault device is subjected to the fault process directly based on the initialization configuration, regardless of whether the fault device is connected with the processor directly or indirectly.
In some embodiments, the initialization configuration of the processor is that a part of fault types adopt port containment processing, and the part of fault types do not adopt port containment processing and only report faults, and as the fault types which are corresponding to different connection modes and need to adopt port containment processing are different, the connection modes of the fault equipment and the processor are combined, whether the fault types of the fault equipment need to adopt port containment processing or not is judged, so that the fault processing is carried out on the fault equipment.
In some embodiments, the port containment process may be a process that includes uninstalling the driver of the failed device and removing the device tag of the failed device, releasing the link state of the failed device with the processor, enumerating the failed device, and enabling the driver of the failed device from the new, thereby achieving isolation and repair of the failed device.
In summary, the device fault processing method provided by the disclosure includes initializing a processor based on a fault processing mechanism supported by the processor, determining a connection mode of the fault device and the processor in response to fault information of the fault device, and performing fault processing on the fault device based on the connection mode and/or the initialization configuration. According to the method, different fault processing mechanisms are configured for processing according to the supporting capability of the processor, and according to the connection relation between the fault equipment and the processing, the fault processing is performed on the fault equipment by using the different fault processing mechanisms, so that the problem that normal PCIe equipment is interrupted due to the fact that the fault processing is performed by adopting EDPC technology on fault reporting errors of different types is avoided, and the processing granularity and accuracy of the fault processing are improved.
Fig. 5 further illustrates a flow chart of an apparatus fault handling method set forth in the present disclosure. Further explained based on the embodiment shown in fig. 1, fig. 5 may include the following steps.
In step 201, when the processor supports the root port programmable input/output, the first failure handling mechanism of the processor is masked, and the second failure handling mechanism and/or the third failure handling mechanism is configured for the processor.
In some embodiments, the first failure handling mechanism includes triggering a port containment process when a failed device fails.
In some embodiments, the second failure handling mechanism includes not triggering a port throttling process when the failure of the failed device is a failure that does not support a configuration space request, and triggering a port throttling process when the failure of the failed device is a failure that does not support a configuration space request.
In some embodiments, the third failure handling mechanism includes not triggering port throttling processing when the failure of the failed device is a completion timeout error, and triggering port throttling processing when the failure of the failed device is an error other than a completion timeout error.
In some embodiments, the second failure handling mechanism is used when the failed device is directly connected to the processor and the third failure handling mechanism is used when the failed device is indirectly connected to the processor through the PCIe switch.
In some embodiments, the unsupported configuration space request may be a UR error of the configuration space request, but is not limited thereto.
In some embodiments, the completion timeout (Completion Timeout) error may be a configuration space request timeout, an input output request timeout, a memory request timeout, etc., which is not limited by the present disclosure.
In some alternative embodiments, the second fault handling mechanism may be configured for the processor only when the devices connected to the processor are both direct connections, the third fault handling mechanism may be configured for the process only when the devices connected to the process are both indirect connections, and both the second fault handling mechanism and the third fault handling mechanism may be configured for the processor when both direct and indirect connections exist with the devices connected to the processor.
Step 202, when the processor does not support the root port programmable input output, configuring a first failure handling mechanism for the processor.
In some embodiments, when the processor does not support the root port programmable input/output, that is, the processor does not support adopting different fault handling modes according to the fault type, the first fault handling mechanism may be directly configured for the processor, so that the fault handling can be performed when the device connected with the processor fails.
In summary, the device fault handling method provided by the disclosure includes shielding a first fault handling mechanism of a processor when the processor supports root port programmable input and output, configuring a second fault handling mechanism and/or a third fault handling mechanism for the processor, and configuring the first fault handling mechanism for the processor when the processor does not support root port programmable input and output. According to the method, different fault processing mechanisms are configured for the processor according to the capability of the processor, so that granularity of fault processing of the processor is improved, and accuracy of fault processing is improved.
Fig. 6 further illustrates a flow chart of an apparatus fault handling method set forth in the present disclosure. Further explained based on the embodiments shown in fig. 1 and 2, fig. 6 may include the following steps.
Step 301, when the processor does not support the root port programmable input output, configuring a first failure handling mechanism for the processor.
In some embodiments, the principle of step 301 is the same as that of step 201, and reference may be made to the embodiment in step 201 and the related description thereof, which will not be repeated here.
In response to the failure information of the failed device, a port containment process is triggered, step 302.
In some embodiments, since the initialization of the processor is configured as the first fault handling mechanism, when fault information is received, the processor has no capability to adopt different fault handling modes according to different fault types, and therefore, the processor can directly perform port containment processing on the fault device to perform fault handling.
In summary, the device fault processing method provided by the disclosure includes configuring a first fault processing mechanism for a processor when the processor does not support root port programmable input and output, and triggering port containment processing in response to fault information of a faulty device. The method of the present disclosure improves the application range of the present disclosure by configuring the first processing mechanism for the processor that does not support the root port programmable input/output, so that when the device directly or indirectly connected to the processor fails, the port containment process can be utilized to implement the failure processing.
Fig. 7 further illustrates a flow chart of an apparatus fault handling method set forth in the present disclosure. Further explained based on the embodiments shown in fig. 1 and 2, fig. 7 may include the following steps:
In step 401, when the processor supports the root port programmable input/output, the first failure handling mechanism of the processor is masked, and the second failure handling mechanism is configured for the processor, or the second failure handling mechanism and the third failure handling mechanism are configured for the processor.
In some embodiments, the principle of the step 401 is the same as that of the step 201, and reference may be made to the embodiment in the step 201 and the related description thereof, which are not repeated herein.
In response to the fault information for the faulty device, it is determined that the faulty device is directly connected to the processor, step 402.
In some embodiments, when the fault information is received, the processor may determine that the fault device is directly connected to the processor by querying the connection information of the fault device, but is not limited thereto, and the technical means adopted in determining the connection manner is not limited in the disclosure.
And step 403, performing fault processing on the fault equipment by using a second fault processing mechanism.
In some embodiments, the fault information may further indicate a fault type of the fault device, so that whether the fault of the fault device is a request for unsupported configuration space may be determined according to the fault information, and then, based on a result of the determination, fault processing is performed on the fault device.
Specifically, when the fault of the fault equipment is determined to be the unsupported configuration space request according to the fault information, the fault equipment is processed in the mode that first report information of the fault is generated and port restraining processing is not triggered.
When the fault of the fault equipment is determined not to not support the configuration space request (namely, not support other faults except the configuration space request) according to the fault information, the fault equipment is processed in the mode of triggering port containment processing.
Specifically, when the device is directly connected to the processor and the processor enumerates the device, since the normal device that is not initialized is caused to fail to support the configuration space request (enumerate for the processor to read or write data to the device, and since the device does not complete initialization, the processor cannot read or write data, and thus fails to support the configuration space request), the normal device does not fail at this time, and only does not complete initialization, so that the normal device does not need to perform port containment processing.
Further, in order to avoid that the failure that does not support the configuration space request is not generated due to the fact that the normal device does not complete initialization, that is, the device actually fails, the configuration space request is not supported, so that the first report information needs to be generated to analyze the specific generation reason of the configuration space request not supported, so as to determine whether to execute corresponding failure processing.
In other words, the failure cause of the unsupported configuration space request can be analyzed according to the first report information, so that when the unsupported configuration space request is not generated due to incomplete initialization of normal equipment, the failure equipment is subjected to failure processing by using a preset failure processing mode, and the isolation and repair of the failure equipment are realized.
In some embodiments, the first reporting information may be fault advisory reporting (Advisory Error report) information, but is not limited thereto.
In summary, the device fault handling method provided by the disclosure includes shielding a first fault handling mechanism of a processor and configuring a second fault handling mechanism for the processor or configuring a second fault handling mechanism and a third fault handling mechanism for the processor when the processor supports root port programmable input and output, determining that the fault device is directly connected with the processor in response to fault information for the fault device, and performing fault handling on the fault device by using the second fault handling mechanism. According to the method, when the processor supports the programmable input and output of the root port, the second processing mechanism is configured for the processor, so that when equipment directly connected with the processor fails, different types of faults are processed by different faults, when the faults are not supporting the configuration space request, only report information is generated, the port suppressing processing is not triggered, the normal equipment which is not initialized is prevented from triggering the fault processing, and the normal operation of the equipment is ensured.
Fig. 8 further illustrates a flow chart of an apparatus fault handling method set forth in the present disclosure. Further explained based on the embodiments shown in fig. 1 and 2, fig. 8 may include the following steps:
Step 501, when the processor supports root port programmable input/output, masks the first failure handling mechanism of the processor and configures a third failure handling mechanism for the processor, or configures a second failure handling mechanism and a third failure handling mechanism for the processor.
In some embodiments, the principle of the step 501 is the same as that of the step 201, and reference may be made to the embodiment in the step 201 and the related description thereof, which are not repeated herein.
In step 502, in response to the fault information for the faulty device, it is determined that the faulty device is indirectly connected to the processor.
In some embodiments, when the fault information is received, the processor may determine that the fault device is indirectly connected to the processor by querying the connection information of the fault device, but is not limited thereto, and the technical means adopted in determining the connection manner is not limited in the disclosure.
And step 503, performing fault processing on the fault equipment by using a third fault processing mechanism.
In some embodiments, the fault information may also indicate a fault type of the fault device, so that whether the fault of the fault device is a completion timeout error may be determined according to the fault information, and then, based on a result of the determination, fault processing is performed on the fault device.
Specifically, when the fault of the fault equipment is determined not to be the completion timeout error according to the fault information, the processing mode of the fault equipment is that the port restraining processing is triggered.
When the fault of the fault equipment is determined to be the completion timeout error according to the fault information, generating second report information of the fault, and acquiring a message header information log corresponding to the fault to determine whether port containment processing is performed or not based on the message header information log (RP PIO Header Log).
Further, a first information set indicated by the message head information log can be determined according to the type of the message head information log, and whether equipment corresponding to the first information set is fault equipment is further judged to determine whether port suppression processing is performed or not.
Specifically, when a fault device indirectly connected to a processor fails, a PCIe switch connected to the fault device first triggers a port containment process to perform fault processing on the fault device, where the PCIe switch sends fault information to the processor, and since the PCIe switch has triggered the port containment process, the fault device is performing fault processing at this time, so to avoid that the processor performs the port containment process again due to the fault information, it is necessary to compare sources of the fault information to determine whether the device sending the fault information is a device that has triggered the port containment process.
Specifically, when the type of the message head information log is a memory read-write type or an input-output read-write type, determining a space address indicated by the message head information log based on an information address of the message head information log to determine a first information set based on the space address, and when the type of the message head information log is a configuration space read-write type, determining the first information set based on an information address offset of the message head information log.
In some embodiments, the type of the message header information log may be determined based on the memory mapping space applied by the fault device, but is not limited thereto, and the present disclosure is not limited to the manner of determining the type of the message header information log.
In some embodiments, when the type of the message header information log is a memory read-write type or an input-output read-write type, the message header information log may directly indicate a space address of a device that has failed, and further determine the device that has failed according to the space address, so as to obtain the first information set.
In some embodiments, when the type of the header information log is a configuration space read-write type, the data of the different locations of the header information may indicate the first information set of the failed device.
In other words, when the type of the header information log is a memory read-write type or an input-output read-write type, specific equipment with faults needs to be determined according to the space address indicated by the header information log so as to determine the first information set, and when the type of the header information log is a configuration space read-write type, the header information log can directly indicate the first information set, so that the header information log can be directly analyzed so as to determine the first information set.
By way of example, taking the storage read-write type information with a message header information log of 9 bits as an example, the first 3 bits can be used for indicating bus information, the middle 3 bits can be used for indicating device information, and the last 3 bits can be used for indicating function information.
Further, the first set of information includes at least one of bus (bus) information of the failed device, device information (devices) of the failed device, and function (function) information of the failed device.
In some alternative embodiments, whether the fault information is generated due to the fact that the fault device triggers the port containment process can be further determined according to whether the device corresponding to the space address is the fault device, so that the fact that the fault information generated due to the port containment process triggers the port containment process again can be avoided.
Further, when the equipment corresponding to the first information set is judged to be the fault equipment or the equipment corresponding to the first information set is judged to be the PCIe switch connected with the fault equipment, port inhibition processing is not executed, and when the equipment corresponding to the first information set is judged not to be the fault equipment and the equipment corresponding to the first information set is judged not to be the PCIe switch connected with the fault, port inhibition processing is executed.
In other words, when the device corresponding to the first information set is a failed device, or it is determined that the device corresponding to the first information set is a PCIe switch connected to the failed device, it is indicated that the failed information is generated by the PCIe switch or the failed device when the failed device is failed by the port-throttling process, and at this time the failed device is already performing the failure process, the port-throttling process is not required to be performed again by the processor, and therefore the port-throttling process is not performed.
In other words, when the device corresponding to the first information set is not a failed device, or it is determined that the device corresponding to the first information set is not a PCIe switch connected to the failed device, it is indicated that the above-described failed information is not generated when the failed device is failed by the port-throttling process, that is, there is currently a failed device that requires the failed device, and thus the port-throttling process is performed.
For example, taking the device included in fig. 3 as an example, when the device 2 sends fault information to the processor, a register in the processor parses the header information log, determines that the source of the fault indicated by the header information log is the device 2 or the PCIe switch 1, does not execute port containment processing, and ends the fault processing.
For example, taking the device included in fig. 3 as an example, when the device 2 sends fault information to the processor, a register in the processor parses the header information log, and performs port suppression processing when it is determined that the source of the fault indicated by the header information log is not the device 2 or the PCIe switch 1 (e.g., the device 1, the device 3, the PCIe switch, etc.).
In some embodiments, the second reporting information may be fault advisory reporting information, but is not limited thereto.
In some alternative embodiments, the equipment identity information of the fault equipment and the fault reason of the fault equipment can be determined according to the first report information or the second report information, and then the fault processing is performed by using a preset solution mode based on the equipment identity information and the fault reason.
Specifically, the first report information or the second report information may include information such as a device model of the fault device, a fault reason, and the like, and further, according to a preset mapping relation table, a preset solution corresponding to the information such as the device model, the fault reason, and the like in the first report information or the second report information in the mapping relation table is searched by a table lookup manner, so that the fault processing is performed by using the corresponding preset solution.
In summary, the device fault handling method provided by the disclosure includes shielding a first fault handling mechanism of a processor and configuring a third fault handling mechanism for the processor or configuring a second fault handling mechanism and the third fault handling mechanism for the processor when the processor supports root port programmable input and output, determining that the fault device is directly connected with the processor in response to fault information for the fault device, and performing fault handling on the fault device by using the third fault handling mechanism. According to the method, when the processor supports the programmable input and output of the root port, a third processing mechanism is configured for the processor, so that when equipment directly connected with the processor fails, different fault processing is adopted for different types of faults, when the faults are overtime faults, only report information is generated, whether port containment processing is triggered or not is further determined according to a message header information log, whether the fault information is generated by triggering the port containment processing for the fault equipment is verified, the fact that the fault equipment which has triggered the port containment processing triggers the port containment processing again is avoided, and normal operation of non-fault equipment is guaranteed.
In summary, the present disclosure has the following beneficial effects:
1. different fault processing mechanisms are configured for processing according to the supporting capability of the processor, and according to the connection relation between the fault equipment and the processing, the fault processing is performed on the fault equipment by using different fault processing mechanisms, so that the operation interruption of normal PCIe equipment caused by the fact that the fault processing is performed by adopting EDPC technology on fault reporting errors of different types is avoided, and the processing granularity and accuracy of the fault processing are improved.
2. When the processor supports the programmable input and output of the root port, a second processing mechanism is configured for the processor, so that when equipment directly connected with the processor fails, different fault processing is adopted for different types of faults, when the fault is a request for not supporting configuration space, only report information is generated, port restraining processing is not triggered, normal equipment which is not initialized is prevented from triggering fault processing, and normal operation of non-fault equipment is ensured.
3. The third processing mechanism is configured for the processor when the processor supports the programmable input and output of the root port, so that when equipment directly connected with the processor fails, different fault processing is adopted for different types of faults, when the fault is an overtime completion fault, only report information is generated, whether port containment processing is triggered or not is further determined according to a message header information log, whether the fault information is generated by triggering the port containment processing for the fault equipment is verified, and therefore the fault equipment which has triggered the port containment processing is prevented from triggering the port containment processing again, and normal operation of non-fault equipment is ensured.
The following is an exemplary description of the present disclosure:
in embodiment 1, when a fatal error occurs in the enumeration process of the device 1 shown in fig. 3, the method shown in fig. 9 is executed to perform fault handling, and the method includes the following steps:
When the equipment 1 generates a fault, the trigger mode of the configurable PCIe root port DPC is RP PIO (i.e. the second fault handling mechanism described above), and at this time, by configuring an RP PIO related register, a part of error types trigger the DPC function of the root port, and another part of error types do not trigger the DPC function of the root port, and only fault advice reporting is performed.
1. It is checked whether the PCIe root port supports DPC functions of the RP PIO mechanism.
2. When the PCIe root port supports the DPC function of the RP PIO mechanism, the PCIe root port is initialized to shield the AER mechanism, configure the RP PIO mechanism, and set the UR errors of the configuration space request not to trigger the DPC, and the rest errors are set to trigger the DPC (namely the second fault handling mechanism).
3. When the PCIe root port does not support the DPC function of the RP PIO mechanism, the PCIe root port initializes by configuring the conventional AER mechanism to trigger the DPC (i.e., the first failure handling mechanism described above).
Pcie root port as a requester determines that device 1 has failed fatal (i.e., the processor receives failure information for the failed device).
Pcie root port determines if the processor supports RP PIO mechanisms.
6. When the PCIe root port does not support the RP PIO mechanism, the fatal error triggers the DPC using the conventional AER mechanism.
7. When the PCIe root port supports the RP PIO mechanism, further judging whether the fatal error is a UR error of the configuration space request.
8. If the fatal error is a UR error of the configuration space request, the fatal error does not trigger the DPC and only the recommended fault is reported.
9. If the fatal error is not a UR error of the configuration space request, the fatal error triggers the DPC using an RP PIO mechanism.
Embodiment 2, when a fatal error occurs in the enumeration process of the device 2 shown in fig. 3, performs the method shown in fig. 10 to perform fault handling, and includes the following steps:
If DPC is triggered at the downstream port of PCIe switch 1 (i.e., device 2), and this may be accompanied by one or more completion timeout errors in the generating PCIe root device. The present disclosure filters out CTO failures by using an RP PIO mechanism, and by accurately locating the CTO failure source to distinguish whether it is the downstream port of the PCIe switch 1 that triggered the CTO failure accompanying the DPC, so as to accurately process the failure, avoid the DPC being triggered by the CTO failure RootPort accompanying the DPC, and allow the PCIe root device to continue to operate normally with other PCIe switches (PCIe switch 2) and its downstream ports (device 3).
1. The device initialization stage checks whether the PCIe root port supports DPC functions of the RP PIO mechanism.
2. When the PCIe root port supports the DPC function of the RP PIO mechanism, the PCIe root port initializes the mask AER mechanism, configures the RP PIO mechanism, sets not to trigger the DPC when a completion timeout error (Cfg CTO, I/O CTO, mem CTO error) occurs, and sets the rest of the errors to trigger the DPC (i.e., the third failure handling mechanism described above).
3. When the PCIe root port does not support the DPC function of the RP PIO mechanism, the PCIe root port is initialized by configuring the traditional AER mechanism to trigger the DPC.
Pcie root port as requester finds fatal failure.
5. When the PCIe root port does not support the RP PIO mechanism, the fatal error triggers the DPC using the conventional AER mechanism.
6. When the PCIe switch 1 supports the RP PIO mechanism, it is further determined whether the fatal error is a Cfg CTO, I/O CTO, mem CTO error.
7. If the fatal error is at least one of Cfg CTO, I/O CTO and Mem CTO error, the DPC is not triggered temporarily, and only a recommended fault report is made, wherein the recommended fault report can indicate identity information, fault reasons and the like of the fault equipment so as to facilitate subsequent analysis and processing.
8. If the fatal error is not at least one of Cfg CTO, I/O CTO and Mem CTO error, the RP PIO mechanism is used to trigger DPC.
9. Further, when the user software receives the Cfg CTO, I/O CTO, mem CTO errors discovered by PCIe as the requester, the processor of the root port needs to be utilized to parse the message header information log (RP PIO Header Log) to determine the CTO error source.
10. Further, the type of the message header information log is judged, if the analysis type is a memory space read-write type, the space address of which device the information address belongs to is determined according to the information address indicated by the message header information log, and bus information, device information and function information (bus) of the device are recorded. The type of the information log may be determined by collecting Memory Mapped (MMIO) space of the device application.
11. If the analysis type is the configuration space read-write type, determining the bus information, the device information and the function information of the device indicated by the message head information log according to the information address offset of the message head information log.
12. And judging whether the equipment indicated by the message header information log is equipment 2 or not.
13. If the equipment indicated by the message header information log is equipment 2, the processing is not performed, and the fault processing is finished.
14. If the message header information log indicates the device, the PCIe root port executes the DPC.
In order to implement the device fault handling method provided by the embodiment of the present disclosure, the embodiment of the present disclosure further provides a device fault handling apparatus, as shown in fig. 11, where the device fault handling apparatus 1100 includes:
an initialization unit 1101, configured to perform an initialization configuration on a processor based on a fault handling mechanism supported by the processor;
A determining unit 1102, configured to determine a connection manner of the fault device and the processor in response to the fault information of the fault device;
the processing unit 1103 is configured to perform fault processing on the fault device based on the connection mode and/or the initialization configuration.
In some embodiments, the initialization unit 1101 is further configured to mask the first failure handling mechanism of the processor and configure the second failure handling mechanism and/or the third failure handling mechanism for the processor when the processor supports the root port programmable input/output, and configure the first failure handling mechanism for the processor when the processor does not support the root port programmable input/output.
In some embodiments, the first failure handling mechanism includes triggering a port throttling process when a failure device has a failure, the second failure handling mechanism includes not triggering a port throttling process when the failure device fails to support a configuration space request, and the third failure handling mechanism includes not triggering a port throttling process when the failure device fails to complete a timeout error, triggering a port throttling process when the failure device fails to complete an error other than a timeout error.
In some embodiments, the fault device is connected to the processor in a manner that includes a direct connection of the fault device to the processor and an indirect connection of the fault device to the processor through the switch.
In some embodiments, the processing unit 1103 is further configured to trigger a port throttling process when the first failure handling mechanism is configured in an initialized manner, and perform a failure handling on the failed device based on the connection mode when the second failure handling mechanism or the third failure handling mechanism is configured in an initialized manner.
In some embodiments, the processing unit 1103 is further configured to perform fault handling on the fault device by using the second fault handling mechanism when the fault device is directly connected to the processor, and perform fault handling on the fault device by using the third fault handling mechanism when the fault device is indirectly connected to the processor through the switch.
In some embodiments, the processing unit 1103 is further configured to determine, based on the fault information, whether the fault of the faulty device is a request for not supporting the configuration space, and perform fault processing on the faulty device based on a result of the determination.
In some embodiments, the processing unit 1103 is further configured to perform fault processing on the faulty device based on the result of the determination, where the result of the determination is that the fault of the faulty device is a configuration space unsupported request, generate first report information of the fault and not trigger the port containment processing, and when the result of the determination is that the fault of the faulty device is not a configuration space unsupported request, trigger the port containment processing.
In some embodiments, the processing unit 1103 is further configured to determine whether the failure of the failed device is a completion timeout error based on the failure information, and perform failure processing on the failed device based on the result of the determination.
In some embodiments, the processing unit 1103 is further configured to trigger the port containment process when the determined result is that the failure of the failed device is not a completion timeout error, generate second report information of the failure when the determined result is that the failure of the failed device is a completion timeout error, and obtain a header information log corresponding to the failure, so as to determine whether to perform the port containment process based on the header information log.
In some embodiments, the processing unit 1103 is further configured to determine a spatial address indicated by the header information log based on the type of the header information log, so as to determine the first information set based on the spatial address, and determine the first information set based on an information address offset of the header information log when the type of the header information log is a configuration space read-write type.
In some embodiments, the processing unit 1103 is further configured to determine, based on the type of the header information log, a space address indicated by the header information log, where the type of the header information log is a memory read-write type or an input-output read-write type, and where the type of the header information log is a configuration space read-write type, determine, based on an information address offset of the header information log, the space address.
In some embodiments, the processing unit 1103 is further configured to, when it is determined that the device corresponding to the first information set is a failed device, or when it is determined that the device corresponding to the first information set is a switch connected to the failed device, not perform the port containment process, and when it is determined that the device corresponding to the first information set is not a failed device, and the device corresponding to the first information set is not a switch connected to the failure, perform the port containment process.
In some embodiments, the processing unit 1103 is further configured to determine a type of the header information log based on the memory mapped space applied by the failed device.
In some embodiments, the processing unit 1103 is further configured to determine equipment identity information of the failed equipment and a failure cause of the failed equipment based on the first reporting information or the second reporting information, and perform failure processing by using a preset solution based on the equipment identity information and the failure cause.
In some embodiments, the processing unit 1103 is further configured to uninstall the driver of the failed device and remove the device tag of the failed device, release the link state of the failed device and the processor, enumerate the failed device, and enable the driver of the failed device from the new one.
In summary, the device fault processing apparatus provided according to the present disclosure includes an initialization unit configured to perform an initialization configuration on a processor based on a fault processing mechanism supported by the processor, a determination unit configured to determine a connection manner between the fault device and the processor in response to fault information of the fault device, and a processing unit configured to perform fault processing on the fault device based on the connection manner and/or the initialization configuration. According to the device disclosed by the invention, different fault processing mechanisms are configured for processing according to the supporting capability of the processor, and the fault processing is performed on the fault equipment by utilizing different fault processing mechanisms according to the connection relation between the fault equipment and the processing, so that the operation interruption of normal PCIe equipment caused by the fault processing of different types of fault reporting errors by adopting EDPC technology is avoided, and the processing granularity and accuracy of the fault processing are improved.
It should be noted that, when the device fault processing apparatus provided in the foregoing embodiment performs device fault processing, only the division of each program module is used as an example, in practical application, the processing allocation may be performed by different program modules according to needs, that is, the internal structure of the device fault processing apparatus is divided into different program modules, so as to complete all or part of the processing described above. In addition, the device fault handling apparatus provided in the foregoing embodiments and the device fault handling method embodiment provided in the embodiments of the present disclosure belong to the same concept, and specific implementation processes thereof are detailed in the method embodiment, and are not described herein again.
Fig. 12 is a schematic diagram of a hardware composition structure of an electronic device according to an embodiment of the disclosure, as shown in fig. 12, where the electronic device 1200 includes at least one processor 1202, and a memory 1201 communicatively connected to the at least one processor 1202, where the memory 1201 stores a command executable by the at least one processor 1202, and the command is executed by the at least one processor 1202 to implement the steps of the device failure processing method according to the embodiment of the disclosure.
Optionally, the electronic device may be specifically an apparatus fault handling device in the embodiment of the present application, and the electronic device may implement a corresponding flow implemented by the apparatus fault handling device in each method in the embodiment of the present application, which is not described herein for brevity.
It is understood that the electronic device also includes a communication interface 1203. The various components in the electronic device are coupled together by a bus system 1204. It is appreciated that the bus system 1204 is used to facilitate connected communications between these components. The bus system 1204 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration, the various buses are labeled as bus system 1204 in fig. 12.
It is to be appreciated that memory 1201 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. The non-volatile Memory may be, among other things, a Read Only Memory (ROM), a programmable Read Only Memory (PROM, programmable Read-Only Memory), erasable programmable Read-Only Memory (EPROM, erasable Programmable Read-Only Memory), electrically erasable programmable Read-Only Memory (EEPROM, ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory), Magnetic random access Memory (FRAM, ferromagnetic random access Memory), flash Memory (Flash Memory), magnetic surface Memory, optical disk, or compact disk-Only Memory (CD-ROM, compact Disc Read-Only Memory), which may be disk Memory or tape Memory. The volatile memory may be random access memory (RAM, random Access Memory) which acts as external cache memory. By way of example and not limitation, the present method is applicable to many forms of RAM such as static random access memory (SRAM, static Random Access Memory), synchronous static random access memory (SSRAM, synchronous Static Random Access Memory), dynamic random access memory (DRAM, dynamic Random Access Memory), synchronous dynamic random access memory (SDRAM, synchronous Dynamic Random Access Memory), memory cells, Double data rate synchronous dynamic random access memory (DDRSDRAM, double Data Rate Synchronous Dynamic Random Access Memory), enhanced synchronous dynamic random access memory (ESDRAM, enhanced Synchronous Dynamic Random Access Memory), synchronous link dynamic random access memory (SLDRAM, syncLink Dynamic Random Access Memory), Direct memory bus random access memory (DRRAM, direct Rambus Random Access Memory). The memory 1201 described by embodiments of the present invention is intended to comprise, without being limited to, these and any other suitable types of memory.
The methods disclosed in the embodiments of the present disclosure described above may be applied to the processor 1202 or implemented by the processor 1202. The processor 1202 has signal processing capabilities. In implementation, the steps of the methods described above may be performed by integrated logic circuitry in hardware in the processor 1202 or by commands in software. The processor 1202 may be a general purpose processor, DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 1202 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiment of the invention can be directly embodied in the hardware of the decoding processor or can be implemented by combining hardware and software modules in the decoding processor. The software modules may be located in a storage medium in the memory 1201 and the processor 1202 reads information in the memory 1201 to perform the steps of the method in combination with its hardware.
In an exemplary embodiment, the electronic device may be implemented by one or more Application-specific integrated circuits (ASICs), DSPs, programmable logic devices (PLDs, programmable Logic Device), complex programmable logic devices (CPLDs, complex Programmable Logic Device), FPGAs, general purpose processors, controllers, MCUs, microprocessors, or other electronic elements for performing the aforementioned methods.
The disclosed embodiments also provide a non-transitory computer-readable storage medium storing computer commands for causing the computer to execute the steps of the device failure handling method of the disclosed embodiments.
The disclosed embodiments also provide a computer program product comprising a computer program which, when executed by a processor, implements the steps of the device fault handling method of the disclosed embodiments.
Optionally, the computer readable storage medium may be applied to the device fault handling apparatus in the embodiment of the present application, and the computer command causes a computer to execute a corresponding flow implemented by the device fault handling apparatus in each method of the embodiment of the present application, which is not described herein for brevity.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is merely a logical function division, and there may be additional divisions of actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.
The units described as separate components may or may not be physically separate, and components displayed as units may or may not be physical units, may be located in one place, may be distributed on a plurality of network units, and may select some or all of the units according to actual needs to achieve the purpose of the embodiment.
In addition, each functional unit in each embodiment of the present invention may be integrated in one processing unit, or each unit may be separately used as a unit, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of hardware plus a form of software functional unit.
It will be appreciated by those of ordinary skill in the art that implementing all or part of the steps of the above method embodiments may be accomplished by hardware associated with program instructions, and that the above program may be stored on a computer readable storage medium which, when executed, performs the steps comprising the above method embodiments, where the above storage medium includes various media that can store program code, such as removable storage devices, ROM, RAM, magnetic or optical disks.
Or the above-described integrated units of the invention may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium, comprising several commands for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present invention. The storage medium includes various media capable of storing program codes such as a removable storage device, a ROM, a RAM, a magnetic disk or an optical disk.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (20)

1.一种设备故障处理方法,其特征在于,所述方法包括:1. A method for handling equipment failure, characterized in that the method comprises: 基于处理器支持的故障处理机制,对所述处理器进行初始化配置;Initializing configuration of the processor based on a fault handling mechanism supported by the processor; 响应于故障设备的故障信息,确定所述故障设备与所述处理器的连接方式;In response to the fault information of the faulty device, determining a connection mode between the faulty device and the processor; 基于所述连接方式和/或所述初始化配置,对所述故障设备进行故障处理。Based on the connection mode and/or the initialization configuration, fault processing is performed on the faulty device. 2.根据权利要求1所述的方法,其特征在于,所述基于处理器支持的故障处理机制,对所述处理器进行初始化配置包括:2. The method according to claim 1, wherein the initializing configuration of the processor based on the fault handling mechanism supported by the processor comprises: 当所述处理器支持根端口可编程输入输出时,屏蔽所述处理器的第一故障处理机制,并为所述处理器配置第二故障处理机制和/或第三故障处理机制;When the processor supports root port programmable input and output, shielding the first fault handling mechanism of the processor, and configuring the second fault handling mechanism and/or the third fault handling mechanism for the processor; 当所述处理器不支持所述根端口可编程输入输出时,为所述处理器配置所述第一故障处理机制。When the processor does not support the root port programmable input and output, the first fault handling mechanism is configured for the processor. 3.根据权利要求2所述的方法,其特征在于,所述第一故障处理机制包括:当所述故障设备存在故障时,触发端口遏制处理;3. The method according to claim 2, characterized in that the first fault handling mechanism comprises: triggering port containment processing when the faulty device is faulty; 所述第二故障处理机制包括:当所述故障设备的故障为不支持配置空间请求时,不触发所述端口遏制处理,当所述故障设备的故障为除所述不支持配置空间请求以外的故障时,触发所述端口遏制处理;The second fault handling mechanism includes: when the fault of the faulty device is that the configuration space request is not supported, the port containment processing is not triggered, and when the fault of the faulty device is a fault other than the failure to support the configuration space request, the port containment processing is triggered; 所述第三故障处理机制包括:当所述故障设备的故障为完成超时错误时,不触发所述端口遏制处理,当所述故障设备的故障为除所述完成超时错误以外的错误时,触发所述端口遏制处理。The third fault processing mechanism includes: when the fault of the faulty device is a completion timeout error, the port containment processing is not triggered; when the fault of the faulty device is an error other than the completion timeout error, the port containment processing is triggered. 4.根据权利要求1所述的方法,其特征在于,所述故障设备与所述处理器的连接方式包括:所述故障设备与所述处理器直接连接,以及所述故障设备通过交换机与所述处理器间接连接。4. The method according to claim 1 is characterized in that the connection mode between the faulty device and the processor includes: the faulty device is directly connected to the processor, and the faulty device is indirectly connected to the processor through a switch. 5.根据权利要求3所述的方法,其特征在于,所述基于所述连接方式和/或所述初始化配置,对所述故障设备进行故障处理包括:5. The method according to claim 3, characterized in that the performing fault processing on the faulty device based on the connection mode and/or the initialization configuration comprises: 当所述初始化配置所述第一故障处理机制时,触发端口遏制处理;When the first fault handling mechanism is configured in the initialization, a port containment process is triggered; 当所述初始化配置所述第二故障处理机制或所述第三故障处理机制时,基于所述连接方式,对所述故障设备进行故障处理。When the second fault handling mechanism or the third fault handling mechanism is configured by initialization, fault handling is performed on the faulty device based on the connection mode. 6.根据权利要求5所述的方法,其特征在于,所述当所述初始化配置所述第二故障处理机制或所述第三故障处理机制时,基于所述连接方式,对所述故障设备进行故障处理包括:6. The method according to claim 5, characterized in that when the second fault handling mechanism or the third fault handling mechanism is initially configured, performing fault handling on the faulty device based on the connection mode comprises: 当所述故障设备与所述处理器直接连接时,利用所述第二故障处理机制,对所述故障设备进行故障处理;When the faulty device is directly connected to the processor, using the second fault handling mechanism to handle the fault of the faulty device; 当所述故障设备与所述处理器通过交换机与所述处理器间接连接时,利用所述第三故障处理机制,对所述故障设备进行故障处理。When the faulty device is indirectly connected to the processor via a switch, the third fault handling mechanism is used to perform fault handling on the faulty device. 7.根据权利要求6所述的方法,其特征在于,所述当所述故障设备与所述处理器直接连接时,利用所述第二故障处理机制,对所述故障设备进行故障处理包括:7. The method according to claim 6, wherein when the faulty device is directly connected to the processor, using the second fault handling mechanism to perform fault handling on the faulty device comprises: 基于所述故障信息,判断所述故障设备的故障是否为不支持配置空间请求;Based on the fault information, determining whether the fault of the faulty device is that the configuration space request is not supported; 基于判断的结果,对所述故障设备进行故障处理。Based on the judgment result, the fault processing is performed on the faulty device. 8.根据权利要求7所述的方法,其特征在于,所述基于判断的结果,对所述故障设备进行故障处理包括:8. The method according to claim 7, wherein the performing fault processing on the faulty device based on the judgment result comprises: 当所述判断的结果为所述故障设备的故障是所述不支持配置空间请求时,生成所述故障的第一上报信息,并不触发所述端口遏制处理;When the result of the judgment is that the fault of the faulty device is the failure to support the configuration space request, generating first reporting information of the fault, and not triggering the port containment processing; 当所述判断的结果为所述故障设备的故障不是所述不支持配置空间请求时,触发所述端口遏制处理。When the result of the determination is that the fault of the faulty device is not the failure to support the configuration space request, the port containment process is triggered. 9.根据权利要求6所述的方法,其特征在于,所述当所述故障设备与所述处理器通过所述交换机与所述处理器间接连接时,利用所述第三故障处理机制,对所述故障设备进行故障处理包括:9. The method according to claim 6, wherein when the faulty device is indirectly connected to the processor through the switch, performing fault processing on the faulty device by using the third fault processing mechanism comprises: 基于所述故障信息,判断所述故障设备的故障是否为完成超时错误;Based on the fault information, determining whether the fault of the faulty device is a completion timeout error; 基于判断的结果,对所述故障设备进行故障处理。Based on the judgment result, the fault processing is performed on the faulty device. 10.根据权利要求9所述的方法,其特征在于,所述基于所述判断的结果,对所述故障设备进行故障处理包括:10. The method according to claim 9, wherein the performing fault processing on the faulty device based on the result of the judgment comprises: 当所述判断的结果为所述故障设备的故障不是所述完成超时错误时,触发所述端口遏制处理;When the result of the judgment is that the fault of the faulty device is not the completion timeout error, triggering the port containment process; 当所述判断的结果为所述故障设备的故障是所述完成超时错误时,生成所述故障的第二上报信息,并获取所述故障对应的报文头信息日志,以基于所述报文头信息日志,确定是否进行所述端口遏制处理。When the result of the judgment is that the fault of the faulty device is the completion timeout error, second reporting information of the fault is generated, and a message header information log corresponding to the fault is obtained to determine whether to perform the port containment processing based on the message header information log. 11.根据权利要求10所述的方法,其特征在于,所述基于所述报文头信息日志,确定是否进行所述端口遏制处理包括:11. The method according to claim 10, wherein determining whether to perform the port containment process based on the message header information log comprises: 基于所述报文头信息日志的类型,确定所述报文头信息日志指示的第一信息集;Based on the type of the message header information log, determining a first information set indicated by the message header information log; 判断所述第一信息集对应的设备是否为所述故障设备,以确定是否进行所述端口遏制处理。It is determined whether the device corresponding to the first information set is the faulty device to determine whether to perform the port containment process. 12.根据权利要求11所述方法,其特征在于,所述基于所述报文头信息日志的类型,确定所述报文头信息日志指示的空间地址包括:12. The method according to claim 11, wherein determining the space address indicated by the message header information log based on the type of the message header information log comprises: 当所述报文头信息日志的类型为内存读写类型或输入输出读写类型时,基于所述报文头信息日志的信息地址,确定所述报文头信息日志指示的空间地址,以基于所述空间地址确定所述第一信息集;When the type of the message header information log is a memory read-write type or an input-output read-write type, determining a space address indicated by the message header information log based on the information address of the message header information log, so as to determine the first information set based on the space address; 当所述报文头信息日志的类型为配置空间读写类型时,基于所述报文头信息日志的信息地址偏移,确定所述第一信息集。When the type of the message header information log is a configuration space read-write type, the first information set is determined based on the information address offset of the message header information log. 13.根据权利要求11所述的方法,其特征在于,所述判断所述第一信息集对应的设备是否为所述故障设备,以确定是否进行所述端口遏制处理包括:13. The method according to claim 11, wherein the step of determining whether the device corresponding to the first information set is the faulty device to determine whether to perform the port containment process comprises: 当判断所述第一信息集对应的设备为所述故障设备,或判断所述第一信息集对应的设备为与所述故障设备连接的交换机时,不执行所述端口遏制处理;When it is determined that the device corresponding to the first information set is the faulty device, or when it is determined that the device corresponding to the first information set is a switch connected to the faulty device, the port containment processing is not performed; 当判断所述第一信息集对应的设备不为故障设备,且所述第一信息集对应的设备不为与所述故障设备连接的交换机时,执行所述端口遏制处理。When it is determined that the device corresponding to the first information set is not a faulty device, and the device corresponding to the first information set is not a switch connected to the faulty device, the port containment process is performed. 14.根据权利要求11所述的方法,其特征在于,所述方法还包括:14. The method according to claim 11, characterized in that the method further comprises: 基于所述故障设备申请的内存映射空间,确定所述报文头信息日志的类型。Based on the memory mapping space requested by the faulty device, the type of the message header information log is determined. 15.根据权利要求14所述的方法,其特征在于,所述第一信息集包括以下至少一项:所述故障设备的总线信息、所述故障设备的设备信息和所述故障设备的功能信息。15 . The method according to claim 14 , wherein the first information set comprises at least one of the following: bus information of the faulty device, device information of the faulty device, and function information of the faulty device. 16.根据权利要求8或10中任一项所述的方法,其特征在于,所述方法还包括:16. The method according to any one of claims 8 or 10, characterized in that the method further comprises: 基于第一上报信息或第二上报信息,确定所述故障设备的设备身份信息和所述故障设备的故障原因;Determine, based on the first reporting information or the second reporting information, device identity information of the faulty device and a fault cause of the faulty device; 基于所述设备身份信息和所述故障原因,利用预设解决方式进行故障处理。Based on the device identity information and the cause of the fault, the fault is handled using a preset solution. 17.根据权利要求3或5中任一项所述的方法,其特征在于,所述端口遏制处理包括:17. The method according to any one of claims 3 or 5, characterized in that the port throttling process comprises: 卸载所述故障设备的驱动程序,并移除所述故障设备的设备标记;Uninstalling the driver of the faulty device and removing the device mark of the faulty device; 释放所述故障设备与所述处理器的链接状态;releasing the link state between the faulty device and the processor; 对所述故障设备进行枚举,并从新启用所述故障设备的驱动程序。The faulty device is enumerated and a driver of the faulty device is re-enabled. 18.一种电子设备,其特征在于,包括:处理器和用于存储能够在处理器上运行的计算机程序的存储器,其中,所述处理器用于运行所述计算机程序时,执行根据权利要求1-17中任一项所述的设备故障处理方法。18. An electronic device, characterized in that it comprises: a processor and a memory for storing a computer program that can be run on the processor, wherein the processor is used to execute the device failure processing method according to any one of claims 1 to 17 when running the computer program. 19.一种存储有计算机命令的非瞬时计算机可读存储介质,其特征在于,所述计算机命令用于使所述计算机执行根据权利要求1-17中任一项所述的设备故障处理方法。19. A non-transitory computer-readable storage medium storing computer commands, wherein the computer commands are used to cause the computer to execute the device failure processing method according to any one of claims 1 to 17. 20.一种计算机程序产品,其特征在于,包括计算机程序,所述计算机程序在被处理器执行时实现根据权利要求1-17中任一项所述的设备故障处理方法。20. A computer program product, comprising a computer program, wherein when the computer program is executed by a processor, the device failure processing method according to any one of claims 1 to 17 is implemented.
CN202510489529.2A 2025-04-18 2025-04-18 Equipment failure handling method, electronic equipment, storage medium and product Active CN120011127B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510489529.2A CN120011127B (en) 2025-04-18 2025-04-18 Equipment failure handling method, electronic equipment, storage medium and product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510489529.2A CN120011127B (en) 2025-04-18 2025-04-18 Equipment failure handling method, electronic equipment, storage medium and product

Publications (2)

Publication Number Publication Date
CN120011127A true CN120011127A (en) 2025-05-16
CN120011127B CN120011127B (en) 2025-07-25

Family

ID=95668215

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510489529.2A Active CN120011127B (en) 2025-04-18 2025-04-18 Equipment failure handling method, electronic equipment, storage medium and product

Country Status (1)

Country Link
CN (1) CN120011127B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110246686A1 (en) * 2010-04-01 2011-10-06 Cavanagh Jr Edward T Apparatus and system having pci root port and direct memory access device functionality
US20180089047A1 (en) * 2016-09-29 2018-03-29 Lenovo Enterprise Solutions (Singapore) Pte. Ltd. Detecting and handling an expansion card fault during system initialization
CN112306913A (en) * 2019-07-30 2021-02-02 华为技术有限公司 A management method, device and system for an endpoint device
CN112470158A (en) * 2018-05-11 2021-03-09 美国莱迪思半导体公司 Fault characterization system and method for programmable logic device
CN119512758A (en) * 2024-11-20 2025-02-25 杭州义益钛迪信息技术有限公司 Programmable logic controller memory dynamic allocation method, device and electronic equipment
CN119621400A (en) * 2024-11-27 2025-03-14 苏州元脑智能科技有限公司 A device failure handling method, device, equipment and medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110246686A1 (en) * 2010-04-01 2011-10-06 Cavanagh Jr Edward T Apparatus and system having pci root port and direct memory access device functionality
US20180089047A1 (en) * 2016-09-29 2018-03-29 Lenovo Enterprise Solutions (Singapore) Pte. Ltd. Detecting and handling an expansion card fault during system initialization
CN112470158A (en) * 2018-05-11 2021-03-09 美国莱迪思半导体公司 Fault characterization system and method for programmable logic device
CN112306913A (en) * 2019-07-30 2021-02-02 华为技术有限公司 A management method, device and system for an endpoint device
CN119512758A (en) * 2024-11-20 2025-02-25 杭州义益钛迪信息技术有限公司 Programmable logic controller memory dynamic allocation method, device and electronic equipment
CN119621400A (en) * 2024-11-27 2025-03-14 苏州元脑智能科技有限公司 A device failure handling method, device, equipment and medium

Also Published As

Publication number Publication date
CN120011127B (en) 2025-07-25

Similar Documents

Publication Publication Date Title
US7774651B2 (en) System and method to detect errors and predict potential failures
WO2021169260A1 (en) System board card power supply test method, apparatus and device, and storage medium
CN103198000A (en) Method for positioning faulted memory in linux system
TWI777628B (en) Computer system, dedicated crash dump hardware device thereof and method of logging error data
CN115964218A (en) Method and device for identifying fault of high-speed serial computer expansion bus equipment
CN114003416B (en) Memory error dynamic processing method, system, terminal and storage medium
WO2024260013A1 (en) Memory failure processing method and apparatus, and computer device and storage medium
CN118550747A (en) PCIe fatal error quick positioning method, system, electronic equipment and medium
CN116430835A (en) A Cortex-M Microcontroller Fault Storage and Analysis Method
CN118711651A (en) A solid state hard disk fault processing method, product, device and medium
CN113127245B (en) Method, system and device for processing system management interruption
CN116049249A (en) Error message processing method, device, system, equipment and storage medium
WO2024124862A1 (en) Server-based memory processing method and apparatus, processor and an electronic device
CN120429158B (en) System manager, error data processing method, device and program product
CN120196519B (en) Fault alarm method, system, device, medium and product of PCIe device
CN116302694A (en) Troubleshooting method, device, communication device and storage medium
CN120011127B (en) Equipment failure handling method, electronic equipment, storage medium and product
CN115114097A (en) Hard disk injection medium error test method, system, terminal and storage medium
CN114496036A (en) An overload detection and protection method, device, circuit and electronic equipment
CN118093265A (en) A PCIE device fault processing method and server
CN113868000B (en) A link fault repair method, system and related components
CN119003225B (en) A fault location method and device, storage medium and computer program product
CN120560894B (en) Memory fault management system, method, server and electronic equipment
KR100862407B1 (en) System and method to detect errors and predict potential failures
JPH11120154A (en) Access control device and method in computer system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant