Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
With the development of computer technology, PCIe devices have become an integral part of current computer systems, however, with the evolution of PCIe technology, PCIe devices operate faster and faster, and their failures are more frequent. For partial fault problems, the operating system corresponding to the PCIe device can be automatically repaired, but most of the operating systems cannot be automatically repaired, so that the machine is down or restarted, and the operating systems cannot normally run.
To solve the system downtime caused by the fatal faults of PCIe equipment, EDPC technologies are developed in the industry. EDPC is an error isolation and recovery technique for PCIe buses that avoids error propagation and protects the operating system from potentially bad data by disabling a single PCIe link and forcing termination of outstanding requests when a link error (e.g., a transaction layer data message in a format error, an unexpected shutdown, etc.) is detected.
In particular, the related art generally employs the advanced error reporting (Advanced Error Reporting, AER) mechanism of PCIe, which is a mechanism for detecting and reporting errors occurring in PCle devices that allows PCle devices to detect and report various types of errors, such as non-fatal, recoverable, and severe errors, to trigger EDPCs. The AER implements a set of registers and corresponding error notification mechanisms on the PCle device that can be read to obtain information about the error. By using AER, the system can better monitor and process error conditions of PCle equipment so as to improve data integrity and reliability.
Specifically, EDPC and AER cooperate to implement isolation and repair functions of PCIe devices, as shown in fig. 1, the method includes the following steps:
① The processor detects that the device has an uncorrectable error (i.e., an error that the operating system cannot automatically fix). ② The processor sends a signal to the operating system for error handling. ③ The operating system battery management module notifies the device driver to uninstall the driver of the device connected to the processor and removes the device tag of the device to prevent subsequent Memory-Mapped Input/Output (MMIO) space access. ④ The operating system battery management module restores the device by releasing the device's (Link) Link state, re-enumerating the device, and re-enabling the device's device driver.
The related art can accurately perform fault isolation and repair on a fault device when a PCIe root port of a processor corresponds to a single device, but when the PCIe root port corresponds to a plurality of devices, that is, the device corresponding to the PCIe root port is a PCIe expansion (PCIE SWITCH) device (as shown in fig. 2), when one of the plurality of devices fails (for example, device 1 fails), the PCIe switch 1 triggers the DPC, and causes a timeout fault to be associated with the PCIe root port, thereby causing the processor to trigger the DPC, resulting in normal device operation interruption.
And during enumeration of devices with DPC for failure handling, partial PCIE configuration (cfg) instructions cause PCIE unsupported requests (Unsupported Request, UR) errors to be generated because the devices have not been initialized. In the related art, false alarm is usually avoided by shielding UR errors, but when an operating system completes enumeration of equipment and operates normally, due to shielding PCIE UR errors, more serious errors are caused by a processor.
Technical terms related to the present disclosure are described below:
root Port input output (RP PIO) is a mechanism in the PCIe fabric for managing errors encountered when a Root Port (Root Port) sends a Non-posted request (Non-Posted Requests). It provides fine control over unrecoverable errors (uncorrectable errors) and proposed errors (advssory errors). The RP PIO error control register provides fine error management capability for non-POSTED requests, allowing flexible configuration depending on request type (configuration, input output, memory) and error type.
In order to solve the technical problems in the related art, embodiments of the present disclosure provide an apparatus fault handling method.
The application scenario to which the present disclosure relates is first described by way of example:
As shown in fig. 3, the processor is directly connected to the device 1 through a PCIe port, the device 2 is connected to the PCIe port of the processor through the PCIe switch 1, and the device 3 is connected to the PCIe port of the processor through the PCIe switch 2.
It should be understood that fig. 3 is only an exemplary illustration, and that the application scenario of the present disclosure may include some or all of the devices shown in fig. 3, and the present disclosure is not limited to the number of devices directly connected to the processor, and is not limited to the number of devices connected to the processor through the PCIe switch. For example, the application scenario of the present disclosure may include only one or more devices directly connected to the processor, or only one or more devices indirectly connected to the processor through a PCIe switch, or one or more devices directly connected to the processor, and one or more devices indirectly connected to the processor.
Embodiments of the present disclosure will be described in detail below.
As shown in fig. 4, an embodiment of the present disclosure provides an apparatus failure processing method, including the steps of:
Step 101, initializing and configuring the processor based on a fault handling mechanism supported by the processor.
In some embodiments, different initialization configurations may be performed for the processor depending on whether the processor supports root port programmable input output (Root Port Programmable I/O, RP PIO).
In some embodiments, when the processor does not support the root port programmable input/output, the initialization configuration of the processor may trigger the port containment process when the fault device has a fault, that is, the fault type of the fault device is not distinguished, and the fault device is uniformly processed by adopting the port containment process.
In some embodiments, when processing supporting root port programmable input and output, the initialization configuration of the processor may use port throttling processing for part of the fault types, and part of the fault types do not use port throttling processing and only perform fault reporting, so as to implement different fault processing modes for different fault types, so as to avoid affecting the operation of normal equipment.
In some embodiments, the port containment process may be EDPC processes, may be downstream port containment (Downstream Port Containment, DPC) processes, or may be modified processes based on DPC processes, etc., which are not limited in this disclosure.
And 102, responding to the fault information of the fault equipment, and determining the connection mode of the fault equipment and the processor.
In some embodiments, the fault information is used to indicate that a device connected to the processor is faulty.
In some embodiments, when the fault information is received, the processor may query the connection information of the fault device to determine a connection manner between the fault device and the processor, so that in combination with the initialization configuration of the processor, a targeted fault processing manner is performed on the fault device, but the present disclosure is not limited to the technical means adopted for determining the connection manner.
In some embodiments, the fault device is connected to the processor in a manner that includes a direct connection of the fault device to the processor and an indirect connection of the fault device to the processor through the switch. The switch may be a PCIe switch.
The fault device and the processing direct connection can be that the fault device and a PCIe root port of the processor are directly connected, and the fault device and the processor are indirectly connected through a PCIe switch and the fault device and the PCIe root port of the processor are connected through the PCIe switch.
And 103, performing fault processing on the fault equipment based on the connection mode and/or the initialization configuration.
In some embodiments, the initialization configuration is configured to trigger the port containment process when the fault device has a fault, and at this time, the fault device is subjected to the fault process by adopting the port containment process mode, that is, the fault device is subjected to the fault process directly based on the initialization configuration, regardless of whether the fault device is connected with the processor directly or indirectly.
In some embodiments, the initialization configuration of the processor is that a part of fault types adopt port containment processing, and the part of fault types do not adopt port containment processing and only report faults, and as the fault types which are corresponding to different connection modes and need to adopt port containment processing are different, the connection modes of the fault equipment and the processor are combined, whether the fault types of the fault equipment need to adopt port containment processing or not is judged, so that the fault processing is carried out on the fault equipment.
In some embodiments, the port containment process may be a process that includes uninstalling the driver of the failed device and removing the device tag of the failed device, releasing the link state of the failed device with the processor, enumerating the failed device, and enabling the driver of the failed device from the new, thereby achieving isolation and repair of the failed device.
In summary, the device fault processing method provided by the disclosure includes initializing a processor based on a fault processing mechanism supported by the processor, determining a connection mode of the fault device and the processor in response to fault information of the fault device, and performing fault processing on the fault device based on the connection mode and/or the initialization configuration. According to the method, different fault processing mechanisms are configured for processing according to the supporting capability of the processor, and according to the connection relation between the fault equipment and the processing, the fault processing is performed on the fault equipment by using the different fault processing mechanisms, so that the problem that normal PCIe equipment is interrupted due to the fact that the fault processing is performed by adopting EDPC technology on fault reporting errors of different types is avoided, and the processing granularity and accuracy of the fault processing are improved.
Fig. 5 further illustrates a flow chart of an apparatus fault handling method set forth in the present disclosure. Further explained based on the embodiment shown in fig. 1, fig. 5 may include the following steps.
In step 201, when the processor supports the root port programmable input/output, the first failure handling mechanism of the processor is masked, and the second failure handling mechanism and/or the third failure handling mechanism is configured for the processor.
In some embodiments, the first failure handling mechanism includes triggering a port containment process when a failed device fails.
In some embodiments, the second failure handling mechanism includes not triggering a port throttling process when the failure of the failed device is a failure that does not support a configuration space request, and triggering a port throttling process when the failure of the failed device is a failure that does not support a configuration space request.
In some embodiments, the third failure handling mechanism includes not triggering port throttling processing when the failure of the failed device is a completion timeout error, and triggering port throttling processing when the failure of the failed device is an error other than a completion timeout error.
In some embodiments, the second failure handling mechanism is used when the failed device is directly connected to the processor and the third failure handling mechanism is used when the failed device is indirectly connected to the processor through the PCIe switch.
In some embodiments, the unsupported configuration space request may be a UR error of the configuration space request, but is not limited thereto.
In some embodiments, the completion timeout (Completion Timeout) error may be a configuration space request timeout, an input output request timeout, a memory request timeout, etc., which is not limited by the present disclosure.
In some alternative embodiments, the second fault handling mechanism may be configured for the processor only when the devices connected to the processor are both direct connections, the third fault handling mechanism may be configured for the process only when the devices connected to the process are both indirect connections, and both the second fault handling mechanism and the third fault handling mechanism may be configured for the processor when both direct and indirect connections exist with the devices connected to the processor.
Step 202, when the processor does not support the root port programmable input output, configuring a first failure handling mechanism for the processor.
In some embodiments, when the processor does not support the root port programmable input/output, that is, the processor does not support adopting different fault handling modes according to the fault type, the first fault handling mechanism may be directly configured for the processor, so that the fault handling can be performed when the device connected with the processor fails.
In summary, the device fault handling method provided by the disclosure includes shielding a first fault handling mechanism of a processor when the processor supports root port programmable input and output, configuring a second fault handling mechanism and/or a third fault handling mechanism for the processor, and configuring the first fault handling mechanism for the processor when the processor does not support root port programmable input and output. According to the method, different fault processing mechanisms are configured for the processor according to the capability of the processor, so that granularity of fault processing of the processor is improved, and accuracy of fault processing is improved.
Fig. 6 further illustrates a flow chart of an apparatus fault handling method set forth in the present disclosure. Further explained based on the embodiments shown in fig. 1 and 2, fig. 6 may include the following steps.
Step 301, when the processor does not support the root port programmable input output, configuring a first failure handling mechanism for the processor.
In some embodiments, the principle of step 301 is the same as that of step 201, and reference may be made to the embodiment in step 201 and the related description thereof, which will not be repeated here.
In response to the failure information of the failed device, a port containment process is triggered, step 302.
In some embodiments, since the initialization of the processor is configured as the first fault handling mechanism, when fault information is received, the processor has no capability to adopt different fault handling modes according to different fault types, and therefore, the processor can directly perform port containment processing on the fault device to perform fault handling.
In summary, the device fault processing method provided by the disclosure includes configuring a first fault processing mechanism for a processor when the processor does not support root port programmable input and output, and triggering port containment processing in response to fault information of a faulty device. The method of the present disclosure improves the application range of the present disclosure by configuring the first processing mechanism for the processor that does not support the root port programmable input/output, so that when the device directly or indirectly connected to the processor fails, the port containment process can be utilized to implement the failure processing.
Fig. 7 further illustrates a flow chart of an apparatus fault handling method set forth in the present disclosure. Further explained based on the embodiments shown in fig. 1 and 2, fig. 7 may include the following steps:
In step 401, when the processor supports the root port programmable input/output, the first failure handling mechanism of the processor is masked, and the second failure handling mechanism is configured for the processor, or the second failure handling mechanism and the third failure handling mechanism are configured for the processor.
In some embodiments, the principle of the step 401 is the same as that of the step 201, and reference may be made to the embodiment in the step 201 and the related description thereof, which are not repeated herein.
In response to the fault information for the faulty device, it is determined that the faulty device is directly connected to the processor, step 402.
In some embodiments, when the fault information is received, the processor may determine that the fault device is directly connected to the processor by querying the connection information of the fault device, but is not limited thereto, and the technical means adopted in determining the connection manner is not limited in the disclosure.
And step 403, performing fault processing on the fault equipment by using a second fault processing mechanism.
In some embodiments, the fault information may further indicate a fault type of the fault device, so that whether the fault of the fault device is a request for unsupported configuration space may be determined according to the fault information, and then, based on a result of the determination, fault processing is performed on the fault device.
Specifically, when the fault of the fault equipment is determined to be the unsupported configuration space request according to the fault information, the fault equipment is processed in the mode that first report information of the fault is generated and port restraining processing is not triggered.
When the fault of the fault equipment is determined not to not support the configuration space request (namely, not support other faults except the configuration space request) according to the fault information, the fault equipment is processed in the mode of triggering port containment processing.
Specifically, when the device is directly connected to the processor and the processor enumerates the device, since the normal device that is not initialized is caused to fail to support the configuration space request (enumerate for the processor to read or write data to the device, and since the device does not complete initialization, the processor cannot read or write data, and thus fails to support the configuration space request), the normal device does not fail at this time, and only does not complete initialization, so that the normal device does not need to perform port containment processing.
Further, in order to avoid that the failure that does not support the configuration space request is not generated due to the fact that the normal device does not complete initialization, that is, the device actually fails, the configuration space request is not supported, so that the first report information needs to be generated to analyze the specific generation reason of the configuration space request not supported, so as to determine whether to execute corresponding failure processing.
In other words, the failure cause of the unsupported configuration space request can be analyzed according to the first report information, so that when the unsupported configuration space request is not generated due to incomplete initialization of normal equipment, the failure equipment is subjected to failure processing by using a preset failure processing mode, and the isolation and repair of the failure equipment are realized.
In some embodiments, the first reporting information may be fault advisory reporting (Advisory Error report) information, but is not limited thereto.
In summary, the device fault handling method provided by the disclosure includes shielding a first fault handling mechanism of a processor and configuring a second fault handling mechanism for the processor or configuring a second fault handling mechanism and a third fault handling mechanism for the processor when the processor supports root port programmable input and output, determining that the fault device is directly connected with the processor in response to fault information for the fault device, and performing fault handling on the fault device by using the second fault handling mechanism. According to the method, when the processor supports the programmable input and output of the root port, the second processing mechanism is configured for the processor, so that when equipment directly connected with the processor fails, different types of faults are processed by different faults, when the faults are not supporting the configuration space request, only report information is generated, the port suppressing processing is not triggered, the normal equipment which is not initialized is prevented from triggering the fault processing, and the normal operation of the equipment is ensured.
Fig. 8 further illustrates a flow chart of an apparatus fault handling method set forth in the present disclosure. Further explained based on the embodiments shown in fig. 1 and 2, fig. 8 may include the following steps:
Step 501, when the processor supports root port programmable input/output, masks the first failure handling mechanism of the processor and configures a third failure handling mechanism for the processor, or configures a second failure handling mechanism and a third failure handling mechanism for the processor.
In some embodiments, the principle of the step 501 is the same as that of the step 201, and reference may be made to the embodiment in the step 201 and the related description thereof, which are not repeated herein.
In step 502, in response to the fault information for the faulty device, it is determined that the faulty device is indirectly connected to the processor.
In some embodiments, when the fault information is received, the processor may determine that the fault device is indirectly connected to the processor by querying the connection information of the fault device, but is not limited thereto, and the technical means adopted in determining the connection manner is not limited in the disclosure.
And step 503, performing fault processing on the fault equipment by using a third fault processing mechanism.
In some embodiments, the fault information may also indicate a fault type of the fault device, so that whether the fault of the fault device is a completion timeout error may be determined according to the fault information, and then, based on a result of the determination, fault processing is performed on the fault device.
Specifically, when the fault of the fault equipment is determined not to be the completion timeout error according to the fault information, the processing mode of the fault equipment is that the port restraining processing is triggered.
When the fault of the fault equipment is determined to be the completion timeout error according to the fault information, generating second report information of the fault, and acquiring a message header information log corresponding to the fault to determine whether port containment processing is performed or not based on the message header information log (RP PIO Header Log).
Further, a first information set indicated by the message head information log can be determined according to the type of the message head information log, and whether equipment corresponding to the first information set is fault equipment is further judged to determine whether port suppression processing is performed or not.
Specifically, when a fault device indirectly connected to a processor fails, a PCIe switch connected to the fault device first triggers a port containment process to perform fault processing on the fault device, where the PCIe switch sends fault information to the processor, and since the PCIe switch has triggered the port containment process, the fault device is performing fault processing at this time, so to avoid that the processor performs the port containment process again due to the fault information, it is necessary to compare sources of the fault information to determine whether the device sending the fault information is a device that has triggered the port containment process.
Specifically, when the type of the message head information log is a memory read-write type or an input-output read-write type, determining a space address indicated by the message head information log based on an information address of the message head information log to determine a first information set based on the space address, and when the type of the message head information log is a configuration space read-write type, determining the first information set based on an information address offset of the message head information log.
In some embodiments, the type of the message header information log may be determined based on the memory mapping space applied by the fault device, but is not limited thereto, and the present disclosure is not limited to the manner of determining the type of the message header information log.
In some embodiments, when the type of the message header information log is a memory read-write type or an input-output read-write type, the message header information log may directly indicate a space address of a device that has failed, and further determine the device that has failed according to the space address, so as to obtain the first information set.
In some embodiments, when the type of the header information log is a configuration space read-write type, the data of the different locations of the header information may indicate the first information set of the failed device.
In other words, when the type of the header information log is a memory read-write type or an input-output read-write type, specific equipment with faults needs to be determined according to the space address indicated by the header information log so as to determine the first information set, and when the type of the header information log is a configuration space read-write type, the header information log can directly indicate the first information set, so that the header information log can be directly analyzed so as to determine the first information set.
By way of example, taking the storage read-write type information with a message header information log of 9 bits as an example, the first 3 bits can be used for indicating bus information, the middle 3 bits can be used for indicating device information, and the last 3 bits can be used for indicating function information.
Further, the first set of information includes at least one of bus (bus) information of the failed device, device information (devices) of the failed device, and function (function) information of the failed device.
In some alternative embodiments, whether the fault information is generated due to the fact that the fault device triggers the port containment process can be further determined according to whether the device corresponding to the space address is the fault device, so that the fact that the fault information generated due to the port containment process triggers the port containment process again can be avoided.
Further, when the equipment corresponding to the first information set is judged to be the fault equipment or the equipment corresponding to the first information set is judged to be the PCIe switch connected with the fault equipment, port inhibition processing is not executed, and when the equipment corresponding to the first information set is judged not to be the fault equipment and the equipment corresponding to the first information set is judged not to be the PCIe switch connected with the fault, port inhibition processing is executed.
In other words, when the device corresponding to the first information set is a failed device, or it is determined that the device corresponding to the first information set is a PCIe switch connected to the failed device, it is indicated that the failed information is generated by the PCIe switch or the failed device when the failed device is failed by the port-throttling process, and at this time the failed device is already performing the failure process, the port-throttling process is not required to be performed again by the processor, and therefore the port-throttling process is not performed.
In other words, when the device corresponding to the first information set is not a failed device, or it is determined that the device corresponding to the first information set is not a PCIe switch connected to the failed device, it is indicated that the above-described failed information is not generated when the failed device is failed by the port-throttling process, that is, there is currently a failed device that requires the failed device, and thus the port-throttling process is performed.
For example, taking the device included in fig. 3 as an example, when the device 2 sends fault information to the processor, a register in the processor parses the header information log, determines that the source of the fault indicated by the header information log is the device 2 or the PCIe switch 1, does not execute port containment processing, and ends the fault processing.
For example, taking the device included in fig. 3 as an example, when the device 2 sends fault information to the processor, a register in the processor parses the header information log, and performs port suppression processing when it is determined that the source of the fault indicated by the header information log is not the device 2 or the PCIe switch 1 (e.g., the device 1, the device 3, the PCIe switch, etc.).
In some embodiments, the second reporting information may be fault advisory reporting information, but is not limited thereto.
In some alternative embodiments, the equipment identity information of the fault equipment and the fault reason of the fault equipment can be determined according to the first report information or the second report information, and then the fault processing is performed by using a preset solution mode based on the equipment identity information and the fault reason.
Specifically, the first report information or the second report information may include information such as a device model of the fault device, a fault reason, and the like, and further, according to a preset mapping relation table, a preset solution corresponding to the information such as the device model, the fault reason, and the like in the first report information or the second report information in the mapping relation table is searched by a table lookup manner, so that the fault processing is performed by using the corresponding preset solution.
In summary, the device fault handling method provided by the disclosure includes shielding a first fault handling mechanism of a processor and configuring a third fault handling mechanism for the processor or configuring a second fault handling mechanism and the third fault handling mechanism for the processor when the processor supports root port programmable input and output, determining that the fault device is directly connected with the processor in response to fault information for the fault device, and performing fault handling on the fault device by using the third fault handling mechanism. According to the method, when the processor supports the programmable input and output of the root port, a third processing mechanism is configured for the processor, so that when equipment directly connected with the processor fails, different fault processing is adopted for different types of faults, when the faults are overtime faults, only report information is generated, whether port containment processing is triggered or not is further determined according to a message header information log, whether the fault information is generated by triggering the port containment processing for the fault equipment is verified, the fact that the fault equipment which has triggered the port containment processing triggers the port containment processing again is avoided, and normal operation of non-fault equipment is guaranteed.
In summary, the present disclosure has the following beneficial effects:
1. different fault processing mechanisms are configured for processing according to the supporting capability of the processor, and according to the connection relation between the fault equipment and the processing, the fault processing is performed on the fault equipment by using different fault processing mechanisms, so that the operation interruption of normal PCIe equipment caused by the fact that the fault processing is performed by adopting EDPC technology on fault reporting errors of different types is avoided, and the processing granularity and accuracy of the fault processing are improved.
2. When the processor supports the programmable input and output of the root port, a second processing mechanism is configured for the processor, so that when equipment directly connected with the processor fails, different fault processing is adopted for different types of faults, when the fault is a request for not supporting configuration space, only report information is generated, port restraining processing is not triggered, normal equipment which is not initialized is prevented from triggering fault processing, and normal operation of non-fault equipment is ensured.
3. The third processing mechanism is configured for the processor when the processor supports the programmable input and output of the root port, so that when equipment directly connected with the processor fails, different fault processing is adopted for different types of faults, when the fault is an overtime completion fault, only report information is generated, whether port containment processing is triggered or not is further determined according to a message header information log, whether the fault information is generated by triggering the port containment processing for the fault equipment is verified, and therefore the fault equipment which has triggered the port containment processing is prevented from triggering the port containment processing again, and normal operation of non-fault equipment is ensured.
The following is an exemplary description of the present disclosure:
in embodiment 1, when a fatal error occurs in the enumeration process of the device 1 shown in fig. 3, the method shown in fig. 9 is executed to perform fault handling, and the method includes the following steps:
When the equipment 1 generates a fault, the trigger mode of the configurable PCIe root port DPC is RP PIO (i.e. the second fault handling mechanism described above), and at this time, by configuring an RP PIO related register, a part of error types trigger the DPC function of the root port, and another part of error types do not trigger the DPC function of the root port, and only fault advice reporting is performed.
1. It is checked whether the PCIe root port supports DPC functions of the RP PIO mechanism.
2. When the PCIe root port supports the DPC function of the RP PIO mechanism, the PCIe root port is initialized to shield the AER mechanism, configure the RP PIO mechanism, and set the UR errors of the configuration space request not to trigger the DPC, and the rest errors are set to trigger the DPC (namely the second fault handling mechanism).
3. When the PCIe root port does not support the DPC function of the RP PIO mechanism, the PCIe root port initializes by configuring the conventional AER mechanism to trigger the DPC (i.e., the first failure handling mechanism described above).
Pcie root port as a requester determines that device 1 has failed fatal (i.e., the processor receives failure information for the failed device).
Pcie root port determines if the processor supports RP PIO mechanisms.
6. When the PCIe root port does not support the RP PIO mechanism, the fatal error triggers the DPC using the conventional AER mechanism.
7. When the PCIe root port supports the RP PIO mechanism, further judging whether the fatal error is a UR error of the configuration space request.
8. If the fatal error is a UR error of the configuration space request, the fatal error does not trigger the DPC and only the recommended fault is reported.
9. If the fatal error is not a UR error of the configuration space request, the fatal error triggers the DPC using an RP PIO mechanism.
Embodiment 2, when a fatal error occurs in the enumeration process of the device 2 shown in fig. 3, performs the method shown in fig. 10 to perform fault handling, and includes the following steps:
If DPC is triggered at the downstream port of PCIe switch 1 (i.e., device 2), and this may be accompanied by one or more completion timeout errors in the generating PCIe root device. The present disclosure filters out CTO failures by using an RP PIO mechanism, and by accurately locating the CTO failure source to distinguish whether it is the downstream port of the PCIe switch 1 that triggered the CTO failure accompanying the DPC, so as to accurately process the failure, avoid the DPC being triggered by the CTO failure RootPort accompanying the DPC, and allow the PCIe root device to continue to operate normally with other PCIe switches (PCIe switch 2) and its downstream ports (device 3).
1. The device initialization stage checks whether the PCIe root port supports DPC functions of the RP PIO mechanism.
2. When the PCIe root port supports the DPC function of the RP PIO mechanism, the PCIe root port initializes the mask AER mechanism, configures the RP PIO mechanism, sets not to trigger the DPC when a completion timeout error (Cfg CTO, I/O CTO, mem CTO error) occurs, and sets the rest of the errors to trigger the DPC (i.e., the third failure handling mechanism described above).
3. When the PCIe root port does not support the DPC function of the RP PIO mechanism, the PCIe root port is initialized by configuring the traditional AER mechanism to trigger the DPC.
Pcie root port as requester finds fatal failure.
5. When the PCIe root port does not support the RP PIO mechanism, the fatal error triggers the DPC using the conventional AER mechanism.
6. When the PCIe switch 1 supports the RP PIO mechanism, it is further determined whether the fatal error is a Cfg CTO, I/O CTO, mem CTO error.
7. If the fatal error is at least one of Cfg CTO, I/O CTO and Mem CTO error, the DPC is not triggered temporarily, and only a recommended fault report is made, wherein the recommended fault report can indicate identity information, fault reasons and the like of the fault equipment so as to facilitate subsequent analysis and processing.
8. If the fatal error is not at least one of Cfg CTO, I/O CTO and Mem CTO error, the RP PIO mechanism is used to trigger DPC.
9. Further, when the user software receives the Cfg CTO, I/O CTO, mem CTO errors discovered by PCIe as the requester, the processor of the root port needs to be utilized to parse the message header information log (RP PIO Header Log) to determine the CTO error source.
10. Further, the type of the message header information log is judged, if the analysis type is a memory space read-write type, the space address of which device the information address belongs to is determined according to the information address indicated by the message header information log, and bus information, device information and function information (bus) of the device are recorded. The type of the information log may be determined by collecting Memory Mapped (MMIO) space of the device application.
11. If the analysis type is the configuration space read-write type, determining the bus information, the device information and the function information of the device indicated by the message head information log according to the information address offset of the message head information log.
12. And judging whether the equipment indicated by the message header information log is equipment 2 or not.
13. If the equipment indicated by the message header information log is equipment 2, the processing is not performed, and the fault processing is finished.
14. If the message header information log indicates the device, the PCIe root port executes the DPC.
In order to implement the device fault handling method provided by the embodiment of the present disclosure, the embodiment of the present disclosure further provides a device fault handling apparatus, as shown in fig. 11, where the device fault handling apparatus 1100 includes:
an initialization unit 1101, configured to perform an initialization configuration on a processor based on a fault handling mechanism supported by the processor;
A determining unit 1102, configured to determine a connection manner of the fault device and the processor in response to the fault information of the fault device;
the processing unit 1103 is configured to perform fault processing on the fault device based on the connection mode and/or the initialization configuration.
In some embodiments, the initialization unit 1101 is further configured to mask the first failure handling mechanism of the processor and configure the second failure handling mechanism and/or the third failure handling mechanism for the processor when the processor supports the root port programmable input/output, and configure the first failure handling mechanism for the processor when the processor does not support the root port programmable input/output.
In some embodiments, the first failure handling mechanism includes triggering a port throttling process when a failure device has a failure, the second failure handling mechanism includes not triggering a port throttling process when the failure device fails to support a configuration space request, and the third failure handling mechanism includes not triggering a port throttling process when the failure device fails to complete a timeout error, triggering a port throttling process when the failure device fails to complete an error other than a timeout error.
In some embodiments, the fault device is connected to the processor in a manner that includes a direct connection of the fault device to the processor and an indirect connection of the fault device to the processor through the switch.
In some embodiments, the processing unit 1103 is further configured to trigger a port throttling process when the first failure handling mechanism is configured in an initialized manner, and perform a failure handling on the failed device based on the connection mode when the second failure handling mechanism or the third failure handling mechanism is configured in an initialized manner.
In some embodiments, the processing unit 1103 is further configured to perform fault handling on the fault device by using the second fault handling mechanism when the fault device is directly connected to the processor, and perform fault handling on the fault device by using the third fault handling mechanism when the fault device is indirectly connected to the processor through the switch.
In some embodiments, the processing unit 1103 is further configured to determine, based on the fault information, whether the fault of the faulty device is a request for not supporting the configuration space, and perform fault processing on the faulty device based on a result of the determination.
In some embodiments, the processing unit 1103 is further configured to perform fault processing on the faulty device based on the result of the determination, where the result of the determination is that the fault of the faulty device is a configuration space unsupported request, generate first report information of the fault and not trigger the port containment processing, and when the result of the determination is that the fault of the faulty device is not a configuration space unsupported request, trigger the port containment processing.
In some embodiments, the processing unit 1103 is further configured to determine whether the failure of the failed device is a completion timeout error based on the failure information, and perform failure processing on the failed device based on the result of the determination.
In some embodiments, the processing unit 1103 is further configured to trigger the port containment process when the determined result is that the failure of the failed device is not a completion timeout error, generate second report information of the failure when the determined result is that the failure of the failed device is a completion timeout error, and obtain a header information log corresponding to the failure, so as to determine whether to perform the port containment process based on the header information log.
In some embodiments, the processing unit 1103 is further configured to determine a spatial address indicated by the header information log based on the type of the header information log, so as to determine the first information set based on the spatial address, and determine the first information set based on an information address offset of the header information log when the type of the header information log is a configuration space read-write type.
In some embodiments, the processing unit 1103 is further configured to determine, based on the type of the header information log, a space address indicated by the header information log, where the type of the header information log is a memory read-write type or an input-output read-write type, and where the type of the header information log is a configuration space read-write type, determine, based on an information address offset of the header information log, the space address.
In some embodiments, the processing unit 1103 is further configured to, when it is determined that the device corresponding to the first information set is a failed device, or when it is determined that the device corresponding to the first information set is a switch connected to the failed device, not perform the port containment process, and when it is determined that the device corresponding to the first information set is not a failed device, and the device corresponding to the first information set is not a switch connected to the failure, perform the port containment process.
In some embodiments, the processing unit 1103 is further configured to determine a type of the header information log based on the memory mapped space applied by the failed device.
In some embodiments, the processing unit 1103 is further configured to determine equipment identity information of the failed equipment and a failure cause of the failed equipment based on the first reporting information or the second reporting information, and perform failure processing by using a preset solution based on the equipment identity information and the failure cause.
In some embodiments, the processing unit 1103 is further configured to uninstall the driver of the failed device and remove the device tag of the failed device, release the link state of the failed device and the processor, enumerate the failed device, and enable the driver of the failed device from the new one.
In summary, the device fault processing apparatus provided according to the present disclosure includes an initialization unit configured to perform an initialization configuration on a processor based on a fault processing mechanism supported by the processor, a determination unit configured to determine a connection manner between the fault device and the processor in response to fault information of the fault device, and a processing unit configured to perform fault processing on the fault device based on the connection manner and/or the initialization configuration. According to the device disclosed by the invention, different fault processing mechanisms are configured for processing according to the supporting capability of the processor, and the fault processing is performed on the fault equipment by utilizing different fault processing mechanisms according to the connection relation between the fault equipment and the processing, so that the operation interruption of normal PCIe equipment caused by the fault processing of different types of fault reporting errors by adopting EDPC technology is avoided, and the processing granularity and accuracy of the fault processing are improved.
It should be noted that, when the device fault processing apparatus provided in the foregoing embodiment performs device fault processing, only the division of each program module is used as an example, in practical application, the processing allocation may be performed by different program modules according to needs, that is, the internal structure of the device fault processing apparatus is divided into different program modules, so as to complete all or part of the processing described above. In addition, the device fault handling apparatus provided in the foregoing embodiments and the device fault handling method embodiment provided in the embodiments of the present disclosure belong to the same concept, and specific implementation processes thereof are detailed in the method embodiment, and are not described herein again.
Fig. 12 is a schematic diagram of a hardware composition structure of an electronic device according to an embodiment of the disclosure, as shown in fig. 12, where the electronic device 1200 includes at least one processor 1202, and a memory 1201 communicatively connected to the at least one processor 1202, where the memory 1201 stores a command executable by the at least one processor 1202, and the command is executed by the at least one processor 1202 to implement the steps of the device failure processing method according to the embodiment of the disclosure.
Optionally, the electronic device may be specifically an apparatus fault handling device in the embodiment of the present application, and the electronic device may implement a corresponding flow implemented by the apparatus fault handling device in each method in the embodiment of the present application, which is not described herein for brevity.
It is understood that the electronic device also includes a communication interface 1203. The various components in the electronic device are coupled together by a bus system 1204. It is appreciated that the bus system 1204 is used to facilitate connected communications between these components. The bus system 1204 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration, the various buses are labeled as bus system 1204 in fig. 12.
It is to be appreciated that memory 1201 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. The non-volatile Memory may be, among other things, a Read Only Memory (ROM), a programmable Read Only Memory (PROM, programmable Read-Only Memory), erasable programmable Read-Only Memory (EPROM, erasable Programmable Read-Only Memory), electrically erasable programmable Read-Only Memory (EEPROM, ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory), Magnetic random access Memory (FRAM, ferromagnetic random access Memory), flash Memory (Flash Memory), magnetic surface Memory, optical disk, or compact disk-Only Memory (CD-ROM, compact Disc Read-Only Memory), which may be disk Memory or tape Memory. The volatile memory may be random access memory (RAM, random Access Memory) which acts as external cache memory. By way of example and not limitation, the present method is applicable to many forms of RAM such as static random access memory (SRAM, static Random Access Memory), synchronous static random access memory (SSRAM, synchronous Static Random Access Memory), dynamic random access memory (DRAM, dynamic Random Access Memory), synchronous dynamic random access memory (SDRAM, synchronous Dynamic Random Access Memory), memory cells, Double data rate synchronous dynamic random access memory (DDRSDRAM, double Data Rate Synchronous Dynamic Random Access Memory), enhanced synchronous dynamic random access memory (ESDRAM, enhanced Synchronous Dynamic Random Access Memory), synchronous link dynamic random access memory (SLDRAM, syncLink Dynamic Random Access Memory), Direct memory bus random access memory (DRRAM, direct Rambus Random Access Memory). The memory 1201 described by embodiments of the present invention is intended to comprise, without being limited to, these and any other suitable types of memory.
The methods disclosed in the embodiments of the present disclosure described above may be applied to the processor 1202 or implemented by the processor 1202. The processor 1202 has signal processing capabilities. In implementation, the steps of the methods described above may be performed by integrated logic circuitry in hardware in the processor 1202 or by commands in software. The processor 1202 may be a general purpose processor, DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 1202 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiment of the invention can be directly embodied in the hardware of the decoding processor or can be implemented by combining hardware and software modules in the decoding processor. The software modules may be located in a storage medium in the memory 1201 and the processor 1202 reads information in the memory 1201 to perform the steps of the method in combination with its hardware.
In an exemplary embodiment, the electronic device may be implemented by one or more Application-specific integrated circuits (ASICs), DSPs, programmable logic devices (PLDs, programmable Logic Device), complex programmable logic devices (CPLDs, complex Programmable Logic Device), FPGAs, general purpose processors, controllers, MCUs, microprocessors, or other electronic elements for performing the aforementioned methods.
The disclosed embodiments also provide a non-transitory computer-readable storage medium storing computer commands for causing the computer to execute the steps of the device failure handling method of the disclosed embodiments.
The disclosed embodiments also provide a computer program product comprising a computer program which, when executed by a processor, implements the steps of the device fault handling method of the disclosed embodiments.
Optionally, the computer readable storage medium may be applied to the device fault handling apparatus in the embodiment of the present application, and the computer command causes a computer to execute a corresponding flow implemented by the device fault handling apparatus in each method of the embodiment of the present application, which is not described herein for brevity.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is merely a logical function division, and there may be additional divisions of actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.
The units described as separate components may or may not be physically separate, and components displayed as units may or may not be physical units, may be located in one place, may be distributed on a plurality of network units, and may select some or all of the units according to actual needs to achieve the purpose of the embodiment.
In addition, each functional unit in each embodiment of the present invention may be integrated in one processing unit, or each unit may be separately used as a unit, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of hardware plus a form of software functional unit.
It will be appreciated by those of ordinary skill in the art that implementing all or part of the steps of the above method embodiments may be accomplished by hardware associated with program instructions, and that the above program may be stored on a computer readable storage medium which, when executed, performs the steps comprising the above method embodiments, where the above storage medium includes various media that can store program code, such as removable storage devices, ROM, RAM, magnetic or optical disks.
Or the above-described integrated units of the invention may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium, comprising several commands for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present invention. The storage medium includes various media capable of storing program codes such as a removable storage device, a ROM, a RAM, a magnetic disk or an optical disk.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.