CN118245269B - PCI equipment fault processing method and device and fault processing system - Google Patents
PCI equipment fault processing method and device and fault processing system Download PDFInfo
- Publication number
- CN118245269B CN118245269B CN202410671171.0A CN202410671171A CN118245269B CN 118245269 B CN118245269 B CN 118245269B CN 202410671171 A CN202410671171 A CN 202410671171A CN 118245269 B CN118245269 B CN 118245269B
- Authority
- CN
- China
- Prior art keywords
- pci
- failure
- fault
- bridge link
- devices
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0745—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in an input/output transactions management context
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/38—Information transfer, e.g. on bus
- G06F13/40—Bus structure
- G06F13/4004—Coupling between buses
- G06F13/4027—Coupling between buses using bus bridges
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/38—Information transfer, e.g. on bus
- G06F13/42—Bus transfer protocol, e.g. handshake; Synchronisation
- G06F13/4204—Bus transfer protocol, e.g. handshake; Synchronisation on a parallel bus
- G06F13/4221—Bus transfer protocol, e.g. handshake; Synchronisation on a parallel bus being an input/output bus, e.g. ISA bus, EISA bus, PCI bus, SCSI bus
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Computer Hardware Design (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Debugging And Monitoring (AREA)
- Small-Scale Networks (AREA)
Abstract
The embodiment of the application provides a fault processing method and device of PCI equipment and a fault processing system, wherein the method comprises the following steps: detecting PCI equipment connected on a PCI bridge link; determining a fault processing mode of the PCI equipment according to the detected PCI equipment, and processing the fault of the PCI equipment according to the fault processing mode; under the condition that the PCI equipment connected on the PCI bridge link is only the EP equipment, the fault processing function of the EP equipment is started; in the case where the PCI device connected on the PCI bridge link includes PCI SWITCH devices, the failure threshold of the PCI SWITCH device is set, and the transmission rate of the PCI bridge link is adjusted based on the failure threshold. The application solves the problem that the PCI equipment with faults cannot be accurately determined in the related technology, so that the accurate fault processing mode cannot be determined.
Description
Technical Field
The embodiment of the application relates to the field of servers, in particular to a fault processing method and device of PCI equipment and a fault processing system.
Background
The server of any architecture supports and must support components including basic core modules such as memory, bus (PERIPHERAL COMPONENT INTERCONNECT, abbreviated as PCI) devices, etc., while the core of the server is to externally link the PCI devices through the PCI bus, and the conventional PCI devices include components such as a network card, a storage medium card, an accelerator card, a graphics processor (Graphics Processing Unit, abbreviated as GPU) card, etc. of the PCI interface. Regardless of the component, the PCI device also needs to conform to the PCI bus protocol standard, requiring compliance with its standard requirements for supporting the protocol. However, the maximum speed can support 32GT/s at present because of higher PCI bus speed, and once PCI bus protocol fails, serious fatal problems such as PCI equipment loss, server system downtime and the like can be caused. The fault of the PCI link is divided into a plurality of error types, meanwhile, a plurality of devices can be supported on the PCI link, and the problem of how the fault is generated by which device on the PCI link cannot be accurately positioned in the related technology, so that the fault processing cannot be accurately performed.
Disclosure of Invention
The embodiment of the application provides a fault processing method and device of PCI equipment and a fault processing system, which at least solve the problem that the fault processing mode cannot be determined accurately because the PCI equipment with the fault cannot be determined accurately in the related technology.
According to an embodiment of the present application, there is provided a fault handling method of a PCI device, applied to a fault handling system, where the fault handling system is disposed on a motherboard of a server, and the fault handling system is connected to a PCI bridge link deployed in the server, the method including: detecting PCI devices connected on the PCI bridge link, wherein the PCI devices comprise EP devices and/or PCI SWITCH devices, and a plurality of EP devices are allowed to be connected on the PCI SWITCH device; determining a fault processing mode of the PCI equipment according to the detected PCI equipment, and processing the fault of the PCI equipment according to the fault processing mode; when the PCI device connected to the PCI bridge is only the EP device, a failure processing function of the EP device is started, the failure processing function is a function provided in the EP device, and the failure processing function is used for processing a failure of the EP device; when the PCI device connected to the PCI bridge includes the PCI SWITCH device, a failure threshold of the PCI SWITCH device is set, and a transmission rate of the PCI bridge is adjusted based on the failure threshold, wherein the failure threshold is a threshold indicating a number of times of failure of the PCI SWITCH device, and the failure processing method includes a method of setting a failure threshold of the PCI SWITCH device and the failure processing function.
In an exemplary embodiment, in a case where the PCI device connected to the PCI bridge link is only the EP device, the fault handling function of the EP device is started, including: when the PCI device connected to the PCI bridge is only the EP device, the AER function of the EP device is set to an enabled state or a started state to start a failure handling function of the EP device, where the AER function includes the failure handling function.
In an exemplary embodiment, in a case where the PCI device connected to the PCI bridge link is only the EP device, after the fault handling function of the EP device is started, the method further includes: monitoring a first failure event of the EP device, wherein the first failure event comprises at least one of: allowing a first fault event to be repaired, and not allowing the first fault event to be repaired, so as to cause a first fault event that the EP equipment cannot operate, wherein the EP equipment can continue to operate; reporting the first fault event to a target operating system and a baseboard management controller in the server under the condition that the first fault event is monitored;
Wherein the target operating system is configured to perform one of the following operations based on the first failure event: restarting the server, executing self-checking operation, and generating fault prompt information; the baseboard management controller is configured to perform at least one of the following operations based on the first failure event: recording the first fault event, executing fault checking operation, and repairing the first fault event.
In an exemplary embodiment, in a case where the PCI device connected on the PCI bridge link includes the PCI SWITCH device, setting the failure threshold of the PCI SWITCH device includes: determining a target register set in the PCI SWITCH device when the PCI device connected to the PCI bridge includes the PCI SWITCH device, where the target register is used to record the number of failures of the PCI SWITCH device; and setting the fault threshold of the target register according to preset conditions.
In an exemplary embodiment, the PCI device connected to the PCI bridge link includes the PCI SWITCH device, including one of: the PCI devices connected to the PCI bridge link are only PCI SWITCH devices; the PCI devices connected to the PCI bridge link are only PCI SWITCH devices, and one or more EP devices are connected to PCI SWITCH devices; the PCI devices connected to the PCI bridge include the PCI SWITCH devices and the EP devices, and the PCI SWITCH devices are connected to a first processor in the server, and the EP devices are connected to a second processor in the server; the PCI devices connected to the PCI bridge include the PCI SWITCH device and the EP device, and the PCI SWITCH device is connected to a first processor in the server, one or more other EP devices are connected to the PCI SWITCH device, and the EP device is connected to a second processor in the server.
In an exemplary embodiment, setting the fault threshold of the target register according to a preset condition includes: setting said fault threshold of said target register according to said preset conditions and by at least one of: determining operation information of the PCI SWITCH equipment, and setting the fault threshold based on the operation information, wherein the operation information comprises at least one of the following: the environment of operation, the performance of operation, the safety performance of operation and the stability performance of operation; determining a first number of said PCI SWITCH devices and a second number of said EP devices connected in said PCI SWITCH devices, and setting said failure threshold based on said first number and said second number.
In an exemplary embodiment, after setting the fault threshold of the target register according to a preset condition, the method further includes: detecting one or more EP devices connected in the PCI SWITCH devices and the PCI SWITCH devices; when the PCI SWITCH device fails or any EP device connected to the PCI SWITCH device fails, a counter in the destination register is triggered to perform a failure accumulation operation, where the failure accumulation operation is used to accumulate the number of times that the PCI SWITCH device fails.
In an exemplary embodiment, in a case where the PCI device connected on the PCI bridge link includes the PCI SWITCH device, before setting the failure threshold of the PCI SWITCH device, the method further includes one of: in the case that one or more EP devices are connected to the PCI SWITCH devices, starting a fault handling function of the EP device, and receiving, by the PCI SWITCH device, a result of the EP device handling a fault according to the fault handling function; the PCI SWITCH device is connected with a first processor in the server, and when the EP device is connected with a second processor in the server, the fault processing function of the EP device is turned off; and when the PCI SWITCH equipment is connected with the first processor in the server and the PCI SWITCH equipment is connected with one or more other EP equipment, and the EP equipment is connected with the second processor in the server, the fault processing function of the EP equipment is closed, the other fault processing functions of the other EP equipment are opened, and the result of processing faults according to the other fault processing functions of the other EP equipment is received through the PCI SWITCH equipment.
In an exemplary embodiment, adjusting the transmission rate of the PCI bridge link based on the failure threshold includes: and triggering and adjusting the transmission rate of the PCI bridge link to obtain a first adjusted transmission rate under the condition that the accumulated first failure times of the PCI bridge link failure is larger than or equal to the failure threshold value.
In an exemplary embodiment, when the accumulated first failure number of the PCI bridge link failure is greater than or equal to the failure threshold, triggering to adjust the transmission rate of the PCI bridge link to obtain a first adjusted transmission rate, where the first adjusted transmission rate includes at least one of: triggering and adjusting the transmission rate of PCI SWITCH equipment under the condition that the first failure times are larger than or equal to the failure threshold value, so as to obtain the first adjusted transmission rate; and triggering and adjusting the transmission rate of one or more EP devices connected in the PCI SWITCH devices when the first failure frequency is greater than or equal to the failure threshold value, so as to obtain the first adjusted transmission rate.
In an exemplary embodiment, in a case where the accumulated first failure number of the PCI bridge link failure is greater than or equal to the failure threshold, triggering to adjust the transmission rate of the PCI bridge link, and obtaining the first adjusted transmission rate, the method further includes: clearing the first failure times; and re-recording the times of the faults of the PCI bridge links to obtain second times of faults.
In an exemplary embodiment, after the re-recorded number of failures of the PCI bridge link and the second number of failures are obtained, the method further includes: continuously triggering and adjusting the transmission rate of the PCI bridge link under the condition that the second failure times are larger than or equal to the failure threshold value, so as to obtain a second adjusted transmission rate; and controlling the operation of the PCI bridge link based on the second adjusted transmission rate.
In an exemplary embodiment, after controlling the operation of the PCI bridge link based on the second adjusted transmission rate, the method further includes: stopping the adjustment of the transmission rate of the PCI bridge link when the second adjusted transmission rate is equal to the lowest rate of the PCI bridge link; and triggering and adjusting the network bandwidth of the PCI bridge link to obtain an adjusted network bandwidth under the condition that the second adjusted transmission rate is equal to the lowest rate of the PCI bridge link and the recorded third failure times of the PCI bridge link failure is greater than or equal to the failure threshold value, wherein the third failure times are the failure times recorded in the process that the PCI bridge link operates according to the second adjusted transmission rate.
In an exemplary embodiment, when the second adjusted transmission rate is equal to the lowest rate of the PCI bridge link and the recorded third number of failures of the PCI bridge link is greater than or equal to the failure threshold, the method further includes, after triggering adjustment of the network bandwidth of the PCI bridge link to obtain the adjusted network bandwidth: generating a second failure event if the adjusted network bandwidth is equal to the lowest network bandwidth of the PCI bridge link, wherein the second failure event includes at least one of: allowing a second fault event to be repaired, and not allowing the second fault event to be repaired, so as to cause a second fault event that the PCI bridge link cannot operate, wherein the PCI bridge link can continue to operate; and sending the second fault event to a baseboard management controller in the server, wherein the baseboard management controller is used for recording the second fault event and generating the fault processing information, and the fault processing information is used for indicating to replace the PCI SWITCH equipment with fault and/or replace the EP equipment connected in the PCI SWITCH equipment with fault.
In an exemplary embodiment, after adjusting the transmission rate of the PCI bridge link based on the failure threshold, the method further includes: generating target information and sending the target information to a baseboard management controller deployed in the server, wherein the target information comprises a target transmission rate, the target transmission rate is a rate after the transmission rate of the PCI bridge link is adjusted, and the target transmission rate is smaller than the transmission rate before the PCI bridge link is adjusted; and recording the target transmission rate through the baseboard management controller, and generating prompt information to prompt that the first failure times of the PCI bridge link failure is larger than the failure threshold value.
In an exemplary embodiment, detecting a PCI device connected to the PCI bridge link includes: when the server is started, the fault processing system is started, and N levels on the PCI bridge link are detected, wherein N is a natural number which is greater than or equal to 1; the type of the PCI device in each of the tiers is determined.
According to another embodiment of the present application, there is provided a fault handling apparatus for a PCI device, applied to a fault handling system, the fault handling system being disposed on a motherboard of a server, the fault handling system being connected to a PCI bridge link disposed in the server, the apparatus including: a first detection module, configured to detect a PCI device connected to the PCI bridge link, where the PCI device includes an EP device and/or PCI SWITCH devices, and a plurality of EP devices are allowed to be connected to the PCI SWITCH device; the first processing module is used for determining a fault processing mode of the PCI equipment according to the detected PCI equipment and processing the fault of the PCI equipment according to the fault processing mode; wherein the first processing module comprises: a first processing unit, configured to, in a case where the PCI device connected to the PCI bridge link is only the EP device, start a fault handling function of the EP device, where the fault handling function is a function provided in the EP device, and the fault handling function is configured to handle a fault of the EP device; and a second processing unit, configured to, in a case where the PCI device connected to the PCI bridge includes the PCI SWITCH device, set a failure threshold of the PCI SWITCH device, and adjust a transmission rate of the PCI bridge based on the failure threshold, where the failure threshold is used to represent a threshold for recording a number of failures of the PCI SWITCH device, and the failure processing manner includes a manner in which the failure processing function and the failure threshold of the PCI SWITCH device are set.
In an exemplary embodiment, the first processing unit includes: and a first setting subunit, configured to set, when the PCI device connected to the PCI bridge link is only the EP device, an AER function of the EP device to an enabled state or a started state, so as to start a fault handling function of the EP device, where the AER function includes the fault handling function.
In an exemplary embodiment, the apparatus further includes a first monitoring module, configured to monitor a first failure event of the EP device after the failure processing function of the EP device is started in a case where the PCI device connected to the PCI bridge link is only the EP device, where the first failure event includes at least one of: allowing a first fault event to be repaired, and not allowing the first fault event to be repaired, so as to cause a first fault event that the EP equipment cannot operate, wherein the EP equipment can continue to operate; the first reporting module is used for reporting the first fault event to a target operating system and a baseboard management controller in the server under the condition that the first fault event is monitored; wherein the target operating system is configured to perform one of the following operations based on the first failure event: restarting the server, executing self-checking operation, and generating fault prompt information; the baseboard management controller is configured to perform at least one of the following operations based on the first failure event: recording the first fault event, executing fault checking operation, and repairing the first fault event.
In an exemplary embodiment, the second processing unit includes: a first determining subunit, configured to determine a target register set in the PCI SWITCH device, where the target register is used to record the number of failures of the PCI SWITCH device, where the PCI device connected to the PCI bridge includes the PCI SWITCH device; and the first setting subunit is used for setting the fault threshold value of the target register according to preset conditions.
In an exemplary embodiment, the PCI device connected to the PCI bridge link includes the PCI SWITCH device, including one of: the PCI devices connected to the PCI bridge link are only PCI SWITCH devices; the PCI devices connected to the PCI bridge link are only PCI SWITCH devices, and one or more EP devices are connected to PCI SWITCH devices; the PCI devices connected to the PCI bridge include the PCI SWITCH devices and the EP devices, and the PCI SWITCH devices are connected to a first processor in the server, and the EP devices are connected to a second processor in the server; the PCI devices connected to the PCI bridge include the PCI SWITCH device and the EP device, and the PCI SWITCH device is connected to a first processor in the server, one or more other EP devices are connected to the PCI SWITCH device, and the EP device is connected to a second processor in the server.
In an exemplary embodiment, the first setting subunit includes: a first setting sub-module, configured to set the fault threshold of the target register according to the preset condition and by at least one of: determining operation information of the PCI SWITCH equipment, and setting the fault threshold based on the operation information, wherein the operation information comprises at least one of the following: the environment of operation, the performance of operation, the safety performance of operation and the stability performance of operation; determining a first number of said PCI SWITCH devices and a second number of said EP devices connected in said PCI SWITCH devices, and setting said failure threshold based on said first number and said second number.
In an exemplary embodiment, the above apparatus further includes: a first detection module, configured to detect one or more EP devices connected to the PCI SWITCH devices and the PCI SWITCH devices after setting the fault threshold of the target register according to a preset condition; and the first triggering module is used for triggering a counter in the target register to execute a fault accumulation operation when the PCI SWITCH equipment fails or any one of the EP equipment connected with the PCI SWITCH equipment fails, wherein the fault accumulation operation is used for accumulating the times of the faults of the PCI SWITCH equipment.
In an exemplary embodiment, the apparatus further comprises one of: a first opening module, configured to, in a case where one or more EP devices are connected to the PCI SWITCH device before setting a failure threshold of the PCI SWITCH device in a case where the PCI device connected to the PCI bridge includes the PCI SWITCH device, open a failure processing function of the EP device, and receive, through the PCI SWITCH device, a result of the EP device processing a failure according to the failure processing function; a first shutdown module configured to shut down a failure handling function of the EP device when the PCI SWITCH device is connected to a first processor in the server and the EP device is connected to a second processor in the server, before setting a failure threshold of the PCI SWITCH device when the PCI device connected to the PCI bridge includes the PCI SWITCH device; and a second processing module, configured to, before setting a failure threshold of the PCI SWITCH device when the PCI device connected to the PCI bridge includes the PCI SWITCH device, connect the PCI SWITCH device to a first processor in the server, connect the PCI SWITCH device to one or more other EP devices, and, when the EP device is connected to a second processor in the server, turn off a failure processing function of the EP device, turn on other failure processing functions of the other EP devices, and receive, via the PCI SWITCH device, a result of the other EP devices processing a failure according to the other failure processing functions.
In an exemplary embodiment, the second processing unit includes: and the first triggering subunit is used for triggering and adjusting the transmission rate of the PCI bridge link to obtain a first adjusted transmission rate under the condition that the accumulated first failure times of the PCI bridge link failure is larger than or equal to the failure threshold value.
In an exemplary embodiment, the first triggering subunit includes at least one of: the first triggering sub-module is used for triggering and adjusting the transmission rate of the PCI SWITCH equipment to obtain the first adjusted transmission rate under the condition that the first failure times are larger than or equal to the failure threshold value; and the second triggering sub-module is used for triggering and adjusting the transmission rate of one or more EP devices connected in the PCI SWITCH devices under the condition that the first failure times are greater than or equal to the failure threshold value, so as to obtain the first adjusted transmission rate.
In an exemplary embodiment, the above apparatus further includes: the first clearing module is used for triggering and adjusting the transmission rate of the PCI bridge link under the condition that the accumulated first failure times of the PCI bridge link failure is larger than or equal to the failure threshold value, and clearing the first failure times after the first adjusted transmission rate is obtained; and the first recording module is used for recording the times of the faults of the PCI bridge link again to obtain second times of faults.
In an exemplary embodiment, the above apparatus further includes: the first adjusting module is used for re-recording the times of the faults of the PCI bridge link, obtaining a second fault times, and continuously triggering and adjusting the transmission rate of the PCI bridge link to obtain a second adjusted transmission rate under the condition that the second fault times are larger than or equal to the fault threshold value; and the first control module is used for controlling the operation of the PCI bridge link based on the second adjusted transmission rate.
In an exemplary embodiment, the above apparatus further includes: a first stopping module, configured to stop adjustment of the transmission rate of the PCI bridge link when the second adjusted transmission rate is equal to a minimum rate of the PCI bridge link after controlling the operation of the PCI bridge link based on the second adjusted transmission rate; and the second triggering module is used for triggering and adjusting the network bandwidth of the PCI bridge link to obtain the adjusted network bandwidth under the condition that the second adjusted transmission rate is equal to the lowest rate of the PCI bridge link and the recorded third failure times of the PCI bridge link failure is greater than or equal to the failure threshold value, wherein the third failure times are the failure times recorded in the process that the PCI bridge link operates according to the second adjusted transmission rate.
In an exemplary embodiment, the above apparatus further includes: a first generating module, configured to trigger to adjust a network bandwidth of the PCI bridge link when the second adjusted transmission rate is equal to a minimum rate of the PCI bridge link and the recorded third number of failures occurring in the PCI bridge link is greater than or equal to the failure threshold, and generate a second failure event when the adjusted network bandwidth is equal to the minimum network bandwidth of the PCI bridge link after the adjusted network bandwidth is obtained, where the second failure event includes at least one of: allowing a second fault event to be repaired, and not allowing the second fault event to be repaired, so as to cause a second fault event that the PCI bridge link cannot operate, wherein the PCI bridge link can continue to operate; and a first sending module, configured to send the second failure event to a baseboard management controller in the server, where the baseboard management controller is configured to record the second failure event and generate the failure processing information, where the failure processing information is used to instruct replacement of the PCI SWITCH device that fails and/or replace the EP device connected to the PCI SWITCH device that fails.
In an exemplary embodiment, the above apparatus further includes: a second generating module, configured to generate target information after adjusting a transmission rate of the PCI bridge link based on the failure threshold, and send the target information to a baseboard management controller disposed in the server, where the target information includes a target transmission rate, the target transmission rate is a rate after adjusting the transmission rate of the PCI bridge link, and the target transmission rate is smaller than a transmission rate before adjusting the PCI bridge link; and the second recording module is used for recording the target transmission rate through the baseboard management controller and generating prompt information so as to prompt that the first failure frequency of the PCI bridge link failure is larger than the failure threshold value.
In an exemplary embodiment, the first detection module includes: the first detection unit is used for starting the fault processing system when the server is started, and detecting N levels on the PCI bridge link, wherein N is a natural number which is greater than or equal to 1; and a first determining unit, configured to determine a type of the PCI device in each of the tiers.
According to yet another embodiment of the present application, there is also provided a fault handling system, provided on a motherboard of a server, the fault handling system including a target processor, the target processor being connected to a PCI bridge link deployed in the server, the target processor being configured to implement the steps of the above-described method.
According to still another embodiment of the present application, there is also provided a server including: a PCI bridge link, and a fault handling system as described above, wherein a PCI device is connected to the PCI bridge link.
According to a further embodiment of the application, there is also provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of any of the method embodiments described above.
According to a further embodiment of the present application, there is also provided a computer non-transitory readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
According to a further embodiment of the application there is also provided an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
According to the application, the PCI equipment connected on the PCI bridge link is detected firstly, and then different fault processing modes are determined according to the detected different PCI equipment, so that the faults of the PCI equipment are processed according to the fault processing modes; and when the PCI device connected on the PCI bridge link comprises PCI SWITCH devices, setting a fault threshold of PCI SWITCH devices, and adjusting the transmission rate of the PCI bridge link based on the fault threshold. Therefore, the problem that the PCI equipment with faults cannot be accurately determined in the related technology, so that an accurate fault processing mode cannot be determined can be solved, and the effect of improving the accuracy of processing the faults of the PCI equipment is achieved.
Drawings
FIG. 1 is a block diagram of a hardware configuration of a mobile terminal of a fault handling method for PCI devices according to an embodiment of the present application;
FIG. 2 is a flow chart of a method of fault handling of a PCI device according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a connection of a PCI device according to an embodiment of the present application;
FIG. 4 is a second schematic diagram of a PCI device connection according to an embodiment of the present application;
FIG. 5 is a third schematic connection diagram of a PCI device according to an embodiment of the present application;
FIG. 6 is an interaction diagram of various devices performing error information processing in accordance with a specific embodiment of the present application;
FIG. 7 is a flowchart of BIOS error message processing according to an embodiment of the present application;
FIG. 8 is a block diagram of a fault handling system in accordance with an embodiment of the present application;
FIG. 9 is a block diagram of a server in an embodiment according to the application;
FIG. 10 is a block diagram of a failure handling apparatus of a PCI device according to an embodiment of the present application;
FIG. 11 is a schematic diagram of a computer non-transitory readable storage medium of a failure handling method of a PCI device according to an embodiment of the application;
Fig. 12 is a schematic hardware configuration diagram of an embodiment of an electronic device of a failure processing method of a PCI device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application will be described in detail below with reference to the accompanying drawings in conjunction with the embodiments.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
The method embodiments provided in the embodiments of the present application may be performed in a mobile terminal, a computer terminal or similar computing device. Taking the mobile terminal as an example, fig. 1 is a block diagram of a hardware structure of the mobile terminal of a fault handling method of a PCI device according to an embodiment of the present application. As shown in fig. 1, a mobile terminal may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, wherein the mobile terminal may also include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely illustrative and not limiting of the structure of the mobile terminal described above. For example, the mobile terminal may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1.
The memory 104 may be used to store computer programs, such as software programs of application software and modules, such as computer programs corresponding to the fault handling method of the PCI device in the embodiment of the present application, and the processor 102 executes the computer programs stored in the memory 104 to perform various functional applications and data processing, that is, implement the above-mentioned method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the mobile terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as a NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.
In this embodiment, a fault handling method of a PCI device is provided, which is applied to a fault handling system, where the fault handling system is disposed on a motherboard of a server, and the fault handling system is connected to a PCI bridge link deployed in the server, and fig. 2 is a flowchart of the fault handling method of the PCI device according to an embodiment of the present application, as shown in fig. 2, where the flowchart includes the following steps:
step S202, detecting PCI devices connected on the PCI bridge link, wherein the PCI devices comprise EP devices and/or PCI SWITCH devices, and a plurality of EP devices are allowed to be connected on the PCI SWITCH device;
step S204, determining a fault processing mode of the PCI equipment according to the detected PCI equipment, and processing the fault of the PCI equipment according to the fault processing mode;
Wherein, in the case that the PCI device connected to the PCI bridge link is only the EP device, a failure processing function of the EP device is started, the failure processing function being a function provided in the EP device, the failure processing function being for processing a failure of the EP device; in the case that the PCI device connected to the PCI bridge link includes the PCI SWITCH device, setting a failure threshold of the PCI SWITCH device, and adjusting a transmission rate of the PCI bridge link based on the failure threshold, the failure threshold being a threshold indicating a number of times of failure of the PCI SWITCH device, the failure handling means including the failure handling function and a means of setting a failure threshold of the PCI SWITCH device.
Through the steps, as the PCI devices connected on the PCI bridge link are detected firstly, and then different fault processing modes are determined according to the detected different PCI devices, so that the faults of the PCI devices are processed according to the fault processing modes; and when the PCI device connected on the PCI bridge link comprises PCI SWITCH devices, setting a fault threshold of PCI SWITCH devices, and adjusting the transmission rate of the PCI bridge link based on the fault threshold. Therefore, the problem that the PCI equipment with faults cannot be accurately determined in the related technology, so that an accurate fault processing mode cannot be determined can be solved, and the effect of improving the accuracy of processing the faults of the PCI equipment is achieved.
Alternatively, the fault handling system may be a basic input output system (Basic Input Output System, abbreviated as BIOS) deployed in a server, or may be a device including a processor. BIOS is a firmware in a computer system, located on the motherboard of the computer. It is responsible for hardware initialization, self-checking and booting the operating system at computer start-up. The BIOS provides an interface to enable the operating system and application programs to communicate with the hardware of the computer.
Alternatively, a PCI bridge link refers to a communication link that connects between two or more PCI bridges. A PCI bridge is a device used to connect different PCI links that can transfer data from one PCI link to another PCI link. The PCI bridge links are responsible for transferring data and control information between different PCI links to enable communication and data exchange between different PCI devices.
Alternatively, the PCI device refers to a device having a PCI interface, which is an external expansion bus that is a bus intended to connect a device external to a computer to the inside of the computer, and is an external expansion bus, and may be a network card having a PCI interface, a memory medium card, an accelerator card, a GPU card, or the like.
Optionally, the PCI SWITCH device is a device for connecting a computer system to an external device. PCI SWITCH devices may extend the PCI bus interface of the computer system to enable it to connect to more external devices, such as network cards, graphics cards, storage devices, and the like. The PCI bus interfaces are connected together to realize the management and control of a plurality of external devices, thereby improving the expansibility and flexibility of the system. Therefore, PCI SWITCH devices realize time division multiplexing of PCI buses, so that a plurality of PCI devices can communicate through the PCI buses at the same time, and can provide data flow control and routing functions, so that data exchange and communication can be performed between the PCI devices efficiently. PCI SWITCH chip devices are commonly used in servers, workstations, and other systems that require a large number of PCI device connections.
PCI SWITCH has a fault threshold register in the device for setting a fault threshold, and when the number of faults reaches or exceeds the fault threshold, a corresponding fault handling mode is triggered. The fault thresholds may include various types of faults, such as transmission faults, reception faults, clock faults, and so forth. By setting the fault thresholds, fault detection and processing can be performed on the equipment, and reliability and stability of the equipment are ensured.
Alternatively, an EP (Endpoint) device refers to a device connected to PCI SWITCH devices or to PCI links, such as network cards, graphics cards, sound cards, memory controllers, and the like. The EP device communicates and transmits data through a PCI bus or PCI SWITCH device motherboard or other devices.
Optionally, the fault handling functions in the EP device include, but are not limited to, an advanced error reporting function (Advanced Error Reporting, abbreviated AER), a function in the PCI device for reporting and handling errors of the device. The AER function allows the PCI device to generate an error report when an error occurs and send it to the operating system or other management entity of the host system. These error reports may include error types, error locations, and other relevant information to assist system administrators in diagnosing and handling device errors. The AER function may also notify the host system of link errors through an error reporting mechanism on the PCI link so that the system may take appropriate action to handle the error, such as resetting or reconfiguring devices, rerouting data flows, etc. Thereby helping to improve the reliability and stability of the system and reducing the impact of errors on system performance and data integrity.
Optionally, the failure of the PCI device includes, but is not limited to:
1) Fault event that can correct the fault or allow repair: refers to errors that may be corrected by error correction codes or other techniques, such as check errors in data transmission.
2) Fault events that are uncorrectable for errors or do not allow repair: refers to errors that cannot be corrected by error correction codes or other techniques, such as critical errors during data transmission.
3) A fatal error, or a failure event that causes the PCI device to fail to operate: refers to an error that severely affects the function of the device, resulting in the device not functioning properly.
4) A non-fatal error, or a failure event in which the PCI device can continue to operate: refers to an error that affects the function of the device, but the device can still continue to operate.
In an exemplary embodiment, in a case where the PCI device connected on the PCI bridge link is only the EP device, starting a failure processing function of the EP device, including: and setting an AER function of the EP device to an enabling state or a starting state to start a fault processing function of the EP device under the condition that the PCI device connected on the PCI bridge link is only the EP device, wherein the AER function comprises the fault processing function.
Alternatively, the EP device may be a device in a server connected to a plurality of processors via PCI links. For example, as shown in fig. 3, when the BIOS is included in the failure processing system, the devices to which the CPU0 and the CPU1 are connected include only EP devices. When the BIOS scans all PCI devices of the current PCI link, and only the EP device is scanned, the BIOS sets the AER function of the EP device of all PCI links to be enabled.
Alternatively, setting the AER function of the EP device to an enabled state or a start-up state may be performed according to the following steps:
1) After the BIOS system is started, entering a setting interface of the BIOS system;
2) In the BIOS setup interface, find "PCI device configuration" or similar options;
3) In the PCI device configuration menu, find "Advanced Error Report (AER)" or similar options;
4) Setting an "Advanced Error Report (AER)" option to "enable" or "enable";
5) Saving the setting and exiting the setting interface of the BIOS system;
6. Ensuring that the AER functions of all PCI devices have been successfully set to enabled can be checked in the operating system using corresponding tools or commands.
In this embodiment, in the case where the PCI device connected to the PCI bridge link is only an EP device, by setting the AER function of the EP device to an enabled state or a started state, the failure of the EP device can be accurately detected and handled.
In an exemplary embodiment, in a case where the PCI device connected on the PCI bridge link is only the EP device, after starting a failure handling function of the EP device, the method further includes: monitoring a first failure event of the EP device, wherein the first failure event comprises at least one of: allowing a repaired first fault event, not allowing the repaired first fault event, causing a first fault event that the EP equipment cannot operate, and allowing the EP equipment to continue to operate; reporting the first fault event to a target operating system and a baseboard management controller in the server under the condition that the first fault event is monitored; wherein the target operating system is configured to perform one of the following operations based on the first failure event: restarting the server, executing self-checking operation, and generating fault prompt information; the baseboard management controller is configured to perform at least one of the following operations based on the first failure event: recording the first fault event, executing fault checking operation, and repairing the first fault event.
Optionally, the first fault event allowing repair comprises a correctable error occurring in the EP device, e.g. a check error in the data transmission; the first failure event that does not allow repair includes uncorrectable errors in the EP equipment, e.g., a serious error in the data transmission process; the first failure event causing the EP device to fail includes a fatal error of the EP device; the first failure event that the EP device can continue to operate includes a non-fatal error that the EP device has occurred. For example, when any one of a correctable error, an uncorrectable error, a fatal error, and a non-fatal error occurs in the EP device connected to the PCI link in the server, the BIOS immediately reports the fault information to the target operating system and baseboard management controller (Board Management Controller, abbreviated as BMC) side, notifies the target operating system and baseboard management controller of performing the above-mentioned fault processing operation, or prompts replacement of the EP device.
According to the method and the device, the fault event of the EP equipment is monitored, and the fault event is reported to the target operating system and the baseboard management controller in the server, so that the purpose of timely processing the fault can be achieved.
In an exemplary embodiment, in a case where the PCI device connected on the PCI bridge link includes the PCI SWITCH device, setting the failure threshold of the PCI SWITCH device includes:
Determining a target register set in the PCI SWITCH device in a case that the PCI device connected to the PCI bridge link includes the PCI SWITCH device, wherein the target register is used for recording the number of failures of the PCI SWITCH device; and setting the fault threshold of the target register according to a preset condition.
Optionally, the target register is a fault threshold register set in PCI SWITCH devices, and is used for setting a fault threshold, and when the number of faults reaches or exceeds the fault threshold, a corresponding fault handling mode is triggered. The target register comprises a counter which is used for accumulating the fault times of PCI SWITCH equipment.
Optionally, the PCI device connected on the PCI bridge link includes the PCI SWITCH device, including one of:
The PCI devices connected on the PCI bridge link are only PCI SWITCH devices; for example, as shown in FIG. 4, CPU0 and CPU1 are connected by PCI SWITCH devices.
The PCI devices connected on the PCI bridge link are only PCI SWITCH devices, and one or more EP devices are connected in the PCI SWITCH devices; for example, as shown in fig. 4, 3 EP devices are connected among PCI SWITCH devices in the CPU 0.
The PCI devices connected on the PCI bridge link comprise PCI SWITCH devices and EP devices, the PCI SWITCH devices are connected with a first processor in the server, and the EP devices are connected with a second processor in the server; for example, as shown in fig. 5, connected to CPU0 is PCI SWITCH equipment, and connected to CPU1 is EP equipment.
The PCI devices connected on the PCI bridge link comprise PCI SWITCH devices and EP devices, the PCI SWITCH devices are connected with a first processor in the server, one or more other EP devices are connected in PCI SWITCH devices, and the EP devices are connected with a second processor in the server. For example, as shown in fig. 5, connected to CPU0 are PCI SWITCH devices, connected to PCI SWITCH devices are 3 EP devices, and connected to CPU1 are 3 EP devices.
It should be noted that PCI SWITCH devices may be one or more, and each PCI SWITCH device may be connected to a plurality of EP devices. The embodiment can accurately determine the fault processing mode of each type of PCI equipment by detecting the type of the connected equipment.
Optionally, setting the fault threshold of the target register according to a preset condition includes: setting said fault threshold of said target register according to said preset conditions and by at least one of:
Determining operation information of the PCI SWITCH equipment and setting the fault threshold based on the operation information, wherein the operation information comprises at least one of the following: the environment of operation, the performance of operation, the safety performance of operation and the stability performance of operation;
For example, the environment of operation: including PCI SWITCH the temperature range, humidity range, and other environmental conditions in which the device operates, corresponding fault thresholds may be set according to the stability performance of the device under different environmental conditions, to ensure the reliability of the device under different environments.
Performance of the run: the method comprises the steps of setting corresponding fault thresholds according to performance of PCI SWITCH equipment under different performance requirements according to performance indexes such as data transmission speed, bandwidth utilization rate and the like of PCI SWITCH equipment so as to ensure stability of the equipment under different loads.
Safety performance of operation: the security authentication method comprises the security functions of PCI SWITCH equipment, data encryption, access control and the like. Corresponding fault thresholds can be set according to the performances of the equipment under different safety requirements so as to ensure the reliability of PCI SWITCH equipment under different safety threats.
Stability performance of operation: the method comprises the steps of performing stability performance of PCI SWITCH equipment under long-time operation, such as disconnection, dead halt and the like, and setting corresponding fault thresholds according to performance of PCI SWITCH equipment under different stability requirements so as to ensure reliability of the equipment under long-time operation.
Based on the operation information, an appropriate fault threshold can be set according to actual conditions so as to ensure the stable operation of PCI SWITCH equipment under different conditions.
A first number of said PCI SWITCH devices and a second number of said EP devices connected in said PCI SWITCH devices are determined and said failure threshold is set based on said first number and said second number. In this embodiment, the failure threshold may be set according to the first number of PCI SWITCH devices and the second number of connected EP devices. For example, if there are 5 PCI SWITCH devices and 20 connected EP devices, an appropriate failure threshold (e.g., 100) may be set according to the actual situation to ensure stability and reliability of the network. By monitoring and analyzing fault conditions in the network, an optimal fault threshold may be determined to identify and resolve problems in time.
In an exemplary embodiment, after setting the failure threshold of the target register according to a preset condition, the method further includes: detecting one or more EP devices connected in the PCI SWITCH devices and the PCI SWITCH devices; and triggering a counter in the target register to execute a fault accumulation operation when the PCI SWITCH equipment fails or any EP equipment connected in the PCI SWITCH equipment fails, wherein the fault accumulation operation is used for accumulating the times of the faults of the PCI SWITCH equipment. In this embodiment, when PCI SWITCH devices or any EP devices connected thereto fail, the counter in the destination register will be triggered to perform the failure accumulation operation. This means that the fault counter will record each fault event and increment it into the corresponding counter for subsequent analysis and troubleshooting. This helps to quickly identify the frequency and pattern of failure occurrences in order to take appropriate action to repair the failure and improve the reliability and stability of the system.
In an exemplary embodiment, in a case where the PCI device connected on the PCI bridge link includes the PCI SWITCH device, before setting the failure threshold of the PCI SWITCH device, the method further includes one of:
Under the condition that one or more EP devices are connected in PCI SWITCH devices, starting a fault processing function of the EP devices, and receiving a fault processing result of the EP devices according to the fault processing function through PCI SWITCH devices; for example, as shown in fig. 4, when 3 EP devices are connected to PCI SWITCH devices connected to the CPU0, the failure handling function (AER function) of the 3 EP devices is turned on; the fault processing function of the EP equipment can be started through PCI SWITCH equipment; the fault handling functions of the EP device may include automatic restart, fault alerting, error logging, etc. When the EP device processes the fault according to the fault handling function, the PCI SWITCH device may receive the processing result sent by the EP device, such as information about successful restart, alarm sending, and the like. By receiving the processing result of the EP equipment, PCI SWITCH equipment can timely know the fault processing condition of the EP equipment, help an administrator to timely take corresponding measures, and ensure the normal operation of the system.
Closing a fault handling function of the EP device in case the PCI SWITCH device is connected to a first processor in the server and the EP device is connected to a second processor in the server; for example, as shown in fig. 5, CPU0 is connected to PCI SWITCH devices, CPU1 is connected to 3 EP devices, and the AER functions of the 3 EP devices connected to CPU1 are turned off. The error handling mechanism of the PCI link may thus be implemented by setting PCI SWITCH the failure threshold of the device.
And under the condition that the PCI SWITCH equipment is connected with a first processor in the server, one or more other EP equipment is connected in the PCI SWITCH equipment, and the EP equipment is connected with a second processor in the server, the fault processing function of the EP equipment is closed, the other fault processing functions of the other EP equipment are opened, and the result of the other EP equipment for processing faults according to the other fault processing functions is received through the PCI SWITCH equipment. For example, as shown in fig. 5, CPU0 is connected to PCI SWITCH devices, PCI SWITCH devices are connected to 3 EP devices, CPU1 is connected to 3 EP devices, and the AER functions of the 3 EP devices connected to CPU1 are turned off, and the AER functions of the 3 EP devices connected to PCI SWITCH devices are turned on. The PCI SWITCH equipment can timely learn the fault processing condition of the EP equipment by receiving the processing result of the EP equipment, and the PCI SWITCH equipment helps an administrator to timely take corresponding measures to ensure the normal operation of the system.
In one exemplary embodiment, adjusting the transmission rate of the PCI bridge link based on the failure threshold includes: and triggering and adjusting the transmission rate of the PCI bridge link to obtain a first adjusted transmission rate under the condition that the accumulated first failure times of the PCI bridge link failure is larger than or equal to the failure threshold value.
Alternatively, the adjustment of the transmission rate may be to reduce the data transmission rate to reduce the link load, for example, from the original 32GT/s to 16GT/s. Therefore, the method can timely cope with link faults, reduce interruption and delay of data transmission and improve stability and reliability of PCI bridge links.
Optionally, when the accumulated first failure number of the PCI bridge link failure is greater than or equal to the failure threshold, triggering and adjusting the transmission rate of the PCI bridge link to obtain a first adjusted transmission rate, where the first adjusted transmission rate includes at least one of the following:
Triggering and adjusting the transmission rate of PCI SWITCH equipment under the condition that the first failure times are larger than or equal to the failure threshold value, so as to obtain the first adjusted transmission rate;
And triggering and adjusting the transmission rate of one or more EP devices connected in the PCI SWITCH devices under the condition that the first failure times are greater than or equal to the failure threshold value, so as to obtain the first adjusted transmission rate.
Alternatively, the adjustment PCI SWITCH of the transmission rate of the connected EP device in the device, or the transmission rate of the EP device, may be triggered in several ways: software control: using PCI SWITCH devices or management software or drivers of EP devices, the transmission rate of connected EP devices can be adjusted by setting parameters or profiles. Hardware control: some PCI SWITCH devices or EP devices may have physical switches or buttons that can be manually operated directly on the device to adjust the transmission rate of the connected EP device. Remote control: through a network or remote connection, the PCI SWITCH devices or EP devices may be remotely controlled through a remote management tool or protocol to adjust the transmission rate of the connected EP devices. Whichever way is used, it is necessary to ensure that the operation of adjusting the transmission rate is able to correctly identify and affect the PCI SWITCH or EP device that needs to be adjusted to ensure the accuracy and effectiveness of the transmission rate adjustment. The transmission rate after the first adjustment can be obtained through the above mode. It should be noted that the adjustment of the transmission rate may require a restart PCI SWITCH of the device to be effective.
Optionally, when the accumulated first failure number of the PCI bridge link failure is greater than or equal to the failure threshold, triggering and adjusting the transmission rate of the PCI bridge link, and after obtaining the first adjusted transmission rate, the method further includes: clearing the first failure times; and re-recording the times of the faults of the PCI bridge link to obtain a second times of faults. If the second failure times are smaller than the failure threshold value, keeping the current transmission rate unchanged; and if the second failure times are greater than or equal to the failure threshold value, triggering and adjusting the transmission rate of the PCI bridge link again. And so on until the transmission rate stabilizes at a suitable value.
Optionally, after the re-recorded number of times of the PCI bridge link failure occurs and the second number of times of failure is obtained, the method further includes: continuously triggering and adjusting the transmission rate of the PCI bridge link under the condition that the second failure times are larger than or equal to the failure threshold value, so as to obtain a second adjusted transmission rate; and controlling the operation of the PCI bridge link based on the second adjusted transmission rate. If the second number of failures does not reach the failure threshold, the operation state of the PCI bridge link is continuously monitored, and the operation of adjusting the transmission rate is triggered when necessary. Therefore, the PCI bridge link can be correspondingly adjusted in time when the PCI bridge link fails, so that the stable operation of the system is ensured.
Optionally, after controlling operation of the PCI bridge link based on the second adjusted transmission rate, the method further comprises: stopping the adjustment of the transmission rate of the PCI bridge link under the condition that the second adjusted transmission rate is equal to the lowest rate of the PCI bridge link; and triggering adjustment of the network bandwidth of the PCI bridge link to obtain an adjusted network bandwidth under the condition that the second adjusted transmission rate is equal to the lowest rate of the PCI bridge link and the recorded third failure times of the PCI bridge link failure is greater than or equal to the failure threshold value, wherein the third failure times are the failure times recorded in the process that the PCI bridge link operates according to the second adjusted transmission rate. When the transmission rate after the second adjustment is equal to the lowest rate of the PCI bridge link, and the recorded number of times of PCI bridge link faults is greater than or equal to a set fault threshold value, the network bandwidth of the PCI bridge link is triggered to be adjusted. This means that the system will automatically reallocate the bandwidth of the PCI bridge link to ensure the stability and reliability of the network transmission. Such automatic adjustment may help the system react quickly in the event of a failure, thereby minimizing impact on network performance.
Optionally, when the second adjusted transmission rate is equal to the lowest rate of the PCI bridge link and the recorded third failure number of failures of the PCI bridge link is greater than or equal to the failure threshold, triggering adjustment of the network bandwidth of the PCI bridge link, and obtaining the adjusted network bandwidth, the method further includes: generating a second failure event in the case that the adjusted network bandwidth is equal to the lowest network bandwidth of the PCI bridge link, wherein the second failure event includes at least one of: allowing a repaired second failure event, not allowing the repaired second failure event, causing a second failure event that the PCI bridge link cannot operate, wherein the PCI bridge link can continue to operate; and sending the second fault event to a baseboard management controller in the server, wherein the baseboard management controller is used for recording the second fault event and generating the fault processing information, and the fault processing information is used for indicating to replace the PCI SWITCH equipment with fault and/or replace the EP equipment connected in the PCI SWITCH equipment with fault.
Optionally, the second failure event that allows repair refers to an error that can be corrected by error correction codes or other techniques, such as a check error in the data transmission. A second failure event that does not allow repair refers to an error that cannot be corrected by error correction codes or other techniques, such as a critical error during data transmission. The second failure event that causes the PCI bridge link to fail refers to an error that severely affects the function of the PCI bridge link, resulting in the device failing to function properly. The second failure event that the PCI bridge link can continue to operate refers to an error that has some effect on the device's functionality, but the device can still continue to operate. The second failure event may be a disruption or loss of network connectivity, resulting in incomplete or severely delayed data transmission.
Optionally, the baseboard management controller records detailed information of the second fault event, including the time, place, reason, influence, etc. of the fault. The baseboard management controller then generates fault handling information including handling measures for the fault, repair plans, responsible persons, and predicted completion times, etc. The information can help managers and technicians to better understand the fault condition, timely take countermeasures and ensure timely repair of the fault.
In an exemplary embodiment, after adjusting the transmission rate of the PCI bridge link based on the failure threshold, the method further comprises: generating target information and sending the target information to a baseboard management controller deployed in the server, wherein the target information comprises a target transmission rate, the target transmission rate is a rate after the transmission rate of the PCI bridge link is adjusted, and the target transmission rate is smaller than the transmission rate before the PCI bridge link is adjusted; and recording the target transmission rate through the baseboard management controller, and generating prompt information to prompt that the first failure times of the PCI bridge link failure is larger than the failure threshold value.
Optionally, the process of generating the target information includes determining a value of the target transmission rate, encapsulating the value into a data packet or message, and then sending the data packet or message to a server where the baseboard management controller is located via the network. The baseboard management controller can adjust the transmission rate of the PCI bridge link according to the received target information so as to achieve the required target transmission rate. This ensures that data transmissions in the system can be made at a set rate to meet specific performance requirements.
Optionally, the baseboard management controller may record the target transmission rate and generate corresponding prompt information to help the user monitor and manage the transmission rate. The hint information may include, but is not limited to:
1) The current transmission rate reaches or exceeds a preset threshold value, the user is reminded that network congestion or bandwidth limitation possibly exists, and corresponding measures are recommended to be taken to optimize the network environment.
2) The current transmission rate is lower than expected, alerting the user to the possible device failure or network problem, suggesting to check the relevant device and network connection to increase the transmission rate.
3) For a specific task or application, according to the set target transmission rate, reminding the user whether the current transmission rate meets the expectations or not so as to adjust the setting or take other measures in time.
Through recording and generating prompt information, the baseboard management controller can help users to find and solve problems in terms of transmission rate in time, and transmission efficiency and user experience are improved.
In one exemplary embodiment, detecting a PCI device connected on the PCI bridge link includes: when the server is started, starting the fault processing system, and detecting N levels on the PCI bridge link, wherein N is a natural number which is greater than or equal to 1; determining the type of the PCI devices in each of the tiers.
Optionally, when the server is started, the fault handling system is started, and N levels on the PCI bridge link are detected, so as to ensure stability and reliability of connection.
Optionally, the types of the PCI devices in each of the tiers include: hardware level: PCI device types may include graphics cards, network cards, sound cards, RAID cards, and the like. Driver level: PCI device types may include graphics drivers, network drivers, sound card drivers, and the like. Application level: PCI device types may include graphics processors, network interface cards, voice input output devices, and so forth.
The application is illustrated below with reference to specific examples:
The present embodiment is mainly a method for processing error information in the case that a PCI bridge link at a CPU end is connected to different devices, and is described by taking a method for controlling a PCI link to process an error by using a BIOS as an example.
Since the PCI link bus can support both PCI terminal devices (i.e., EP devices) and PCI SWITCH chips (i.e., PCI SWITCH devices), but since the PCI terminal devices only support AER functions, i.e., errors can be recovered, errors cannot be recovered, fatal errors, and non-fatal errors, while the lower end of the PCI SWITCH chip still can support PCI terminal settings or PCI SWITCH chips, while the PCI SWITCH chip itself belongs to bridge type devices, error threshold parameters can be set by performing error repair on devices of the terminal below PCI SWITCH, and when the number of bridge errors of PCI SWITCH reaches a threshold setting value, the PCI SWITCH chip performs a speed-down mechanism to reduce the number of errors, but since there may be three cases of PCI links, i.e., the first case, only the PCI terminal devices have no PCI SWITCH chips; the second is that only PCI SWITCH chips have no PCI terminal equipment, the third is that both PCI terminal equipment and PCI SWITCH chips are available, and for the first and second cases, error mechanism processing is easy to be performed for specific PCI link conditions, but for the third case, on the premise that the error processing mechanism of the PCI terminal equipment is closed, the error processing mechanism flow of PCI SWITCH is opened, and the error processing mechanism of the PCI link is implemented, because the number of PCI bridges supported by CPUs with different architectures is different, the embodiment only uses 1 PCI bridge link as a main body to perform error information processing mechanism.
As shown in fig. 6, an interaction diagram of each device performing error information processing in this embodiment is shown, where the interaction procedure is as follows:
S601, powering up a server and starting a BIOS;
S602, BIOS scans PCI equipment under a PCI bridge link, confirms whether PCI SWITCH chips exist in the current link, enables AER function of PCI terminal equipment under the current link if PCI SWITCH chips do not exist, and reports errors; if PCI SWITCH chips are detected to exist, closing the AER function of the PCI terminal EP equipment under the bridge of the non-Switch chip, only starting the AER function of the PCI terminal equipment under the bridge of the current Switch chip, setting an error threshold, adopting a link speed reduction and bandwidth reduction measure when an error counter reaches the threshold, and adopting an SMI interrupt signal to inform a BMC end of recording;
S603, the BMC receives error information reported by the BIOS and records information of link speed reduction and bandwidth reduction.
The embodiment combines different linking conditions to take different measures, effectively ensures that errors of PCI equipment are reported in time, reduces time loss caused by frequent replacement of PCI terminal equipment by a server through measures taken by a Switch chip bridge, improves safety, reliability and stability of the server, and favorably ensures RAS function cost performance of the server.
As shown in fig. 7, a flowchart of the BIOS of the present embodiment for error information processing includes the following steps:
s701, BIOS determines that a server is powered on;
s702, BIOS scans and detects all PCI devices of PCI bridge link, including PCI terminal device, PCI SWITCH chip device and other PCI devices;
s703, the BIOS judges whether the scanned PCI device has only EP device;
S704, when the BIOS scans all PCI devices of the current PCI bridge link and only the EP device exists, the BIOS sets the AER function of the EP device of all PCI bridge links to be enabled;
s705, when any one of correctable errors, uncorrectable errors, fatal errors and non-fatal errors of the EP device is generated, the BIOS immediately reports the fault information to the OS and BMC terminals, and notifies the OS and BMC terminals to replace or replace the device;
S706, when BIOS scans that only PCI SWITCH chips are arranged at the first level of all PCI devices of the current PCI bridge link, even if PCI SWITCH chips are connected with the EP device at the downstream port, the AER function of the EP device is closed, and meanwhile, an error threshold value register of PCI SWITCH chips is set to be a fault threshold value, for example, the set fault threshold value is 1000;
S707, when any AER error is generated by the EP device at the lower end of the PCI SWITCH chip, the fault threshold value of the PCI SWITCH chip is processed by adding 1;
S708, judging whether the fault count of the threshold value register of the PCI SWITCH chip reaches a fault threshold value or not;
S709, when the count of the threshold register reaches or exceeds the failure threshold, at this time, going to S710, the PCI bridge link automatically triggers a speed-down (e.g., from 32GT/S- >16 GT/S) function, and counts the threshold register again;
S711, when the S704 still exceeds the set fault threshold, turning to S715, continuing to slow down until the lowest transmission rate of the current PCI link is reached; decreasing the PCI link bandwidth (e.g., from x16- > x 8) until the lowest speed and minimum bandwidth (e.g., 2.5GT/S, x 1) is decreased while the minimum transmission rate still exceeds the set threshold, and continuing to perform steps S704-S705;
S712, judging whether the PCI link bandwidth has been reduced to the lowest bandwidth;
S713, BIOS will inform BMC end of current PCI link deceleration information to record through SMI interrupt signal, in order to inform user that error data occurred in current PCI link exceeds threshold value setting;
s714, the counting of the threshold registers is performed again.
It should be noted that, when the BIOS scans that there is both PCI SWITCH chips and EP devices in the first hierarchy of all PCI devices of the current PCI link, even if the downlink port of PCI SWITCH chips is connected to the EP device, the AER function of the EP device needs to be turned off, and only the error threshold of PCI SWITCH is reserved and executed according to steps S706-S705.
In summary, in this embodiment, when the BIOS starts up the server, the PCI device under the PCI bridge link is scanned, to confirm whether the PCI SWITCH chip bridge exists in the current link, and if no PCI SWITCH chip bridge exists, the AER function of the PCI terminal device under the current link is enabled, and error reporting is performed; if PCI SWITCH chips are detected to exist, the AER function of the PCI terminal EP equipment under the bridge of the non-Switch chip is closed, the AER function of the PCI terminal equipment under the bridge of the current Switch chip is only started, an error threshold is set, when the error counter reaches the threshold, a link speed reduction bandwidth reduction measure is adopted, and the BIOS informs the BMC end of recording by adopting an SMI interrupt. The problem of frequently replacing the terminal equipment can be reduced through active maintenance errors, the reliability of the server can be guaranteed, and great beneficial measures are provided for the stability and the safety of the server, the operation and maintenance of the data center and the service life of the whole server.
In this embodiment, as shown in fig. 8, a structural block diagram of the fault handling system in this embodiment is further provided, where the fault handling system is disposed on a motherboard of a server, and the fault handling system includes a target processor, where the target processor is connected to a PCI bridge link disposed in the server, and the target processor is configured to implement the steps of the method described above.
Also provided in this embodiment is a server, as shown in fig. 9, which is a block diagram of the server in this embodiment, including: a PCI bridge link, and a fault handling system as described above, wherein a PCI device is connected to the PCI bridge link.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.
The embodiment also provides a fault handling device of a PCI device, which is used to implement the foregoing embodiments and preferred embodiments, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
Fig. 10 is a block diagram of a failure handling apparatus of a PCI device according to an embodiment of the present application, which is applied to a failure handling system provided on a motherboard of a server, the failure handling system being connected to a PCI bridge link disposed in the server, as shown in fig. 10, and includes:
A first detection module 1002, configured to detect a PCI device connected to the PCI bridge link, where the PCI device includes an EP device and/or PCI SWITCH devices, and a plurality of EP devices are allowed to be connected to the PCI SWITCH device; a first processing module 1004, configured to determine a failure handling manner of the PCI device according to the detected PCI device, and process a failure of the PCI device according to the failure handling manner; wherein the first processing module comprises: a first processing unit, configured to, in a case where the PCI device connected to the PCI bridge link is only the EP device, start a fault handling function of the EP device, where the fault handling function is a function provided in the EP device, and the fault handling function is configured to handle a fault of the EP device; and a second processing unit, configured to, in a case where the PCI device connected to the PCI bridge includes the PCI SWITCH device, set a failure threshold of the PCI SWITCH device, and adjust a transmission rate of the PCI bridge based on the failure threshold, where the failure threshold is used to represent a threshold for recording a number of failures of the PCI SWITCH device, and the failure processing manner includes a manner in which the failure processing function and the failure threshold of the PCI SWITCH device are set.
In an exemplary embodiment, the first processing unit includes: and a first setting subunit, configured to set, when the PCI device connected to the PCI bridge link is only the EP device, an AER function of the EP device to an enabled state or a started state, so as to start a fault handling function of the EP device, where the AER function includes the fault handling function.
In an exemplary embodiment, the apparatus further includes a first monitoring module, configured to monitor a first failure event of the EP device after the failure processing function of the EP device is started in a case where the PCI device connected to the PCI bridge link is only the EP device, where the first failure event includes at least one of: allowing a first fault event to be repaired, and not allowing the first fault event to be repaired, so as to cause a first fault event that the EP equipment cannot operate, wherein the EP equipment can continue to operate; the first reporting module is used for reporting the first fault event to a target operating system and a baseboard management controller in the server under the condition that the first fault event is monitored; wherein the target operating system is configured to perform one of the following operations based on the first failure event: restarting the server, executing self-checking operation, and generating fault prompt information; the baseboard management controller is configured to perform at least one of the following operations based on the first failure event: recording the first fault event, executing fault checking operation, and repairing the first fault event.
In an exemplary embodiment, the second processing unit includes: a first determining subunit, configured to determine a target register set in the PCI SWITCH device, where the target register is used to record the number of failures of the PCI SWITCH device, where the PCI device connected to the PCI bridge includes the PCI SWITCH device; and the first setting subunit is used for setting the fault threshold value of the target register according to preset conditions.
In an exemplary embodiment, the PCI device connected to the PCI bridge link includes the PCI SWITCH device, including one of: the PCI devices connected to the PCI bridge link are only PCI SWITCH devices; the PCI devices connected to the PCI bridge link are only PCI SWITCH devices, and one or more EP devices are connected to PCI SWITCH devices; the PCI devices connected to the PCI bridge include the PCI SWITCH devices and the EP devices, and the PCI SWITCH devices are connected to a first processor in the server, and the EP devices are connected to a second processor in the server; the PCI devices connected to the PCI bridge include the PCI SWITCH device and the EP device, and the PCI SWITCH device is connected to a first processor in the server, one or more other EP devices are connected to the PCI SWITCH device, and the EP device is connected to a second processor in the server.
In an exemplary embodiment, the first setting subunit includes: a first setting sub-module, configured to set the fault threshold of the target register according to the preset condition and by at least one of: determining operation information of the PCI SWITCH equipment, and setting the fault threshold based on the operation information, wherein the operation information comprises at least one of the following: the environment of operation, the performance of operation, the safety performance of operation and the stability performance of operation; determining a first number of said PCI SWITCH devices and a second number of said EP devices connected in said PCI SWITCH devices, and setting said failure threshold based on said first number and said second number.
In an exemplary embodiment, the above apparatus further includes: a first detection module, configured to detect one or more EP devices connected to the PCI SWITCH devices and the PCI SWITCH devices after setting the fault threshold of the target register according to a preset condition; and the first triggering module is used for triggering a counter in the target register to execute a fault accumulation operation when the PCI SWITCH equipment fails or any one of the EP equipment connected with the PCI SWITCH equipment fails, wherein the fault accumulation operation is used for accumulating the times of the faults of the PCI SWITCH equipment.
In an exemplary embodiment, the apparatus further comprises one of: a first opening module, configured to, in a case where one or more EP devices are connected to the PCI SWITCH device before setting a failure threshold of the PCI SWITCH device in a case where the PCI device connected to the PCI bridge includes the PCI SWITCH device, open a failure processing function of the EP device, and receive, through the PCI SWITCH device, a result of the EP device processing a failure according to the failure processing function; a first shutdown module configured to shut down a failure handling function of the EP device when the PCI SWITCH device is connected to a first processor in the server and the EP device is connected to a second processor in the server, before setting a failure threshold of the PCI SWITCH device when the PCI device connected to the PCI bridge includes the PCI SWITCH device; and a second processing module, configured to, before setting a failure threshold of the PCI SWITCH device when the PCI device connected to the PCI bridge includes the PCI SWITCH device, connect the PCI SWITCH device to a first processor in the server, connect the PCI SWITCH device to one or more other EP devices, and, when the EP device is connected to a second processor in the server, turn off a failure processing function of the EP device, turn on other failure processing functions of the other EP devices, and receive, via the PCI SWITCH device, a result of the other EP devices processing a failure according to the other failure processing functions.
In an exemplary embodiment, the second processing unit includes: and the first triggering subunit is used for triggering and adjusting the transmission rate of the PCI bridge link to obtain a first adjusted transmission rate under the condition that the accumulated first failure times of the PCI bridge link failure is larger than or equal to the failure threshold value.
In an exemplary embodiment, the first triggering subunit includes at least one of: the first triggering sub-module is used for triggering and adjusting the transmission rate of the PCI SWITCH equipment to obtain the first adjusted transmission rate under the condition that the first failure times are larger than or equal to the failure threshold value; and the second triggering sub-module is used for triggering and adjusting the transmission rate of one or more EP devices connected in the PCI SWITCH devices under the condition that the first failure times are greater than or equal to the failure threshold value, so as to obtain the first adjusted transmission rate.
In an exemplary embodiment, the above apparatus further includes: the first clearing module is used for triggering and adjusting the transmission rate of the PCI bridge link under the condition that the accumulated first failure times of the PCI bridge link failure is larger than or equal to the failure threshold value, and clearing the first failure times after the first adjusted transmission rate is obtained; and the first recording module is used for recording the times of the faults of the PCI bridge link again to obtain second times of faults.
In an exemplary embodiment, the above apparatus further includes: the first adjusting module is used for re-recording the times of the faults of the PCI bridge link, obtaining a second fault times, and continuously triggering and adjusting the transmission rate of the PCI bridge link to obtain a second adjusted transmission rate under the condition that the second fault times are larger than or equal to the fault threshold value; and the first control module is used for controlling the operation of the PCI bridge link based on the second adjusted transmission rate.
In an exemplary embodiment, the above apparatus further includes: a first stopping module, configured to stop adjustment of the transmission rate of the PCI bridge link when the second adjusted transmission rate is equal to a minimum rate of the PCI bridge link after controlling the operation of the PCI bridge link based on the second adjusted transmission rate; and the second triggering module is used for triggering and adjusting the network bandwidth of the PCI bridge link to obtain the adjusted network bandwidth under the condition that the second adjusted transmission rate is equal to the lowest rate of the PCI bridge link and the recorded third failure times of the PCI bridge link failure is greater than or equal to the failure threshold value, wherein the third failure times are the failure times recorded in the process that the PCI bridge link operates according to the second adjusted transmission rate.
In an exemplary embodiment, the above apparatus further includes: a first generating module, configured to trigger to adjust a network bandwidth of the PCI bridge link when the second adjusted transmission rate is equal to a minimum rate of the PCI bridge link and the recorded third number of failures occurring in the PCI bridge link is greater than or equal to the failure threshold, and generate a second failure event when the adjusted network bandwidth is equal to the minimum network bandwidth of the PCI bridge link after the adjusted network bandwidth is obtained, where the second failure event includes at least one of: allowing a second fault event to be repaired, and not allowing the second fault event to be repaired, so as to cause a second fault event that the PCI bridge link cannot operate, wherein the PCI bridge link can continue to operate; and a first sending module, configured to send the second failure event to a baseboard management controller in the server, where the baseboard management controller is configured to record the second failure event and generate the failure processing information, where the failure processing information is used to instruct replacement of the PCI SWITCH device that fails and/or replace the EP device connected to the PCI SWITCH device that fails.
In an exemplary embodiment, the above apparatus further includes: a second generating module, configured to generate target information after adjusting a transmission rate of the PCI bridge link based on the failure threshold, and send the target information to a baseboard management controller disposed in the server, where the target information includes a target transmission rate, the target transmission rate is a rate after adjusting the transmission rate of the PCI bridge link, and the target transmission rate is smaller than a transmission rate before adjusting the PCI bridge link; and the second recording module is used for recording the target transmission rate through the baseboard management controller and generating prompt information so as to prompt that the first failure frequency of the PCI bridge link failure is larger than the failure threshold value.
In an exemplary embodiment, the first detection module includes: the first detection unit is used for starting the fault processing system when the server is started, and detecting N levels on the PCI bridge link, wherein N is a natural number which is greater than or equal to 1; and a first determining unit, configured to determine a type of the PCI device in each of the tiers.
It should be noted that each of the above modules may be implemented by software or hardware, and for the latter, it may be implemented by, but not limited to: the modules are all located in the same processor; or the above modules may be located in different processors in any combination.
Embodiments of the application also provide a computer program product comprising a computer program which, when executed by a processor, implements the steps of any of the method embodiments described above.
Embodiments of the present application also provide another computer program product comprising a non-volatile computer non-volatile readable storage medium storing a computer program which, when executed by a processor, implements the steps of any of the method embodiments described above.
Embodiments of the present application also provide a computer program comprising computer instructions stored in a computer non-volatile readable storage medium; the processor of the computer device reads the computer instructions from the computer non-transitory readable storage medium and the embedder executes the computer instructions, causing the computer device to perform the steps of any of the method embodiments described above.
The present application also provides a computer non-volatile readable storage medium storing a computer program that when executed by a processor performs a failure processing method of a PCI device.
Referring to fig. 11, a schematic diagram of an embodiment of a computer non-volatile readable storage medium of the method for handling a failure of a PCI device according to the present application is shown. Taking a computer non-volatile readable storage medium as shown in fig. 11 as an example, the computer non-volatile readable storage medium 1101 stores a computer program 1102 that when executed by a processor performs the above method.
Fig. 12 is a schematic hardware structure of an embodiment of an electronic device according to the above-mentioned method for handling a failure of a PCI device.
Taking the example of the apparatus shown in fig. 12, a processor 1201 and a memory 1202 are included in the apparatus.
The processor 1201 and the memory 1202 may be connected by a bus or otherwise, for example in fig. 12.
The memory 1202 is a nonvolatile computer readable storage medium that is used to store nonvolatile software programs, nonvolatile computer executable programs, and modules, such as program instructions/modules corresponding to the failure processing method of the PCI device in the embodiment of the present application. Processor 1201 performs various functional applications of the server and data processing, i.e., implements a failure processing method of the PCI device, by running nonvolatile software programs, instructions, and modules stored in memory 1202.
Memory 1202 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of the failure processing method of the PCI device, or the like. In addition, memory 1202 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 1202 optionally includes memory located remotely from processor 1201, which may be connected to the local module via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Computer instructions 1203 corresponding to a failure handling method of one or more PCI devices are stored in memory 1202 that, when executed by processor 1201, perform the failure handling method of a PCI device in any of the method embodiments described above.
Any one embodiment of a computer device that performs the above-described method for handling a failure of a PCI device may achieve the same or similar effects as any one of the foregoing method embodiments corresponding thereto.
Finally, it should be noted that, as will be understood by those skilled in the art, implementing all or part of the above-described methods in the embodiments may be implemented by a computer program to instruct related hardware, and the program of the fault handling method of the PCI device may be stored in a computer non-volatile readable storage medium, where the program may include the flow of the embodiments of the above-described methods when executed. The non-volatile readable storage medium of the program may be a magnetic disk, an optical disk, a read-only memory (ROM), a random-access memory (RAM), or the like. The computer program embodiments described above may achieve the same or similar effects as any of the method embodiments described above.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
It should be understood that as used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.
The foregoing embodiment of the present application has been disclosed with reference to the number of embodiments for the purpose of description only, and does not represent the advantages or disadvantages of the embodiments.
Those of ordinary skill in the art will appreciate that all or a portion of the steps implementing the above embodiments may be implemented by hardware, or may be implemented by a program to instruct related hardware, and the program may be stored in a computer non-volatile readable storage medium, where the non-volatile readable storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
Those of ordinary skill in the art will appreciate that: the above discussion of any embodiment is merely exemplary and is not intended to imply that the scope of the disclosure of embodiments of the application, including the claims, is limited to such examples; combinations of features of the above embodiments or in different embodiments are also possible within the idea of an embodiment of the application, and many other variations of the different aspects of the embodiments of the application as described above exist, which are not provided in detail for the sake of brevity. Therefore, any omission, modification, equivalent replacement, improvement, etc. of the embodiments should be included in the protection scope of the embodiments of the present application.
Claims (22)
1. A fault handling method for a PCI device, applied to a fault handling system, the fault handling system being disposed on a motherboard of a server, the fault handling system being connected to a PCI bridge link deployed in the server, the method comprising:
detecting PCI devices connected on the PCI bridge link, wherein the PCI devices comprise EP devices and/or PCI SWITCH devices, and a plurality of EP devices are allowed to be connected on the PCI SWITCH device;
Determining a fault processing mode of the PCI equipment according to the detected PCI equipment, and processing the fault of the PCI equipment according to the fault processing mode;
Wherein, in the case that the PCI device connected to the PCI bridge link is only the EP device, a failure processing function of the EP device is started, the failure processing function being a function provided in the EP device, the failure processing function being for processing a failure of the EP device; in the case that the PCI device connected to the PCI bridge link includes the PCI SWITCH device, setting a failure threshold of the PCI SWITCH device, and adjusting a transmission rate of the PCI bridge link based on the failure threshold, the failure threshold being a threshold indicating a number of times of failure of the PCI SWITCH device, the failure handling means including the failure handling function and a means of setting a failure threshold of the PCI SWITCH device.
2. The method according to claim 1, wherein, in case the PCI device connected on the PCI bridge link is only the EP device, starting a failure handling function of the EP device, comprising:
And setting an AER function of the EP device to an enabling state or a starting state to start a fault processing function of the EP device under the condition that the PCI device connected on the PCI bridge link is only the EP device, wherein the AER function comprises the fault processing function.
3. The method according to claim 1, wherein, in case the PCI device connected on the PCI bridge link is only the EP device, after starting a failure handling function of the EP device, the method further comprises:
Monitoring a first failure event of the EP device, wherein the first failure event comprises at least one of: allowing a repaired first fault event, not allowing the repaired first fault event, causing a first fault event that the EP equipment cannot operate, and allowing the EP equipment to continue to operate;
Reporting the first fault event to a target operating system and a baseboard management controller in the server under the condition that the first fault event is monitored;
wherein the target operating system is configured to perform one of the following operations based on the first failure event: restarting the server, executing self-checking operation, and generating fault prompt information; the baseboard management controller is configured to perform at least one of the following operations based on the first failure event: recording the first fault event, executing fault checking operation, and repairing the first fault event.
4. The method of claim 1, wherein setting the failure threshold of the PCI SWITCH device in the case where the PCI device connected on the PCI bridge link includes the PCI SWITCH device comprises:
Determining a target register set in the PCI SWITCH device in a case that the PCI device connected to the PCI bridge link includes the PCI SWITCH device, wherein the target register is used for recording the number of failures of the PCI SWITCH device;
And setting the fault threshold of the target register according to a preset condition.
5. The method of claim 4, wherein the PCI devices connected on the PCI bridge link comprise the PCI SWITCH devices, comprising one of:
The PCI devices connected on the PCI bridge link are only PCI SWITCH devices;
the PCI devices connected on the PCI bridge link are only PCI SWITCH devices, and one or more EP devices are connected in the PCI SWITCH devices;
The PCI devices connected on the PCI bridge link comprise PCI SWITCH devices and EP devices, the PCI SWITCH devices are connected with a first processor in the server, and the EP devices are connected with a second processor in the server;
The PCI devices connected on the PCI bridge link comprise PCI SWITCH devices and EP devices, the PCI SWITCH devices are connected with a first processor in the server, one or more other EP devices are connected in PCI SWITCH devices, and the EP devices are connected with a second processor in the server.
6. The method of claim 4, wherein setting the fault threshold of the target register according to a preset condition comprises:
setting said fault threshold of said target register according to said preset conditions and by at least one of:
Determining operation information of the PCI SWITCH equipment and setting the fault threshold based on the operation information, wherein the operation information comprises at least one of the following: the environment of operation, the performance of operation, the safety performance of operation and the stability performance of operation;
A first number of said PCI SWITCH devices and a second number of said EP devices connected in said PCI SWITCH devices are determined and said failure threshold is set based on said first number and said second number.
7. The method of claim 4, wherein after setting the failure threshold of the target register according to a preset condition, the method further comprises:
detecting one or more EP devices connected in the PCI SWITCH devices and the PCI SWITCH devices;
And triggering a counter in the target register to execute a fault accumulation operation when the PCI SWITCH equipment fails or any EP equipment connected in the PCI SWITCH equipment fails, wherein the fault accumulation operation is used for accumulating the times of the faults of the PCI SWITCH equipment.
8. The method of claim 1, wherein, in the case where the PCI device connected on the PCI bridge link includes the PCI SWITCH device, before setting the failure threshold of the PCI SWITCH device, the method further comprises one of:
Under the condition that one or more EP devices are connected in PCI SWITCH devices, starting a fault processing function of the EP devices, and receiving a fault processing result of the EP devices according to the fault processing function through PCI SWITCH devices;
closing a fault handling function of the EP device in case the PCI SWITCH device is connected to a first processor in the server and the EP device is connected to a second processor in the server;
And under the condition that the PCI SWITCH equipment is connected with a first processor in the server, one or more other EP equipment is connected in the PCI SWITCH equipment, and the EP equipment is connected with a second processor in the server, the fault processing function of the EP equipment is closed, the other fault processing functions of the other EP equipment are opened, and the result of the other EP equipment for processing faults according to the other fault processing functions is received through the PCI SWITCH equipment.
9. The method of claim 1, wherein adjusting the transmission rate of the PCI bridge link based on the failure threshold comprises:
And triggering and adjusting the transmission rate of the PCI bridge link to obtain a first adjusted transmission rate under the condition that the accumulated first failure times of the PCI bridge link failure is larger than or equal to the failure threshold value.
10. The method of claim 9, wherein, in the case where the accumulated first number of failures of the PCI bridge link is greater than or equal to the failure threshold, triggering an adjustment of the transmission rate of the PCI bridge link to obtain a first adjusted transmission rate, comprising at least one of:
Triggering and adjusting the transmission rate of PCI SWITCH equipment under the condition that the first failure times are larger than or equal to the failure threshold value, so as to obtain the first adjusted transmission rate;
And triggering and adjusting the transmission rate of one or more EP devices connected in the PCI SWITCH devices under the condition that the first failure times are greater than or equal to the failure threshold value, so as to obtain the first adjusted transmission rate.
11. The method of claim 9, wherein in the case where the accumulated first number of failures of the PCI bridge link that fails is greater than or equal to the failure threshold, triggering an adjustment of the transmission rate of the PCI bridge link, the method further comprises, after obtaining the first adjusted transmission rate:
clearing the first failure times;
and re-recording the times of the faults of the PCI bridge link to obtain a second times of faults.
12. The method of claim 11, wherein the re-recorded number of failures of the PCI bridge link, after obtaining the second number of failures, further comprises:
continuously triggering and adjusting the transmission rate of the PCI bridge link under the condition that the second failure times are larger than or equal to the failure threshold value, so as to obtain a second adjusted transmission rate;
And controlling the operation of the PCI bridge link based on the second adjusted transmission rate.
13. The method of claim 12, wherein after controlling operation of the PCI bridge link based on the second adjusted transmission rate, the method further comprises:
Stopping the adjustment of the transmission rate of the PCI bridge link under the condition that the second adjusted transmission rate is equal to the lowest rate of the PCI bridge link;
And triggering adjustment of the network bandwidth of the PCI bridge link to obtain an adjusted network bandwidth under the condition that the second adjusted transmission rate is equal to the lowest rate of the PCI bridge link and the recorded third failure times of the PCI bridge link failure is greater than or equal to the failure threshold value, wherein the third failure times are the failure times recorded in the process that the PCI bridge link operates according to the second adjusted transmission rate.
14. The method of claim 12, wherein in the case where the second adjusted transmission rate is equal to the lowest rate of the PCI bridge link and the recorded third number of failures of the PCI bridge link is greater than or equal to the failure threshold, triggering an adjustment of the network bandwidth of the PCI bridge link, resulting in an adjusted network bandwidth, the method further comprising:
Generating a second failure event in the case that the adjusted network bandwidth is equal to the lowest network bandwidth of the PCI bridge link, wherein the second failure event includes at least one of: allowing a repaired second failure event, not allowing the repaired second failure event, causing a second failure event that the PCI bridge link cannot operate, wherein the PCI bridge link can continue to operate;
And sending the second fault event to a baseboard management controller in the server, wherein the baseboard management controller is used for recording the second fault event and generating the fault processing information, and the fault processing information is used for indicating to replace the PCI SWITCH equipment with fault and/or replace the EP equipment connected in the PCI SWITCH equipment with fault.
15. The method of claim 1, wherein after adjusting the transmission rate of the PCI bridge link based on the failure threshold, the method further comprises:
Generating target information and sending the target information to a baseboard management controller deployed in the server, wherein the target information comprises a target transmission rate, the target transmission rate is a rate after the transmission rate of the PCI bridge link is adjusted, and the target transmission rate is smaller than the transmission rate before the PCI bridge link is adjusted;
And recording the target transmission rate through the baseboard management controller, and generating prompt information to prompt that the first failure times of the PCI bridge link failure is larger than the failure threshold value.
16. The method of claim 1, wherein detecting a PCI device connected on the PCI bridge link comprises:
When the server is started, starting the fault processing system, and detecting N levels on the PCI bridge link, wherein N is a natural number which is greater than or equal to 1;
determining the type of the PCI devices in each of the tiers.
17. A failure handling system, characterized in that it is arranged on a motherboard of a server, the failure handling system comprising a target processor, which is connected to a PCI bridge link deployed in the server, the target processor being adapted to implement the steps of the method as claimed in any of claims 1-16.
18. A server for a server, which comprises a server and a server, characterized by comprising the following steps: the fault handling system as claimed in claim 17, wherein said PCI bridge links are connected to PCI devices.
19. A failure processing apparatus of a PCI device, characterized by being applied to a failure processing system, the failure processing system being provided on a motherboard of a server, the failure processing system being connected to a PCI bridge link deployed in the server, the apparatus comprising:
A first detection module, configured to detect a PCI device connected to the PCI bridge link, where the PCI device includes an EP device and/or PCI SWITCH devices, and a plurality of EP devices are allowed to be connected to the PCI SWITCH device;
the first processing module is used for determining a fault processing mode of the PCI equipment according to the detected PCI equipment and processing the fault of the PCI equipment according to the fault processing mode;
Wherein the first processing module comprises: a first processing unit, configured to, in a case where the PCI device connected to the PCI bridge link is only the EP device, start a fault handling function of the EP device, where the fault handling function is a function provided in the EP device, and the fault handling function is configured to handle a fault of the EP device; and a second processing unit, configured to, in a case where the PCI device connected to the PCI bridge includes the PCI SWITCH device, set a failure threshold of the PCI SWITCH device, and adjust a transmission rate of the PCI bridge based on the failure threshold, where the failure threshold is used to represent a threshold for recording a number of failures of the PCI SWITCH device, and the failure processing manner includes a manner in which the failure processing function and the failure threshold of the PCI SWITCH device are set.
20. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method as claimed in any one of claims 1 to 16.
21. A computer non-transitory readable storage medium, characterized in that the computer non-transitory readable storage medium has stored therein a computer program, wherein the computer program, when executed by a processor, implements the steps of the method as claimed in any one of claims 1 to 16.
22. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any one of claims 1 to 16 when the computer program is executed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410671171.0A CN118245269B (en) | 2024-05-28 | 2024-05-28 | PCI equipment fault processing method and device and fault processing system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410671171.0A CN118245269B (en) | 2024-05-28 | 2024-05-28 | PCI equipment fault processing method and device and fault processing system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118245269A CN118245269A (en) | 2024-06-25 |
CN118245269B true CN118245269B (en) | 2024-09-17 |
Family
ID=91559323
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410671171.0A Active CN118245269B (en) | 2024-05-28 | 2024-05-28 | PCI equipment fault processing method and device and fault processing system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118245269B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118733322B (en) * | 2024-08-30 | 2024-12-20 | 苏州元脑智能科技有限公司 | Link fault identification and repair method, device, server and storage medium |
CN120276906B (en) * | 2025-04-18 | 2025-08-08 | 苏州元脑智能科技有限公司 | Fault detection system, method, storage medium, electronic device, and program product |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102662808A (en) * | 2012-03-21 | 2012-09-12 | 北京星网锐捷网络技术有限公司 | Method and device for realizing hardware fault detection on PCIE (peripheral component interconnect express) |
CN117499214A (en) * | 2023-12-19 | 2024-02-02 | 苏州元脑智能科技有限公司 | Method and device for determining fault equipment, storage medium and electronic equipment |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9590892B2 (en) * | 2013-12-02 | 2017-03-07 | University Of Ontario Institute Of Technology | Proactive controller for failure resiliency in communication networks |
-
2024
- 2024-05-28 CN CN202410671171.0A patent/CN118245269B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102662808A (en) * | 2012-03-21 | 2012-09-12 | 北京星网锐捷网络技术有限公司 | Method and device for realizing hardware fault detection on PCIE (peripheral component interconnect express) |
CN117499214A (en) * | 2023-12-19 | 2024-02-02 | 苏州元脑智能科技有限公司 | Method and device for determining fault equipment, storage medium and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN118245269A (en) | 2024-06-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN118245269B (en) | PCI equipment fault processing method and device and fault processing system | |
CN117389790B (en) | Firmware detection system, method, storage medium and server capable of recovering faults | |
CN114691408B (en) | Fault detection device of substrate management controller | |
US11848889B2 (en) | Systems and methods for improved uptime for network devices | |
CN101800675A (en) | Failure monitoring method, monitoring equipment and communication system | |
CN114218004B (en) | Fault processing method and system of Kubernetes cluster physical node based on BMC | |
CN116820820A (en) | Server fault monitoring method and system | |
CN109525434B (en) | Redundancy backup method based on onboard equipment board card | |
TWI670952B (en) | Network switching system | |
CN115237644A (en) | System fault handling method, central computing unit and vehicle | |
CN117033115A (en) | Fault processing method, device, equipment and storage medium | |
CN107528705B (en) | Troubleshooting method and device | |
EP2784677A1 (en) | Processing apparatus, program and method for logically separating an abnormal device based on abnormality count and a threshold | |
CN114168071A (en) | Distributed cluster capacity expansion method, distributed cluster capacity expansion device and medium | |
US7434085B2 (en) | Architecture for high availability using system management mode driven monitoring and communications | |
CN102646065B (en) | Equipment power-on detection method and device with protection function | |
CN1074148C (en) | Data Processing System with Error Detection and Processing Function | |
CN100442903C (en) | Implementation method and device for improving clock source stability | |
CN116137603A (en) | Link fault detection method and device, storage medium and electronic device | |
CN115543707A (en) | Hard disk fault detection method, system and device, storage medium and electronic device | |
CN120492211B (en) | Method, device, electronic device and storage medium for repairing expansion chips | |
CN120234225B (en) | Controller processing method, device, electronic device and medium | |
CN119676058A (en) | Link interruption error reporting method, device, computer equipment and storage medium | |
TWI390398B (en) | Method and system for monitoring and processing running status of a computer | |
CN118695288A (en) | Real-time monitoring method, device, medium and equipment for baseband physical layer software |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |