CN106598790A

CN106598790A - Server hardware failure detection method, apparatus of server, and server

Info

Publication number: CN106598790A
Application number: CN201510673005.5A
Authority: CN
Inventors: 李存龙
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2015-10-16
Filing date: 2015-10-16
Publication date: 2017-04-26
Also published as: WO2017063505A1

Abstract

The present invention provides a server hardware failure detection method and its device and server, the method includes detecting that the server's BIOS device enters the start-up phase; Fault detection and analysis is performed on the hardware in each working stage, and the working stage includes the startup stage; the hardware fault information detected by the BIOS device covers the entire cycle of the server's operation, and the fault pre-detection is carried out on the hardware of the server, thereby Timely processing the failure of the server during operation improves the stability and reliability of the server operation; further, the basic input and output system device stores the hardware failure information obtained by detection and analysis, which is convenient for personnel to deal with the failure, And the unified storage management of the hardware fault information is realized.

Description

A server hardware fault detection method and its device and server

技术领域technical field

本发明涉及计算机及通信领域，尤其涉及一种服务器硬件故障检测方法及其装置和服务器。The invention relates to the field of computers and communications, in particular to a server hardware failure detection method and its device and server.

背景技术Background technique

在目前的中高端服务器上，服务器一般都具有部分“黑匣子”功能，用于操作系统崩溃是的故障信息记录，可以将OS(操作系统，Operating System)的各种内核异常如内核错误、重启复位、异常打印信息等记录下来，也可以通过SEL(系统事件日志，System Event Log)记录部分简单的硬件错误，再或者通过带外的方式(比如联合测试链路)在故障发生后在现场采集错误，又或者通过带内的异常触发机理被动的监控设备异常，而带内的异常触发机理需要异常条件去触发其异常记录模块才进行记录。这些方法可以一定程度上帮助维护人员确定故障产生的原因，但是这些方法仍存在如下缺陷：In the current mid-to-high-end servers, the server generally has some "black box" functions, which are used to record fault information when the operating system crashes, and can record various kernel exceptions of the OS (operating system, Operating System) such as kernel errors, restarts, and resets. , Abnormal print information, etc. can also be recorded through SEL (System Event Log, System Event Log) to record some simple hardware errors, or through out-of-band methods (such as joint test links) to collect errors on site after a fault occurs , or passively monitor device abnormalities through the in-band abnormality trigger mechanism, which needs abnormal conditions to trigger its abnormality recording module to record. These methods can help maintenance personnel to determine the cause of the fault to a certain extent, but these methods still have the following defects:

1、上述的方法是通过被动的触发检测记录，缺少对服务器的主动检测，尤其是对服务器硬件故障的主动甄别监控。对于在系统正常启动并运行，且业务质量大幅度下降的情况，系统并不会触发故障信息记录，这是就会造成故障信息被遗漏，使得维护人员在维护时对故障信息的追查困难。1. The above-mentioned method uses passive trigger detection records, lacks active detection of servers, especially active screening and monitoring of server hardware failures. When the system starts and runs normally and the service quality drops significantly, the system does not trigger the fault information record, which will cause the fault information to be missed, making it difficult for maintenance personnel to trace the fault information during maintenance.

2、由于只有在系统崩溃或者产生异常触发时，才会对检测记录故障信息，因此，造成了系统(业务)运行过程中对硬件故障的采集能力和分析能力严重不足，从而导致系统的预警能力不足，降低了系统的稳定性和可靠性。2. Since the fault information will be recorded for detection only when the system crashes or an abnormal trigger occurs, the ability to collect and analyze hardware faults during system (business) operation is seriously insufficient, resulting in the early warning capabilities of the system Insufficient, reducing the stability and reliability of the system.

3、对于记录的故障信息过于简单、零散，没有准确统一的记录管理，无法做到对故障信息分析一步到位，后期需要大量的分析和筛查、交叉验证才能找到主要故障源。3. The recorded fault information is too simple and scattered. Without accurate and unified record management, it is impossible to analyze the fault information in one step. In the later stage, a lot of analysis, screening and cross-validation are required to find the main fault source.

4、通过带外的方式对故障信息采集，会受限于专业人员、局点环境、信息安全等，环境部署、人员协调、环境恢复等成本高昂。4. The collection of fault information through out-of-band will be limited by professionals, site environment, information security, etc., and the cost of environment deployment, personnel coordination, and environment recovery will be high.

因此，在目前的服务器故障信息记录实现方案，只有在特定的条件下才能实现故障信息的检测记录，并且其记录的故障信息简单、零散，需要后期的大量分析。Therefore, in the current implementation scheme of server fault information recording, the detection and recording of fault information can only be realized under specific conditions, and the recorded fault information is simple and scattered, requiring a large amount of analysis in the later stage.

发明内容Contents of the invention

本发明要解决的主要技术问题是，提供一种服务器硬件故障检测方法及其装置和服务器，解决现有技术中无法实现对服务器各个工作阶段的硬件进行实时故障信息的检测和记录存储的技术问题。The main technical problem to be solved by the present invention is to provide a server hardware fault detection method and its device and server, to solve the technical problem that in the prior art, it is impossible to detect and record and store the real-time fault information of the hardware in each working stage of the server .

为解决上述技术问题，本发明提供一种服务器硬件故障检测方法，包括：In order to solve the above technical problems, the present invention provides a server hardware fault detection method, comprising:

服务器的基本输入输出系统装置检测到所述服务器进入启动阶段；the server's BIOS device detects that the server has entered a start-up phase;

所述基本输入输出系统装置开始对所述服务器在各工作阶段进行硬件故障检测，所述工作阶段包括所述启动阶段；The BIOS device starts to detect the hardware failure of the server in each working stage, and the working stage includes the start-up stage;

所述基本输入输出系统装置将检测得到的硬件故障信息进行存储。The BIOS device stores the detected hardware failure information.

在本发明一实施例中，所述启动阶段包括初始化阶段，所述基本输入输出系统装置在所述初始化阶段对所述服务器进行硬件故障检测包括：In an embodiment of the present invention, the start-up phase includes an initialization phase, and the BIOS device performs hardware failure detection on the server during the initialization phase, including:

所述基本输入输出系统装置根据所述服务器提供的硬件检测机制对所述服务器的CPU、内存、芯片组和电源中的至少一个进行硬件的预检测获取当前的硬件信息，从所述硬件信息中筛选出有故障的硬件信息进行分析处理得到相应的硬件故障信息。The basic input and output system device performs hardware pre-detection on at least one of the CPU, memory, chipset and power supply of the server according to the hardware detection mechanism provided by the server to obtain current hardware information, and obtains current hardware information from the hardware information. The faulty hardware information is screened out for analysis and processing to obtain corresponding hardware fault information.

在本发明另一实施例中，所述启动阶段还包括设备枚举阶段，所述基本输入输出系统装置在所述设备枚举阶段对所述服务器进行硬件故障检测包括：In another embodiment of the present invention, the startup phase further includes a device enumeration phase, and the BIOS device performing hardware failure detection on the server during the device enumeration phase includes:

所述基本输入输出系统装置获取所述服务器上各硬件的状态信息和资源信息，并从中识别出现故障的硬件的故障信息。The BIOS device obtains the status information and resource information of each hardware on the server, and identifies the fault information of the faulty hardware therefrom.

在本发明另一实施例中，所述启动阶段为冷启动阶段或者热启动阶段。In another embodiment of the present invention, the startup phase is a cold startup phase or a hot startup phase.

在本发明另一实施例中，所述工作阶段还包括操作系统预引导阶段和操作系统业务运行阶段中的至少一个。In another embodiment of the present invention, the working phase further includes at least one of an operating system pre-booting phase and an operating system service running phase.

在本发明另一实施例中，所述工作阶段包括操作系统预引导阶段时，所述基本输入输出系统装置在所述操作系统预引导阶段对所述服务器进行硬件故障检测包括：In another embodiment of the present invention, when the working stage includes an operating system pre-boot stage, the BIOS device performing hardware fault detection on the server during the operating system pre-boot stage includes:

所述基本输入输出系统装置对将要引导启动的所述服务器带外的硬件设备进行预检测；The BIOS device pre-tests the out-of-band hardware device of the server to be booted;

获取所述硬件设备的当前硬件信息；Obtain current hardware information of the hardware device;

从所述当前硬件信息中筛选出出现故障的硬件设备的故障信息；Screen out the fault information of the faulty hardware device from the current hardware information;

所述工作阶段包括操作系统业务运行阶段时，所述基本输入输出系统装置在所述操作系统业务运行阶段对所述服务器进行硬件故障检测包括：所述基本输入输出系统装置判断所述服务器的硬件中断信号是否到来，若是，则所述基本输入输出系统装置对所述操作系统的相关硬件进行检测；获取所述硬件的故障信息。When the working stage includes the operating system service running stage, the BIOS device performing hardware fault detection on the server during the operating system service running stage includes: the basic input output system device judging the hardware failure of the server Whether an interrupt signal arrives, and if so, the BIOS device detects the relevant hardware of the operating system; obtains the fault information of the hardware.

在本发明另一实施例中，在所述基本输入输出系统装置将检测得到的故障信息进行存储之前，还包括在所述服务器串行闪存存储器上分配一个用于存储所述硬件故障信息的故障存储区。In another embodiment of the present invention, before the BIOS device stores the detected fault information, it further includes allocating a fault on the server serial flash memory for storing the hardware fault information. storage area.

为解决上述技术问题，本发明还提供一种基本输入输出系统装置，包括：In order to solve the above technical problems, the present invention also provides a basic input and output system device, comprising:

故障信息检测触发模块，用于检测服务器是否进入启动阶段；The fault information detection trigger module is used to detect whether the server enters the startup phase;

故障信息检测模块，用于在所述故障检测触发模块检测到所述服务器进入启动阶段时，开始对所述服务器在各工作阶段进行硬件故障检测，所述工作阶段包括所述启动阶段；The fault information detection module is configured to, when the fault detection triggering module detects that the server enters the start-up phase, start to detect the hardware fault of the server in each working phase, and the working phase includes the start-up phase;

故障信息存储模块，用于所述故障信息检测模块检测得到的硬件故障信息进行存储。The fault information storage module is used for storing the hardware fault information detected by the fault information detection module.

在本发明另一实施例中，还包括存储设置模块，用于在所述故障信息存储模块将所述硬件故障信息进行存储之前，在所述服务器串行闪存存储器上分配一个用于存储所述硬件故障信息的故障存储区。In another embodiment of the present invention, it also includes a storage setting module, configured to allocate a memory on the server serial flash memory for storing the hardware failure information before the failure information storage module stores the hardware failure information. Fault storage area for hardware fault information.

为解决上述技术问题，本发明还提供一种服务器包括如上所述的基本输入输出系统装置。In order to solve the above technical problems, the present invention further provides a server including the above-mentioned BIOS device.

本发明的有益效果是：The beneficial effects of the present invention are:

本发明提供的一种服务器硬件故障检测方法及其装置和服务器，由服务器的基本输入输出系统(Basic Input Output System，BIOS)装置检测到服务器进入启动阶段时，开始对该服务器的各工作阶段的硬件进行故障检测分析进而得到相应的硬件故障信息。由于利用的是服务器自身的BIOS装置，因此可以检测服务器运行的整个周期内可能出现的所有硬件故障，可提升提高了对硬件故障信息检测的全面性和准确度，并更利于实现对服务器硬件故障信息的统一存储管理，保证了维护人员在对所述服务器进行维护时，能准确获取到硬件故障信息得知需要故障处理的硬件的位置和故障原因，进一步地提高了服务器的稳定性和可靠性。A kind of server hardware fault detection method and its device and server provided by the present invention, when the basic input output system (Basic Input Output System, BIOS) device of the server detects that the server enters the start-up phase, it starts to monitor each working phase of the server The hardware performs fault detection and analysis to obtain corresponding hardware fault information. Since the server's own BIOS device is used, it can detect all hardware failures that may occur during the entire cycle of server operation, which can improve the comprehensiveness and accuracy of hardware failure information detection, and is more conducive to realizing server hardware failure. The unified storage and management of information ensures that maintenance personnel can accurately obtain hardware fault information when maintaining the server, and know the location of the hardware that needs fault handling and the cause of the fault, further improving the stability and reliability of the server .

附图说明Description of drawings

图1为本发明所提供的服务器硬件故障检测方法的流程图；Fig. 1 is the flowchart of the server hardware fault detection method provided by the present invention;

图2为本发明所提供的服务器初始化阶段进行硬件故障检测的流程图；Fig. 2 is the flow chart of hardware failure detection in server initialization stage provided by the present invention;

图3为本发明设备枚举阶段进行硬件故障检测的流程图；Fig. 3 is the flowchart of hardware failure detection in the equipment enumeration stage of the present invention;

图4为本发明操作系统预引导阶段进行硬件故障检测的流程图；Fig. 4 is the flow chart that the operating system pre-boot phase of the present invention carries out hardware failure detection;

图5为本发明操作系统业务运行阶段进行硬件故障检测的流程图；Fig. 5 is the flow chart of hardware fault detection in the operation stage of the operating system of the present invention;

图6为本发明提供的基本输入输出系统装置结构框图。Fig. 6 is a structural block diagram of the basic input and output system device provided by the present invention.

具体实施方式detailed description

下面通过具体实施方式结合附图对本发明作进一步详细说明。The present invention will be further described in detail below through specific embodiments in conjunction with the accompanying drawings.

实施例一：Embodiment one:

请参考图1，图1为本发明所提供的服务器硬件故障检测方法的流程图，本实施例提供的服务器硬件故障检测方法应当理解的是通过所述基本输入输出系统装置主动对服务器的硬件进行故障检测，这里的主动指的是根据服务器预设的检测机制，在服务器启动时所述BIOS装置立即执行对所述服务器的运行硬件进行故障检测操作或者所述BIOS装置对所述服务器的各工作阶段都进行硬件故障检测操作，具体包括以下步骤：Please refer to FIG. 1. FIG. 1 is a flow chart of the server hardware fault detection method provided by the present invention. It should be understood that the server hardware fault detection method provided in this embodiment is to actively monitor the server hardware through the basic input output system device. Fault detection, active here means that according to the detection mechanism preset by the server, when the server is started, the BIOS device immediately performs a fault detection operation on the running hardware of the server or the various tasks of the BIOS device on the server The hardware fault detection operation is carried out in each stage, which specifically includes the following steps:

S101，服务器的基本输入输出系统装置检测到所述服务器进入启动阶段；S101. The BIOS device of the server detects that the server enters the start-up phase;

本实施例中，服务器的启动阶段为冷启动阶段或者热启动阶段；所述基本输入输出系统装置检测到所述服务器进入启动阶段指的是：当所述启动阶段为冷启动阶段时，所述基本输入输出系统装置可以通过以下方式检测是否进入启动阶段但不限于以下方式：检测服务器上的电源开关键是否有按下或者检测服务器的供电电路是否与服务器电源接口接通或者通过检查电源的状态标志位，若是，则服务器已进入启动阶段，服务器已运行，执行步骤S102，否则，继续检测；In this embodiment, the startup phase of the server is a cold startup phase or a hot startup phase; the detection by the BIOS device that the server enters the startup phase means: when the startup phase is a cold startup phase, the The basic input and output system device can detect whether it enters the startup phase through the following methods but not limited to the following methods: detecting whether the power switch key on the server is pressed or detecting whether the power supply circuit of the server is connected to the server power interface or by checking the status of the power supply flag, if so, the server has entered the start-up phase, the server has been running, and step S102 is performed, otherwise, continue to detect;

当所述启动阶段为热启动阶段时，所述基本输入输出系统装置通过检测所述服务器是否有复位启动信号输入，若是，则服务器开始进行热启动运行，执行步骤S102，否则，继续检测；这里的复位启动信号可以是由硬件触发输入，比如：通过复位按键的方式输入；也可以是通过软件实现的方式输入，比如：通过代码、工具实现定时地向服务器输入；还可以是用户主动通过命令或者操作“重新启动”按钮输入。When the startup phase is a hot startup phase, the basic input and output system device detects whether the server has a reset startup signal input, if so, the server starts a hot startup operation, and performs step S102, otherwise, continues detection; here The reset start signal can be triggered by hardware, such as: input through the reset button; it can also be input through software, such as: through codes and tools to achieve regular input to the server; it can also be the user actively through the command Or operate the "Restart" button to enter.

S102，所述基本输入输出系统装置开始对所述服务器在各工作阶段的硬件进行故障检测，所述工作阶段包括所述启动阶段；S102. The BIOS device starts to perform fault detection on the hardware of the server in each working stage, and the working stage includes the start-up stage;

S103，所述基本输入输出系统装置将检测得到的硬件故障信息进行存储。S103. The BIOS device stores the detected hardware failure information.

在本实施例中，在步骤S103之前，还包括在所述服务器串行闪存存储器上分配一个用于存储所述硬件故障信息的故障存储区；进一步的，所述基本输入输出系统装置记录存储的故障信息内容包括：时间，发生的事件，严重程度，具体位置或故障详情，建议处理方式。In this embodiment, before step S103, it also includes allocating a fault storage area for storing the hardware fault information on the server serial flash memory; further, the BIOS device records and stores The fault information includes: time, event, severity, specific location or fault details, and suggested handling methods.

在本实施例中，执行完上述的步骤检测到硬件故障信息并进行存储后，当维护人员需要对所述服务器进行维护时，维护人员可以通过与所述存储区连接的带外控制平台或者网络用户界面直接获取所述存储的硬件故障信息，方便维护人员追踪故障发生轨迹，现场恢复、置换故障硬件(比如：直接更换某颗CPU，直接更换第几根内存条，直接替换故障总线接口卡)。在中、高端服务器通过热插拔技术(包含但不限于：CPU热插拔、内存热插拔、总线接口热插拔)完全可以保证系统运行不间断，达到早发现、早预警、早预防、早处理的目的。即使服务器在冷启动或热启动过程中挂死，服务器外设无法使用(如网口不通、屏幕未亮、键盘鼠标不响应)，仍然可以获取到有效的故障信息。In this embodiment, after the above-mentioned steps are performed to detect and store the hardware failure information, when the maintenance personnel need to maintain the server, the maintenance personnel can use the out-of-band control platform connected to the storage area or the network The user interface directly obtains the stored hardware fault information, which is convenient for maintenance personnel to track the fault occurrence track, restore on-site, and replace the faulty hardware (for example: directly replace a certain CPU, directly replace the first memory stick, and directly replace the faulty bus interface card) . In middle and high-end servers, the hot-swapping technology (including but not limited to: CPU hot-swapping, memory hot-swapping, and bus interface hot-swapping) can fully guarantee the uninterrupted operation of the system, achieving early detection, early warning, early prevention, purpose of early treatment. Even if the server hangs up during cold start or hot start, and the peripherals of the server cannot be used (for example, the network port is not connected, the screen is not lit, and the keyboard and mouse do not respond), effective fault information can still be obtained.

在本实施例中，运维人员通过控制平台获取到硬件故障信息，除了及时处理之外，还可以将所述硬件故障信息转储到另外的带外存储设备上。In this embodiment, the operation and maintenance personnel obtain the hardware fault information through the control platform, and besides processing it in time, they can also dump the hardware fault information to another out-of-band storage device.

在本实施例中，所述启动阶段包括初始化阶段，所述基本输入输出系统装置在所述初始化阶段对所述服务器的硬件进行故障检测的步骤如图2所示，其具体包括：In this embodiment, the startup phase includes an initialization phase, and the step of the BIOS device performing fault detection on the hardware of the server during the initialization phase is shown in FIG. 2 , which specifically includes:

S201，所述基本输入输出系统装置初始化CPU、内存、芯片组和电源；S201, the BIOS device initializes CPU, memory, chipset and power supply;

S202，所述基本输入输出系统装置检测获取CPU、内存、芯片组和电源中至少一个的当前硬件信息；S202. The BIOS device detects and acquires current hardware information of at least one of CPU, memory, chipset and power supply;

在本实施例中，所述基本输入输出系统装置是根据所述服务器提供的硬件检测机制对所述服务器的CPU、内存、芯片组和电源中的至少一个进行预检测获取当前的硬件信息，从所述硬件信息中筛选出有故障的硬件信息进行分析处理得到相应的硬件故障信息。In this embodiment, the basic input and output system device performs pre-detection on at least one of the server's CPU, memory, chipset and power supply according to the hardware detection mechanism provided by the server to obtain current hardware information, from Faulty hardware information is screened out from the hardware information for analysis and processing to obtain corresponding hardware fault information.

具体的，在本实施例中，当所述基本输入输出系统装置检测到所述服务器已进入初始化阶段时，所述BIOS装置可以利用所述BIOS装置本身或主动增加压力、或利用CPU和芯片组提供的硬件检测机制、或利用带内的集成工具(如内存测试工具、系统事件日记测试工具)等方式，主动发起对CPU、内存、芯片组和电源等服务器硬件的故障和配置进行检测，获取对应的硬件信息，然后对所获取到的硬件信息进行预分析判断、预统计、预甄别、扫描、度量硬件，并收集测试结果，并筛选出有效的故障信息(包括可能会触发系统后续异常的信息)进行详细的记录并进行存储；使得当服务器在该阶段中发生系统异常情况时，保证了所述服务器在系统异常发生之前获取并记录更多的详细硬件故障信息。在该阶段中，所述记录存储的故障信息包括但不限于：CPU错误与告警、CBO(缓存区，Caching Agent)错误与告警、QPI(快速通道互联，QuickPathInterconnect)错误与告警、IIO(集成输入/输出，Integrated I/O)端口错误与告警、HA(本地代理，Home Agent)错误与告警、IMC(整合内存控制器，Integrated Memory Controller)错误与告警、PCU(电源控制单元，Power ControlUnit)错误与告警、电源和电压错误与告警、内存错误与告警(包括内存条本身错误与告警、内存通道错误与告警、内存插法错误与告警、内存电压错误与告警、内存不兼容错误与告警、配置错误与告警等)。Specifically, in this embodiment, when the BIOS device detects that the server has entered the initialization phase, the BIOS device can use the BIOS device itself or actively increase the pressure, or use the CPU and chipset The provided hardware detection mechanism, or the use of in-band integrated tools (such as memory test tools, system event log test tools), etc., actively initiates detection of server hardware faults and configurations such as CPU, memory, chipset, and power supply, and obtains Corresponding hardware information, and then perform pre-analysis and judgment, pre-statistics, pre-screening, scanning, and measurement of the hardware information obtained, and collect test results, and filter out effective fault information (including those that may trigger subsequent system exceptions) Information) is recorded and stored in detail; so that when a system abnormality occurs on the server at this stage, it is ensured that the server obtains and records more detailed hardware failure information before the system abnormality occurs. In this stage, the fault information stored in the record includes but is not limited to: CPU errors and warnings, CBO (caching agent, Caching Agent) errors and warnings, QPI (Quick Path Interconnect, QuickPathInterconnect) errors and warnings, IIO (integrated input /Output, Integrated I/O) port errors and alarms, HA (Home Agent, Home Agent) errors and alarms, IMC (Integrated Memory Controller, Integrated Memory Controller) errors and alarms, PCU (Power Control Unit, Power Control Unit) errors Errors and alarms, power supply and voltage errors and alarms, memory errors and alarms (including errors and alarms of the memory module itself, memory channel errors and alarms, memory insertion errors and alarms, memory voltage errors and alarms, memory incompatibility errors and alarms, configuration errors and warnings, etc.).

在本实施例中，所述启动阶段还包括设备枚举阶段，该阶段进行硬件故障检测的流程图如图3所示，具体包括如下步骤：In this embodiment, the startup phase also includes a device enumeration phase, and the flowchart of hardware fault detection in this phase is shown in Figure 3, specifically including the following steps:

S301，所述基本输入输出系统装置开始设备枚举；S301. The BIOS device starts device enumeration;

S302，所述基本输入输出系统装置检测获取设备的当前信息；S302. The device of the basic input and output system detects and acquires current information of the device;

进一步的，所述基本输入输出系统装置获取所述服务器上各硬件的状态信息和资源信息，并从中识别出现故障的软硬件的故障信息。在该阶段中，所述故障信息包括但不限于：设备访问错误(包括内存和IO要求不合法)、第三方固件(OPTION ROM)未执行(包括空间不足、格式不对)、设备损坏被禁用。具体的，在本实施例中，当所述服务器对总线接口(Peripheral Component InterfaceExpress，PCIE)外设下发探针任务，计算资源需求时，所述BIOS装置根据检测机制开始识别工业规范制定的第三方固件(OPTION ROM)标识符、厂商信息、设备分类信息及容量，检查硬件状态指示信息(如链接状态、带宽信息等)等，并从上述的信息中识别出有故障硬件的故障信息进行存储。Further, the BIOS device obtains the status information and resource information of each hardware on the server, and identifies the fault information of the faulty software and hardware therefrom. At this stage, the fault information includes but is not limited to: device access error (including illegal memory and IO requirements), third-party firmware (OPTION ROM) is not executed (including insufficient space, wrong format), device damage is disabled. Specifically, in this embodiment, when the server sends a probe task to the bus interface (Peripheral Component Interface Express, PCIE) peripherals and calculates resource requirements, the BIOS device starts to identify the first set of industrial specifications according to the detection mechanism. Three-party firmware (OPTION ROM) identifier, manufacturer information, device classification information and capacity, check hardware status indication information (such as link status, bandwidth information, etc.), and identify faulty hardware fault information from the above information for storage .

在本实施例中，所述工作阶段还包括操作系统预引导阶段和操作系统业务运行阶段中的至少一个；请参见图4、5，分别为操作系统预引导阶段、操作系统业务运行阶段进行硬件故障检测的流程图；In this embodiment, the working phase also includes at least one of the operating system pre-boot phase and the operating system business running phase; please refer to Figures 4 and 5, which respectively perform hardware Flow chart of fault detection;

如图4，所述操作系统预引导阶段进行硬件故障检测分析包括以下步骤：As shown in Figure 4, the hardware failure detection and analysis in the pre-boot stage of the operating system includes the following steps:

S401，所述基本输入输出系统装置对将要引导启动的所述服务器带外的硬件设备进行预检测；S401. The BIOS device performs pre-detection on the out-of-band hardware device of the server to be booted;

S402，获取所述硬件设备的当前硬件信息；S402. Obtain current hardware information of the hardware device;

S403，从所述当前硬件信息中筛选出出现故障的硬件设备的故障信息；S403. Filter out the fault information of the faulty hardware device from the current hardware information;

在本实施例中，所述服务器带外的硬件设备包括但不限于：硬盘、服务器网口、设备引导属性；所述故障信息包括但不限于：无可启动设备、硬盘(或U盘)损坏(含MBR分区破坏)、PXE网络引导失败(含端口信息、网络ping不通)、ME(Management Engine)工作状态异常。优选的，当在该阶段中，所述基本输入输出系统装置对所述硬盘分区进行故障检测时，所述基本输入输出系统装置主动发起检测获取信号，获取硬盘(U盘)的主引导记录(MBR分区)数据，分析引导标志、结束标志和出错信息数据区，根据所述服务器提供的硬件检测机判断硬盘(U盘)是否可以引导、损坏；通过下发自检命令判断服务器与主机之间的通信链路状态、工作模式；通过DHCP(Dynamic Host ConfigurationProtocol，动态主机配置协议)通讯检查网络是否连通；罗列单板启动设备，检查是否存在可启动设备。In this embodiment, the out-of-band hardware devices of the server include but are not limited to: hard disks, server network ports, and device boot properties; the fault information includes but not limited to: no bootable devices, hard disk (or U disk) damage (including MBR partition damage), PXE network boot failure (including port information, network ping failure), ME (Management Engine) working status is abnormal. Preferably, at this stage, when the BIOS device performs fault detection on the hard disk partition, the BIOS device actively initiates a detection and acquisition signal to obtain the master boot record ( MBR partition) data, analyze guide sign, end sign and error information data area, judge whether hard disk (U disk) can guide, damage according to the hardware detection machine that described server provides; communication link status and working mode; through DHCP (Dynamic Host Configuration Protocol, Dynamic Host Configuration Protocol) communication to check whether the network is connected; list the single-board boot devices, and check whether there are bootable devices.

如图5，所述操作系统业务运行阶段进行硬件故障检测分析包括以下步骤：As shown in Fig. 5, the hardware failure detection and analysis in the operation stage of the operating system includes the following steps:

S501，判断所述服务器的硬件中断信号是否到来；S501, judging whether a hardware interrupt signal of the server arrives;

S502，若是，则所述基本输入输出系统装置对所述操作系统的相关硬件进行检测；S502, if yes, the BIOS device detects related hardware of the operating system;

S503，获取所述硬件的故障信息；S503, acquiring fault information of the hardware;

在上述故障检测分析中，当判断到所述硬件中断信号到来时，所述基本输入输出系统装置对与所述业务运行相关的硬件进行故障检测，并对检测到的硬件故障信息进行分析、分类、统计，然后对所述故障信息进行存储。在该阶段检测的故障信息包含但不限于：CPU错误与告警、CBO错误与告警、QPI错误与告警、VT-D错误与告警、IIO端口错误与告警、内存错误与告警、PCIE错误与告警、PCU错误与告警、Ubox(Utility Box)错误与告警。优选的，在该阶段的硬件故障检测过程，所述BIOS装置开启MCA(Machine Check Architecture)功能和增强型错误记录AER(Advance Error Report)功能，打开各个组件对应的错误检测块(Machine Check Error Bank)开关，挂接故障识别分类函数以及各个组件的错误处理钩子函数。当MCE(Machine-Check Exception)异常发生时，硬件拉低错误状态引脚，产生系统管理中断(SMI)。此时所述BIOS装置获得控制权，通过硬件故障识别分类函数读取CPU和桥片自带的错误状态寄存器，获取错误检测块(Machine Check Error Bank)具体信息，然后根据芯片手册进行详细解析，将具体的硬件错误信息分离、解读出来。In the above fault detection and analysis, when it is judged that the hardware interrupt signal arrives, the BIOS device detects the fault of the hardware related to the business operation, and analyzes and classifies the detected hardware fault information , statistics, and then store the fault information. The fault information detected at this stage includes but is not limited to: CPU errors and warnings, CBO errors and warnings, QPI errors and warnings, VT-D errors and warnings, IIO port errors and warnings, memory errors and warnings, PCIE errors and warnings, PCU errors and warnings, Ubox (Utility Box) errors and warnings. Preferably, in the hardware failure detection process at this stage, the BIOS device starts the MCA (Machine Check Architecture) function and the enhanced error record AER (Advance Error Report) function, and opens the error detection block (Machine Check Error Bank) corresponding to each component. ) switch, hook up the fault identification classification function and the error handling hook function of each component. When an MCE (Machine-Check Exception) exception occurs, the hardware pulls down the error status pin to generate a system management interrupt (SMI). Now described BIOS device obtains control right, reads the error state register that CPU and bridge chip carry by hardware failure identification classification function, obtains error detection block (Machine Check Error Bank) concrete information, then carries out detailed analysis according to chip manual, Separate and interpret specific hardware error messages.

实施例二：Embodiment two:

本实施例提供了一种基本输入输出系统装置，应当理解的是该BIOS装置可以设置于任意服务器中，实现对服务器在任意工作阶段的硬件故障检测，请参见图6所示，基本输入输出系统装置60包括：This embodiment provides a basic input and output system device. It should be understood that the BIOS device can be installed in any server to realize hardware failure detection of the server at any working stage. Please refer to FIG. 6, the basic input and output system Device 60 includes:

故障信息检测触发模块61，用于检测到所述服务器进入启动阶段；A failure information detection trigger module 61, configured to detect that the server enters the start-up phase;

故障信息检测模块62，用于开始对所述服务器在各工作阶段的硬件进行故障检测，所述工作阶段包括所述启动阶段；The fault information detection module 62 is configured to start fault detection on the hardware of the server in each working stage, and the working stage includes the startup stage;

故障信息存储模块63，用于所述基本输入输出系统装置将检测分析得到的硬件故障信息进行存储。The fault information storage module 63 is used for the BIOS device to store the hardware fault information obtained through detection and analysis.

在本实施例中，在服务器的启动阶段，所述故障信息检测模块62根据所述服务器提供的硬件检测机制对所述服务器的CPU、内存、芯片组和电源中的至少一个进行预检测获取当前的硬件信息，从所述硬件信息中筛选出有故障的硬件信息进行分析处理得到相应的硬件故障信息。In this embodiment, at the start-up stage of the server, the fault information detection module 62 pre-detects at least one of the CPU, memory, chipset and power supply of the server according to the hardware detection mechanism provided by the server to obtain the current hardware information, and screen faulty hardware information from the hardware information for analysis and processing to obtain corresponding hardware fault information.

在服务器的设备枚举阶段，所述故障信息检测模块62获取所述服务器上各硬件的状态信息和资源信息，并从中识别出现故障的硬件的故障信息。In the device enumeration stage of the server, the fault information detection module 62 obtains the status information and resource information of each hardware on the server, and identifies the fault information of the faulty hardware therefrom.

在服务器的操作系统预引导阶段时，所述故障信息检测模块62对将要引导启动的所述服务器带外的硬件设备进行预检测；During the pre-boot stage of the operating system of the server, the fault information detection module 62 pre-detects the out-of-band hardware device of the server to be booted;

在服务器的操作系统业务运行阶段时，所述故障信息检测模块62对所述服务器进行硬件故障检测包括：所述基本输入输出系统装置判断所述服务器的硬件中断信号是否到来，若是，则所述基本输入输出系统装置对所述操作系统的相关硬件进行检测；获取所述硬件的故障信息。During the operation stage of the operating system business of the server, the fault information detection module 62 detects the hardware fault of the server including: the basic input and output system device judges whether the hardware interrupt signal of the server arrives, and if so, the The basic input and output system device detects the relevant hardware of the operating system; obtains the fault information of the hardware.

在本实施例中，还包括存储设置模块64，用于在所述故障信息存储模块将所述故障信息进行存储之前，在所述服务器串行闪存存储器上分配一个用于存储所述硬件故障信息的故障存储区。In this embodiment, it also includes a storage setting module 64, which is used to allocate a storage device for storing the hardware fault information on the server serial flash memory before the fault information storage module stores the fault information. fault storage area.

在本发明中，还提供了一种服务器，所述服务器包括如上所述的基本输入输出系统装置。In the present invention, a server is also provided, and the server includes the above-mentioned basic input output system device.

本发明提供的技术方案可广泛应用于计算机、网络通信设备等设备上，通过基本输入输出系统装置对所述服务器运行的整个周期中的硬件设备进行故障检测，可预防所述服务器在运行过程中出现故障，提高了所述服务器运行的稳定性和可靠性。The technical solution provided by the present invention can be widely applied to computers, network communication equipment and other equipment. The basic input and output system device can detect the faults of the hardware equipment in the whole cycle of the server operation, which can prevent the server from Failures occur, improving the stability and reliability of the server operation.

以上内容是结合具体的实施方式对本发明所作的进一步详细说明，不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干简单推演或替换，都应当视为属于本发明的保护范围。The above content is a further detailed description of the present invention in conjunction with specific embodiments, and it cannot be assumed that the specific implementation of the present invention is limited to these descriptions. For those of ordinary skill in the technical field of the present invention, without departing from the concept of the present invention, some simple deduction or replacement can be made, which should be regarded as belonging to the protection scope of the present invention.

Claims

1. a kind of server hardware fault detection method, it is characterised in that include：

The basic input output system device of server detects the server and enters startup stage；

The basic input output system device starts to carry out hardware fault in each working stage to the server Detection, the working stage includes the startup stage；

The basic input output system device will detect that the hardware fault information for obtaining is stored.

2. server failure detection method as claimed in claim 1, it is characterised in that the startup Stage includes initial phase, and the basic input output system device is in the initial phase to the clothes Business device carries out hardware failure detection to be included：

The hardware detection mechanism that the basic input output system device is provided according to the server is to the clothes At least one of the business CPU of device, internal memory, chipset and power supply carry out the pre-detection of hardware and obtain current Hardware information, faulty hardware information is filtered out from the hardware information is analyzed process and obtain accordingly Hardware fault information.

3. server hardware fault detection method as claimed in claim 2, it is characterised in that described Startup stage also includes the device enumeration stage, and the basic input output system device is in the device enumeration rank Section carries out hardware failure detection to the server to be included：

The basic input output system device obtains the status information and resource letter of each hardware on the server Breath, and therefrom recognize the fault message of the hardware for breaking down.

4. the server hardware fault detection method as described in any one of claim 1-3, its feature exists In the startup stage is cold-start phase or thermal starting stage.

5. the server hardware fault detection method as described in any one of claim 1-3, its feature exists In, the working stage also include operating system pre-boot phase and in the operating system service operation stage extremely It is few one.

6. server hardware fault detection method as claimed in claim 5, it is characterised in that described When working stage includes operating system pre-boot phase, the basic input output system device is in the operation System pre-boot phase carries out hardware failure detection to the server to be included：

The basic input output system device is to the hardware device outside the server band that will be booted up Carry out pre-detection；

Obtain the Current hardware information of the hardware device；

The fault message of the hardware device for breaking down is filtered out from the Current hardware information；

When the working stage includes the operating system service operation stage, the basic input output system device Carrying out hardware failure detection to the server in the operating system service operation stage includes：It is described basic Input-output system device judges whether the hardware interrupt of the server arrives, if so, the then base This input-output system device is detected to the related hardware of the operating system；Obtain the event of the hardware Barrier information.

7. the server hardware fault detection method as described in any one of claim 1-3, its feature exists In before the basic input output system device will detect that the fault message that obtains is stored, also wrapping Include and distribute a failure for being used to store the hardware fault information on the server serial flash memorizer Memory block.

8. a kind of basic input output system device, it is characterised in that include：

Fault message detects trigger module, and whether startup stage is entered for detection service device；

Fault message detection module, enters for detecting the server in the fault detect trigger module During startup stage, start to carry out hardware failure detection, the work rank in each working stage to the server Section includes the startup stage；

Fault message memory module, for the hardware fault information that fault message detection module detection is obtained Stored.

9. basic input output system device as claimed in claim 8, it is characterised in that also include Storage setup module, for the hardware fault information to be carried out into storage in the fault message memory module Before, an event for being used to store the hardware fault information is distributed on the server serial flash memorizer Barrier memory block.

10. a kind of server, it is characterised in that including basic input as claimed in claim 8 or 9 Output system device.