WO2012119432A1

WO2012119432A1 - Method for improving stability of computer system, and computer system

Info

Publication number: WO2012119432A1
Application number: PCT/CN2011/079198
Authority: WO
Inventors: 张斌
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2011-08-31
Filing date: 2011-08-31
Publication date: 2012-09-13
Anticipated expiration: 2014-02-28
Also published as: CN102369513A

Abstract

The present invention relates to a method for improving the stability of a computer system and a computer system. The method for improving the stability of a computer system includes: when a computer system is started or is running, collecting the error data generated by the devices in the computer system; storing the error data in a non-volatile memory; when the computer system is restarted, performing state recovery on the devices generating the error data according to the error data. Recording the error data into a non-volatile memory when the computer system is running, and reading the error data in the non-volatile memory to perform state recovery on corresponding devices in the computer system during reboot, solves the problem of degraded system stability, which is directly caused by failure of processing such as disenabling and isolation carried out by the computer system on some abnormal devices or devices predicted to be damaged because of device state initialization after the computer system is restarted, improving the stability of the computer system.

Description

提高计算机系统稳定性的方法及计算机系统技术领域 Method and computer system for improving stability of computer system

本发明涉及计算机技术，特别涉及一种提高计算机系统稳定性的方法及计算机系统。背景技术 The present invention relates to computer technology, and more particularly to a method and computer system for improving the stability of a computer system. Background technique

由于高端容错计算机系统承载着金融、电信、航空、电力等行业的关键业务（ Mission Critical ), 需要保证 365天 24小时不间断的运行，并保证数据的正确性，因此需要具备高度的稳定性、可用性与可服务性（Reliability Availability and Serviceability, RAS )特性。具体来说，稳定性要求计算机能够持续运转，自动检测和纠正系统错误。可用性要求计算机系统的重要资源都有备份，能够检测到潜在要发生的问题，并且能够转移其上正在运行的任务到备份资源，以保持计算机系统正常运行，减少宕机时间。可服务性要求计算机系统能够实时在线诊断，精确定位出根本问题所在，做到准确无误的快速修复。 Because the high-end fault-tolerant computer system carries Mission Critical in the financial, telecommunications, aerospace, power and other industries, it needs to ensure 24 hours a day, 365 days of continuous operation, and ensure the correctness of the data, so it needs to have a high degree of stability. Availability Availability and Serviceability (RAS) features. Specifically, stability requires the computer to continue to operate, automatically detecting and correcting system errors. Availability requires that critical resources of the computer system be backed up, able to detect potential problems, and be able to transfer tasks that are running on them to backup resources to keep the computer system up and running down and downtime. Serviceability requirements The computer system is capable of real-time online diagnosis, pinpointing the underlying problem, and making accurate and accurate repairs.

现有技术中，通常通过板上管理（Onboard Administrator, OA ) 来收集运行中的计算机系统的设备错误数据，以利用这些错误数据进行故障的预测。当设备故障次数达到设定阈值时，启用备份设备或进行热替换。这些错误数据严重影响重新启动的计算机系统或已经下线，之后再次上线启用的设备的稳定性。发明内容 In the prior art, device error data of a running computer system is usually collected through an Onboard Administrator (OA) to utilize these erroneous data for fault prediction. When the number of device failures reaches the set threshold, enable the backup device or perform a hot replacement. These error data severely affect the stability of the restarted computer system or the device that has been offline and then turned on again. Summary of the invention

本发明实施例提出一种提高计算机系统稳定性的方法及计算机系统，以提高计算机系统的稳定性。本发明实施例提供了一种提高计算机系统稳定性的方法，包括：计算机系统在启动或运行时，收集所述计算机系统的设备产生的错误数据； The embodiment of the invention provides a method and a computer system for improving the stability of a computer system to improve the stability of the computer system. An embodiment of the present invention provides a method for improving stability of a computer system, including: collecting, when the computer system is started or running, erroneous data generated by a device of the computer system;

将所述错误数据存储到非易失性存储器中； Storing the erroneous data in a non-volatile memory;

所述计算机系统重启时，根据所述错误数据对产生所述错误数据的设备进行状态恢复处理。 When the computer system is restarted, state recovery processing is performed on the device that generates the erroneous data based on the error data.

本发明实施例还提供了一种提高计算机系统稳定性的方法，包括：计算机系统在运行时，收集所述计算机系统的设备中发生异常的设备的异常信息； The embodiment of the present invention further provides a method for improving the stability of a computer system, including: collecting, during operation, an abnormality information of a device in which an abnormality occurs in a device of the computer system;

将发生异常的设备的所述异常信息存储到非易失性存储器中； Storing the abnormality information of the device in which the abnormality occurs in the non-volatile memory;

所述计算机系统根据所述异常信息对请求重新上线的已下线的所述设备进行状态恢复。 The computer system performs state restoration on the off-line device requesting to go online again according to the abnormality information.

本发明实施例还提供了一种计算机系统，包括： The embodiment of the invention further provides a computer system, including:

错误收集单元，用于在计算机系统运行或启动时，收集所述计算机系统的设备产生的错误数据； An error collection unit, configured to collect error data generated by a device of the computer system when the computer system is running or started;

存储单元 , 用于将所述错误数据存储到非易失性存储器中； a storage unit, configured to store the error data into a non-volatile memory;

恢复处理单元，用于在所述计算机系统重启时，根据所述错误数据对产生所述错误数据的设备进行状态恢复处理。 And a recovery processing unit, configured to perform state recovery processing on the device that generates the erroneous data according to the error data when the computer system is restarted.

异常信息收集单元，用于在计算机系统运行时，收集所述计算机系统的设备中发生异常的设备的异常信息； An abnormality information collecting unit, configured to collect abnormal information of a device in which an abnormality occurs in a device of the computer system when the computer system is running;

存储单元，用于将发生异常的设备的所述异常信息存储到非易失性存储器中； a storage unit, configured to store the abnormality information of the device in which the abnormality occurs in the non-volatile memory;

状态恢复单元，用于根据所述异常信息对请求重新上线的已下线的所述设备进行状态恢复。 The state recovery unit is configured to perform state recovery on the offline device that is requested to be re-online according to the abnormality information.

本发明实施例提供的提高计算机系统稳定性的方法及计算机系统，通过在计算机系统运行中将错误数据记录到非易失性存储器中，并在重新启动过程中读取非易失性存储器中的错误数据对计算机系统中对应的设备进行状态恢复处理，解决了计算机系统在重启后设备状态初始化导致的计算机系统之前对一些异常或预测会损坏的设备的禁用、隔离等处理失效，直接导致系统稳定性的降低的问题，提高了计算机系统的稳定性。附图说明 A method and a computer system for improving stability of a computer system provided by an embodiment of the present invention The error data is recorded into the non-volatile memory during the operation of the computer system, and the error data in the non-volatile memory is read during the restart process to perform state recovery processing on the corresponding device in the computer system, thereby solving the computer system. The failure of the computer system before the restart of the device state to disable or isolate the device that is abnormal or predicted to be damaged may directly lead to the problem of lowering the stability of the system and improve the stability of the computer system. DRAWINGS

图 1 为本发明实施例提供的一种提高计算机系统稳定性的方法的流程图； 1 is a flow chart of a method for improving stability of a computer system according to an embodiment of the present invention;

图 2为本发明实施例提供的另一种提高计算机系统稳定性的方法的流程图； 2 is a flow chart of another method for improving stability of a computer system according to an embodiment of the present invention;

图 3为本发明实施例提供的提高计算机系统稳定性的方法中计算机系统重启情况下的设备状态恢复示意图； 3 is a schematic diagram of device state recovery in a case where a computer system is restarted in a method for improving stability of a computer system according to an embodiment of the present invention;

图 4 为本发明实施例提供的提高计算机系统稳定性的方法中计算机系统的 BIOS策略配置菜单示意图； 4 is a schematic diagram of a BIOS policy configuration menu of a computer system in a method for improving stability of a computer system according to an embodiment of the present invention;

图 5为本发明实施例提供的提高计算机系统稳定性的方法中 DIMM隔离状态恢复流程图； 5 is a flowchart of recovering a DIMM isolation state in a method for improving stability of a computer system according to an embodiment of the present invention;

图 6为本发明实施例提供的提高计算机系统稳定性的方法中处理器核禁用恢复处理流程图； FIG. 6 is a flowchart of a processor core disable recovery process in a method for improving stability of a computer system according to an embodiment of the present invention;

图 7为本发明实施例提供的提高计算机系统稳定性的方法中緩存禁用信息的状态恢复处理流程图； FIG. 7 is a flowchart of a state recovery process of cache disable information in a method for improving stability of a computer system according to an embodiment of the present invention; FIG.

图 8为本发明实施例提供的提高计算机系统稳定性的方法中已下线的发生异常的节点重新上线的状态恢复处理流程图； FIG. 8 is a flowchart of a state recovery process of a node that has gone offline abnormally in a method for improving stability of a computer system according to an embodiment of the present invention; FIG.

图 9为本发明实施例提供的一种计算机系统的结构示意图； FIG. 9 is a schematic structural diagram of a computer system according to an embodiment of the present invention;

图 10为本发明实施例提供的另一种计算机系统的结构示意图。具体实施方式 FIG. 10 is a schematic structural diagram of another computer system according to an embodiment of the present invention. detailed description

为使本发明的目的、技术方案和优点更加清楚，下面将结合附图对本发明作进一步地详细描述。 The present invention will be further described in detail below with reference to the accompanying drawings.

图 1为本发明实施例提供的一种提高计算机系统稳定性的方法的流程图。本实施例针对重新启动的计算机系统在上一次运行时收集的错误数据对计算机系统中的设备进行状态恢复，以提高计算机系统的稳定性。如图 1所示，该方法包括： FIG. 1 is a flowchart of a method for improving stability of a computer system according to an embodiment of the present invention. This embodiment performs state recovery on the devices in the computer system for the erroneous data collected by the restarted computer system during the last run to improve the stability of the computer system. As shown in Figure 1, the method includes:

步骤 11、计算机系统在启动或运行时，收集所述计算机系统的设备产生的错误数据；错误数据可以是设备的异常信息、双列直插式存储模块（Dual Inline Memory Modules, DIMM )隔离信息、处理器核的禁用信息、緩存禁用信息等。 Step 11: The computer system collects error data generated by the device of the computer system when the computer system is started or running; the error data may be abnormal information of the device, isolation information of dual inline memory modules (DIMMs), Disable information for the processor core, cache disable information, and so on.

步骤 12、将所述存储到非易失性存储器（Non- Volatile Memory, NVM ) 中。 Step 12. Store the information in a Non-Volatile Memory (NVM).

当错误数据为设备的异常信息时，计算机系统在运行时将发生异常的设备的异常信息存储到所述非易失性存储器中； When the error data is abnormal information of the device, the computer system stores the abnormality information of the device in which the abnormality occurs in the non-volatile memory at runtime;

该方法还包括：所述计算机系统在运行时根据所述异常信息，对请求重新上线的已下线的所述设备进行状态恢复。 The method further includes: the computer system performing state restoration on the off-line device requesting the re-online according to the abnormality information during operation.

当错误数据为 DIMM 隔离信息时，所述计算机系统在运行时判断所述 DIMM是否被替换，若是，则将存储在所述非易失性存储器中的所述 DIMM 隔离信息进行清除；否则，所述计算机系统重启时将所述 DIMM隔离。 When the error data is DIMM isolation information, the computer system determines whether the DIMM is replaced at runtime, and if so, clears the DIMM isolation information stored in the non-volatile memory; otherwise, The DIMM is isolated when the computer system is restarted.

步骤 13、所述计算机系统重启时，根据所述错误数据对产生所述错误数据的设备进行状态恢复处理。 Step 13: When the computer system restarts, perform state recovery processing on the device that generates the error data according to the error data.

例如计算机系统根据 DIMM隔离信息对所述计算机系统中对应的 DIMM 进行隔离。再如计算机系统根据处理器核的禁用信息禁止所述计算机系统中对应的处理器核参与处理器启动处理线程 ( Processor Boot Strap Processor, PBSP )的选择，或禁用所述计算机系统中对应的处理器核。又如计算机系统根据緩存禁用信息重新禁用所述计算机系统中对应的緩存。 For example, the computer system isolates corresponding DIMMs in the computer system based on DIMM isolation information. For example, the computer system prohibits the corresponding processor core in the computer system from participating in the selection of the Processor Boot Strap Processor (PBSP) according to the disable information of the processor core, or disables the corresponding processor in the computer system. nuclear. Computer system Re-disabling the corresponding cache in the computer system based on the cache disable information.

本实施例中，计算机系统通过将运行时收集的错误数据存储在 NVM中，并通过在重启时根据错误数据对对应设备进行状态恢复处理，避免了初始化后计算机系统将有问题的或不稳定的设备作为正常设备启用，提高了计算机系统的稳定性。 In this embodiment, the computer system saves the error data collected by the runtime in the NVM, and performs state recovery processing on the corresponding device according to the error data during the restart, thereby avoiding the problem that the computer system will be problematic or unstable after the initialization. The device is enabled as a normal device, improving the stability of the computer system.

图 2为本发明实施例提供的另一种提高计算机系统稳定性的方法的流程图。本实施例单独针对运行的计算机系统中异常设备下线后重新上线导致计算机系统不稳定的问题进行处理。如图 2所示，该方法包括： FIG. 2 is a flow chart of another method for improving stability of a computer system according to an embodiment of the present invention. This embodiment separately deals with the problem that the abnormality of the computer system in the running computer system causes the computer system to be unstable after being offline. As shown in Figure 2, the method includes:

步骤 21、计算机系统在运行时，收集所述计算机系统的设备中发生异常的设备的异常信息； Step 21: When the computer system is running, collecting abnormal information of the device that is abnormal in the device of the computer system;

步骤 22、将发生异常的设备的所述异常信息存储到非易失性存储器中；步骤 23、所述计算机系统根据所述异常信息对请求重新上线的已下线的所述设备进行状态恢复。 Step 22: The abnormality information of the device in which the abnormality occurs is stored in the non-volatile memory. Step 23: The computer system performs state restoration on the offline device that requests the re-online according to the abnormality information.

计算机系统在运行时将发生异常的设备的异常信息存储到所述非易失性存储器中之后，还可包括：所述计算机系统判断所述设备是否被替换，若是，则删除所述非易失性存储器中的所述异常信息；否则，执行所述状态恢复。 After the computer system stores the abnormality information of the device in which the abnormality occurs in the non-volatile memory, the computer system further includes: the computer system determines whether the device is replaced, and if yes, deletes the nonvolatile The exception information in the memory; otherwise, the state recovery is performed.

本实施例中，计算机系统通过根据 NVM中存储的异常信息对对应的重新请求上线的已下线的设备进行状态恢复，避免了由于设备异常而下线的设备作为正常设备重新上线而导致的系统不稳定，提高了计算机系统的稳定性。 In this embodiment, the computer system restores the status of the corresponding offline device that has been re-requested online according to the abnormal information stored in the NVM, and avoids the system that the device that is offline due to the abnormality of the device is re-lived as a normal device. Unstable, improving the stability of the computer system.

例如在计算机系统运行时，如果其中的设备出现故障或被系统禁用，其基本输入输出系统（ Basic Input Output System, BIOS )会将这些信息保存在 NVM中。当计算机系统重新启动时，对这些信息进行分析和处理，在不稳定设备启动前根据这些信息进行系统配置，将不稳定设备隔离或禁用，还原计算机系统中不稳定设备到重启前的状态，以保证计算机系统的稳定性。 For example, when the computer system is running, if the device fails or is disabled by the system, its Basic Input Output System (BIOS) will save this information in the NVM. When the computer system is restarted, the information is analyzed and processed. Before the unstable device is started, the system is configured according to the information, the unstable device is isolated or disabled, and the unstable device in the computer system is restored to the state before the restart. Ensure the stability of the computer system.

图 3 为本发明实施例提供的提高计算机系统稳定性的方法中计算机系统重启情况下的设备状态恢复示意图。如图 3所示，计算机系统运行时，将收集的 DIMM隔离信息、处理器核健康状态信息及緩存禁用信息等错误数据存储到 NVM中，当计算机系统重启时，开启 BIOS配置策略，即在 BIOS配置界面开启设备状态恢复功能，从 NVM获取错误数据，根据相应的错误数据调用相应的设备状态恢复分发程序对对应的设备进行状态恢复处理。 FIG. 3 is a schematic diagram of device state recovery in a case where a computer system is restarted in a method for improving stability of a computer system according to an embodiment of the present invention. As shown in Figure 3, when the computer system is running, it will receive The error data such as the DIMM isolation information, the processor core health status information, and the cache disable information are stored in the NVM. When the computer system is restarted, the BIOS configuration policy is enabled, that is, the device state recovery function is enabled in the BIOS configuration interface, and the error is obtained from the NVM. Data, according to the corresponding error data, the corresponding device state recovery distribution program is called to perform state recovery processing on the corresponding device.

进一步地， BIOS配置界面还可设置有开关，以使用户根据需求进行灵活的设备状态恢复。如图 4所示，按照需求分别开启不同设备的配置策略开关，只有处于开启状态的设备才会按照相应策略进行状态恢复。当重启开始，进行系统初始化时，系统会进入设备状态恢复分发程序。 Further, the BIOS configuration interface can also be provided with switches to enable the user to perform flexible device state recovery according to requirements. As shown in Figure 4, the configuration policy switches of different devices are enabled as required. Only the devices in the enabled state will be restored according to the corresponding policies. When the reboot starts and the system is initialized, the system enters the device status recovery distribution.

下面以图 4所示 DIMM隔离信息、处理器核健康状况信息和緩存被禁用信息为例，对提高计算机系统稳定性的方法做进一步详细的说明。 The DIMM isolation information, the processor core health status information, and the cache disabled information shown in FIG. 4 are taken as an example to further explain the method for improving the stability of the computer system.

当这些错误数据产生后， BIOS 将这些错误数据保存在非易失性存储器中，当计算机系统重新启动进行初始化时，调用设备状态恢复分发程序，从 NVM中获取这些错误数据，按照不同的策略进行处理。 When these error data are generated, the BIOS saves the error data in non-volatile memory. When the computer system is restarted for initialization, the device state recovery distribution program is called to obtain the error data from the NVM, according to different strategies. deal with.

具体地， DIMM隔离状态恢复流程如图 5所示，包括： Specifically, the DIMM isolation state recovery process is as shown in FIG. 5, including:

步骤 51、计算机系统运行时，某一 DIMM 的错误检查与纠正（Error Step 51. When the computer system is running, the error check and correction of a certain DIMM (Error

Correcting Code, ECC )错误达到规定阈值； Correcting Code, ECC) The error reaches the specified threshold;

步骤 52、计算机系统通过 Log信息识别出错误来源； Step 52: The computer system identifies the source of the error through the Log information;

步骤 53、计算机系统标记该 DIMM为即将失效状态，并将该 DIMM禁用； Step 53: The computer system marks the DIMM as an impending failure state, and disables the DIMM;

步骤 54、计算机系统将失效 DIMM信息保存到 NVM中； Step 54: The computer system saves the invalid DIMM information to the NVM.

步骤 55、计算机系统判断 DIMM是否被替换，若是，则执行步骤 58; 否则, 执行步骤 56; Step 55, the computer system determines whether the DIMM is replaced, and if so, step 58 is performed; otherwise, step 56;

步骤 56、计算机系统重新启动后读取 NVM中存储的 DIMM信息；步骤 57、计算机系统调用设备状态恢复分发程序，将对应的 DIMM隔离，计算机系统继续启动，引导操作系统（ Operation System, OS )。 Step 56: After the computer system is restarted, the DIMM information stored in the NVM is read. Step 57: The computer system calls the device state recovery distribution program to isolate the corresponding DIMM, and the computer system continues to boot, and the operating system (OS) is booted.

步骤 58、计算机系统清除 NVM中该失效 DIMM信息，计算机系统恢复为健康状态。 Step 58, the computer system clears the invalid DIMM information in the NVM, and the computer system recovers For a healthy state.

本实施例中，计算机系统中的 DIMM通过 ECC等机制来对其中的数据进行保护，当发现 ECC错误数量达到设定的阈值上限时，系统会将此 DIMM标记为失效状态并禁用，将失效信息保存在 NVM 中。在计算机系统继续正常运行到重启的这段时间内，如果 DIMM通过热插拔流程进行了更换，则将 NVM中该 DIMM的失效信息清除。否则，当系统重新启动时，在启动过程中调用设备状态恢复分发程序，读取系统中所有 DIMM的失效信息，使用配置程序将其隔离。 In this embodiment, the DIMM in the computer system protects the data by using a mechanism such as ECC. When the number of ECC errors is found to reach the set upper threshold, the system marks the DIMM as a failed state and disables the failure information. Save in NVM. During the period from the normal operation of the computer system to the restart, if the DIMM is replaced by the hot plug process, the failure information of the DIMM in the NVM is cleared. Otherwise, when the system restarts, the device status recovery dispatcher is called during startup, reading the failure information of all DIMMs in the system, and isolating it using the configuration program.

处理器核禁用恢复处理流程如图 6所示，包括： The processor core disable recovery process is shown in Figure 6, including:

步骤 61、计算机系统在第一次启动时，检测处理器核的健康状况；步骤 62、计算机系统对未通过检测的处理器核进行禁用； Step 61: The computer system detects the health status of the processor core when the computer is started for the first time; Step 62: The computer system disables the processor core that fails the detection;

步骤 63、计算机系统记录该禁用信息到 NVM中； Step 63: The computer system records the disable information to the NVM;

步骤 64、 PBSP选择，计算机系统继续启动，引导 OS; Step 64, PBSP selection, the computer system continues to boot, booting the OS;

步骤 65、计算机系统第二次启动时，调用设备状态恢复分发程序，从 NVM 中读取禁用信息； Step 65: When the computer system is started for the second time, the device state recovery distribution program is invoked, and the disable information is read from the NVM;

步骤 66、计算机系统根据读取的禁用信息禁止对应的处理器核参与 PBSP 选择，或直接禁用对应的处理器核； Step 66: The computer system prohibits the corresponding processor core from participating in the PBSP selection according to the read disable information, or directly disables the corresponding processor core;

步骤 67、计算机系统继续启动，检测处理器核的健康状况； Step 67: The computer system continues to start, detecting the health status of the processor core;

步骤 68、计算机系统对未经过检测的处理器核进行禁用； Step 68: The computer system disables the undetected processor core;

步骤 69、计算机系统将禁用信息记录到 NVM中。 Step 69: The computer system records the disabling information into the NVM.

之后，计算机系统进行 PBSP选择，计算机系统继续启动，引导 OS。系统启动时有可能会检测到一些不稳定的核，在本次启动中该核会被屏蔽，但下次启动时，该核却仍有可能通过检测，进一步竟争 PBSP 和 SBSP 成为系统主线程，这时如果该核发生问题，整个系统就会立即崩溃。本处理流程中将检测到的不稳定核的信息存储在 NVM 中，下次启动时通过设备状态恢复流程识别并限制不稳定的核参与 PBSP竟争或直接将该核禁用。针对緩存禁用信息的状态恢复处理流程如图 7所示，包括：步骤 71、计算机系统中，某緩存（ Cacheline )发生 ECC错误次数达到阈值时，计算机系统禁用该緩存。 After that, the computer system performs PBSP selection, and the computer system continues to boot, booting the OS. Some unstable cores may be detected when the system starts up. The core will be blocked during this startup, but the next time it starts, the core may still pass the detection, further competing PBSP and SBSP to become the main thread of the system. At this time, if there is a problem with the core, the entire system will crash immediately. In this process, the information of the unstable core detected is stored in the NVM, and the device state recovery process is used to identify and limit the unstable core to participate in the PBSP competition or directly disable the core. The state recovery processing flow for the cache disable information is as shown in FIG. 7, and includes: Step 71: In the computer system, when the number of ECC errors of a cache (Cacheline) reaches a threshold, the computer system disables the cache.

步骤 72、计机系统记录该禁用信息到 NVM中； Step 72: The computer system records the disable information to the NVM;

步骤 73、计机系统重新启动时，调用设备状态恢复分发程序，读取緩存禁用信息； Step 73: When the computer system restarts, the device state recovery distribution program is invoked, and the cache disable information is read;

步骤 74、计机系统使用系统服务对緩存进行重新配置。 Step 74: The computer system uses the system service to reconfigure the cache.

之后，计算机系统继续启动，引导 OS。 After that, the computer system continues to boot, booting the OS.

緩存安全技术中规定，当某条 Cacheline ECC错误次数达到阈值后，会被系统禁用，本处理流程通过 NVM保存此信息，当下一次计算机系统启动时，调用设备状态恢复分发程序读取此信息，进行 Cacheline配置，将曾经标记为禁用的 Cacheline重新关闭。 The cache security technology stipulates that when a Cacheline ECC error reaches the threshold, it will be disabled by the system. This process saves this information through the NVM. When the next computer system starts, the device status recovery distribution program is called to read this information. Cacheline configuration, re-close the Cacheline that was once marked as disabled.

在运行的计算机系统中设备发生异常的信息都被保存到 NVM中，假设一个场景中，用户由于节能或其它需求，对运行中的计算机系统中的某一发生异常的节点进行了下线请求，计算机系统完成了下线操作后，由于业务需要，用户请求将已下线的节点重新上线，这时系统会调用设备状态恢复程序，对节点状态进行恢复处理，具体如图 8所示，包括： In the running computer system, the information about the abnormality of the device is saved to the NVM. Suppose, in a scenario, the user makes a offline request for an abnormal node in the running computer system due to energy saving or other requirements. After the computer system completes the offline operation, the user requests the node that has been offline to go online again due to the service requirement. At this time, the system will call the device state recovery program to recover the node status. As shown in Figure 8, the following includes:

步骤 81、运行的计算机系统中设备发生异常时，对异常状况进行处理。步骤 82、将异常信息保持在 NVM中； Step 81: When an abnormality occurs in the device in the running computer system, the abnormal condition is processed. Step 82: Keep the abnormal information in the NVM.

步骤 83、用户需求将某一节点或设备下线； Step 83: The user needs to drop a node or device offline;

步骤 84、下线操作完成，计算机系统继续运行； Step 84, the offline operation is completed, and the computer system continues to run;

步骤 85、用户需求将已下线节点或设备重新上线； Step 85: The user needs to bring the offline node or device back online;

步骤 86、计算机系统在上线操作中调用设备状态恢复分发程序，对各个设备进行状态恢复操作； Step 86: The computer system invokes a device state recovery distribution program in the online operation, and performs a state recovery operation on each device;

步骤 87、设备上线； Step 87, the device is online;

步骤 88、计算机系继续运行。这个恢复操作是可选的，如图 4所示可以采用 BIOS配置界面中设置的开关在上线操作时进行配置。 Step 88, the computer system continues to run. This recovery operation is optional. As shown in Figure 4, the switch set in the BIOS configuration interface can be configured for online operation.

本处理流程与上述处理流程最大的不同在于不涉及计算机系统重启，整个操作是动态进行的。 The biggest difference between this processing flow and the above processing flow is that the computer system restart is not involved, and the whole operation is dynamic.

上述方法实施例通过在计算机系统运行中将错误数据记录到 NVM中，并在重新启动过程中读取 NVM 中的错误数据对计算机系统中对应的设备进行状态恢复处理，解决了计算机系统在重启后设备状态初始化导致的计算机系统之前对一些异常或预测会损坏的设备的禁用、隔离等处理失效，直接导致系统稳定性的降低的问题，提高了计算机系统的稳定性。并且，通过在运行的计算机系统中的设备上线流程中将异常信息保存到 NVM 中，并在该设备重新上线过程中根据 NVM保存的异常信息对该设备进行恢复处理，解决了在计算机系统运行过程中，用户请求设备下线后重新上线，这时设备下线前的状态会被初始化，设备中被禁用的不稳定部分（可能是另一个子设备）可能再次运行于系统中，造成系统稳定性下降的问题，提高了计算机系统的稳定性。 The foregoing method embodiment solves the problem that the computer system is restarted by recording the error data into the NVM during the running of the computer system, and reading the error data in the NVM during the restarting process to perform state recovery processing on the corresponding device in the computer system. The failure of the computer system to disable or isolate the abnormality or predicted damage of the device caused by the initialization of the device state directly leads to the problem of the stability of the system, which improves the stability of the computer system. And, the abnormal information is saved in the NVM in the online process of running the computer system, and the device is restored according to the abnormal information saved by the NVM during the re-up of the device, thereby solving the running process of the computer system. The user requests the device to go offline after being offline. At this time, the state before the device goes offline is initialized. The unstable part (possibly another child device) that is disabled in the device may run in the system again, resulting in system stability. The problem of decline has improved the stability of the computer system.

本领域普通技术人员可以理解：实现上述方法实施例的全部或部分步骤可以通过程序指令及相关的硬件来完成，前述的程序可以存储于一计算机可读取存储介质中，该程序在执行时，执行包括上述方法实施例的步骤；而前述的存储介质包括： ROM, RAM, 磁碟或者光盘等各种可以存储程序代码的介质。 A person skilled in the art can understand that all or part of the steps of implementing the above method embodiments may be completed by program instructions and related hardware, and the foregoing program may be stored in a computer readable storage medium, when executed, The foregoing steps include the steps of the foregoing method embodiments; and the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

图 9为本发明实施例提供的一种计算机系统的结构示意图。如图 9所示，计算机系统包括：错误收集单元 91、存储单元 92及恢复处理单元 93。 FIG. 9 is a schematic structural diagram of a computer system according to an embodiment of the present invention. As shown in FIG. 9, the computer system includes: an error collecting unit 91, a storage unit 92, and a recovery processing unit 93.

错误收集单元 91用于在计算机系统运行或启动时，收集所述计算机系统的设备产生的错误数据；存储单元 92用于将所述错误数据存储到 NVM中；恢复处理单元 93用于在所述计算机系统重启时，根根据所述错误数据对产生所述错误数据的设备进行状态恢复处理。当所述错误数据为所述设备的异常信息，所述恢复处理单元 93还用于在所述计算机系统运行时根据所述异常信息，对请求重新上线的已下线的所述设备进行状态恢复。 The error collection unit 91 is configured to collect error data generated by the device of the computer system when the computer system is running or starting up; the storage unit 92 is configured to store the error data into the NVM; and the recovery processing unit 93 is configured to When the computer system is restarted, the root performs state recovery processing on the device that generates the erroneous data according to the error data. When the error data is abnormal information of the device, the recovery processing unit 93 is further configured to perform state recovery on the offline device that is requested to be re-online according to the abnormal information when the computer system is running. .

再如所述恢复处理单元 93具体用于根据 DIMM隔离信息对所述计算机系统中对应的 DIMM进行隔离。 The recovery processing unit 93 is specifically configured to isolate the corresponding DIMMs in the computer system according to the DIMM isolation information.

所述计算机系统还可包括：替换判断单元及信息清除单元。 The computer system may further include: a replacement determination unit and an information removal unit.

替换判断单元用于在所述计算机系统运行时判断所述 DIMM是否被替换；信息清除单元用于若所述替换判断单元判断所述 DIMM是被替换，则将存储在所述非易失性存储器中的所述 DIMM隔离信息进行清除；相应地，所述恢复处理单元用于若所述替换判断单元判断所述 DIMM未被替换，则在所述计算机系统重启时将所述 DIMM隔离。 The replacement determining unit is configured to determine whether the DIMM is replaced when the computer system is running; the information clearing unit is configured to: when the replacement determining unit determines that the DIMM is replaced, to be stored in the non-volatile memory The DIMM isolation information is cleared. Correspondingly, the recovery processing unit is configured to isolate the DIMM when the computer system is restarted if the replacement determination unit determines that the DIMM is not replaced.

又如所述恢复处理单元 93可具体用于根据处理器核的禁用信息禁止所述计算机系统中对应的处理器核参与 PBSP 的选择，或禁用所述计算机系统中对应的处理器核。 The recovery processing unit 93 may be specifically configured to prohibit the corresponding processor core in the computer system from participating in the selection of the PBSP according to the disable information of the processor core, or to disable the corresponding processor core in the computer system.

又如所述恢复处理单元 93可具体用于根据緩存禁用信息重新关闭所述计算机系统中对应的緩存。上述错误收集单元收集的错误数据及恢复处理单元进行的恢复处理操作具体详见上述方法实施例中的说明。 As another example, the recovery processing unit 93 may be specifically configured to re-close the corresponding cache in the computer system according to the cache disable information. The error data collected by the error collection unit and the recovery processing operation performed by the recovery processing unit are specifically described in the above method embodiments.

本实施例中，计算机系统通过错误收集单元将运行时收集的错误数据存储在 NVM 中，并通过恢复处理单元在重启时根据错误数据对对应设备进行状态恢复处理，避免了初始化后计算机系统将有问题的或不稳定的设备作为正常设备启用，提高了计算机系统的稳定性。 In this embodiment, the computer system stores the error data collected by the runtime in the NVM through the error collection unit, and performs state recovery processing on the corresponding device according to the error data by the recovery processing unit during the restart, thereby avoiding the computer system having the initialization after the initialization. A problematic or unstable device is enabled as a normal device, improving the stability of the computer system.

图 10为本发明实施例提供的另一种计算机系统的结构示意图。如图 10 所示，计算机系统包括：异常信息收集单元 101、存储单元 102及状态恢复单元 103。 FIG. 10 is a schematic structural diagram of another computer system according to an embodiment of the present invention. As shown in FIG. 10, the computer system includes: an abnormality information collecting unit 101, a storage unit 102, and a state restoring unit 103.

异常信息收集单元 101 用于在计算机系统运行时，收集所述计算机系统的设备中发生异常的设备的异常信息；存储单元 102用于将发生异常的设备的所述异常信息存储到 NVM中；状态恢复单元 103用于根据所述异常信息对请求重新上线的已下线的所述设备进行状态恢复。 The abnormality information collecting unit 101 is configured to collect abnormality information of a device in which an abnormality occurs in a device of the computer system when the computer system is running; the storage unit 102 is configured to: The abnormality information is stored in the NVM; the state restoring unit 103 is configured to perform state recovery on the offline device that is requested to be re-online according to the abnormality information.

本发明实施例提供的计算机系统还可包括：替换判断单元及信息删除单元。 The computer system provided by the embodiment of the present invention may further include: a replacement judging unit and an information deleting unit.

替换判断单元用于在所述计算机系统运行时将发生异常的设备的异常信息存储到所述非易失性存储器中之后，判断所述设备是否被替换；信息删除单元用于若所述替换判断单元判断所述设备是被替换，则删除所述非易失性存储器中的所述异常信息；相应地，所述状态恢复单元用于若所述替换判断单元判断所述设备未被替换，则执行所述状态恢复。 The replacement determining unit is configured to determine whether the device is replaced after the abnormality information of the device in which the abnormality occurs is stored in the non-volatile memory when the computer system is running; the information deleting unit is configured to determine the replacement Determining, by the unit, that the device is replaced, deleting the abnormality information in the non-volatile memory; and correspondingly, the state recovery unit is configured to: if the replacement determining unit determines that the device is not replaced, Perform the state recovery.

本实施例中，计算机系统通过状态恢复单元根据 NVM中存储的异常信息对对应的重新请求上线的已下线的设备进行状态恢复，避免了由于设备异常而下线的设备作为正常设备重新上线而导致的系统不稳定，提高了计算机系统的稳定性。 In this embodiment, the state recovery unit performs a state recovery on the corresponding offline device that is re-requested by the state recovery unit according to the abnormality information stored in the NVM, thereby avoiding that the device that is offline due to the abnormality of the device is re-online as the normal device. The resulting system is unstable and improves the stability of the computer system.

上述系统实施例中各单元可设置在 BIOS中。 Each unit in the above system embodiment can be set in the BIOS.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。 It should be noted that the above embodiments are only for explaining the technical solutions of the present invention, and are not intended to be limiting; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that: The technical solutions described in the foregoing embodiments are modified, or some of the technical features are equivalently replaced. The modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

Claim

A method for improving the stability of a computer system, comprising:

The computer system collects error data generated by the device of the computer system when it is started or running;

Storing the erroneous data in a non-volatile memory;

When the computer system is restarted, state recovery processing is performed on the device that generates the erroneous data based on the error data.

2. The method for improving the stability of a computer system according to claim 1, wherein the error data is abnormal information of the device;

The method further includes: performing, during operation, the state recovery of the off-line device requesting to go online according to the abnormality information.

The method for improving the stability of a computer system according to claim 1 or 2, wherein the error data is DIMM isolation information of the dual in-line storage module;

The process of performing state recovery processing on the corresponding device in the computer system according to the error data includes:

Corresponding DIMMs in the computer system are isolated based on the DIMM isolation information.

4. The method of improving stability of a computer system according to claim 3, wherein said computer system determines whether said DIMM is replaced during operation, and if so, is stored in said non-volatile memory The DIMM isolation information is cleared; otherwise, the DIMM is isolated when the computer system is restarted.

The method for improving the stability of a computer system according to claim 1 or 2, wherein the error data is a disable information of a processor core;

The corresponding processor cores in the computer system are prohibited from participating in the selection of the PBSP according to the disable information of the processor core, or the corresponding processor cores in the computer system are disabled.

6. The method of improving stability of a computer system according to claim 1 or 2, wherein the error data is cache disable information;

Re-closing the corresponding cache in the computer system according to the cache disable information.

7. A method for improving the stability of a computer system, comprising:

While the computer system is running, collecting abnormal information of the device in the device of the computer system that is abnormal;

Storing the abnormality information of the device in which the abnormality occurs in the non-volatile memory;

The computer system performs state restoration on the off-line device requesting to go online again according to the abnormality information.

The method for improving the stability of a computer system according to claim 7, wherein the computer system, after the operating system stores the abnormality information of the abnormal device in the non-volatile memory, further includes: The computer system determines whether the device is replaced, and if so, deletes the abnormal information in the non-volatile memory; otherwise, performs the state recovery.

9. A computer system, comprising:

An error collection unit, configured to collect error data generated by a device of the computer system when the computer system is running or started;

a storage unit, configured to store the error data into a non-volatile memory;

And a recovery processing unit, configured to perform state recovery processing on the device that generates the erroneous data according to the error data when the computer system is restarted.

The computer system according to claim 9, wherein the error data is abnormal information of the device;

The recovery processing unit is further configured to perform state recovery on the off-line device that requests to be re-online according to the abnormality information when the computer system is running.

The computer system according to claim 9 or 10, wherein the number of errors According to the DIMM isolation information of the dual in-line storage module, the recovery processing unit is specifically configured to isolate the corresponding DIMMs in the computer system according to the DIMM isolation information.

The computer system according to claim 11, further comprising: a replacement determining unit, configured to determine whether the DIMM is replaced when the computer system is running;

An information clearing unit, configured to: if the replacement determining unit determines that the DIMM is replaced, clear the DIMM isolation information stored in the non-volatile memory;

The recovery processing unit is configured to isolate the DIMM when the computer system is restarted, if the replacement determining unit determines that the DIMM is not replaced.

The computer system according to claim 9 or 10, wherein the error data is a disable information of a processor core; the recovery processing unit is specifically configured to disable the computer system according to a disable information of a processor core. The corresponding processor core participates in the selection of the PBSP or disables the corresponding processor core in the computer system.

The computer system according to claim 9 or 10, wherein the error data is cache disable information; the recovery processing unit is specifically configured to re-close the corresponding cache in the computer system according to the cache disable information.

15. A computer system, comprising:

An abnormality information collecting unit, configured to collect abnormal information of a device in which an abnormality occurs in a device of the computer system when the computer system is running;

a storage unit, configured to store the abnormality information of the device in which the abnormality occurs in the non-volatile memory;

The state recovery unit is configured to perform state recovery on the offline device that is requested to be re-online according to the abnormality information.

The computer system according to claim 15, further comprising: a replacement determining unit, configured to store, in the non-volatile memory, abnormal information of a device in which an abnormality occurs when the computer system is running After that, it is determined whether the device is replaced; An information deleting unit, configured to delete the abnormality information in the non-volatile memory if the replacement determining unit determines that the device is replaced;

The state restoration unit is configured to perform the state recovery if the replacement determination unit determines that the device is not replaced.