[go: up one dir, main page]

CN116136805A - Memory channel fault detection method and device, memory system and computer system - Google Patents

Memory channel fault detection method and device, memory system and computer system Download PDF

Info

Publication number
CN116136805A
CN116136805A CN202111364493.3A CN202111364493A CN116136805A CN 116136805 A CN116136805 A CN 116136805A CN 202111364493 A CN202111364493 A CN 202111364493A CN 116136805 A CN116136805 A CN 116136805A
Authority
CN
China
Prior art keywords
memory
memory channel
channel link
crc transmission
crc
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111364493.3A
Other languages
Chinese (zh)
Inventor
韩林
刁阳彬
张文桂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202111364493.3A priority Critical patent/CN116136805A/en
Publication of CN116136805A publication Critical patent/CN116136805A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • G06F11/1044Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices with specific ECC/EDC distribution
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1004Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's to protect a block of data words, e.g. CRC or checksum

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

本申请公开了一种内存通道故障检测方法及装置、内存系统及计算机系统,属于计算机技术领域。控制设备获取内存控制器对内存通道的ECC错误统计信息。当内存通道发生ECC错误的次数超出第一次数阈值时,控制设备控制内存控制器对该内存通道中的内存通道链路进行CRC传输校验,以确定该内存通道链路是否发生故障。控制设备能够基于内存控制器对内存通道的ECC错误统计信息判断该内存通道是否频繁发生错误。控制设备可以触发对频繁发生错误的内存通道中的内存通道链路的CRC传输校验流程,以确定内存通道链路是否发生故障,进而能够明确该内存通道中的故障点在内存通道链路上还是内存模块上,提高了故障定位效率。

Figure 202111364493

The application discloses a memory channel fault detection method and device, a memory system and a computer system, which belong to the field of computer technology. The control device obtains the ECC error statistics information of the memory controller for the memory channel. When the number of ECC errors in the memory channel exceeds the first number threshold, the control device controls the memory controller to perform CRC transmission check on the memory channel link in the memory channel, so as to determine whether the memory channel link fails. The control device can determine whether the memory channel has frequent errors based on the memory controller's ECC error statistics for the memory channel. The control device can trigger the CRC transmission verification process of the memory channel link in the memory channel where frequent errors occur, so as to determine whether the memory channel link is faulty, and then it can be determined that the fault point in the memory channel is on the memory channel link It is also on the memory module, which improves the efficiency of fault location.

Figure 202111364493

Description

内存通道故障检测方法及装置、内存系统及计算机系统Memory channel fault detection method and device, memory system and computer system

技术领域technical field

本申请涉及计算机技术领域,特别涉及一种内存通道故障检测方法及装置、内存系统及计算机系统。The present application relates to the field of computer technology, in particular to a memory channel fault detection method and device, a memory system and a computer system.

背景技术Background technique

内存系统包括内存控制器(memory controller)和内存通道(memory channel)。一个内存控制器能够控制一个或多个内存通道。每个内存通道包括一条内存通道链路以及一个或多个内存模块(memory module)。其中,内存通道链路为物理链路,内存通道链路的一端连接内存控制器,另一端连接一个或多个内存模块,以实现内存控制器与内存模块之间的电连接。内存控制器可以通过内存通道链路向内存模块写入数据以及从内存模块读取数据。The memory system includes a memory controller and a memory channel. A memory controller can control one or more memory channels. Each memory channel includes a memory channel link and one or more memory modules. Wherein, the memory channel link is a physical link, one end of the memory channel link is connected to the memory controller, and the other end is connected to one or more memory modules, so as to realize the electrical connection between the memory controller and the memory modules. The memory controller can write data to and read data from the memory modules through the memory channel link.

由于数据在内存通道链路上传输或在内存模块中存储时都有可能发生错误,因此通常会在内存模块和内存控制器上设计相应的错误检测和纠错功能。目前通常采用错误检查和纠正(error checking and correction,ECC)校验技术对数据进行校验和纠错。内存控制器为传输到内存模块的数据计算ECC码,在向内存模块写入该数据时,这个ECC码会与该数据一起存储在内存模块中。内存控制器从内存模块读取数据时,会连同读取存储的ECC码,并使用该ECC码来校验读取的数据的正确性,并尝试纠正发生的错误。Because errors may occur when data is transmitted on the memory channel link or stored in the memory module, corresponding error detection and error correction functions are usually designed on the memory module and memory controller. Currently, an error checking and correction (ECC) check technology is usually used to check and correct data. The memory controller calculates the ECC code for the data transmitted to the memory module, and when writing the data to the memory module, the ECC code will be stored together with the data in the memory module. When the memory controller reads data from the memory module, it reads the stored ECC code together with it, and uses the ECC code to verify the correctness of the read data and try to correct any errors that occur.

但是,ECC校验技术只能检测出数据在传输和存储的整个过程中发生错误,而无法区分数据错误是发生在传输过程还是存储过程中的,因此需要人工进一步排查内存通道中的故障位置,故障定位效率较低。However, the ECC verification technology can only detect data errors during the entire process of data transmission and storage, but cannot distinguish whether data errors occur during transmission or storage. Therefore, manual further investigation of the fault location in the memory channel is required. Fault location efficiency is low.

发明内容Contents of the invention

本申请提供了一种内存通道故障检测方法及装置、内存系统及计算机系统,可以解决目前内存通道的故障定位效率较低的问题。The present application provides a memory channel fault detection method and device, a memory system and a computer system, which can solve the problem of low fault location efficiency of the current memory channel.

第一方面,提供了一种内存通道故障检测方法。该方法包括:控制设备获取内存控制器对内存通道的ECC错误统计信息。该ECC错误统计信息用于反映内存通道发生ECC错误的次数。当内存通道发生ECC错误的次数超出第一次数阈值时,控制设备控制内存控制器对内存通道中的内存通道链路进行循环冗余校验(cyclic redundancy check,CRC)传输校验,以确定该内存通道链路是否发生故障。In a first aspect, a memory channel fault detection method is provided. The method includes: controlling the device to acquire ECC error statistics information of the memory controller on the memory channel. The ECC error statistics are used to reflect the number of ECC errors that occur on the memory channel. When the number of ECC errors in the memory channel exceeds the first number threshold, the control device controls the memory controller to perform a cyclic redundancy check (cyclic redundancy check, CRC) transmission check on the memory channel link in the memory channel to determine Whether the memory channel link has failed.

本申请中,控制设备能够基于内存控制器对内存通道的ECC错误统计信息判断该内存通道是否频繁发生错误。如果某个内存通道频繁发生错误,控制设备可以触发对该内存通道中的内存通道链路的CRC传输校验流程,以确定内存通道链路是否发生故障,进而能够明确该内存通道中的故障点在内存通道链路上还是内存模块上,无需人工排查内存通道中的故障位置,提高了故障定位效率。In the present application, the control device can determine whether errors occur frequently in the memory channel based on the memory controller's ECC error statistics for the memory channel. If an error occurs frequently in a memory channel, the control device can trigger the CRC transmission verification process of the memory channel link in the memory channel to determine whether the memory channel link is faulty, and then can clarify the fault point in the memory channel On the memory channel link or on the memory module, there is no need to manually check the fault location in the memory channel, which improves the fault location efficiency.

可选地,ECC错误统计信息包括内存通道发生ECC错误的次数或对内存通道发生ECC错误的次数超出第一次数阈值的状态指示。Optionally, the ECC error statistics information includes the number of ECC errors that occur on the memory channel or a state indication that the number of ECC errors that occur on the memory channel exceeds a first number threshold.

可选地,内存控制器对内存通道中的内存通道链路进行CRC传输校验,包括:内存控制器对内存通道链路进行写内存方向的CRC传输校验;和/或,内存通道中的内存模块具备计算CRC传输校验码的能力,内存控制器控制内存模块对内存通道链路进行读内存方向的CRC传输校验。Optionally, the memory controller performs a CRC transmission check on the memory channel link in the memory channel, including: the memory controller performs a CRC transmission check on the memory channel link in the direction of writing memory; and/or, the memory channel link The memory module has the ability to calculate the CRC transmission check code, and the memory controller controls the memory module to perform the CRC transmission check on the memory channel link in the direction of reading the memory.

可选地,响应于内存通道链路发生故障,或者,内存控制器对内存通道链路进行CRC传输校验的时长达到时长阈值,控制设备控制内存控制器停止对该内存通道链路进行CRC传输校验。当内存通道链路发生CRC传输校验错误的次数超出第二次数阈值时,可以确定该内存通道链路发生故障。可选地,内存通道链路发生CRC传输校验错误的次数为内存通道链路在写内存方向上发生CRC传输校验错误的次数、内存通道链路在读内存方向上发生CRC传输校验错误的次数或内存通道链路在写内存方向和读内存方向上发生CRC传输校验错误的总次数。Optionally, in response to a failure of the memory channel link, or when the duration for the memory controller to perform CRC transmission verification on the memory channel link reaches a duration threshold, the control device controls the memory controller to stop performing CRC transmission on the memory channel link check. When the number of CRC transmission check errors that occur on the memory channel link exceeds the second threshold, it can be determined that the memory channel link is faulty. Optionally, the number of CRC transmission verification errors that occur on the memory channel link is the number of CRC transmission verification errors that occur on the memory channel link in the direction of writing memory, and the number of times that CRC transmission verification errors occur on the memory channel link in the direction of reading memory. The number of times or the total number of times that CRC transmission check errors occur on the memory channel link in the direction of writing memory and the direction of reading memory.

本申请中,通过设置内存控制器进行CRC传输校验的结束机制,使得内存控制器能够在内存系统的运行过程中及时停止进行CRC传输校验,避免持续降低内存系统的性能,在进行内存通道故障定位的同时,尽可能降低对内存系统的性能影响,提高了内存系统的可用性。In this application, by setting the end mechanism of the memory controller for CRC transmission verification, the memory controller can stop the CRC transmission verification in time during the operation of the memory system, so as to avoid continuously reducing the performance of the memory system. While locating the fault, the performance impact on the memory system is reduced as much as possible, and the availability of the memory system is improved.

可选地,控制设备在确定内存通道链路发生故障之后,输出第一故障检测结果。该第一故障检测结果指示内存通道链路发生故障。Optionally, the control device outputs the first fault detection result after determining that the memory channel link is faulty. The first fault detection result indicates that the memory channel link is faulty.

本申请中,在内存通道发生故障时,控制设备能够输出故障点的具体位置。通过向运维人员提供更准确的告警信息,便于运维人员进行故障修复,从而缩短故障恢复时间,提高系统可用性。In the present application, when the memory channel fails, the control device can output the specific location of the failure point. By providing operation and maintenance personnel with more accurate alarm information, it is convenient for operation and maintenance personnel to repair faults, thereby shortening the fault recovery time and improving system availability.

可选地,ECC错误统计信息包括内存通道中发生ECC错误的内存地址。控制设备在确定内存通道链路未发生故障之后,输出第二故障检测结果。该第二故障检测结果指示内存地址对应的内存模块发生故障。Optionally, the ECC error statistics information includes memory addresses where ECC errors occur in the memory channel. The control device outputs a second fault detection result after determining that no fault occurs in the memory channel link. The second fault detection result indicates that the memory module corresponding to the memory address is faulty.

本申请中,在内存通道发生故障时,控制设备能够输出故障点的具体位置。通过向运维人员提供更准确的告警信息,便于运维人员进行故障修复,从而缩短故障恢复时间,提高系统可用性。In the present application, when the memory channel fails, the control device can output the specific location of the failure point. By providing operation and maintenance personnel with more accurate alarm information, it is convenient for operation and maintenance personnel to repair faults, thereby shortening the fault recovery time and improving system availability.

可选地,当内存通道发生ECC错误的次数超出第一次数阈值时,控制设备输出故障指示,该故障指示用于指示内存通道发生故障。控制设备控制内存控制器对内存通道中的内存通道链路进行CRC传输校验的实现方式,包括:响应于针对内存通道的内存通道链路诊断命令,控制设备控制内存控制器对该内存通道链路进行CRC传输校验。Optionally, when the number of ECC errors that occur on the memory channel exceeds the first number threshold, the control device outputs a fault indication, where the fault indication is used to indicate that the memory channel is faulty. The control device controls the memory controller to perform CRC transmission verification on the memory channel link in the memory channel, including: in response to the memory channel link diagnostic command for the memory channel, the control device controls the memory controller to the memory channel link CRC transmission check.

本申请中,控制设备可以输出故障指示,由用户决定是否触发内存控制器对该内存通道中的内存通道链路进行CRC传输校验。In this application, the control device can output a fault indication, and the user decides whether to trigger the memory controller to perform CRC transmission check on the memory channel link in the memory channel.

可选地,内存通道链路诊断命令包括第二次数阈值和/或时长阈值,其中,第二次数阈值用于:当内存通道链路发生CRC传输校验错误的次数超出第二次数阈值时,判定内存通道链路发生故障;时长阈值为对内存通道链路进行CRC传输校验的最大允许时长。Optionally, the memory channel link diagnosis command includes a second times threshold and/or a duration threshold, where the second times threshold is used for: when the number of CRC transmission check errors that occur on the memory channel link exceeds the second times threshold, It is determined that the memory channel link is faulty; the duration threshold is the maximum allowable duration for performing CRC transmission check on the memory channel link.

第二方面,提供了一种控制设备。所述控制设备包括多个功能模块,所述多个功能模块相互作用,实现上述第一方面及其各实施方式中的方法。所述多个功能模块可以基于软件、硬件或软件和硬件的结合实现,且所述多个功能模块可以基于具体实现进行任意组合或分割。In a second aspect, a control device is provided. The control device includes a plurality of functional modules, and the plurality of functional modules interact to implement the methods in the above first aspect and various implementation manners thereof. The multiple functional modules can be implemented based on software, hardware or a combination of software and hardware, and the multiple functional modules can be combined or divided arbitrarily based on specific implementations.

第三方面,提供了一种控制设备,包括:处理器和存储器;In a third aspect, a control device is provided, including: a processor and a memory;

所述存储器,用于存储计算机程序,所述计算机程序包括程序指令;The memory is used to store a computer program, and the computer program includes program instructions;

所述处理器,用于调用所述计算机程序,实现上述第一方面及其各实施方式中的方法。The processor is configured to invoke the computer program to implement the methods in the above first aspect and various implementation manners thereof.

第四方面,提供了一种内存系统,包括:内存控制器和内存通道。内存控制器用于对内存通道进行ECC校验,并记录内存通道发生ECC错误的次数。内存控制器还用于在接收到针对内存通道的CRC传输校验命令之后,对内存通道中的内存通道链路进行CRC传输校验。In a fourth aspect, a memory system is provided, including: a memory controller and a memory channel. The memory controller is used to perform ECC verification on the memory channel and record the number of ECC errors that occur on the memory channel. The memory controller is further configured to perform a CRC transmission check on the memory channel link in the memory channel after receiving the CRC transmission check command for the memory channel.

可选地,内存控制器还用于在内存通道发生ECC错误的次数超出第一次数阈值后,记录状态指示,该状态指示用于指示该内存通道发生ECC错误的次数超出第一次数阈值。Optionally, the memory controller is further configured to record a status indication after the number of ECC errors occurring on the memory channel exceeds the first number threshold, and the status indication is used to indicate that the number of ECC errors occurring on the memory channel exceeds the first number threshold .

可选地,内存控制器对内存通道中的内存通道链路进行CRC传输校验,包括:内存控制器对内存通道链路进行写内存方向的CRC传输校验。和/或,内存通道中的内存模块具备计算CRC传输校验码的能力,内存控制器控制内存模块对内存通道链路进行读内存方向的CRC传输校验。Optionally, the memory controller performs a CRC transmission check on the memory channel link in the memory channel, including: the memory controller performs a CRC transmission check on the memory channel link in a memory writing direction. And/or, the memory module in the memory channel has the ability to calculate the CRC transmission check code, and the memory controller controls the memory module to perform the CRC transmission check in the direction of reading the memory on the memory channel link.

可选地,内存控制器还用于记录内存通道链路发生CRC传输校验错误的次数。内存控制器还用于在内存通道链路发生CRC传输校验错误的次数超出第二次数阈值,或者,对内存通道链路进行CRC传输校验的时长达到时长阈值后,停止对内存通道链路进行CRC传输校验。Optionally, the memory controller is also used to record the number of times CRC transmission check errors occur in the memory channel link. The memory controller is also used to stop the memory channel link after the number of CRC transmission verification errors occurring on the memory channel link exceeds the second threshold, or after the time for performing CRC transmission verification on the memory channel link reaches the duration threshold. Perform CRC transmission check.

本申请中,当内存控制器确定内存通道链路满足CRC传输校验结束条件时,内存控制器可以主动停止对该内存通道链路进行CRC传输校验,以避免持续降低内存系统的性能,保证内存系统的可用性。In this application, when the memory controller determines that the memory channel link satisfies the end condition of the CRC transmission verification, the memory controller can actively stop the CRC transmission verification of the memory channel link, so as to avoid continuously reducing the performance of the memory system and ensure Availability of the memory system.

可选地,内存控制器用于对内存通道链路进行写内存方向的CRC传输校验。内存模块用于记录内存通道链路在写内存方向上发生CRC传输校验错误的次数,并在内存通道链路在写内存方向上发生CRC传输校验错误的次数超出第二次数阈值时,停止对通过该内存通道链路接收到的CRC传输校验码进行校验。Optionally, the memory controller is used to perform CRC transmission verification in the memory writing direction on the memory channel link. The memory module is used to record the number of CRC transmission verification errors that occur on the memory channel link in the direction of writing memory, and stop when the number of CRC transmission verification errors that occur on the memory channel link in the direction of writing memory exceeds the second threshold Check the CRC transmission check code received through the memory channel link.

本申请中,在内存控制器对内存通道链路进行写内存方向的CRC传输校验过程中,内存模块可以在内存通道链路在写内存方向上发生CRC传输校验错误的次数超出第二次数阈值时,主动停止对通过该内存通道链路接收到的CRC传输校验码进行校验,以及时释放自身的内存资源,提高内存系统的可用性。In this application, when the memory controller performs CRC transmission verification on the memory channel link in the direction of writing memory, the number of times that the memory module can generate CRC transmission verification errors in the direction of writing memory on the memory channel link exceeds the second number of times When the threshold is exceeded, it actively stops checking the CRC transmission check code received through the memory channel link, so as to release its own memory resources in time and improve the availability of the memory system.

第五方面,提供了一种计算机系统,包括:如第三方面所述的控制设备以及如第四方面所述的内存系统。A fifth aspect provides a computer system, including: the control device as described in the third aspect and the memory system as described in the fourth aspect.

第六方面,提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有指令,当所述指令被处理器执行时,实现上述第一方面及其各实施方式中的方法。According to a sixth aspect, a computer-readable storage medium is provided. Instructions are stored on the computer-readable storage medium. When the instructions are executed by a processor, the above-mentioned first aspect and the methods in each implementation manner thereof are implemented.

第七方面,提供了一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时,实现上述第一方面及其各实施方式中的方法。In a seventh aspect, a computer program product is provided, including a computer program, and when the computer program is executed by a processor, the method in the above first aspect and each implementation manner thereof is realized.

第八方面,提供了一种芯片,芯片包括可编程逻辑电路和/或程序指令,当芯片运行时,实现上述第一方面及其各实施方式中的方法。In an eighth aspect, a chip is provided, and the chip includes a programmable logic circuit and/or program instructions, and when the chip is running, implements the method in the above first aspect and its various implementation manners.

附图说明Description of drawings

图1是本申请实施例提供的一种计算机系统的基本架构示意图;FIG. 1 is a schematic diagram of the basic architecture of a computer system provided by an embodiment of the present application;

图2是本申请实施例提供的一种内存通道故障检测方法的流程示意图;Fig. 2 is a schematic flow chart of a memory channel fault detection method provided by an embodiment of the present application;

图3是本申请实施例提供的故障检测触发过程的实现流程示意图;FIG. 3 is a schematic diagram of the implementation flow of the fault detection triggering process provided by the embodiment of the present application;

图4是本申请实施例提供的故障检测过程的实现流程示意图;FIG. 4 is a schematic diagram of the implementation flow of the fault detection process provided by the embodiment of the present application;

图5是本申请实施例提供的故障检测结束过程的实现流程示意图;FIG. 5 is a schematic diagram of the implementation flow of the fault detection end process provided by the embodiment of the present application;

图6是本申请实施例提供的一种控制设备的结构示意图;FIG. 6 is a schematic structural diagram of a control device provided by an embodiment of the present application;

图7是本申请实施例提供的另一种控制设备的结构示意图;Fig. 7 is a schematic structural diagram of another control device provided by the embodiment of the present application;

图8是本申请实施例提供的一种控制设备的框图。Fig. 8 is a block diagram of a control device provided by an embodiment of the present application.

具体实施方式Detailed ways

为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。In order to make the purpose, technical solution and advantages of the present application clearer, the implementation manners of the present application will be further described in detail below in conjunction with the accompanying drawings.

目前,ECC校验技术只能检测出数据在传输和存储的整个过程中发生错误,而数据在内存通道链路上传输或在内存模块中存储时都有可能发生错误。由于采用ECC校验技术无法区分数据错误是发生在传输过程还是存储过程中的,因此只能判断出内存通道中存在故障点,而无法定位内存通道中的故障位置。内存系统发出故障告警时仅能报告某个内存通道在读写某个内存模块时发生故障,运维人员通常会尝试更换相应的内存模块来排除故障。但如果故障点在内存通道链路上,则运维人员可能要更换多次内存模块才能确定故障点在内存通道链路上而不是在内存模块上。通过人工排查的方式确定内存通道中的故障位置的效率较低,这会造成故障恢复时间较长,降低了系统可用性。At present, the ECC verification technology can only detect errors during the entire process of data transmission and storage, and errors may occur when data is transmitted on the memory channel link or stored in the memory module. Since the ECC verification technology cannot distinguish whether data errors occur in the transmission process or the storage process, it can only determine the fault point in the memory channel, but cannot locate the fault location in the memory channel. When the memory system issues a fault alarm, it can only report that a certain memory channel has failed when reading and writing a certain memory module. Operation and maintenance personnel usually try to replace the corresponding memory module to troubleshoot the fault. However, if the fault point is on the memory channel link, the operation and maintenance personnel may have to replace the memory module several times to determine that the fault point is on the memory channel link instead of the memory module. It is inefficient to determine the location of the fault in the memory channel by manual investigation, which will cause a long time for fault recovery and reduce the availability of the system.

为了确定内存通道中的故障点是在内存通道链路上还是在内存模块上,相关技术中还会在内存控制器和内存模块上设计CRC传输校验功能,以确定是否是内存通道链路发生故障。CRC传输校验通常以一组内存读写突发传输为单位生成CRC传输校验码。每次突发传输会不间断地按地址顺序读写内存8次或16次形成一个数据帧。内存控制器或内存模块可以为一次突发传输的数据帧生成8比特的CRC传输校验码,专用于校验这一次数据突发传输的正确性。这个CRC传输校验码将在突发传输数据帧完成后,额外以两个时钟周期从内存控制器或内存模块传输到对端。这个CRC传输校验码不会被存储到内存模块中,也不会被传递到内存控制器之外,仅仅用于校验传输过程的正确性。因此如果CRC传输校验不通过,就能够以此准确地判断出数据是在传输过程中发生错误的。通过在数据传输过程中附加CRC传输校验码,可准确地判断出内存通道链路是否发生故障,进而能够为运维人员提供更准确的告警信息。In order to determine whether the fault point in the memory channel is on the memory channel link or on the memory module, a CRC transmission check function is also designed on the memory controller and the memory module in related technologies to determine whether the fault is caused by the memory channel link Fault. The CRC transmission check usually generates a CRC transmission check code in units of a group of memory read and write burst transmissions. Each burst transfer will uninterruptedly read and write memory 8 times or 16 times in order of addresses to form a data frame. The memory controller or the memory module can generate an 8-bit CRC transmission check code for a data frame of a burst transmission, which is dedicated to verifying the correctness of this data burst transmission. The CRC transmission check code will be transmitted from the memory controller or memory module to the peer end in two additional clock cycles after the burst transmission data frame is completed. This CRC transmission check code will not be stored in the memory module, nor will it be passed out of the memory controller, and is only used to verify the correctness of the transmission process. Therefore, if the CRC transmission check fails, it can be accurately judged that the data is wrong during transmission. By adding the CRC transmission check code during the data transmission process, it can accurately determine whether the memory channel link is faulty, thereby providing more accurate alarm information for operation and maintenance personnel.

虽然CRC传输校验技术能够帮助准确定位内存通道链路故障,但是由于传输CRC传输校验码需要额外的时钟周期,开启CRC传输校验功能后每次突发传输的耗时高于不开启CRC传输校验功能时的耗时,并且内存控制器或内存模块计算CRC传输校验码会产生额外的时延,这会使内存系统的性能严重下降。在目前的计算机系统中,内存系统的CRC传输校验功能通常会作为一个可选功能在计算机系统启动前由用户自行决定是否开启。如果用户选择开启CRC传输校验功能,则在计算机系统的运行过程中始终存在内存系统的性能下降的问题。如果用户选择不开启CRC传输校验功能,则仍无法定位内存通道中的故障位置。Although the CRC transmission verification technology can help accurately locate memory channel link faults, because the transmission of the CRC transmission verification code requires additional clock cycles, the time-consuming of each burst transmission after the CRC transmission verification function is enabled is higher than that without the CRC. The transmission check function is time-consuming, and the calculation of the CRC transmission check code by the memory controller or the memory module will generate additional delay, which will seriously degrade the performance of the memory system. In the current computer system, the CRC transmission verification function of the memory system is usually used as an optional function before the computer system is started, and it is up to the user to decide whether to enable it. If the user chooses to enable the CRC transmission check function, there will always be a problem of performance degradation of the memory system during the operation of the computer system. If the user chooses not to enable the CRC transmission verification function, it is still impossible to locate the fault location in the memory channel.

基于此,本申请提出了将ECC校验技术与CRC传输校验技术结合的技术方案。当内存系统中的某个内存通道发生ECC错误的次数超出次数阈值时,内存控制器开始对该内存通道中的内存通道链路进行CRC传输校验。其中,内存控制器对内存通道中的内存通道链路进行CRC传输校验,包括:内存控制器对该内存通道中的内存通道链路进行写内存方向的CRC传输校验,和/或,内存通道中的内存模块具备计算CRC传输校验码的能力,内存控制器控制该内存通道中的内存模块对该内存通道中的内存通道链路进行读内存方向的CRC传输校验。进行写内存方向的CRC传输校验是指,内存系统执行写内存命令时,由内存控制器为传输的数据计算CRC传输校验码并发送给内存模块,并由内存模块基于接收到的CRC传输校验码对传输过程进行校验。进行读内存方向的CRC传输校验是指,内存系统执行读内存命令时,由内存模块为传输的数据计算CRC传输校验码并发送给内存控制器,并由内存控制器基于接收到的CRC传输校验码对传输过程进行校验。Based on this, the present application proposes a technical solution that combines the ECC verification technology with the CRC transmission verification technology. When the number of ECC errors that occur on a certain memory channel in the memory system exceeds the number threshold, the memory controller starts to perform CRC transmission check on the memory channel link in the memory channel. Wherein, the memory controller performs a CRC transmission check on the memory channel link in the memory channel, including: the memory controller performs a CRC transmission check on the memory channel link in the memory channel in the direction of writing memory, and/or, the memory The memory module in the channel has the ability to calculate the CRC transmission check code, and the memory controller controls the memory module in the memory channel to perform the CRC transmission check in the direction of reading the memory on the memory channel link in the memory channel. The CRC transmission check in the direction of writing memory means that when the memory system executes the write memory command, the memory controller calculates the CRC transmission check code for the transmitted data and sends it to the memory module, and the memory module transmits it based on the received CRC. The check code checks the transmission process. The CRC transmission check in the direction of reading memory means that when the memory system executes the read memory command, the memory module calculates the CRC transmission check code for the transmitted data and sends it to the memory controller, and the memory controller based on the received CRC The transmission verification code verifies the transmission process.

本申请中,在内存通道频繁发生ECC错误时,通过控制设备触发对该内存通道中的内存通道链路的CRC传输校验流程,以确定内存通道链路是否发生故障,进而能够明确内存通道中的故障点在内存通道链路上还是内存模块上,无需人工排查内存通道中的故障位置,提高了故障定位效率。另外,本申请在内存通道发生故障时,能够向运维人员提供更准确的告警信息,以便运维人员进行故障修复,从而缩短故障恢复时间,提高系统可用性。In this application, when ECC errors occur frequently on the memory channel, the control device triggers the CRC transmission verification process of the memory channel link in the memory channel to determine whether the memory channel link is faulty, and then it is possible to determine the error in the memory channel. Whether the fault point is on the memory channel link or the memory module, there is no need to manually check the fault location in the memory channel, which improves the fault location efficiency. In addition, the present application can provide operation and maintenance personnel with more accurate alarm information when the memory channel fails, so that the operation and maintenance personnel can perform fault repair, thereby shortening the fault recovery time and improving system availability.

在一些实施例中,在内存控制器对内存通道链路进行CRC传输校验的过程中,一旦确定了该内存通道链路发生故障,或者,内存控制器对该内存通道链路进行CRC传输校验的时长达到了时长阈值,则内存控制器停止对该内存通道链路进行CRC传输校验。其中,当某个内存通道链路发生CRC传输校验错误的次数超出设置的次数阈值,可以判定该内存通道链路发生故障。本申请通过设置内存控制器进行CRC传输校验的结束机制,使得内存控制器能够在内存系统的运行过程中及时停止进行CRC传输校验,避免持续降低内存系统的性能,在实现对内存通道故障定位的同时,尽可能降低对内存系统的性能影响,提高了内存系统的可用性。In some embodiments, in the process of the memory controller performing CRC transmission verification on the memory channel link, once it is determined that the memory channel link has failed, or the memory controller performs the CRC transmission verification on the memory channel link If the time length of the verification reaches the time length threshold, the memory controller stops performing the CRC transmission verification on the memory channel link. Wherein, when the number of CRC transmission check errors occurring on a memory channel link exceeds a set number threshold, it can be determined that the memory channel link is faulty. This application sets the end mechanism for the memory controller to perform CRC transmission verification, so that the memory controller can stop the CRC transmission verification in time during the operation of the memory system, avoiding continuous degradation of the performance of the memory system, and realizing memory channel failure While positioning, the performance impact on the memory system is reduced as much as possible, and the availability of the memory system is improved.

下面从系统、硬件装置、方法流程、软件装置等多个角度,对本申请方案进行详细介绍。The solution of this application will be introduced in detail below from multiple perspectives such as system, hardware device, method flow, and software device.

本申请方案可应用于多种通用计算场景和专用计算场景下的内存通道故障检测场景。例如可以应用于通用服务器、存储控制器、网络控制器、图形工作站等包含处理器和内存的计算机系统。例如,图1是本申请实施例提供的一种计算机系统的基本架构示意图。如图1所示,该计算机系统包括:控制设备11以及内存系统。内存系统集成在基板(baseboard)或处理器封装(processor package)12上。基板也可称为主板。内存系统包括内存控制器121和内存通道122。内存通道122包括内存通道链路122a和内存模块122b。The solution of the present application can be applied to memory channel fault detection scenarios in various general-purpose computing scenarios and special-purpose computing scenarios. For example, it can be applied to computer systems including processors and memory, such as general servers, storage controllers, network controllers, and graphics workstations. For example, FIG. 1 is a schematic diagram of a basic architecture of a computer system provided by an embodiment of the present application. As shown in FIG. 1 , the computer system includes: a control device 11 and a memory system. The memory system is integrated on a baseboard or processor package 12 . The substrate can also be referred to as the motherboard. The memory system includes a memory controller 121 and a memory channel 122 . The memory channel 122 includes a memory channel link 122a and a memory module 122b.

控制设备11是独立于基板或处理器封装12的计算机设备。控制设备11通过内集成电路(inter-integrated circuit,I2C)总线等物理连接总线实现智能平台管理接口(intelligent platform management interface,IPMI)与基板或处理器封装12进行连接。外界可以通过操作控制设备11来间接控制基板或处理器封装12。The control device 11 is a computer device independent of the substrate or processor package 12 . The control device 11 is connected to the substrate or the processor package 12 through an intelligent platform management interface (intelligent platform management interface, IPMI) through a physical connection bus such as an inter-integrated circuit (I2C) bus. The outside world can indirectly control the substrate or processor package 12 by operating the control device 11 .

控制设备11也称为管理单元(mangement unit),例如可以是基板管理控制器(baseboard management controller,BMC)。控制设备11用于监控和记录基板或处理器封装12的工作状态,包括硬件健康和性能状态、硬件故障告警和电源状态等。控制设备11还能够对基板或处理器封装12进行基本的控制操作,例如上下电、重新启动和查询硬件工作状态等。The control device 11 is also called a management unit (mangement unit), and may be, for example, a baseboard management controller (baseboard management controller, BMC). The control device 11 is used to monitor and record the working status of the substrate or processor package 12, including hardware health and performance status, hardware failure alarm and power supply status, etc. The control device 11 can also perform basic control operations on the substrate or the processor package 12, such as powering on and off, restarting, and querying the working status of the hardware.

基板或处理器封装12包括处理器(processer)和内存。可选地,处理器包括但不限于各种架构下的中央处理器(central processing unit,CPU)、图形处理器(graphicprocessing unit,GPU)或神经网络处理器(neural network processing unit,NPU)中的一个或多个。内存包括基本输入输出系统(basic input and output system,BIOS)只读存储器(read only memory,ROM)和随机存取存储器(random access memory,RAM)。BIOS ROM用于存储BIOS程序。处理器启动时会从BIOS ROM读取BIOS程序运行。BIOS程序负责提供基板或处理器封装12的底层操作接口。The substrate or processor package 12 includes a processor and memory. Optionally, the processor includes but is not limited to a central processing unit (central processing unit, CPU), a graphics processing unit (graphic processing unit, GPU) or a neural network processor (neural network processing unit, NPU) under various architectures. one or more. Memory includes basic input and output system (basic input and output system, BIOS) read only memory (read only memory, ROM) and random access memory (random access memory, RAM). BIOS ROM is used to store BIOS programs. When the processor starts, it will read the BIOS program from the BIOS ROM and run it. The BIOS program is responsible for providing the underlying operation interface of the substrate or processor package 12 .

处理器包含或外接一个或多个内存控制器121。每个内存控制器121用于控制一个或多个内存通道122。每个内存通道122包括一条内存通道链路122a以及一个或多个内存模块122b。内存控制器121是用于管理与规划内存与处理器之间的传输速度的总线电路控制器。内存通道链路122a上可以传输内存访问命令、地址和数据。内存模块122b是装有内存集成电路的印刷电路板,内存模块122b通常为RAM。可选地,每个内存模块122b包括一个或多个内存芯片。例如,内存模块122b可以是双倍数据传输率(double data-rate,DDR)同步动态随机存取内存(synchronous dynamic random-access memory,SDRAM)芯片或高带宽内存(high-bandwidth memory,HBM)芯片等单独的内存芯片,或者,内存模块也可以是由多个内存芯片组成的双列直插内存模块(dual in-line memory module,DIMM)。The processor includes or is externally connected to one or more memory controllers 121 . Each memory controller 121 is used to control one or more memory channels 122 . Each memory channel 122 includes a memory channel link 122a and one or more memory modules 122b. The memory controller 121 is a bus circuit controller for managing and planning the transmission speed between the memory and the processor. Memory access commands, addresses and data can be transmitted on the memory channel link 122a. The memory module 122b is a printed circuit board equipped with a memory integrated circuit, and the memory module 122b is usually a RAM. Optionally, each memory module 122b includes one or more memory chips. For example, the memory module 122b may be a double data rate (double data-rate, DDR) synchronous dynamic random-access memory (synchronous dynamic random-access memory, SDRAM) chip or a high-bandwidth memory (high-bandwidth memory, HBM) chip Alternatively, the memory module may also be a dual in-line memory module (dual in-line memory module, DIMM) composed of multiple memory chips.

本申请实施例对内存系统中的内存控制器和内存模块所支持的功能进行说明。The embodiment of the present application describes the functions supported by the memory controller and memory modules in the memory system.

内存控制器能够为传输到内存模块的数据计算ECC码。内存控制器在向内存模块写入数据时,为该数据计算得到的ECC码会与该数据被一起存储在内存模块中。可选地,ECC码与数据保存在同一个内存芯片中。或者,数据保存在内存模块上的一个内存芯片中,ECC码保存在该内存模块上专用于存储ECC码的内存芯片中。其中,内存模块上用于存储ECC码的内存芯片与用于存储数据的内存芯片的结构可以相同,或者也可以不同。The memory controller is capable of calculating ECC codes for data transferred to the memory modules. When the memory controller writes data to the memory module, the ECC code calculated for the data will be stored in the memory module together with the data. Optionally, the ECC code and the data are stored in the same memory chip. Alternatively, the data is stored in a memory chip on the memory module, and the ECC code is stored in a memory chip dedicated to storing the ECC code on the memory module. Wherein, the structure of the memory chip used to store the ECC code on the memory module and the memory chip used to store data may be the same or different.

内存控制器从内存模块读取数据时,会连同读取相应的ECC码,并使用该ECC码来校验读取的数据的正确性,并尝试纠正发生的错误。内存控制器具有一组寄存器,可以保存关于读取数据时发生ECC错误的信息,包括但不限于各个内存通道发生ECC错误的次数、内存通道中发生ECC错误的内存地址以及ECC错误是否被成功纠正。其中,内存地址可以是内存模块的物理地址和系统地址。可选地,内存控制器具有能够被控制设备访问的计数寄存器,内存控制器能够在计数寄存器中记录各个内存通道发生ECC错误的次数。和/或,内存控制器具有能够被控制设备访问的状态寄存器,当某个内存通道发生ECC错误的次数超出设置的次数阈值时,内存控制器在对应的状态寄存器中标记超过该次数阈值的状态。When the memory controller reads data from the memory module, it will read the corresponding ECC code together, and use the ECC code to verify the correctness of the read data and try to correct the errors that occur. The memory controller has a set of registers that can store information about ECC errors that occur when reading data, including but not limited to the number of ECC errors that occur in each memory channel, the memory address where the ECC error occurred in the memory channel, and whether the ECC error was successfully corrected. Wherein, the memory address may be a physical address and a system address of the memory module. Optionally, the memory controller has a count register that can be accessed by the control device, and the memory controller can record the number of ECC errors that occur in each memory channel in the count register. And/or, the memory controller has a status register that can be accessed by the control device. When the number of ECC errors that occur on a certain memory channel exceeds a set threshold, the memory controller marks the state that exceeds the threshold in the corresponding status register. .

可选地,本申请实施例提供的内存系统支持写内存方向上的CRC传输校验和/或读内存方向上的CRC传输校验。Optionally, the memory system provided in the embodiment of the present application supports CRC transmission check in the direction of writing memory and/or CRC transmission check in the direction of reading memory.

在内存系统支持写内存方向上的CRC传输校验的情况下:In the case where the memory system supports CRC transmission check in the write memory direction:

内存控制器具备计算CRC传输校验码的能力。内存控制器执行写内存命令时,能够为传输到内存模块的数据计算CRC传输校验码。内存控制器在传输完要保存到内存模块中的数据后,以附加的若干个传输周期将CRC传输校验码发送到内存模块。相应地,内存模块具备CRC传输校验能力。内存模块执行写内存命令时,接收到从内存控制器传输过来的数据和CRC传输校验码后,能够采用该CRC传输校验码来校验传输过程的正确性。如果校验结果错误(即发生了CRC传输校验错误),内存模块能够以改变某个物理连接线路的电平的方式,或者,将校验结果保存至该内存模块上能够被内存控制器读取的某个寄存器的方式,来通知内存控制器本次传输过程中发生了错误。The memory controller has the ability to calculate the CRC transmission check code. When the memory controller executes the memory write command, it can calculate the CRC transmission check code for the data transmitted to the memory module. After the memory controller transmits the data to be saved in the memory module, it sends the CRC transmission check code to the memory module in several additional transmission cycles. Correspondingly, the memory module has a CRC transmission verification capability. When the memory module executes the write memory command, after receiving the data transmitted from the memory controller and the CRC transmission check code, the CRC transmission check code can be used to verify the correctness of the transmission process. If the verification result is wrong (that is, a CRC transmission verification error occurs), the memory module can change the level of a physical connection line, or save the verification result to the memory module and be read by the memory controller. The method of taking a certain register to notify the memory controller that an error occurred during this transmission.

第一种实现方式中,内存模块能够记录写内存方向上发生CRC传输校验错误的次数。In the first implementation manner, the memory module can record the number of times CRC transmission check errors occur in the direction of writing to the memory.

可选地,内存模块具有能够被内存控制器访问的计数寄存器,内存模块能够在计数寄存器中记录写内存方向上发生CRC传输校验错误的次数。和/或,内存模块具有能够被内存控制器访问的状态寄存器,当写内存方向上发生CRC传输校验错误次数超过设置的次数阈值时,内存模块在状态寄存器中标记超出该次数阈值的状态。内存模块能够以改变某个物理连接线路的电平的方式通知内存控制器读取对应的寄存器,或者,内存模块仅被动地等待内存控制器读取对应的寄存器。可选地,当写内存方向上发生CRC传输校验错误的次数超出设置的次数阈值,内存模块能够主动停止对接收到的CRC传输校验码进行校验。Optionally, the memory module has a count register that can be accessed by the memory controller, and the memory module can record in the count register the number of times CRC transmission check errors occur in the direction of writing to the memory. And/or, the memory module has a status register that can be accessed by the memory controller. When the number of CRC transmission check errors in the direction of writing to the memory exceeds a set threshold, the memory module marks the status exceeding the threshold in the status register. The memory module can notify the memory controller to read the corresponding register by changing the level of a certain physical connection line, or the memory module only passively waits for the memory controller to read the corresponding register. Optionally, when the number of CRC transmission check errors in the memory writing direction exceeds a set number threshold, the memory module can actively stop checking the received CRC transmission check code.

第二种实现方式中,内存控制器能够记录写内存方向上发生CRC传输校验错误的次数。内存模块每采用CRC传输校验码得到错误的校验结果,都会通知内存控制器本次传输过程中发生了错误,内存控制器将接收到错误的校验结果的次数作为写内存方向上发生CRC传输校验错误的次数。可选地,内存控制器能够采用计数寄存器记录写内存方向上发生CRC传输校验错误的次数,或者,采用状态寄存器标记写内存方向上发生CRC传输校验错误的次数是否超出设置的次数阈值的状态。In the second implementation manner, the memory controller can record the number of times CRC transmission check errors occur in the direction of writing to the memory. Every time the memory module uses the CRC transmission check code to get the wrong check result, it will notify the memory controller that an error has occurred during the transmission process, and the memory controller will use the number of times the wrong check result is received as the occurrence of CRC in the direction of writing memory. The number of transmission parity errors. Optionally, the memory controller can use the counting register to record the number of times that CRC transmission verification errors occur in the direction of writing memory, or use the status register to mark whether the number of times of CRC transmission verification errors occurring in the direction of writing memory exceeds the set number of times threshold state.

上述第一种实现方式和第二种实现方式也可以结合使用,即内存模块和内存控制器都能够记录写内存方向上发生CRC传输校验错误的次数。The above-mentioned first implementation manner and the second implementation manner can also be used in combination, that is, both the memory module and the memory controller can record the number of CRC transmission check errors that occur in the direction of writing to the memory.

可选地,当内存控制器确定写内存方向上发生CRC传输校验错误次数超出设置的次数阈值时,内存控制器能够主动停止计算和传输CRC传输校验码。Optionally, when the memory controller determines that the number of CRC transmission check errors in the memory writing direction exceeds a set number threshold, the memory controller can actively stop calculating and transmitting the CRC transmission check code.

可选地,通过向内存控制器写入一个或一组命令来开启或关闭针对写内存过程的CRC传输校验功能,无需重新初始化内存系统来使这个功能生效。Optionally, by writing one or a group of commands to the memory controller to enable or disable the CRC transmission check function for the process of writing the memory, there is no need to re-initialize the memory system to make this function effective.

在内存系统支持读内存方向上的CRC传输校验的情况下:In the case where the memory system supports CRC transmission check in the direction of reading memory:

内存模块具备计算CRC传输校验码的能力。内存模块执行读内存命令时,能够为传输到内存控制器的数据计算CRC传输校验码。内存模块在传输完内存控制器读取的数据后,以附加的若干个传输周期将CRC传输校验码发送到内存控制器。相应地,内存控制器具备CRC传输校验能力。内存控制器执行读内存命令时,接收到从内存模块传输过来的数据和CRC传输校验码后,能够采用该CRC传输校验码来校验传输过程的正确性。The memory module has the ability to calculate the CRC transmission check code. When the memory module executes the memory read command, it can calculate the CRC transmission check code for the data transmitted to the memory controller. After the memory module transmits the data read by the memory controller, it sends the CRC transmission check code to the memory controller in several additional transmission cycles. Correspondingly, the memory controller has a CRC transmission verification capability. When the memory controller executes the memory read command, after receiving the data transmitted from the memory module and the CRC transmission check code, it can use the CRC transmission check code to verify the correctness of the transmission process.

内存控制器能够记录读内存方向上发生CRC传输校验错误的次数。可选地,内存控制器具有能够被处理器访问的计数寄存器,内存控制器能够在计数寄存器中记录读内存方向上发生CRC传输校验错误的次数。和/或,内存控制器具有能够被处理器访问的状态寄存器,当读内存方向上发生CRC传输校验错误的次数超过设置的次数阈值时,内存控制器在状态寄存器中标记超出该次数阈值的状态。当读内存方向上发生CRC传输校验错误的次数超过设置的次数阈值时,内存控制器能够发起一个中断通知处理器执行BIOS程序来读取对应的寄存器,或者,仅被动地等待处理器读取对应的寄存器。The memory controller can record the number of CRC transmission check errors in the direction of reading memory. Optionally, the memory controller has a counting register that can be accessed by the processor, and the memory controller can record in the counting register the number of times CRC transmission check errors occur in the memory reading direction. And/or, the memory controller has a status register that can be accessed by the processor. When the number of CRC transmission check errors in the direction of reading the memory exceeds a set threshold, the memory controller marks in the status register the number of times that exceeds the threshold. state. When the number of CRC transmission check errors in the direction of reading memory exceeds the set threshold, the memory controller can initiate an interrupt to notify the processor to execute the BIOS program to read the corresponding register, or just passively wait for the processor to read the corresponding register.

可选地,当内存控制器确定读内存方向上发生CRC传输校验错误次数超出设置的次数阈值时,能够主动停止对接收到的CRC传输校验码进行校验,并控制内存模块停止计算和传输CRC传输校验码。Optionally, when the memory controller determines that the number of CRC transmission check errors in the memory reading direction exceeds the set threshold, it can actively stop checking the received CRC transmission check code, and control the memory module to stop calculating and Transmission CRC transmission check code.

可选地,通过向内存控制器写入一个或一组命令来开启或关闭针对读内存过程的CRC传输校验功能,内存控制器也会向指定内存通道上的内存模块写入相应的命令,使其开始或停止为读内存计算和传输CRC传输校验码,无需重新初始化内存系统来使这个功能生效。Optionally, by writing one or a group of commands to the memory controller to enable or disable the CRC transmission verification function for the process of reading the memory, the memory controller will also write the corresponding command to the memory module on the specified memory channel, Enables it to start or stop calculating and transmitting the CRC for reading memory without reinitializing the memory system for this function to take effect.

本申请实施例对BIOS程序能够实现的软件功能进行说明。BIOS程序负责提供基板或处理器封装的底层操作接口。The embodiment of the present application describes the software functions that can be realized by the BIOS program. The BIOS program is responsible for providing the underlying operation interface of the substrate or processor package.

(1)BIOS程序中包含一组软件接口,调用这组软件接口能够开启或关闭内存控制器和内存模块的CRC传输校验功能,以及查询CRC传输校验功能的开关状态。这组软件接口在内存系统的运行过程中可被反复调用,调用这组软件接口不会中断内存系统执行其它计算机程序,也无需重新启动内存系统使对CRC传输校验功能的开关状态的改变生效。(1) The BIOS program contains a set of software interfaces. Calling this set of software interfaces can enable or disable the CRC transmission verification function of the memory controller and memory module, and query the switch status of the CRC transmission verification function. This group of software interfaces can be called repeatedly during the operation of the memory system. Calling this group of software interfaces will not interrupt the execution of other computer programs by the memory system, and there is no need to restart the memory system to make the change of the switch state of the CRC transmission verification function take effect .

(2)BIOS程序中包含一组软件接口,调用这组软件接口能够读取内存控制器中特定的寄存器,以查询指定的内存通道上指定的内存模块的发生ECC错误的次数或发生ECC错误的次数是否超出设置的次数阈值的状态、发生ECC错误的内存地址等。(2) The BIOS program contains a set of software interfaces. Calling this set of software interfaces can read specific registers in the memory controller to query the number of ECC errors or the number of ECC errors of the specified memory module on the specified memory channel. Whether the number of times exceeds the set threshold, the memory address where the ECC error occurs, etc.

(3)BIOS程序中包含一组软件接口,调用这组软件接口能够读取内存控制器和/或内存模块中特定的寄存器,以查询指定的内存通道上指定的内存模块在写内存方向上发生CRC传输校验错误的次数或者在写内存方向上发生CRC传输校验错误的次数是否超出设置的次数阈值的状态。(3) The BIOS program contains a set of software interfaces. Calling this set of software interfaces can read specific registers in the memory controller and/or memory modules, so as to query the occurrence of memory writes in the specified memory channel on the specified memory module. Whether the number of times of CRC transmission verification errors or the number of times of CRC transmission verification errors in the direction of writing memory exceeds the set threshold.

(4)如果内存系统支持读内存方向上的CRC传输校验,BIOS程序中包含一组软件接口,调用这组软件接口能够读取内存控制器中特定的寄存器,以查询指定的内存通道在读内存方向上发生CRC传输校验错误的次数或者在读内存方向上发生CRC传输校验错误的次数是否超出设置的次数阈值的状态。(4) If the memory system supports CRC transmission verification in the direction of reading memory, the BIOS program contains a set of software interfaces, calling this set of software interfaces can read specific registers in the memory controller to query the specified memory channel is reading memory Whether the number of CRC transmission check errors in the direction or the number of CRC transmission check errors in the read memory direction exceeds the set threshold.

本申请实施例对控制设备的所支持的功能进行说明。The embodiment of the present application describes the functions supported by the control device.

控制设备能够调用BIOS程序提供的上述(1)-(4)中的软件接口,开启或关闭CRC传输校验功能、查询CRC传输校验功能的开关状态、读取指定内存通道上的ECC错误计数或ECC错误计数超出设置的次数阈值的状态、读取ECC错误地址信息、读取CRC传输校验错误计数或CRC传输校验错误计数超出设置的次数阈值的状态。The control device can call the software interface in the above (1)-(4) provided by the BIOS program to enable or disable the CRC transmission verification function, query the switch status of the CRC transmission verification function, and read the ECC error count on the specified memory channel Or the state that the ECC error count exceeds the set threshold, reads ECC error address information, reads the CRC transmission check error count, or the state that the CRC transmission check error count exceeds the set threshold.

控制设备能够向用户提供查看或诊断内存通道故障信息的用户接口,包括但不限于以告警或日志的形式向用户提示某个内存通道或某个内存通道链路发生故障、接收用户对某个内存通道的内存通道链路诊断命令。The control device can provide users with a user interface for viewing or diagnosing memory channel failure information, including but not limited to prompting users in the form of alarms or logs that a certain memory channel or a certain memory channel Memory channel link diagnostic commands for the channel.

图2是本申请实施例提供的一种内存通道故障检测方法的流程示意图。该方法可以应用于如图1所示的计算机系统中的控制设备11。如图2所示,该方法包括:FIG. 2 is a schematic flowchart of a memory channel fault detection method provided by an embodiment of the present application. The method can be applied to the control device 11 in the computer system as shown in FIG. 1 . As shown in Figure 2, the method includes:

步骤201、获取内存控制器对内存通道的ECC错误统计信息。Step 201 , acquiring ECC error statistics information of a memory controller for a memory channel.

ECC错误统计信息用于反映内存通道发生ECC错误的次数。可选地,如果内存控制器控制多个内存通道,控制设备可以获取内存控制器分别对各个内存通道的ECC错误统计信息。ECC error statistics are used to reflect the number of ECC errors that occur on the memory channel. Optionally, if the memory controller controls multiple memory channels, the control device may obtain the ECC error statistics information of the memory controller for each memory channel.

可选地,ECC错误统计信息包括内存通道发生ECC错误的次数或对内存通道发生ECC错误的次数超出第一次数阈值的状态指示。例如,控制设备可以从内存控制器上的计数寄存器读取内存通道发生ECC错误的次数,或者,控制设备可以从内存控制器上的状态寄存器读取对内存通道发生ECC错误的次数超出第一次数阈值的状态指示。第一次数阈值的取值为正整数。第一次数阈值的大小可以是在系统中预先设置的,或者也可以是由用户通过控制设备设置或更改的,本申请实施例对第一次数阈值的具体取值不做限定。Optionally, the ECC error statistics information includes the number of ECC errors that occur on the memory channel or a state indication that the number of ECC errors that occur on the memory channel exceeds a first number threshold. For example, the control device can read from a count register on the memory controller the number of times an ECC error has occurred for a memory channel, or, the control device can read from a status register on the memory controller that the number of times an ECC error has occurred for a memory channel exceeds the first Status indication of the number threshold. The value of the first number threshold is a positive integer. The size of the first number threshold may be preset in the system, or may be set or changed by the user through the control device. The embodiment of the present application does not limit the specific value of the first number threshold.

可选地,内存控制器统计内存通道在时间窗口内发生ECC错误的次数。例如,内存控制器可以始终统计内存通道在过去5分钟内发生ECC错误的次数。时间窗口的大小可以是在内存控制器中预先设置的,或者也可以是由用户通过控制设备设置或更改的,本申请实施例对时间窗口的具体取值不做限定。Optionally, the memory controller counts the number of ECC errors that occur on the memory channel within the time window. For example, the memory controller can always count the number of times the memory channel has experienced ECC errors in the past 5 minutes. The size of the time window may be preset in the memory controller, or may be set or changed by the user through a control device. The embodiment of the present application does not limit the specific value of the time window.

可选地,控制设备获取内存控制器对内存通道的ECC错误统计信息的实现方式,包括:控制设备定期在内存控制器中查询内存通道发生ECC错误的次数。或者,当某个内存通道发生ECC错误的次数超出第一次数阈值时,内存控制器向控制设备发送对该内存通道发生ECC错误的次数超出第一次数阈值的状态指示。Optionally, the implementation manner for the control device to acquire the ECC error statistics information of the memory channel from the memory controller includes: the control device periodically queries the memory controller for the number of ECC errors that occur in the memory channel. Alternatively, when the number of ECC errors occurring on a certain memory channel exceeds the first threshold, the memory controller sends a status indication to the control device that the number of ECC errors occurring on the memory channel exceeds the first threshold.

步骤202、当内存通道发生ECC错误的次数超出第一次数阈值时,控制内存控制器对该内存通道中的内存通道链路进行CRC传输校验,以确定该内存通道链路是否发生故障。Step 202: When the number of ECC errors in the memory channel exceeds the first number threshold, control the memory controller to perform CRC transmission check on the memory channel link in the memory channel to determine whether the memory channel link is faulty.

在内存系统开始运行时,内存系统的CRC传输校验功能是处于关闭状态的。在某个内存通道发生ECC错误的次数超出第一次数阈值时,控制设备才会开启针对该内存通道的CRC传输校验功能,即触发控制该内存通道的内存控制器对该内存通道中的内存通道链路进行CRC传输校验,这样可以避免CRC传输校验功能长时间开启造成的内存系统的性能损失。When the memory system starts running, the CRC transmission check function of the memory system is disabled. When the number of ECC errors in a memory channel exceeds the threshold of the first number, the control device will enable the CRC transmission verification function for the memory channel, that is, trigger the memory controller controlling the memory channel to check the ECC errors in the memory channel. The memory channel link performs CRC transmission verification, which can avoid the performance loss of the memory system caused by the CRC transmission verification function being turned on for a long time.

本申请实施例中,控制设备能够基于内存控制器对内存通道的ECC错误统计信息判断该内存通道是否频繁发生错误。如果某个内存通道频繁发生错误,控制设备可以触发对该内存通道中的内存通道链路的CRC传输校验流程,以确定内存通道链路是否发生故障,进而能够明确该内存通道中的故障点在内存通道链路上还是内存模块上,无需人工排查内存通道中的故障位置,提高了故障定位效率。In the embodiment of the present application, the control device can determine whether errors occur frequently in the memory channel based on the memory controller's ECC error statistics for the memory channel. If an error occurs frequently in a memory channel, the control device can trigger the CRC transmission verification process of the memory channel link in the memory channel to determine whether the memory channel link is faulty, and then can clarify the fault point in the memory channel On the memory channel link or on the memory module, there is no need to manually check the fault location in the memory channel, which improves the fault location efficiency.

当内存通道发生ECC错误的次数超出第一次数阈值时,控制设备可以确定该内存通道发生故障。可选地,当内存通道发生ECC错误的次数超出第一次数阈值时,控制设备还可以输出故障指示,该故障指示用于指示该内存通道发生故障。步骤202的实现方式包括:控制设备响应于针对该内存通道的内存通道链路诊断命令,控制内存控制器对该内存通道中的内存通道链路进行CRC传输校验。其中,控制设备输出故障指示,可以是通过用户接口输出该故障指示,以在相应的显示设备上显示故障信息供运维人员查看。运维人员在知晓某个内存通道发生故障之后,可以选择触发针对该内存通道的内存通道链路诊断命令。When the number of ECC errors that occur on the memory channel exceeds the first number threshold, the control device may determine that the memory channel is faulty. Optionally, when the number of times of ECC errors occurring on the memory channel exceeds the first number threshold, the control device may also output a fault indication, which is used to indicate that the memory channel is faulty. The implementation of step 202 includes: the control device controls the memory controller to perform CRC transmission check on the memory channel link in the memory channel in response to the memory channel link diagnosis command for the memory channel. Wherein, the control device outputting the fault indication may output the fault indication through a user interface, so as to display fault information on a corresponding display device for operation and maintenance personnel to view. After knowing that a memory channel is faulty, the operation and maintenance personnel can choose to trigger a memory channel link diagnosis command for the memory channel.

本申请实施例中,控制设备在确定某个内存通道发生故障后,可以直接触发内存控制器对该内存通道中的内存通道链路进行CRC传输校验,或者,也可以输出故障指示,由用户决定是否触发内存控制器对该内存通道中的内存通道链路进行CRC传输校验。In the embodiment of the present application, after the control device determines that a certain memory channel fails, it can directly trigger the memory controller to perform CRC transmission verification on the memory channel link in the memory channel, or it can also output a fault indication, and the user can Determine whether to trigger the memory controller to perform CRC transmission check on the memory channel link in the memory channel.

可选地,内存通道链路诊断命令包括第二次数阈值和/或时长阈值。其中,第二次数阈值用于:当内存通道链路发生CRC传输校验错误的次数超出该第二次数阈值时,判定该内存通道链路发生故障。时长阈值为对内存通道链路进行CRC传输校验的最大允许时长。第二次数阈值和时长阈值都可用作判断是否结束CRC传输校验的条件。当内存通道链路发生CRC传输校验错误的次数超出第二次数阈值,即该内存通道链路发生故障时,可以确定该内存通道链路满足CRC传输校验结束条件。当内存控制器对该内存通道链路进行CRC传输校验的时长达到时长阈值时,也可以确定该内存通道链路满足CRC传输校验结束条件。也即是,内存通道链路满足CRC传输校验结束条件,包括:内存通道链路发生故障,或者,内存控制器对该内存通道链路进行CRC传输校验的时长达到时长阈值。Optionally, the memory channel link diagnosis command includes a second times threshold and/or a duration threshold. Wherein, the second times threshold is used to determine that the memory channel link is faulty when the number of CRC transmission check errors occurring on the memory channel link exceeds the second times threshold. The duration threshold is the maximum allowable duration for performing CRC transmission check on the memory channel link. Both the second times threshold and the duration threshold can be used as conditions for judging whether to end the CRC transmission check. When the number of CRC transmission check errors that occur on the memory channel link exceeds the second threshold, that is, when the memory channel link fails, it can be determined that the memory channel link satisfies the CRC transmission check end condition. When the duration for the memory controller to perform the CRC transmission check on the memory channel link reaches a duration threshold, it may also be determined that the memory channel link satisfies the end condition of the CRC transmission check. That is, the memory channel link satisfies the end condition of the CRC transmission verification, including: the memory channel link fails, or the duration for the memory controller to perform the CRC transmission verification on the memory channel link reaches a duration threshold.

可选地,第二次数阈值和时长阈值可以是在系统中预先设置的,或者也可以是由用户通过控制设备设置或更改的,例如在内存通道链路诊断命令中携带,本申请实施例对第二次数阈值和时长阈值的具体取值均不做限定。Optionally, the second count threshold and duration threshold can be preset in the system, or can also be set or changed by the user through the control device, for example, carried in the memory channel link diagnosis command, the embodiment of the present application Specific values of the second count threshold and duration threshold are not limited.

本申请实施例通过设置内存控制器进行CRC传输校验的结束机制,使得内存控制器能够在内存系统的运行过程中及时停止进行CRC传输校验,避免持续降低内存系统的性能,在进行内存通道故障定位的同时,尽可能降低对内存系统的性能影响,提高了内存系统的可用性。In the embodiment of the present application, by setting the end mechanism of the memory controller to perform CRC transmission verification, the memory controller can stop the CRC transmission verification in time during the operation of the memory system, so as to avoid continuously reducing the performance of the memory system. While locating the fault, the performance impact on the memory system is reduced as much as possible, and the availability of the memory system is improved.

一种实现方式下,当控制设备确定内存通道链路满足CRC传输校验结束条件时,控制设备控制内存控制器停止对该内存通道链路进行CRC传输校验。也即是,响应于内存通道链路发生故障,或者,内存控制器对内存通道链路进行CRC传输校验的时长达到时长阈值,控制设备控制内存控制器停止对该内存通道链路进行CRC传输校验。In one implementation manner, when the control device determines that the memory channel link satisfies the CRC transmission verification termination condition, the control device controls the memory controller to stop performing the CRC transmission verification on the memory channel link. That is, in response to a failure of the memory channel link, or when the duration for the memory controller to perform CRC transmission verification on the memory channel link reaches a duration threshold, the control device controls the memory controller to stop performing CRC transmission on the memory channel link check.

控制设备可以在内存控制器开始对内存通道链路进行CRC传输校验后,实时或定期获取内存控制器和/或内存模块对内存通道链路的CRC错误统计信息,直至确定该内存通道链路发生故障或对该内存通道链路进行CRC传输校验的时长达到了时长阈值。可选地,CRC错误统计信息包括内存通道链路发生CRC传输校验错误的次数或对内存通道链路发生CRC传输校验错误的次数超出第二次数阈值的状态指示。若步骤202中,仅由内存控制器对内存通道链路进行写内存方向的CRC传输校验,则内存通道链路发生CRC传输校验错误的次数为内存通道链路在写内存方向上发生CRC传输校验错误的次数。若步骤202中,仅由内存模块对内存通道中的内存通道链路进行读内存方向的CRC传输校验,则内存通道链路发生CRC传输校验错误的次数为内存通道链路在读内存方向上发生CRC传输校验错误的次数。若步骤202中,既由内存控制器对内存通道链路进行写内存方向的CRC传输校验,又由内存模块对内存通道中的内存通道链路进行读内存方向的CRC传输校验,则内存通道链路发生CRC传输校验错误的次数可以为内存通道链路在写内存方向和读内存方向上发生CRC传输校验错误的总次数。After the memory controller starts to perform CRC transmission verification on the memory channel link, the control device can obtain the CRC error statistics information of the memory controller and/or memory module on the memory channel link in real time or periodically until the memory channel link is determined to be A fault occurs or the duration of the CRC transmission check for the memory channel link reaches the duration threshold. Optionally, the CRC error statistics information includes the number of CRC transmission check errors that occur on the memory channel link or a status indication that the number of CRC transmission check errors that occur on the memory channel link exceeds a second threshold. If in step 202, only the memory controller performs CRC transmission verification on the memory channel link in the direction of writing to memory, the number of times that CRC transmission verification errors occur on the memory channel link is CRC occurred on the memory channel link in the direction of writing to memory The number of transmission parity errors. If in step 202, only the memory module performs CRC transmission verification on the memory channel link in the memory channel in the direction of reading memory, the number of times that the memory channel link has CRC transmission verification errors is the memory channel link in the direction of reading memory The number of times CRC transmission checksum errors occurred. If in step 202, the memory controller performs the CRC transmission verification of the memory channel link in the direction of writing memory, and the memory module performs the CRC transmission verification of the memory channel link in the memory channel in the direction of reading memory, then the memory The number of CRC transmission check errors that occur on the channel link may be the total number of CRC transmission check errors that occur on the memory channel link in the direction of writing memory and the direction of reading memory.

可选地,控制设备获取内存控制器对内存通道链路的CRC错误统计信息的实现方式,包括:控制设备在内存控制器中查询内存通道链路发生CRC传输校验错误的次数。或者,当内存通道链路发生CRC传输校验错误的次数超出第二次数阈值时,内存控制器向控制设备发送对该内存通道链路发生CRC传输校验错误的次数超出第二次数阈值的状态指示。Optionally, the method for the control device to acquire CRC error statistics of the memory channel link by the memory controller includes: the control device queries the memory controller for the number of times CRC transmission check errors occur on the memory channel link. Or, when the number of CRC transmission check errors that occur on the memory channel link exceeds the second threshold, the memory controller sends a status to the control device that the number of CRC transmission check errors that occur on the memory channel link exceeds the second threshold instruct.

另一种实现方式下,当内存控制器确定内存通道链路满足CRC传输校验结束条件时,内存控制器主动停止对该内存通道链路进行CRC传输校验。这种实现方式下,控制设备可以根据内存控制器和/或内存模块进行CRC传输校验的状态信息,确定内存通道链路是否发生故障。如果开启了写内存方向的CRC传输校验功能的内存系统中的内存控制器的状态信息指示已停止进行CRC传输校验,或者,开启了读内存方向的CRC传输校验功能的内存系统中的内存模块的状态信息指示已停止进行CRC传输校验,则控制设备可以确定内存通道链路发生了故障。In another implementation manner, when the memory controller determines that the memory channel link meets the CRC transmission verification end condition, the memory controller actively stops performing the CRC transmission verification on the memory channel link. In this implementation manner, the control device can determine whether the memory channel link fails according to the status information of the CRC transmission check performed by the memory controller and/or the memory module. If the status information of the memory controller in the memory system with the CRC transmission verification function in the direction of writing memory is enabled indicates that the CRC transmission verification function has been stopped, or if the memory system in the memory system with the CRC transmission verification function in the direction of reading memory is enabled If the state information of the memory module indicates that the CRC transmission check has stopped, the control device can determine that the memory channel link has failed.

内存控制器中预先保存有第二次数阈值和时长阈值,内存控制器可以记录内存通道链路发生CRC传输校验错误的次数,并在内存通道链路发生CRC传输校验错误的次数超出第二次数阈值,或者,对内存通道链路进行CRC传输校验的时长达到时长阈值后,主动停止对该内存通道链路进行CRC传输校验。其中,内存控制器对内存通道链路进行传输校验,包括:内存控制器对内存通道链路进行写内存方向的CRC传输校验,和/或,控制内存模块对内存通道链路进行读内存方向的CRC传输校验。相应地,内存控制器停止对内存通道链路进行传输校验,包括:内存控制器停止对内存通道链路进行写内存方向的CRC传输校验,和/或,控制内存模块停止对内存通道链路进行读内存方向的CRC传输校验。The memory controller pre-stores the second count threshold and duration threshold. The memory controller can record the number of CRC transmission check errors on the memory channel link, and the number of CRC transmission check errors on the memory channel link exceeds the second The number of times threshold, or, after the duration of performing CRC transmission verification on the memory channel link reaches the duration threshold, actively stop performing the CRC transmission verification on the memory channel link. Wherein, the memory controller performs transmission verification on the memory channel link, including: the memory controller performs CRC transmission verification on the memory channel link in the direction of writing memory, and/or controls the memory module to perform memory read on the memory channel link CRC transmission check for the direction. Correspondingly, the memory controller stops performing transmission verification on the memory channel link, including: the memory controller stops performing CRC transmission verification on the memory channel link in the direction of writing memory, and/or controls the memory module to stop performing verification on the memory channel link The way to perform CRC transmission check in the direction of reading memory.

第一种情况,内存控制器用于对内存通道链路进行写内存方向的CRC传输校验。内存模块可以记录内存通道链路在写内存方向上发生CRC传输校验错误的次数,并在内存通道链路在写内存方向上发生CRC传输校验错误的次数超出第二次数阈值时,停止对通过该内存通道链路接收到的CRC传输校验码进行校验。另外,内存模块还可以通知内存控制器结束计算和传输CRC传输校验码。In the first case, the memory controller is used to perform CRC transmission verification in the memory writing direction on the memory channel link. The memory module can record the number of CRC transmission verification errors that occur on the memory channel link in the direction of writing memory, and when the number of CRC transmission verification errors that occur on the memory channel link in the direction of writing to memory exceeds the second threshold, stop checking The CRC transmission check code received by the memory channel link is used for verification. In addition, the memory module may also notify the memory controller to finish calculating and transmitting the CRC transmission check code.

第二种情况,内存模块用于对内存通道链路进行读内存方向的CRC传输校验。内存控制器可以记录内存通道链路在读内存方向上发生CRC传输校验错误的次数,并在内存通道链路在读内存方向上发生CRC传输校验错误的次数超出第二次数阈值时,通知内存模块结束计算和传输CRC传输校验码,并停止对通过该内存通道链路接收到的CRC传输校验码进行校验。In the second case, the memory module is used to perform CRC transmission verification on the memory channel link in the direction of reading the memory. The memory controller can record the number of CRC transmission check errors that occur on the memory channel link in the direction of reading memory, and notify the memory module when the number of CRC transmission check errors that occur on the memory channel link in the direction of reading memory exceeds the second threshold End the calculation and transmission of the CRC transmission check code, and stop checking the CRC transmission check code received through the memory channel link.

第三种情况,内存控制器用于对内存通道链路进行写内存方向的CRC传输校验,并且内存模块用于对内存通道链路进行读内存方向的CRC传输校验。内存控制器可以记录内存通道链路在写内存方向和读内存方向上发生CRC传输校验错误的总次数,并在内存通道链路在写内存方向和读内存方向上发生CRC传输校验错误的总次数超出第二次数阈值时,停止计算和传输CRC传输校验码,停止对通过该内存通道链路接收到的CRC传输校验码进行校验,并通知内存模块结束计算和传输CRC传输校验码。In the third case, the memory controller is used to perform CRC transmission verification on the memory channel link in the direction of writing memory, and the memory module is used to perform CRC transmission verification on the memory channel link in the direction of reading memory. The memory controller can record the total number of CRC transmission check errors in the direction of writing memory and reading the memory of the memory channel link, and record the number of CRC transmission check errors in the direction of writing memory and reading memory of the memory channel link. When the total number of times exceeds the second threshold, stop calculating and transmitting the CRC transmission check code, stop verifying the CRC transmission check code received through the memory channel link, and notify the memory module to end the calculation and transmission of the CRC transmission check code. check code.

上述第三种情况中,通过统计内存通道链路在写内存方向和读内存方向上发生CRC传输校验错误的总次数来与第二次数阈值比较,相较于第一种情况和第二种情况,可以更快地判断出内存通道链路发生了故障,故障检测效率更高。In the above third case, by counting the total number of CRC transmission check errors in the memory channel link in the direction of writing memory and reading the direction of memory, it is compared with the second threshold. Compared with the first case and the second case In this case, it can be judged faster that the memory channel link is faulty, and the fault detection efficiency is higher.

可选地,控制设备在确定内存通道链路发生故障之后,可以输出第一故障检测结果,该第一故障检测结果指示该内存通道链路发生故障。或者,在ECC错误统计信息包括内存通道中发生ECC错误的内存地址的情况下,控制设备在确定内存通道链路未发生故障之后,可以输出第二故障检测结果,该第二故障检测结果指示内存地址对应的内存模块发生故障。Optionally, after determining that the memory channel link is faulty, the control device may output a first fault detection result, where the first fault detection result indicates that the memory channel link is faulty. Alternatively, when the ECC error statistics information includes memory addresses where ECC errors occur in the memory channel, the control device may output a second fault detection result after determining that the memory channel link is not faulty, and the second fault detection result indicates that the memory channel The memory module corresponding to the address has failed.

本申请实施例中,在内存通道发生故障时,控制设备能够输出故障点的具体位置。通过向运维人员提供更准确的告警信息,便于运维人员进行故障修复,从而缩短故障恢复时间,提高系统可用性。In the embodiment of the present application, when the memory channel fails, the control device can output the specific location of the failure point. By providing operation and maintenance personnel with more accurate alarm information, it is convenient for operation and maintenance personnel to repair faults, thereby shortening the fault recovery time and improving system availability.

本申请实施例提供的内存通道故障检测方法主要包括故障检测触发过程、故障检测过程和故障检测结束过程。本申请以下实施例基于控制设备与BIOS程序之间的交互,对这三个过程的实现流程一一进行示例性说明。在如图1所示的计算机系统中,控制设备可以是控制设备11,BIOS程序可以是BIOS ROM中存储的计算机程序。The memory channel fault detection method provided in the embodiment of the present application mainly includes a fault detection trigger process, a fault detection process, and a fault detection end process. The following embodiments of the present application illustrate the implementation flows of these three processes based on the interaction between the control device and the BIOS program. In the computer system shown in FIG. 1, the control device may be the control device 11, and the BIOS program may be a computer program stored in the BIOS ROM.

图3是本申请实施例提供的故障检测触发过程的实现流程示意图。如图3所示,该实现流程包括:Fig. 3 is a schematic diagram of the implementation flow of the fault detection triggering process provided by the embodiment of the present application. As shown in Figure 3, the implementation process includes:

步骤301、控制设备向BIOS程序发送设置命令,该设置命令用于指示设置内存通道对应的ECC错误计数时间窗口和ECC计数阈值。Step 301 , the control device sends a setting command to the BIOS program, and the setting command is used to instruct to set the ECC error counting time window and the ECC counting threshold corresponding to the memory channel.

其中,ECC计数阈值即上述第一次数阈值。不同内存通道对应的ECC错误计数时间窗口和ECC计数阈值可以相同,或者也可以不同。Wherein, the ECC count threshold is the above-mentioned first count threshold. The ECC error counting time windows and ECC counting thresholds corresponding to different memory channels may be the same or different.

步骤302、BIOS程序为内存控制器设置内存通道对应的ECC错误计数时间窗口和ECC计数阈值。Step 302, the BIOS program sets the ECC error counting time window and the ECC counting threshold corresponding to the memory channel for the memory controller.

步骤303、控制设备向BIOS程序定时发送查询命令,该查询命令用于指示查询对内存通道的ECC错误统计信息。Step 303 , the control device periodically sends a query command to the BIOS program, and the query command is used to instruct to query the ECC error statistics information of the memory channel.

ECC错误统计信息包括ECC错误计数(即内存通道发生ECC错误的次数)或对ECC错误计数超出ECC计数阈值的状态指示。The ECC error statistics include ECC error counts (that is, the number of times an ECC error occurs on a memory channel) or a status indication that the ECC error count exceeds an ECC count threshold.

步骤304、BIOS程序从内存控制器读取对内存通道的ECC错误统计信息。Step 304, the BIOS program reads the ECC error statistics information of the memory channel from the memory controller.

这里,内存控制器对内存通道的ECC错误统计信息指的是,内存控制器在最近的ECC错误计数时间窗口内对内存通道的ECC错误统计信息。Here, the ECC error statistics information of the memory controller for the memory channel refers to the ECC error statistics information of the memory controller for the memory channel within the latest ECC error counting time window.

步骤305、BIOS程序向控制设备发送对内存通道的ECC错误统计信息。Step 305, the BIOS program sends the ECC error statistics information of the memory channel to the control device.

步骤306、控制设备判断是否有内存通道的ECC错误计数超出ECC计数阈值。如果有内存通道的ECC错误计数超出ECC计数阈值,则将ECC错误计数超出ECC计数阈值的内存通道作为待校验的内存通道并执行步骤307;如果没有内存通道的ECC错误计数超出ECC计数阈值,则返回执行步骤303。Step 306, the control device judges whether the ECC error count of any memory channel exceeds the ECC count threshold. If the ECC error count of any memory channel exceeds the ECC count threshold, then use the memory channel whose ECC error count exceeds the ECC count threshold as the memory channel to be verified and perform step 307; if the ECC error count of no memory channel exceeds the ECC count threshold, Then return to step 303.

步骤307、控制设备向BIOS程序发送针对待校验的内存通道的CRC传输校验命令。Step 307, the control device sends a CRC transmission verification command for the memory channel to be verified to the BIOS program.

步骤308、BIOS程序检测待校验的内存通道是否处于空闲状态。如果该待校验的内存通道处于空闲状态,则执行步骤309;如果该待校验的内存通道处于繁忙状态,则执行步骤311。Step 308, the BIOS program detects whether the memory channel to be verified is in an idle state. If the memory channel to be checked is idle, execute step 309; if the memory channel to be checked is busy, execute step 311.

在BIOS程序接收到CRC传输校验命令后,如果待校验的内存通道读写内存的时长超出了预设的超时时长,则BIOS程序确定该待校验的内存通道处于繁忙状态。After the BIOS program receives the CRC transmission verification command, if the memory channel to be verified reads and writes the memory over a preset timeout period, the BIOS program determines that the memory channel to be verified is in a busy state.

步骤309、BIOS程序设置内存控制器开始对待校验的内存通道中的内存通道链路进行CRC传输校验。Step 309 , the BIOS program sets the memory controller to start performing CRC transmission verification on the memory channel links in the memory channels to be verified.

在步骤309中,BIOS程序可以设置内存控制器对待校验的内存通道中的内存通道链路进行写内存方向的CRC传输校验。BIOS程序还可以判断待校验的内存通道中的内存模块是否具备计算CRC传输校验码的能力。如果内存模块具备计算CRC传输校验码的能力,BIOS程序还可以设置内存控制器向内存模块写命令,以使内存模块对待校验的内存通道中的内存通道链路进行读内存方向的CRC传输校验。In step 309, the BIOS program may set the memory controller to perform CRC transmission verification in the memory writing direction on the memory channel link in the memory channel to be verified. The BIOS program can also determine whether the memory module in the memory channel to be verified has the ability to calculate the CRC transmission check code. If the memory module has the ability to calculate the CRC transmission check code, the BIOS program can also set the memory controller to write commands to the memory module, so that the memory module performs CRC transmission in the direction of reading the memory for the memory channel link in the memory channel to be verified. check.

可选地,如果内存控制器支持CRC传输校验自动停止,则BIOS程序可以设置开启内存控制器的CRC传输校验自动停止功能,并在内存控制器中设置CRC计数阈值(即上述第二次数阈值)。如果内存模块支持CRC传输校验自动停止,则BIOS程序可以设置开启内存模块的CRC传输校验自动停止功能,并在内存模块中设置CRC计数阈值。Optionally, if the memory controller supports the automatic stop of the CRC transmission check, the BIOS program can be set to enable the automatic stop function of the CRC transmission check of the memory controller, and set the CRC count threshold in the memory controller (that is, the above-mentioned second number of times) threshold). If the memory module supports automatic stop of CRC transmission check, the BIOS program can enable the function of automatic stop of CRC transmission check of the memory module, and set the CRC count threshold in the memory module.

步骤310、BIOS程序向控制设备发送CRC传输校验启动成功通知。Step 310, the BIOS program sends a CRC transmission verification startup success notification to the control device.

CRC传输校验启动成功通知用于指示针对待校验的内存通道的CRC传输校验启动成功。The CRC transmission verification startup success notification is used to indicate that the CRC transmission verification startup for the memory channel to be verified is successful.

步骤311、BIOS程序向控制设备发送CRC传输校验启动失败通知。Step 311 , the BIOS program sends a CRC transmission verification startup failure notification to the control device.

CRC传输校验启动失败通知用于指示针对待校验的内存通道的CRC传输校验启动失败。The CRC transmission verification start failure notification is used to indicate that the CRC transmission verification for the memory channel to be verified fails to start.

步骤312、控制设备判断针对待校验的内存通道的CRC传输校验是否启动成功。如果针对待校验的内存通道的CRC传输校验启动成功,则进入故障检测流程;如果针对待校验的内存通道的CRC传输校验启动失败,则返回执行步骤303。Step 312, the control device judges whether the CRC transmission check for the memory channel to be checked is started successfully. If the CRC transmission verification for the memory channel to be verified is successfully started, enter the fault detection process; if the CRC transmission verification for the memory channel to be verified fails, return to step 303 .

图4是本申请实施例提供的故障检测过程的实现流程示意图。该故障检测过程以内存控制器用于对待校验的内存通道中的内存通道链路进行写内存方向的CRC传输校验,内存模块用于对待校验的内存通道中的内存通道链路进行读内存方向的CRC传输校验,并且内存控制器和内存模块均支持CRC传输校验自动停止为例进行说明。如图4所示,该实现流程包括:Fig. 4 is a schematic diagram of the implementation flow of the fault detection process provided by the embodiment of the present application. In this fault detection process, the memory controller is used to perform CRC transmission verification in the direction of writing memory to the memory channel link in the memory channel to be verified, and the memory module is used to read memory from the memory channel link in the memory channel to be verified. The CRC transmission check in the direction, and both the memory controller and the memory module support the automatic stop of the CRC transmission check will be described as an example. As shown in Figure 4, the implementation process includes:

步骤401、控制设备判断针对待校验的内存通道的CRC传输校验时长是否达到时长阈值。如果针对待校验的内存通道的CRC传输校验时长未达到时长阈值,则执行步骤402;如果针对待校验的内存通道的CRC传输校验时长达到时长阈值,则进入故障检测结束流程。Step 401 , the control device judges whether the CRC transmission verification duration for the memory channel to be verified reaches a duration threshold. If the CRC transmission verification duration for the memory channel to be verified does not reach the duration threshold, execute step 402; if the CRC transmission verification duration for the memory channel to be verified reaches the duration threshold, enter the fault detection end process.

步骤402、控制设备向BIOS程序发送查询命令,该查询命令用于指示查询对待校验的内存通道中的内存通道链路的故障检测结果。Step 402, the control device sends a query command to the BIOS program, where the query command is used to instruct to query the fault detection result of the memory channel link in the memory channel to be verified.

可选地,在对待校验的内存通道中的内存通道链路进行CRC传输校验过程中,控制设备可以周期性地向BIOS程序发送查询命令来查询故障检测结果。Optionally, during the CRC transmission verification process of the memory channel link in the memory channel to be verified, the control device may periodically send a query command to the BIOS program to query the fault detection result.

步骤403、BIOS程序检测待校验的内存通道是否处于空闲状态。如果该待校验的内存通道处于空闲状态,则执行步骤404;如果该待校验的内存通道处于繁忙状态,则执行步骤408。Step 403, the BIOS program detects whether the memory channel to be verified is in an idle state. If the memory channel to be checked is idle, execute step 404; if the memory channel to be checked is busy, execute step 408.

在BIOS程序接收到查询命令后,如果待校验的内存通道读写内存的时长超出了预设的超时时长,则BIOS程序确定该待校验的内存通道处于繁忙状态。After the BIOS program receives the query command, if the time for reading and writing the memory of the memory channel to be verified exceeds the preset timeout period, the BIOS program determines that the memory channel to be verified is in a busy state.

步骤404、BIOS程序检测内存控制器是否已停止对待校验的内存通道中的内存通道链路进行写内存方向的CRC传输校验。如果内存控制器未停止对待校验的内存通道中的内存通道链路进行写内存方向的CRC传输校验,则执行步骤405;如果内存控制器已停止对待校验的内存通道中的内存通道链路进行写内存方向的CRC传输校验,则执行步骤407。Step 404 , the BIOS program detects whether the memory controller has stopped performing CRC transmission verification in the direction of writing memory for the memory channel link in the memory channel to be verified. If the memory controller has not stopped the memory channel link in the memory channel to be verified to perform CRC transmission verification in the write memory direction, then perform step 405; if the memory controller has stopped the memory channel chain in the memory channel to be verified If the CRC transmission check is performed in the direction of writing memory, step 407 is performed.

步骤405、BIOS程序检测内存模块是否已停止对待校验的内存通道中的内存通道链路进行读内存方向的CRC传输校验。如果内存模块未停止对待校验的内存通道中的内存通道链路进行读内存方向的CRC传输校验,则执行步骤406;如果内存模块已停止对待校验的内存通道中的内存通道链路进行读内存方向的CRC传输校验,则执行步骤407。Step 405 , the BIOS program detects whether the memory module has stopped performing CRC transmission verification in the memory reading direction on the memory channel link in the memory channel to be verified. If the memory module has not stopped performing the CRC transmission verification in the direction of reading the memory for the memory channel link in the memory channel to be verified, then perform step 406; if the memory module has stopped performing the memory channel link in the memory channel to be verified For the CRC transmission check in the direction of reading the memory, go to step 407.

可选地,上述步骤404和步骤405的执行顺序可以互换,或者,上述步骤404和步骤405也可以同时执行,本申请实施例对步骤404和步骤405的执行先后顺序不做限定。Optionally, the execution order of the above step 404 and step 405 can be interchanged, or the above step 404 and step 405 can also be executed at the same time, the embodiment of the present application does not limit the execution sequence of step 404 and step 405.

步骤406、BIOS程序向控制设备发送未故障通知。Step 406, the BIOS program sends a non-fault notification to the control device.

该未故障通知用于指示待校验的内存通道中的内存通道链路未发生故障。The non-failure notification is used to indicate that the memory channel link in the memory channel to be verified is not faulty.

步骤407、BIOS程序向控制设备发送故障通知。Step 407, the BIOS program sends a fault notification to the control device.

该故障通知用于指示待校验的内存通道中的内存通道链路发生了故障。The fault notification is used to indicate that a memory channel link in the memory channel to be verified has failed.

步骤408、BIOS程序向控制设备发送查询失败通知。Step 408, the BIOS program sends a query failure notification to the control device.

该查询失败通知用于指示本次查询故障检测结果失败。The query failure notification is used to indicate that the current query failure detection result fails.

步骤409、控制设备判断是否查询到对待校验的内存通道中的内存通道链路的故障检测结果。如果未查询到故障检测结果,则返回执行步骤401;如果查询到了故障检测结果,则执行步骤410。Step 409, the control device judges whether the fault detection result of the memory channel link in the memory channel to be verified is found. If no fault detection result is found, return to step 401; if the fault detection result is found, execute step 410.

步骤410、控制设备判断待校验的内存通道中的内存通道链路是否发生故障。如果内存通道链路发生故障,则执行步骤411;如果内存通道链路未发生故障,则返回执行步骤401。Step 410, the control device judges whether a memory channel link in the memory channel to be verified is faulty. If the memory channel link fails, execute step 411; if the memory channel link does not fail, return to execute step 401.

步骤411、控制设备在日志中记录内存通道链路故障事件。Step 411, the control device records the memory channel link failure event in the log.

可选地,控制设备还可以输出内存通道链路故障事件以在用户界面上显示。在内存控制器和内存模块均支持CRC传输校验自动停止的情况下,如果内存通道链路发生故障,内存控制器和内存模块均自动停止进行CRC传输校验,则无需再进入故障检测结束流程。Optionally, the control device may also output memory channel link fault events for display on the user interface. In the case that both the memory controller and the memory module support the automatic stop of CRC transmission verification, if the memory channel link fails, the memory controller and the memory module will automatically stop the CRC transmission verification, and there is no need to enter the fault detection end process .

图5是本申请实施例提供的故障检测结束过程的实现流程示意图。如图5所示,该实现流程包括:FIG. 5 is a schematic diagram of the implementation flow of the fault detection end process provided by the embodiment of the present application. As shown in Figure 5, the implementation process includes:

步骤501、控制设备向BIOS程序发送针对内存通道的CRC传输校验结束命令。Step 501 , the control device sends a CRC transmission check end command for the memory channel to the BIOS program.

可选地,当针对某个内存通道的CRC传输校验时长达到了时长阈值,或者确定某个内存通道中的内存通道链路发生了故障,则控制设备可以向BIOS程序发送针对该内存通道的CRC传输校验结束命令。Optionally, when the CRC transmission verification duration for a certain memory channel reaches the duration threshold, or it is determined that a memory channel link in a certain memory channel has failed, the control device may send a CRC transmission for the memory channel to the BIOS program. CRC transmission verification end command.

步骤502、BIOS程序检测内存通道是否处于空闲状态。如果该内存通道处于空闲状态,则执行步骤503;如果该内存通道处于繁忙状态,则执行步骤505。Step 502, the BIOS program detects whether the memory channel is in an idle state. If the memory channel is in an idle state, execute step 503; if the memory channel is in a busy state, execute step 505.

在BIOS程序接收到CRC传输校验结束命令后,如果内存通道读写内存的时长超出了预设的超时时长,则BIOS程序确定该内存通道处于繁忙状态。After the BIOS program receives the CRC transmission verification end command, if the memory channel reads and writes the memory for longer than the preset timeout period, the BIOS program determines that the memory channel is in a busy state.

步骤503、BIOS程序设置内存控制器停止对内存通道中的内存通道链路进行CRC传输校验。Step 503 , the BIOS program sets the memory controller to stop performing CRC transmission check on the memory channel link in the memory channel.

在步骤503中,BIOS程序可以设置内存控制器停止对内存通道中的内存通道链路进行写内存方向的CRC传输校验。BIOS程序还可以判断内存通道中的内存模块是否具备计算CRC传输校验码的能力。如果内存模块具备计算CRC传输校验码的能力,BIOS程序还可以设置内存控制器向内存模块写命令,以使内存模块停止对内存通道中的内存通道链路进行读内存方向的CRC传输校验。In step 503, the BIOS program may set the memory controller to stop performing CRC transmission check in the memory writing direction on the memory channel link in the memory channel. The BIOS program can also determine whether the memory module in the memory channel has the ability to calculate the CRC transmission check code. If the memory module has the ability to calculate the CRC transmission check code, the BIOS program can also set the memory controller to write commands to the memory module, so that the memory module stops performing CRC transmission verification on the memory channel link in the memory channel in the direction of reading memory. .

步骤504、BIOS程序向控制设备发送CRC传输校验关闭成功通知。Step 504, the BIOS program sends a notification that the CRC transmission check is closed successfully to the control device.

CRC传输校验关闭成功通知用于指示针对内存通道的CRC传输校验关闭成功。The CRC transmission verification closing success notification is used to indicate that the CRC transmission verification for the memory channel is successfully closed.

步骤505、BIOS程序向控制设备发送CRC传输校验关闭失败通知。Step 505, the BIOS program sends a notification of failure to close the CRC transmission check to the control device.

CRC传输校验关闭失败通知用于指示针对内存通道的CRC传输校验关闭失败。The notification of failure to close the CRC transmission check is used to indicate that the close of the CRC transmission check for the memory channel fails.

步骤506、控制设备判断针对内存通道的CRC传输校验是否关闭成功。如果针对该内存通道的CRC传输校验关闭成功,则结束流程;如果针对该内存通道的CRC传输校验关闭失败,则返回执行步骤501。Step 506, the control device judges whether the CRC transmission check for the memory channel is closed successfully. If the CRC transmission check for the memory channel is successfully closed, the process ends; if the CRC transmission check for the memory channel fails, return to step 501 .

本申请实施例提供的内存通道故障检测方法的步骤先后顺序可以进行适当调整,步骤也可以根据情况进行相应增减。任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化的方法,都应涵盖在本申请的保护范围之内。The sequence of steps in the memory channel fault detection method provided in the embodiment of the present application can be adjusted appropriately, and the steps can also be increased or decreased accordingly according to the situation. Any person familiar with the technical field within the technical scope disclosed in this application can easily think of changing methods, which should be covered within the scope of protection of this application.

综上所述,在本申请实施例提供的内存通道故障检测方法中,控制设备能够基于内存控制器对内存通道的ECC错误统计信息判断该内存通道是否频繁发生错误。如果某个内存通道频繁发生错误,控制设备可以触发对该内存通道中的内存通道链路的CRC传输校验流程,以确定内存通道链路是否发生故障,进而能够明确该内存通道中的故障点在内存通道链路上还是内存模块上,无需人工排查内存通道中的故障位置,提高了故障定位效率。另外,通过设置内存控制器进行CRC传输校验的结束机制,使得内存控制器能够在内存系统的运行过程中及时停止进行CRC传输校验,避免持续降低内存系统的性能,在进行内存通道故障定位的同时,尽可能降低对内存系统的性能影响,提高了内存系统的可用性。在内存通道发生故障时,控制设备能够输出故障点的具体位置。通过向运维人员提供更准确的告警信息,便于运维人员进行故障修复,从而缩短故障恢复时间,提高系统可用性。To sum up, in the memory channel fault detection method provided by the embodiment of the present application, the control device can judge whether the memory channel frequently has errors based on the ECC error statistics information of the memory controller for the memory channel. If an error occurs frequently in a memory channel, the control device can trigger the CRC transmission verification process of the memory channel link in the memory channel to determine whether the memory channel link is faulty, and then can clarify the fault point in the memory channel On the memory channel link or on the memory module, there is no need to manually check the fault location in the memory channel, which improves the fault location efficiency. In addition, by setting the end mechanism for the memory controller to perform CRC transmission verification, the memory controller can stop the CRC transmission verification in time during the operation of the memory system, avoiding continuous degradation of the performance of the memory system, and locating memory channel faults At the same time, the performance impact on the memory system is reduced as much as possible, and the availability of the memory system is improved. When the memory channel fails, the control device can output the specific location of the failure point. By providing operation and maintenance personnel with more accurate alarm information, it is convenient for operation and maintenance personnel to repair faults, thereby shortening the fault recovery time and improving system availability.

图6是本申请实施例提供的一种控制设备的结构示意图。如图6所示,控制设备600包括:Fig. 6 is a schematic structural diagram of a control device provided by an embodiment of the present application. As shown in Figure 6, the control device 600 includes:

获取模块601,用于获取内存控制器对内存通道的ECC错误统计信息,ECC错误统计信息用于反映内存通道发生ECC错误的次数。The acquiring module 601 is configured to acquire the ECC error statistics information of the memory channel from the memory controller, and the ECC error statistics information is used to reflect the number of times of ECC errors occurring on the memory channel.

控制模块602,用于当内存通道发生ECC错误的次数超出第一次数阈值时,控制内存控制器对内存通道中的内存通道链路进行循环冗余校验CRC传输校验,以确定内存通道链路是否发生故障。The control module 602 is configured to control the memory controller to perform a cyclic redundancy check (CRC) transmission check on the memory channel link in the memory channel when the number of ECC errors occurring on the memory channel exceeds the first number threshold, so as to determine the memory channel Whether the link is down.

可选地,ECC错误统计信息包括内存通道发生ECC错误的次数或对内存通道发生ECC错误的次数超出第一次数阈值的状态指示。Optionally, the ECC error statistics information includes the number of ECC errors that occur on the memory channel or a state indication that the number of ECC errors that occur on the memory channel exceeds a first number threshold.

可选地,内存控制器对内存通道中的内存通道链路进行CRC传输校验,包括:内存控制器对内存通道链路进行写内存方向的CRC传输校验。和/或,内存通道中的内存模块具备计算CRC传输校验码的能力,内存控制器控制内存模块对内存通道链路进行读内存方向的CRC传输校验。Optionally, the memory controller performs a CRC transmission check on the memory channel link in the memory channel, including: the memory controller performs a CRC transmission check on the memory channel link in a memory writing direction. And/or, the memory module in the memory channel has the ability to calculate the CRC transmission check code, and the memory controller controls the memory module to perform the CRC transmission check in the direction of reading the memory on the memory channel link.

可选地,控制模块602,还用于响应于内存通道链路发生故障,或者,内存控制器对内存通道链路进行CRC传输校验的时长达到时长阈值,控制内存控制器停止对内存通道链路进行CRC传输校验。Optionally, the control module 602 is also configured to control the memory controller to stop performing CRC transmission verification on the memory channel link in response to a failure of the memory channel link, or when the memory controller performs a CRC transmission check on the memory channel link. CRC transmission check.

可选地,如图7所示,控制设备600还包括:输出模块603。Optionally, as shown in FIG. 7 , the control device 600 further includes: an output module 603 .

可选地,输出模块603,用于在确定内存通道链路发生故障之后,输出第一故障检测结果,第一故障检测结果指示内存通道链路发生故障。Optionally, the output module 603 is configured to output a first fault detection result after determining that the memory channel link is faulty, where the first fault detection result indicates that the memory channel link is faulty.

可选地,ECC错误统计信息包括内存通道中发生ECC错误的内存地址。输出模块603,用于在确定内存通道链路未发生故障之后,输出第二故障检测结果,第二故障检测结果指示内存地址对应的内存模块发生故障。Optionally, the ECC error statistics information includes memory addresses where ECC errors occur in the memory channel. The output module 603 is configured to output a second fault detection result after it is determined that the memory channel link is not faulty, and the second fault detection result indicates that the memory module corresponding to the memory address is faulty.

可选地,输出模块603,用于当内存通道发生ECC错误的次数超出第一次数阈值时,输出故障指示,故障指示用于指示内存通道发生故障。控制模块602,用于响应于针对内存通道的内存通道链路诊断命令,控制内存控制器对内存通道链路进行CRC传输校验。Optionally, the output module 603 is configured to output a fault indication when the number of times of ECC errors occurring on the memory channel exceeds a first number threshold, where the fault indication is used to indicate that the memory channel is faulty. The control module 602 is configured to control the memory controller to perform CRC transmission check on the memory channel link in response to the memory channel link diagnosis command for the memory channel.

可选地,内存通道链路诊断命令包括第二次数阈值和/或时长阈值,其中,第二次数阈值用于:当内存通道链路发生CRC传输校验错误的次数超出第二次数阈值时,判定内存通道链路发生故障;时长阈值为对内存通道链路进行CRC传输校验的最大允许时长。Optionally, the memory channel link diagnosis command includes a second times threshold and/or a duration threshold, where the second times threshold is used for: when the number of CRC transmission check errors that occur on the memory channel link exceeds the second times threshold, It is determined that the memory channel link is faulty; the duration threshold is the maximum allowable duration for performing CRC transmission check on the memory channel link.

关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。Regarding the apparatus in the foregoing embodiments, the specific manner in which each module executes operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

图8是本申请实施例提供的一种控制设备的框图。如图8所示,控制设备800包括:处理器801和存储器802。Fig. 8 is a block diagram of a control device provided by an embodiment of the present application. As shown in FIG. 8 , the control device 800 includes: a processor 801 and a memory 802 .

存储器802,用于存储计算机程序,所述计算机程序包括程序指令;memory 802, configured to store computer programs, the computer programs including program instructions;

处理器801,用于调用所述计算机程序,实现上述方法实施例中控制设备执行的步骤。The processor 801 is configured to call the computer program to implement the steps performed by the control device in the above method embodiments.

其中,处理器801包括一个或者一个以上处理核心,处理器801通过运行计算机程序,执行各种功能应用以及数据处理。Wherein, the processor 801 includes one or more processing cores, and the processor 801 executes various functional applications and data processing by running computer programs.

存储器802可用于存储计算机程序。可选地,存储器可存储操作系统和至少一个功能所需的应用程序单元。操作系统可以是实时操作系统(Real Time eXecutive,RTX)、LINUX、UNIX、WINDOWS或OS X之类的操作系统。Memory 802 may be used to store computer programs. Optionally, the memory may store an operating system and application program units required for at least one function. The operating system may be an operating system such as a real-time operating system (Real Time eXecutive, RTX), LINUX, UNIX, WINDOWS or OS X.

通信接口804可以为多个,通信接口804用于与其它存储设备或网络设备进行通信。例如在本申请实施例中,通信接口804可以用于与基板或处理器封装进行通信。There may be multiple communication interfaces 804, and the communication interface 804 is used to communicate with other storage devices or network devices. For example, in the embodiment of the present application, the communication interface 804 may be used to communicate with a substrate or a processor package.

存储器802与通信接口804分别通过通信总线803与处理器801连接。The memory 802 and the communication interface 804 are respectively connected to the processor 801 through the communication bus 803 .

本申请实施例提供了一种内存系统,包括:内存控制器和内存通道。内存控制器用于对内存通道进行ECC校验,并记录内存通道发生ECC错误的次数。内存控制器还用于在接收到针对内存通道的CRC传输校验命令之后,对内存通道中的内存通道链路进行CRC传输校验。该CRC传输校验命令可以是由控制设备指示BIOS程序触发的。An embodiment of the present application provides a memory system, including: a memory controller and a memory channel. The memory controller is used to perform ECC verification on the memory channel and record the number of ECC errors that occur on the memory channel. The memory controller is further configured to perform a CRC transmission check on the memory channel link in the memory channel after receiving the CRC transmission check command for the memory channel. The CRC transmission check command may be triggered by the control device instructing the BIOS program.

可选地,内存控制器还用于在内存通道发生ECC错误的次数超出第一次数阈值后,记录状态指示,该状态指示用于指示内存通道发生ECC错误的次数超出第一次数阈值。内存控制器可以采用状态寄存器记录内存通道发生ECC错误的次数是否超出第一次数阈值。Optionally, the memory controller is further configured to record a status indication after the number of ECC errors occurring on the memory channel exceeds the first number threshold, where the status indication is used to indicate that the number of ECC errors occurring on the memory channel exceeds the first number threshold. The memory controller may use the status register to record whether the number of times of ECC errors on the memory channel exceeds the first number threshold.

可选地,内存控制器对内存通道中的内存通道链路进行CRC传输校验,包括:内存控制器对内存通道链路进行写内存方向的CRC传输校验。和/或,内存通道中的内存模块具备计算CRC传输校验码的能力,内存控制器控制内存模块对内存通道链路进行读内存方向的CRC传输校验。Optionally, the memory controller performs a CRC transmission check on the memory channel link in the memory channel, including: the memory controller performs a CRC transmission check on the memory channel link in a memory writing direction. And/or, the memory module in the memory channel has the ability to calculate the CRC transmission check code, and the memory controller controls the memory module to perform the CRC transmission check in the direction of reading the memory on the memory channel link.

可选地,内存控制器还用于记录内存通道链路发生CRC传输校验错误的次数。内存控制器还用于在内存通道链路发生CRC传输校验错误的次数超出第二次数阈值,或者,对内存通道链路进行CRC传输校验的时长达到时长阈值后,停止对内存通道链路进行CRC传输校验。Optionally, the memory controller is also used to record the number of times CRC transmission check errors occur in the memory channel link. The memory controller is also used to stop the memory channel link after the number of CRC transmission verification errors occurring on the memory channel link exceeds the second threshold, or after the time for performing CRC transmission verification on the memory channel link reaches the duration threshold. Perform CRC transmission check.

可选地,内存控制器用于对内存通道链路进行写内存方向的CRC传输校验。内存模块用于记录内存通道链路在写内存方向上发生CRC传输校验错误的次数,并在内存通道链路在写内存方向上发生CRC传输校验错误的次数超出第二次数阈值时,停止对通过内存通道链路接收到的CRC传输校验码进行校验。Optionally, the memory controller is used to perform CRC transmission verification in the memory writing direction on the memory channel link. The memory module is used to record the number of CRC transmission verification errors that occur on the memory channel link in the direction of writing memory, and stop when the number of CRC transmission verification errors that occur on the memory channel link in the direction of writing memory exceeds the second threshold Check the CRC transmission check code received through the memory channel link.

本申请实施例提供的内存系统可以采用符合DDR4规范(JEDEC-JESD79-4标准规范)的内存模块(以下简称:DDR4内存模块)或符合DDR5规范(JEDEC-JESD79-5标准规范)的内存模块(以下简称:DDR5内存模块)。内存模块例如可以是寄存器缓冲双列直插内存模块(registered DIMM,RDIMM)。符合DDR4或DDR5规范的内存模块都支持实现本申请技术方案。值得说明的是,本申请实施例提供的内存系统所能采用的内存模块包括但不限于符合DDR4或DDR5规范的内存模块。The memory system provided in the embodiment of the present application can adopt a memory module conforming to the DDR4 specification (JEDEC-JESD79-4 standard specification) (hereinafter referred to as: DDR4 memory module) or a memory module conforming to the DDR5 specification (JEDEC-JESD79-5 standard specification) ( Hereinafter referred to as: DDR5 memory module). The memory module may be, for example, a register buffered dual in-line memory module (registered DIMM, RDIMM). Memory modules conforming to the DDR4 or DDR5 specification all support the realization of the technical solution of the present application. It should be noted that the memory modules that can be used in the memory system provided by the embodiments of the present application include but are not limited to memory modules conforming to DDR4 or DDR5 specifications.

DDR4内存模块具备对写内存操作的CRC传输校验能力,但不具备为读内存操作计算CRC传输校验码的能力。当写内存过程中发生CRC传输校验错误时,DDR4内存模块可以通过拉低ALERT_n引脚的电平来通知内存控制器发生了错误。基于DDR4内存模块的工作状态可知,DDR4内存模块在进入空闲状态后不进行读写操作时,可以随时接受对模式寄存器的设置指令,实现CRC传输校验功能的开启和关闭。The DDR4 memory module has the capability of CRC transmission verification for memory write operations, but does not have the ability to calculate CRC transmission verification codes for memory read operations. When a CRC transmission check error occurs during memory writing, the DDR4 memory module can notify the memory controller of an error by pulling down the level of the ALERT_n pin. Based on the working state of the DDR4 memory module, when the DDR4 memory module does not perform read and write operations after entering the idle state, it can accept the setting command of the mode register at any time to realize the opening and closing of the CRC transmission verification function.

DDR5内存模块具备对写内存操作的CRC传输校验能力,以及为读内存操作计算CRC传输校验码的能力。在采用DDR5内存模块的内存系统中,读写双向都能够进行CRC传输校验。基于DDR5内存模块的工作状态可知,DDR5内存模块在进入空闲状态后不进行读写操作时,可以随时接受对模式寄存器的设置指令,实现CRC传输校验功能的开启和关闭。The DDR5 memory module has the capability of CRC transmission verification for memory write operations, and the ability to calculate CRC transmission verification codes for memory read operations. In a memory system using a DDR5 memory module, CRC transmission verification can be performed in both read and write directions. Based on the working state of the DDR5 memory module, when the DDR5 memory module does not perform read and write operations after entering the idle state, it can accept the setting command of the mode register at any time to realize the opening and closing of the CRC transmission verification function.

本申请实施例还提供了一种计算机系统,包括:如图6至图8任一所示的控制设备以及如上所述的内存系统。该计算机系统的基本架构可以如图1所示。An embodiment of the present application further provides a computer system, including: a control device as shown in any one of FIG. 6 to FIG. 8 and the above-mentioned memory system. The basic architecture of the computer system can be shown in FIG. 1 .

本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有指令,当所述指令被处理器执行时,实现上述方法实施例中控制设备执行的步骤。An embodiment of the present application further provides a computer-readable storage medium, where instructions are stored on the computer-readable storage medium, and when the instructions are executed by a processor, the steps performed by the control device in the foregoing method embodiments are implemented.

本申请实施例还提供了一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时,实现上述方法实施例中控制设备执行的步骤。An embodiment of the present application further provides a computer program product, including a computer program, and when the computer program is executed by a processor, the steps performed by the control device in the foregoing method embodiments are implemented.

本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps for implementing the above embodiments can be completed by hardware, and can also be completed by instructing related hardware through a program. The program can be stored in a computer-readable storage medium. The above-mentioned The storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk, and the like.

在本申请实施例中,术语“第一”、“第二”和“第三”仅用于描述目的,而不能理解为指示或暗示相对重要性。In the embodiments of the present application, the terms "first", "second" and "third" are used for description purposes only, and cannot be understood as indicating or implying relative importance.

本申请中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。The term "and/or" in this application is only an association relationship describing associated objects, which means that there may be three relationships, for example, A and/or B, which can mean: A exists alone, A and B exist simultaneously, and A and B exist alone. There are three cases of B. In addition, the character "/" in this article generally indicates that the contextual objects are an "or" relationship.

以上所述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的构思和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。The above are only optional embodiments of the application, and are not intended to limit the application. Any modifications, equivalent replacements, improvements, etc. made within the concept and principles of the application shall be included in the protection of the application. within range.

Claims (25)

1. A method for detecting a memory channel failure, the method being applied to a control device, the method comprising:
Obtaining error checking and correction ECC error statistical information of a memory channel by a memory controller, wherein the ECC error statistical information is used for reflecting the times of ECC errors of the memory channel;
and when the number of times of ECC errors of the memory channel exceeds a first time threshold, controlling the memory controller to perform Cyclic Redundancy Check (CRC) transmission check on a memory channel link in the memory channel so as to determine whether the memory channel link fails.
2. The method of claim 1, wherein the ECC error statistics include a number of ECC errors occurred for the memory channel or a status indication that the number of ECC errors occurred for the memory channel exceeds a first threshold number.
3. The method according to claim 1 or 2, wherein the memory controller performs cyclic redundancy check, CRC, transmission check on a memory channel link in the memory channel, comprising:
the memory controller performs CRC transmission check of the writing memory direction on the memory channel link;
and/or, the memory module in the memory channel has the capability of calculating CRC transmission check codes, and the memory controller controls the memory module to carry out CRC transmission check in the memory reading direction on the memory channel link.
4. A method according to any one of claims 1 to 3, wherein the method further comprises:
and responding to the failure of the memory channel link, or controlling the memory controller to stop performing CRC transmission check on the memory channel link by controlling the memory controller when the duration of performing CRC transmission check on the memory channel link reaches a duration threshold.
5. The method according to any one of claims 1 to 4, further comprising:
and after determining that the memory channel link fails, outputting a first failure detection result, wherein the first failure detection result indicates that the memory channel link fails.
6. The method of any of claims 1-4, wherein the ECC error statistics include a memory address in the memory channel where an ECC error occurred, the method further comprising:
and after determining that the memory channel link fails, outputting a second failure detection result, wherein the second failure detection result indicates that the memory module corresponding to the memory address fails.
7. The method according to any one of claims 1 to 6, further comprising:
Outputting a fault indication when the number of times of ECC errors of the memory channel exceeds the first time threshold, wherein the fault indication is used for indicating that the memory channel has faults;
the controlling the memory controller to perform Cyclic Redundancy Check (CRC) transmission check on a memory channel link in the memory channel includes:
and responding to a memory channel link diagnosis command aiming at the memory channel, and controlling the memory controller to carry out CRC transmission check on the memory channel link.
8. The method of claim 7, wherein the memory channel link diagnostic command includes a second time threshold and/or a duration threshold, wherein the second time threshold is used to: when the number of times of CRC transmission check errors of the memory channel link exceeds the second time threshold, judging that the memory channel link fails; the time threshold is the maximum allowable time for performing CRC transmission check on the memory channel link.
9. A control apparatus, characterized in that the control apparatus comprises:
the acquisition module is used for acquiring error checking and correcting ECC error statistical information of the memory channel by the memory controller, wherein the ECC error statistical information is used for reflecting the times of ECC errors of the memory channel;
And the control module is used for controlling the memory controller to carry out Cyclic Redundancy Check (CRC) transmission check on a memory channel link in the memory channel when the number of times of ECC errors of the memory channel exceeds a first time threshold value so as to determine whether the memory channel link fails.
10. The control device of claim 9, wherein the ECC error statistics include a number of ECC errors occurred for the memory channel or a status indication that the number of ECC errors occurred for the memory channel exceeds a first threshold number.
11. The control device according to claim 9 or 10, wherein the memory controller performs Cyclic Redundancy Check (CRC) transmission check on a memory channel link in the memory channel, comprising:
the memory controller performs CRC transmission check of the writing memory direction on the memory channel link;
and/or, the memory module in the memory channel has the capability of calculating CRC transmission check codes, and the memory controller controls the memory module to carry out CRC transmission check in the memory reading direction on the memory channel link.
12. The control device according to any one of claims 9 to 11, characterized in that,
The control module is further configured to, in response to a failure of the memory channel link, or when a duration of performing CRC transmission check on the memory channel link by the memory controller reaches a duration threshold, control the memory controller to stop performing CRC transmission check on the memory channel link.
13. The control apparatus according to any one of claims 9 to 12, characterized in that the control apparatus further comprises:
and the output module is used for outputting a first fault detection result after determining that the memory channel link breaks down, wherein the first fault detection result indicates that the memory channel link breaks down.
14. The control apparatus according to any one of claims 9 to 12, wherein the ECC error statistic information includes a memory address in the memory channel where an ECC error occurred, the control apparatus further comprising:
and the output module is used for outputting a second fault detection result after determining that the memory channel link does not fail, wherein the second fault detection result indicates that the memory module corresponding to the memory address fails.
15. The control apparatus according to any one of claims 9 to 14, characterized in that the control apparatus further comprises:
The output module is used for outputting a fault indication when the number of times of ECC errors of the memory channel exceeds the first time threshold value, wherein the fault indication is used for indicating the memory channel to fail;
the control module is configured to control the memory controller to perform CRC transmission check on the memory channel link in response to a memory channel link diagnosis command for the memory channel.
16. The control device of claim 15, wherein the memory channel link diagnostic command includes a second time threshold and/or a duration threshold, wherein the second time threshold is used to: when the number of times of CRC transmission check errors of the memory channel link exceeds the second time threshold, judging that the memory channel link fails; the time threshold is the maximum allowable time for performing CRC transmission check on the memory channel link.
17. A control apparatus, characterized by comprising: a processor and a memory;
the memory is used for storing a computer program, and the computer program comprises program instructions;
the processor is configured to invoke the computer program to implement the memory channel failure detection method according to any one of claims 1 to 8.
18. A memory system, comprising: a memory controller and a memory channel;
the memory controller is used for performing error checking and ECC correction checking on the memory channel and recording the times of ECC errors of the memory channel;
the memory controller is further configured to perform a CRC transmission check on a memory channel link in the memory channel after receiving a CRC transmission check command for the memory channel.
19. The memory system of claim 18, wherein,
the memory controller is further configured to record a status indication after the number of times of ECC errors occurring in the memory channel exceeds a first time threshold, where the status indication is used to indicate that the number of times of ECC errors occurring in the memory channel exceeds the first time threshold.
20. The memory system according to claim 18 or 19, wherein the memory controller performs a cyclic redundancy check, CRC, transmission check on a memory channel link in the memory channel, comprising:
the memory controller performs CRC transmission check of the writing memory direction on the memory channel link;
and/or, the memory module in the memory channel has the capability of calculating CRC transmission check codes, and the memory controller controls the memory module to carry out CRC transmission check in the memory reading direction on the memory channel link.
21. The memory system according to any one of claims 18 to 20, wherein,
the memory controller is further configured to record the number of times that the CRC transmission check error occurs on the memory channel link;
the memory controller is further configured to stop performing CRC transmission check on the memory channel link after the number of times that the CRC transmission check error occurs on the memory channel link exceeds a second number of times threshold, or after a duration of performing CRC transmission check on the memory channel link reaches a duration threshold.
22. The memory system of claim 20, wherein,
the memory controller is used for performing CRC transmission check of the writing memory direction on the memory channel link;
the memory module is used for recording the times of the CRC transmission check errors of the memory channel link in the memory writing direction, and stopping checking the CRC transmission check code received by the memory channel link when the times of the CRC transmission check errors of the memory channel link in the memory writing direction exceed a second time threshold.
23. A computer system, comprising: a control apparatus as claimed in any one of claims 9 to 17 and a memory system as claimed in any one of claims 18 to 22.
24. A computer readable storage medium having instructions stored thereon which, when executed by a processor, implement the memory channel failure detection method of any of claims 1 to 8.
25. A computer program product comprising a computer program which, when executed by a processor, implements the memory channel failure detection method of any of claims 1 to 8.
CN202111364493.3A 2021-11-17 2021-11-17 Memory channel fault detection method and device, memory system and computer system Pending CN116136805A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111364493.3A CN116136805A (en) 2021-11-17 2021-11-17 Memory channel fault detection method and device, memory system and computer system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111364493.3A CN116136805A (en) 2021-11-17 2021-11-17 Memory channel fault detection method and device, memory system and computer system

Publications (1)

Publication Number Publication Date
CN116136805A true CN116136805A (en) 2023-05-19

Family

ID=86333093

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111364493.3A Pending CN116136805A (en) 2021-11-17 2021-11-17 Memory channel fault detection method and device, memory system and computer system

Country Status (1)

Country Link
CN (1) CN116136805A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120723562A (en) * 2025-08-25 2025-09-30 苏州元脑智能科技有限公司 Test link anomaly location device, system, method, equipment, medium and program

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120723562A (en) * 2025-08-25 2025-09-30 苏州元脑智能科技有限公司 Test link anomaly location device, system, method, equipment, medium and program
CN120723562B (en) * 2025-08-25 2025-11-21 苏州元脑智能科技有限公司 Test link abnormality positioning device, system, method, apparatus, medium, and program

Similar Documents

Publication Publication Date Title
US11119874B2 (en) Memory fault detection
CN117389790B (en) Recoverable fault firmware detection system, method, storage medium and server
US8667372B2 (en) Memory controller and method of controlling memory
CN111124780B (en) UPI Link speed reduction test method, system, terminal and storage medium
CN112256507B (en) Chip fault diagnosis method and device, readable storage medium and electronic equipment
US12519740B2 (en) Method to reset switch when controller fault is detected
CN118245269B (en) PCI device fault processing method and device, and fault processing system
CN109597719A (en) A kind of monitoring method of multiple nucleus system, system, device and readable storage medium storing program for executing
CN111581043A (en) Monitoring method, device and server for server power consumption
CN115981898A (en) Error-correctable error processing method, device and equipment for memory and readable storage medium
CN114003416B (en) Memory error dynamic processing method, system, terminal and storage medium
CN116136805A (en) Memory channel fault detection method and device, memory system and computer system
CN120723522B (en) Server control methods
CN114518972B (en) Memory error processing method, device, memory controller and processor
CN104133744A (en) Arbitration system and method oriented to critical applications
CN119356989B (en) Server fault information recording method, device, computer equipment and storage medium
CN120045368A (en) Fault processing method, device, BMC, storage medium and computer program product
CN118819936A (en) A detection method, device, equipment and readable storage medium
CN118093265A (en) A PCIE device fault processing method and server
CN115587003A (en) A xGMI deceleration function test method, system, device and readable storage medium
CN121008975B (en) Troubleshooting methods for storage devices, controllers, equipment, media and products
CN100418059C (en) Method for detecting switching failure
TWI767378B (en) Error type determination system and method thereof
CN118747130A (en) Data transmission repair function verification method, device, electronic device and storage medium
WO2025113286A1 (en) Fault handling method and device, and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination