CN105893166A - Method and device for processing memory errors - Google Patents
Method and device for processing memory errors Download PDFInfo
- Publication number
- CN105893166A CN105893166A CN201610286680.7A CN201610286680A CN105893166A CN 105893166 A CN105893166 A CN 105893166A CN 201610286680 A CN201610286680 A CN 201610286680A CN 105893166 A CN105893166 A CN 105893166A
- Authority
- CN
- China
- Prior art keywords
- isolated
- error address
- memory
- error
- log file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0727—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a storage system, e.g. in a DASD or network based storage system
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0766—Error or fault reporting or storing
- G06F11/0787—Storage of error reports, e.g. persistent data storage, storage using memory protection
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/08—Error detection or correction by redundancy in data representation, e.g. by using checking codes
- G06F11/10—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
- G06F11/1008—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
- G06F11/1012—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices using codes or arrangements adapted for a specific type of error
- G06F11/1016—Error in accessing a memory location, i.e. addressing error
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
Abstract
Description
技术领域technical field
本发明涉及计算机技术领域,特别涉及一种处理内存错误的方法及装置。The invention relates to the technical field of computers, in particular to a method and device for processing memory errors.
背景技术Background technique
内存是计算机中重要的部件之一,它是与CPU进行沟通的桥梁。计算机中所有程序的运行都是在内存中进行的,因此内存的性能对计算机的影响非常大。当内存出现内存错误时,如何对内存错误进行处理,非常重要。Memory is one of the important components in the computer, it is a bridge to communicate with the CPU. All programs in the computer run in the memory, so the performance of the memory has a great impact on the computer. When a memory error occurs in the memory, how to deal with the memory error is very important.
现有技术中,通过对内存进行监控,确定内存中出现的内存错误,将这些内存错误显示出来,通知用户,使得用户对这些内存错误进行处理,通过更换内存条等方式来处理内存错误。In the prior art, by monitoring the memory, the memory errors occurring in the memory are determined, these memory errors are displayed, and the user is notified, so that the user can handle these memory errors and deal with the memory errors by replacing the memory stick or the like.
通过上述描述可见,现有技术由于无法对内存错误进行自动处理,使得系统中的内存错误越来越多,导致系统的稳定性较低。It can be seen from the above description that because the prior art cannot automatically handle memory errors, there are more and more memory errors in the system, resulting in low stability of the system.
发明内容Contents of the invention
本发明实施例提供了一种处理内存错误的方法及装置,能够提高系统的稳定性。Embodiments of the present invention provide a method and device for processing memory errors, which can improve system stability.
一方面,本发明实施例提供了一种处理内存错误的方法,包括:On the one hand, the embodiment of the present invention provides a method for handling memory errors, including:
预先设置隔离条件;Set isolation conditions in advance;
S1:获取记录有内存错误的日志文件,从所述日志文件中提取内存错误对应的出错地址;S1: Obtain a log file recording a memory error, and extract an error address corresponding to the memory error from the log file;
S2:确定满足所述隔离条件的待隔离出错地址;S2: Determine the error address to be isolated that satisfies the isolation condition;
S3:对所述待隔离出错地址进行隔离。S3: Isolate the error address to be isolated.
进一步地,所述日志文件包括:ECC(Error Correcting Code,错误检查和纠正)log文件、MCA(MicroChannel Architecture,微通道体系结构)log文件、MCE(MachineCheck Event,机器检查事件)log文件中的一个或多个。Further, the log file includes: one of ECC (Error Correcting Code, error checking and correction) log file, MCA (MicroChannel Architecture, Micro Channel Architecture) log file, MCE (MachineCheck Event, machine inspection event) log file or more.
进一步地,还包括:预先设置发生内存错误的次数的门限值;Further, it also includes: pre-setting the threshold value of the number of occurrences of memory errors;
所述隔离条件包括:发生内存错误的次数大于等于所述门限值;The isolation condition includes: the number of occurrences of memory errors is greater than or equal to the threshold;
所述S2,包括:The S2, including:
根据所述日志文件,确定每个出错地址中发生内存错误的次数;Determine the number of memory errors that occur in each error address according to the log file;
判断当前出错地址中发生内存错误的次数是否大于等于所述门限值,如果是,则确定当前出错地址是待隔离出错地址,否则,确定当前出错地址不是待隔离出错地址。Judging whether the number of occurrences of memory errors in the current error address is greater than or equal to the threshold value, if yes, then determining that the current error address is an error address to be isolated, otherwise, determining that the current error address is not an error address to be isolated.
进一步地,所述S3,包括:Further, said S3 includes:
在操作系统中,保存所述待隔离出错地址;In the operating system, saving the error address to be isolated;
在BIOS阶段,获取所保存的待隔离出错地址,对获取的待隔离出错地址进行隔离。In the BIOS stage, the stored error address to be isolated is acquired, and the acquired error address to be isolated is isolated.
进一步地,所述S3,包括:Further, said S3 includes:
在操作系统中,确定所述待隔离出错地址对应的内存页,隔离所述内存页;In the operating system, determine the memory page corresponding to the error address to be isolated, and isolate the memory page;
在BIOS阶段,禁止操作系统使用被隔离的内存页对应的内存区域。In the BIOS stage, the operating system is prohibited from using the memory area corresponding to the isolated memory page.
另一方面,本发明实施例提供了一种处理内存错误的装置,包括:On the other hand, an embodiment of the present invention provides a device for handling memory errors, including:
第一设置单元,用于设置隔离条件;a first setting unit, configured to set an isolation condition;
提取单元,用于获取记录有内存错误的日志文件,从所述日志文件中提取内存错误对应的出错地址;An extracting unit, configured to obtain a log file recording a memory error, and extract an error address corresponding to the memory error from the log file;
确定单元,用于确定满足所述隔离条件的待隔离出错地址;a determining unit, configured to determine an error address to be isolated that satisfies the isolation condition;
隔离单元,用于对所述待隔离出错地址进行隔离。The isolation unit is configured to isolate the error address to be isolated.
进一步地,所述日志文件包括:ECC log文件、MCA log文件、MCE log文件中的一个或多个。Further, the log files include: one or more of ECC log files, MCA log files, and MCE log files.
进一步地,还包括:第二设置单元,用于设置发生内存错误的次数的门限值;Further, it also includes: a second setting unit, configured to set a threshold value for the number of occurrences of memory errors;
所述隔离条件包括:发生内存错误的次数大于等于所述门限值;The isolation condition includes: the number of occurrences of memory errors is greater than or equal to the threshold;
所述确定单元,用于根据所述日志文件,确定每个出错地址中发生内存错误的次数,判断当前出错地址中发生内存错误的次数是否大于等于所述门限值,如果是,则确定当前出错地址是待隔离出错地址,否则,确定当前出错地址不是待隔离出错地址。The determining unit is configured to determine the number of memory errors in each error address according to the log file, and determine whether the number of memory errors in the current error address is greater than or equal to the threshold value, and if so, determine the current The error address is an error address to be isolated, otherwise, it is determined that the current error address is not an error address to be isolated.
进一步地,所述隔离单元,用于在操作系统中,保存所述待隔离出错地址,在BIOS阶段,获取所保存的待隔离出错地址,对获取的待隔离出错地址进行隔离。Further, the isolation unit is configured to save the error address to be isolated in the operating system, acquire the stored error address to be isolated in the BIOS stage, and isolate the acquired error address to be isolated.
进一步地,所述隔离单元,用于在操作系统中,确定所述待隔离出错地址对应的内存页,隔离所述内存页,在BIOS阶段,禁止操作系统使用被隔离的内存页对应的内存区域。Further, the isolation unit is used to determine the memory page corresponding to the error address to be isolated in the operating system, isolate the memory page, and prohibit the operating system from using the memory area corresponding to the isolated memory page at the BIOS stage .
在本发明实施例中,从日志文件中提取出内存错误对应的出错地址,将满足隔离条件的待隔离出错地址隔离,通过对待隔离出错地址的隔离可以减少系统中内存错误,提高了系统的稳定性。In the embodiment of the present invention, the error address corresponding to the memory error is extracted from the log file, and the error address to be isolated that satisfies the isolation condition is isolated, and the memory error in the system can be reduced by isolating the error address to be isolated, and the stability of the system is improved. sex.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are For some embodiments of the present invention, those skilled in the art can also obtain other drawings based on these drawings without creative work.
图1是本发明一实施例提供的一种处理内存错误的方法的流程图;FIG. 1 is a flowchart of a method for processing memory errors provided by an embodiment of the present invention;
图2是本发明一实施例提供的另一种处理内存错误的方法的流程图;FIG. 2 is a flow chart of another method for handling memory errors provided by an embodiment of the present invention;
图3是本发明一实施例提供的一种处理内存错误的装置的示意图;FIG. 3 is a schematic diagram of a device for handling memory errors provided by an embodiment of the present invention;
图4是本发明一实施例提供的另一种处理内存错误的装置的示意图。FIG. 4 is a schematic diagram of another device for handling memory errors provided by an embodiment of the present invention.
具体实施方式detailed description
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例,基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work belong to the protection of the present invention. scope.
如图1所示,本发明实施例提供了一种处理内存错误的方法,该方法可以包括以下步骤:As shown in Figure 1, the embodiment of the present invention provides a method for handling memory errors, the method may include the following steps:
S0:预先设置隔离条件;S0: pre-set isolation conditions;
S1:获取记录有内存错误的日志文件,从所述日志文件中提取内存错误对应的出错地址;S1: Obtain a log file recording a memory error, and extract an error address corresponding to the memory error from the log file;
S2:确定满足所述隔离条件的待隔离出错地址;S2: Determine the error address to be isolated that satisfies the isolation condition;
S3:对所述待隔离出错地址进行隔离。S3: Isolate the error address to be isolated.
在本发明实施例中,从日志文件中提取出内存错误对应的出错地址,将满足隔离条件的待隔离出错地址隔离,通过对待隔离出错地址的隔离可以减少系统中内存错误,提高了系统的稳定性。In the embodiment of the present invention, the error address corresponding to the memory error is extracted from the log file, and the error address to be isolated that satisfies the isolation condition is isolated, and the memory error in the system can be reduced by isolating the error address to be isolated, and the stability of the system is improved. sex.
在本发明一实施例中,所述日志文件包括:ECC log文件、MCA log文件、MCE log文件中的一个或多个。In an embodiment of the present invention, the log files include: one or more of ECC log files, MCA log files, and MCE log files.
在本发明一实施例中,该方法还包括:预先设置发生内存错误的次数的门限值;In an embodiment of the present invention, the method further includes: presetting a threshold value for the number of occurrences of memory errors;
所述隔离条件包括:发生内存错误的次数大于等于所述门限值;The isolation condition includes: the number of occurrences of memory errors is greater than or equal to the threshold;
所述S2,包括:The S2, including:
根据所述日志文件,确定每个出错地址中发生内存错误的次数;Determine the number of memory errors that occur in each error address according to the log file;
判断当前出错地址中发生内存错误的次数是否大于等于所述门限值,如果是,则确定当前出错地址是待隔离出错地址,否则,确定当前出错地址不是待隔离出错地址。Judging whether the number of occurrences of memory errors in the current error address is greater than or equal to the threshold value, if yes, then determining that the current error address is an error address to be isolated, otherwise, determining that the current error address is not an error address to be isolated.
当某个出错地址发生内存错误的次数大于等于门限值时,说明该出错地址比较容易出错,需要进行隔离。该门限值可以是1或2等。When the number of times memory errors occur at an error address is greater than or equal to the threshold value, it indicates that the error address is prone to errors and needs to be isolated. The threshold value can be 1 or 2 and so on.
为了保证操作系统的正常运行,在本发明一实施例中,所述S3,包括:In order to ensure the normal operation of the operating system, in an embodiment of the present invention, the S3 includes:
在操作系统中,保存所述待隔离出错地址;In the operating system, saving the error address to be isolated;
在BIOS阶段,获取所保存的待隔离出错地址,对获取的待隔离出错地址进行隔离。In the BIOS stage, the stored error address to be isolated is acquired, and the acquired error address to be isolated is isolated.
在该实施例中,在操作系统中不对待隔离出错地址进行隔离,这样可以保证操作系统的正常运行,在下次启动时,在BIOS阶段对待隔离出错地址进行隔离,可以保证启动后,操作系统中不会使用待隔离出错地址,提高了系统的稳定性。举例来说,在操作系统中,可以将待隔离出错地址保存到预设文件中,在BIOS节点读取该预设文件获取待隔离出错地址。In this embodiment, the error address to be isolated is not isolated in the operating system, so that the normal operation of the operating system can be guaranteed. When starting next time, the error address to be isolated is isolated in the BIOS stage, which can ensure that after starting, the error address in the operating system will be isolated. The error address to be isolated will not be used, which improves the stability of the system. For example, in the operating system, the address of the error to be isolated can be saved in a preset file, and the BIOS node can read the preset file to obtain the address of the error to be isolated.
在本发明一实施例中,所述S3,包括:In an embodiment of the present invention, said S3 includes:
在操作系统中,确定所述待隔离出错地址对应的内存页,隔离所述内存页;In the operating system, determine the memory page corresponding to the error address to be isolated, and isolate the memory page;
在BIOS阶段,禁止操作系统使用被隔离的内存页对应的内存区域。In the BIOS stage, the operating system is prohibited from using the memory area corresponding to the isolated memory page.
在该实施例中,以内存页为单位进行隔离,在操作系统中,可以利用页隔离技术对内存页进行隔离。在BIOS阶段,通过禁止内存页对应的内存区域的使用,实现对待隔离出错地址的隔离。In this embodiment, the isolation is performed in units of memory pages. In the operating system, the page isolation technology can be used to isolate the memory pages. In the BIOS stage, by prohibiting the use of the memory area corresponding to the memory page, the isolation of the error address to be isolated is realized.
如图2所示,本发明实施例提供了一种处理内存错误的方法,该方法可以包括以下步骤:As shown in Figure 2, the embodiment of the present invention provides a method for processing memory errors, the method may include the following steps:
步骤201:预先设置发生内存错误的次数的门限值。Step 201: Preset the threshold value of the number of occurrences of memory errors.
步骤202:获取记录有内存错误的日志文件,从日志文件中提取内存错误对应的出错地址。Step 202: Obtain a log file recording the memory error, and extract an error address corresponding to the memory error from the log file.
该日志文件可以是操作系统中检测系统错误时生成的,例如:ECC log文件、MCAlog文件、MCE log文件。The log file may be generated when a system error is detected in the operating system, for example: ECC log file, MCAlog file, MCE log file.
步骤203:根据日志文件,确定每个出错地址中发生内存错误的次数。Step 203: According to the log file, determine the number of occurrences of memory errors in each error address.
具体地,根据日志文件中对内存错误的记录,统计出每个出错地址中发生内存错误的次数。Specifically, according to the records of memory errors in the log file, the number of times of memory errors occurring in each error address is counted.
步骤204:判断当前出错地址中发生内存错误的次数是否大于等于门限值,如果是,则执行步骤205,否则,执行步骤206。Step 204: Judging whether the number of occurrences of memory errors in the current error address is greater than or equal to a threshold value, if yes, perform step 205, otherwise, perform step 206.
步骤205:确定当前出错地址是待隔离出错地址。Step 205: Determine that the current error address is an error address to be isolated.
步骤206:确定当前出错地址不是待隔离出错地址。Step 206: Determine that the current error address is not an error address to be isolated.
步骤207:在操作系统中,将待隔离出错地址保存到预设文件中。Step 207: In the operating system, save the address of the error to be isolated in a preset file.
该预设文件可以是txt文件。The preset file can be a txt file.
步骤208:重新启动后,在BIOS阶段,获取预设文件中待隔离出错地址,对获取的待隔离出错地址进行隔离。Step 208: After restarting, at the BIOS stage, obtain the error address to be isolated in the preset file, and isolate the acquired error address to be isolated.
具体地,可以通过禁止使用待隔离出错地址对应的内存区域来实现隔离。Specifically, the isolation can be implemented by prohibiting the use of the memory area corresponding to the error address to be isolated.
本发明实施例可以通过IMS(Intelligent Memory Surveillance,智能内存监控)来实现:The embodiment of the present invention can be realized through IMS (Intelligent Memory Surveillance, intelligent memory monitoring):
(1)在/home/test/iMSLinux目录下键入命令:insmod iMSDrv.ko;(1) Type the command in the /home/test/iMSLinux directory: insmod iMSDrv.ko;
在该步骤中,在操作系统中,安装IMS的驱动。In this step, the driver of the IMS is installed in the operating system.
(2)运行IMS应用程序。(2) Run the IMS application program.
在/home/test/iMSLinux目录下键入以下命令:Type the following command in the /home/test/iMSLinux directory:
./iMSApp–threshold n–print-mcelog file-interval n–nosign./iMSApp –threshold n –print-mcelog file-interval n –nosign
上述命令行参数定义如下:The above command line parameters are defined as follows:
-threshold n 设置门限值为n;-threshold n set the threshold value to n;
-print 输出内存错误对应的出错地址;-print Output the error address corresponding to the memory error;
-mcelog file 设置日志文件为mce log文件,缺省的文件是/var/log/mcelog;-mcelog file Set the log file to mce log file, the default file is /var/log/mcelog;
-interval n 设置检查mce log文件的时间间隔,单位为秒,缺省值为30;-interval n Set the time interval for checking the mce log file, in seconds, the default value is 30;
-nosign enable iMS的实时监控和保护OS应用程序;若无此参数iMSApp不工作。-nosign enable iMS real-time monitoring and protection OS application; without this parameter iMSApp will not work.
通过该命令,IMS可以设置门限值,设置访问的日志文件,设置访问日志文件的周期,输出出错地址。Through this command, the IMS can set the threshold, set the log file for access, set the period for accessing the log file, and output the error address.
具体地,利用iMS能够依据日志文件,对出错地址进行强化侦测,确认该出错地址是待隔离出错地址后,随即对其进行临时性隔离,这样可以使得操作系统的ECC报错逐步减少,并且不会影响系统的正常运行。为保证系统重启后,曾经出现过的内存错误不再重复出现,操作系统中的iMS会把待隔离出错地址传递给BIOS级的iMS,在下次系统重启时,在BIOS阶段,通过BIOS级的iMS对待隔离出错地址进行隔离,将这些待隔离出错地址从系统可用的内存区域中永久屏蔽掉。Specifically, the use of iMS can strengthen the detection of error addresses based on log files, and after confirming that the error address is an error address to be isolated, it can be temporarily isolated, which can gradually reduce the number of ECC errors reported by the operating system and prevent It will affect the normal operation of the system. In order to ensure that the previous memory errors will not recur after the system is restarted, the iMS in the operating system will pass the address of the error to be isolated to the BIOS-level iMS. When the system restarts next time, at the BIOS stage, the BIOS-level iMS will Isolate the error addresses to be isolated, and permanently shield these error addresses to be isolated from the available memory area of the system.
其中,IMS按照到操作系统并处于Enable以后,全部自动后台运行,不必用户任何操作干预。Among them, IMS runs automatically in the background according to the operating system and after it is enabled, without any user intervention.
本发明实施例可以在Linux系统中实现。The embodiment of the present invention can be realized in the Linux system.
如图3、图4所示,本发明实施例提供了一种处理内存错误的装置。装置实施例可以通过软件实现,也可以通过硬件或者软硬件结合的方式实现。从硬件层面而言,如图3所示,为本发明实施例提供的一种处理内存错误的装置所在设备的一种硬件结构图,除了图3所示的处理器、内存、网络接口、以及非易失性存储器之外,实施例中装置所在的设备通常还可以包括其他硬件,如负责处理报文的转发芯片等等。以软件实现为例,如图4所示,作为一个逻辑意义上的装置,是通过其所在设备的CPU将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。本实施例提供的一种处理内存错误的装置,包括:As shown in FIG. 3 and FIG. 4 , an embodiment of the present invention provides a device for handling memory errors. The device embodiments can be implemented by software, or by hardware or a combination of software and hardware. From the hardware level, as shown in Figure 3, it is a hardware structure diagram of the device where the device for processing memory errors provided by the embodiment of the present invention, except for the processor, memory, network interface, and In addition to the non-volatile memory, the device where the device in the embodiment is located may generally include other hardware, such as a forwarding chip responsible for processing packets, and the like. Taking software implementation as an example, as shown in Figure 4, as a device in a logical sense, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory for operation by the CPU of the device where it is located. A device for handling memory errors provided by this embodiment includes:
第一设置单元401,用于设置隔离条件;A first setting unit 401, configured to set an isolation condition;
提取单元402,用于获取记录有内存错误的日志文件,从所述日志文件中提取内存错误对应的出错地址;An extracting unit 402, configured to obtain a log file recorded with a memory error, and extract an error address corresponding to the memory error from the log file;
确定单元403,用于确定满足所述隔离条件的待隔离出错地址;A determining unit 403, configured to determine an error address to be isolated that satisfies the isolation condition;
隔离单元404,用于对所述待隔离出错地址进行隔离。The isolation unit 404 is configured to isolate the error address to be isolated.
在本发明一实施例中,所述日志文件包括:错误检查和纠正ECC log文件、微通道体系结构MCA log文件、机器检查事件MCE log文件中的一个或多个。In an embodiment of the present invention, the log files include: one or more of error checking and correction ECC log files, microchannel architecture MCA log files, and machine inspection event MCE log files.
在本发明一实施例中,还包括:第二设置单元,用于设置发生内存错误的次数的门限值;In an embodiment of the present invention, it also includes: a second setting unit, configured to set a threshold value for the number of memory errors;
所述隔离条件包括:发生内存错误的次数大于等于所述门限值;The isolation condition includes: the number of occurrences of memory errors is greater than or equal to the threshold;
所述确定单元403,用于根据所述日志文件,确定每个出错地址中发生内存错误的次数,判断当前出错地址中发生内存错误的次数是否大于等于所述门限值,如果是,则确定当前出错地址是待隔离出错地址,否则,确定当前出错地址不是待隔离出错地址。The determining unit 403 is configured to determine the number of memory errors in each error address according to the log file, and determine whether the number of memory errors in the current error address is greater than or equal to the threshold value, and if so, determine The current error address is an error address to be isolated, otherwise, it is determined that the current error address is not an error address to be isolated.
在本发明一实施例中,所述隔离单元404,用于在操作系统中,保存所述待隔离出错地址,在BIOS阶段,获取所保存的待隔离出错地址,对获取的待隔离出错地址进行隔离。In an embodiment of the present invention, the isolation unit 404 is configured to save the error address to be isolated in the operating system, obtain the stored error address to be isolated in the BIOS stage, and perform an operation on the acquired error address to be isolated. isolation.
在本发明一实施例中,所述隔离单元404,用于在操作系统中,确定所述待隔离出错地址对应的内存页,隔离所述内存页,在BIOS阶段,禁止操作系统使用被隔离的内存页对应的内存区域。In an embodiment of the present invention, the isolation unit 404 is configured to determine the memory page corresponding to the error address to be isolated in the operating system, isolate the memory page, and prohibit the operating system from using the isolated memory page in the BIOS stage. The memory area corresponding to the memory page.
上述装置内的各单元之间的信息交互、执行过程等内容,由于与本发明方法实施例基于同一构思,具体内容可参见本发明方法实施例中的叙述,此处不再赘述。The information exchange and execution process among the units in the above-mentioned device are based on the same concept as the method embodiment of the present invention, and the specific content can refer to the description in the method embodiment of the present invention, and will not be repeated here.
本发明实施例至少具有如下有益效果:Embodiments of the present invention have at least the following beneficial effects:
1、在本发明实施例中,从日志文件中提取出内存错误对应的出错地址,将满足隔离条件的待隔离出错地址隔离,通过对待隔离出错地址的隔离可以减少系统中内存错误,提高了系统的稳定性。1. In the embodiment of the present invention, the error address corresponding to the memory error is extracted from the log file, and the error address to be isolated that meets the isolation condition is isolated, and the memory error in the system can be reduced by the isolation of the error address to be isolated, and the system is improved. stability.
2、在本发明实施例中,通过隔离待隔离出错地址对应的内存区域,减少由于内存老化衰变等原因引起的内存错误而导致的系统崩溃。2. In the embodiment of the present invention, by isolating the memory area corresponding to the error address to be isolated, system crashes caused by memory errors caused by reasons such as memory aging and decay are reduced.
3、在本发明实施例中,在操作系统中不对待隔离出错地址进行隔离,这样可以保证操作系统的正常运行,在下次启动时,在BIOS阶段对待隔离出错地址进行隔离,可以保证启动后,操作系统中不会使用待隔离出错地址,提高了系统的稳定性。3, in the embodiment of the present invention, do not isolate the error address to be isolated in the operating system, can guarantee the normal operation of operating system like this, when starting next time, isolate the error address to be isolated in the BIOS stage, can guarantee after starting, The operating system will not use the error address to be isolated, which improves the stability of the system.
需要说明的是,在本文中,诸如第一和第二之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个······”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同因素。It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that there is a relationship between these entities or operations. There is no such actual relationship or sequence. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or device. Without more limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional same elements in the process, method, article or apparatus comprising said element.
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储在计算机可读取的存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质中。Those of ordinary skill in the art can understand that all or part of the steps for realizing the above-mentioned method embodiments can be completed by hardware related to program instructions, and the aforementioned programs can be stored in a computer-readable storage medium. When the program is executed, the It includes the steps of the above method embodiments; and the aforementioned storage medium includes: ROM, RAM, magnetic disk or optical disk and other various media that can store program codes.
最后需要说明的是:以上所述仅为本发明的较佳实施例,仅用于说明本发明的技术方案,并非用于限定本发明的保护范围。凡在本发明的精神和原则之内所做的任何修改、等同替换、改进等,均包含在本发明的保护范围内。Finally, it should be noted that: the above descriptions are only preferred embodiments of the present invention, and are only used to illustrate the technical solutions of the present invention, and are not used to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610286680.7A CN105893166A (en) | 2016-04-29 | 2016-04-29 | Method and device for processing memory errors |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610286680.7A CN105893166A (en) | 2016-04-29 | 2016-04-29 | Method and device for processing memory errors |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105893166A true CN105893166A (en) | 2016-08-24 |
Family
ID=56703241
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610286680.7A Pending CN105893166A (en) | 2016-04-29 | 2016-04-29 | Method and device for processing memory errors |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105893166A (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109086151A (en) * | 2017-06-13 | 2018-12-25 | 中兴通讯股份有限公司 | The method and device of memory failure is isolated on a kind of server |
CN109753378A (en) * | 2019-01-02 | 2019-05-14 | 浪潮商用机器有限公司 | A memory fault isolation method, device, system and readable storage medium |
CN111143125A (en) * | 2019-12-20 | 2020-05-12 | 浪潮电子信息产业股份有限公司 | A kind of MCE error processing method, device, electronic device and storage medium |
WO2020177493A1 (en) * | 2019-03-01 | 2020-09-10 | 华为技术有限公司 | Memory error processing method and device |
CN112231128A (en) * | 2020-09-11 | 2021-01-15 | 中科可控信息产业有限公司 | Memory error processing method and device, computer equipment and storage medium |
CN112256465A (en) * | 2020-10-22 | 2021-01-22 | 皇虎测试科技(深圳)有限公司 | Method and device for repairing memory bank errors |
WO2021056912A1 (en) * | 2019-09-29 | 2021-04-01 | 苏州浪潮智能科技有限公司 | Method and device for detecting memory downgrade error |
CN113297046A (en) * | 2020-08-03 | 2021-08-24 | 阿里巴巴集团控股有限公司 | Early warning method and device for memory fault |
WO2021185279A1 (en) * | 2020-03-20 | 2021-09-23 | 华为技术有限公司 | Memory failure processing method and related device |
CN113515405A (en) * | 2021-07-09 | 2021-10-19 | 维沃移动通信有限公司 | Address management method and device |
CN115269446A (en) * | 2022-06-22 | 2022-11-01 | 超聚变数字技术有限公司 | A method and apparatus for isolating memory |
CN115328684A (en) * | 2022-06-30 | 2022-11-11 | 超聚变数字技术有限公司 | Memory fault reporting method, BMC and electronic equipment |
CN115470061A (en) * | 2022-10-10 | 2022-12-13 | 中电云数智科技有限公司 | A distributed storage system I/O sub-health intelligent detection and recovery method |
CN115686901A (en) * | 2022-10-25 | 2023-02-03 | 超聚变数字技术有限公司 | Memory fault analysis method and computer equipment |
CN116302656A (en) * | 2023-03-13 | 2023-06-23 | 哈尔滨工业大学(深圳) | Intelligent memory isolation method and related equipment |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2569714A1 (en) * | 2004-06-08 | 2005-12-22 | Dartdevices Corporation | Architecture, apparatus and method for device team recruitment and content renditioning for universal device interoperability platform |
CN101533370A (en) * | 2009-04-09 | 2009-09-16 | 成都市华为赛门铁克科技有限公司 | Memory abnormal access positioning method and device |
CN102222025A (en) * | 2011-06-17 | 2011-10-19 | 华为数字技术有限公司 | Method and device for eliminating memory failure |
CN102402472A (en) * | 2010-09-17 | 2012-04-04 | 鸿富锦精密工业(深圳)有限公司 | Memory detection system and detection method thereof |
CN102495770A (en) * | 2011-11-24 | 2012-06-13 | 曙光信息产业股份有限公司 | Method and system for computer memory error analysis |
CN103092709A (en) * | 2013-01-22 | 2013-05-08 | 浪潮电子信息产业股份有限公司 | Memory error processing method |
CN103279406A (en) * | 2013-05-31 | 2013-09-04 | 华为技术有限公司 | Method and device for isolating internal memories |
CN103631721A (en) * | 2012-08-23 | 2014-03-12 | 华为技术有限公司 | Method and system for isolating bad blocks in internal storage |
-
2016
- 2016-04-29 CN CN201610286680.7A patent/CN105893166A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2569714A1 (en) * | 2004-06-08 | 2005-12-22 | Dartdevices Corporation | Architecture, apparatus and method for device team recruitment and content renditioning for universal device interoperability platform |
CN101533370A (en) * | 2009-04-09 | 2009-09-16 | 成都市华为赛门铁克科技有限公司 | Memory abnormal access positioning method and device |
CN102402472A (en) * | 2010-09-17 | 2012-04-04 | 鸿富锦精密工业(深圳)有限公司 | Memory detection system and detection method thereof |
CN102222025A (en) * | 2011-06-17 | 2011-10-19 | 华为数字技术有限公司 | Method and device for eliminating memory failure |
CN102495770A (en) * | 2011-11-24 | 2012-06-13 | 曙光信息产业股份有限公司 | Method and system for computer memory error analysis |
CN103631721A (en) * | 2012-08-23 | 2014-03-12 | 华为技术有限公司 | Method and system for isolating bad blocks in internal storage |
CN103092709A (en) * | 2013-01-22 | 2013-05-08 | 浪潮电子信息产业股份有限公司 | Memory error processing method |
CN103279406A (en) * | 2013-05-31 | 2013-09-04 | 华为技术有限公司 | Method and device for isolating internal memories |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109086151A (en) * | 2017-06-13 | 2018-12-25 | 中兴通讯股份有限公司 | The method and device of memory failure is isolated on a kind of server |
CN109753378A (en) * | 2019-01-02 | 2019-05-14 | 浪潮商用机器有限公司 | A memory fault isolation method, device, system and readable storage medium |
WO2020177493A1 (en) * | 2019-03-01 | 2020-09-10 | 华为技术有限公司 | Memory error processing method and device |
WO2021056912A1 (en) * | 2019-09-29 | 2021-04-01 | 苏州浪潮智能科技有限公司 | Method and device for detecting memory downgrade error |
US11853150B2 (en) | 2019-09-29 | 2023-12-26 | Inspur Suzhou Intelligent Technology Co., Ltd. | Method and device for detecting memory downgrade error |
CN111143125A (en) * | 2019-12-20 | 2020-05-12 | 浪潮电子信息产业股份有限公司 | A kind of MCE error processing method, device, electronic device and storage medium |
CN111143125B (en) * | 2019-12-20 | 2022-04-22 | 浪潮电子信息产业股份有限公司 | MCE error processing method and device, electronic equipment and storage medium |
WO2021185279A1 (en) * | 2020-03-20 | 2021-09-23 | 华为技术有限公司 | Memory failure processing method and related device |
CN113495799A (en) * | 2020-03-20 | 2021-10-12 | 华为技术有限公司 | Memory fault processing method and related equipment |
CN113495799B (en) * | 2020-03-20 | 2024-04-12 | 华为技术有限公司 | Memory fault processing method and related equipment |
CN113297046A (en) * | 2020-08-03 | 2021-08-24 | 阿里巴巴集团控股有限公司 | Early warning method and device for memory fault |
CN113297046B (en) * | 2020-08-03 | 2025-02-14 | 阿里巴巴集团控股有限公司 | A memory failure early warning method and device |
CN112231128A (en) * | 2020-09-11 | 2021-01-15 | 中科可控信息产业有限公司 | Memory error processing method and device, computer equipment and storage medium |
CN112256465A (en) * | 2020-10-22 | 2021-01-22 | 皇虎测试科技(深圳)有限公司 | Method and device for repairing memory bank errors |
CN113515405A (en) * | 2021-07-09 | 2021-10-19 | 维沃移动通信有限公司 | Address management method and device |
CN115269446A (en) * | 2022-06-22 | 2022-11-01 | 超聚变数字技术有限公司 | A method and apparatus for isolating memory |
CN115328684A (en) * | 2022-06-30 | 2022-11-11 | 超聚变数字技术有限公司 | Memory fault reporting method, BMC and electronic equipment |
CN115470061A (en) * | 2022-10-10 | 2022-12-13 | 中电云数智科技有限公司 | A distributed storage system I/O sub-health intelligent detection and recovery method |
CN115686901A (en) * | 2022-10-25 | 2023-02-03 | 超聚变数字技术有限公司 | Memory fault analysis method and computer equipment |
CN115686901B (en) * | 2022-10-25 | 2023-08-04 | 超聚变数字技术有限公司 | Memory fault analysis method and computer equipment |
CN116302656A (en) * | 2023-03-13 | 2023-06-23 | 哈尔滨工业大学(深圳) | Intelligent memory isolation method and related equipment |
CN116302656B (en) * | 2023-03-13 | 2023-11-03 | 哈尔滨工业大学(深圳) | Intelligent memory isolation methods and related equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105893166A (en) | Method and device for processing memory errors | |
US7845006B2 (en) | Mitigating malicious exploitation of a vulnerability in a software application by selectively trapping execution along a code path | |
CN103019787B (en) | Function calling relationship determines method, hot patch upgrade method and device | |
CN105975377B (en) | A kind of method and device for monitoring memory | |
CN110727597B (en) | A method for troubleshooting invalid code completion use cases based on logs | |
US11422827B2 (en) | Method, device, apparatus for identifying graphics card of GPU server and medium | |
WO2021135272A1 (en) | Memory anomaly processing method and system, electronic device, and storage medium | |
CN107590016B (en) | Power-down restarting identification method and device | |
WO2016095672A1 (en) | Stack-based exception detection method and device | |
CN106326067A (en) | Method and device for monitoring CPU (central processing unit) performance under pressure test | |
CN106598796A (en) | Method for testing hardware information stability in reboot | |
JP6282217B2 (en) | Anti-malware system and anti-malware method | |
CN106021054A (en) | Method and apparatus for testing upgrading and downgrading stability of BMC | |
CN113127245B (en) | Method, system and device for processing system management interruption | |
CN115421960A (en) | UE memory fault recovery method, device, electronic equipment and medium | |
US9262274B2 (en) | Persistent data across reboots | |
CN102929733B (en) | Method and device for processing error files and client-side equipment | |
CN105404813B (en) | A kind of daily record generation method of Intrusion Detection based on host system of defense, apparatus and system | |
CN106228065A (en) | The localization method of a kind of buffer-overflow vulnerability and device | |
CN114780276A (en) | Memory isolation method and device, electronic equipment and readable storage medium | |
CN110647463A (en) | Method and device for restoring test breakpoint and electronic equipment | |
CN108197041A (en) | A kind of method, equipment and its storage medium of the parent process of determining subprocess | |
CN111858136A (en) | Solid-state drive abnormal data detection method, system, electronic device and storage medium | |
CN110647507A (en) | File system writing state determining method and device, electronic equipment and medium | |
CN110647455A (en) | Storage device restart recording method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20160824 |