WO2026016559A1 - Diagnosis method and computer program product - Google Patents
Diagnosis method and computer program productInfo
- Publication number
- WO2026016559A1 WO2026016559A1 PCT/CN2025/089095 CN2025089095W WO2026016559A1 WO 2026016559 A1 WO2026016559 A1 WO 2026016559A1 CN 2025089095 W CN2025089095 W CN 2025089095W WO 2026016559 A1 WO2026016559 A1 WO 2026016559A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- bmc
- npu
- test
- log
- fault
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
Abstract
Description
本申请要求于2024年7月19日提交中国国家知识产权局、申请号为202410984695.5、申请名称为“诊断方法及计算机程序产品”的中国专利公开的优先权,其全部内容通过引用结合在本申请中。This application claims priority to Chinese Patent No. 202410984695.5, filed on July 19, 2024, entitled "Diagnostic Method and Computer Program Product", the entire contents of which are incorporated herein by reference.
本申请涉及计算机技术领域,特别是涉及一种诊断方法及计算机程序产品。This application relates to the field of computer technology, and in particular to a diagnostic method and a computer program product.
NPU(Neural Processing Unit)服务器是专为深度学习、神经网络等计算密集型任务设计的服务器。在客户模型训练场景中,这些服务器起着关键作用,能够高效地处理大量数据,加速模型的训练过程。然而,在实际使用过程中,NPU服务器可能会遇到一些问题,如降速、掉卡等,这些问题会直接影响模型训练的效率和质量。NPU (Neural Processing Unit) servers are designed specifically for computationally intensive tasks such as deep learning and neural networks. These servers play a crucial role in customer model training scenarios, efficiently processing large amounts of data and accelerating the model training process. However, in practical use, NPU servers may encounter some problems, such as speed reduction and GPU failures, which directly affect the efficiency and quality of model training.
相关技术中,问题定位依赖收集并分析带内日志,带内日志通常指的是在服务器内部产生的日志,如系统日志、应用程序日志、硬件日志等。这些日志包含了服务器运行时的详细信息,对于分析和定位问题至关重要。然而,由于这些数据可能包含敏感信息(如用户数据、模型参数等),因此客户可能出于数据安全性的考虑,不愿意提供带内日志,进而无法实现问题定位。In related technologies, problem localization relies on collecting and analyzing in-band logs. In-band logs typically refer to logs generated within the server, such as system logs, application logs, and hardware logs. These logs contain detailed information about server operation and are crucial for analyzing and locating problems. However, because this data may contain sensitive information (such as user data and model parameters), customers may be unwilling to provide in-band logs due to data security concerns, thus hindering problem localization.
本申请实施例提供了一种诊断方法及计算机程序产品,用以在不依赖客户OS系统带内日志的前提下,实现对NPU的故障诊断。This application provides a diagnostic method and computer program product for diagnosing NPU faults without relying on the in-band logs of the client OS system.
为解决上述问题,本申请实施例提供的技术方案如下:To address the above problems, the technical solutions provided in this application are as follows:
本申请实施例第一方面提供了一种诊断方法,应用于服务器,所述服务器包括中央处理器CPU、主板管理控制器BMC和神经网络处理器NPU,所述BMC与所述CPU和所述NPU均连接,包括:The first aspect of this application provides a diagnostic method applied to a server, the server including a central processing unit (CPU), a motherboard management controller (BMC), and a neural network processor (NPU), wherein the BMC is connected to both the CPU and the NPU, including:
所述CPU对BMC日志进行解析,得到预诊断结果;The CPU parses the BMC logs to obtain pre-diagnostic results;
当所述预诊断结果表征所述NPU故障时,基于所述预诊断结果确定测试脚本,所述测试脚本用于对所述NPU进行测试;When the pre-diagnosis result indicates a fault in the NPU, a test script is determined based on the pre-diagnosis result, and the test script is used to test the NPU.
基于所述测试脚本对NPU进行测试,并读取测试过程中产生的BMC日志;The NPU is tested based on the test script, and the BMC logs generated during the test are read.
当测试过程中产生的BMC日志表征所述BMC报警时,基于测试过程中产生的带内日志和触发所述BMC报警的所述BMC日志,生成诊断结果。When the BMC logs generated during the test indicate the BMC alarm, a diagnostic result is generated based on the in-band logs generated during the test and the BMC logs that triggered the BMC alarm.
本申请实施例提供的方法,当预诊断结果表征NPU故障时,CPU根据预诊断结果调用对应的测试脚本,对NPU进行针对性的测试。在测试期间,CPU监测主板BMC日志,在测试期间如果出现BMC报警,CPU可以基于触发报警的主板BMC日志和带内日志,解析得到诊断结果。其中,带内日志提供了NPU在测试过程中的运行状态和软件层面的错误信息,而BMC日志则揭示了硬件层面的异常状态。基于上述两种日志即可得到最终的诊断结果,无需访问或依赖客户OS的带内日志。这意味着即使客户出于数据安全考虑限制了对客户操作系统的访问,本申请实施例仍然可以实现对于NPU的故障诊断,解决了因客户数据安全性考量而无法获取日志导致的故障诊断难题。The method provided in this application embodiment, when the pre-diagnosis result indicates an NPU fault, the CPU calls the corresponding test script based on the pre-diagnosis result to perform targeted testing on the NPU. During the test, the CPU monitors the motherboard BMC log. If a BMC alarm occurs during the test, the CPU can parse the diagnostic result based on the motherboard BMC log that triggered the alarm and the in-band log. The in-band log provides the NPU's operating status and software-level error information during the test, while the BMC log reveals abnormal hardware-level states. The final diagnostic result can be obtained based on these two types of logs without accessing or relying on the client OS's in-band log. This means that even if the client restricts access to the client operating system for data security reasons, this application embodiment can still achieve NPU fault diagnosis, solving the problem of fault diagnosis caused by the inability to obtain logs due to client data security considerations.
在一种可能的实现方式中,所述基于所述预诊断结果确定测试脚本之后,还包括:加载NPU驱动,所述NPU驱动用于对CPU与NPU之间建立通讯连接;In one possible implementation, after determining the test script based on the pre-diagnosis results, the method further includes: loading the NPU driver, which is used to establish a communication connection between the CPU and the NPU.
所述基于所述测试脚本对NPU进行测试,并读取测试过程中产生的BMC日志包括:The process of testing the NPU based on the test script and reading the BMC logs generated during the test includes:
当所述NPU驱动加载完成时,基于所述测试脚本对NPU进行测试,并读取测试过程中产生的BMC日志。When the NPU driver is loaded, the NPU is tested based on the test script, and the BMC logs generated during the test are read.
在一种可能的实现方式中,所述BMC日志包括所述BMC的传感器事件日志SEL和所述NPU的BMC的传感器事件日志SEL,所述预诊断结果包括NPU故障类型,所述CPU对BMC日志进行解析,得到预诊断结果,包括:In one possible implementation, the BMC log includes the sensor event log (SEL) of the BMC and the sensor event log (SEL) of the NPU's BMC. The pre-diagnostic result includes the NPU fault type. The CPU parses the BMC log to obtain the pre-diagnostic result, which includes:
对所述BMC的SEL进行解析;The SEL of the BMC is parsed;
当所述BMC的SEL中包括NPU的BMC报警信息,基于NPU的BMC报警信息对NPU的BMC的SEL进行解析,获取所述NPU的BMC报警信息对应的NPU故障类型。When the SEL of the BMC includes the BMC alarm information of the NPU, the SEL of the BMC of the NPU is parsed based on the BMC alarm information of the NPU to obtain the NPU fault type corresponding to the BMC alarm information of the NPU.
在一种可能的实现方式中,所述方法还包括:In one possible implementation, the method further includes:
当所述BMC的SEL中包括NPU的BMC报警信息,获取所述NPU的BMC的SEL中的故障代码;When the SEL of the BMC includes the BMC alarm information of the NPU, obtain the fault code in the SEL of the BMC of the NPU;
将获取到的故障代码与预设故障代码表进行比对,确定所述故障代码对应的NPU故障类型,所述预设故障代码表包括与各个故障代码相对应的NPU故障类型。The obtained fault codes are compared with a preset fault code table to determine the NPU fault type corresponding to the fault code. The preset fault code table includes NPU fault types corresponding to each fault code.
在一种可能的实现方式中,所述基于所述预诊断结果确定测试脚本,包括:In one possible implementation, determining the test script based on the pre-diagnosis results includes:
基于所述预诊断结果,确定测试任务列表,并对所述测试任务列表中各个测试任务的参数进行赋值,得到所述预诊断结果对应的测试脚本。Based on the pre-diagnosis results, a test task list is determined, and the parameters of each test task in the test task list are assigned values to obtain the test script corresponding to the pre-diagnosis results.
在一种可能的实现方式中,所述基于所述测试脚本对NPU进行测试,并读取测试过程中产生的BMC日志,包括:In one possible implementation, the step of testing the NPU based on the test script and reading the BMC logs generated during the test includes:
根据所述测试脚本中各个测试任务的排列顺序,依次对NPU进行测试,读取测试过程中产生的BMC日志,直至测试过程中产生的BMC日志表征BMC报警。According to the order of the test tasks in the test script, the NPU is tested sequentially, and the BMC logs generated during the test are read until the BMC logs generated during the test indicate a BMC alarm.
在一种可能的实现方式中,所述当测试过程中产生的BMC日志表征所述BMC报警时,基于测试过程中产生的带内日志和触发所述BMC报警的所述BMC日志,生成诊断结果,包括:In one possible implementation, when the BMC logs generated during the test characterize the BMC alarm, a diagnostic result is generated based on the in-band logs generated during the test and the BMC logs that triggered the BMC alarm, including:
当测试过程中产生的BMC日志表征所述BMC报警时,终止测试脚本中当前执行的测试任务;When a BMC log generated during the test indicates a BMC alarm, the currently executing test task in the test script is terminated.
根据所述测试脚本中各个测试任务的排列顺序,依次对NPU进行测试,直至所述测试脚本中各个测试任务均执行完毕;The NPU is tested sequentially according to the order of the test tasks in the test script until all the test tasks in the test script have been executed.
基于测试过程中产生的带内日志和触发所述BMC报警的BMC日志,生成诊断结果。Based on the in-band logs generated during the test and the BMC logs that triggered the BMC alarm, diagnostic results are generated.
在一种可能的实现方式中,所述CPU对BMC日志进行解析,得到预诊断结果之前,还包括:In one possible implementation, before the CPU parses the BMC log to obtain the pre-diagnostic result, it further includes:
所述CPU向所述BMC发送用于请求获取BMC日志的第一预设指令,以使所述BMC在接受到第一预设指令后,将所述BMC日志返回给所述CPU,所述BMC日志包括在NPU故障时,所述NPU的BMC发送到所述BMC日志中的NPU的BMC带外信息。The CPU sends a first preset instruction to the BMC to request the BMC log, so that after receiving the first preset instruction, the BMC returns the BMC log to the CPU. The BMC log includes out-of-band information of the NPU's BMC sent to the BMC log when the NPU fails.
在一种可能的实现方式中,所述CPU对BMC日志进行解析,得到预诊断结果之前,还包括:In one possible implementation, before the CPU parses the BMC log to obtain the pre-diagnostic result, it further includes:
所述CPU向所述BMC发送用于请求获取BMC日志的第一预设指令,以使所述BMC在接收到第一预设指令后,向NPU的BMC发送用于获取NPU的BMC的带外信息的请求,以使所述NPU的BMC将带外信息发送给所述BMC;The CPU sends a first preset instruction to the BMC to request the BMC log, so that after receiving the first preset instruction, the BMC sends a request to the NPU's BMC to obtain out-of-band information of the NPU's BMC, so that the NPU's BMC sends the out-of-band information to the BMC.
在所述BMC获取到NPU的BMC返回的NPU的BMC带外信息后,CPU接收BMC返回的,包含NPU的BMC带外信息的服务器BMC日志。After the BMC obtains the NPU's out-of-band information returned by the NPU's BMC, the CPU receives the server BMC log, which contains the NPU's out-of-band information, returned by the BMC.
在一种可能的实现方式中,所述方法还包括:In one possible implementation, the method further includes:
基于内置案例库,生成所述诊断结果对应的处理策略,所述处理策略包括针对所述诊断结果的解决方案,所述内置案例库包括多个历史诊断结果,以及与各个历史诊断结果对应的处理策略。Based on the built-in case library, a processing strategy corresponding to the diagnostic result is generated. The processing strategy includes a solution for the diagnostic result. The built-in case library includes multiple historical diagnostic results and processing strategies corresponding to each historical diagnostic result.
本申请实施例第二方面提供了一种计算机程序产品,当所述计算机程序产品在计算机上运行时,所述计算机执行如前述第一方面所述的诊断方法。A second aspect of this application provides a computer program product that, when run on a computer, executes the diagnostic method described in the first aspect above.
为更清楚地说明本实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请实施例的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。To more clearly illustrate the technical solutions of this embodiment, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
图1为本申请实施例提供的诊断系统的应用场景示意图;Figure 1 is a schematic diagram of the application scenario of the diagnostic system provided in the embodiment of this application;
图2为本申请实施例提供的诊断系统框架示意图;Figure 2 is a schematic diagram of the diagnostic system framework provided in an embodiment of this application;
图3为本申请实施例提供的第一种诊断方法的流程示意图;Figure 3 is a flowchart illustrating the first diagnostic method provided in this application embodiment;
图4为本申请实施例提供的第二种诊断方法的流程示意图;Figure 4 is a flowchart illustrating the second diagnostic method provided in an embodiment of this application;
图5为本申请实施例提供的故障预诊断模块的运行流程示意图;Figure 5 is a schematic diagram of the operation flow of the fault pre-diagnosis module provided in the embodiment of this application;
图6为本申请实施例提供的解析方式分类示意图;Figure 6 is a schematic diagram illustrating the classification of parsing methods provided in the embodiments of this application;
图7为本申请实施例提供的NPU日志示意图;Figure 7 is a schematic diagram of the NPU log provided in an embodiment of this application;
图8为本申请实施例提供的SEL文件内容示意图;Figure 8 is a schematic diagram of the SEL file content provided in an embodiment of this application;
图9为本申请实施例提供的预诊断结果示意图;Figure 9 is a schematic diagram of the pre-diagnosis results provided in an embodiment of this application;
图10为本申请实施例提供的测试脚本示意图;Figure 10 is a schematic diagram of the test script provided in an embodiment of this application;
图11为本申请实施例提供的故障预诊断模块的运行流程示意图。Figure 11 is a schematic diagram of the operation flow of the fault pre-diagnosis module provided in the embodiment of this application.
为了使本技术领域的人员更好地理解本申请实施例方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请实施例保护的范围。To enable those skilled in the art to better understand the embodiments of this application, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the protection scope of the embodiments of this application.
为便于理解本申请实施例提供的技术方案,下面将先对本申请实施例涉及的术语进行说明。To facilitate understanding of the technical solutions provided in the embodiments of this application, the terms involved in the embodiments of this application will be explained below.
神经网络处理单元(Neural Processing Unit,NPU)是一种专门设计用于进行人工神经网络计算的处理器或芯片。它被广泛用于加速人工智能任务,特别是深度学习和机器学习算法。A Neural Processing Unit (NPU) is a processor or chip specifically designed for performing computations on artificial neural networks. It is widely used to accelerate artificial intelligence tasks, particularly deep learning and machine learning algorithms.
BMC是一种专用的微控制器,用于监视和管理服务器硬件的各个方面,包括但不限于温度、电源状态、风扇速度以及处理器和内存的健康状况。每个主要的硬件组件,如主板和NPU,都可以有自己的BMC,以便更细致地管理和监控。A BMC is a dedicated microcontroller used to monitor and manage various aspects of server hardware, including but not limited to temperature, power status, fan speed, and the health of the processor and memory. Each major hardware component, such as the motherboard and NPU, can have its own BMC for more granular management and monitoring.
为便于理解本申请实施例提供的技术方案,下面将先对本申请实施例涉及的背景技术进行说明。To facilitate understanding of the technical solutions provided in the embodiments of this application, the background technology involved in the embodiments of this application will be described below.
在本申请实施例提供了一种诊断方法,CPU读取并解析主板BMC日志,得到预诊断结果;当预诊断结果表征NPU故障时,CPU根据预诊断结果调用对应的测试脚本对NPU进行测试。在测试期间,CPU监测主板BMC日志,以捕捉由于测试引起的NPU故障复现。当主板BMC报警时,获取触发主板BMC报警的主板BMC日志,以及测试过程中NPU产生的带内日志,带内日志反映了NPU在测试过程中的运行状态和软件层面的错误信息,触发主板BMC报警的主板BMC日志反映了NPU在测试过程中硬件层面出现的故障。这些信息详细记录了诊断过程中的硬件状态和软件状态,以及可能的问题或异常情况。由此,本申请实施例可以在不直接接触或进入客户操作系统的情况下进行故障诊断,并且基于诊断过程中生成的带内日志和带外日志即可得出诊断结果,无需依赖客户OS系统的带内日志,解决了因客户数据安全性考量而无法获取日志导致的故障诊断难题。This application provides a diagnostic method in which the CPU reads and parses the motherboard BMC log to obtain a pre-diagnostic result. When the pre-diagnostic result indicates an NPU fault, the CPU calls the corresponding test script to test the NPU based on the pre-diagnostic result. During the test, the CPU monitors the motherboard BMC log to capture the reproduction of NPU faults caused by the test. When the motherboard BMC alarms, the CPU obtains the motherboard BMC log that triggered the alarm, as well as the in-band log generated by the NPU during the test. The in-band log reflects the NPU's operating status and software-level error information during the test, while the motherboard BMC log that triggered the alarm reflects the hardware-level faults that occurred in the NPU during the test. This information records in detail the hardware and software states during the diagnostic process, as well as possible problems or abnormal situations. Therefore, this application can perform fault diagnosis without directly accessing or entering the customer's operating system, and the diagnostic result can be obtained based on the in-band and out-of-band logs generated during the diagnostic process, without relying on the in-band logs of the customer's OS system. This solves the problem of fault diagnosis caused by the inability to obtain logs due to customer data security considerations.
为了使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请实施例一部分实施例,而不是全部的实施例。基于本申请实施例中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请实施例保护的范围。To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the protection scope of the embodiments of this application.
以下通过一个实施例,对本申请实施例提供的诊断系统进行说明。The following example illustrates the diagnostic system provided in this application.
参见图1,图1为本申请实施例提供的诊断系统的应用场景示意图,服务器包括CPU、BMC和NPU。BMC与CPU和NPU均连接。本申请实施例提供了一个可以外接的诊断设备,诊断设备包括诊断系统,诊断设备可以是硬盘或USB存储设备。诊断设备具有足够的容量来存储诊断系统及其所需的所有资源。Referring to Figure 1, which is a schematic diagram of an application scenario for the diagnostic system provided in this embodiment, the server includes a CPU, a BMC, and an NPU. The BMC is connected to both the CPU and the NPU. This embodiment provides an externally connectable diagnostic device, which includes a diagnostic system. The diagnostic device can be a hard disk or a USB storage device. The diagnostic device has sufficient capacity to store the diagnostic system and all its required resources.
在实际应用过程中,可以将包含诊断系统的诊断设备插入到服务器的相应接口,如SATA、SAS、USB等。在诊断设备与服务器连接后,CPU可以加载运行诊断设备中的诊断系统。In practical applications, diagnostic devices containing diagnostic systems can be plugged into the corresponding interfaces of the server, such as SATA, SAS, and USB. After the diagnostic device is connected to the server, the CPU can load and run the diagnostic system within the device.
当诊断系统运行于CPU之后,可以由人工触发CPU启动诊断方法。在CPU接收到触发诊断方法开始的指令时,CPU运行诊断系统中的故障诊断模块开始获取BMC的带外信息,根据BMC带外信息进行预诊断。如果预诊断结果表征NPU故障,进一步进行诊断,对NPU进行测试,以使NPU故障复现并生成处理策略。After the diagnostic system runs on the CPU, the CPU can be manually triggered to start the diagnostic method. When the CPU receives the instruction to trigger the diagnostic method, it runs the fault diagnosis module in the diagnostic system to acquire out-of-band information of the BMC and performs a preliminary diagnosis based on the BMC out-of-band information. If the preliminary diagnosis indicates an NPU fault, further diagnosis is performed, and the NPU is tested to reproduce the NPU fault and generate a processing strategy.
关于诊断系统的各部分功能,可以参见图2,图2为本申请实施例提供的诊断系统框架示意图。For details on the functions of each part of the diagnostic system, please refer to Figure 2, which is a schematic diagram of the diagnostic system framework provided in the embodiment of this application.
故障预诊断模块用于对主板BMC提供的服务器主板BMC日志进行解析,判断问题设备的输出故障位置、故障类型,为故障诊断模块提供针对性测试支撑。解析方式可以包括寄存器解析、SEL解析和FDM解析。The fault pre-diagnosis module is used to parse the server motherboard BMC logs provided by the motherboard BMC to determine the location and type of output faults in problematic devices, providing targeted testing support for the fault diagnosis module. Parsing methods can include register parsing, SEL parsing, and FDM parsing.
故障诊断模块用于根据故障预诊断模块确定出的故障位置、故障类型对NPU进行软硬件性能测试,监控测试进程,输出测试结果,其中故障诊断模块的测试类型可以包括HBM介质测试、TDP功耗测试、ROCE网口测试、带宽测试和EDP功耗测试。其中,HBN介质测试包括:Read:进行读操作的压力测试。Write:进行写操作的压力测试。带宽测试包括:d2h:数据从NPU发送到CPU的测试。h2d:数据从CPU发送到NPU设备的测试。d2d:NPU内部的数据传输测试。p2p:CPU内部的数据传输测试。高速缓冲存储器测试包括:The fault diagnosis module is used to perform hardware and software performance tests on the NPU based on the fault location and type determined by the fault pre-diagnosis module, monitor the test process, and output test results. The test types of the fault diagnosis module can include HBM media testing, TDP power consumption testing, ROCE network port testing, bandwidth testing, and EDP power consumption testing. Specifically, HBM media testing includes: Read: stress test performing read operations; Write: stress test performing write operations. Bandwidth testing includes: d2h: test of data transmission from the NPU to the CPU; h2d: test of data transmission from the CPU to the NPU device; d2d: data transmission test within the NPU; p2p: data transmission test within the CPU. Cache memory testing includes:
日志解析模块用于通过收集、解析测试过程中产生的日志信息,生成诊断结果以及处理策略。测试过程中产生的日志信息包括触发主板BMC报警的主板BMC带外信息,以及CPU运行诊断系统时产生的带内日志。可以对带内日志和带外信息分别进行解析,分别生成带内日志和带外日志各自对应的诊断策略。也可以将带内日志和带外信息结合,生成对应的一个诊断策略。The log parsing module collects and parses log information generated during testing to produce diagnostic results and processing strategies. The log information generated during testing includes out-of-band BMC information that triggers motherboard BMC alarms, and in-band logs generated when the CPU runs the diagnostic system. The module can parse the in-band and out-of-band information separately to generate corresponding diagnostic strategies for each. Alternatively, it can combine the in-band and out-of-band information to generate a single diagnostic strategy.
以下对本申请实施例提供的诊断方法进行说明,参见图3,图3为本申请实施例提供的第一种诊断方法的流程示意图,该方法包括:The diagnostic method provided in the embodiments of this application is described below. Referring to Figure 3, Figure 3 is a flowchart illustrating the first diagnostic method provided in the embodiments of this application. The method includes:
S110、所述CPU对BMC日志进行解析,得到预诊断结果。S110. The CPU parses the BMC log to obtain the pre-diagnosis result.
CPU读取并分析BMC日志。BMC负责管理服务器硬件状态(如温度、电压、风扇速度等)并生成相应日志的组件。CPU通过解析BMC日志,区分是普通硬件故障(如硬盘、内存、CPU、板卡等)还是NPU相关的故障,从而得到一个初步的预诊断结果。The CPU reads and analyzes the BMC logs. The BMC is a component responsible for managing server hardware status (such as temperature, voltage, fan speed, etc.) and generating corresponding logs. By parsing the BMC logs, the CPU distinguishes between common hardware failures (such as hard drive, memory, CPU, expansion cards, etc.) and NPU-related failures, thus obtaining a preliminary diagnostic result.
S111、当所述预诊断结果表征所述NPU故障时,基于所述预诊断结果确定测试脚本,所述测试脚本用于对所述NPU进行测试。S111. When the pre-diagnosis result indicates a fault in the NPU, a test script is determined based on the pre-diagnosis result, and the test script is used to test the NPU.
当预诊断结果表征NPU故障,则根据预诊断结果,调用与预诊断结果对应的测试脚本,对NPU进行针对性的测试,以使NPU故障复现。When the pre-diagnosis result indicates an NPU fault, the test script corresponding to the pre-diagnosis result is called to perform targeted tests on the NPU in order to reproduce the NPU fault.
S112、基于所述测试脚本对NPU进行测试,并读取测试过程中产生的BMC日志。S112. Test the NPU based on the test script and read the BMC logs generated during the test.
通过监控测试过程中产生的BMC日志,确保测试的准确性和有效性。By monitoring the BMC logs generated during the testing process, we can ensure the accuracy and effectiveness of the tests.
S113、当测试过程中产生的BMC日志表征所述BMC报警时,基于测试过程中产生的带内日志和触发所述BMC报警的所述BMC日志,生成诊断结果。S113. When the BMC log generated during the test indicates the BMC alarm, a diagnostic result is generated based on the in-band log generated during the test and the BMC log that triggered the BMC alarm.
如果在测试过程中,BMC日志表明BMC报警,那么CPU通过综合分析测试过程中产生的带内日志以及触发BMC报警的BMC日志,生成诊断结果。If the BMC log indicates a BMC alarm during the test, the CPU generates a diagnostic result by comprehensively analyzing the in-band logs generated during the test and the BMC logs that triggered the BMC alarm.
由此,本申请实施例当预诊断结果表征NPU故障时,CPU根据预诊断结果调用对应的测试脚本,对NPU进行针对性的测试。在测试期间,CPU监测主板BMC日志,在测试期间如果出现BMC报警,CPU可以基于触发报警的主板BMC日志和带内日志,解析得到诊断结果。其中,带内日志提供了NPU在测试过程中的运行状态和软件层面的错误信息,而BMC日志则揭示了硬件层面的异常状态。基于上述两种日志即可得到最终的诊断结果,无需访问或依赖客户OS的带内日志。这意味着即使客户出于数据安全考虑限制了对客户操作系统的访问,本申请实施例仍然可以实现对于NPU的故障诊断,解决了因客户数据安全性考量而无法获取日志导致的故障诊断难题。以下对诊断系统执行的诊断方法进行说明,参见图4,图4为本申请实施例提供的第二种诊断方法的示意图,该方法包括:Therefore, in this embodiment, when the pre-diagnosis result indicates an NPU fault, the CPU calls the corresponding test script based on the pre-diagnosis result to perform targeted testing on the NPU. During the test, the CPU monitors the motherboard BMC log. If a BMC alarm occurs during the test, the CPU can parse the diagnostic result based on the motherboard BMC log that triggered the alarm and the in-band log. The in-band log provides the NPU's operating status and software-level error information during the test, while the BMC log reveals abnormal hardware-level states. The final diagnostic result can be obtained based on these two types of logs without accessing or relying on the client OS's in-band log. This means that even if the client restricts access to the client operating system for data security reasons, this embodiment can still achieve NPU fault diagnosis, solving the problem of fault diagnosis caused by the inability to obtain logs due to client data security considerations. The diagnostic method executed by the diagnostic system is described below. Referring to Figure 4, Figure 4 is a schematic diagram of the second diagnostic method provided in this embodiment, which includes:
S101、CPU通过第一预设指令读取服务器主板BMC日志。S101, the CPU reads the server motherboard BMC log through the first preset instruction.
运行诊断系统的CPU向服务器主板BMC发送第一预设指令,请求获取服务器主板BMC日志。主板BMC可以获取并整合来自不同组件BMC的信息。The CPU running the diagnostic system sends a first preset command to the server motherboard BMC to request the server motherboard BMC logs. The motherboard BMC can acquire and integrate information from the BMCs of different components.
服务器主板BMC日志可以用于保存关于服务器硬件状态、事件和潜在故障的重要信息。服务器主板BMC可以记录NPU的故障信息到服务器主板BMC日志文件中,并可以将服务器主板BMC日志文件转发给服务器的管理系统,如本申请实施例提供的诊断系统。也就是说,服务器主板BMC日志不仅包括服务器主板的BMC带外信息,还可以包括NPU的BMC带外信息。主板BMC与NPU的BMC通过各自记录硬件状态和异常,两者结合提供全面的硬件健康概况,让诊断系统能更全面了解NPU和主板的运行情况。The server motherboard BMC log can be used to store important information about server hardware status, events, and potential failures. The server motherboard BMC can record NPU fault information to the server motherboard BMC log file and can forward the server motherboard BMC log file to the server's management system, such as the diagnostic system provided in this application embodiment. That is, the server motherboard BMC log includes not only out-of-band information from the server motherboard BMC but also out-of-band information from the NPU BMC. The motherboard BMC and the NPU BMC, by recording hardware status and anomalies respectively, combine to provide a comprehensive hardware health overview, allowing the diagnostic system to gain a more complete understanding of the NPU and motherboard's operating status.
第一种情况,服务器主板的BMC在接收到第一预设指令后,向NPU的BMC发送获取NPU的BMC的带外信息的请求,以使NPU的BMC将带外信息发送给服务器主板的BMC。在服务器主板的BMC获取到NPU的BMC返回的NPU的BMC带外信息后,服务器主板BMC将包含NPU的BMC带外信息的服务器主板BMC日志返回给CPU。In the first scenario, after receiving the first preset instruction, the server motherboard's BMC sends a request to the NPU's BMC to retrieve the NPU's out-of-band information, causing the NPU's BMC to send the out-of-band information to the server motherboard's BMC. After the server motherboard's BMC receives the NPU's out-of-band information, it returns a server motherboard BMC log containing this information to the CPU.
另一种情况,当NPU的BMC监测到NPU存在故障时,NPU的BMC可以将NPU的带外信息,通过预设接口传递给服务器主板BMC。服务器主板BMC将接收到的NPU的BMC的带外信息,直接或经过处理后记录至服务器主板BMC日志,以使服务器主板BMC在接收到CPU发送的第一预设指令时,将包含NPU的BMC带外信息的服务器主板BMC日志返回给CPU。In another scenario, when the NPU's BMC detects a fault in the NPU, the NPU's BMC can transmit the NPU's out-of-band information to the server motherboard's BMC via a preset interface. The server motherboard's BMC will then record the received out-of-band information from the NPU's BMC, either directly or after processing, into its log. This log will then be returned to the CPU by the server motherboard's BMC upon receiving the first preset instruction from the CPU.
示例性的,在NPU或主板的带外信息传输过程中,可以将带外信息打包成数据包,利用HTTPs或XML、JSON等格式,确保安全加密协议传输,保证传输到服务器主板BMC。其中,预设接口可以为Redfish。Redfish是一个现代管理接口,允许硬件状态和配置数据以标准化的方式被远程访问和交换,提供高效和安全。NPU的BMC利用这一接口,可以将包含NPU相关的故障详情的带外信息实时通知到服务器主板BMC。并且,Redfish可以使用基于HTTPS的通信,确保了数据在传输过程中的安全性。For example, during out-of-band information transmission from the NPU or motherboard, the out-of-band information can be packaged into data packets and transmitted using secure encryption protocols such as HTTPS, XML, or JSON to ensure transmission to the server motherboard BMC. The default interface can be Redfish. Redfish is a modern management interface that allows hardware status and configuration data to be remotely accessed and exchanged in a standardized manner, providing efficiency and security. The NPU's BMC can utilize this interface to send out-of-band information containing NPU-related fault details to the server motherboard BMC in real time. Furthermore, Redfish can use HTTPS-based communication to ensure data security during transmission.
服务器主板BMC在接收到NPU的BMC带外信息后,可以对NPU的BMC带外信息进行解析并转化成用户可读的格式,展示于Web界面。将NPU的BMC带外信息转换为直观信息,通过web界面,维护人员或IT管理员可以直观看到NPU的运行状态,提高透明度和响应速度。After receiving out-of-band information from the NPU's BMC, the server motherboard's BMC can parse and convert this information into a user-readable format, displaying it on the web interface. This transformation of the NPU's out-of-band information into intuitive data allows maintenance personnel or IT administrators to visually assess the NPU's operational status through the web interface, improving transparency and responsiveness.
S102、CPU对服务器主板BMC日志进行解析,确定预诊断结果。S102. The CPU analyzes the server motherboard BMC logs to determine the preliminary diagnostic results.
即CPU运行诊断系统中的故障预诊断模块,通过分析主板BMC提供的服务器主板BMC日志,区分是普通硬件故障(如硬盘、内存、CPU、板卡等)还是NPU相关的故障,当确定NPU存在故障时,对服务器主板BMC日志中的NPU的BMC的带外信息进行解析,确定预诊断结果。该预诊断结果用于指示后续对NPU进行性能测试。This refers to the fault pre-diagnosis module in the CPU operation diagnostic system. By analyzing the server motherboard BMC logs provided by the motherboard BMC, it distinguishes between ordinary hardware faults (such as hard drive, memory, CPU, and expansion cards) and NPU-related faults. When an NPU fault is determined, it parses the out-of-band information of the NPU's BMC in the server motherboard BMC logs to determine the pre-diagnosis result. This pre-diagnosis result is used to guide subsequent performance testing of the NPU.
一旦服务器主板BMC日志指示NPU故障,运行诊断系统的CPU将对服务器主板BMC日志中的NPU的BMC的带外信息进行分析,以获取更具体的问题细节和相应的诊断策略。其中,NPU下层日志指的是NPU在执行任务过程中,由底层硬件或固件生成的日志信息,这些信息可以包括NPU的运行状态、错误代码、性能数据等。Once the server motherboard BMC log indicates an NPU failure, the CPU running the diagnostic system will analyze the out-of-band information of the NPU's BMC in the server motherboard BMC log to obtain more specific problem details and corresponding diagnostic strategies. The NPU lower-level log refers to the log information generated by the underlying hardware or firmware during NPU task execution. This information can include the NPU's operating status, error codes, performance data, etc.
故障预诊断模块的工作流程可以参见图5,图5为本申请实施例提供的故障预诊断模块的运行流程示意图,包括:The workflow of the fault pre-diagnosis module can be seen in Figure 5. Figure 5 is a schematic diagram of the operation flow of the fault pre-diagnosis module provided in the embodiment of this application, including:
首先,故障预诊断模块获取服务器主板BMC日志,服务器主板BMC日志包括服务器主板BMC提供的故障信息,可以是因服务器主板故障而产生的故障信息,和/或,因NPU故障而产生的故障信息。First, the fault pre-diagnosis module obtains the server motherboard BMC logs. The server motherboard BMC logs include fault information provided by the server motherboard BMC, which may be fault information caused by server motherboard failure and/or fault information caused by NPU failure.
示例性的,故障预诊断模块可以使用第一预设命令来一键收集服务器主板BMC日志。对收集到的服务器主板BMC日志进行预处理,该过程中可以涉及压缩、格式化或者其他预处理任务,以便于后续的解析。其中,第一预设命令可以是ipmcget-d diaginfo。For example, the fault pre-diagnosis module can use a first preset command to collect server motherboard BMC logs with a single click. The collected server motherboard BMC logs are then preprocessed, which may involve compression, formatting, or other preprocessing tasks to facilitate subsequent parsing. The first preset command could be `ipmcget -d diaginfo`.
然后对预处理后的服务器主板BMC日志进行解析,如图2所示,解析方式可以包括寄存器解析、SEL(System Event Log,系统事件日志)解析和FDM(Frequency-division multiplexing,频分复用)解析。需要说明的是,预诊断过程中使用三种解析方式对服务器主板BMC日志进行解析,三种解析方式可以是同时进行的,也可以是有先后顺序进行的。The preprocessed server motherboard BMC logs are then parsed, as shown in Figure 2. Parsing methods can include register parsing, SEL (System Event Log) parsing, and FDM (Frequency-division multiplexing) parsing. It should be noted that these three parsing methods are used to parse the server motherboard BMC logs during the pre-diagnosis process. These three methods can be performed simultaneously or sequentially.
第一种,寄存器解析,硬件设备(如NPU、内存、硬盘等)的健康状态和故障信息常常记录在特定的寄存器中,这些寄存器包含了一系列二进制位,每一个位代表一种状态或故障标志。服务器主板BMC日志中包括寄存器的值,通过读取这些寄存器的值,然后根据预定的比特位映射表来解读,可以判断是否有故障发生及故障的具体类型。在日志解析过程中,可以基于寄存器相关的信息对收集到的日志进行解析,得出有关电源供应单元(PSU)的精确故障输出。若预诊断结果仅表征PSU存在故障,其他解析方式表征当前NPU不存在故障,终止本次诊断。The first method is register parsing. The health status and fault information of hardware devices (such as NPU, memory, hard drives, etc.) are often recorded in specific registers. These registers contain a series of binary bits, each representing a status or fault flag. The server motherboard BMC log includes register values. By reading these register values and interpreting them according to a predetermined bit mapping table, it can be determined whether a fault has occurred and its specific type. During log parsing, the collected logs can be analyzed based on register-related information to obtain precise fault outputs related to the Power Supply Unit (PSU). If the preliminary diagnostic result only indicates a PSU fault, and other parsing methods indicate that the current NPU is not faulty, the current diagnostic process is terminated.
第二种,FDM是服务器中的一种智能故障管理引擎,它能够监控服务器的各个组件并及时报告和分析可能出现的硬件故障。FDM可以通过与协同工作,解析服务器主板BMC日志,以判断服务器中是否存在故障及其类型。具体的,FDM引擎可以解析服务器主板BMC日志,查找任何异常或错误代码,这些代码可能指示硬件故障。分析包括但不限于CPU错误、内存错误、电源供应问题、风扇故障、温度超限等。基于解析出的错误代码和异常,FDM可以匹配预设的故障模式数据库,以识别具体的故障类型。这些模式可以涉及单个组件的故障,如CPU或内存模块,或是更复杂的系统级问题。The second type is FDM, an intelligent fault management engine in servers. It monitors various server components and promptly reports and analyzes potential hardware failures. FDM works in conjunction with other systems to parse the server's motherboard BMC logs to determine the presence and type of faults. Specifically, the FDM engine analyzes the server's motherboard BMC logs, looking for any anomalies or error codes that may indicate hardware failures. Analysis includes, but is not limited to, CPU errors, memory errors, power supply problems, fan failures, and over-temperature. Based on the parsed error codes and anomalies, FDM can match them to a pre-defined fault mode database to identify specific fault types. These patterns can involve failures of individual components, such as the CPU or memory modules, or more complex system-level problems.
因此,通过分析FDM日志,可以快速定位到具体的故障源头,例如得出有关内存模组(Dual In-line Memory Modules,DIMM)、磁盘(DISK)和独立冗余磁盘阵列(Redundant Array of Independent Disks,RAID)的精确故障输出。若预诊断结果仅表征DIMM、DISK、RAID存在故障,其他解析方式表征当前NPU不存在故障,终止本次诊断。Therefore, by analyzing the FDM logs, the specific source of the fault can be quickly located, such as obtaining precise fault outputs related to Dual In-line Memory Modules (DIMMs), disks, and Redundant Array of Independent Disks (RAID). If the pre-diagnostic results only indicate that there is a fault in the DIMM, disk, or RAID, while other parsing methods indicate that there is no fault in the current NPU, the current diagnosis is terminated.
第三种,SEL解析,SEL日志是由BMC生成的,用于记录服务器硬件和健康状态的重要信息。当检测到硬件故障、温度异常、电源问题、系统重启等事件时,BMC会将这些事件记录到SEL日志中,每个事件都有一个唯一的故障码(SEL Record ID)。服务器主板BMC日志包括主板BMC的sel文件和NPU的BMC的sel文件,本申请实施例中服务器包括的主板和NPU各自有BMC,NPU的BMC和主板BMC各自具有一个SEL文件,主板的SEL文件内包括内存、硬盘的故障信息,NPU的BMC的SEL文件内包括NPU相关的故障信息。在NPU的BMC运行过程中,两种SEL文件之间是可以通过通信把NPU的BMC的SEL报警传递给主板BMC的SEL文件,因此,主板BMC的SEL文件里也可以看到NPU透传过来的报警信息。The third method is SEL parsing. The SEL log is generated by the BMC and is used to record important information about the server hardware and its health status. When events such as hardware failure, abnormal temperature, power problems, or system restarts are detected, the BMC records these events in the SEL log. Each event has a unique fault code (SEL Record ID). The server motherboard BMC log includes the motherboard BMC's SEL file and the NPU's BMC's SEL file. In this embodiment, the server includes a motherboard and an NPU, each with its own BMC. The NPU's BMC and the motherboard BMC each have a SEL file. The motherboard's SEL file contains fault information related to memory and hard disk, while the NPU's BMC's SEL file contains NPU-related fault information. During the operation of the NPU's BMC, the two SEL files can communicate to transmit SEL alarms from the NPU's BMC to the motherboard BMC's SEL file. Therefore, the motherboard BMC's SEL file can also contain alarm information transmitted from the NPU.
在实际应用场景中,一部分NPU的报警信息会透传给主板的BMC,但不是所有NPU所有报警信息都能够透传给主板BMC,所以为了更加准确全面的确定NPU的报警信息,可以进一步分析NPU的SEL文件,确定预诊断结果。因此,在查询到主板BMC的SEL文件中存在透传的NPU报警信息后,进一步解析NPU的SEL文件。在对NPU的SEL文件进行解析的过程中,可以通过NPU的SEL文件中的NPU的BMC的报警信息或NPU故障ID两种方式中的至少一种进行预诊断。In practical applications, some NPU alarm information is transmitted to the motherboard's BMC, but not all NPU alarm information is transmitted to the motherboard BMC. Therefore, to more accurately and comprehensively determine NPU alarm information, the NPU's SEL file can be further analyzed to determine the preliminary diagnostic results. Thus, after finding transmitted NPU alarm information in the motherboard BMC's SEL file, the NPU's SEL file is further parsed. During the parsing of the NPU's SEL file, preliminary diagnostics can be performed using at least one of two methods: the NPU's BMC alarm information or the NPU fault ID within the NPU's SEL file.
参见图6,图6为本申请实施例提供的解析方式分类示意图,如图6中第一种解析方式为通过NPU的BMC的报警信息,确定NPU故障序号或故障现象,这种解析方式可以适用于以下几种情况:掉卡、降带宽、过温、载板故障和温度获取失败等。上述几种故障问题可能导致立即可见且直接的故障或性能下降,影响服务器的稳定性和可用性。因此在BMC之间的接口较为有限情况下,只允许NPU的BMC的这一部分影响较大的故障信息被透传到主板BMC的SEL文件中。这些信息可以被主板BMC和NPU的BMC记录为NPU的BMC的报警信息,并记录在NPU的BMC的SEL文件中以便在查询故障时快速响应。同时,NPU的BMC还可以继续记录运行过程中传感器事件到NPU的BMC的SEL文件中,包括被透传的报警信息以及更多详细的事件记录。因此,在对NPU的BMC的SEL文件进行解析时,可以根据SEL文件中记录的NPU报警信息读取到预诊断结果。Referring to Figure 6, which is a schematic diagram of the parsing methods provided in this application embodiment, the first parsing method in Figure 6 determines the NPU fault number or fault phenomenon through the alarm information of the NPU's BMC. This parsing method can be applied to the following situations: card drop, bandwidth reduction, overheating, carrier board failure, and temperature acquisition failure. The above-mentioned fault problems may cause immediately visible and direct failures or performance degradation, affecting the stability and availability of the server. Therefore, when the interfaces between BMCs are relatively limited, only the fault information with greater impact from the NPU's BMC is allowed to be passed through to the SEL file of the motherboard's BMC. This information can be recorded by the motherboard BMC and the NPU's BMC as alarm information of the NPU's BMC and recorded in the SEL file of the NPU's BMC for quick response when querying faults. At the same time, the NPU's BMC can also continue to record sensor events during operation to the SEL file of the NPU's BMC, including the passed-through alarm information and more detailed event records. Therefore, when parsing the SEL file of the NPU's BMC, the pre-diagnostic results can be read based on the NPU alarm information recorded in the SEL file.
示例性的,在CPU读取主板BMC的sel文件时,其中包含了NPU的BMC透传过去的NPU的BMC报警信息:NPU降带宽。为了预诊断结果的准确性,CPU基于NPU降带宽有目标性得读取NPU的BMC的SEL文件,查询到NPU的BMC的SEL文件包括NPU降带宽,如果当前服务器只包括一个NPU模组,那么得出预诊断结果包括NPU降带宽。For example, when the CPU reads the motherboard BMC's SEL file, it contains NPU BMC alarm information passed through by the NPU BMC: NPU bandwidth reduction. To ensure the accuracy of the pre-diagnostic results, the CPU selectively reads the NPU BMC's SEL file based on the NPU bandwidth reduction. If the NPU BMC's SEL file contains NPU bandwidth reduction, and the current server only includes one NPU module, then the pre-diagnostic result includes NPU bandwidth reduction.
在一种可能的实现方式中,如果当前服务器的NPU模组数量大于一,为了更精确地确认后续应对哪个模组进行测试。CPU可以根据读取到的NPU的BMC报警信息:NPU降带宽,进一步根据主板BMC日志中NPU的systemcom.dat文件进行解析,参考图7,图7为本申请实施例提供的NPU日志示意图。该文件包含了npu的建链信息,即识别了多少张npu卡,带宽是多少。在解析过程中,预诊断模块以LINKSTS为关键字,对该日志文件进行正交匹配,如图7所示可以识别到NPU 10后建链状态为0x2045,得出NPU 10模组降带宽作为预诊断结果,后续对NPU 10模组进行针对性的测试诊断。In one possible implementation, if the number of NPU modules on the current server is greater than one, to more accurately determine which module should be tested next, the CPU can further parse the systemcom.dat file of the NPU in the motherboard BMC log based on the read NPU BMC alarm information: NPU bandwidth reduction. Referring to Figure 7, which is a schematic diagram of the NPU log provided in this embodiment, this file contains NPU link establishment information, i.e., how many NPU cards were identified and what the bandwidth is. During the parsing process, the pre-diagnosis module uses LINKSTS as the keyword to perform orthogonal matching on the log file. As shown in Figure 7, it can identify that the link establishment status of NPU 10 is 0x2045, and conclude that the NPU 10 module has reduced bandwidth as a pre-diagnosis result. Subsequently, targeted testing and diagnosis of the NPU 10 module can be performed.
如图6中第二种解析方式为通过NPU故障ID确定故障现象,图6中还包括一些可以通过故障ID的方式进行预诊断的情况,即对SEL文件中记录的NPU故障ID进行查询。这种解析方式可以适用于以下几种情况:网口闪断、ECC隔离和L2 Buff等。上述几种故障问题更为隐匿,它们可能不会立即导致系统崩溃,但会影响系统的可靠性和数据完整性,在BMC接口资源有限的情况下,上述故障信息并未被透传到主板BMC的SEL文件中,因此该解析方式可以对应未被透传到主板BMC SEL文件中的NPU故障类型。As shown in Figure 6, the second parsing method identifies the fault phenomenon through the NPU fault ID. Figure 6 also includes some situations where pre-diagnosis can be performed using the fault ID, i.e., querying the NPU fault ID recorded in the SEL file. This parsing method is applicable to the following situations: intermittent network port outages, ECC isolation, and L2 buffers. These types of faults are more insidious; they may not immediately cause system crashes, but they affect system reliability and data integrity. Given limited BMC interface resources, the above fault information is not transparently transmitted to the motherboard BMC's SEL file. Therefore, this parsing method can correspond to NPU fault types that are not transparently transmitted to the motherboard BMC's SEL file.
参考图8,图8为本申请实施例提供的SEL文件内容示意图。CPU可以遍历NPU的SEL文件,尝试获取SEL文件中记录的故障ID,图8中提取故障ID,如0X81078603。故障ID通常与特定的故障类型相关联。预诊断模块将故障ID:0X81078603与预先定义的故障代码表进行比对。故障代码表列出了所有可能的故障ID及其对应的故障描述和建议处理策略。通过查表,预诊断模块可以确定0X81078603故障ID具体代表的NPU故障类型,如NPU内存、带宽异常或网口连接问题。明确故障ID后,诊断系统不仅可以定位到问题所在,还可以查询到故障的具体信息,如图8中所示的故障详情:1、网口芯片检测到自身link状态故障;2、网口芯片检测到对端link状态故障。故障类型为:网口功能不可用。对于当前故障的已采用处理措施为:1、上报故障事件到故障管理;2、记录日志。当前故障等级:重要。Referring to Figure 8, which is a schematic diagram of the SEL file content provided in an embodiment of this application, the CPU can traverse the NPU's SEL file and attempt to obtain the fault IDs recorded in the SEL file. The fault IDs extracted in Figure 8 are, for example, 0X81078603. Fault IDs are typically associated with specific fault types. The pre-diagnosis module compares the fault ID 0X81078603 with a predefined fault code table. The fault code table lists all possible fault IDs and their corresponding fault descriptions and suggested handling strategies. By looking up the table, the pre-diagnosis module can determine the specific NPU fault type represented by fault ID 0X81078603, such as NPU memory or bandwidth abnormalities, or network port connection problems. After identifying the fault ID, the diagnostic system can not only locate the problem but also query specific fault information, as shown in the fault details in Figure 8: 1. The network port chip detected a fault in its own link status; 2. The network port chip detected a fault in the peer's link status. The fault type is: Network port function unavailable. The handling measures adopted for the current fault are: 1. Reporting the fault event to fault management; 2. Recording logs. Current fault level: critical.
随后,预诊断模块可以输出故障的具体位置和建议的处理策略,参考图9,图9为本申请实施例提供的预诊断结果示意图,包括:故障ID为0X81078603、该预诊断结果用于指导维护人员采取相应行动,如进行硬件检查、固件升级或参数调整等。Subsequently, the pre-diagnosis module can output the specific location of the fault and suggested handling strategies. Referring to Figure 9, which is a schematic diagram of the pre-diagnosis results provided in the embodiment of this application, the pre-diagnosis results include: fault ID 0X81078603. This pre-diagnosis result is used to guide maintenance personnel to take corresponding actions, such as hardware checks, firmware upgrades, or parameter adjustments.
故障预诊断模块可以根据服务器主板BMC日志解析得到问题设备的输出故障位置,和/或,故障类型,基于上述预诊断结果,故障预诊断模块输出诊断策略,以使故障诊断模块对NPU进行软硬件性能测试。The fault pre-diagnosis module can obtain the output fault location and/or fault type of the problematic device by parsing the server motherboard BMC log. Based on the above pre-diagnosis results, the fault pre-diagnosis module outputs a diagnosis strategy to enable the fault diagnosis module to perform hardware and software performance testing on the NPU.
S103、CPU根据预诊断结果,调用与预诊断结果对应的测试脚本。S103. The CPU calls the test script corresponding to the pre-diagnosis result based on the pre-diagnosis result.
该步骤可以由CPU运行诊断系统中的故障诊断模块实现,以下对诊断系统中的故障诊断模块进行说明。This step can be implemented by the fault diagnosis module in the CPU running diagnostic system. The following is an explanation of the fault diagnosis module in the diagnostic system.
该模块的主要功能是根据预诊断阶段确定的问题点位置和问题类型,调用相应的测试工具,对NPU进行针对性的压力测试。通过监控测试进程,确保测试的准确性和有效性,并最终输出测试结果,以辅助维护人员快速定位和解决故障。The main function of this module is to invoke appropriate testing tools to perform targeted stress tests on the NPU based on the location and type of problems identified in the pre-diagnosis phase. By monitoring the testing process, it ensures the accuracy and effectiveness of the tests and ultimately outputs the test results to assist maintenance personnel in quickly locating and resolving faults.
测试脚本包括设置测试参数,如测试的类型、持续时间、负载水平等。压力测试的目的是模拟高负载条件,以检查NPU在极限情况下的表现和稳定性。关于故障诊断模块调用测试脚本的内容和流程可以参见图10,图10为本申请实施例提供的测试脚本示意图。该流程描述了在故障诊断模块开始运行时,通过init_logger()用于初始化日志记录器,设置一个任务列表(TASK_LIST),对本次测试过程中需要执行的测试任务和测试相关参数进行赋值。图10中初始化后得到的测试脚本,包含五个类,即本次测试需要执行五个任务,分别是bandwidth(带宽测试)、roce(RoCE测试)、hbm(高速缓冲存储器测试)、tdp_power(TDP功率测试)和edp_power(EDP功率测试)。这些类对应各自的测试任务,如bandwidth()用于获取带宽值等。根据测试脚本,CPU与NPU通信,以使NPU可以依次运行测试脚本中的每个测试任务。The test script includes setting test parameters, such as test type, duration, and load level. The purpose of stress testing is to simulate high load conditions to check the performance and stability of the NPU under extreme conditions. The content and flow of the test script call by the fault diagnosis module can be seen in Figure 10, which is a schematic diagram of the test script provided in this embodiment. This flow describes how, when the fault diagnosis module starts running, the logger is initialized using `init_logger()`, a task list (TASK_LIST) is set, and values are assigned to the test tasks and test-related parameters to be executed during this test. The test script obtained after initialization in Figure 10 contains five classes, indicating that this test requires the execution of five tasks: bandwidth (bandwidth test), roce (RoCE test), hbm (cache memory test), tdp_power (TDP power test), and edp_power (EDP power test). These classes correspond to their respective test tasks; for example, `bandwidth()` is used to obtain bandwidth values. According to the test script, the CPU communicates with the NPU so that the NPU can run each test task in the test script sequentially.
测试脚本可以是根据不同的预诊断结果诊断策略预定义的自动化脚本,确保测试针对性,不是盲目全量测,而是聚焦在疑似问题区域。例如,如果诊断策略指出NPU性能下降,则集中于带宽测试,如果内存错误则侧重内存校验。这种针对性测试减少了无关测试,节省时间,避免了全范围广撒网式排查,直击中问题,提高诊断效率。Test scripts can be predefined automated scripts based on different pre-diagnostic results and diagnostic strategies, ensuring targeted testing rather than blindly testing the entire spectrum. Instead, they focus on areas suspected of being problematic. For example, if the diagnostic strategy indicates a decline in NPU performance, the focus is on bandwidth testing; if there are memory errors, the emphasis is on memory verification. This targeted testing reduces irrelevant testing, saves time, avoids a broad, indiscriminate search, directly addresses the problem, and improves diagnostic efficiency.
S104、CPU加载NPU驱动,对NPU计算单元进行压力测试。S104. The CPU loads the NPU driver and performs a stress test on the NPU computing unit.
该步骤可以由CPU运行诊断系统中的故障诊断模块实现。压力测试的目标是检查NPU在高负载下的稳定性和性能,以此来验证NPU是否存在预诊断结果所示的问题或性能瓶颈。This step can be implemented by the fault diagnosis module in the CPU's diagnostic system. The goal of stress testing is to check the stability and performance of the NPU under high load, thereby verifying whether the NPU has the problems or performance bottlenecks indicated by the pre-diagnostic results.
其中,NPU驱动为自动诊断系统与NPU进行通讯的程序,用于对运行自动诊断系统的CPU与NPU之间建立通讯连接,当NPU驱动加载完成时,故障诊断模块可以控制NPU计算单元执行压力测试的相关任务。The NPU driver is a program that enables communication between the automatic diagnostic system and the NPU. It is used to establish a communication connection between the CPU running the automatic diagnostic system and the NPU. When the NPU driver is loaded, the fault diagnosis module can control the NPU computing unit to perform stress test related tasks.
在驱动加载之后,故障诊断模块可以启动压力测试,例如,CPU向NPU计算单元发送计算指令,以使NPU计算单元进行运算,进而得到当前压力测试的测试结果。After the driver is loaded, the fault diagnosis module can start a stress test. For example, the CPU sends calculation instructions to the NPU computing unit so that the NPU computing unit can perform calculations and obtain the test results of the current stress test.
S105、NPU测试过程中,CPU通过第二预设命令,读取服务器主板BMC日志,基于服务器主板BMC日志判断服务器主板BMC是否报警。During the S105 and NPU tests, the CPU reads the server motherboard BMC log through a second preset command and determines whether the server motherboard BMC is alarming based on the server motherboard BMC log.
在测试脚本中嵌入第二预设命令,例如使用ipmitool工具的命令,如ipmitool sel list,周期性地(例如每几秒钟一次)或实时地读取服务器主板BMC日志,检查是否有任何异常情况或报警信息,等待测试脚本执行过程中,NPU故障复现。Embed a second preset command in the test script, such as a command using the ipmitool tool, like ipmitool sel list, to periodically (e.g., every few seconds) or in real time read the server motherboard BMC logs to check for any abnormalities or alarm information, and wait for the NPU failure to reproduce during the execution of the test script.
其中,被读取的服务器主板BMC日志包括但不限于温度、功耗、错误率等关键指标。基于读取的BMC日志,判断服务器主板BMC是否触发了报警,可以包括温度过高、电压异常、风扇故障等硬件事件。The server motherboard BMC logs being read include, but are not limited to, key indicators such as temperature, power consumption, and error rate. Based on the read BMC logs, it is determined whether the server motherboard BMC has triggered an alarm, which may include hardware events such as overheating, abnormal voltage, or fan failure.
S106、当主板BMC报警时,停止测试脚本中当前执行的测试任务。S106. When the motherboard BMC alarms, stop the currently executing test task in the test script.
如果在测试中读取到的服务器主板BMC日志触发主板BMC报警,故障诊断模块立即停止当前处于运行过程中的测试任务。这意味着即使在测试过程中,CPU也能实时获取硬件状态和报警信息,这种即时反馈机制能快速响应异常,以防止硬件进一步受损。If the server motherboard BMC logs read during testing trigger a motherboard BMC alarm, the fault diagnosis module immediately stops the currently running test task. This means that even during testing, the CPU can obtain hardware status and alarm information in real time. This instant feedback mechanism can quickly respond to anomalies to prevent further hardware damage.
在一种可能的实现方式中,任务列表中的测试脚本数量大于1,例如对NPU进行测试的测试任务包括bandwidth(带宽测试)、hbm(高速缓冲存储器测试)、tdp_power(TDP功率测试)三种,在执行带宽测试时,并未触发主板BMC报警,则以及测试任务的排列顺序执行下一个高速缓冲存储器测试。In one possible implementation, the number of test scripts in the task list is greater than one. For example, the test tasks for testing the NPU include three types: bandwidth test, hbm (cache memory test), and tdp_power (TDP power test). If the motherboard BMC alarm is not triggered when the bandwidth test is executed, the next cache memory test will be executed according to the order of the test tasks.
当执行第二个测试任务时,触发主板BMC报警,则终止当前测试任务,日志解析模块获取触发主板BMC报警的主板BMC带外信息。对NPU进行第三个测试任务,即TDP功率测试。在测试任务全部执行完毕后,日志解析模块基于两次触发主板BMC报警的服务器主板BMC日志,分析生成诊断结果。When the second test task is executed, if a motherboard BMC alarm is triggered, the current test task is terminated, and the log parsing module obtains the out-of-band information of the motherboard BMC that triggered the alarm. A third test task, the TDP power test, is then performed on the NPU. After all test tasks are completed, the log parsing module analyzes and generates diagnostic results based on the server motherboard BMC logs from the two instances of triggered alarms.
S107、CPU获取自动诊断系统的带内日志,基于自动诊断系统的带内日志和触发主板BMC报警的服务器主板BMC日志生成诊断结果。S107. The CPU obtains the in-band log of the automatic diagnostic system and generates diagnostic results based on the in-band log of the automatic diagnostic system and the server motherboard BMC log that triggers the motherboard BMC alarm.
带内日志指的是CPU运行自动诊断系统而生成的日志,相对于客户OS系统,本申请实施例提供的自动诊断系统是一个独立的OS系统,自动诊断系统有权限可以获取该带内日志。带内日志提供了NPU在测试过程中的运行状态和软件层面的错误信息。In-band logs refer to the logs generated by the CPU running the automatic diagnostic system. Relative to the client OS system, the automatic diagnostic system provided in this embodiment is an independent OS system, and the automatic diagnostic system has the authority to access these in-band logs. The in-band logs provide the NPU's running status and software-level error information during the testing process.
触发主板BMC报警的服务器主板BMC日志包括BMC检测到的故障信息,而在运行过程中NPU也可能存在带内报警,由NPU内部生成的日志信息反映了NPU在执行任务时的状态和遇到的错误,包括但不限于:NPU工具运行时的异常、内存管理问题、性能指标异常、应用程序崩溃或挂起和算法执行错误。因此,需要获取CPU运行自动诊断系统时产生的带内日志以及触发主板BMC报警的主板BMC带外信息。这些信息详细记录了整个诊断过程中的硬件和软件状态,以及任何可能的问题或异常情况。The server motherboard BMC logs that trigger motherboard BMC alarms include fault information detected by the BMC. During operation, the NPU may also generate in-band alarms. Log information generated internally by the NPU reflects the NPU's status and encountered errors during task execution, including but not limited to: NPU tool runtime exceptions, memory management problems, abnormal performance metrics, application crashes or hangs, and algorithm execution errors. Therefore, it is necessary to obtain the in-band logs generated by the CPU running the automatic diagnostic system, as well as the motherboard BMC out-of-band information that triggers motherboard BMC alarms. This information records in detail the hardware and software status throughout the diagnostic process, as well as any possible problems or anomalies.
根据自动诊断系统的带内日志和触发主板BMC报警的主板BMC带外信息中的至少一项,可以生成本次诊断的诊断结果。诊断结果可以包括NPU的故障类型、故障的成因。The diagnostic results for this diagnosis can be generated based on at least one of the following: the in-band log of the automatic diagnostic system and the out-of-band information of the motherboard BMC that triggered the motherboard BMC alarm. The diagnostic results may include the NPU fault type and the cause of the fault.
触发主板BMC报警的带内日志示例:Example of in-band log triggering a motherboard BMC alarm:
时间戳:2024-07-17 12:35:45Timestamp: 2024-07-17 12:35:45
事件:NPU性能下降,处理延迟增加。Event: NPU performance degraded, processing latency increased.
详细信息:在过去30分钟内,NPU执行神经网络推断任务的平均延迟从10ms增加到了20ms。同时,CPU占用率从30%上升至70%,表明NPU可能无法有效分担计算负载。Detailed information: Over the past 30 minutes, the average latency of the NPU executing neural network inference tasks increased from 10ms to 20ms. Simultaneously, CPU utilization rose from 30% to 70%, indicating that the NPU may not be effectively distributing the computational load.
触发主板BMC报警的服务器主板BMC日志示例:Example of a server motherboard BMC log that triggers a motherboard BMC alarm:
时间戳:2024-07-17 12:36:00Timestamp: 2024-07-17 12:36:00
事件:触发过热报警。Event: Overheat alarm triggered.
详细信息:NPU核心温度达到85℃,超过警戒阈值80℃,BMC已自动降低风扇转速至最大值,但温度仍未见明显下降。Detailed information: The NPU core temperature reached 85°C, exceeding the warning threshold of 80°C. The BMC has automatically reduced the fan speed to the maximum value, but the temperature has not yet dropped significantly.
结合上述带内日志和带外信息,确定诊断结果,诊断结果包括故障类型和故障成因。Based on the in-band logs and out-of-band information mentioned above, the diagnostic results are determined, including the fault type and the cause of the fault.
故障类型:NPU过热导致性能退化。Fault type: NPU overheating leading to performance degradation.
故障的成因:高强度持续计算任务导致NPU核心温度升高。散热系统可能不足以应对高负荷情况,或散热器和风扇积尘影响了散热效率。Cause of the failure: High-intensity, continuous computing tasks caused the NPU core temperature to rise. The cooling system may be insufficient to handle the high load, or dust accumulation on the heatsink and fan may be affecting cooling efficiency.
在一种可能的实现方式中,基于自动诊断系统的带内日志和触发主板BMC报警的主板BMC带外信息对应的诊断结果,获取内置案例库中对应的处理策略。In one possible implementation, the corresponding processing strategy in the built-in case library is obtained based on the diagnostic results corresponding to the in-band logs of the automatic diagnostic system and the out-of-band information of the motherboard BMC that triggers the motherboard BMC alarm.
诊断系统还包括内置案例库,内置案例库用于根据诊断结果生成后续对NPU执行的操作,其中包括多种问题各自对应的解决方案。The diagnostic system also includes a built-in case library, which is used to generate subsequent operations for the NPU based on the diagnostic results, including solutions for various problems.
在测试结束后,日志解析模块可以不只是简单地输出原始日志数据,可以基于诊断系统中的内置案例库中大量故障案例和解决方案,对诊断结果生成对应的处理策略,该处理策略用于提供针对本次诊断结果的解决方法或建议。这使得用户无需花费大量时间搜索和尝试不同的解决方案,而是可以直接根据系统提供的建议进行操作,从而大大加快了问题的解决速度。After the test, the log parsing module can do more than simply output raw log data. Based on a large library of fault cases and solutions built into the diagnostic system, it can generate corresponding processing strategies for the diagnostic results. These strategies provide solutions or suggestions tailored to the specific diagnostic findings. This eliminates the need for users to spend significant time searching and trying different solutions; instead, they can directly follow the system's recommendations, greatly accelerating problem-solving.
例如,对于上述示例的诊断结果,内置案例库给出的处理策略可以包括:For example, the processing strategies provided by the built-in case library for the diagnostic results in the above example may include:
1、立即降低NPU的工作负载,避免长时间过热损伤硬件。1. Immediately reduce the workload of the NPU to avoid overheating and hardware damage over a prolonged period.
2、清理服务器内部积尘,检查并可能升级散热系统。2. Clean the dust inside the server and check and potentially upgrade the cooling system.
3、监控温度变化,如果问题持续,则可能需要更换NPU或相关散热组件。3. Monitor temperature changes. If the problem persists, it may be necessary to replace the NPU or related heat dissipation components.
诊断完成后,诊断系统可以将诊断结果输出到服务器上的一个文件或通过网络发送到远程位置。该诊断结果可以是一个详细的报告,列出了检测到的所有问题和可能的解决方案。一旦诊断完成并获取了结果,可以从服务器中移除当前存储有诊断系统的诊断设备,以便在其他服务器上可以重复使用诊断设备中的诊断系统。After diagnosis, the diagnostic system can output the results to a file on a server or send them to a remote location over a network. The diagnostic results can be a detailed report listing all detected problems and possible solutions. Once the diagnosis is complete and the results are obtained, the diagnostic device currently storing the diagnostic system can be removed from the server so that the diagnostic system can be reused on other servers.
并且为了方便用户的使用,自动诊断系统可以配备友好的用户界面。用户只需通过简单的操作,即可轻松启动故障诊断、查看诊断结果以及获取解决方案。这种直观易用的用户交互设计降低了诊断系统的使用门槛,使得更多用户能够受益于自动诊断系统带来的便利。Furthermore, to facilitate user operation, automated diagnostic systems can be equipped with a user-friendly interface. Users can easily initiate fault diagnosis, view diagnostic results, and obtain solutions through simple operations. This intuitive and easy-to-use user interface design lowers the barrier to entry for diagnostic systems, allowing more users to benefit from the convenience brought by automated diagnostic systems.
在实际应用场景中,随着系统或设备的不断更新和升级,新的问题会不断出现。诊断系统可以将每次诊断得到的诊断结果、测试脚本和故障处理策略存储在诊断系统内的内置案例库中,以使诊断系统具备持续学习的能力,能够通过学习新的故障案例来不断提升其诊断效果。这种自我优化的能力使得系统能够始终保持在最佳状态,以应对不断变化的挑战。In real-world applications, new problems constantly arise as systems or devices are continuously updated and upgraded. Diagnostic systems can store the diagnostic results, test scripts, and fault handling strategies obtained from each diagnosis in a built-in case library. This enables the diagnostic system to continuously learn and improve its diagnostic effectiveness by learning from new fault cases. This self-optimization capability allows the system to always remain in optimal condition to meet ever-changing challenges.
关于诊断方法中诊断系统各个模块的应用流程,可以参见图11,图11为本申请实施例提供的故障预诊断模块的运行流程示意图。包括:For the application flow of each module of the diagnostic system in the diagnostic method, please refer to Figure 11. Figure 11 is a schematic diagram of the operation flow of the fault pre-diagnosis module provided in the embodiment of this application. It includes:
S21、读取服务器主板BMC日志。S21. Read the server motherboard BMC log.
在诊断系统开始运行后,故障预诊断模块读取服务器主板BMC日志,其中,服务器主板BMC日志包括NPU的BMC透传的NPU的BMC带外信息。After the diagnostic system starts running, the fault pre-diagnosis module reads the server motherboard BMC log, which includes the out-of-band information of the NPU's BMC passed through by the NPU's BMC.
S22、基于服务器主板BMC日志生成预诊断结果。S22. Generate pre-diagnostic results based on the server motherboard BMC logs.
故障预诊断模块根据读取到的服务器主板BMC日志,分析是否为NPU故障,如果为NPU故障,生成预诊断结果,并将预诊断结果发送给故障诊断模块。The fault pre-diagnosis module analyzes the server motherboard BMC logs to determine if it is an NPU fault. If it is an NPU fault, it generates a pre-diagnosis result and sends the result to the fault diagnosis module.
S23、调用测试脚本,加载NPU驱动,对NPU计算单元进行测试。S23. Call the test script, load the NPU driver, and test the NPU computing unit.
故障诊断模块基于接收到的预诊断结果,调用与预诊断结果对应的测试脚本,并加载NPU驱动,在驱动加载完成后,根据测试脚本对NPU计算单元进行测试。Based on the received pre-diagnosis results, the fault diagnosis module calls the test script corresponding to the pre-diagnosis results and loads the NPU driver. After the driver is loaded, the NPU computing unit is tested according to the test script.
S24、测试过程中,读取服务器主板BMC日志。S24. During the test, read the server motherboard BMC log.
测试过程中,故障诊断模块读取服务器主板BMC日志,根据读取到的服务器主板BMC日志判断当前服务器主板BMC是否报警。During the test, the fault diagnosis module reads the server motherboard BMC log and determines whether the server motherboard BMC is alarming based on the read server motherboard BMC log.
S25、服务器主板BMC报警,终止测试脚本。S25, Server motherboard BMC alarm, terminate test script.
故障诊断模块在检测到服务器主板BMC报警时,停止当前正在执行的测试脚本。When the fault diagnosis module detects an alarm in the server motherboard BMC, it stops the currently executing test script.
S26、基于触发主板BMC报警的服务器主板BMC日志生成诊断结果。S26. Generate diagnostic results based on the server motherboard BMC logs that trigger motherboard BMC alarms.
在服务器主板BMC报警后,日志解析模块获取诊断系统产生的带内日志,以及触发主板BMC报警的服务器主板BMC日志,并对获取到的日志进行分析,得到诊断结果。After the server motherboard BMC alarms, the log parsing module obtains the in-band logs generated by the diagnostic system, as well as the server motherboard BMC logs that triggered the alarm, and analyzes the obtained logs to obtain diagnostic results.
综上,本申请实施例提供的诊断系统,具备以下有益效果:In summary, the diagnostic system provided in this application has the following beneficial effects:
1、相对于依赖带内日志进行诊断,本申请实施例利用带外信息同步,收集硬件告警状态,结合智能分析预测故障位置和类型,选择性执行性能测试。确保在不直接接触或进入客户操作系统的情况下进行故障诊断和性能测试,无需依赖客户OS系统的带内日志,解决了因客户数据安全性考量而无法获取日志导致的故障定位难问题。1. Compared to relying on in-band logs for diagnosis, this application's embodiments utilize out-of-band information synchronization to collect hardware alarm status, combine intelligent analysis to predict fault location and type, and selectively perform performance tests. This ensures that fault diagnosis and performance testing are performed without directly accessing or entering the customer's operating system, eliminating the need to rely on the customer's OS system's in-band logs and solving the problem of difficulty in fault location caused by the inability to obtain logs due to customer data security considerations.
2、相对于手动执行测试命令效率低下,诊断系统在完成故障预诊断后,可以调用内置案例库,根据诊断结果自动生成故障处理策略。这些策略可以包括针对性的测试命令、配置调整建议等。通过自动化处理策略生成,不仅提高了效率,还减少了人为错误的可能性。2. Compared to the inefficiency of manually executing test commands, the diagnostic system, after completing pre-fault diagnosis, can call upon its built-in case library to automatically generate fault handling strategies based on the diagnostic results. These strategies can include targeted test commands, configuration adjustment suggestions, etc. Automating strategy generation not only improves efficiency but also reduces the possibility of human error.
3、相对于全量测试时间长,无法满足快速维修和恢复业务的要求,本申请实施例结合故障预诊断的结果,可以设计精准测试方案。这些方案只针对可能存在问题的区域进行测试,而不是对整个系统进行全量测试。由此,可以大大缩短测试时间,同时保证测试的针对性和有效性。此外,一旦发现问题,可以立即调用内置案例库中的处理策略进行修复,实现快速恢复。3. Compared to the time-consuming nature of full-scale testing, which fails to meet the requirements for rapid repair and business recovery, this application's embodiments, combined with the results of fault pre-diagnosis, can design precise testing schemes. These schemes only test areas that may have problems, rather than performing full-scale testing on the entire system. This significantly shortens testing time while ensuring the relevance and effectiveness of the testing. Furthermore, once a problem is discovered, processing strategies from the built-in case library can be immediately invoked for repair, achieving rapid recovery.
4、本申请实施例通过在待诊断服务器上接入存储有诊断系统的诊断设备,即可通过CPU运行诊断系统,对NPU进行诊断。诊断系统作为独立的服务,可以在多种环境下快速部署,无需重复进行复杂的工具安装和配置,大幅简化在不同客户环境下的部署和配置流程,降低了操作复杂度和出错概率。4. In this embodiment, by connecting a diagnostic device storing the diagnostic system to the server to be diagnosed, the diagnostic system can be run via the CPU to diagnose the NPU. The diagnostic system, as an independent service, can be quickly deployed in various environments without the need for repeated complex tool installation and configuration, significantly simplifying the deployment and configuration process in different customer environments and reducing operational complexity and the probability of errors.
以上为本申请实施例所提供的诊断方法的一些具体实现方式,基于此,本申请实施例还提供了一种计算机程序产品,当所述计算机程序产品在计算机上运行时,所述计算机实现本申请实施例所提供的诊断方法。The above are some specific implementations of the diagnostic method provided in the embodiments of this application. Based on this, the embodiments of this application also provide a computer program product. When the computer program product is run on a computer, the computer implements the diagnostic method provided in the embodiments of this application.
本申请实施例还提供了对应的设备以及计算机存储介质,用于实现本申请实施例所提供的诊断方法。This application also provides corresponding devices and computer storage media for implementing the diagnostic methods provided in this application.
其中,所述设备包括存储器和处理器,所述存储器用于存储指令或代码,所述处理器用于执行所述指令或代码,以使所述设备执行本申请实施例任一实施例所述的诊断方法。The device includes a memory and a processor. The memory stores instructions or code, and the processor executes the instructions or code to enable the device to perform the diagnostic method described in any embodiment of this application.
所述计算机存储介质中存储有代码,当所述代码被运行时,运行所述代码的设备实现本申请实施例任一实施例所述的诊断方法。The computer storage medium stores code, and when the code is executed, the device running the code implements the diagnostic method described in any embodiment of this application.
需要说明的是,本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的系统或装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。It should be noted that the various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the systems or apparatus disclosed in the embodiments, since they correspond to the methods disclosed in the embodiments, the descriptions are relatively simple, and relevant parts can be referred to the method section.
应当理解,在本申请实施例中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c可以是单个,也可以是多个。It should be understood that in the embodiments of this application, "at least one (item)" refers to one or more, and "more than one" refers to two or more. "And/or" is used to describe the relationship between related objects, indicating that three relationships can exist. For example, "A and/or B" can represent three cases: only A exists, only B exists, and both A and B exist simultaneously, where A and B can be singular or plural. The character "/" generally indicates that the preceding and following related objects are in an "or" relationship. "At least one (item) of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one (item) of a, b, or c can represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, and c can be single or multiple.
需要理解的是,术语“中心”、“纵向”、“横向”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”、“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本发明和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本发明的限制。It should be understood that the terms "center", "longitudinal", "lateral", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", and "outer" indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. They are only for the convenience of describing the present invention and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on the present invention.
需要说明的是,除非另有明确的规定和限定,术语“安装”、“相连”、“连接”应做广义理解,例如,可以是固定连接,也可以是可拆卸连接,或一体地连接;可以是机械连接,也可以是电连接;可以是直接相连,也可以通过中间媒介间接相连,可以是两个元件内部的连通。对于本领域的普通技术人员而言,可以具体情况理解上述术语在本发明中的具体含义。It should be noted that, unless otherwise explicitly specified and limited, the terms "installation," "connection," and "linking" should be interpreted broadly. For example, they can refer to fixed connections, detachable connections, or integral connections; they can refer to mechanical connections or electrical connections; they can refer to direct connections or indirect connections through an intermediate medium; and they can refer to the internal connection between two components. Those skilled in the art can understand the specific meaning of the above terms in this invention based on the specific circumstances.
还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should also be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of the methods or algorithms described in conjunction with the embodiments disclosed herein can be implemented directly by hardware, a software module executed by a processor, or a combination of both. The software module can be located in random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art.
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本申请实施例。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请实施例的精神或范围的情况下,在其它实施例中实现。因此,本申请实施例将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables those skilled in the art to implement or use the embodiments of this application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the embodiments of this application. Therefore, the embodiments of this application are not to be limited to the embodiments shown herein, but are to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (10)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410984695.5A CN119025309B (en) | 2024-07-19 | 2024-07-19 | Diagnostic method and computer program product |
| CN202410984695.5 | 2024-07-19 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2026016559A1 true WO2026016559A1 (en) | 2026-01-22 |
Family
ID=93526182
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2025/089095 Pending WO2026016559A1 (en) | 2024-07-19 | 2025-04-15 | Diagnosis method and computer program product |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN119025309B (en) |
| WO (1) | WO2026016559A1 (en) |
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113849329B (en) * | 2021-08-26 | 2023-07-14 | 苏州浪潮智能科技有限公司 | An operating system log analysis integration method and system |
| CN113777476B (en) * | 2021-08-30 | 2024-02-23 | 苏州浪潮智能科技有限公司 | A GPU fault diagnosis system, diagnosis method, equipment and readable storage medium |
| CN114675991B (en) * | 2022-03-28 | 2025-04-25 | 苏州浪潮智能科技有限公司 | A method, system, device and storage medium for realizing effective log positioning |
| CN117370063A (en) * | 2023-10-24 | 2024-01-09 | 浪潮云信息技术股份公司 | A cloud server memory fault feature extraction method, system and related devices |
-
2024
- 2024-07-19 CN CN202410984695.5A patent/CN119025309B/en active Active
-
2025
- 2025-04-15 WO PCT/CN2025/089095 patent/WO2026016559A1/en active Pending
Also Published As
| Publication number | Publication date |
|---|---|
| CN119025309A (en) | 2024-11-26 |
| CN119025309B (en) | 2025-08-19 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11347573B2 (en) | In-drive bus trace | |
| CN113708986B (en) | Server monitoring apparatus, method and computer-readable storage medium | |
| CN108897666A (en) | Server failure log generation method and relevant device | |
| WO2017125014A1 (en) | Method and device for monitoring hard disk | |
| US20150370619A1 (en) | Management system for managing computer system and management method thereof | |
| CN112084097B (en) | Disk alarm method and device | |
| US9354962B1 (en) | Memory dump file collection and analysis using analysis server and cloud knowledge base | |
| CN112988442B (en) | Method and equipment for transmitting fault information in server operation stage | |
| US20210334153A1 (en) | Remote error detection method adapted for a remote computer device to detect errors that occur in a service computer device | |
| WO2024250776A1 (en) | Fault detection method and apparatus for external device | |
| CN118550747A (en) | PCIe fatal error quick positioning method, system, electronic equipment and medium | |
| US8601318B2 (en) | Method, apparatus and computer program product for rule-based directed problem resolution for servers with scalable proactive monitoring | |
| US9389941B2 (en) | Methods for diagnosing hardware component failure and devices thereof | |
| CN114356708A (en) | A device fault monitoring method, device, device and readable storage medium | |
| CN114138574B (en) | Controller testing methods, apparatus, servers, and storage media | |
| US8949669B1 (en) | Error detection, correction and triage of a storage array errors | |
| CN119576628B (en) | Fault processing method, device, equipment and medium | |
| CN117806899A (en) | Data monitoring and analysis methods, devices, servers, operation and maintenance systems and storage media | |
| WO2026016559A1 (en) | Diagnosis method and computer program product | |
| CN118747165A (en) | Method, device, computer equipment and storage medium for reading log data | |
| CN118860720A (en) | Fault information processing method, equipment and medium | |
| CN117873799A (en) | Performance test method and equipment for EMMC device based on android device | |
| CN116795635A (en) | Pressure testing method, pressure testing device, computer equipment and storage medium | |
| CN101140540A (en) | A method and system for automatically monitoring magnetic array faults | |
| CN109491846B (en) | Method and system for capturing SATA hard disk trace by server |