CN116560936A - Anomaly monitoring method, coprocessor and computing device - Google Patents
Anomaly monitoring method, coprocessor and computing device Download PDFInfo
- Publication number
- CN116560936A CN116560936A CN202210109088.5A CN202210109088A CN116560936A CN 116560936 A CN116560936 A CN 116560936A CN 202210109088 A CN202210109088 A CN 202210109088A CN 116560936 A CN116560936 A CN 116560936A
- Authority
- CN
- China
- Prior art keywords
- monitoring
- coprocessor
- processor
- bus
- main processor
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3051—Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
Abstract
一种异常监测方法,该方法应用于包括主处理器以及协处理器的主板,该主处理器与协处理器之间通过总线进行耦合,在进行异常监测过程中,协处理器对主板进行芯片级的异常监测,该芯片级的异常监测包括对主板中的主处理器或者主处理器与协处理器耦合的总线中的至少一种进行异常监测,并且,当检测到主板出现芯片级的异常时,协处理器生成主板出现芯片级异常。如此,不仅可以实现异常监测,而且,由于协处理器对主板进行芯片级的异常监测,这可以使得协处理器在异常监测过程中通常不会受到网络传输时延或者主板运行负载的影响,以此可以提高异常监测的准确性以及可靠性。
A method for anomaly monitoring, the method is applied to a mainboard including a main processor and a coprocessor, the main processor and the coprocessor are coupled through a bus, and during the abnormality monitoring process, the coprocessor carries out chip monitoring on the mainboard Level abnormal monitoring, the chip-level abnormal monitoring includes at least one of the main processor in the motherboard or the bus coupled to the main processor and the coprocessor for abnormal monitoring, and when it is detected that the motherboard has a chip-level abnormal , the coprocessor generates a chip-level exception on the motherboard. In this way, not only abnormality monitoring can be realized, but also because the coprocessor performs chip-level abnormality monitoring on the motherboard, this can make the coprocessor usually not affected by the network transmission delay or the operating load of the motherboard during the abnormality monitoring process, so as to This can improve the accuracy and reliability of anomaly monitoring.
Description
技术领域technical field
本申请涉及计算机技术领域,尤其涉及一种异常监测方法、协处理器及计算设备。The present application relates to the field of computer technology, in particular to an abnormality monitoring method, a coprocessor and a computing device.
背景技术Background technique
服务器等计算设备在经历长时间运行后,可能会出现运行异常。比如,计算设备中的部分硬件器件在长时间运行后,可能因为发生老化或者受到电磁干扰等原因,导致计算设备的运行数据发生错误,从而引发该计算设备发生宕机。Computing devices such as servers may run abnormally after running for a long time. For example, after some hardware devices in the computing device have been running for a long time, due to aging or electromagnetic interference, etc., the operating data of the computing device may be erroneous, causing the computing device to go down.
目前,通常会为计算设备配置专用的监测设备,该监测设备可以定期基于安全外壳(Secure Shell,SSH)协议向该计算设备发送连接请求,并且当连接成功,监测设备确定该计算设备未发生宕机,而当出现连接失败时,监测设备确定该计算设备发生宕机。但是,在计算设备正常运行的情况下,如果该连接请求在网络传输过程中,因传输时延较大或者计算设备负荷较大而未能及时响应,监测设备可能会误判该计算设备发生宕机,降低了监测准确性。At present, a dedicated monitoring device is usually configured for the computing device, and the monitoring device can periodically send a connection request to the computing device based on the Secure Shell (SSH) protocol, and when the connection is successful, the monitoring device determines that the computing device does not go down machine, and when a connection failure occurs, the monitoring device determines that the computing device is down. However, when the computing device is running normally, if the connection request fails to respond in time due to a large transmission delay or a heavy load on the computing device during network transmission, the monitoring device may misjudge that the computing device is down. machine, reducing the accuracy of monitoring.
因此,如何提高计算设备的异常监测的准确性以及可靠性,成为亟需解决的重要问题。Therefore, how to improve the accuracy and reliability of abnormality monitoring of computing devices has become an important problem that needs to be solved urgently.
发明内容Contents of the invention
本申请提供了一种异常监测方法、协处理器、计算设备、异常监测装置、计算机可读存储介质及计算机程序产品,用以提高针对计算设备的异常监测的准确性以及可靠性。The present application provides an abnormality monitoring method, a coprocessor, computing equipment, an abnormality monitoring device, a computer-readable storage medium and a computer program product, which are used to improve the accuracy and reliability of abnormality monitoring for computing equipment.
第一方面,本申请提供一种异常监测方法,该方法应用于包括主处理器以及协处理器的主板,该主处理器与协处理器之间通过总线进行耦合,在进行异常监测过程中,协处理器对主板进行芯片级的异常监测,该芯片级的异常监测包括对主板中的主处理器进行异常监测,或者对主处理器与协处理器耦合的总线进行异常监测,或者同时对主处理器以及该总线进行异常监测;当检测到主板出现芯片级的异常时,协处理器生成主板出现芯片级异常。In the first aspect, the present application provides an abnormality monitoring method, which is applied to a main board including a main processor and a coprocessor, the main processor and the coprocessor are coupled through a bus, and during the abnormality monitoring process, The coprocessor performs chip-level anomaly monitoring on the mainboard. The chip-level anomaly monitoring includes anomaly monitoring on the main processor in the mainboard, or anomalous monitoring on the bus coupling the main processor and the coprocessor, or simultaneously The processor and the bus perform abnormality monitoring; when a chip-level abnormality occurs on the mainboard, the coprocessor generates a chip-level abnormality on the mainboard.
如此,不仅可以实现异常监测,而且,由于协处理器对主板进行芯片级的异常监测,这可以使得协处理器在异常监测过程中通常不会受到网络传输时延或者主板运行负载的影响,以此可以提高异常监测的准确性以及可靠性。另外,协处理器可以实现芯片级别的异常源的监测,这使得在存在异常后,可以快速定位出是主处理器发生异常,还是主处理器与协处理器之间的总线发生异常,或者主处理器以及总线同时发生异常,从而可以准确识别异常源,有效增强定位异常根因的能力,进而有助于可以降低故障恢复时延。In this way, not only abnormality monitoring can be realized, but also because the coprocessor performs chip-level abnormality monitoring on the motherboard, this can make the coprocessor usually not affected by the network transmission delay or the operating load of the motherboard during the abnormality monitoring process, so as to This can improve the accuracy and reliability of abnormality monitoring. In addition, the coprocessor can realize the monitoring of the abnormal source at the chip level, which makes it possible to quickly locate the abnormality of the main processor, the abnormality of the bus between the main processor and the coprocessor, or the abnormality of the main processor. The processor and the bus are abnormal at the same time, so that the source of the abnormality can be accurately identified, and the ability to locate the root cause of the abnormality can be effectively enhanced, which in turn helps to reduce the fault recovery delay.
在一种可能的实施方式中,协处理器可以采用实时监测的方式对主板进行芯片级的异常监测。如此,协处理器可以实时监测出主板上的主处理器或者总线是否存在异常,以便在主处理器或总线中至少一种存在异常时及时进行上报和告警,降低故障恢复时延。In a possible implementation manner, the coprocessor may perform chip-level abnormality monitoring on the motherboard in a real-time monitoring manner. In this way, the coprocessor can monitor in real time whether the main processor or the bus on the main board is abnormal, so as to report and alarm in time when at least one of the main processor or the bus is abnormal, and reduce the fault recovery delay.
在一种可能的实施方式中,协处理器在采用实时监测方式对主板进行芯片级的异常监测时,具体可以是实时监测是否接收到主处理器的无指令超时中断,该无指令超时中断用于指示主处理器中的一个或者多个处理器核在预设时长内未接收到指令,并且,当接收到无指令超时中断时,协处理器可以确定主板出现芯片级异常,具体是主处理器发生异常。如此,可以通过监测是否接收到无指令超时中断来实时判断主处理器是否发生异常,以此实现在主处理器异常时进行快速定位。In a possible implementation manner, when the coprocessor uses a real-time monitoring method to monitor the motherboard for chip-level abnormalities, it may specifically monitor in real time whether it receives a no-instruction timeout interrupt from the main processor, and the no-instruction timeout interrupt uses It is used to indicate that one or more processor cores in the main processor have not received instructions within a preset time period, and when receiving a no-instruction timeout interrupt, the coprocessor can determine that a chip-level exception occurs on the main board, specifically the main processing An exception occurred in the device. In this way, whether the main processor is abnormal can be judged in real time by monitoring whether the no-instruction timeout interrupt is received, so as to realize fast positioning when the main processor is abnormal.
在一种可能的实施方式中,协处理器在采用实时监测方式对主板进行芯片级的异常监测时,具体可以是实时监测是否接收到主处理器中的内部总线的中断信号,该中断信号用于指示主处理器中的内部总线运行故障,如内部总线存在数据传输阻塞或者数传输链路发生中断等,从而当接收到内部总线的中断信号时,协处理器可以确定主板出现芯片级异常,具体是主处理器中的内部总线发生异常。如此,可以通过监测是否接收到内部总线的中断信号来实时判断内部总线是否发生异常,以此实现在内部总线出现异常时进行快速定位。In a possible implementation manner, when the coprocessor uses a real-time monitoring method to monitor the chip-level abnormality of the motherboard, it may specifically monitor in real time whether an interrupt signal of the internal bus in the main processor is received, and the interrupt signal is used for It is used to indicate the operation failure of the internal bus in the main processor, such as data transmission blockage in the internal bus or interruption of the data transmission link, etc., so that when receiving the interrupt signal of the internal bus, the coprocessor can determine that a chip-level exception occurs on the main board, Specifically, the internal bus in the host processor is abnormal. In this way, whether the internal bus is abnormal can be judged in real time by monitoring whether the interrupt signal of the internal bus is received, so as to realize rapid positioning when the internal bus is abnormal.
在一种可能的实施方式中,协处理器在采用实时监测方式对主板进行芯片级的异常监测时,具体可以是实时监测是否接收到协处理器与主处理器耦合的总线的中断信号,该中断信号用于指示协处理器与主处理器耦合的总线运行故障,如总线出现数据传输阻塞或者数据传输链路发生中断等,从而当接收到该总线的中断信号时,协处理器可以确定主板出现芯片级异常,具体是协处理器与主处理器耦合的总线发生异常。如此,可以通过监测是否接收到协处理器与主处理器耦合的总线的中断信号,来实时判断该总线是否发生异常,以此实现在协处理器与主处理器耦合的总线出现异常时进行快速定位。In a possible implementation manner, when the coprocessor performs chip-level abnormality monitoring on the motherboard in a real-time monitoring manner, it may specifically monitor in real time whether it receives an interrupt signal from the bus coupled between the coprocessor and the main processor. The interrupt signal is used to indicate the operation failure of the bus coupled between the coprocessor and the main processor, such as data transmission blocking on the bus or interruption of the data transmission link, etc., so that when receiving the interrupt signal of the bus, the coprocessor can determine the main board A chip-level exception occurs, specifically, an exception occurs on the bus coupling the coprocessor to the main processor. In this way, it is possible to judge whether the bus is abnormal in real time by monitoring whether the interrupt signal of the bus coupled to the coprocessor and the main processor is received, so as to realize fast processing when the bus coupled to the coprocessor and the main processor is abnormal. position.
在一种可能的实施方式中,协处理器可以采用周期监测方式对主板进行芯片级的异常监测。如此,协处理器不仅可以通过周期监测的方式监测出主板上的主处理器或者总线是否存在异常,而且,也可以降低异常监测的成本。In a possible implementation manner, the coprocessor may perform chip-level abnormality monitoring on the motherboard in a periodic monitoring manner. In this way, the coprocessor can not only detect whether the main processor or the bus on the main board is abnormal by periodic monitoring, but also reduce the cost of abnormal monitoring.
在一种可能的实施方式中,协处理器在采用周期监测方式对主板进行芯片级的异常监测时,具体可以是在第一监测周期内,监测是否接收到主处理器的第一处理器核发送的心跳消息,该心跳消息用于指示第一处理器核正常运行,并且,当在该第一监测周期内未接收到心跳消息时,协处理器确定该主板出现芯片级异常,具体是第一处理器核发生异常。如此,可以通过周期监测的方式确定主处理器中的第一处理器核是否发生异常,以便在该第一处理器核发生异常时实现对该异常源的精确定位。In a possible implementation manner, when the coprocessor performs chip-level abnormality monitoring on the motherboard in a periodic monitoring manner, it may specifically monitor whether it receives the first processor core of the main processor in the first monitoring cycle. A heartbeat message is sent, the heartbeat message is used to indicate that the first processor core is running normally, and when the heartbeat message is not received within the first monitoring period, the coprocessor determines that a chip-level abnormality occurs on the motherboard, specifically the first An exception occurred in a processor core. In this way, it can be determined whether the first processor core in the main processor is abnormal by means of periodical monitoring, so as to accurately locate the source of the abnormality when the first processor core is abnormal.
在一种可能的实施方式中,协处理器在确定第一处理器核发生异常后,可以在第二监测周期内监测是否接收到第二处理器核发送的心跳消息,该第二处理器核发送的心跳消息用于指示所述第二处理器核正常运行,该第二处理器为主处理器中除第一处理器核之外的处理器核,并且,当未接收到该第二处理器核发送的心跳消息时,协处理器可以确定第二处理器核发生异常。如此,可以在第一处理器核发生异常后,检测出主处理器上的其余处理器核是否也存在异常,并可以通过心跳消息精确定位出存在异常的其它处理器核。In a possible implementation manner, after determining that the first processor core is abnormal, the coprocessor may monitor whether it receives a heartbeat message sent by the second processor core within the second monitoring period, and the second processor core The heartbeat message sent is used to indicate that the second processor core is running normally, the second processor is a processor core in the main processor other than the first processor core, and, when the second processing core is not received When receiving the heartbeat message sent by the processor core, the coprocessor may determine that an exception occurs in the second processor core. In this way, after the first processor core is abnormal, it can be detected whether other processor cores on the main processor also have abnormalities, and other processor cores with abnormalities can be precisely located through the heartbeat message.
在一种可能的实施方式中,协处理器在采用周期监测方式对主板进行芯片级的异常监测时,具体可以是周期性的监测主处理器中的所有处理器核,并且,当未接收到主处理器中的所有处理器核中第三处理器核发送的心跳消息时,确定该第三处理器核发生异常,该第三处理器核为主处理器中的任意一个处理器核。如此,在主处理器中的任意一个处理器核存在异常时,协处理器可以通过周期监测的方式精确定位出该存在异常的处理器核。In a possible implementation manner, when the coprocessor performs chip-level abnormality monitoring on the motherboard in a periodic monitoring manner, specifically, it may periodically monitor all processor cores in the main processor, and, when no When the heartbeat message is sent by the third processor core among all the processor cores in the main processor, it is determined that the third processor core is abnormal, and the third processor core is any processor core in the main processor. In this way, when any processor core in the main processor is abnormal, the coprocessor can accurately locate the abnormal processor core through periodic monitoring.
在一种可能的实施方式中,主板除了包括主处理器以及协处理器之外,还可以包括基板管理控制器,则协处理器在生成主板出现芯片级异常后,可以向基板管理控制器上报该主板出现芯片级异常,以触发基板管理控制器进行异常告警。如此,基板管理控制器可以通过异常告警来提示运维人员及时对异常源进行恢复,以此降低异常恢复的时延。In a possible implementation manner, the motherboard may include a baseboard management controller in addition to the main processor and the coprocessor, and the coprocessor may report to the baseboard management controller after generating a chip-level exception on the motherboard. A chip-level abnormality occurs on the motherboard to trigger the baseboard management controller to issue an abnormal alarm. In this way, the baseboard management controller can prompt the operation and maintenance personnel to restore the source of the abnormality in time through the abnormality alarm, thereby reducing the time delay of abnormality recovery.
在一种可能的实施方式中,协处理器与主处理器之间耦合的总线,可以包括高级可扩展接口AXI总线、超级传输HT总线、快速通道互联QPI总线、统一总线UB、计算快速链接CXL、快捷外设部件互连PCIe总线中的至少一种,或者也可以是其它可用于数据通信的信号线,本实施例对此并不进行限定。In a possible implementation, the bus coupled between the coprocessor and the main processor may include an Advanced Extensible Interface AXI bus, a HyperTransport HT bus, a QuickPath Interconnect QPI bus, a unified bus UB, and a computing fast link CXL 1. At least one of the shortcut peripheral components interconnecting PCIe bus, or other signal lines that can be used for data communication, which is not limited in this embodiment.
第二方面,本申请还提供了一种异常监测方法,该方法应用于基板管理控制器BMC,该BMC位于主板,该主板除了包括BMC之外,还包括通过总线进行耦合的主处理器以及协处理器,该方法包括:BMC接收协处理器上报的主板出现芯片级异常,该主板出现芯片级异常是指主处理器发生异常或者该总线发生异常,然后,BMC对该芯片级异常进行解析,生成异常监测结果,并根据该异常监测结果进行异常告警。如此,基板管理控制器可以通过异常告警来提示运维人员及时对异常源进行恢复,以此降低异常恢复的时延。In the second aspect, the present application also provides a method for abnormality monitoring, which is applied to a baseboard management controller (BMC), and the BMC is located on the main board. In addition to the BMC, the main board also includes a main processor coupled through a bus and an auxiliary Processor, the method includes: BMC receives a chip-level abnormality on the main board reported by the coprocessor, and the chip-level abnormality on the main board means that the main processor has an abnormality or the bus has an abnormality, and then, the BMC analyzes the chip-level abnormality, An abnormal monitoring result is generated, and an abnormal alarm is issued according to the abnormal monitoring result. In this way, the baseboard management controller can prompt the operation and maintenance personnel to restore the source of the abnormality in time through the abnormality alarm, thereby reducing the time delay of abnormality recovery.
在一种可能的实施方式中,主板还包括存储器,则BMC还可以生成监测日志,该监测日志包括异常监测结果,然后,BMC可以将该监测日志写入存储器中。这样,当运维人员进行异常恢复时,可以通过查看存储器中的监测日志获得异常源的具体信息,以便运维人员对异常源的精确定位以及快速恢复。In a possible implementation manner, the main board further includes a memory, and the BMC can also generate a monitoring log, which includes abnormal monitoring results, and then, the BMC can write the monitoring log into the memory. In this way, when the operation and maintenance personnel perform abnormal recovery, they can obtain the specific information of the abnormal source by viewing the monitoring log in the storage, so that the operation and maintenance personnel can accurately locate the abnormal source and recover quickly.
第三方面,本申请还提供了一种异常监测装置,所述异常监测装置包括用于执行第一方面或第一方面任一种可能实现方式中的异常监测方法的各个模块。In a third aspect, the present application further provides an abnormality monitoring device, the abnormality monitoring device including various modules for executing the first aspect or the abnormality monitoring method in any possible implementation manner of the first aspect.
第四方面,本申请还提供了一种异常监测装置,所述异常监测装置包括用于执行第二方面或第二方面任一种可能实现方式中的异常监测方法的各个模块。In a fourth aspect, the present application further provides an abnormality monitoring device, which includes various modules for executing the abnormality monitoring method in the second aspect or any possible implementation manner of the second aspect.
第五方面,本申请还提供了一种协处理器,该协处理器由逻辑电路组成,并且该协处理器用于执行第一方面或第一方面任一种可能实现方式中的异常监测方法的操作步骤。In the fifth aspect, the present application also provides a coprocessor, the coprocessor is composed of logic circuits, and the coprocessor is used to execute the abnormality monitoring method in the first aspect or in any possible implementation manner of the first aspect Steps.
第六方面,本申请还提供了一种计算设备,包括主板,该主板包括主处理器以及协处理器,并且,该协处理器用于根据计算机指令执行第一方面或第一方面任一种可能实现方式中的异常监测方法的操作步骤。In a sixth aspect, the present application also provides a computing device, including a mainboard, the mainboard includes a main processor and a coprocessor, and the coprocessor is used to execute the first aspect or any one of the possibilities of the first aspect according to computer instructions. Operation steps of the abnormality monitoring method in the implementation manner.
第七方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在协处理器上运行时,使得协处理器执行上述第一方面以及第一方面中任意一种实施方式所述方法的操作步骤。In the seventh aspect, the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium is run on the coprocessor, the coprocessor is made to perform the above-mentioned first aspect and the first aspect The operation steps of the method described in any one of the implementation modes.
第八方面,本申请提供了一种包含指令的计算机程序产品,当其在协处理器上运行时,使得协处理器执行上述第一方面以及第一方面中任意一种实施方式所述方法的操作步骤。In an eighth aspect, the present application provides a computer program product containing instructions, which, when run on a coprocessor, causes the coprocessor to execute the first aspect and the method described in any one of the implementation modes of the first aspect Steps.
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。On the basis of the implementation manners provided in the foregoing aspects, the present application may further be combined to provide more implementation manners.
附图说明Description of drawings
图1为本申请提供的一示例计算设备的结构示意图;FIG. 1 is a schematic structural diagram of an example computing device provided by the present application;
图2为本申请提供的一种异常监测方法的流程示意图;FIG. 2 is a schematic flow diagram of an abnormality monitoring method provided by the present application;
图3为本申请提供的一示例性显示界面的示意图;FIG. 3 is a schematic diagram of an exemplary display interface provided by the present application;
图4为本申请提供的一种异常监测装置的结构示意图;FIG. 4 is a schematic structural diagram of an abnormality monitoring device provided by the present application;
图5为本申请提供的另一种异常监测装置的结构示意图;FIG. 5 is a schematic structural diagram of another abnormality monitoring device provided by the present application;
图6为本申请提供的一种计算设备的硬件结构示意图。FIG. 6 is a schematic diagram of a hardware structure of a computing device provided in the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请中的技术方案进行描述。The technical solutions in the present application will be described below in conjunction with the drawings in the embodiments of the present application.
参见图1,为本申请提供的一种计算设备的硬件结构示意图。如图1所示,计算设备100包括主处理器101、协处理器102,该主处理器101与协处理器102之间可以通过总线(图1中未示出)耦合,从而协处理器102可以通过该总线与主处理器101进行数据交互。并且,主处理器101以及协处理器102可以部署于计算设备100中的主板201上。Referring to FIG. 1 , it is a schematic diagram of a hardware structure of a computing device provided in this application. As shown in FIG. 1 , a computing device 100 includes a main processor 101 and a coprocessor 102. The main processor 101 and the coprocessor 102 may be coupled through a bus (not shown in FIG. 1 ), so that the coprocessor 102 Data interaction with the main processor 101 can be performed through the bus. Moreover, the main processor 101 and the coprocessor 102 can be deployed on the main board 201 in the computing device 100 .
其中,主处理器101,用于运行操作系统,并且,对于计算设备100上运行的业务,主处理器101还可以为该业务执行相应的数据运算和资源调度等。协处理器102,用于协助主处理器101进行相应的管理和控制,并且,部分协处理器102还可以执行主处理器101无法执行的任务等。Wherein, the main processor 101 is used to run an operating system, and, for services running on the computing device 100, the main processor 101 can also perform corresponding data operations and resource scheduling for the services. The coprocessor 102 is used to assist the main processor 101 to perform corresponding management and control, and some coprocessors 102 can also perform tasks that the main processor 101 cannot perform.
作为一些示例,主处理器101,例如可以是中央处理器(central processingunit,CPU),包括基于高级精简指令集机器(advanced reduced instruction setcomputer machine,ARM)架构的处理器(如ARMv8、ARMv9等)、基于复杂指令集架构的X86处理器(如Skylake等),或者主处理器101可以通过专用集成电路(application-specificintegrated circuit,ASIC)实现,或可编程逻辑器件(programmable logic device,PLD)实现,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD)、现场可编程逻辑门阵列(field programmable gate array,FPGA)、通用阵列逻辑(genericarray logic,GAL)或其任意组合实现等。As some examples, the main processor 101 may be, for example, a central processing unit (central processing unit, CPU), including a processor based on an advanced reduced instruction set machine (Advanced Reduced Instruction Set Computer Machine, ARM) architecture (such as ARMv8, ARMv9, etc.), An X86 processor based on a complex instruction set architecture (such as Skylake, etc.), or the main processor 101 can be implemented by an application-specific integrated circuit (ASIC), or a programmable logic device (programmable logic device, PLD). The PLD may be a complex programmable logic device (complex programmable logical device, CPLD), a field programmable logic gate array (field programmable gate array, FPGA), a general array logic (generic array logic, GAL) or any combination thereof.
协处理器102,例如可以是芯片级系统(system on chip,SOC)上的辅助处理器、图形处理器(graphics processing unit,GPU)、智能管理单元(Intelligent managementunit,IMU)、管理引擎(management engine,ME)、单片机中的任意一种,或者可以是其它协处理器等。本实施例中,对于主处理器101以及协处理器的具体实现方式并不进行限定。The coprocessor 102 can be, for example, an auxiliary processor on a system on chip (SOC), a graphics processing unit (graphics processing unit, GPU), an intelligent management unit (Intelligent management unit, IMU), a management engine (management engine) , ME), any one of single-chip microcomputers, or other co-processors, etc. In this embodiment, specific implementation manners of the main processor 101 and the coprocessor are not limited.
并且,主处理器101以及协处理器102之间的总线,例如可以是高级可扩展接口(advanced extensible interface,AXI)总线、超级传输(hyper transport,HT)总线、快速通道互联(quick path interconnect,QPI)总线、统一总线(Unified bus,UB)、计算快速链接(compute express link,CXL)、快捷外设部件互连(peripheral componentinterconnect express,PCIe)总线中的一种,或者可以是其它可用于信息交互的信号线,如共享地址线、数据线等,本实施例对此不作限定。Moreover, the bus between the main processor 101 and the coprocessor 102 may be, for example, an advanced extensible interface (advanced extensible interface, AXI) bus, a hyper transport (hyper transport, HT) bus, a quick path interconnect (quick path interconnect, QPI) bus, unified bus (Unified bus, UB), computing express link (compute express link, CXL), shortcut peripheral component interconnect express (PCIe) bus, or other information available Interacting signal lines, such as shared address lines, data lines, etc., are not limited in this embodiment.
实际应用场景中,由于硬件老化或者受到电磁干扰等原因,可能导致主处理器101与协处理器102之间的总线或者主处理器101发生异常,从而进一步导致计算设备100发生宕机等故障。为此,本申请实施例提供了一种异常监测方法,由协处理器102对主板进行芯片级异常监测,也即对主板中处理器101或总线中至少一种进行监控。并且,当监测到主处理器101以及总线中的至少一种发生异常时,协处理器101可以生成主板201出现芯片级异常。如此,不仅可以实现对于计算设备100的异常监测,而且,由于协处理器102是在计算设备100内部执行监测过程,这可以使得协处理器102在得到异常监测结果的过程中通常不会受到网络传输时延或者计算设备100的运行负载的影响,以此可以提高异常监测的准确性以及可靠性。In actual application scenarios, due to hardware aging or electromagnetic interference, etc., the bus between the main processor 101 and the coprocessor 102 or the main processor 101 may be abnormal, which further causes failures such as downtime of the computing device 100 . To this end, the embodiment of the present application provides an abnormality monitoring method, in which the coprocessor 102 performs chip-level abnormality monitoring on the motherboard, that is, monitors at least one of the processor 101 or the bus in the motherboard. Moreover, when detecting that at least one of the main processor 101 and the bus is abnormal, the coprocessor 101 may generate a chip-level abnormality on the motherboard 201 . In this way, not only can the abnormal monitoring of the computing device 100 be realized, but also, since the coprocessor 102 executes the monitoring process inside the computing device 100, this can make the coprocessor 102 usually not be affected by the network during the process of obtaining the abnormal monitoring result. The transmission delay or the influence of the operating load of the computing device 100 can improve the accuracy and reliability of anomaly monitoring.
另外,协处理器102可以实现芯片级别的异常源的监测,这使得在计算设备100发生宕机等故障后,可以快速定位出主处理器101以及主处理器101与协处理器102之间的总线中的至少一种发生异常,从而可以准确识别异常源,有效增强定位计算设备100异常根因的能力,有助于可以降低计算设备100的故障恢复时延。In addition, the coprocessor 102 can realize the monitoring of the abnormal source at the chip level, which makes it possible to quickly locate the main processor 101 and the connection between the main processor 101 and the coprocessor 102 after the computing device 100 has a fault such as a downtime. At least one of the buses is abnormal, so that the source of the abnormality can be accurately identified, the ability to locate the root cause of the abnormality of the computing device 100 is effectively enhanced, and the failure recovery delay of the computing device 100 can be reduced.
值得注意的是,图1中是以主处理器101以及协处理器102在一个芯片上进行集成部署为例进行说明(即协处理器102位于主处理器101所在的芯片),在其他可能的实现方式中,主处理器101以及协处理器102之间也可以是分别部署于不同的芯片。It should be noted that, in FIG. 1 , the integrated deployment of the main processor 101 and the coprocessor 102 on one chip is used as an example for illustration (that is, the coprocessor 102 is located on the chip where the main processor 101 is located), and in other possible In an implementation manner, the main processor 101 and the coprocessor 102 may also be respectively deployed on different chips.
进一步地,计算设备100不仅包括主处理器101、协处理器102,还可以包括基板管理控制器(baseboard management controller,BMC)103以及存储器104,并且,BMC103以及存储器104可以部署于主板201上。其中,BMC103用于接收协处理器102提供的信息,并负责对该信息进行解析以及信息存储;BMC103可以通过智能平台管理接口(intelligentplatform management interface,IPMI)接口或者其它硬件接口与协处理器102进行数据通信,如BMC103可以通过IPMI接口接收协处理器102上报的主板201出现芯片级异常等。存储器104用于负责数据存储,例如可以是嵌入式多媒体卡(embedded multi media card,EMMC)等;存储器104可以通过安全数字输入输出(secure digital input and output,SDIO)接口或者其它接口接收BMC103发送的数据,并进行数据存储,如存储器104通过SDIO接口接收监测日志并存储监测日志等。Further, the computing device 100 not only includes a main processor 101 and a coprocessor 102, but also includes a baseboard management controller (baseboard management controller, BMC) 103 and a memory 104, and the BMC 103 and the memory 104 can be deployed on the motherboard 201. Wherein, BMC103 is used for receiving the information that coprocessor 102 provides, and is responsible for analyzing this information and information storage; For data communication, for example, the BMC 103 can receive a chip-level abnormality reported by the coprocessor 102 through the IPMI interface. The memory 104 is used to be responsible for data storage, such as an embedded multi media card (embedded multi media card, EMMC) etc.; the memory 104 can receive the data sent by the BMC103 through a secure digital input and output (secure digital input and output, SDIO) interface or other interfaces data, and store the data, for example, the memory 104 receives the monitoring log through the SDIO interface and stores the monitoring log, etc.
另外,计算设备100内部还可以配置外设105(图1中以外设105配置于计算设备100内部为例),或者计算设备100可以外接外设105等,外设105可以用于为计算设备100对外呈现信息或者向计算设备100输入数据。并且,外设105可以通过视频图形阵列(videographics array,VGA)接口或者其它接口与BMC103进行数据交互。作为一些示例,外设105进行可以是显示器,或者可以是键盘显示器鼠标(keyboard video mouse,KVM)等。In addition, the computing device 100 can also be equipped with peripherals 105 (in FIG. Information is presented externally or data is entered into computing device 100 . Moreover, the peripheral device 105 can perform data interaction with the BMC 103 through a video graphics array (videographics array, VGA) interface or other interfaces. As some examples, the peripheral device 105 may be a monitor, or may be a keyboard video mouse (keyboard video mouse, KVM) or the like.
实际应用时,计算设备100具体可以是计算服务器,例如高性能计算(highperformance computing,HPC)集群中的服务器等。或者,计算设备100可以是具有数据存储能力的存储服务器,此时,计算设备100中还可以集成有固态硬盘(solid state disk,SSD)等存储元件。或者,计算设备100可以是终端设备,如手机、笔记本电脑等便携式终端设备等。本实施例中,对于计算设备100的具体应用场景并不进行限定。In practical applications, the computing device 100 may specifically be a computing server, such as a server in a high performance computing (highperformance computing, HPC) cluster. Alternatively, the computing device 100 may be a storage server capable of storing data. In this case, the computing device 100 may also be integrated with a storage element such as a solid state disk (solid state disk, SSD). Alternatively, the computing device 100 may be a terminal device, such as a portable terminal device such as a mobile phone or a notebook computer. In this embodiment, specific application scenarios of the computing device 100 are not limited.
需要说明的是,图1所示的计算设备100仅作为一种示例性说明,实际应用时,计算设备100也可以采用其它可适用的架构。例如,在其它可能的计算设备100中,主板201上还可以集成有电源,该电源分别与主处理器101、协处理器102、BMC103、存储器104连接,并为其进行供电;或者,计算设备100中可以配置有更多数量的主处理器或协处理器等。It should be noted that the computing device 100 shown in FIG. 1 is only used as an example, and in actual application, the computing device 100 may also adopt other applicable architectures. For example, in other possible computing devices 100, a power supply may also be integrated on the motherboard 201, and the power supplies are respectively connected to the main processor 101, the coprocessor 102, the BMC 103, and the memory 104 to provide power for them; or, the computing device 100 can be configured with a greater number of main processors or co-processors, etc.
下面,结合附图进一步介绍本申请提供的针对计算设备100的异常监测过程。Next, the abnormality monitoring process for the computing device 100 provided by the present application will be further introduced with reference to the accompanying drawings.
参见图2,图2为本申请实施例提供的一种异常监测方法的流程示意图,其中,该方法可以应用于图1所示的计算设备100,具体可以由图1中的协处理器102执行该方法。Referring to FIG. 2, FIG. 2 is a schematic flowchart of an abnormality monitoring method provided by an embodiment of the present application, wherein the method can be applied to the computing device 100 shown in FIG. 1, and specifically can be executed by the coprocessor 102 in FIG. 1 this method.
如图2所示,该方法具体可以包括:As shown in Figure 2, the method may specifically include:
S201:协处理器102对主板201进行芯片级的异常监测,该芯片级的异常监测包括对主板201中的主处理器101或主处理器101与协处理器102之间的总线中的至少一种进行异常监测。S201: The coprocessor 102 performs chip-level abnormality monitoring on the motherboard 201. The chip-level abnormality monitoring includes at least one of the main processor 101 in the motherboard 201 or the bus between the main processor 101 and the coprocessor 102. Anomaly monitoring.
由于在计算设备100外部进行异常监测,可能会受到网络传输时延以及计算设备100自身负载的影响,因此,本实施例中,可以在计算设备100内部,由协处理器102对引起计算设备100产生宕机等故障的故障源进行监测。具体实现时,协处理器102可以对主板201进行芯片级的异常监测。作为一些示例,协处理器102对主板201进行芯片级的异常监测主要包括以下方式中任意一种:Since abnormality monitoring is performed outside the computing device 100, it may be affected by the network transmission delay and the load of the computing device 100 itself. Therefore, in this embodiment, the coprocessor 102 can cause the computing device 100 to Monitor the fault sources that cause downtime and other faults. During specific implementation, the coprocessor 102 may perform chip-level abnormality monitoring on the main board 201 . As some examples, the chip-level abnormality monitoring performed by the coprocessor 102 on the motherboard 201 mainly includes any of the following methods:
方式1,协处理器102可以对主处理器101进行异常监测。Mode 1, the coprocessor 102 can monitor the abnormality of the main processor 101 .
在第一种监测主处理器101的实现示例中,协处理器102可以实时监测是否接收到来自主处理器101的无指令超时中断,该无指令超时中断用于指示主处理器102中的一个或者多个处理器核在预设时长内未接收到指令,并且,当接收到无指令超时中断时,协处理器102可以确定该主处理器101发生异常。其中,主处理器101中可以包括一个或者多个处理器核,并且,主处理器101在确定存在一个或者多个处理器核在预设时长内一直没有接收到操作系统内核提交的指令时,这很可能是因为该处理器核发生故障而导致该处理器核长时间处于无指令提交(no-commit)的状态。此时,主处理器101可以生成无指令超时中断,并将其发送给协处理器102,以便通过该无指令超时中断通知协处理器102发生异常。In the first implementation example of monitoring the main processor 101, the coprocessor 102 can monitor in real time whether it receives a no-instruction timeout interrupt from the main processor 101, and the no-instruction timeout interrupt is used to indicate that one of the main processors 102 or Multiple processor cores do not receive instructions within a preset time period, and when receiving a no-instruction timeout interrupt, the coprocessor 102 may determine that the main processor 101 is abnormal. Wherein, the main processor 101 may include one or more processor cores, and when the main processor 101 determines that there is one or more processor cores that have not received an instruction submitted by the operating system kernel within a preset period of time, This is probably because the processor core is in a no-commit state for a long time due to a failure of the processor core. At this time, the main processor 101 may generate a no-instruction timeout interrupt and send it to the coprocessor 102, so as to notify the coprocessor 102 of an exception through the no-instruction timeout interrupt.
进一步地,协处理器102在接收到该无指令超时中断后,可以访问主处理器101并收集异常信息。其中,主处理器101上可以集成多个元件,如处除了上述处理器核之外,还可以包括缓存(cache)、内存(memory)、内存控制器(memory controller)、输入输出控制器(IO controller)、IO扩展控制器以及其它能够执行计算功能的逻辑电路等。其中,输入输出控制器例如可以是PCIe控制器、串行连接小型计算机系统接口控制器(serial attachedsmall computer system interface controller)、串行高级技术附件控制器(serialadvanced technology attachment controller)等;IO扩展控制器例如可以是集成电路控制器(inter-integrated circuit controller)。因此,协处理器102可以在接收到无指令超时中断后,通过总线访问主处理器101上的缓存、内存、内存控制器等多个元件,并确定各个元件是否存在异常(如确定该元件是否发生存储错误/控制错误等),从而可以收集得到异常元件的信息,如异常元件的名称、异常原因等信息。其中,通过访问缓存、内存等元件以确定该元件是否发生异常,在实际应用场景中存在相关应用,本实施例对此不再进行详述。Further, after receiving the no-instruction timeout interrupt, the coprocessor 102 can access the main processor 101 and collect exception information. Wherein, multiple components can be integrated on the main processor 101, for example, in addition to the above-mentioned processor core, it can also include cache (cache), memory (memory), memory controller (memory controller), input and output controller (IO controller), IO expansion controller, and other logic circuits capable of performing computing functions. Wherein, the input-output controller can be, for example, a PCIe controller, a serial attached small computer system interface controller (serial attachedsmall computer system interface controller), a serial advanced technology attachment controller (serialadvanced technology attachment controller), etc.; an IO expansion controller For example, it may be an integrated circuit controller (inter-integrated circuit controller). Therefore, the coprocessor 102 can access multiple elements such as the cache memory, the memory, and the memory controller on the main processor 101 through the bus after receiving the no-instruction timeout interrupt, and determine whether each element is abnormal (such as determining whether the element is storage error/control error, etc.), so that the information of the abnormal component can be collected, such as the name of the abnormal component, the cause of the abnormality and other information. Wherein, by accessing elements such as cache and memory to determine whether the element is abnormal, there are related applications in actual application scenarios, which will not be described in detail in this embodiment.
在第二种监测主处理器101的实现示例中,协处理器102可以实时监测是否接收到来自主处理器101的内部总线所发送的中断信号,该中断信号可以指示内部总线在运行过程中发生故障,如通信阻塞、链路断开等故障。其中,主处理器101的内部总线,是指主处理器101中的多个元件之间进行数据交互时所使用的总线,比如,环形总线(ring bus)、mesh总线等。具体实现时,该内部总线上可以是部署有系统隔离墙(system isolation wall,SIW),并且,主处理器101上的两个元件之间通过该内部总线进行通信时,可以由该内部总线上的SIW判断这两个元件之间的数据通信过程是否合法。相应地,协处理器102可以通过该内部总线访问主处理器101上的元件。比如,协处理器102可以向主处理器101发送访问指令,该访问指令可以由主处理器101上的内部总线传输至主处理器101中的元件。当内部总线正常运行时,协处理器102可以通过该访问指令时实现对于主处理器101中元件的访问;而当内部总线发生故障时,内部总线上的SIW可以返回中断信号并且该中断信号被传输至协处理器102。如此,协处理器102在接收到来自内部总线的中断信号后,可以确定主处理器101发生异常,具体为主处理器101上的内部总线发生异常。In the second implementation example of monitoring the main processor 101, the coprocessor 102 can monitor in real time whether an interrupt signal sent by the internal bus from the main processor 101 is received, and the interrupt signal can indicate that the internal bus fails during operation. , such as communication congestion, link disconnection and other faults. Wherein, the internal bus of the main processor 101 refers to a bus used for data exchange between multiple components in the main processor 101, such as a ring bus (ring bus), a mesh bus, and the like. During specific implementation, a system isolation wall (system isolation wall, SIW) may be deployed on the internal bus, and when the two components on the main processor 101 communicate through the internal bus, the The SIW judges whether the data communication process between these two components is legal. Correspondingly, the coprocessor 102 can access the components on the main processor 101 through the internal bus. For example, the coprocessor 102 may send an access instruction to the main processor 101 , and the access instruction may be transmitted to components in the main processor 101 by an internal bus on the main processor 101 . When the internal bus is running normally, the coprocessor 102 can realize the access to the elements in the main processor 101 through the access instruction; and when the internal bus fails, the SIW on the internal bus can return an interrupt signal and the interrupt signal is transmitted to the coprocessor 102. In this way, after receiving the interrupt signal from the internal bus, the coprocessor 102 can determine that the main processor 101 is abnormal, specifically, the internal bus on the main processor 101 is abnormal.
在第三种监测主处理器101的实现示例中,协处理器102可以周期性的通过心跳监测确定主处理器101中的第一处理器核是否发生故障,该第一处理器核为主处理器101中用于对片上资源进行统一调度和管理的处理器核,在部分应用场景中,该第一处理器核也可以称之为主核。具体实现时,主处理器101中的第一处理器核在运行过程中,可以周期性的向协处理器102发送心跳消息,该心跳消息例如可以是第一处理器核发送的中断响应等,用于指示第一处理器核正常运行。相应的,协处理器102可以监测在第一处理器核对应的第一监测周期内是否接收到来自该第一处理器核的心跳消息,并且,当协处理器102在该第一监测周期内未接收到该心跳消息时,表征该第一处理器核很可能是因为发生故障而导致协处理器102未接收到心跳消息,此时,协处理器102可以确定主处理器101发生异常,具体为该第一处理器核发生异常。In the third implementation example of monitoring the main processor 101, the coprocessor 102 can periodically determine whether the first processor core in the main processor 101 fails through heartbeat monitoring, and the first processor core is the main processor. The first processor core in the processor 101 is used to uniformly schedule and manage on-chip resources. In some application scenarios, the first processor core may also be called the main core. During specific implementation, the first processor core in the main processor 101 may periodically send a heartbeat message to the coprocessor 102 during operation, and the heartbeat message may be, for example, an interrupt response sent by the first processor core, etc. Used to indicate that the first processor core is running normally. Correspondingly, the coprocessor 102 may monitor whether a heartbeat message from the first processor core is received within the first monitoring period corresponding to the first processor core, and, when the coprocessor 102 is within the first monitoring period When the heartbeat message is not received, it indicates that the coprocessor 102 does not receive the heartbeat message because the first processor core is likely to fail. At this time, the coprocessor 102 can determine that the main processor 101 is abnormal, specifically An exception occurred for the first processor core.
其中,第一处理器核可以在运行过程中主动向协处理器102周期性的发送心跳消息,从而协处理器102可以根据每个监测周期内是否接收到该心跳消息,确定第一处理器核是否发生异常。或者,协处理器102可以通过系统控制管理接口(System Control andManagement Interface,SCMI)在每个监测周期内主动向主处理器101中的第一处理器核发送心跳指令,从而第一处理器核可以基于该心跳指令生成中断响应(也即心跳消息),并将该中断响应反馈给协处理器102。这样,当协处理器102在监测周期内未接收到该中断响应时,即可确定该第一处理器核发生异常。Wherein, the first processor core may actively send a heartbeat message to the coprocessor 102 periodically during operation, so that the coprocessor 102 may determine whether the first processor core receives the heartbeat message in each monitoring cycle. Whether an exception occurred. Alternatively, the coprocessor 102 can actively send a heartbeat command to the first processor core in the main processor 101 through a system control management interface (System Control and Management Interface, SCMI) in each monitoring cycle, so that the first processor core can An interrupt response (that is, a heartbeat message) is generated based on the heartbeat instruction, and the interrupt response is fed back to the coprocessor 102 . In this way, when the coprocessor 102 does not receive the interrupt response within the monitoring period, it can determine that the first processor core is abnormal.
进一步地,协处理器102在确定第一处理器核发生异常后,可以访问主处理器101上多个元件,如主处理器101上的缓存、内存、内存控制器、输入输出控制器、IO扩展控制器以及其它能够执行计算功能的逻辑电路等,以确定该主处理器101上的元件是否发生异常,并针对存在异常的元件收集异常信息。Further, after the coprocessor 102 determines that the first processor core is abnormal, it can access multiple components on the main processor 101, such as the cache, memory, memory controller, input and output controller, IO on the main processor 101. The extended controller and other logic circuits capable of performing calculation functions can determine whether the components on the main processor 101 are abnormal, and collect abnormal information for the abnormal components.
在第四种监测主处理器101的实现示例中,协处理器102在确定第一处理器核发生异常后,可以启动监测主处理器101上的其它处理器核是否发生异常。以监测主处理器101上的第二处理器核为例,该第二处理器核为主处理器101中除第一处理器核之外的处理器核,协处理器102可以监测在第二处理器核对应的第二监测周期内是否接收到来自该第二处理器核发送的心跳消息,该心跳消息用于指示第二处理器核正常运行。并且,当在该第二监测周期内未接收到来自第二处理器核的心跳消息时,协处理器102可以确定该第二处理器核发生异常;而当在该第二监测周期内接收到该心跳消息时,协处理器102可以确定该第二处理器核运行正常。其中,第一处理器核对应的第一监测周期与第二处理器核的对应的第二监测周期可以相同,也可以是不同。In the fourth implementation example of monitoring the main processor 101, after determining that the first processor core is abnormal, the coprocessor 102 may start monitoring whether other processor cores on the main processor 101 are abnormal. Taking the monitoring of the second processor core on the main processor 101 as an example, the second processor core in the main processor 101 other than the first processor core, the coprocessor 102 can monitor the Whether a heartbeat message sent by the second processor core is received within the second monitoring period corresponding to the processor core, and the heartbeat message is used to indicate that the second processor core is running normally. And, when a heartbeat message from the second processor core is not received within the second monitoring period, the coprocessor 102 may determine that an abnormality occurs in the second processor core; and when a heartbeat message is received within the second monitoring period When receiving the heartbeat message, the coprocessor 102 may determine that the second processor core is running normally. Wherein, the first monitoring period corresponding to the first processor core may be the same as or different from the second monitoring period corresponding to the second processor core.
在第五种监测主处理器101的实现示例中,协处理器102可以周期性的监测主处理器101中的所有处理器核,以监测第三处理器核为例,该第三处理器核为主处理器101中所有处理器核中的任意一个处理器核,协处理器102可以监测在第三处理器核对应的第三监测周期内是否接收到来自该第三处理器核发送的心跳消息,该心跳消息用于指示第三处理器核正常运行。并且,当在该第三监测周期内未接收到来自第三处理器核的心跳消息时,协处理器102可以确定该第三处理器核发生异常;而当在该第三监测周期内接收到该心跳消息时,协处理器102可以确定该第三处理器核运行正常。如此,协处理器102可以基于上述方式可以对主处理器101中的所有处理器核进行异常监测,并可以确定出该主处理器101中存在异常的全部处理器核。In the fifth implementation example of monitoring the main processor 101, the coprocessor 102 can periodically monitor all the processor cores in the main processor 101, taking the monitoring of the third processor core as an example, the third processor core For any processor core in all processor cores in the main processor 101, the coprocessor 102 can monitor whether a heartbeat sent from the third processor core is received in the third monitoring period corresponding to the third processor core message, the heartbeat message is used to indicate that the third processor core is running normally. And, when the heartbeat message from the third processor core is not received within the third monitoring period, the coprocessor 102 may determine that the third processor core is abnormal; When receiving the heartbeat message, the coprocessor 102 may determine that the third processor core is running normally. In this way, the coprocessor 102 can monitor all processor cores in the main processor 101 for abnormalities based on the above method, and can determine all processor cores in the main processor 101 that have abnormalities.
实际应用时,协处理器102可以同时执行上述五种实现示例中的任意一种或者多种,如同时监测主处理器101中的内部总线以及处理器核是否存在异常等,本实施例对此并不进行限定。In practical applications, the coprocessor 102 can simultaneously execute any one or more of the above five implementation examples, such as monitoring the internal bus in the main processor 101 and whether there is an abnormality in the processor core, etc. Not limited.
方式2,协处理器102可以对主处理器101与协处理器102之间耦合的总线进行异常监测。In mode 2, the coprocessor 102 can monitor the abnormality of the bus coupled between the main processor 101 and the coprocessor 102 .
在第一种监测总线的实现示例中,协处理器102可以实时监测是否接收到来自主处理器101与协处理器102进行耦合的总线所发送的中断信号,该中断信号用于指示该总线运行故障,并且,当接收到该总线发送的中断信号时,协处理器102可以确定该总线发生异常。其中,当主处理器101具体为ARM处理器时,主处理器101与协处理器102进行耦合的总线可以通过SIW向协处理器102发送中断信号。具体地,耦合主处理器101与协处理器102的总线中可以配置有SIW,并且,在协处理器102通过总线访问主处理器101的过程中,协处理器102发送的访问指令会先经过SIW。SIW在确定该访问指令为合法指令后,才将该访问指令传输至主处理器101,实现协处理器102对于主处理器101的访问;而在出现访问错误或者访问无响应时,SIW会向协处理器102反馈相应的消息,以避免协处理器102因访问失败而处于挂死状态。SIW在运行过程中可以感知该总线在运行过程中是否发生故障,如感知总线是否存在通信阻塞、链路断开等故障,并在确定总线出现运行故障时,生成指示总线运行故障的中断信号,并将该中断信号发送给协处理器102。In the implementation example of the first monitoring bus, the coprocessor 102 can monitor in real time whether it receives an interrupt signal sent by the bus coupling the main processor 101 and the coprocessor 102, and the interrupt signal is used to indicate the operation failure of the bus , and, when receiving the interrupt signal sent by the bus, the coprocessor 102 may determine that the bus is abnormal. Wherein, when the main processor 101 is specifically an ARM processor, the bus coupling the main processor 101 and the coprocessor 102 may send an interrupt signal to the coprocessor 102 through the SIW. Specifically, an SIW may be configured in the bus coupling the main processor 101 and the coprocessor 102, and when the coprocessor 102 accesses the main processor 101 through the bus, the access instruction sent by the coprocessor 102 will first pass through the SIW. After the SIW determines that the access instruction is a legal instruction, it transmits the access instruction to the main processor 101 to realize the access of the coprocessor 102 to the main processor 101; The coprocessor 102 feeds back a corresponding message, so as to prevent the coprocessor 102 from hanging due to an access failure. During operation, SIW can sense whether the bus is faulty during operation, such as sensing whether there are faults such as communication blockage and link disconnection on the bus, and when it is determined that the bus is faulty, an interrupt signal indicating the faulty operation of the bus is generated. And send the interrupt signal to the coprocessor 102 .
在第二种监测主处理器101的实现示例中,协处理器102采用实时监测方式进行异常监测,并且主板201基于X86架构进行部署,则,协处理器102可以通过主机嵌入式控制器接口(host embedded controller interface,HECI)与主处理器102进行通信。此时,该协处理器102具体可以是管理引擎,从而协处理器102可以根据通过HECI接口访问主处理器101的状态确定主处理器101是否发生故障。比如,当协处理器102通过HECI接口访问主处理器101时出现访问失败时,可以确定该主处理器101发生异常等。In the second implementation example of monitoring the main processor 101, the coprocessor 102 uses a real-time monitoring method to perform abnormality monitoring, and the main board 201 is deployed based on the X86 architecture, then the coprocessor 102 can use the host embedded controller interface ( host embedded controller interface (HECI) communicates with the main processor 102 . At this time, the coprocessor 102 may specifically be a management engine, so that the coprocessor 102 may determine whether the main processor 101 fails according to the state of the main processor 101 accessed through the HECI interface. For example, when the coprocessor 102 fails to access the main processor 101 through the HECI interface, it may be determined that the main processor 101 is abnormal.
值得注意的是,上述各种实现示例仅作为协处理器102进行异常监测的示例性说明,在其它实施例中,协处理器102也可以是通过其它方式实现对该主处理器101或总线中的至少一种进行异常监测;并且,实际应用时,协处理器102可以同时执行上述多种实现示例,如同时对主处理器101以及总线进行异常监测等,本实施例对此并不进行限定。另外,上述协处理器102进行异常监测的功能,可以是在协处理器102启动并运行过程中自动执行,如实现上述协处理器102所执行的异常监测功能的程序代码可以预先烧录至协处理器102中的固件中,从而协处理器102在启动并运行时即可开始执行异常监测功能。或者,协处理器102可以在运行过程中的任意时间段内开启该异常监测功能,如可以由用户或者运维人员触发该协处理器102开启异常监测功能等。It is worth noting that the various implementation examples above are only exemplary illustrations for coprocessor 102 to perform abnormality monitoring. In other embodiments, coprocessor 102 may also implement At least one of the abnormality monitoring; and, in actual application, the coprocessor 102 can execute the above-mentioned multiple implementation examples at the same time, such as performing abnormality monitoring on the main processor 101 and the bus at the same time, which is not limited in this embodiment . In addition, the above-mentioned coprocessor 102’s function of abnormality monitoring can be automatically executed during the start-up and operation of the coprocessor 102. For example, the program code for realizing the abnormality monitoring function performed by the above-mentioned coprocessor 102 can be pre-burned into the coprocessor. In the firmware in the processor 102, the coprocessor 102 can begin to perform anomaly monitoring functions when it is up and running. Alternatively, the coprocessor 102 may enable the abnormality monitoring function during any period of time during operation, for example, the coprocessor 102 may be triggered by a user or operation and maintenance personnel to enable the abnormality monitoring function.
需要说明的是,本实施例中的实时监测与周期监测属于相对概念,即实时监测方式所产生的异常监测时延远小于周期监测方式所产生的异常监测时延。比如,当采用实时监测方式时,协处理器102可以在纳秒级的时延内监测到主处理器101或总线出现异常,达到或者近似达到“实时”的效果;而当采用周期监测方式时,协处理器102可以在毫秒级或者秒级的时延内监测到主处理器101或总线出现异常,异常监测时延相对较长。It should be noted that the real-time monitoring and periodic monitoring in this embodiment are relative concepts, that is, the abnormality monitoring time delay generated by the real-time monitoring method is much smaller than the abnormality monitoring time delay generated by the periodic monitoring method. For example, when the real-time monitoring method is adopted, the coprocessor 102 can detect that the main processor 101 or the bus is abnormal within a time delay of nanosecond level, achieving or approximately achieving the "real-time" effect; while when the periodic monitoring method is adopted Therefore, the coprocessor 102 can detect that the main processor 101 or the bus is abnormal within a time delay of milliseconds or seconds, and the time delay for abnormality monitoring is relatively long.
S202:当监测到主处理器101或该总线发生异常时,协处理器102生成主板201出现芯片级异常。S202: When detecting that the main processor 101 or the bus is abnormal, the coprocessor 102 generates a chip-level abnormality on the motherboard 201 .
本实施例中,协处理器102在确定主处理器101或总线中的至少一种存在异常时,生成主板201出现芯片异常。进一步地,协处理器102还可以将主板201出现芯片级异常的指示信息上报给BMC103。如此,不仅可以实现对计算设备100的异常监测,而且还可以实现芯片级的异常定位,即在计算设备100发生故障时,可以定位出是主板201中的一个或者多个芯片上的主处理器102或总线中的至少一种发生故障。In this embodiment, when the coprocessor 102 determines that at least one of the main processor 101 or the bus is abnormal, it generates a chip abnormality on the motherboard 201 . Further, the coprocessor 102 may also report the indication information that the motherboard 201 has a chip-level abnormality to the BMC 103 . In this way, not only the abnormal monitoring of the computing device 100 can be realized, but also the chip-level abnormal location can be realized, that is, when the computing device 100 fails, it can be located that it is the main processor on one or more chips in the motherboard 201 At least one of 102 or the bus has failed.
示例性地,协处理器102在上报主板201出现芯片级异常时,可以将收集的异常信息以及指示主板201出现芯片级异常的信息一并上报给BMC103。其中,该异常信息可以是由协处理器102在确定存在异常时对主处理器101上的元件进行收集进行确定,如主处理器101上异常元件的名称、异常原因等。Exemplarily, when the coprocessor 102 reports that the mainboard 201 has a chip-level abnormality, it may report the collected abnormality information and the information indicating that the mainboard 201 has a chip-level abnormality to the BMC 103 . Wherein, the abnormality information may be determined by the coprocessor 102 collecting components on the main processor 101 when determining that there is an abnormality, such as the name of the abnormal component on the main processor 101, the cause of the abnormality, and the like.
进一步地,协处理器102在监测到主处理器101或总线中的至少一种存在异常后,还可以触发BMC103进行异常告警。具体地,本实施例还可以包括:Further, after the coprocessor 102 detects that at least one of the main processor 101 or the bus is abnormal, it can also trigger the BMC 103 to give an abnormal alarm. Specifically, this embodiment may also include:
S203:BMC103在接收到协处理器102上报的主板201出现芯片级异常后,进行异常告警。S203: After receiving the report from the coprocessor 102 that the motherboard 201 has a chip-level abnormality, the BMC 103 issues an abnormality alarm.
在一种可能的实施方式中,BMC103在接收到主板201出现芯片级异常的信息后,可以对该信息进行解析,生成主处理器101或总线中的至少一种发生异常的异常监测结果,从而BMC103可以根据该异常监测结果进行异常告警。示例性地,BMC103可以根据该异常监测结果生成告警信息,并将其发送给外设105,以便外设105可以根据该告警信息做出相应的告警提示。比如,当外设105具体为显示器时,外设105可以在接收到BMC103发送的告警信息后,可以在显示界面中呈现主处理器101或总线中的至少一种发生异常的告警提示。例如,当监测到主处理器101存在异常时,外设105可以在图3所示的显示界面中呈现“主处理器当前存在异常,请及时修复”的告警提示等。这样,用户或者运维人员在查看到该告警内容后,不仅可以确定该计算设备100发生异常,而且还可以快速定位到计算设备100发生异常的位置为主板201上的主处理器101或总线中的至少一种,从而可以对计算设备100进行相应的维护操作,提高对于计算设备100的异常恢复效率,进而提高计算设备100上的业务恢复效率。当然,外设105也可以是通过语音、亮灯等方式进行异常告警,本实施例对此并不进行限定。In a possible implementation manner, after BMC 103 receives information that a chip-level abnormality occurs in the main board 201, it can analyze the information and generate an abnormality monitoring result indicating that at least one of the main processor 101 or the bus is abnormal, thereby BMC103 can issue an abnormal alarm according to the abnormal monitoring result. Exemplarily, BMC 103 can generate alarm information according to the abnormality monitoring result, and send it to peripheral device 105, so that peripheral device 105 can make a corresponding alarm prompt according to the alarm information. For example, when the peripheral device 105 is specifically a display, after receiving the alarm information sent by the BMC 103 , the peripheral device 105 may display an alarm prompt indicating that at least one of the main processor 101 or the bus is abnormal on the display interface. For example, when it is detected that the main processor 101 is abnormal, the peripheral device 105 may display an alarm prompt of "the main processor is currently abnormal, please fix it in time" on the display interface shown in FIG. 3 . In this way, after viewing the alarm content, the user or operation and maintenance personnel can not only determine that the computing device 100 is abnormal, but also quickly locate the location where the computing device 100 is abnormal, which is the main processor 101 on the main board 201 or the main processor 101 on the bus. At least one of them, so that corresponding maintenance operations can be performed on the computing device 100, and the efficiency of abnormal recovery of the computing device 100 is improved, thereby improving the efficiency of service recovery on the computing device 100. Of course, the peripheral device 105 may also issue an abnormal alarm by means of voice, lighting, etc., which is not limited in this embodiment.
另外,BMC103在确定存在芯片级异常后,还可以生成包括异常监测结果的监测日志,以便在该监测日志中记录主处理器101或总线中的至少一种出现异常。比如,BMC103可以先创建监测日志,并根据协处理器102收集的异常信息的时间戳,按照时间先后顺序整理成日志条目,并将其记录在该监测日志中。然后,BMC103可以将生成的监测日志发送给存储器104中进行存储,或者通过其它方式存储该监测日志等,本实施例对此并不进行限定。这样,用户或者运维人员在确定计算设备100存在异常后,可以读取存储器104存储的监测日志,并根据该监测日志中记录的日志条目对主处理器101或总线中的至少一种进行异常分析,以便对主处理器101或总线中的至少一种进行高效维护。In addition, after the BMC 103 determines that there is a chip-level abnormality, it may also generate a monitoring log including abnormality monitoring results, so as to record in the monitoring log that at least one of the main processor 101 or the bus is abnormal. For example, the BMC 103 may first create a monitoring log, and according to the time stamp of the abnormal information collected by the coprocessor 102, organize log entries in chronological order, and record them in the monitoring log. Then, the BMC 103 may send the generated monitoring log to the memory 104 for storage, or store the monitoring log in other ways, which is not limited in this embodiment. In this way, after the user or operation and maintenance personnel determine that there is an abnormality in the computing device 100, they can read the monitoring log stored in the memory 104, and perform an abnormal operation on at least one of the main processor 101 or the bus according to the log entries recorded in the monitoring log. analysis for efficient maintenance of at least one of the main processor 101 or the bus.
本实施例中,不仅可以实现异常监测,而且,由于协处理器102对主板201进行芯片级的异常监测,这可以使得协处理器102在异常监测过程中通常不会受到网络传输时延或者主板201运行负载的影响,以此可以提高异常监测的准确性以及可靠性。另外,协处理器102可以实现芯片级别的异常源的监测,这使得在存在异常后,可以快速定位出是主处理器101发生异常,还是主处理器101与协处理器102之间的总线发生异常,或者主处理器101以及总线同时发生异常,从而可以准确识别异常源,有效增强定位异常根因的能力,进而有助于可以降低故障恢复时延。In this embodiment, not only abnormality monitoring can be realized, but also, since the coprocessor 102 performs chip-level abnormality monitoring on the motherboard 201, this can make the coprocessor 102 usually not be affected by network transmission delay or the motherboard 201 during the abnormality monitoring process. 201 operating load, so as to improve the accuracy and reliability of abnormal monitoring. In addition, the coprocessor 102 can realize the monitoring of the abnormal source at the chip level, which makes it possible to quickly locate the abnormality in the main processor 101 or the bus between the main processor 101 and the coprocessor 102 after an abnormality occurs. Abnormal, or the main processor 101 and the bus are abnormal at the same time, so that the source of the abnormality can be accurately identified, and the ability to locate the root cause of the abnormality can be effectively enhanced, thereby helping to reduce the fault recovery delay.
值得注意的是,本领域的技术人员根据以上描述的内容,能够想到的其他合理的步骤组合,也属于本申请的保护范围内。其次,本领域技术人员也应该熟悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本申请所必须的。It should be noted that other reasonable step combinations conceivable by those skilled in the art based on the above description also fall within the protection scope of the present application. Secondly, those skilled in the art should also be familiar with that the embodiments described in the specification belong to preferred embodiments, and the actions involved are not necessarily required by this application.
上文中结合图1至图3,详细描述了本申请所提供的异常监测方法,下面将结合图4至图6,描述根据本申请所提供的异常监测装置和计算设备。The anomaly monitoring method provided by the present application is described in detail above with reference to FIG. 1 to FIG. 3 , and the abnormality monitoring device and computing equipment provided according to the present application will be described below in conjunction with FIG. 4 to FIG. 6 .
图4为本申请提供的一种异常监测装置的结构示意图。异常监测装置400应用于主板中的协处理器,如上述协处理器102等,并且,该主板还包括主处理器,该主处理器与协处理器之间通过总线耦合,如图4所示,异常监测装置400可以包括:FIG. 4 is a schematic structural diagram of an abnormality monitoring device provided by the present application. The abnormality monitoring device 400 is applied to the coprocessor in the motherboard, such as the above-mentioned coprocessor 102, etc., and the motherboard also includes a main processor, and the main processor and the coprocessor are coupled through a bus, as shown in Figure 4 , the abnormality monitoring device 400 may include:
异常监测模块401,用于对所述主板进行芯片级的异常监测,所述芯片级的异常监测包括对所述主板中所述主处理器或所述总线中至少一种进行异常监测;The abnormality monitoring module 401 is configured to perform chip-level abnormality monitoring on the motherboard, and the chip-level abnormality monitoring includes abnormality monitoring on at least one of the main processor or the bus in the motherboard;
生成模块402,用于当监测到所述主板出现所述芯片级的异常时,生成所述主板出现芯片级异常。The generating module 402 is configured to generate a chip-level abnormality on the motherboard when it is detected that the chip-level abnormality occurs on the motherboard.
在一种可能的实施方式中,所述异常监测模块401具体用于采用实时监测方式对所述主板进行芯片级的异常监测。In a possible implementation manner, the abnormality monitoring module 401 is specifically configured to perform chip-level abnormality monitoring on the motherboard in a real-time monitoring manner.
在一种可能的实施方式中,所述异常监测模块401,具体用于:In a possible implementation manner, the abnormality monitoring module 401 is specifically used for:
当接收到所述主处理器的无指令超时中断时,确定所述主处理器发生异常,其中,所述无指令超时中断用于指示所述主处理器中的一个或者多个处理器核在预设时长内未接收到指令。When the no-instruction timeout interrupt of the main processor is received, it is determined that an exception occurs in the main processor, wherein the no-instruction timeout interrupt is used to indicate that one or more processor cores in the main processor are No command received within the preset time period.
在一种可能的实施方式中,所述异常监测模块401,具体用于:In a possible implementation manner, the abnormality monitoring module 401 is specifically used for:
监测是否接收所述主处理器中的内部总线的中断信号,所述中断信号用于指示所述内部总线运行故障;monitoring whether an interrupt signal of the internal bus in the main processor is received, the interrupt signal is used to indicate that the internal bus is faulty;
当接收到所述中断信号时,确定所述内部总线发生异常。When the interrupt signal is received, it is determined that the internal bus is abnormal.
在一种可能的实施方式中,所述异常监测模块401,具体用于:In a possible implementation manner, the abnormality monitoring module 401 is specifically used for:
监测是否接收到所述总线的中断信号,所述中断信号用于指示所述总线运行故障;monitoring whether an interrupt signal of the bus is received, the interrupt signal is used to indicate the operation failure of the bus;
当接收到所述中断信号时,确定所述总线发生异常。When the interrupt signal is received, it is determined that the bus is abnormal.
在一种可能的实施方式中,所述异常监测模块401具体用于采用周期监测方式对所述主板进行芯片级的异常监测。In a possible implementation manner, the abnormality monitoring module 401 is specifically configured to perform chip-level abnormality monitoring on the motherboard in a periodic monitoring manner.
在一种可能的实施方式中,所述异常监测模块401,具体用于:In a possible implementation manner, the abnormality monitoring module 401 is specifically used for:
在第一监测周期内,监测是否接收到所述主处理器的第一处理器核发送的心跳消息,所述心跳消息用于指示所述第一处理器核正常运行;During the first monitoring period, it is monitored whether a heartbeat message sent by the first processor core of the main processor is received, and the heartbeat message is used to indicate that the first processor core is running normally;
当在所述第一监测周期内未接收到所述心跳消息时,确定所述第一处理器核发生异常。When the heartbeat message is not received within the first monitoring period, it is determined that the first processor core is abnormal.
在一种可能的实施方式中,所述异常监测模块401,具体用于:In a possible implementation manner, the abnormality monitoring module 401 is specifically used for:
在第二检测周期内,监测是否接收到第二处理器核发送的心跳消息,所述第二处理器核发送的心跳消息用于指示所述第二处理器核正常运行,所述第二处理器核为所述主处理器中除所述第一处理器核之外的处理器核;In the second detection period, it is monitored whether a heartbeat message sent by the second processor core is received, the heartbeat message sent by the second processor core is used to indicate that the second processor core is running normally, and the second processing The processor core is a processor core other than the first processor core in the main processor;
当在所述第二监测周期内未接收到所述第二处理器核发送的心跳消息时,确定所述第二处理器核发生异常。When the heartbeat message sent by the second processor core is not received within the second monitoring period, it is determined that the second processor core is abnormal.
在一种可能的实施方式中,所述异常监测模块401,具体用于:In a possible implementation manner, the abnormality monitoring module 401 is specifically used for:
周期性的监测所述主处理器中所有处理器核;Periodically monitor all processor cores in the main processor;
当未接收所述主处理器中所有处理器核中第三处理器核发送的心跳消息时,确定所述第三处理器核发生异常,所述第三处理器核为所述主处理器中任意一个处理器核。When the heartbeat message sent by the third processor core among all the processor cores in the main processor is not received, it is determined that an exception occurs in the third processor core, and the third processor core is one of the main processor cores. any processor core.
在一种可能的实施方式中,所述异常监测装置400还包括:In a possible implementation manner, the abnormality monitoring device 400 further includes:
上报模块403,用于向所述BMC上报所述主板出现所述芯片级异常,以触发所述BMC进行异常告警。The reporting module 403 is configured to report to the BMC that the chip-level abnormality occurs on the motherboard, so as to trigger the BMC to issue an abnormal alarm.
在一种可能的实施方式中,所述总线包括高级可扩展接口AXI总线、超级传输HT总线、快速通道互联QPI总线、统一总线UB、计算快速链接CXL、快捷外设部件互连PCIe总线中的至少一种。In a possible implementation manner, the bus includes an advanced extensible interface AXI bus, a hypertransport HT bus, a fast channel interconnect QPI bus, a unified bus UB, a computing fast link CXL, and a fast peripheral component interconnect PCIe bus. at least one.
根据本申请实施例的异常监测装置400可对应于执行本申请实施例中描述协处理器102所执行的方法,并且异常监测装置400的各个模块的上述和其它操作或功能分别为了实现图2中的各个方法的相应流程,为了简洁,在此不再赘述。The anomaly monitoring device 400 according to the embodiment of the present application may correspond to the method executed by the coprocessor 102 described in the embodiment of the present application, and the above-mentioned and other operations or functions of the various modules of the anomaly monitoring device 400 are respectively in order to realize the For the sake of brevity, the corresponding processes of each method are not repeated here.
本实施例中,不仅可以实现异常监测,而且,由于协处理器对主板进行芯片级的异常监测,这可以使得协处理器在异常监测过程中通常不会受到网络传输时延或者主板运行负载的影响,以此可以提高异常监测的准确性以及可靠性。另外,协处理器可以实现芯片级别的异常源的监测,这使得在存在异常后,可以快速定位出是主处理器发生异常,还是主处理器与协处理器之间的总线发生异常,或者主处理器以及总线同时发生异常,从而可以准确识别异常源,有效增强定位异常根因的能力,进而有助于可以降低故障恢复时延。In this embodiment, not only abnormality monitoring can be realized, but also, since the coprocessor performs chip-level abnormality monitoring on the motherboard, this can make the coprocessor usually not be affected by the network transmission delay or the operating load of the motherboard during the abnormality monitoring process. Influence, so as to improve the accuracy and reliability of anomaly monitoring. In addition, the coprocessor can realize the monitoring of the abnormal source at the chip level, which makes it possible to quickly locate the abnormality of the main processor, the abnormality of the bus between the main processor and the coprocessor, or the abnormality of the main processor. The processor and the bus are abnormal at the same time, so that the source of the abnormality can be accurately identified, and the ability to locate the root cause of the abnormality can be effectively enhanced, which in turn helps to reduce the fault recovery delay.
图5为本申请提供的另一种异常监测装置的结构示意图。异常监测装置500应用于主板中的BMC,如上述BMC103等,并且,该主板还包括主处理器以及协处理器,该主处理器与协处理器之间通过总线耦合,如图5所示,异常监测装置500可以包括:FIG. 5 is a schematic structural diagram of another abnormality monitoring device provided by the present application. The abnormality monitoring device 500 is applied to the BMC in the motherboard, such as the above-mentioned BMC103, etc., and the motherboard also includes a main processor and a coprocessor, and the main processor and the coprocessor are coupled through a bus, as shown in Figure 5, Anomaly monitoring device 500 may include:
接收模块501,用于接收协处理器上报的主板出现芯片级异常,该主板出现芯片级异常是指主处理器发生异常或者该总线发生异常;The receiving module 501 is used to receive a chip-level abnormality on the main board reported by the coprocessor, and the chip-level abnormality on the main board means that the main processor is abnormal or the bus is abnormal;
解析模块502,用于该芯片级异常进行解析,生成异常监测结果;The analysis module 502 is used to analyze the chip-level abnormality and generate abnormality monitoring results;
告警模块503,用于根据该异常监测结果进行异常告警。The warning module 503 is configured to give an abnormality warning according to the abnormality monitoring result.
在一种可能的实施方式中,主板还包括存储器,则异常监测装置500,还包括:In a possible implementation manner, the motherboard further includes a memory, and the abnormality monitoring device 500 further includes:
生成模块504,用于生成监测日志,该监测日志包括异常监测结果;A generating module 504, configured to generate a monitoring log, which includes abnormal monitoring results;
写入模块505,用于将该监测日志写入存储器中。A writing module 505, configured to write the monitoring log into the memory.
根据本申请实施例的异常监测装置500可对应于执行本申请实施例中描述BMC103所执行的方法,并且异常监测装置500的各个模块的上述和其它操作或功能分别为了实现图2中的各个方法的相应流程,为了简洁,在此不再赘述。The abnormality monitoring device 500 according to the embodiment of the present application may correspond to the implementation of the method described in the embodiment of the present application performed by the BMC103, and the above-mentioned and other operations or functions of the various modules of the abnormality monitoring device 500 are respectively in order to realize the various methods in FIG. 2 For the sake of brevity, the corresponding process will not be repeated here.
图6为本申请提供的一种计算设备600的示意图,如图所示,所述计算设备600包括主板700,该主板700包括主处理器601、协处理器602,并且该主处理器601与协处理器602通过总线603进行耦合,从而主处理器601与协处理器602通过总线603进行数据通信。可选地,主板700还可以包括存储器604、基板管理控制器605以及通信接口606,相应的,主处理器601、协处理器602、存储器604、基板管理控制器605以及通信接口606可以通过该总线603进行通信。FIG. 6 is a schematic diagram of a computing device 600 provided by the present application. As shown in the figure, the computing device 600 includes a motherboard 700, the motherboard 700 includes a main processor 601, a coprocessor 602, and the main processor 601 and The coprocessor 602 is coupled through the bus 603 , so that the main processor 601 and the coprocessor 602 perform data communication through the bus 603 . Optionally, the mainboard 700 may further include a memory 604, a baseboard management controller 605, and a communication interface 606, and correspondingly, the main processor 601, the coprocessor 602, the memory 604, the baseboard management controller 605, and the communication interface 606 may The bus 603 communicates.
值得注意的是,总线603除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都标为总线603。It should be noted that, besides the data bus, the bus 603 may also include a power bus, a control bus, and a status signal bus. However, for clarity of illustration, the various buses are labeled as bus 603 in the figure.
本实施例中,主处理器601可以是CPU,还可以是其他通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立器件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。In this embodiment, the main processor 601 can be a CPU, and can also be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs) or other programmable logic devices , discrete gate or transistor logic devices, discrete device components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like.
协处理器602,例如可以是SOC上的辅助处理器、GPU、IMU、ME、单片机或逻辑电路中的任意一种,或者可以是其它协处理器等。The coprocessor 602 may be, for example, any one of an auxiliary processor on the SOC, a GPU, an IMU, an ME, a single-chip microcomputer, or a logic circuit, or may be other coprocessors.
作为一种可能的实现方式,主处理器601和协处理器602可以封装为一个芯片,也可以被分别封装为一个芯片。As a possible implementation manner, the main processor 601 and the coprocessor 602 may be packaged as one chip, or may be separately packaged as one chip.
存储器604可以包括只读存储器和随机存取存储器,并向主处理器601以及协处理器602提供指令和数据。存储器604还可以包括非易失性随机存取存储器。The memory 604 may include read-only memory and random-access memory, and provides instructions and data to the main processor 601 and the co-processor 602 . Memory 604 may also include non-volatile random access memory.
存储器604可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。Memory 604 can be volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. Among them, the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory can be random access memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of RAM are available, such as static random access memory (static RAM, SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM), Double data rate synchronous dynamic random access memory (double data date SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory (synchlink DRAM, SLDRAM) and direct Memory bus random access memory (direct rambus RAM, DR RAM).
通信接口606,用于与计算设备600连接的其它设备进行通信。The communication interface 606 is used for communicating with other devices connected to the computing device 600 .
应理解,根据本申请实施例的计算设备600可对应于本申请实施例中图2中的各个方法的相应流程,为了简洁,在此不再赘述。It should be understood that the computing device 600 according to the embodiment of the present application may correspond to the corresponding processes of the methods in FIG. 2 in the embodiment of the present application, and details are not repeated here for brevity.
本申请还提供一种主板,该主板结构如图6中主板700所示,主板700包括主处理器和协处理器,协处理器用于实现上述图2所示方法中协处理器所实现的功能,为了简洁,在此不再赘述。The present application also provides a mainboard, the mainboard structure is as shown in the mainboard 700 in Figure 6, the mainboard 700 includes a main processor and a coprocessor, and the coprocessor is used to realize the functions realized by the coprocessor in the above method shown in Figure 2 , for the sake of brevity, it is not repeated here.
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载或执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘(solid state drive,SSD)。The above-mentioned embodiments may be implemented in whole or in part by software, hardware, firmware or other arbitrary combinations. When implemented using software, the above-described embodiments may be implemented in whole or in part in the form of computer program products. The computer program product includes one or more computer instructions. When the computer program instructions are loaded or executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server, or data center by wired (eg, coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center that includes one or more sets of available media. The available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media. The semiconductor medium may be a solid state drive (SSD).
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above is only a specific embodiment of the application, but the scope of protection of the application is not limited thereto. Any person familiar with the technical field can easily think of various equivalents within the scope of the technology disclosed in the application. Modifications or replacements, these modifications or replacements shall be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.
Claims (13)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210109088.5A CN116560936A (en) | 2022-01-28 | 2022-01-28 | Anomaly monitoring method, coprocessor and computing device |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210109088.5A CN116560936A (en) | 2022-01-28 | 2022-01-28 | Anomaly monitoring method, coprocessor and computing device |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN116560936A true CN116560936A (en) | 2023-08-08 |
Family
ID=87490335
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202210109088.5A Pending CN116560936A (en) | 2022-01-28 | 2022-01-28 | Anomaly monitoring method, coprocessor and computing device |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN116560936A (en) |
-
2022
- 2022-01-28 CN CN202210109088.5A patent/CN116560936A/en active Pending
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11687391B2 (en) | Serializing machine check exceptions for predictive failure analysis | |
| US11526411B2 (en) | System and method for improving detection and capture of a host system catastrophic failure | |
| TWI229796B (en) | Method and system to implement a system event log for system manageability | |
| US8843785B2 (en) | Collecting debug data in a secure chip implementation | |
| WO2017063505A1 (en) | Method for detecting hardware fault of server, apparatus thereof, and server | |
| JP2017517060A (en) | Fault processing method, related apparatus, and computer | |
| US8689059B2 (en) | System and method for handling system failure | |
| EP3167371B1 (en) | A method for diagnosing power supply failure in a wireless communication device | |
| CN112988442B (en) | Method and equipment for transmitting fault information in server operation stage | |
| CN112462920B (en) | Method, device, server and storage medium for power control | |
| US20220082634A1 (en) | Power failure monitoring device and power failure monitoring method | |
| CN112596568B (en) | Method, system, device and medium for reading error information of voltage regulator | |
| CN118550747A (en) | PCIe fatal error quick positioning method, system, electronic equipment and medium | |
| CN116483612A (en) | Memory fault handling method, device, computer equipment and storage medium | |
| CN114003416B (en) | Memory error dynamic processing method, system, terminal and storage medium | |
| US20080288828A1 (en) | structures for interrupt management in a processing environment | |
| CN118819934A (en) | Hardware error processing method, device and system, RAS processing unit, SOC, BMC | |
| CN114356708A (en) | A device fault monitoring method, device, device and readable storage medium | |
| CN115686896A (en) | Extended memory error processing method, system, electronic device and storage medium | |
| CN116560936A (en) | Anomaly monitoring method, coprocessor and computing device | |
| CN118860720A (en) | Fault information processing method, equipment and medium | |
| CN113190278B (en) | Multi-scenario fault processing method, system and medium | |
| JP5440673B1 (en) | Programmable logic device, information processing apparatus, suspected part indication method and program | |
| CN108415788B (en) | Data processing apparatus and method for responding to non-responsive processing circuitry | |
| US12367092B2 (en) | Attributing errors to input/output peripheral drivers |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |