CN114880266A

CN114880266A - Fault processing method and device, computer equipment and storage medium

Info

Publication number: CN114880266A
Application number: CN202210766483.0A
Authority: CN
Inventors: 赵建平; 孙路遥
Original assignee: Shenzhen Xingyun Zhilian Technology Co ltd
Current assignee: Shenzhen Xingyun Zhilian Technology Co ltd
Priority date: 2022-07-01
Filing date: 2022-07-01
Publication date: 2022-08-09
Anticipated expiration: 2042-07-01
Also published as: CN115454705B; CN114880266B; CN115454705A

Abstract

The present application relates to the field of electronic digital data processing in the Internet industry, and discloses a fault handling method, device, computer equipment and storage medium. The method includes: in response to detecting a failure of an embedded processor in the data processor, entering a pickup mode, wherein the pickup mode includes sending a hot-plug interrupt signal to a computer host to isolate the computer host from the embedded processor failure , the hot-plug interrupt signal is used to indicate that the embedded processor has performed a hot-plug operation; in response to detecting that the embedded processor has been repaired, send a hot-plug signal to the computer host to exit the pickup mode to complete fault recovery. The plug signal is used to indicate that the embedded processor has performed a hot plug operation. By implementing the embodiments of the present application, fault isolation can be effectively achieved, the computer host does not need to be restarted, the impact on the computer host can be minimized, and the normal operation of the computer host can be ensured.

Description

Method, apparatus, computer equipment and storage medium for troubleshooting

技术领域technical field

本申请涉及互联网产业的电数字数据处理领域，尤其涉及一种故障处理的方法、装置、计算机设备和存储介质。The present application relates to the field of electronic digital data processing in the Internet industry, and in particular, to a method, device, computer equipment and storage medium for troubleshooting.

背景技术Background technique

随着数据中心的高速发展，通信能力和计算能力成为数据中心基础设施的相辅相成的两个重要发展方向。若数据中心仅关注计算能力的提升，通信基础设施的提升跟不上，那么数据中心的整体系统性能依然受限，无法发挥出真正的潜力。为了应对日益庞大且复杂的数据量，数据处理器（data processing unit，DPU）应运而生。With the rapid development of data centers, communication capabilities and computing capabilities have become two important development directions of data center infrastructure that complement each other. If the data center only pays attention to the improvement of computing power, and the improvement of communication infrastructure cannot keep up, the overall system performance of the data center is still limited and cannot reach its true potential. In order to cope with the increasingly large and complex data volume, the data processing unit (DPU) came into being.

数据处理器定位于协同处理单元，是数据面与控制面分离思想的一种实现，其与中央处理器（central processing unit，CPU）协作配合，后者负责通用控制，前者专注于数据处理。也就是说，数据处理器可以将数据处理/预处理从中央处理器卸载，同时将算力分布在更靠近数据发生的地方，从而降低通信量。由于数据处理器需要将计算移至接近数据的位置，这也意味着对数据处理器的可靠性和可用性提出了更高的要求。为了满足网络对大数据传输的需求，数据处理器需要用到高速串行计算机扩展总线标准（peripheralcomponent interconnect express，PCIe）接口做数据传输。因此，数据处理器通常会涉及PCIe设备的故障，需要数据处理器系统协助PCIe设备进行故障恢复。The data processor is located in the cooperative processing unit, which is an implementation of the idea of separating the data plane and the control plane. It cooperates with the central processing unit (CPU), the latter is responsible for general control, and the former focuses on data processing. That is, data processors can offload data processing/preprocessing from the central processor while distributing computing power closer to where the data occurs, reducing traffic. Since data processors need to move computations close to the data, this also means higher demands on the reliability and availability of data processors. In order to meet the network requirements for large data transmission, the data processor needs to use a high-speed serial computer expansion bus standard (peripheral component interconnect express, PCIe) interface for data transmission. Therefore, the data processor is usually involved in the failure of the PCIe device, and the data processor system is required to assist the PCIe device in failure recovery.

目前，数据处理器会在嵌入式处理器（embedded central processing unit，ECPU）端模拟出PCIe设备，主机的PCIe相关的处理层数据包（transaction layer pocket，TLP）都会转发到嵌入式处理器处理。这意味着当嵌入式处理器的PCIe模拟程序或者系统本身出现故障时，可能会出现如下两种情形：一是影响主机用户业务，甚至会导致主机挂死；二是恢复数据处理器需要重启主机，导致正在运行的所有程序中断。因此，如何进行故障隔离，从而不影响用户业务处理，是本领域技术人员需要解决的问题。At present, the data processor will simulate a PCIe device on the embedded central processing unit (ECPU) side, and the PCIe-related processing layer packets (transaction layer pocket, TLP) of the host will be forwarded to the embedded processor for processing. This means that when the PCIe emulation program of the embedded processor or the system itself fails, the following two situations may occur: one is to affect the business of the host user, and even cause the host to hang up; the other is to restart the host to restore the data processor. , causing all running programs to be interrupted. Therefore, how to perform fault isolation so as not to affect user service processing is a problem to be solved by those skilled in the art.

发明内容SUMMARY OF THE INVENTION

本申请实施例提供了一种故障处理的方法、装置、计算机设备和存储介质，可以有效实现故障隔离，无需对计算机主机进行重启，可以最大限度减少对计算机主机的影响，从而可以保证计算机主机的正常运行。The embodiments of the present application provide a fault handling method, device, computer equipment and storage medium, which can effectively implement fault isolation, do not need to restart the computer host, can minimize the impact on the computer host, and can ensure the safety of the computer host. normal operation.

第一方面，本申请实施例提供了一种故障处理的方法，应用于可编程逻辑器件，其中：In a first aspect, an embodiment of the present application provides a fault handling method, which is applied to a programmable logic device, wherein:

响应于检测到数据处理器中的嵌入式处理器出现故障，所述可编程逻辑器件进入代答模式，所述代答模式包括向计算机主机发送热插拔中断信号，以使所述计算机主机与所述嵌入式处理器故障隔离，所述热插拔中断信号用于指示所述嵌入式处理器执行了热插拔操作；In response to detecting a failure of the embedded processor in the data processor, the programmable logic device enters a pickup mode that includes sending a hot-plug interrupt signal to a computer host to cause the computer host to communicate with The embedded processor is fault isolated, and the hot-plug interrupt signal is used to indicate that the embedded processor has performed a hot-plug operation;

响应于检测到所述嵌入式处理器修复完成，所述可编程逻辑器件向所述计算机主机发送热插信号，退出所述代答模式，以完成故障恢复，所述热插信号用于指示所述嵌入式处理器执行了热插操作。In response to detecting that the embedded processor is repaired, the programmable logic device sends a hot-plug signal to the computer host to exit the pickup mode to complete fault recovery, and the hot-plug signal is used to indicate all The embedded processor described above performs a hot-plug operation.

第二方面，本申请实施例提供了一种故障处理的装置，应用于可编程逻辑器件，其中：In a second aspect, an embodiment of the present application provides a fault handling apparatus, which is applied to a programmable logic device, wherein:

故障隔离单元，用于响应于检测到数据处理器中的嵌入式处理器出现故障，所述可编程逻辑器件进入代答模式，所述代答模式包括向计算机主机发送热插拔中断信号，以使所述计算机主机与所述嵌入式处理器故障隔离，所述热插拔中断信号用于指示所述嵌入式处理器执行了热插拔操作；a fault isolation unit for, in response to detecting that the embedded processor in the data processor fails, the programmable logic device enters a pickup mode, the pickup mode including sending a hot-plug interrupt signal to the computer host to Isolating the computer host from the embedded processor fault, the hot-plug interrupt signal is used to instruct the embedded processor to perform a hot-plug operation;

故障恢复单元，用于响应于检测到所述嵌入式处理器修复完成，所述可编程逻辑器件向所述计算机主机发送热插信号，退出所述代答模式，以完成故障恢复，所述热插信号用于指示所述嵌入式处理器执行了热插操作。The fault recovery unit is used for, in response to detecting that the embedded processor is repaired, the programmable logic device sends a hot-plug signal to the computer host to exit the pickup mode, so as to complete the fault recovery, and the thermal The plug signal is used to indicate that the embedded processor has performed a hot plug operation.

第三方面，本申请实施例提供了一种计算机设备，包括处理器、存储器和通信接口，其中，所述存储器存储有计算机程序，所述计算机程序被配置由所述处理器执行，所述计算机程序包括用于如本申请实施例第一方面中所描述的部分或全部步骤的指令。In a third aspect, embodiments of the present application provide a computer device, including a processor, a memory, and a communication interface, wherein the memory stores a computer program, the computer program is configured to be executed by the processor, and the computer The program includes instructions for some or all of the steps as described in the first aspect of the embodiments of the present application.

第四方面，本申请实施例提供了一种计算机可读存储介质，所述计算机可读存储介质存储计算机程序，所述计算机程序使得计算机执行如本申请实施例第一方面中所描述的部分或全部步骤。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program causes a computer to execute a part or a part as described in the first aspect of the embodiment of the present application. all steps.

实施本申请实施例，将具有如下有益效果：Implementing the embodiments of the present application will have the following beneficial effects:

采用上述的故障处理的方法、装置、计算机设备和存储介质，在检测到数据处理器中的嵌入式处理器出现故障之后，可编程逻辑器件可以进入代答模式，通过向计算机主机发送热插拔中断信号，该热插拔中断信号用于指示嵌入式处理器执行了热插拔操作，以断开计算机主机与嵌入式处理器的通信，简单高效地完成了故障隔离。如此，可以避免传统技术方案中，一旦嵌入式处理器出现故障，必须对计算机主机进行重启，导致正在运行的所有程序中断的问题，从而最大限度减少对计算机主机的影响，保证了计算机主机的正常运行。在检测到嵌入式处理器修复完成之后，可编程逻辑器件向计算机主机发送热插信号，该热插信号用于指示嵌入式处理器执行了热插操作，并退出代答模式，使得计算机主机与嵌入式处理器重新通信，从而快速完成故障恢复，提高了故障处理的效率。此外，本申请实施例不需要BMC系统、管控平台等外部工具的参与，可以减少依赖程度，可靠性也更高。By adopting the above-mentioned method, device, computer equipment and storage medium for fault handling, after detecting that the embedded processor in the data processor fails, the programmable logic device can enter the answering mode, and send the hot-plugging mode to the computer host by sending the hot-plug The interrupt signal is used to instruct the embedded processor to perform a hot-plug operation, so as to disconnect the communication between the computer host and the embedded processor, and complete the fault isolation simply and efficiently. In this way, in the traditional technical solution, once the embedded processor fails, the computer host must be restarted, causing all running programs to be interrupted, thereby minimizing the impact on the computer host and ensuring the normal operation of the computer host. run. After detecting that the embedded processor is repaired, the programmable logic device sends a hot-plug signal to the computer host, the hot-plug signal is used to instruct the embedded processor to perform a hot-plug operation, and exit the pickup mode, so that the computer host and the computer host The embedded processor re-communicates, thereby completing fault recovery quickly and improving the efficiency of fault handling. In addition, the embodiment of the present application does not require the participation of external tools such as a BMC system and a management and control platform, which can reduce the degree of dependence and achieve higher reliability.

附图说明Description of drawings

为了更清楚地说明本申请实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以基于这些附图获得其他的附图。其中：In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the accompanying drawings required for the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort. in:

图1为本申请实施例提供的一种系统架构示意图；FIG. 1 is a schematic diagram of a system architecture provided by an embodiment of the present application;

图2为本申请实施例提供的一种故障处理的方法的流程示意图；FIG. 2 is a schematic flowchart of a fault handling method provided by an embodiment of the present application;

图3为本申请实施例提供的一种故障处理的装置的结构示意图；FIG. 3 is a schematic structural diagram of a fault handling apparatus provided by an embodiment of the present application;

图4为本申请实施例提供的一种计算机设备的结构示意图。FIG. 4 is a schematic structural diagram of a computer device according to an embodiment of the present application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.

本申请的说明书和权利要求书及所述附图中的术语“第一”、“第二”、“预设”和“第四”等是用于区别不同对象，而不是用于描述特定顺序。此外，术语“包括”和“具有”以及它们任何变形，意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元，而是可选地还包括没有列出的步骤或单元，或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "preset" and "fourth" in the description and claims of the present application and the accompanying drawings are used to distinguish different objects, rather than to describe a specific order . Furthermore, the terms "comprising" and "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device comprising a series of steps or units is not limited to the listed steps or units, but optionally also includes unlisted steps or units, or optionally also includes For other steps or units inherent to these processes, methods, products or devices.

在本文中提及“实施例”意味着，结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例，也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是，本文所描述的实施例可以与其它实施例相结合。Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor a separate or alternative embodiment that is mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.

还应理解，本文中术语“和/或”，仅仅是一种描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。另外，本文中字符“/”，一般表示前后关联对象是一种“或”的关系。It should also be understood that the term "and/or" in this document is only an association relationship for describing associated objects, indicating that there can be three kinds of relationships, for example, A and/or B, which can mean that A exists alone, and A and B exist at the same time. B, there are three cases of B alone. In addition, the character "/" in this document generally indicates that the related objects are an "or" relationship.

为了便于理解，下文首先介绍本申请实施例涉及的几个基本的概念。For ease of understanding, the following first introduces several basic concepts involved in the embodiments of the present application.

数据处理器（data processing unit，DPU），是最新发展起来的专用处理器的一个大类，是继中央处理器（central processing unit，CPU）、图像处理器（graphicsprocessing unit，GPU）之后，数据中心场景中的第三颗重要的算力芯片，为高带宽、低延迟、数据密集的计算场景提供计算引擎。数据处理器主要有以下三个特点，分别是卸载、加速、隔离。相应地，数据处理器的三个主要应用场景分别是网络、存储、安全。在卸载方面，数据处理器可以作为中央处理器的卸载引擎，释放中央处理器的算力到上层应用，例如数据处理器可以卸载数据中心网络服务（虚拟交换、虚拟路由等等）、数据中心存储服务、数据中心的安全服务（防火墙、加解密等等）等等。在加速方面，数据处理器将成为算法加速的沙盒，成为最灵活的加速器载体。数据处理器不完全是一颗固化的专用集成芯片（application specific integrated circuit，ASIC），在CXL（compute express link）等标准组织所倡导中央处理器、图像处理器与数据处理器等数据一致性访问协议的铺垫下，将更进一步扫清数据处理器编程障碍，结合现场可编程门阵列（field programmable gatearray，FPGA）等可编程器件，可定制硬件将有更大的发挥空间，“软件硬件化”将成为常态，异构计算的潜能将因各种数据处理器的普及而彻底发挥出来。在隔离方面，数据处理器将成为新的数据网关，将安全隐私提升到一个新的高度。非对称加密算法SM2、哈希算法SM3和对称分组密码算法SM4等都可以通过将其固化在数据处理器中来实现。Data processing unit (DPU) is a major category of newly developed special-purpose processors. After central processing unit (CPU) and graphics processing unit (GPU), data center The third important computing chip in the scene provides a computing engine for high-bandwidth, low-latency, and data-intensive computing scenarios. Data processors mainly have the following three characteristics, namely offloading, acceleration, and isolation. Correspondingly, the three main application scenarios of data processors are network, storage, and security. In terms of offloading, the data processor can be used as the offload engine of the central processing unit, releasing the computing power of the central processing unit to upper-layer applications. For example, the data processor can offload data center network services (virtual switching, virtual routing, etc.), data center storage Services, data center security services (firewall, encryption and decryption, etc.), etc. In terms of acceleration, the data processor will become the sandbox for algorithm acceleration and the most flexible accelerator carrier. The data processor is not entirely a solidified application specific integrated circuit (ASIC), and the data consistency access such as central processing unit, image processor and data processor advocated by standard organizations such as CXL (compute express link) Under the foreshadowing of the agreement, it will further clear the obstacles of data processor programming. Combined with programmable devices such as field programmable gate array (FPGA), customizable hardware will have more room to play, "software hardware" It will become the norm, and the potential of heterogeneous computing will be fully realized due to the popularization of various data processors. In terms of isolation, data processors will become the new data gateways, taking security and privacy to a new level. Asymmetric encryption algorithm SM2, hash algorithm SM3 and symmetric block cipher algorithm SM4 can all be implemented by solidifying them in the data processor.

高速串行计算机扩展总线标准（peripheral component interconnect express，PCIe），属于高速串行点对点双通道高带宽传输，所连接的设备分配独享通道带宽，不共享总线带宽。其定义了多个宽度的插槽和连接器：x1、x4、x8、x12、x16和x32，通常，低速外设（例如WiFi卡）使用单通道（x1）链路，而图形适配器更多地使用更快更宽的x16通道链路。The high-speed serial computer expansion bus standard (peripheral component interconnect express, PCIe) belongs to high-speed serial point-to-point dual-channel high-bandwidth transmission, and the connected devices are allocated exclusive channel bandwidth and do not share bus bandwidth. It defines slots and connectors of multiple widths: x1, x4, x8, x12, x16, and x32, typically, low-speed peripherals (such as WiFi cards) use single-lane (x1) links, while graphics adapters are more Use faster and wider x16 lane links.

现场可编程门阵列（field programmable gate array，FPGA），是在可编程阵列逻辑（programmable logic array，PAL）等可编程器件的基础上进一步发展的产物，能够有效的解决原有的器件门电路数较少的问题。FPGA的基本结构包括可编程输入输出单元，可配置逻辑块，数字时钟管理模块，布线资源，内嵌专用硬核，底层内嵌功能单元等等。由于FPGA具有布线资源丰富，可重复编程和集成度高，投资较低的特点，在数字电路设计领域得到了广泛的应用。Field programmable gate array (FPGA) is a product of further development on the basis of programmable devices such as programmable logic array (PAL). Fewer problems. The basic structure of FPGA includes programmable input and output units, configurable logic blocks, digital clock management modules, routing resources, embedded dedicated hard cores, and low-level embedded functional units. Because FPGA has the characteristics of abundant wiring resources, reprogrammable and high integration, and low investment, it has been widely used in the field of digital circuit design.

复杂可编程逻辑器件（complex programming logic device，CPLD），通过采用电可擦除可编程只读存储器（electrically EPROM，EEPROM）、快闪存储器和静态随机存取存储器（static RAM，SRAM）等编程技术，从而构成了高密度、高速度和低功耗的可编程逻辑器。复杂可编程逻辑器件是一种用户根据各自需要而自行构造逻辑功能的数字集成电路，其基本设计方法是借助集成开发软件平台，用原理图、硬件描述语言等方法，生成相应的目标文件，并通过下载电缆将代码传送到目标芯片中以实现设计的数字系统。Complex Programmable Logic Device (CPLD), by using programming techniques such as Electrically Erasable Programmable Read-Only Memory (electrically EPROM, EEPROM), Flash Memory, and Static Random Access Memory (static RAM, SRAM) , which constitutes a high-density, high-speed and low-power programmable logic device. A complex programmable logic device is a digital integrated circuit that users construct their own logic functions according to their own needs. The code is transferred to the target chip via a download cable to implement the designed digital system.

请参见图1，图1是本申请实施例提供的一种系统架构示意图。如图1所示，该系统架构可以包括计算机主机100和数据处理器200。其中，计算机主机100可以包括有以Intelx86架构CPU、ARM架构CPU、MIPS架构CPU等多种架构处理器作为计算核心的计算机设备，包括但不限于工业计算机、服务器、车载计算机、移动工作站等应用形态，本申请实施例对计算机主机的类型不做任何限制。Referring to FIG. 1, FIG. 1 is a schematic diagram of a system architecture provided by an embodiment of the present application. As shown in FIG. 1 , the system architecture may include a computer host 100 and a data processor 200 . The computer host 100 may include computer equipment with multiple architecture processors such as Intel x86 architecture CPU, ARM architecture CPU, MIPS architecture CPU, etc. as the computing core, including but not limited to industrial computers, servers, vehicle-mounted computers, mobile workstations and other application forms , the embodiment of the present application does not impose any limitation on the type of the computer host.

在本申请实施例中，数据处理器200可以包括可编程逻辑器件201和嵌入式处理器202。数据处理器200可以采用PCIe标准协议进行数据传输，因此，数据处理器200可以是一种PCIe设备。可编程逻辑器件201可以是FPGA、系统级芯片（system on chip，SOC）或ASIC等，也可以是多核处理器等，或者还可以是其他可编程逻辑器件，本申请实施例对此不做限定。可编程逻辑器件201可以通过PCIe接口与计算机主机100进行通信。可编程逻辑器件201可以通过第一接口与嵌入式处理器202进行通信。第一接口可以是PCIe接口、通用闪存接口（common flash interface，CFI）或串行外设接口（serial peripheral interface，SPI）、外设组件互连标准（peripheral component interconnect，PCI）接口、局部总线（LocalBus）接口等等，本申请实施例对此不做限定。In this embodiment of the present application, the data processor 200 may include a programmable logic device 201 and an embedded processor 202 . The data processor 200 may use the PCIe standard protocol for data transmission. Therefore, the data processor 200 may be a PCIe device. The programmable logic device 201 may be an FPGA, a system on chip (system on chip, SOC), an ASIC, etc., or a multi-core processor, etc., or may also be other programmable logic devices, which are not limited in this embodiment of the present application . The programmable logic device 201 may communicate with the computer host 100 through a PCIe interface. The programmable logic device 201 may communicate with the embedded processor 202 through the first interface. The first interface may be a PCIe interface, a common flash interface (CFI) or a serial peripheral interface (SPI), a peripheral component interconnect (PCI) interface, a local bus ( LocalBus) interface, etc., which are not limited in this embodiment of the present application.

可编程逻辑器件201可以包括状态寄存器，状态寄存器可以用于记录嵌入式处理器202或者数据处理器200中的其他设备的运行状态，可编程逻辑器件201通过读取状态寄存器的信息，从而判断嵌入式处理器202是否出现故障。可选地，计算机主机100、可编程逻辑器件201和嵌入式处理器202之间可以建立有先入先出队列（firstinfirst out，FIFO）通道，使得计算机主机100、可编程逻辑器件201和嵌入式处理器202可以通过FIFO通道进行数据交互，以提高数据的传输速度。The programmable logic device 201 may include a status register, and the status register may be used to record the running status of the embedded processor 202 or other devices in the data processor 200. whether the processor 202 is faulty. Optionally, a first-in-first-out (FIFO) channel may be established between the computer host 100 , the programmable logic device 201 and the embedded processor 202 , so that the computer host 100 , the programmable logic device 201 and the embedded processor 202 The controller 202 can perform data interaction through the FIFO channel, so as to improve the data transmission speed.

数据处理器200可以包括图1中未示出的复杂可编程逻辑器件，复杂可编程逻辑器件可以通过第一接口与可编程逻辑器件201进行通信，也可以通过第一接口与嵌入式处理器202进行通信。也就是说，复杂可编程逻辑器件可以分别与可编程逻辑器件201和嵌入式处理器202进行通信。数据处理器200也可以包括图1中未示出的存储器（例如Flash存储器），存储器可以用于存储需要运行的程序，存储器可以通过CFI接口与可编程逻辑器件201进行通信。此外，数据处理器200还可以包括图1中未示出的PCIe交换器（Switch）、GPU、数字信号处理器（digital signal processing，DSP）、磁盘阵列（redundant arrays ofindependent disks，RAID）等等，本申请实施例对此不做限定。The data processor 200 may include a complex programmable logic device not shown in FIG. 1 , and the complex programmable logic device may communicate with the programmable logic device 201 through the first interface, and may also communicate with the embedded processor 202 through the first interface. to communicate. That is, the complex programmable logic device may communicate with the programmable logic device 201 and the embedded processor 202, respectively. The data processor 200 may also include a memory (eg, a Flash memory) not shown in FIG. 1 , the memory may be used to store programs to be executed, and the memory may communicate with the programmable logic device 201 through a CFI interface. In addition, the data processor 200 may further include a PCIe switch (Switch), a GPU, a digital signal processor (DSP), a redundant array of independent disks (RAID), etc., which are not shown in FIG. 1 , This embodiment of the present application does not limit this.

为了解决上述问题，本申请实施例提供了一种故障处理的方法，通过实施该方法，可以有效实现故障隔离，无需对计算机主机进行重启，可以最大限度减少对计算机主机的影响，从而可以保证计算机主机的正常运行。In order to solve the above problem, an embodiment of the present application provides a fault handling method. By implementing the method, fault isolation can be effectively achieved, the computer host does not need to be restarted, the impact on the computer host can be minimized, and the computer can be guaranteed normal operation of the host.

请参照图2，图2是本申请实施例提供的一种故障处理的方法的流程示意图。可以理解地，该方法可以用于图1所示的系统架构中，具体可以由图1所示的可编程逻辑器件执行，该方法可以包括以下步骤S201-S202，其中：Please refer to FIG. 2 . FIG. 2 is a schematic flowchart of a fault handling method provided by an embodiment of the present application. It can be understood that the method can be used in the system architecture shown in FIG. 1, and can be specifically executed by the programmable logic device shown in FIG. 1, and the method can include the following steps S201-S202, wherein:

步骤S201：响应于检测到数据处理器中的嵌入式处理器出现故障，可编程逻辑器件进入代答模式，代答模式包括向计算机主机发送热插拔中断信号，以使计算机主机与嵌入式处理器故障隔离。Step S201: In response to detecting the failure of the embedded processor in the data processor, the programmable logic device enters a pickup mode, and the pickup mode includes sending a hot-plug interrupt signal to the computer host, so that the computer host can communicate with the embedded processor. device fault isolation.

随着计算机主机外接的PCIe设备数量的增加，PCIe设备出现故障的概率也随之增加。PCIe设备出现故障可能会影响计算机主机的正常运行，严重的情况下，甚至会导致计算机主机挂死。因此，对出现故障的PCIe设备的处理是维护计算机主机正常运行的重要环节。由于数据处理器可以采用PCIe标准协议进行数据传输，数据处理器可以是一种PCIe设备。如图1所示，数据处理器可以包括可编程逻辑器件和嵌入式处理器。可编程逻辑器件可以是FPGA、SOC或ASIC等，也可以是多核处理器等，或者还可以是其他可编程逻辑器件，本申请实施例对此不做限定。对于数据处理器而言，数据处理器可以在嵌入式处理器端模拟出PCIe设备，嵌入式处理器可以用于处理转发到其中的计算机主机的PCIe相关的TLP包。这就意味着当嵌入式处理器的PCIe模拟程序或者系统本身出现故障时，嵌入式处理器出现故障可能会影响计算机主机的正常运行。As the number of PCIe devices connected to the host computer increases, the probability of the PCIe device failure also increases. The failure of the PCIe device may affect the normal operation of the computer host, and in severe cases, even cause the computer host to hang up. Therefore, the processing of the faulty PCIe device is an important part of maintaining the normal operation of the computer host. Since the data processor may use the PCIe standard protocol for data transmission, the data processor may be a PCIe device. As shown in FIG. 1, a data processor may include a programmable logic device and an embedded processor. The programmable logic device may be an FPGA, an SOC, an ASIC, etc., or a multi-core processor, etc., or may also be other programmable logic devices, which are not limited in this embodiment of the present application. For the data processor, the data processor can simulate a PCIe device on the embedded processor side, and the embedded processor can be used to process PCIe-related TLP packets forwarded to the computer host therein. This means that when the PCIe emulation program of the embedded processor or the system itself fails, the failure of the embedded processor may affect the normal operation of the computer host.

为了避免嵌入式处理器出现故障影响计算机主机的正常运行，需要将出现故障的嵌入式处理器与计算机主机进行故障隔离。在本申请实施例中，可以利用可编程逻辑器件对出现故障的嵌入式处理器与计算机主机进行故障隔离。具体地，在检测到嵌入式处理器出现故障之后，可编程逻辑器件进入代答模式，以完成故障隔离。其中，可编程逻辑器件进入代答模式的一种可能的实现方式可以是：可编程逻辑器件向计算机主机发送热插拔中断信号，该热插拔中断信号用于指示嵌入式处理器执行了热插拔操作。计算机主机在接收到该热插拔中断信号，会认为嵌入式处理器已经被热拔出，计算机主机将不会再与嵌入式处理器进行业务交互，从而使得计算机主机与嵌入式处理器断开通信连接，以完成故障隔离。In order to prevent the failure of the embedded processor from affecting the normal operation of the computer host, it is necessary to isolate the faulty embedded processor from the computer host. In this embodiment of the present application, a programmable logic device may be used to isolate a faulty embedded processor from a computer host. Specifically, after detecting the failure of the embedded processor, the programmable logic device enters a pickup mode to complete the fault isolation. A possible implementation manner for the programmable logic device to enter the pickup mode may be: the programmable logic device sends a hot-plug interrupt signal to the computer host, and the hot-plug interrupt signal is used to instruct the embedded processor to perform a hot-plug interrupt signal. plug-in operation. When the computer host receives the hot-plug interrupt signal, it will consider that the embedded processor has been hot-plugged, and the computer host will no longer conduct business interactions with the embedded processor, thereby disconnecting the computer host from the embedded processor. communication connection to complete fault isolation.

可以看出，在检测到嵌入式处理器出现故障之后，通过可编程逻辑器件代答向计算机主机发送热插拔中断信号，可以实现计算机主机与嵌入式处理器的故障隔离。这种故障隔离的方式较为彻底，因此，可以避免将故障扩散至整个计算机主机。此外，这种故障隔离的方式无需对计算机主机进行重启，简单高效，从而最大限度减少对计算机主机的影响，保证了计算机主机的其他业务不受干扰地正常运行。It can be seen that after detecting the failure of the embedded processor, the programmable logic device sends a hot-plug interrupt signal to the computer host in response, so that the fault isolation between the computer host and the embedded processor can be realized. This method of fault isolation is more thorough, so it can avoid spreading the fault to the entire computer host. In addition, this method of fault isolation does not need to restart the computer host, which is simple and efficient, thereby minimizing the impact on the computer host, and ensuring that other services of the computer host run normally without interference.

在一种可能的实施方式中，在步骤S201之前，还可以包括故障检测阶段，具体可以包括以下步骤：In a possible implementation, before step S201, a fault detection stage may also be included, which may specifically include the following steps:

响应于从计算机主机中读取到预设标志位，可编程逻辑器件确定嵌入式处理器出现故障，预设标志位用于指示嵌入式处理器出现故障。In response to reading the preset flag bit from the computer host, the programmable logic device determines that the embedded processor is faulty, and the preset flag bit is used to indicate that the embedded processor is faulty.

在本申请实施例中，计算机主机、可编程逻辑器件和嵌入式处理器之间可以建立有FIFO通道，计算机主机、可编程逻辑器件和嵌入式处理器可以通过FIFO通道进行数据交互。具体地，计算机主机可以按照预定的时间周期向嵌入式处理器发送心跳包，通过FIFO通道将该心跳包发送给嵌入式处理器，嵌入式处理器在接收到心跳包之后，向计算机主机回复一个心跳包，以保持计算机主机和嵌入式处理器之间的相互通信。如果计算机主机在预设时长内接收到嵌入式处理器发送的心跳包，则可以判定嵌入式处理器处于正常工作状态，即嵌入式处理器未出现故障。如果计算机主机在超过预设时长时未接收到嵌入式处理器发送的心跳包，则可以判定嵌入式处理器出现故障，即嵌入式处理器可能已经挂死。In the embodiments of the present application, a FIFO channel may be established between the computer host, the programmable logic device, and the embedded processor, and the computer host, the programmable logic device, and the embedded processor may perform data interaction through the FIFO channel. Specifically, the computer host can send a heartbeat packet to the embedded processor according to a predetermined time period, and send the heartbeat packet to the embedded processor through the FIFO channel. After receiving the heartbeat packet, the embedded processor returns a heartbeat packet to the computer host. Heartbeat packets to maintain mutual communication between the computer host and the embedded processor. If the computer host receives the heartbeat packet sent by the embedded processor within a preset time period, it can be determined that the embedded processor is in a normal working state, that is, the embedded processor is not faulty. If the computer host does not receive the heartbeat packet sent by the embedded processor for more than a preset time period, it can be determined that the embedded processor is faulty, that is, the embedded processor may have died.

在一种可能的实施方式中，计算机主机可以设置有预设程序，在检测到嵌入式处理器出现故障之后，可以在预设程序中设置预设标志位，该预设标志位用于指示嵌入式处理器出现故障。可编程逻辑器件在检测到该预设标志位之后，可以确定嵌入式处理器出现故障，进入故障隔离阶段，例如可以执行步骤S201，以完成计算机主机与嵌入式处理器故障隔离。In a possible implementation manner, the computer host may be provided with a preset program, and after detecting a failure of the embedded processor, a preset flag bit may be set in the preset program, where the preset flag bit is used to indicate the embedded processor The processor has failed. After detecting the preset flag bit, the programmable logic device can determine that the embedded processor is faulty and enter the fault isolation stage, for example, step S201 can be performed to complete fault isolation between the computer host and the embedded processor.

在一种可能的实施方式中，计算机主机可以设置有第一状态寄存器，第一状态寄存器可以用于存储嵌入式处理器的运行状态。第一状态寄存器的状态值可以设置为：“00”指示嵌入式处理器正常工作，“10”指示嵌入式处理器出现故障。预设标志位可以是用于指示嵌入式处理器出现故障的标识，在本申请实施例中，预设标志位可以理解为第一状态寄存器的状态值“10”。计算机主机在检测到嵌入式处理器正常工作时，第一状态寄存器的状态值保持“00”不变。计算机主机在检测到嵌入式处理器出现故障之后，可以将第一状态寄存器的状态值从“00”修改为“10”，即表示嵌入式处理器出现故障。可编程逻辑器件可以轮询读取该第一状态寄存器的状态值，当读取到第一状态寄存器的状态值为“10”，即读取到预设标志位时，则确定嵌入式处理器出现故障。可编程逻辑器件在检测到嵌入式处理器出现故障之后，进入故障隔离阶段，例如可以执行步骤S201，以完成计算机主机与嵌入式处理器故障隔离。In a possible implementation, the computer host may be provided with a first status register, and the first status register may be used to store the operating status of the embedded processor. The state value of the first state register may be set as: "00" indicates that the embedded processor is working normally, and "10" indicates that the embedded processor is faulty. The preset flag bit may be an identifier used to indicate that the embedded processor is faulty. In this embodiment of the present application, the preset flag bit may be understood as the state value "10" of the first state register. When the computer host detects that the embedded processor is working normally, the state value of the first state register remains "00" unchanged. After detecting that the embedded processor is faulty, the computer host can modify the state value of the first status register from "00" to "10", which means that the embedded processor is faulty. The programmable logic device can poll and read the state value of the first state register. When the state value of the first state register is read as "10", that is, when the preset flag bit is read, the embedded processor is determined. error occured. After detecting that the embedded processor is faulty, the programmable logic device enters a fault isolation stage, for example, step S201 may be performed to complete fault isolation between the computer host and the embedded processor.

可以看出，计算机主机、可编程逻辑器件和嵌入式处理器之间建立有FIFO通道，计算机主机、可编程逻辑器件和嵌入式处理器可以通过FIFO通道进行数据交互。计算机主机在检测到嵌入式处理器出现故障之后，可以设置预设标志位，用于指示嵌入式处理器出现故障，可编程逻辑器件读取到该预设标志位之后，可以确定嵌入式处理器出现故障。这种故障检测的方式，可编程逻辑器件的负担小，且简单高效。It can be seen that a FIFO channel is established between the computer host, the programmable logic device and the embedded processor, and the computer host, the programmable logic device and the embedded processor can exchange data through the FIFO channel. After the computer host detects that the embedded processor is faulty, it can set a preset flag to indicate that the embedded processor is faulty. After the programmable logic device reads the preset flag, it can determine the embedded processor. error occured. In this fault detection method, the burden of the programmable logic device is small, and it is simple and efficient.

在一种可能的实施方式中，故障检测的具体实现方式还可以通过步骤A1-A2来实现：In a possible implementation manner, the specific implementation manner of fault detection can also be implemented through steps A1-A2:

步骤A1：可编程逻辑器件获取状态寄存器的寄存器信息，寄存器信息用于记录嵌入式处理器的运行状态。Step A1: The programmable logic device acquires the register information of the status register, and the register information is used to record the running state of the embedded processor.

步骤A2：可编程逻辑器件根据寄存器信息判断嵌入式处理器是否出现故障。Step A2: The programmable logic device determines whether the embedded processor is faulty according to the register information.

在本申请实施例中，可编程逻辑器件可以设置有状态寄存器，该状态寄存器用于存储嵌入式处理器的运行状态。该状态寄存器存储的寄存器信息用于记录嵌入式处理器的运行状态。In this embodiment of the present application, the programmable logic device may be provided with a status register, where the status register is used to store the running status of the embedded processor. The register information stored in the status register is used to record the running status of the embedded processor.

以可编程逻辑器件包括1个32位状态寄存器为例，状态寄存器的各比特位代表含义可以如表1所示：Taking the programmable logic device including a 32-bit status register as an example, the meaning of each bit of the status register can be shown in Table 1:

表1Table 1

Bit0Bit0 Bit1Bit1 Bit2-Bit3Bit2-Bit3 Bit4-Bit8Bit4-Bit8 Bit9-Bit31Bit9-Bit31 第一标志位first flag 第二标志位second flag 保留reserve 故障信息代码fault information code 保留 reserve

具体说明如下：The specific instructions are as follows:

Bit0：可编程逻辑器件判断嵌入式处理器是否正常工作，当嵌入式处理器正常工作时，在预设时间内置位状态寄存器中的第一标志位。其中，置位可以理解为将对应的标志位置0或置1。Bit0: The programmable logic device judges whether the embedded processor is working normally. When the embedded processor is working normally, the first flag bit in the status register is set within a preset time. Among them, setting can be understood as setting the corresponding flag position to 0 or 1.

Bit1：第二标志位用于存储复杂可编程逻辑器件反馈的嵌入式处理器的运行状态，可编程逻辑器件可以将复杂可编程逻辑器件反馈的信息存储在Bit1。Bit1: The second flag bit is used to store the running state of the embedded processor fed back by the complex programmable logic device, and the programmable logic device can store the information fed back by the complex programmable logic device in Bit1.

Bit2-Bit3：预留比特位。Bit2-Bit3: reserved bits.

Bit4-Bit8：若可编程逻辑器件检测到嵌入式处理器出现故障，可以根据嵌入式处理器的故障来源和故障类型存储相应的故障信息代码至Bit4-Bit8。Bit4-Bit8: If the programmable logic device detects that the embedded processor is faulty, it can store the corresponding fault information code to Bit4-Bit8 according to the fault source and fault type of the embedded processor.

Bit9-Bit31：预留比特位。Bit9-Bit31: Reserved bits.

在获取到状态寄存器的寄存器信息之后，可以根据寄存器信息判断嵌入式处理器出现故障是否故障，具体的实现方式可以参考下文的描述，在此不再赘述。After obtaining the register information of the status register, it can be determined whether the embedded processor is faulty or not according to the register information. The specific implementation can refer to the description below, which will not be repeated here.

可以看出，通过获取状态寄存器的寄存器信息，然后根据寄存器信息判断嵌入式处理器出现故障是否故障。这种通过可编程逻辑器件提供的状态寄存器，实现嵌入式处理器自检测的方式，简单高效，参与检测的器件较少，故障检测的准确性也较高。It can be seen that, by obtaining the register information of the status register, and then judging whether the embedded processor fails or not according to the register information. The state register provided by the programmable logic device realizes the self-detection of the embedded processor, which is simple and efficient, with fewer devices involved in the detection and higher fault detection accuracy.

在一种可能的实施方式中，步骤A2具体可以包括以下步骤：In a possible implementation manner, step A2 may specifically include the following steps:

可编程逻辑器件按照预设周期从寄存器信息中读取状态寄存器中的第一标志位的状态值；可编程逻辑器件根据读取的第一标志位的状态值判断第一标志位是否置位；响应于到达预设时间第一标志位未被置位，可编程逻辑器件确定嵌入式处理器出现故障。The programmable logic device reads the state value of the first flag bit in the status register from the register information according to a preset cycle; the programmable logic device determines whether the first flag bit is set according to the read state value of the first flag bit; The programmable logic device determines that the embedded processor is faulty in response to the arrival of the preset time when the first flag bit is not set.

在本申请实施例中，可以通过可编程逻辑器件提供的状态寄存器，实现嵌入式处理器的自检测。具体地，可以将嵌入式处理器的运行状态与状态寄存器的第一标志位关联（例如，表1所示的Bit0）。可编程逻辑器件周期性（例如1分钟）向嵌入式处理器发送发现请求。如果接收到该发现请求的嵌入式处理器正常工作，则会向可编程逻辑器件回复发现响应，即对可编程逻辑器件发送的发现请求进行应答，此时可以将第一标志位进行置位。其中，置位可以理解为将对应的标志位置0或置1。例如，可编程逻辑器件对嵌入式处理器发送的发现请求进行了应答，若上个周期第一标志位的状态值为“0”，则将第一标志位的状态值置1；或者若上个周期第一标志位的状态值为“1”，则将第一标志位的状态值置0。也就是说，如果嵌入式处理器正常工作，可编程逻辑器件中的状态寄存器的第一标志位的状态值会周期性变化。如果嵌入式处理器出现故障，则无法对嵌入式处理器发送的发现请求进行应答，此时第一标志位无法被置位。例如，可编程逻辑器件对嵌入式处理器发送的发现请求无法进行应答，若上个周期第一标志位的状态值为“0”，则本次第一标志位的状态值还是为“0”；或者若上个周期第一标志位的状态值为“1”，则本次第一标志位的状态值还是为“1”。也就是说，如果嵌入式处理器出现故障，可编程逻辑器件中的状态寄存器的第一标志位的状态值将不会再周期性变化。因此，在本申请实施例中，可编程逻辑器件可以按照预设周期从寄存器信息中读取状态寄存器中的第一标志位的状态值，然后根据读取的第一标志位的状态值判断第一标志位是否置位，若到达预设时间第一标志位未被置位，则可以确定嵌入式处理器出现故障。其中预设周期和预设时间可以根据实际的应用场景确定，本申请实施例对此不做限定。In this embodiment of the present application, the state register provided by the programmable logic device can be used to implement the self-test of the embedded processor. Specifically, the operating state of the embedded processor may be associated with the first flag bit of the status register (for example, Bit0 shown in Table 1). The programmable logic device periodically (eg, 1 minute) sends discovery requests to the embedded processor. If the embedded processor that receives the discovery request works normally, it will reply a discovery response to the programmable logic device, that is, to respond to the discovery request sent by the programmable logic device, and at this time, the first flag bit can be set. Among them, setting can be understood as setting the corresponding flag position to 0 or 1. For example, the programmable logic device responds to the discovery request sent by the embedded processor. If the state value of the first flag bit in the previous cycle is "0", the state value of the first flag bit is set to 1; If the state value of the first flag bit is "1" in one cycle, the state value of the first flag bit is set to 0. That is to say, if the embedded processor works normally, the state value of the first flag bit of the state register in the programmable logic device will change periodically. If the embedded processor fails, it cannot respond to the discovery request sent by the embedded processor, and the first flag bit cannot be set at this time. For example, the programmable logic device cannot respond to the discovery request sent by the embedded processor. If the state value of the first flag bit in the previous cycle is "0", the state value of the first flag bit this time is still "0". ; Or if the state value of the first flag bit in the previous cycle is "1", the state value of the first flag bit this time is still "1". That is, if the embedded processor fails, the state value of the first flag bit of the state register in the programmable logic device will not change periodically. Therefore, in the embodiment of the present application, the programmable logic device can read the state value of the first flag bit in the status register from the register information according to the preset cycle, and then judge the first flag bit according to the read state value of the first flag bit. Whether a flag is set, if the first flag is not set at the preset time, it can be determined that the embedded processor is faulty. The preset period and the preset time may be determined according to actual application scenarios, which are not limited in this embodiment of the present application.

可以看出，可编程逻辑器件可以按照预设周期从寄存器信息中读取状态寄存器中的第一标志位的状态值，然后根据读取的第一标志位的状态值判断第一标志位是否置位，若到达预设时间第一标志位未被置位，则确定嵌入式处理器出现故障。这种通过可编程逻辑器件提供的状态寄存器，实现嵌入式处理器自检测的方式，简单高效，参与检测的器件较少，故障检测的准确性也较高。It can be seen that the programmable logic device can read the state value of the first flag bit in the status register from the register information according to the preset cycle, and then judge whether the first flag bit is set according to the read state value of the first flag bit If the first flag bit is not set at the preset time, it is determined that the embedded processor is faulty. The state register provided by the programmable logic device realizes the self-detection of the embedded processor, which is simple and efficient, with fewer devices involved in the detection and higher fault detection accuracy.

在一种可能的实施方式中，步骤A2还可以包括以下步骤：In a possible implementation manner, step A2 may further include the following steps:

可编程逻辑器件从寄存器信息中读取状态寄存器中的第二标志位的状态值，第二标志位的状态值与复杂可编程逻辑器件反馈的第一信号关联，复杂可编程逻辑器件用于检测嵌入式处理器的运行状态，并根据嵌入式处理器的运行状态向可编程逻辑器件反馈第一信号；响应于读取到第二标志位的状态值为预设值，可编程逻辑器件确定嵌入式处理器出现故障。The programmable logic device reads the state value of the second flag bit in the status register from the register information, the state value of the second flag bit is associated with the first signal fed back by the complex programmable logic device, and the complex programmable logic device is used to detect running state of the embedded processor, and feeding back the first signal to the programmable logic device according to the running state of the embedded processor; in response to reading the state value of the second flag bit as a preset value, the programmable logic device determines the embedded The processor has failed.

在本申请实施例中，可编程逻辑器件可以通过与嵌入式处理器连接的复杂可编程逻辑器件检测嵌入式处理器是否出现故障。其中，复杂可编程逻辑器件可以分别与嵌入式处理器和可编程逻辑器件进行通信，复杂可编程逻辑器件可以内置有看门狗模块，用于监测嵌入式处理器的运行情况。嵌入式处理器正常工作时，会在预设时间内（例如1秒或者0.5秒等）向看门狗模块的定时器发送反馈信号，对定时器进行清零，以此实现喂狗功能。若嵌入式处理器未在预设时间内对定时器进行喂狗，复杂可编程逻辑器件可以向可编程逻辑器件反馈第一信号，以提嵌入式处理器未在预设时间内进行响应。可编程逻辑器件在接收到该第一信号时，可以将状态寄存器中与复杂可编程逻辑器件对应的比特位（例如表1所示的Bit1）的状态值修改为预设值，以表示嵌入式处理器出现故障。In this embodiment of the present application, the programmable logic device may detect whether the embedded processor is faulty through a complex programmable logic device connected to the embedded processor. Among them, the complex programmable logic device can communicate with the embedded processor and the programmable logic device respectively, and the complex programmable logic device can have a built-in watchdog module for monitoring the operation of the embedded processor. When the embedded processor is working normally, it will send a feedback signal to the timer of the watchdog module within a preset time (for example, 1 second or 0.5 seconds, etc.) to clear the timer, so as to realize the function of feeding the dog. If the embedded processor does not feed the timer within the preset time, the complex programmable logic device may feed back the first signal to the programmable logic device to indicate that the embedded processor does not respond within the preset time. When the programmable logic device receives the first signal, the state value of the bit corresponding to the complex programmable logic device in the status register (for example, Bit1 shown in Table 1) can be modified to a preset value to represent the embedded The processor has failed.

示例地，状态寄存器中第二标志位（参见表1所示的Bit1）的状态值可以设置为：“00”指示嵌入式处理器正常工作，“10”指示嵌入式处理器出现故障。预设值可以是用于指示嵌入式处理器出现故障的标识，在本申请实施例中，预设值可以理解为第二标志位的状态值“10”。可编程逻辑器件在未接收到用于反馈嵌入式处理器出现故障的第一信号时，第二标志位的状态值保持“00”不变。可编程逻辑器件在接收到第一信号之后，可以将第二标志位的状态值从“00”修改为“10”，即表示嵌入式处理器出现故障。可编程逻辑器件读取到第二标志位的状态值为“10”，即读取到预设标志位时，则确定嵌入式处理器出现故障。在一种可能的实施方式中，第二标志位和第一标志位还可以是同一个标志位，也就是说，采用同一比特位来实现第一标志位和第二标志位的功能。例如，Bit1即可以用于存储复杂可编程逻辑器件反馈的嵌入式处理器的运行状态的信息，也可以用于存储可编程逻辑器件判断嵌入式处理器是否正常工作的信息。可编程逻辑器件在检测到嵌入式处理器出现故障之后，进入故障隔离阶段，例如可以执行步骤S201，以完成计算机主机与嵌入式处理器故障隔离。For example, the status value of the second flag bit (see Bit1 shown in Table 1) in the status register can be set as: "00" indicates that the embedded processor is working normally, and "10" indicates that the embedded processor is faulty. The preset value may be a flag used to indicate that the embedded processor is faulty, and in this embodiment of the present application, the preset value may be understood as the state value "10" of the second flag bit. When the programmable logic device does not receive the first signal for feeding back the failure of the embedded processor, the state value of the second flag bit remains "00" unchanged. After receiving the first signal, the programmable logic device can modify the state value of the second flag bit from "00" to "10", which means that the embedded processor is faulty. When the programmable logic device reads the state value of the second flag bit as "10", that is, when the preset flag bit is read, it is determined that the embedded processor is faulty. In a possible implementation manner, the second flag bit and the first flag bit may also be the same flag bit, that is, the same bit bit is used to implement the functions of the first flag bit and the second flag bit. For example, Bit1 can be used to store the information of the running state of the embedded processor fed back by the complex programmable logic device, and can also be used to store the information of the programmable logic device to judge whether the embedded processor is working normally. After detecting that the embedded processor is faulty, the programmable logic device enters a fault isolation stage, for example, step S201 may be performed to complete fault isolation between the computer host and the embedded processor.

可以看出通过复杂可编程逻辑器件对嵌入式处理器进行故障检测，然后将复杂可编程逻辑器件的检测结果与可编程逻辑器件的状态寄存器进行关联。若读取到状态寄存器中的第二标志位的状态值为预设值，则确定嵌入式处理器出现故障。如此，利用复杂可编程逻辑器件对嵌入式处理器进行故障检测，拓展了对嵌入式处理器的故障检测手段，且可编程逻辑器件的负担相对较小，可以提升故障检测的准确性。It can be seen that the fault detection of the embedded processor is performed through the complex programmable logic device, and then the detection result of the complex programmable logic device is associated with the status register of the programmable logic device. If the state value of the second flag bit in the read state register is a preset value, it is determined that the embedded processor is faulty. In this way, the use of complex programmable logic devices for fault detection of embedded processors expands the means of fault detection for embedded processors, and the burden of programmable logic devices is relatively small, which can improve the accuracy of fault detection.

在一种可能的实施方式中，代答模式还包括可编程逻辑器件向计算机主机发送用于报错的第一处理层数据包，以使计算机主机与嵌入式处理器故障隔离。In a possible implementation, the pickup mode further includes that the programmable logic device sends a first processing layer data packet for reporting an error to the computer host, so as to isolate the computer host from the embedded processor fault.

在本申请实施例中，计算机主机在与嵌入式处理器进行业务交互时，计算机主机可以先将业务数据发送给可编程逻辑器件，再由可编程逻辑器件将业务数据转发给嵌入式处理器进行处理。若嵌入式处理器出现故障，则无法对这些业务数据进行处理。此时，计算机主机将业务数据发送给可编程逻辑器件之后，可编程逻辑器件可以进入代答模式，以向计算机主机提示嵌入式处理器出现故障，从而实现将计算机主机与嵌入式处理器进行故障隔离。具体地，可编程逻辑器件可以向计算机主机发送第一处理层数据包，该第一处理层数据包用于向计算机主机报错，以提示嵌入式处理器出现故障，从而实现将计算机主机与嵌入式处理器进行故障隔离。In the embodiment of the present application, when the computer host performs business interaction with the embedded processor, the computer host may first send the business data to the programmable logic device, and then the programmable logic device forwards the business data to the embedded processor for processing. deal with. These business data cannot be processed if the embedded processor fails. At this time, after the computer host sends the service data to the programmable logic device, the programmable logic device can enter the pickup mode to prompt the computer host that the embedded processor is faulty, so as to realize the fault between the computer host and the embedded processor. isolation. Specifically, the programmable logic device can send a first processing layer data packet to the computer host, where the first processing layer data packet is used to report an error to the computer host to prompt the embedded processor to fail, thereby realizing the connection between the computer host and the embedded processor. The processor is fault isolated.

可以看出，在检测到嵌入式处理器出现故障之后，通过可编程逻辑器件代答向计算机主机发送用于报错的第一处理层数据包，可以实现计算机主机与嵌入式处理器的故障隔离。这种故障隔离的方式无需对计算机主机进行重启，可以减少对计算机主机的影响，且可以保证计算机主机的其他业务不受干扰地正常运行。It can be seen that, after detecting the failure of the embedded processor, the programmable logic device sends the first processing layer data packet to the computer host for error reporting, which can realize fault isolation between the computer host and the embedded processor. This fault isolation method does not need to restart the computer host, can reduce the impact on the computer host, and can ensure that other services of the computer host run normally without interference.

在一种可能的实施方式中，可编程逻辑器件向计算机主机发送用于报错的第一处理层数据包，可以包括以下步骤：In a possible implementation manner, the programmable logic device sends the first processing layer data packet for reporting an error to the computer host, which may include the following steps:

获取状态寄存器的寄存器信息，寄存器信息用于记录嵌入式处理器的运行状态；从寄存器信息中读取嵌入式处理器的故障来源和故障类型；根据故障来源和故障类型生成用于报错的第一处理层数据包；向计算机主机发送第一处理层数据包。Obtain the register information of the status register, which is used to record the running state of the embedded processor; read the fault source and fault type of the embedded processor from the register information; generate the first error message according to the fault source and fault type. Processing layer data packet; sending the first processing layer data packet to the computer host.

在PCIe标准协议中，按照故障的严重程度将PCIe设备出现的故障类型划分为：可修正错误（correctable error，CE）、非致命性不可修正错误（non-fatal uncorrectableerror，NFE）以及致命性不可修正错误（fatal uncorrectable error，FE）。其中，可修正错误可以自动地被硬件识别并修正或恢复；非致命性不可修正错误一般由设备驱动软件直接处理；致命性不可修正错误通常由系统软件处理，且一般需要进行复位等操作。故障来源可以理解为具体出现故障的设备。In the PCIe standard protocol, the fault types of PCIe devices are divided into correctable errors (correctable errors, CE), non-fatal uncorrectable errors (NFE) and fatal uncorrectable errors according to the severity of the fault. Error (fatal uncorrectable error, FE). Among them, correctable errors can be automatically identified and corrected or recovered by hardware; non-fatal uncorrectable errors are generally handled directly by device driver software; fatal uncorrectable errors are usually handled by system software, and generally require operations such as reset. The fault source can be understood as the specific faulty equipment.

在本申请实施例中，可以存在多个嵌入式处理器，以存在嵌入式处理器A和嵌入式处理器B为例，可以分别将嵌入式处理器A和嵌入式处理器B与可编程逻辑器件中状态寄存器的各个比特位建立一一对应关系。如表1所示，可以将嵌入式处理器A的运行状态存储在Bit4中，可以将嵌入式处理器B的运行状态存储在Bit5中。相应地，状态寄存器中Bit4和Bit5的状态值可以设置为“00”指示正常工作，“01”指示可修正错误，“10”指示非致命不可修正错误，“11”指示致命性不可修正错误。示例地，当可编程逻辑器件中读取到Bit4的状态值为“00”时，可以确定嵌入式处理器A正常工作。当可编程逻辑器件中读取到Bit5的状态值为“01”时，可以确定故障来源为嵌入式处理器B，故障类型为可修正错误；或者当可编程逻辑器件中读取到Bit5的状态值为“10”时，可以确定故障来源为嵌入式处理器B，故障类型为非致命不可修正错误。然后，可以根据寄存器信息中的处理器的故障来源和故障类型生成第一处理层数据包，并将该第一处理层数据包反馈给计算机主机进行报错，从而完成故障隔离。其中，第一处理层数据包还可以包括出现故障的嵌入式处理器的PCIe设备标识，PCIe设备标识可以是在PCIe总线系统标识PCIe设备的标识符，具体该PCIe设备标识可以是PCIe设备的总线设备功能（bus device function，BDF）或者PCIe设备的地址信息等等，以便快速定位到出现故障的嵌入式处理器。In this embodiment of the present application, there may be multiple embedded processors. Taking the presence of embedded processor A and embedded processor B as an example, the embedded processor A and embedded processor B may be connected to the programmable logic Each bit of the status register in the device establishes a one-to-one correspondence. As shown in Table 1, the running state of the embedded processor A can be stored in Bit4, and the running state of the embedded processor B can be stored in Bit5. Correspondingly, the status values of Bit4 and Bit5 in the status register can be set to "00" to indicate normal operation, "01" to indicate a correctable error, "10" to indicate a non-fatal uncorrectable error, and "11" to indicate a fatal uncorrectable error. For example, when the status value of Bit4 is read in the programmable logic device as "00", it can be determined that the embedded processor A is working normally. When the status value of Bit5 read from the programmable logic device is "01", it can be determined that the source of the fault is the embedded processor B, and the fault type is a correctable error; or when the status of Bit5 is read from the programmable logic device When the value is "10", it can be determined that the source of the fault is embedded processor B, and the fault type is a non-fatal uncorrectable error. Then, a first processing layer data packet can be generated according to the fault source and fault type of the processor in the register information, and the first processing layer data packet can be fed back to the computer host for error reporting, thereby completing fault isolation. The first processing layer data packet may also include the PCIe device identifier of the faulty embedded processor, and the PCIe device identifier may be an identifier that identifies the PCIe device in the PCIe bus system. Specifically, the PCIe device identifier may be the bus of the PCIe device. Device function (bus device function, BDF) or PCIe device address information, etc., in order to quickly locate the faulty embedded processor.

可以看出，可编程逻辑器件从寄存器信息中读取嵌入式处理器的故障来源和故障类型，然后根据故障来源和故障类型生成第一处理层数据包，并向计算机主机反馈该第一处理层数据包，有助于快速定位到出现故障的嵌入式处理器，并且可以提高故障定位的准确性。It can be seen that the programmable logic device reads the fault source and fault type of the embedded processor from the register information, and then generates the first processing layer data packet according to the fault source and fault type, and feeds back the first processing layer to the computer host. data packets, help to quickly locate the faulty embedded processor, and can improve the accuracy of fault location.

步骤S202：响应于检测到嵌入式处理器修复完成，可编程逻辑器件向计算机主机发送热插信号，退出代答模式，以完成故障恢复。Step S202: In response to detecting that the embedded processor is repaired, the programmable logic device sends a hot-plug signal to the computer host, and exits the pickup mode to complete fault recovery.

在本申请实施例中，故障修复方式可以是对出现故障的嵌入式处理器进行自修复或者重启等，本申请实施例对此不做限定。嵌入式处理器故障修复完成，可以正常工作之后，可以向可编程逻辑器件发送恢复请求，该恢复请求用于指示嵌入式处理器故障修复完成。可编程逻辑器件在接收到该恢复请求，根据该恢复请求确定嵌入式处理器修复完成，可以向计算机主机发送热插信号，该热插信号可以用于指示嵌入式处理器执行了热插操作。计算机主机在接收到该热插信号之后，会认为嵌入式处理器已经执行了热插入，计算机主机可以与嵌入式处理器正常进行业务交互。此时，可编程逻辑器件退出代答模式，计算机主机可以与嵌入式处理器重新通信，故障恢复完成。In the embodiment of the present application, the fault repairing manner may be self-repairing or restarting the faulty embedded processor, which is not limited in the embodiment of the present application. After the fault repair of the embedded processor is completed and it can work normally, a recovery request can be sent to the programmable logic device, and the recovery request is used to indicate that the fault repair of the embedded processor is completed. After receiving the restoration request, the programmable logic device determines that the restoration of the embedded processor is completed according to the restoration request, and can send a hot-plug signal to the computer host, where the hot-plug signal can be used to instruct the embedded processor to perform a hot-plug operation. After receiving the hot-plug signal, the computer host considers that the embedded processor has performed the hot-plug, and the computer host can normally perform business interaction with the embedded processor. At this time, the programmable logic device exits the pickup mode, the computer host can re-communicate with the embedded processor, and the fault recovery is completed.

在图2所示的方法中可以看出，在检测到数据处理器中的嵌入式处理器出现故障之后，可编程逻辑器件可以进入代答模式，通过向计算机主机发送热插拔中断信号，该热插拔中断信号用于指示嵌入式处理器执行了热插拔操作，以断开计算机主机与嵌入式处理器的通信，简单高效地完成了故障隔离。如此，可以避免传统技术方案中，一旦嵌入式处理器出现故障，必须对计算机主机进行重启，导致正在运行的所有程序中断的问题，从而最大限度减少对计算机主机的影响，保证了计算机主机的正常运行。在检测到嵌入式处理器修复完成之后，可编程逻辑器件向计算机主机发送热插信号，该热插信号用于指示嵌入式处理器执行了热插操作，并退出代答模式，使得计算机主机与嵌入式处理器重新通信，从而快速完成故障恢复，提高了故障处理的效率。此外，本申请实施例不需要BMC系统、管控平台等外部工具的参与，可以减少依赖程度，可靠性也更高。As can be seen in the method shown in Figure 2, after detecting the failure of the embedded processor in the data processor, the programmable logic device can enter the pickup mode, and by sending a hot-plug interrupt signal to the computer host, the The hot-plug interrupt signal is used to instruct the embedded processor to perform a hot-plug operation, so as to disconnect the communication between the computer host and the embedded processor, and complete fault isolation simply and efficiently. In this way, in the traditional technical solution, once the embedded processor fails, the computer host must be restarted, causing all running programs to be interrupted, thereby minimizing the impact on the computer host and ensuring the normal operation of the computer host. run. After detecting that the embedded processor is repaired, the programmable logic device sends a hot-plug signal to the computer host, the hot-plug signal is used to instruct the embedded processor to perform a hot-plug operation, and exit the pickup mode, so that the computer host and the computer host The embedded processor re-communicates, thereby completing fault recovery quickly and improving the efficiency of fault handling. In addition, the embodiment of the present application does not require the participation of external tools such as a BMC system and a management and control platform, which can reduce the degree of dependence and achieve higher reliability.

上述详细阐述了本申请实施例的方法，下面提供了本申请实施例的装置。The methods of the embodiments of the present application are described in detail above, and the apparatuses of the embodiments of the present application are provided below.

请参照图3，图3是本申请实施例提供的一种故障处理的装置的结构示意图。该装置应用于可编程逻辑器件。如图3所示，该故障处理的装置300包括故障隔离单元301和故障恢复单元302，各个单元的详细描述如下：Please refer to FIG. 3 , which is a schematic structural diagram of a fault handling apparatus provided by an embodiment of the present application. The device is applied to a programmable logic device. As shown in FIG. 3 , the fault processing device 300 includes a fault isolation unit 301 and a fault recovery unit 302. The detailed description of each unit is as follows:

故障隔离单元301，用于响应于检测到数据处理器中的嵌入式处理器出现故障，进入代答模式，所述代答模式包括向计算机主机发送热插拔中断信号，以使所述计算机主机与所述嵌入式处理器故障隔离，所述热插拔中断信号用于指示所述嵌入式处理器执行了热插拔操作；The fault isolation unit 301 is configured to enter a pickup mode in response to detecting that the embedded processor in the data processor fails, and the pickup mode includes sending a hot-plug interrupt signal to a computer host, so that the computer host Being isolated from the embedded processor fault, the hot-plug interrupt signal is used to indicate that the embedded processor has performed a hot-plug operation;

故障恢复单元302，用于响应于检测到所述嵌入式处理器修复完成，向所述计算机主机发送热插信号，退出所述代答模式，以完成故障恢复，所述热插信号用于指示所述嵌入式处理器执行了热插操作。The fault recovery unit 302 is configured to, in response to detecting that the embedded processor is repaired, send a hot-plug signal to the computer host to exit the pickup mode to complete fault recovery, where the hot-plug signal is used to indicate The embedded processor performs a hot-plug operation.

在一种可能的实施方式中，故障处理的装置300还可以包括图3中未示出的故障检测单元，该故障检测单元具体可以用于获取所述状态寄存器的寄存器信息，所述寄存器信息用于记录所述嵌入式处理器的运行状态；根据所述寄存器信息判断所述嵌入式处理器是否出现故障。In a possible implementation manner, the apparatus 300 for fault handling may further include a fault detection unit not shown in FIG. 3 , and the fault detection unit may be specifically configured to acquire register information of the status register, and the register information uses to record the running state of the embedded processor; to determine whether the embedded processor is faulty according to the register information.

在一种可能的实施方式中，故障处理的装置300还可以包括图3中未示出的故障检测单元，该故障检测单元具体可以用于按照预设周期从所述寄存器信息中读取所述状态寄存器中的第一标志位的状态值；根据读取的所述第一标志位的状态值判断所述第一标志位是否置位；响应于到达预设时间所述第一标志位未被置位，确定所述嵌入式处理器出现故障。In a possible implementation manner, the fault processing apparatus 300 may further include a fault detection unit not shown in FIG. 3 , and the fault detection unit may be specifically configured to read the register information from the register information according to a preset period. The state value of the first flag bit in the status register; according to the read state value of the first flag bit, determine whether the first flag bit is set; in response to reaching the preset time, the first flag bit is not Set to determine that the embedded processor has failed.

在一种可能的实施方式中，故障处理的装置300还可以包括图3中未示出的故障检测单元，该故障检测单元具体可以用于从所述寄存器信息中读取所述状态寄存器中的第二标志位的状态值，所述第二标志位的状态值与复杂可编程逻辑器件反馈的第一信号关联，所述复杂可编程逻辑器件用于检测所述嵌入式处理器的运行状态，并根据所述嵌入式处理器的运行状态向所述可编程逻辑器件反馈所述第一信号；响应于读取到所述第二标志位的状态值为预设值，确定所述嵌入式处理器出现故障。In a possible implementation manner, the apparatus 300 for fault handling may further include a fault detection unit not shown in FIG. 3 , and the fault detection unit may be specifically configured to read the status register from the register information. the state value of the second flag bit, the state value of the second flag bit is associated with the first signal fed back by the complex programmable logic device, and the complex programmable logic device is used to detect the running state of the embedded processor, and feeding back the first signal to the programmable logic device according to the running state of the embedded processor; in response to reading the state value of the second flag bit as a preset value, determine the embedded processor device is faulty.

在一种可能的实施方式中，故障隔离单元301还用于向所述计算机主机发送用于报错的第一处理层数据包，以使所述计算机主机与所述嵌入式处理器故障隔离。In a possible implementation manner, the fault isolation unit 301 is further configured to send a first processing layer data packet for reporting an error to the computer host, so as to isolate the computer host from the embedded processor fault.

在一种可能的实施方式中，故障隔离单元301具体用于获取所述状态寄存器的寄存器信息，所述寄存器信息用于记录所述嵌入式处理器的运行状态；从所述寄存器信息中读取所述嵌入式处理器的故障来源和故障类型；根据所述故障来源和所述故障类型生成用于报错的第一处理层数据包；向所述计算机主机发送所述第一处理层数据包。In a possible implementation manner, the fault isolation unit 301 is specifically configured to acquire register information of the status register, where the register information is used to record the running state of the embedded processor; read from the register information A fault source and a fault type of the embedded processor; generating a first processing layer data packet for reporting an error according to the fault source and the fault type; sending the first processing layer data packet to the computer host.

在一种可能的实施方式中，所述计算机主机、所述可编程逻辑器件和所述嵌入式处理器之间存在先入先出队列通道，故障处理的装置300还可以包括图3中未示出的故障检测单元，该故障检测单元具体可以用于响应于从所述计算机主机中读取到预设标志位，确定所述嵌入式处理器出现故障，所述预设标志位用于指示所述嵌入式处理器出现故障。In a possible implementation manner, a first-in-first-out queue channel exists between the computer host, the programmable logic device and the embedded processor, and the apparatus 300 for fault handling may further include a method not shown in FIG. 3 . The fault detection unit is specifically configured to determine that the embedded processor is faulty in response to reading a preset flag bit from the computer host, and the preset flag bit is used to indicate the The embedded processor has failed.

需要说明的是，各个单元的实现还可以对应参照图2所示的方法实施例的相应描述。It should be noted that, the implementation of each unit may also correspond to the corresponding description with reference to the method embodiment shown in FIG. 2 .

请参照图4，图4是本申请实施例提供的一种计算机设备的结构示意图。如图4所示，该计算机设备400包括处理器401、存储器402和通信接口403，其中存储器402存储有计算机程序404。处理器401、存储器402、通信接口403以及计算机程序404之间可以通过总线405连接。Please refer to FIG. 4 , which is a schematic structural diagram of a computer device provided by an embodiment of the present application. As shown in FIG. 4 , the computer device 400 includes a processor 401 , a memory 402 and a communication interface 403 , wherein the memory 402 stores a computer program 404 . The processor 401 , the memory 402 , the communication interface 403 and the computer program 404 can be connected through a bus 405 .

当计算机设备为可编程逻辑器件时，上述计算机程序404用于执行以下步骤的指令：When the computer device is a programmable logic device, the above-mentioned computer program 404 is used to execute the instructions of the following steps:

响应于检测到数据处理器中的嵌入式处理器出现故障，进入代答模式，所述代答模式包括向计算机主机发送热插拔中断信号，以使所述计算机主机与所述嵌入式处理器故障隔离，所述热插拔中断信号用于指示所述嵌入式处理器执行了热插拔操作；In response to detecting a failure of the embedded processor in the data processor, enter a pickup mode, the pickup mode comprising sending a hot-plug interrupt signal to a computer host to cause the computer host to communicate with the embedded processor Fault isolation, the hot-plug interrupt signal is used to indicate that the embedded processor has performed a hot-plug operation;

响应于检测到所述嵌入式处理器修复完成，向所述计算机主机发送热插信号，退出所述代答模式，以完成故障恢复，所述热插信号用于指示所述嵌入式处理器执行了热插操作。In response to detecting that the embedded processor is repaired, a hot-plug signal is sent to the computer host to exit the pickup mode to complete fault recovery, and the hot-plug signal is used to instruct the embedded processor to execute hot-plug operation.

在一种可能的实施方式中，所述可编程逻辑器件包括状态寄存器，在所述响应于检测到数据处理器中的嵌入式处理器出现故障，所述可编程逻辑器件进入代答模式之前，所述计算机程序404还用于执行以下步骤的指令：In one possible implementation, the programmable logic device includes a status register, and before the programmable logic device enters a pickup mode in response to detecting a failure of the embedded processor in the data processor, The computer program 404 is also used to perform instructions for the following steps:

获取所述状态寄存器的寄存器信息，所述寄存器信息用于记录所述嵌入式处理器的运行状态；Obtain register information of the status register, where the register information is used to record the running state of the embedded processor;

根据所述寄存器信息判断所述嵌入式处理器是否出现故障。Whether the embedded processor is faulty is determined according to the register information.

在一种可能的实施方式中，在所述根据所述寄存器信息判断所述嵌入式处理器是否出现故障方面，所述计算机程序404具体用于执行以下步骤的指令：In a possible implementation manner, in the aspect of judging whether the embedded processor is faulty according to the register information, the computer program 404 is specifically configured to execute the instructions of the following steps:

按照预设周期从所述寄存器信息中读取所述状态寄存器中的第一标志位的状态值；Read the state value of the first flag bit in the state register from the register information according to a preset cycle;

根据读取的所述第一标志位的状态值判断所述第一标志位是否置位；According to the read state value of the first flag bit, determine whether the first flag bit is set;

响应于到达预设时间所述第一标志位未被置位，确定所述嵌入式处理器出现故障。In response to the arrival of a preset time when the first flag bit is not set, it is determined that the embedded processor is faulty.

从所述寄存器信息中读取所述状态寄存器中的第二标志位的状态值，所述第二标志位的状态值与复杂可编程逻辑器件反馈的第一信号关联，所述复杂可编程逻辑器件用于检测所述嵌入式处理器的运行状态，并根据所述嵌入式处理器的运行状态向所述可编程逻辑器件反馈所述第一信号；The state value of the second flag bit in the status register is read from the register information, and the state value of the second flag bit is associated with the first signal fed back by the complex programmable logic device. The complex programmable logic The device is configured to detect the running state of the embedded processor, and feed back the first signal to the programmable logic device according to the running state of the embedded processor;

响应于读取到所述第二标志位的状态值为预设值，确定所述嵌入式处理器出现故障。In response to reading that the state value of the second flag bit is a preset value, it is determined that the embedded processor is faulty.

在一种可能的实施方式中，所述代答模式还包括：In a possible implementation manner, the pickup mode further includes:

向所述计算机主机发送用于报错的第一处理层数据包，以使所述计算机主机与所述嵌入式处理器故障隔离。A first processing layer packet for error reporting is sent to the computer host to isolate the computer host from the embedded processor fault.

在一种可能的实施方式中，在所述可编程逻辑器件包括状态寄存器，所述向所述计算机主机发送用于报错的第一处理层数据包方面，所述计算机程序404具体用于执行以下步骤的指令：In a possible implementation manner, in that the programmable logic device includes a status register, and the first processing layer data packet for reporting an error is sent to the computer host, the computer program 404 is specifically configured to execute the following Instructions for steps:

从所述寄存器信息中读取所述嵌入式处理器的故障来源和故障类型；Read the fault source and fault type of the embedded processor from the register information;

根据所述故障来源和所述故障类型生成用于报错的第一处理层数据包；generating a first processing layer data packet for reporting an error according to the fault source and the fault type;

向所述计算机主机发送所述第一处理层数据包。The first processing layer data packet is sent to the computer host.

在一种可能的实施方式中，所述计算机主机、所述可编程逻辑器件和所述嵌入式处理器之间存在先入先出队列通道，在所述响应于检测到数据处理器中的嵌入式处理器出现故障，所述可编程逻辑器件进入代答模式之前，所述计算机程序404还用于执行以下步骤的指令：In a possible implementation, there is a first-in-first-out queue channel between the computer host, the programmable logic device and the embedded processor, and in the response to detecting the embedded processor in the data processor When the processor fails, before the programmable logic device enters the pickup mode, the computer program 404 is further used to execute the instructions of the following steps:

响应于从所述计算机主机中读取到预设标志位，确定所述嵌入式处理器出现故障，所述预设标志位用于指示所述嵌入式处理器出现故障。In response to reading a preset flag bit from the computer host, it is determined that the embedded processor is faulty, and the preset flag bit is used to indicate that the embedded processor is faulty.

本领域技术人员可以理解，为了便于说明，图4中仅示出了一个存储器和处理器。在实际的终端或服务器中，可以存在多个处理器和存储器。存储器402也可以称为存储介质或者存储设备等，本申请实施例对此不做限定。Those skilled in the art can understand that, for the convenience of description, only one memory and processor are shown in FIG. 4 . In an actual terminal or server, there may be multiple processors and memories. The memory 402 may also be referred to as a storage medium or a storage device, etc., which is not limited in this embodiment of the present application.

应理解，在本申请实施例中，处理器401可以是中央处理单元（centralprocessing unit，CPU），该处理器还可以是其他通用处理器、数字信号处理器（digitalsignal processing，DSP）、专用集成电路（application specific integrated circuit，ASIC）、现成可编程门阵列（fieldprogrammable gate array，FPGA）或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。It should be understood that, in this embodiment of the present application, the processor 401 may be a central processing unit (central processing unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (digital signal processing, DSP), application-specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate array (fieldprogrammable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.

还应理解，本申请实施例中提及的存储器402可以是易失性存储器或非易失性存储器，或可包括易失性和非易失性存储器两者。其中，非易失性存储器可以是只读存储器（read-only memory，ROM）、可编程只读存储器（programmable ROM， PROM）、可擦除可编程只读存储器（erasable PROM，EPROM）、电可擦除可编程只读存储器（electrically EPROM，EEPROM）或闪存。易失性存储器可以是随机存取存储器（random access memory，RAM），其用作外部高速缓存。通过示例性但不是限制性说明，许多形式的RAM可用，例如静态随机存取存储器（static RAM，SRAM）、动态随机存取存储器（dynamic RAM，DRAM）、同步动态随机存取存储器（synchronous DRAM，SDRAM）、双倍数据速率同步动态随机存取存储器（double datarate SDRAM，DDR SDRAM）、增强型同步动态随机存取存储器（enhanced SDRAM，ESDRAM）、同步连接动态随机存取存储器synchronize link DRAM，SLDRAM）和直接内存总线随机存取存储器（direct rambus RAM，DR RAM）。It should also be understood that the memory 402 mentioned in the embodiments of the present application may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically programmable Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory may be random access memory (RAM), which acts as an external cache. By way of illustration, but not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (double datarate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous link dynamic random access memory (synchronize link DRAM, SLDRAM) And direct memory bus random access memory (direct rambus RAM, DR RAM).

需要说明的是，当处理器401为通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件时，存储器（存储模块）集成在处理器中。It should be noted that when the processor 401 is a general-purpose processor, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components, the memory (storage module) is integrated in the processor.

应注意，本文描述的存储器402旨在包括但不限于这些和任意其它适合类型的存储器。It should be noted that the memory 402 described herein is intended to include, but not be limited to, these and any other suitable types of memory.

该总线405除包括数据总线之外，还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见，在图中将各种总线都标为总线。In addition to the data bus, the bus 405 may also include a power bus, a control bus, a status signal bus, and the like. However, for the sake of clarity, the various buses are labeled as buses in the figure.

在实现过程中，上述方法的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。结合本申请实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成，或者用处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器，闪存、只读存储器，可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器，处理器读取存储器中的信息，结合其硬件完成上述方法的步骤。为避免重复，这里不再详细描述。In the implementation process, each step of the above-mentioned method can be completed by a hardware integrated logic circuit in a processor or an instruction in the form of software. The steps of the methods disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware processor, or executed by a combination of hardware and software modules in the processor. The software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art. The storage medium is located in the memory, and the processor reads the information in the memory, and completes the steps of the above method in combination with its hardware. To avoid repetition, detailed description is omitted here.

在本申请的各种实施例中，上述各过程的序号的大小并不意味着执行顺序的先后，各过程的执行顺序应以其功能和内在逻辑确定，而不应对本申请实施例的实施过程构成任何限定。In various embodiments of the present application, the size of the sequence numbers of the above-mentioned processes does not mean the sequence of execution, and the execution sequence of each process should be determined by its functions and internal logic, rather than the implementation process of the embodiments of the present application. constitute any limitation.

本领域普通技术人员可以意识到，结合本文中所公开的实施例描述的各种说明性逻辑块（illustrative logical block，ILB）和步骤，能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art will appreciate that the various illustrative logical blocks (ILBs) and steps described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware . Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

在本申请所提供的几个实施例中，应该理解到，所揭露的系统、装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外，在本申请各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.

在上述实施例中，可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时，可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时，全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中，或者从一个计算机可读存储介质向另一个计算机可读存储介质传输，例如，所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线（例如同轴电缆、光纤、数字用户线）或无线（例如红外、无线、微波等）方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质，（例如，软盘、硬盘、磁带）、光介质（例如，DVD）、或者半导体介质（例如固态硬盘）等。In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission to another website site, computer, server, or data center by wire (eg, coaxial cable, fiber optic, digital subscriber line) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes an integration of one or more available media. The usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media (eg, solid state drives), and the like.

本申请实施例还提供一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序被处理器执行以实现如上述方法实施例中记载的任何一种故障处理的方法的部分或全部步骤。Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement any one of the fault handling methods described in the foregoing method embodiments. some or all of the steps of the method.

本申请实施例还提供一种计算机程序产品，所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质，所述计算机程序可操作来使计算机执行如上述方法实施例中记载的任何一种故障处理的方法的部分或全部步骤。Embodiments of the present application further provide a computer program product, the computer program product comprising a non-transitory computer-readable storage medium storing a computer program, the computer program being operable to cause a computer to execute the methods described in the foregoing method embodiments Some or all of the steps in any method of troubleshooting.

以上所述，仅为本申请的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应以所述权利要求的保护范围为准。The above are only specific embodiments of the present application, but the protection scope of the present application is not limited to this. should be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

1. A method for fault handling is applied to a programmable logic device, and is characterized by comprising the following steps:

in response to detecting that an embedded processor in a data processor fails, the programmable logic device enters a solution mode, wherein the solution mode comprises sending a hot plug interrupt signal to a computer host to isolate the computer host from the embedded processor, and the hot plug interrupt signal is used for indicating that the embedded processor executes a hot plug operation;

and in response to detecting that the embedded processor is repaired, the programmable logic device sends a hot plug signal to the computer host, exits the answer mode and completes fault recovery, wherein the hot plug signal is used for indicating the embedded processor to execute hot plug operation.

2. The method of claim 1, wherein the programmable logic device includes a status register, and wherein prior to the programmable logic device entering the solution mode in response to detecting a failure of an embedded processor in the data processor, further comprising:

the programmable logic device acquires register information of the status register, wherein the register information is used for recording the running status of the embedded processor;

and the programmable logic device judges whether the embedded processor has a fault according to the register information.

3. The method of claim 2, wherein the determining, by the programmable logic device, whether the embedded processor is malfunctioning based on the register information comprises:

the programmable logic device reads a state value of a first zone bit in the state register from the register information according to a preset period;

the programmable logic device judges whether the first flag bit is set according to the read state value of the first flag bit;

in response to the first flag bit not being set by a preset time, the programmable logic device determines that the embedded processor is malfunctioning.

4. The method of claim 2, wherein the determining, by the programmable logic device, whether the embedded processor is malfunctioning based on the register information comprises:

the programmable logic device reads a state value of a second flag bit in the state register from the register information, the state value of the second flag bit is associated with a first signal fed back by a complex programmable logic device, and the complex programmable logic device is used for detecting the running state of the embedded processor and feeding back the first signal to the programmable logic device according to the running state of the embedded processor;

and in response to reading that the state value of the second zone bit is a preset value, the programmable logic device determines that the embedded processor fails.

5. The method of claim 1, wherein the pick-up mode further comprises:

and the programmable logic device sends a first processing layer data packet for error reporting to the computer host so as to isolate the computer host from the embedded processor in a fault manner.

6. The method of claim 5, wherein the programmable logic device comprises a status register, and wherein sending a first processing layer packet for error reporting to the host computer by the programmable logic device comprises:

the programmable logic device reads the fault source and the fault type of the embedded processor from the register information;

the programmable logic device generates a first processing layer data packet for error reporting according to the fault source and the fault type;

and the programmable logic device sends the first processing layer data packet to the computer host.

7. The method of claim 1, wherein a first-in-first-out queue path exists between the computer host, the programmable logic device, and the embedded processor, and wherein prior to the programmable logic device entering a solution mode in response to detecting a failure of the embedded processor in the data processor, further comprising:

and in response to reading a preset flag bit from the computer host, the programmable logic device determines that the embedded processor fails, wherein the preset flag bit is used for indicating that the embedded processor fails.

8. A fault handling apparatus for a programmable logic device, comprising:

the fault isolation unit is used for responding to the detection that the embedded processor in the data processor has a fault, and the programmable logic device enters a solution mode, wherein the solution mode comprises the step of sending a hot plug interrupt signal to the computer host so as to isolate the computer host from the embedded processor in a fault mode, and the hot plug interrupt signal is used for indicating the embedded processor to execute hot plug operation;

and the fault recovery unit is used for responding to the detection that the embedded processor is repaired, the programmable logic device sends a hot plug signal to the computer host and exits from the answering mode to complete fault recovery, and the hot plug signal is used for indicating the embedded processor to execute hot plug operation.

9. A computer device, characterized in that it comprises a processor, a memory and a communication interface, wherein the memory stores a computer program configured to be executed by the processor, the computer program comprising instructions for carrying out the steps in the method according to any one of claims 1-7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, the computer program causing a computer to execute to implement the method of any one of claims 1-7.