CN118819941A

CN118819941A - Fault diagnosis method, device, equipment, storage medium and program product

Info

Publication number: CN118819941A
Application number: CN202411307399.8A
Authority: CN
Inventors: 肖如杏; 潘伟光; 欧阳晔
Original assignee: Hangzhou Yaxin Software Co ltd
Current assignee: Hangzhou Yaxin Software Co ltd
Priority date: 2024-09-19
Filing date: 2024-09-19
Publication date: 2024-10-22

Abstract

The application discloses a fault diagnosis method, a device, equipment, a storage medium and a program product, which relate to the technical field of artificial intelligence and comprise the following steps: receiving input fault description information, and determining an operation and maintenance tool list related to the fault description information; performing a multi-step diagnosis, each step diagnosis comprising: filling the fault description information, the operation and maintenance tool list and the history information into a first prompt word template to obtain a first prompt word; the first prompt word indicates rules and formats for fault diagnosis; the rule includes at least: invoking an abnormal index detection tool to determine an abnormal index, obtaining knowledge based on the abnormal index and the knowledge recall tool to diagnose the cause of the abnormality, and using an operation and maintenance tool to assist diagnosis; inputting the first prompt word into a large language model to generate a decision result; if the decision result indicates a calling tool, the calling tool obtains the calling result and then enters the next diagnosis; and if the decision result indicates to output the diagnosis result, outputting the diagnosis result. The application improves the accuracy of fault diagnosis.

Description

Fault diagnosis method, device, equipment, storage medium and program product

技术领域Technical Field

本申请涉及人工智能技术领域，尤其涉及一种故障诊断方法、装置、设备、存储介质和程序产品。The present application relates to the field of artificial intelligence technology, and in particular to a fault diagnosis method, apparatus, device, storage medium and program product.

背景技术Background Art

随着科技的进步，业务系统的软件架构正逐渐变得越来越复杂。为了保证业务系统稳定和可靠的运行，需要业务系统的平均故障间隔时间（Mean Time Between Failures，MTBF）尽量长，故障解决时间（Mean Time To Repair，MTTR）尽量短。这就需要在业务系统出现故障后，快速进行故障诊断并解决。With the advancement of technology, the software architecture of business systems is becoming more and more complex. In order to ensure the stable and reliable operation of business systems, the mean time between failures (MTBF) of business systems needs to be as long as possible and the mean time to repair (MTTR) as short as possible. This requires rapid fault diagnosis and resolution after a business system failure occurs.

目前的故障诊断方法主要是人工诊断，这种方式一般是运维部门收到客户投诉或设备告警后，通过办事处人员反馈到SRE（Site Reliability Engineering，站点可靠性工程）工程师，由SRE工程师根据经验做具体分析和操作，不仅效率低，且成本高。The current fault diagnosis method is mainly manual diagnosis. Generally, after the operation and maintenance department receives customer complaints or equipment alarms, they will feedback to the SRE (Site Reliability Engineering) engineers through office personnel. The SRE engineers will make specific analysis and operations based on their experience. This method is not only inefficient but also costly.

为了提高故障诊断效率，有方案提出基于大语言模型（Large Language Model,LLM）进行辅助诊断，这种方式提高了故障诊断效率，但其在故障诊断过程中存在明显的幻觉问题，准确性仍有待进一步提高。In order to improve the efficiency of fault diagnosis, a solution has been proposed to perform auxiliary diagnosis based on a large language model (LLM). This method improves the efficiency of fault diagnosis, but there is an obvious hallucination problem in the fault diagnosis process, and its accuracy still needs to be further improved.

发明内容Summary of the invention

鉴于上述问题，本申请提供了一种故障诊断方法、装置、设备、存储介质和程序产品，以提高故障诊断的准确性。具体方案如下：In view of the above problems, the present application provides a fault diagnosis method, apparatus, device, storage medium and program product to improve the accuracy of fault diagnosis. The specific scheme is as follows:

本申请第一方面提供一种故障诊断方法，所述方法包括：A first aspect of the present application provides a fault diagnosis method, the method comprising:

接收输入的故障描述信息；Receive input fault description information;

确定与所述故障描述信息相关的运维工具列表；Determine a list of operation and maintenance tools related to the fault description information;

利用大语言模型基于所述故障描述信息和所述运维工具列表进行多步诊断，其中，每一诊断步骤包括：A multi-step diagnosis is performed using a large language model based on the fault description information and the operation and maintenance tool list, wherein each diagnosis step includes:

将所述故障描述信息、所述运维工具列表和历史信息填加到第一提示词模板，得到第一提示词；所述第一提示词指示大语言模型基于所述故障描述信息和所述历史信息进行故障诊断时的规则和格式；其中，所述规则至少包括：在诊断过程中调用异常指标检测工具来确定异常指标，基于所述异常指标以及知识召回工具获得知识来诊断异常原因，使用所述运维工具列表中的运维工具辅助诊断；The fault description information, the operation and maintenance tool list and the historical information are added to the first prompt word template to obtain a first prompt word; the first prompt word indicates the rules and format of the large language model when performing fault diagnosis based on the fault description information and the historical information; wherein the rules at least include: calling an abnormal indicator detection tool to determine the abnormal indicator during the diagnosis process, diagnosing the cause of the abnormality based on the abnormal indicator and the knowledge recall tool to obtain knowledge, and using the operation and maintenance tools in the operation and maintenance tool list to assist in diagnosis;

将所述第一提示词输入所述大语言模型，得到所述大语言模型生成的决策结果；Inputting the first prompt word into the large language model to obtain a decision result generated by the large language model;

如果所述决策结果指示调用所述异常指标检测工具或运维工具或知识召回工具，在调用所述异常指标检测工具或所述运维工具或知识召回工具，得到调用结果后，进入下一诊断步骤；其中，在第一个诊断步骤中，所述历史信息为空，在非第一个诊断步骤中，所述历史信息包括历史诊断步骤得到的决策结果和调用结果；If the decision result indicates to call the abnormal indicator detection tool or the operation and maintenance tool or the knowledge recall tool, after calling the abnormal indicator detection tool or the operation and maintenance tool or the knowledge recall tool and obtaining the calling result, proceed to the next diagnosis step; wherein, in the first diagnosis step, the historical information is empty, and in the non-first diagnosis step, the historical information includes the decision result and the calling result obtained in the historical diagnosis step;

如果所述决策结果指示输出诊断结果，基于所述决策结果输出诊断结果。If the decision result indicates outputting a diagnosis result, a diagnosis result is output based on the decision result.

在一种可能的实现中，所述确定与所述故障描述信息相关的运维工具，包括：In a possible implementation, determining an operation and maintenance tool related to the fault description information includes:

获得所述故障描述信息的向量表示；Obtaining a vector representation of the fault description information;

基于所述故障描述信息的向量表示与数据库中预先存储的每个运维工具的描述信息的向量表示，计算所述数据库中的每个运维工具的描述信息与所述故障描述信息的相似度；Based on the vector representation of the fault description information and the vector representation of the description information of each operation and maintenance tool pre-stored in the database, calculating the similarity between the description information of each operation and maintenance tool in the database and the fault description information;

确定与所述故障描述信息相关的运维工具列表；所述运维工具列表中的运维工具的描述信息与所述故障描述信息的相似度，大于不在所述运维工具列表中的运维工具的描述信息与所述故障描述信息的相似度。Determine a list of operation and maintenance tools related to the fault description information; the similarity between the description information of the operation and maintenance tools in the operation and maintenance tool list and the fault description information is greater than the similarity between the description information of the operation and maintenance tools not in the operation and maintenance tool list and the fault description information.

在一种可能的实现中，调用所述知识召回工具得到调用结果的过程包括：In a possible implementation, the process of calling the knowledge recall tool to obtain a calling result includes:

将所述异常指标输入所述知识召回工具，以便所述知识召回工具将所述异常指标与知识库中各个知识项进行匹配，将与所述异常指标匹配的知识项中的故障分析步骤作为调用结果进行输出。The abnormality indicator is input into the knowledge recall tool so that the knowledge recall tool matches the abnormality indicator with each knowledge item in the knowledge base and outputs the fault analysis steps in the knowledge item matching the abnormality indicator as the call result.

在一种可能的实现中，所述知识库中的每个知识项包括如下几个字段的内容：故障名称、故障描述、故障相关的指标、故障分析步骤；In a possible implementation, each knowledge item in the knowledge base includes the following fields: fault name, fault description, fault-related indicators, and fault analysis steps;

所述知识项通过如下方式提取得到：The knowledge items are extracted in the following way:

将知识文档按照章节结构构建文档片段树；所述文档片段树的根节点对应所述知识文档；每个非根节点对应所述知识文档中的一个章节；所述根节点包括所述知识文档的标题，如果非根节点是叶子节点，则所述非根节点包括对应章节的所有内容，如果非根据节点不是叶子节点，则所述非根节点包括对应章节的章节标题和对应章节的概述；The knowledge document is constructed into a document fragment tree according to the chapter structure; the root node of the document fragment tree corresponds to the knowledge document; each non-root node corresponds to a chapter in the knowledge document; the root node includes the title of the knowledge document, if the non-root node is a leaf node, the non-root node includes all the contents of the corresponding chapter, if the non-root node is not a leaf node, the non-root node includes the chapter title of the corresponding chapter and an overview of the corresponding chapter;

调用所述大语言模型对所述文档片段树中的节点进行遍历，每遍历到一个节点，对遍历到的节点进行知识项提取；每个知识项从一个节点中提取得到。The large language model is called to traverse the nodes in the document fragment tree, and each time a node is traversed, knowledge items are extracted from the traversed node; each knowledge item is extracted from a node.

在一种可能的实现中，所述根节点还包括所述知识文档的标题的摘要；In a possible implementation, the root node also includes a summary of the title of the knowledge document;

如果非根节点是叶子节点，则所述非根节点还包括对应章节的所有内容的摘要；If the non-root node is a leaf node, the non-root node also includes a summary of all contents of the corresponding chapter;

如果非根据节点不是叶子节点，则所述非根节点还包括对应章节的章节标题和对应章节的概述的摘要。If the non-root node is not a leaf node, the non-root node further includes a chapter title of the corresponding chapter and a summary of the overview of the corresponding chapter.

在一种可能的实现中，所述调用所述大语言模型对所述文档片段树中的节点进行遍历，每遍历到一个节点，对遍历到的节点进行知识项提取，包括：In a possible implementation, calling the large language model to traverse the nodes in the document fragment tree, and extracting knowledge items from each traversed node, including:

将所述文档片段树填加到第二提示词模板，得到第二提示词；所述第二提示词指示所述大语言模型逐个遍历所述文档片段树中的节点，对于遍历到的每个节点，根据该节点的内容撰写至少一条符合预设格式的详细诊断信息，从不同节点中提取的详细诊断信息不同；The document fragment tree is added to the second prompt word template to obtain a second prompt word; the second prompt word instructs the large language model to traverse the nodes in the document fragment tree one by one, and for each traversed node, write at least one detailed diagnostic information that conforms to a preset format according to the content of the node, and the detailed diagnostic information extracted from different nodes is different;

每一条详细诊断信息为一个知识项。Each piece of detailed diagnostic information is a knowledge item.

本申请的第二方面提供一种故障诊断装置，包括：A second aspect of the present application provides a fault diagnosis device, comprising:

接收模块，用于接收输入的故障描述信息；A receiving module, used for receiving input fault description information;

工具确定模块，用于确定与所述故障描述信息相关的运维工具列表；A tool determination module, used to determine a list of operation and maintenance tools related to the fault description information;

诊断模块，用于利用大语言模型基于所述故障描述信息和所述运维工具列表进行多步诊断，其中，每一诊断步骤包括：将所述故障描述信息、所述运维工具列表和历史信息填加到第一提示词模板，得到第一提示词；所述第一提示词指示大语言模型基于所述故障描述信息和所述历史信息进行故障诊断时的规则和格式；其中，所述规则至少包括：在诊断过程中调用异常指标检测工具来确定异常指标，基于所述异常指标以及知识召回工具获得知识来诊断异常原因，使用所述运维工具列表中的运维工具辅助诊断；将所述第一提示词输入所述大语言模型，得到所述大语言模型生成的决策结果；如果所述决策结果指示调用所述异常指标检测工具或运维工具或知识召回工具，在调用所述异常指标检测工具或所述运维工具或知识召回工具，得到调用结果后，进入下一诊断步骤；其中，在第一个诊断步骤中，所述历史信息为空，在非第一个诊断步骤中，所述历史信息包括历史诊断步骤得到的决策结果和调用结果；A diagnosis module, used to use a large language model to perform multi-step diagnosis based on the fault description information and the operation and maintenance tool list, wherein each diagnosis step includes: adding the fault description information, the operation and maintenance tool list and the historical information to a first prompt word template to obtain a first prompt word; the first prompt word indicates the rules and format of the large language model when performing fault diagnosis based on the fault description information and the historical information; wherein the rules at least include: calling an abnormal indicator detection tool to determine the abnormal indicator during the diagnosis process, obtaining knowledge based on the abnormal indicator and the knowledge recall tool to diagnose the cause of the abnormality, and using the operation and maintenance tools in the operation and maintenance tool list to assist in diagnosis; inputting the first prompt word into the large language model to obtain a decision result generated by the large language model; if the decision result indicates calling the abnormal indicator detection tool or the operation and maintenance tool or the knowledge recall tool, after calling the abnormal indicator detection tool or the operation and maintenance tool or the knowledge recall tool and obtaining the calling result, entering the next diagnosis step; wherein in the first diagnosis step, the historical information is empty, and in the non-first diagnosis step, the historical information includes the decision result and the calling result obtained by the historical diagnosis step;

输出模块，用于如果所述决策结果指示输出诊断结果，基于所述决策结果输出诊断结果。An output module is used to output a diagnosis result based on the decision result if the decision result indicates to output a diagnosis result.

本申请第三方面提供一种计算机程序产品，包括计算机可读指令，当所述计算机可读指令在电子设备上运行时，使得所述电子设备实现上述第一方面或第一方面任一实现方式的故障诊断方法。A third aspect of the present application provides a computer program product, comprising computer-readable instructions. When the computer-readable instructions are executed on an electronic device, the electronic device implements the fault diagnosis method of the first aspect or any implementation of the first aspect.

本申请第四方面提供一种电子设备，包括至少一个处理器和与所述处理器连接的存储器，其中：A fourth aspect of the present application provides an electronic device, comprising at least one processor and a memory connected to the processor, wherein:

所述存储器用于存储计算机程序；The memory is used to store computer programs;

所述处理器用于执行所述计算机程序，以使所述电子设备能够实现上述第一方面或第一方面任一实现方式的故障诊断方法。The processor is used to execute the computer program so that the electronic device can implement the fault diagnosis method of the first aspect or any implementation manner of the first aspect.

本申请第五方面提供一种计算机存储介质，所述存储介质承载有一个或多个计算机程序，当所述一个或多个计算机程序被电子设备执行时，能够使所述电子设备上述第一方面或第一方面任一实现方式的故障诊断方法。A fifth aspect of the present application provides a computer storage medium, which carries one or more computer programs. When the one or more computer programs are executed by an electronic device, the electronic device can use the fault diagnosis method of the above-mentioned first aspect or any implementation method of the first aspect.

借由上述技术方案，本申请提供的故障诊断方法、装置、设备、存储介质和程序产品，在接收输入的故障描述信息后，确定与故障描述信息相关的运维工具列表，利用大语言模型基于故障描述信息和运维工具列表进行多步诊断，其中，每一诊断步骤包括：将故障描述信息、运维工具列表和历史信息填加到第一提示词模板，得到第一提示词；第一提示词指示大语言模型基于故障描述信息和历史信息进行故障诊断时的规则和格式；其中，上述规则至少包括：在诊断过程中调用异常指标检测工具来确定异常指标，基于异常指标以及知识召回工具获得知识来诊断异常原因，可以使用运维工具列表中的工具辅助诊断；将第一提示词输入大语言模型，得到大语言模型生成的决策结果；如果决策结果指示调用异常指标检测工具或运维工具或知识召回工具，在调用异常指标检测工具或运维工具或知识召回工具，得到调用结果后，进入下一诊断步骤；其中，在第一个诊断步骤中，历史信息为空，在非第一个诊断步骤中，历史信息包括历史诊断步骤得到的决策结果和调用结果；如果决策结果指示输出诊断结果，基于决策结果输出诊断结果。本申请在基于大语言模型进行故障诊断的过程中，通过调用工具和相关知识进行多步自动故障诊断，克服了大语言模型在故障诊断过程中的幻觉问题，提高了故障诊断的准确性。By means of the above technical scheme, the fault diagnosis method, apparatus, device, storage medium and program product provided by the present application, after receiving the input fault description information, determine the operation and maintenance tool list related to the fault description information, and use the large language model to perform multi-step diagnosis based on the fault description information and the operation and maintenance tool list, wherein each diagnosis step includes: adding the fault description information, the operation and maintenance tool list and the historical information to the first prompt word template to obtain the first prompt word; the first prompt word indicates the rules and format of the large language model when performing fault diagnosis based on the fault description information and the historical information; wherein the above rules at least include: calling the abnormal indicator detection tool to determine the abnormal indicator during the diagnosis process, based on Abnormal indicators and knowledge recall tools obtain knowledge to diagnose the cause of the abnormality, and tools in the operation and maintenance tool list can be used to assist in diagnosis; the first prompt word is input into the large language model to obtain the decision result generated by the large language model; if the decision result indicates to call the abnormal indicator detection tool or the operation and maintenance tool or the knowledge recall tool, after calling the abnormal indicator detection tool or the operation and maintenance tool or the knowledge recall tool and obtaining the call result, enter the next diagnosis step; wherein, in the first diagnosis step, the historical information is empty, and in non-first diagnosis steps, the historical information includes the decision results and call results obtained in the historical diagnosis steps; if the decision result indicates to output the diagnosis result, the diagnosis result is output based on the decision result. In the process of fault diagnosis based on the large language model, the present application performs multi-step automatic fault diagnosis by calling tools and related knowledge, overcomes the hallucination problem of the large language model in the fault diagnosis process, and improves the accuracy of fault diagnosis.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

结合附图并参考以下具体实施方式，本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中，相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的，原件和元素不一定按照比例绘制。The above and other features, advantages and aspects of the embodiments of the present disclosure will become more apparent with reference to the following detailed description in conjunction with the accompanying drawings. Throughout the accompanying drawings, the same or similar reference numerals represent the same or similar elements. It should be understood that the drawings are schematic and the originals and elements are not necessarily drawn to scale.

图1为本申请提供的故障诊断方法的一种实现流程图；FIG1 is a flowchart of an implementation of the fault diagnosis method provided by the present application;

图2为本申请提供的文档片段树的一种结构示意图；FIG2 is a schematic diagram of a structure of a document fragment tree provided by the present application;

图3为本申请提供的文档片段树的另一种结构示意图；FIG3 is another schematic diagram of the structure of the document fragment tree provided by the present application;

图4为本申请提供的基于大语言模型进行故障诊断的一种原理性示例图；FIG4 is a schematic diagram of a principle example of fault diagnosis based on a large language model provided by the present application;

图5为本申请提供的故障诊断装置的一种结构示意图；FIG5 is a schematic diagram of a structure of a fault diagnosis device provided by the present application;

图6为本申请提供的电子设备的一种结构示意图。FIG. 6 is a schematic diagram of the structure of an electronic device provided in the present application.

具体实施方式DETAILED DESCRIPTION

下面结合本申请实施例中的附图对本申请实施例进行描述。本申请的实施方式部分使用的术语仅用于对本申请的具体实施例进行解释，而非旨在限定本申请。The following describes the embodiments of the present application in conjunction with the drawings in the embodiments of the present application. The terms used in the implementation method section of the present application are only used to explain the specific embodiments of the present application, and are not intended to limit the present application.

下面结合附图，对本申请的实施例进行描述。本领域普通技术人员可知，随着技术的发展和新场景的出现，本申请实施例提供的技术方案对于类似的技术问题，同样适用。The embodiments of the present application are described below in conjunction with the accompanying drawings. Those skilled in the art will appreciate that, with the development of technology and the emergence of new scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.

本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换，这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。The terms "first", "second", etc. in the specification and claims of the present application and the above-mentioned drawings are used to distinguish similar objects, and need not be used to describe a specific order or sequential order. It should be understood that the terms used in this way can be interchangeable under appropriate circumstances, which is only to describe the distinction mode adopted by the objects of the same attributes when describing in the embodiments of the present application. In addition, the terms "including" and "having" and any of their variations are intended to cover non-exclusive inclusions, so that the process, method, system, product or equipment comprising a series of units need not be limited to those units, but may include other units that are not clearly listed or inherent to these processes, methods, products or equipment.

如图1所示，为本申请实施例提供的故障诊断方法的一种实现流程图，可以包括：As shown in FIG1 , a flowchart of an implementation of a fault diagnosis method provided in an embodiment of the present application may include:

步骤S101：接收输入的故障描述信息。Step S101: receiving input fault description information.

该故障描述信息可以是SRE工程师根据设备告警或运维部门收到的客户投诉编辑得到的。作为示例，故障描述信息中可以包括但不限于：发生时间、异常描述信息、严重级别（比如，警告或严重）、附加特征（例如，异常状态等）。如果故障已经结束，则故障描述信息中还可以包括结束时间。The fault description information may be edited by SRE engineers based on device alarms or customer complaints received by the operation and maintenance department. As an example, the fault description information may include, but is not limited to: occurrence time, abnormal description information, severity level (e.g., warning or severe), and additional features (e.g., abnormal status, etc.). If the fault has ended, the fault description information may also include the end time.

步骤S102：确定与故障描述信息相关的运维工具列表。Step S102: Determine a list of operation and maintenance tools related to the fault description information.

SRE在诊断一个问题时，除了要使用经验知识，还会频繁使用一些运维工具（比如，指标监控系统、日志监控系统、操作系统命令行、SQL优化工具等）。本申请为了让大语言模型（Large Language Model, LLM）诊断故障，也需要让大语言模型知道有哪些运维工具可以调用。When diagnosing a problem, SRE not only uses experience and knowledge, but also frequently uses some operation and maintenance tools (such as indicator monitoring systems, log monitoring systems, operating system command lines, SQL optimization tools, etc.). In order for the Large Language Model (LLM) to diagnose faults, this application also needs to let the LLM know which operation and maintenance tools can be called.

运维工具也可以称为优化工具，为了方便大语言模型区分不同的运维工具，本申请针对运维的特点，建立了一个结构化的层次来分类和组织运维工具。其中，一级分类分为监控工具、异常检测工具和优化配置工具这三个类别。对应监控工具和优化配置工具，二级分类是根据不同的运维对象进行划分，对应异常检测工具，二级分类则是根据检测方法进行划分。如表1所示，为本申请实施例提供的数据库场景（即对数据库进行故障诊断的场景）下对运维工具进行分类的示例。在其它场景（比如，操作系统场景）下，运维工具以及对运维进行分类的方式通常会不同。Operation and maintenance tools can also be called optimization tools. In order to facilitate the large language model to distinguish different operation and maintenance tools, this application establishes a structured hierarchy to classify and organize operation and maintenance tools based on the characteristics of operation and maintenance. Among them, the first-level classification is divided into three categories: monitoring tools, anomaly detection tools, and optimization configuration tools. Corresponding to monitoring tools and optimization configuration tools, the second-level classification is divided according to different operation and maintenance objects, and corresponding to anomaly detection tools, the second-level classification is divided according to the detection method. As shown in Table 1, an example of classifying operation and maintenance tools in a database scenario (i.e., a scenario for fault diagnosis of a database) provided in an embodiment of the present application. In other scenarios (for example, operating system scenarios), operation and maintenance tools and the way of classifying operation and maintenance are usually different.

表1Table 1

与故障描述信息相关的运维工具列表中的运维工具可以基于故障描述信息和运维工具的描述信息的相似度确定。The operation and maintenance tools in the operation and maintenance tool list related to the fault description information may be determined based on the similarity between the fault description information and the description information of the operation and maintenance tools.

步骤S103：利用大语言模型基于上故障描述信息和运维工具列表进行多步诊断，其中，每一诊断步骤包括：Step S103: Use the large language model to perform multi-step diagnosis based on the fault description information and the operation and maintenance tool list, where each diagnosis step includes:

将故障描述信息、运维工具列表和历史信息填加到第一提示词模板，得到第一提示词；第一提示词指示大语言模型基于故障描述信息和历史信息进行故障诊断时的规则和格式；其中，规则至少包括：在诊断过程中调用异常指标检测工具来确定异常指标，基于异常指标以及知识召回工具获得知识来诊断异常原因，可以使用运维工具列表中的运维工具辅助诊断。The fault description information, the operation and maintenance tool list and the historical information are added to the first prompt word template to obtain the first prompt word; the first prompt word indicates the rules and format of the large language model when performing fault diagnosis based on the fault description information and the historical information; wherein the rules at least include: calling the abnormal indicator detection tool to determine the abnormal indicator during the diagnosis process, diagnosing the cause of the abnormality based on the abnormal indicator and the knowledge recall tool to obtain knowledge, and the operation and maintenance tools in the operation and maintenance tool list can be used to assist in the diagnosis.

第一提示词中除了包括上述规则和格式，还可以包括如下信息：In addition to the above rules and formats, the first prompt word may also include the following information:

1）当前故障诊断专家描述，用于描述当前诊断专拣的角色和功能。该信息是第一提示词模板中的内容。1) Description of the current fault diagnosis expert, used to describe the role and function of the current diagnosis expert. This information is the content of the first prompt word template.

2）告警详细信息，包括但不限于：发生时间，异常描述信息，严重级别（如，警告或严重等）、附加特征（例如，异常状态等）。该信息是填加到第一提示词模板中的内容，即用户输入的故障描述信息。如果告警已经消除，告警详细信息中还可以包括结束时间。2) Alarm details, including but not limited to: occurrence time, abnormal description information, severity level (such as warning or severe, etc.), additional features (such as abnormal status, etc.). This information is added to the first prompt word template, that is, the fault description information entered by the user. If the alarm has been eliminated, the alarm details can also include the end time.

3）诊断当前故障需要使用到的工具列表。本申请在故障诊断过程中，一些工具是必须要用到的，比如，知识召回工具和异常指标检测工具，因此，这些必须要用到的工具可以预先写入到第一提示词模板中的信息（即表征必须要用到的工具的信息是第一提示词模板中自带的信息），有一些工具是在某些情况下才会用到，这些工具列表是从数据库中召回的，即步骤S102中确定的运维工具列表。3) List of tools needed to diagnose the current fault. In the fault diagnosis process of this application, some tools must be used, such as knowledge recall tools and abnormal indicator detection tools. Therefore, these tools that must be used can be pre-written into the information in the first prompt word template (that is, the information that characterizes the tools that must be used is the information in the first prompt word template). Some tools are only used in certain situations. These tool lists are recalled from the database, that is, the operation and maintenance tool list determined in step S102.

本申请为了让大语言模型诊断故障时使用知识库中的故障知识，在第一提示词模板中固定增加了知识召回工具“match_diagnose_knowledge”，作为示例，知识召回工具的定义如下：In order to allow the large language model to use the fault knowledge in the knowledge base when diagnosing faults, this application adds a knowledge recall tool "match_diagnose_knowledge" to the first prompt word template. As an example, the definition of the knowledge recall tool is as follows:

“{"{

"name": "match_diagnose_knowledge","name": "match_diagnose_knowledge",

"description": "在诊断知识库中搜索相关的故障诊断知识。","description": "Search for relevant fault diagnosis knowledge in the diagnostic knowledge base.",

"parameters": {"parameters": {

"type": "object","type": "object",

"properties": {"properties": {

"metric_name": {"metric_name": {

"type": "string","type": "string",

"description": "指标名称,用于搜索指标相关的知识""description": "Indicator name, used to search for indicator-related knowledge"

}}

},},

"required": ["required": [

"metric_name""metric_name"

]]

}}

}”}”

4）历史消息，包含大语言模型历史生成的决策结果、工具调用结果等。该信息在每一步诊断后进行更新。4) Historical messages, including decision results generated by the large language model, tool call results, etc. This information is updated after each step of diagnosis.

如下所示，为本申请实施例提供的第一提示词的一个示例：As shown below, an example of the first prompt word provided in the embodiment of the present application is:

““

你是一个IT系统故障专家，你所在的公司IT系统遇到了异常。异常的开始时间是${start_time}，结束时间是${end_time}。You are an IT system failure expert, and your company's IT system has encountered an exception. The start time of the exception is ${start_time}, and the end time is ${end_time}.

异常警报信息如下：The abnormal alarm information is as follows:

${alert_info}${alert_info}

# 回应的规则和格式说明# Response rules and format

- 在诊断过程中，你可以使用以下工具：- During the diagnostic process, you can use the following tools:

${tools}${tools}

==============================

- 返回如下格式以调用工具：- Return the following format to call the tool:

思考：(你的思考)Thinking: (your thinking)

行动：(一个动作名称，可以是以下之一：[metric_abnormal_detect, match_diagnose_knowledge, optimize_index_selection, Speak]，注意大小写)Action: (An action name, can be one of: [metric_abnormal_detect, match_diagnose_knowledge, optimize_index_selection, Speak], pay attention to upper and lower case)

行动输入：(该动作的参数)Action input: (parameters of the action)

你可以首先调用工具确定异常指标，格式如下：You can first call the tool to determine the abnormal indicators, the format is as follows:

思考：现在我已经得到了异常的开始和结束时间，检查此时间段内CPU使用率是否异常。Thinking: Now that I have obtained the start and end time of the anomaly, check whether the CPU usage during this time period is abnormal.

行动：metric_abnormal_detectAction: metric_abnormal_detect

行动输入：{"start_time": ${start_time}, "end_time": ${end_time}, "metric_name": "cpu_usage"}Action input: {"start_time": ${start_time}, "end_time": ${end_time}, "metric_name": "cpu_usage"}

接下来，你必须使用工具诊断根本原因，必须使用以下格式（不允许其他选择）：Next, you must use the tool to diagnose the root cause. You must use the following format (no other options are allowed):

思考：使用metrics和match_diagnose_knowledge获得的知识来诊断异常的原因。Think: Use the knowledge gained from metrics and match_diagnose_knowledge to diagnose the cause of the anomaly.

行动：match_diagnose_knowledgeAction: match_diagnose_knowledge

行动输入：{"metric_name": "cpu_usage"}Action input: {"metric_name": "cpu_usage"}

此外，如果发现索引缺失，你需要调用optimize_index_selection API来获取推荐的索引，使用以下格式：In addition, if you find that an index is missing, you need to call the optimize_index_selection API to get the recommended index using the following format:

思考：由于索引缺失，我需要调用optimize_index_selection API来获取推荐的索引。Thinking: Since the index is missing, I need to call the optimize_index_selection API to get the recommended index.

行动：optimize_index_selectionAction: optimize_index_selection

行动输入：{"start_time": ${start_time},"end_time": ${end_time}}Action input: {"start_time": ${start_time},"end_time": ${end_time}}

当你从match_diagnose_knowledge中得到观察结果后，分析根本原因并向其他专家宣布，使用以下格式：Once you have the observations from match_diagnose_knowledge, analyze the root cause and announce it to other experts, using the following format:

思考：我现在知道了异常的根本原因。Thinking: I now know the root cause of the anomaly.

行动：SpeakAction: Speak

行动输入：（{"diagnose": 你发现的根本原因, "solution": 针对根本原因的优化解决方案，用‘\n’分隔, "knowledge": 你使用的诊断知识}）Action input: ({"diagnose": the root cause you found, "solution": the optimal solution for the root cause, separated by '\n', "knowledge": the diagnostic knowledge you used})

==============================

对话历史如下The conversation history is as follows

${chat_history}${chat_history}

工具的执行结果如下The tool execution results are as follows

${tool_observation}${tool_observation}

记住要注意回应格式说明，并严格遵循以上指定的规则！Remember to pay attention to the response format instructions and strictly follow the rules specified above!

基于以上历史，你作为${agent_name}，接下来会怎么做？Based on the above history, what will you do next as ${agent_name}?

””

上述第一提示词的示例中，“alert_info”为用户输入的故障描述信息；In the example of the first prompt word, "alert_info" is the fault description information entered by the user;

“tools”中包括与故障描述信息相关的运维工具列表，还包括进行故障诊断所必须的运维工具；"tools" includes a list of operation and maintenance tools related to the fault description information, as well as the operation and maintenance tools required for fault diagnosis;

“[metric_abnormal_detect, match_diagnose_knowledge, optimize_index_selection, Speak]”中的工具为“tools”中的工具，写在这里是为了向大模型强调一下运维工具。The tools in “[metric_abnormal_detect, match_diagnose_knowledge, optimize_index_selection, Speak]” are the tools in “tools”. They are written here to emphasize the operation and maintenance tools for large models.

其中，“optimize_index_selection”是与故障描述信息相关的运维工具，其它工具则是第一提示词模板中自带的工具。其中，metric_abnormal_detect是用于进行异常指标检测的工具，match_diagnose_knowledge是知识召回工具，Speak是输出工具；当然，与故障描述信息相关的运维工具列表中也可能包含第一提示词模板中自带的工具，此时，需要对“tools”进行去重处理。Among them, "optimize_index_selection" is an operation and maintenance tool related to the fault description information, and the other tools are tools that come with the first prompt word template. Among them, metric_abnormal_detect is a tool for abnormal metric detection, match_diagnose_knowledge is a knowledge recall tool, and Speak is an output tool; of course, the list of operation and maintenance tools related to the fault description information may also include tools that come with the first prompt word template. In this case, "tools" needs to be deduplicated.

“{chat_history}”即为大语言模型输出的历史决策结果；"{chat_history}" is the historical decision results output by the large language model;

“{tool_observation}”即为待更新的工具调用结果；"{tool_observation}" is the tool call result to be updated;

“agent_name”表示故障专家的名字。该信息是第一提示词模板中自带的信息。"agent_name" indicates the name of the fault expert. This information is included in the first prompt word template.

将第一提示词输入大语言模型，得到大语言模型生成的决策结果。The first prompt word is input into the large language model to obtain a decision result generated by the large language model.

基于上述第一提示词示例，大语言模型的生成的决策结果的格式为：Based on the first prompt word example above, the format of the decision result generated by the large language model is:

““

思考：xxxThinking: xxx

行动：match_diagnose_knowledgeAction: match_diagnose_knowledge

行动输入：{“metric_name”:“cpu”}Action input: {“metric_name”:“cpu”}

””

上述决策结果的示例指示调用知识召回工具“match_diagnose_knowledge”，调用知识召回工具时需要的参数为指标名称，即cpu。The example of the above decision result indicates calling the knowledge recall tool "match_diagnose_knowledge". The parameter required when calling the knowledge recall tool is the indicator name, that is, cpu.

如果决策结果指示调用工具（异常指标检测工具或运维工具列表中的运维工具或知识召回工具），在调用工具，得到调用结果后，进入下一诊断步骤；其中，在第一个诊断步骤中，历史信息为空，在非第一个诊断步骤中，历史信息包括历史诊断步骤得到的决策结果和调用结果。If the decision result indicates to call a tool (an abnormal indicator detection tool or an operation and maintenance tool in the operation and maintenance tool list or a knowledge recall tool), after calling the tool and obtaining the calling result, proceed to the next diagnosis step; wherein, in the first diagnosis step, the historical information is empty, and in the non-first diagnosis step, the historical information includes the decision results and calling results obtained in the historical diagnosis step.

在决策结果指示调用异常指标检测工具或运维工具或知识召回工具的情况下，则根据决策结果的指示调用异常指标检测工具或运维工具或知识召回工具，在调用异常指标检测工具或运维工具或知识召回工具得到调用结果后，就可以进入下一步的诊断了。When the decision result indicates to call the abnormal indicator detection tool, operation and maintenance tool, or knowledge recall tool, the abnormal indicator detection tool, operation and maintenance tool, or knowledge recall tool is called according to the instruction of the decision result. After calling the abnormal indicator detection tool, operation and maintenance tool, or knowledge recall tool and obtaining the calling result, the next step of diagnosis can be carried out.

本申请中，每一步诊断都需要借助大语言模型，在调用大语言模型时，都需要向大语言模型输入提示词，每次向大语言模型输入的提示词中均需要将故障描述信息、运维工具列表和历史信息填加到第一提示词模板，也就是说，不同次故障诊断调用大语言模型时提示词的区别在于历史信息（即{chat_history}和{tool_observation}）的不同，因此，可以将调用结果填加到第一提示词中的“{tool_observation}”中，将决策结果填加到第一提示词中的“{chat_history}”中，得到更新的第一提示词，基于更新的第一提示词调用大于语言模型以进行下一诊断步骤。In the present application, each step of diagnosis requires the use of a large language model. When calling the large language model, a prompt word needs to be input into the large language model. Each time the prompt word is input into the large language model, the fault description information, the operation and maintenance tool list and the historical information need to be added to the first prompt word template. That is to say, the difference in prompt words when the large language model is called for different fault diagnoses lies in the difference in historical information (i.e., {chat_history} and {tool_observation}). Therefore, the call result can be added to the "{tool_observation}" in the first prompt word, and the decision result can be added to the "{chat_history}" in the first prompt word to obtain an updated first prompt word, and the larger language model is called based on the updated first prompt word to perform the next diagnostic step.

步骤S104：如果决策结果指示输出诊断结果，基于决策结果输出诊断结果。Step S104: If the decision result indicates to output a diagnosis result, output a diagnosis result based on the decision result.

本申请中，如果决策结果指示调用工具“Speak”，说明决策结果指示输出诊断结果，则基于决策结果输出诊断结果。In the present application, if the decision result indicates calling the tool "Speak", it means that the decision result indicates outputting a diagnosis result, and then the diagnosis result is output based on the decision result.

可选的，决策结果指示输出如下格式的信息：Optionally, the decision result indicates output of information in the following format:

“"diagnose": 你发现的根本原因；"diagnose": the root cause you discovered;

"solution": 针对根本原因的优化解决方案；"solution": optimized solution to the root cause;

"knowledge": 你使用的诊断知识。”"knowledge": The diagnostic knowledge you use.

本申请实施例提供的故障诊断方法，在基于大语言模型进行故障诊断的过程中，不仅参考相关的知识，还调用相关的工具，使得大语言模型可以感知当前系统环境，并根据系统环境信息和相关知识进行逐步故障诊断，克服了大语言模型在故障诊断过程中的幻觉问题，从而提高故障诊断的准确性。The fault diagnosis method provided in the embodiment of the present application, in the process of fault diagnosis based on the large language model, not only refers to relevant knowledge, but also calls relevant tools, so that the large language model can perceive the current system environment, and perform step-by-step fault diagnosis based on system environment information and relevant knowledge, thereby overcoming the hallucination problem of the large language model in the fault diagnosis process, thereby improving the accuracy of fault diagnosis.

在一可选的实施例中，上述确定与故障描述信息相关的运维工具的一种实现方式可以为：In an optional embodiment, an implementation manner of determining the operation and maintenance tool related to the fault description information may be:

获得故障描述信息的向量表示。故障描述信息是文本信息，可以先获得故障描述信息中各个单字的向量表示，将各个单字的向量表示融合，得到故障描述信息的向量表示。或者，可以采用BERT模型对故障描述信息进行处理，得到故障描述信息的向量表示。Obtain a vector representation of the fault description information. The fault description information is text information. You can first obtain the vector representation of each word in the fault description information, and then fuse the vector representations of each word to obtain the vector representation of the fault description information. Alternatively, you can use the BERT model to process the fault description information to obtain the vector representation of the fault description information.

基于故障描述信息的向量表示与数据库中预先存储的每个运维工具的描述信息的向量表示，计算数据库中的每个运维工具的描述信息与故障描述信息的相似度。Based on the vector representation of the fault description information and the vector representation of the description information of each operation and maintenance tool pre-stored in the database, the similarity between the description information of each operation and maintenance tool in the database and the fault description information is calculated.

运维工具的描述信息（即定义）可以为如下格式的文本信息：The description information (i.e. definition) of the operation and maintenance tool can be text information in the following format:

“{"{

"name": "工具名称","name": "Tool name",

"description": "工具详细描述","description": "Detailed description of the tool",

"parameters": “调用工具时的参数”"parameters": "Parameters when calling the tool"

}”}”

例如，单指标异常检测工具的定义为：For example, the definition of a single-metric anomaly detection tool is:

“{"{

"name": "metric_abnormal_detect","name": "metric_abnormal_detect",

"description": "异常指标巡检。""description": "Abnormal indicator inspection."

"parameters": {'type': 'object', 'properties': {}}"parameters": {'type': 'object', 'properties': {}}

}}

””

对于指标异常检测工具，本申请预先将指标异常检测工具的定义进行文本向量化后存储在向量数据库中，以便后续进行运维工具召回时直接读取向量表示，以提高处理速度。For the indicator anomaly detection tool, this application pre-vectorizes the definition of the indicator anomaly detection tool into text and stores it in a vector database, so that the vector representation can be directly read when the operation and maintenance tool is recalled later to improve the processing speed.

可选的，可以采用如下公式（1）计算故障描述信息和数据库中第j个运维工具的相似度：Optionally, the following formula (1) can be used to calculate the similarity between the fault description information and the jth operation and maintenance tool in the database:

（1） (1)

其中，s表示故障描述信息；t_j表示数据库中的第j个运维工具的描述信息； sim(s, t_j)表示故障描述信息和数据库中第j个运维工具的相似度；emb（）表示向量化处理。Where s represents the fault description information; t _j represents the description information of the j-th operation and maintenance tool in the database; sim(s, t _j ) represents the similarity between the fault description information and the j-th operation and maintenance tool in the database; emb() represents vectorized processing.

确定与故障描述信息相关的运维工具列表；其中，运维工具列表中的运维工具的描述信息与故障描述信息的相似度，大于不在运维工具列表中的运维工具的描述信息与故障描述信息的相似度。Determine a list of operation and maintenance tools related to the fault description information; wherein the similarity between the description information of the operation and maintenance tools in the operation and maintenance tool list and the fault description information is greater than the similarity between the description information of the operation and maintenance tools not in the operation and maintenance tool list and the fault description information.

可以将数据库中与故障描述信息的相似度大于阈值的运维工具确定为与故障描述信息相关的运维工具。An operation and maintenance tool in the database whose similarity with the fault description information is greater than a threshold may be determined as an operation and maintenance tool related to the fault description information.

或者，可以将数据库中的各个运维工具按照与故障描述信息的相似度由大到小的顺序排序，将排序前N的运维工具确定为与故障描述信息相关的运维工具。Alternatively, the operation and maintenance tools in the database may be sorted in descending order according to the similarity with the fault description information, and the top N operation and maintenance tools may be determined as the operation and maintenance tools related to the fault description information.

在一可选的实施例中，上述调用知识召回工具得到调用结果的一种实现方式可以为：In an optional embodiment, one implementation method of invoking the knowledge recall tool to obtain the invocation result may be:

将上述异常指标输入知识召回工具，以便知识召回工具将异常指标与知识库中各个知识项进行匹配，将与异常指标匹配的知识项中的故障分析步骤作为调用结果进行输出。The above abnormal indicators are input into the knowledge recall tool so that the knowledge recall tool matches the abnormal indicators with each knowledge item in the knowledge base and outputs the fault analysis steps in the knowledge item matching the abnormal indicators as the call result.

作为示例，可以使用BM25算法计算异常指标与知识库中各个知识项的相关性得分。具体可以采用如下公式（2）-(3)计算异常指标与知识库中各个知识项中的故障相关的指标的相关性得分：As an example, the BM25 algorithm can be used to calculate the correlation score between the abnormality index and each knowledge item in the knowledge base. Specifically, the following formulas (2)-(3) can be used to calculate the correlation score between the abnormality index and the fault-related index in each knowledge item in the knowledge base:

(2) (2)

(3) (3)

其中：in:

是异常指标集合Q与知识项D的相关性得分。 It is the correlation score between the anomaly indicator set Q and the knowledge item D.

是集合Q中的第i个指标。 is the i-th index in the set Q.

是指标在知识项D中出现的频率。 Is an indicator The frequency of occurrence in knowledge item D.

是知识项D的长度。 is the length of knowledge item D.

是知识库中所有知识项的平均长度。 It is the average length of all knowledge items in the knowledge base.

k和b是可调参数，通常，k在1.2到2之间，b通常设置为0.75。k and b are adjustable parameters. Typically, k is between 1.2 and 2, and b is usually set to 0.75.

是指标的逆文档频率。 Is an indicator The inverse document frequency of .

N是知识库中知识项的总数。N is the total number of knowledge items in the knowledge base.

是知识库中包含指标的知识项数量。 The knowledge base contains indicators The number of knowledge items.

在一可选的实施例中，知识库中的每个知识项包括如下几个字段的内容：故障名称、故障描述、故障相关的指标、故障分析步骤。In an optional embodiment, each knowledge item in the knowledge base includes the contents of the following fields: fault name, fault description, fault-related indicators, and fault analysis steps.

如表2所示，为本申请实施例提供的一个知识项各个字段的说明信息：As shown in Table 2, the description information of each field of a knowledge item provided in the embodiment of the present application is as follows:

表2Table 2

可选的，上述结构的知识项可以通过如下方式提取得到：Optionally, the knowledge items of the above structure can be extracted by:

将知识文档按照章节结构构建文档片段树；文档片段树的根节点对应知识文档；每个非根节点对应知识文档中的一个章节；根节点包括知识文档的标题，如果非根节点是叶子节点，则非根节点包括对应章节的所有内容，如果非根据节点不是叶子节点，则非根节点包括对应章节的章节标题和对应章节的概述。The knowledge document is constructed into a document fragment tree according to the chapter structure; the root node of the document fragment tree corresponds to the knowledge document; each non-root node corresponds to a chapter in the knowledge document; the root node includes the title of the knowledge document, if the non-root node is a leaf node, the non-root node includes all the contents of the corresponding chapter, if the non-root node is not a leaf node, the non-root node includes the chapter title of the corresponding chapter and an overview of the corresponding chapter.

本申请中，知识文档可以包括但不限于如下几种：历史的排障文档、运维文档、说明手册等。In this application, knowledge documents may include but are not limited to the following: historical troubleshooting documents, operation and maintenance documents, instruction manuals, etc.

本申请发明人研究发现，为了适应大语言模型对输入数据量的要求，需要将知识文档分成多个片段，而现有技术是按照字数或者段落将知识文档分成多个片段，这种分段方法存在召回的知识不准确，故障诊断的准确性低的问题。The inventors of the present application have discovered that in order to adapt to the requirements of large language models for the amount of input data, knowledge documents need to be divided into multiple segments. However, the prior art divides knowledge documents into multiple segments according to the number of words or paragraphs. This segmentation method has the problem of inaccurate knowledge recall and low accuracy of fault diagnosis.

为了使得大语言模型能够更加准确的应用知识，提高故障诊断的准确性，本申请在对知识文档进行分段时，按照知识文档的章节结构对知识文档进行划分，如果一个分片超过了大语言模型可处理的最大数据块大小（例如，8k个字），则递归地将该分片进一步细分。然后，根据章节之间的关系构建一个树状结构，作为示例，树状结构中，根节点是知识文档标题，其余节点表示分割后的文档区块。对于第i个节点（不是根节点），它表示第i个章节，如果存在子节点，它的子节点表示第i个章节的子章节，在第i个节点是叶子节点的情况下，第i个节点是第i个章节的内容，在第i个节点不是叶子节点的情况下，第i个节点是第i个章节的标题，或者，标题和概述。In order to enable the large language model to apply knowledge more accurately and improve the accuracy of fault diagnosis, the present application divides the knowledge document according to the chapter structure of the knowledge document when segmenting the knowledge document. If a segment exceeds the maximum data block size that the large language model can handle (for example, 8k words), the segment is further subdivided recursively. Then, a tree structure is constructed based on the relationship between the chapters. As an example, in the tree structure, the root node is the knowledge document title, and the remaining nodes represent the segmented document blocks. For the i-th node (not the root node), it represents the i-th chapter. If there are child nodes, its child nodes represent the child chapters of the i-th chapter. When the i-th node is a leaf node, the i-th node is the content of the i-th chapter. When the i-th node is not a leaf node, the i-th node is the title of the i-th chapter, or the title and overview.

如图2所示，为本申请实施例提供的文档片段树的一种结构示意图。该示例中，每个节点仅包含知识文档中的内容，比如，根节点（标号为1的节点）为文档标题；根节点的子节点（标号为1.1、1.2和1.3的节点，分别表示知识文档的第一章、第二章和第三章）是文档中最高级别的标题，或者，可以是标题和章节的概述；标号为1.1.1的叶子节点是第一章的一个子章节的内容，依此类推。As shown in Figure 2, it is a schematic diagram of the structure of the document fragment tree provided by the embodiment of the present application. In this example, each node contains only the content in the knowledge document, for example, the root node (the node labeled 1) is the document title; the child nodes of the root node (the nodes labeled 1.1, 1.2 and 1.3, representing the first chapter, the second chapter and the third chapter of the knowledge document respectively) are the highest-level titles in the document, or they can be overviews of titles and chapters; the leaf node labeled 1.1.1 is the content of a sub-chapter of the first chapter, and so on.

如图3所示，为本申请实施例提供的文档片段树的另一种结构示意图。该示例中，每个节点除了包含知识文档中的内容外，还包括内容的摘要。比如，根节点为文档标题及文档标题的摘要；根节点的子节点是文档中最高级别的标题及其摘要，或者，可以是标题和章节的概述，以及标题和章节的概述的摘要；叶子节点为对应章节的所有内容及内容摘要。As shown in Figure 3, another structural diagram of the document fragment tree provided in the embodiment of the present application is shown. In this example, each node contains not only the content in the knowledge document, but also a summary of the content. For example, the root node is the document title and the summary of the document title; the child nodes of the root node are the highest-level title in the document and its summary, or it can be an overview of the title and chapter, and a summary of the overview of the title and chapter; the leaf nodes are all the content and content summaries of the corresponding chapters.

调用大语言模型对文档片段树中的节点进行遍历，每遍历到一个节点，对遍历到的节点进行知识项提取；每个知识项从一个节点中提取得到。The large language model is called to traverse the nodes in the document fragment tree. Each time a node is traversed, knowledge items are extracted from the traversed node; each knowledge item is extracted from a node.

可选的，可以将文档片段树填加到第二提示词模板，得到第二提示词；第二提示词指示大语言模型逐个遍历文档片段树中的节点，对于遍历到的每个节点，根据该节点的内容撰写至少一条符合预设格式的详细诊断信息，从不同节点中提取的详细诊断信息不同。Optionally, the document fragment tree can be added to the second prompt word template to obtain a second prompt word; the second prompt word instructs the large language model to traverse the nodes in the document fragment tree one by one, and for each traversed node, write at least one detailed diagnostic information that conforms to a preset format based on the content of the node, and the detailed diagnostic information extracted from different nodes is different.

本申请实施例提供的第二提示词的一种示例如下所示：An example of the second prompt word provided in the embodiment of the present application is as follows:

““

你的任务是基于文档片段树中一个给定的文档段落的内容撰写一条详细诊断知识。（如: 关于IO使用率高，关于慢查询)。Your task is to write a detailed diagnostic knowledge based on the content of a given document paragraph in the document fragment tree. (e.g.: about high IO usage, about slow query).

文档片段树：xxxDocument fragment tree: xxx

尝试按照当前阅读的章节顺序逐个抽取知识块，每个知识块必须严格遵循JSON格式。Try to extract knowledge blocks one by one in the order of the chapters currently being read. Each knowledge block must strictly follow the JSON format.

以下是知识片段的JSON格式:The following is the JSON format of the knowledge fragment:

{{

"name": "high_io_usage","name": "high_io_usage",

"desc": "高IO使用率可能会导致性能下降。重要的是要识别出导致高IO使用率的表或查询，并对其进行优化。","desc": "High IO usage may cause performance degradation. It is important to identify the table or query that causes high IO usage and optimize it.",

"steps": "步骤 1：通过查询 'pg_stat_user_tables' 和 'pg_stat_user_indexes' 检查用户表的 IO 统计信息。\n步骤 2：通过查询 'pg_statio_user_tables'和 'pg_statio_user_indexes' 检查用户表和索引的 IO 统计信息。\n步骤 3：识别 IO高的表或索引，并分析导致高 IO 的查询。\n步骤 4：优化查询或考虑添加索引以改善 IO性能。","steps": "Step 1: Check IO statistics of user tables by querying 'pg_stat_user_tables' and 'pg_stat_user_indexes'.\nStep 2: Check IO statistics of user tables and indexes by querying 'pg_statio_user_tables' and 'pg_statio_user_indexes'.\nStep 3: Identify tables or indexes with high IO and analyze the queries that cause high IO.\nStep 4: Optimize queries or consider adding indexes to improve IO performance.",

"metrics": ["pg_stat_user_tables", "pg_stat_user_indexes", "pg_statio_user_tables", "pg_statio_user_indexes"]""metrics": ["pg_stat_user_tables", "pg_stat_user_indexes", "pg_statio_user_tables", "pg_statio_user_indexes"]"

}}

注意：知识片段数据中"metrics"应该是具体的指标名称例如:"用户表状态", "用户索引状态", "用户表I/O状态", "用户索引I/O状态" 等，而不是"metric1","metric2"；"desc"属性应该是详细的描述；"steps"应该描述诊断当前问题的详细步骤。Note: "metrics" in the knowledge fragment data should be specific indicator names such as "user table status", "user index status", "user table I/O status", "user index I/O status", etc., instead of "metric1" and "metric2"; the "desc" attribute should be a detailed description; "steps" should describe the detailed steps to diagnose the current problem.

不要重复提取如下知识片段:Do not repeatedly extract the following knowledge fragments:

{已经存在的运维知识项}{Existing operation and maintenance knowledge items}

不要重复查找如下子章节：Do not search for the following subsections repeatedly:

{已经使用过的章节}{Already used chapters}

{第i章的摘要}{Summary of Chapter I}

””

上述第二提示词的示例中，“已经存在的运维知识项”是指已经生成的知识项，具体可以用已经生成的知识项中的故障名称表征，也就是说，如果前面已经提取了具有相同故障名称的知识项，后边就不在提取了。“已经使用过的章节”是指前边已经遍历过的文本片段，即已经遍历过的节点，即，每个节点仅遍历一次。“第i章的摘要”是指当前遍历的节点的摘要，即当前节点的摘要仅遍历一次。In the example of the second prompt word above, "existing operation and maintenance knowledge items" refers to knowledge items that have been generated, which can be specifically represented by the fault names in the generated knowledge items. That is to say, if a knowledge item with the same fault name has been extracted before, it will not be extracted later. "Chapters that have been used" refers to text fragments that have been traversed before, that is, nodes that have been traversed, that is, each node is traversed only once. "Summary of Chapter i" refers to the summary of the node currently traversed, that is, the summary of the current node is traversed only once.

如图4所示，为本申请实施例提供的基于大语言模型进行故障诊断的一种原理性示例图。该示例图中：As shown in FIG4 , a schematic diagram of a principle example of fault diagnosis based on a large language model provided in an embodiment of the present application is shown. In the schematic diagram:

1.用户输入的故障描述信息给到提示词生成器，以生成提示词。1. The fault description information entered by the user is given to the prompt word generator to generate a prompt word.

2.提示词输入到大语言模型，得到决策结果。2. The prompt word is input into the large language model to obtain the decision result.

3.如果决策结果指示输出结果，则智能执行体向用户返回诊断结果。3. If the decision result indicates an output result, the intelligent executive returns the diagnosis result to the user.

4.如果决策结果指示调用工具，则智能执行体根据决策结果调用工具或从知识库中召回知识，将执行结果（工具调用结果或召回的知识）返回给提示词生成器，以生成新的提示词。4. If the decision result indicates to call a tool, the intelligent executor calls the tool or recalls knowledge from the knowledge base according to the decision result, and returns the execution result (tool call result or recalled knowledge) to the prompt word generator to generate a new prompt word.

5.新的提示词输入到大语言模型，得到新的决策结果，返回执行步骤3-4。5. The new prompt word is input into the large language model to obtain a new decision result, and then return to execute steps 3-4.

与方法实施例相对应，本申请还提供一种故障诊断装置，本申请实施例提供的故障诊断装置的一种结构示意图如图5所示，可以包括：Corresponding to the method embodiment, the present application further provides a fault diagnosis device. A structural schematic diagram of the fault diagnosis device provided in the embodiment of the present application is shown in FIG5 , and may include:

接收模块501，工具确定模块502，诊断模块503和输出模块504；Receiving module 501, tool determination module 502, diagnosis module 503 and output module 504;

其中，接收模块501用于接收输入的故障描述信息；The receiving module 501 is used to receive input fault description information;

工具确定模块502用于确定与所述故障描述信息相关的运维工具列表；The tool determination module 502 is used to determine a list of operation and maintenance tools related to the fault description information;

诊断模块503用于利用大语言模型基于所述故障描述信息和所述运维工具列表进行多步诊断，其中，每一诊断步骤包括：将所述故障描述信息、所述运维工具列表和历史信息填加到第一提示词模板，得到第一提示词；所述第一提示词指示大语言模型基于所述故障描述信息和所述历史信息进行故障诊断时的规则和格式；其中，所述规则至少包括：在诊断过程中调用异常指标检测工具来确定异常指标，基于所述异常指标以及知识召回工具获得知识来诊断异常原因，可以使用所述运维工具列表中的运维工具辅助诊断；将所述第一提示词输入所述大语言模型，得到所述大语言模型生成的决策结果；如果所述决策结果指示调用运维工具或知识召回工具，在调用所述运维工具或知识召回工具，得到调用结果后，进入下一诊断步骤；其中，在第一个诊断步骤中，所述历史信息为空，在非第一个诊断步骤中，所述历史信息包括历史诊断步骤得到的决策结果和调用结果；The diagnosis module 503 is used to use a large language model to perform multi-step diagnosis based on the fault description information and the operation and maintenance tool list, wherein each diagnosis step includes: adding the fault description information, the operation and maintenance tool list and the historical information to a first prompt word template to obtain a first prompt word; the first prompt word indicates the rules and format of the large language model when performing fault diagnosis based on the fault description information and the historical information; wherein the rules at least include: calling an abnormal indicator detection tool to determine the abnormal indicator during the diagnosis process, obtaining knowledge based on the abnormal indicator and the knowledge recall tool to diagnose the cause of the abnormality, and using the operation and maintenance tools in the operation and maintenance tool list to assist in diagnosis; inputting the first prompt word into the large language model to obtain a decision result generated by the large language model; if the decision result indicates calling an operation and maintenance tool or a knowledge recall tool, after calling the operation and maintenance tool or the knowledge recall tool and obtaining the calling result, entering the next diagnosis step; wherein, in the first diagnosis step, the historical information is empty, and in a non-first diagnosis step, the historical information includes the decision result and the calling result obtained by the historical diagnosis step;

输出模块504用于如果所述决策结果指示输出诊断结果，基于所述决策结果输出诊断结果。The output module 504 is configured to output a diagnosis result based on the decision result if the decision result indicates to output a diagnosis result.

本申请实施例提供的故障诊断装置，在基于大语言模型进行故障诊断的过程中，不仅参考相关的知识，还调用相关的工具，使得大语言模型可以感知当前系统环境，并根据系统环境信息和相关知识进行逐步故障诊断，克服了大语言模型在故障诊断过程中的幻觉问题，从而提高故障诊断的准确性。The fault diagnosis device provided in the embodiment of the present application, in the process of fault diagnosis based on the large language model, not only refers to relevant knowledge, but also calls relevant tools, so that the large language model can perceive the current system environment, and perform step-by-step fault diagnosis according to the system environment information and relevant knowledge, thereby overcoming the hallucination problem of the large language model in the fault diagnosis process, thereby improving the accuracy of fault diagnosis.

在一可选的实施例中，所述工具确定模块502确定与所述故障描述信息相关的运维工具时，用于：In an optional embodiment, when the tool determination module 502 determines the operation and maintenance tool related to the fault description information, it is used to:

在一可选的实施例中，所述诊断模块503调用所述知识召回工具得到调用结果的过程包括：In an optional embodiment, the process of the diagnosis module 503 calling the knowledge recall tool to obtain the calling result includes:

在一可选的实施例中，所述知识库中的每个知识项包括如下几个字段的内容：故障名称、故障描述、故障相关的指标、故障分析步骤；In an optional embodiment, each knowledge item in the knowledge base includes the contents of the following fields: fault name, fault description, fault-related indicators, and fault analysis steps;

所述装置还包括：知识提取模块，用于通过如下方式提取知识项：The device further comprises: a knowledge extraction module, configured to extract knowledge items in the following manner:

在一可选的实施例中，所述根节点还包括所述知识文档的标题的摘要；In an optional embodiment, the root node also includes a summary of the title of the knowledge document;

在一可选的实施例中，所述知识提取模块调用所述大语言模型对所述文档片段树中的节点进行遍历，每遍历到一个节点，对遍历到的节点进行知识项提取的过程，包括：In an optional embodiment, the knowledge extraction module calls the large language model to traverse the nodes in the document fragment tree, and extracts knowledge items from each traversed node, including:

本申请实施例中还提供一种电子设备。参考图6所示，其示出了适于用来实现本申请实施例中的电子设备的一种结构示意图。本申请实施例中的电子设备可以为终端设备，比如手机、平板电脑、笔记本电脑、台式计算机等；当然，终端设备可以独立完成上述故障诊断方法，也可以与服务端设备进行交互配合完成上述故障诊断方法。图6示出的电子设备仅仅是一个示例，不应对本申请实施例的功能和使用范围带来任何限制。An electronic device is also provided in an embodiment of the present application. Referring to FIG6, a structural diagram of an electronic device suitable for implementing an embodiment of the present application is shown. The electronic device in the embodiment of the present application may be a terminal device, such as a mobile phone, a tablet computer, a laptop computer, a desktop computer, etc.; of course, the terminal device may independently complete the above-mentioned fault diagnosis method, or may interact with the server device to complete the above-mentioned fault diagnosis method. The electronic device shown in FIG6 is only an example and should not bring any limitation to the functions and scope of use of the embodiment of the present application.

如图6所示，该电子设备可以包括处理装置（例如中央处理器、图形处理器等）601，其可以根据存储在只读存储器（ROM）602中的程序或者从存储装置608加载到随机存取存储器（RAM）603中的程序而执行各种适当的动作和处理。在电子设备通电的状态下，RAM603中还存储有电子设备操作所需的各种程序和数据。处理装置601、ROM602以及RAM603通过总线604彼此相连。输入/输出（I/O）接口605也连接至总线604。As shown in FIG6 , the electronic device may include a processing device (e.g., a central processing unit, a graphics processing unit, etc.) 601, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 602 or a program loaded from a storage device 608 into a random access memory (RAM) 603. When the electronic device is powered on, various programs and data required for the operation of the electronic device are also stored in the RAM 603. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to the bus 604.

通常，以下装置可以连接至I/O接口605：包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置606；包括例如液晶显示器（LCD）、扬声器、振动器等的输出装置607；包括例如内存卡、硬盘等的存储装置608；以及通信装置609。通信装置609可以允许电子设备与其他设备进行无线或有线通信以交换数据。虽然图6示出了具有各种装置的电子设备，但是应理解的是，并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。Typically, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, a touchpad, a keyboard, a mouse, a camera, a microphone, an accelerometer, a gyroscope, etc.; output devices 607 including, for example, a liquid crystal display (LCD), a speaker, a vibrator, etc.; storage devices 608 including, for example, a memory card, a hard disk, etc.; and communication devices 609. The communication device 609 may allow the electronic device to communicate with other devices wirelessly or by wire to exchange data. Although FIG. 6 shows an electronic device with various devices, it should be understood that it is not required to implement or have all the devices shown. More or fewer devices may be implemented or have alternatively.

本申请实施例中还提供一种包括计算机程序产品，包括计算机可读指令，当计算机可读指令在电子设备上运行时，使得电子设备实现本申请实施例提供的任一种故障诊断方法。An embodiment of the present application also provides a computer program product including computer-readable instructions. When the computer-readable instructions are executed on an electronic device, the electronic device implements any fault diagnosis method provided in the embodiment of the present application.

本申请实施例中还提供一种计算机可读存储介质，该存储介质承载有一个或多个计算机程序，当一个或多个计算机程序被电子设备执行时，能够使电子设备实现本申请实施例提供的任一种故障诊断方法。A computer-readable storage medium is also provided in an embodiment of the present application. The storage medium carries one or more computer programs. When the one or more computer programs are executed by an electronic device, the electronic device can implement any fault diagnosis method provided in the embodiment of the present application.

需说明的是，以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外，本申请提供的装置实施例附图中，模块之间的连接关系表示它们之间具有通信连接，具体可以实现为一条或多条通信总线或信号线。It should be noted that the device embodiments described above are merely schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. In addition, in the drawings of the device embodiments provided by the present application, the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.

通过以上的实施方式的描述，所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现，当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下，凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现，而且，用来实现同一功能的具体硬件结构也可以是多种多样的，例如模拟电路、数字电路或专用电路等。但是，对本申请而言，更多情况下软件程序实现是更佳的实施方式。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在可读取的存储介质中，如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，训练设备，或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above implementation mode, the technicians in the field can clearly understand that the present application can be implemented by means of software plus necessary general hardware, and of course, it can also be implemented by special hardware including special integrated circuits, special CPUs, special memories, special components, etc. In general, all functions completed by computer programs can be easily implemented by corresponding hardware, and the specific hardware structure used to implement the same function can also be various, such as analog circuits, digital circuits or special circuits. However, for the present application, software program implementation is a better implementation mode in more cases. Based on such an understanding, the technical solution of the present application is essentially or the part that contributes to the prior art can be embodied in the form of a software product, which is stored in a readable storage medium, such as a computer floppy disk, a U disk, a mobile hard disk, a ROM, a RAM, a disk or an optical disk, etc., including a number of instructions to enable a computer device (which can be a personal computer, a training device, or a network device, etc.) to execute the methods described in each embodiment of the present application.

在上述实施例中，可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时，可以全部或部分地以计算机程序产品的形式实现。专业技术人员可以对每个特定的方案来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本申请的范围。In the above embodiments, all or part of the embodiments may be implemented by software, hardware, firmware, or any combination thereof. When implemented by software, all or part of the embodiments may be implemented in the form of a computer program product. Professionals and technicians may use different methods to implement the described functions for each specific solution, but such implementation should not be considered beyond the scope of this application.

所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时，全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中，或者从一个计算机可读存储介质向另一计算机可读存储介质传输，例如，所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质，(例如，软盘、硬盘、磁带)、光介质(例如，DVD)、或者半导体介质(例如固态硬盘(Solid State Disk，SSD))等。The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the process or function described in the embodiment of the present application is generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website site, a computer, a training device, or a data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) mode to another website site, computer, training device, or data center. The computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a training device, a data center, etc. that includes one or more available media integrations. The available medium may be a magnetic medium, (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid-state drive (SSD)), etc.

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似部分互相参见即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The same or similar parts between the various embodiments can be referenced to each other.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下，在其它实施例中实现。因此，本申请将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables those skilled in the art to implement or use the present application. Various modifications to these embodiments will be apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the present application. Therefore, the present application will not be limited to the embodiments shown herein, but will conform to the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A fault diagnosis method, the method comprising:

receiving input fault description information;

determining an operation and maintenance tool list related to the fault description information;

performing multi-step diagnosis based on the fault description information and the operation and maintenance tool list by using a large language model, wherein each diagnosis step comprises:

Filling the fault description information, the operation and maintenance tool list and the history information into a first prompt word template to obtain a first prompt word; the first prompt word indicates rules and formats of the large language model when fault diagnosis is performed based on the fault description information and the historical information; wherein the rule at least comprises: invoking an abnormal index detection tool to determine an abnormal index in a diagnosis process, diagnosing an abnormal cause based on the abnormal index and knowledge acquired by a knowledge recall tool, and using an operation and maintenance tool in the operation and maintenance tool list to assist diagnosis;

Inputting the first prompt word into the large language model to obtain a decision result generated by the large language model;

If the decision result indicates that the abnormal index detection tool or the operation and maintenance tool or the knowledge recall tool is called, after the abnormal index detection tool or the operation and maintenance tool or the knowledge recall tool is called, a calling result is obtained, the next diagnosis step is entered; in the first diagnosis step, the history information is empty, and in the non-first diagnosis step, the history information comprises a decision result and a calling result obtained in the history diagnosis step;

and outputting a diagnosis result based on the decision result if the decision result indicates to output the diagnosis result.

2. The method of claim 1, wherein the determining the operation and maintenance tool associated with the fault description information comprises:

obtaining a vector representation of the fault description information;

Calculating the similarity between the description information of each operation and maintenance tool in the database and the fault description information based on the vector representation of the fault description information and the vector representation of the description information of each operation and maintenance tool stored in the database in advance;

determining an operation and maintenance tool list related to the fault description information; and the similarity between the description information of the operation and maintenance tools in the operation and maintenance tool list and the fault description information is larger than that between the description information of the operation and maintenance tools not in the operation and maintenance tool list and the fault description information.

3. The method of claim 1, wherein invoking the knowledge recall tool to obtain a call result comprises:

And inputting the abnormal index into the knowledge recall tool so that the knowledge recall tool can match the abnormal index with each knowledge item in a knowledge base, and outputting a fault analysis step in the knowledge item matched with the abnormal index as a calling result.

4. A method according to claim 3, wherein each knowledge item in the knowledge base comprises the contents of the following fields: fault name, fault description, fault related index and fault analysis;

The knowledge item is extracted by the following steps:

constructing a document fragment tree of the knowledge document according to the chapter structure; the root node of the document fragment tree corresponds to the knowledge document; each non-root node corresponds to a section in the knowledge document; the root node comprises the title of the knowledge document, if the non-root node is a leaf node, the non-root node comprises all contents of the corresponding chapter, and if the non-root node is not a leaf node, the non-root node comprises the chapter title of the corresponding chapter and the summary of the corresponding chapter;

Invoking the large language model to traverse the nodes in the document fragment tree, traversing to one node each time, and extracting knowledge items from the traversed node; each knowledge item is extracted from a node.

5. The method of claim 4, wherein the root node further comprises a summary of a title of the knowledge document;

if the non-root node is a leaf node, the non-root node further comprises a summary of all content of the corresponding chapter;

if the non-root node is not a leaf node, the non-root node also includes a section title of the corresponding section and a summary of the corresponding section.

6. The method of claim 4, wherein the invoking the large language model traverses nodes in the document snippet tree, each traversing to a node, extracting knowledge items from the traversed nodes, comprising:

Filling the document fragment tree into a second prompting word template to obtain a second prompting word; the second prompt word indicates the large language model to traverse the nodes in the document fragment tree one by one, and for each traversed node, at least one piece of detailed diagnosis information which accords with a preset format is written according to the content of the node, and the detailed diagnosis information extracted from different nodes is different;

each piece of detailed diagnostic information is a knowledge item.

7. A fault diagnosis apparatus characterized by comprising:

the receiving module is used for receiving the input fault description information;

A tool determining module for determining an operation and maintenance tool list related to the fault description information;

A diagnostic module for performing multi-step diagnosis based on the fault description information and the operation and maintenance tool list by using a large language model, wherein each diagnosis step comprises: filling the fault description information, the operation and maintenance tool list and the history information into a first prompt word template to obtain a first prompt word; the first prompt word indicates rules and formats of the large language model when fault diagnosis is performed based on the fault description information and the historical information; wherein the rule at least comprises: invoking an abnormal index detection tool to determine an abnormal index in a diagnosis process, diagnosing an abnormal cause based on the abnormal index and knowledge acquired by a knowledge recall tool, and using an operation and maintenance tool in the operation and maintenance tool list to assist diagnosis; inputting the first prompt word into the large language model to obtain a decision result generated by the large language model; if the decision result indicates that the abnormal index detection tool or the operation and maintenance tool or the knowledge recall tool is called, after the abnormal index detection tool or the operation and maintenance tool or the knowledge recall tool is called, a calling result is obtained, the next diagnosis step is entered; in the first diagnosis step, the history information is empty, and in the non-first diagnosis step, the history information comprises a decision result and a calling result obtained in the history diagnosis step;

And the output module is used for outputting a diagnosis result based on the decision result if the decision result indicates to output the diagnosis result.

8. A computer program product comprising computer readable instructions which, when run on an electronic device, cause the electronic device to implement the fault diagnosis method of any one of claims 1 to 6.

9. An electronic device comprising at least one processor and a memory coupled to the processor, wherein:

the memory is used for storing a computer program;

The processor is configured to execute the computer program to enable the electronic device to implement the fault diagnosis method as claimed in any one of claims 1 to 6.

10. A computer storage medium carrying one or more computer programs which, when executed by an electronic device, enable the electronic device to implement the fault diagnosis method of any one of claims 1 to 6.