CN118802492A

CN118802492A - Fault location method, device, equipment and storage medium of network file system

Info

Publication number: CN118802492A
Application number: CN202311221123.3A
Authority: CN
Inventors: 戴伟; 潘宇虹; 周勋; 吴天东; 陈琪
Original assignee: Zhejiang Mobile Information System Integration Co ltd; China Mobile Communications Group Co Ltd; China Mobile Group Zhejiang Co Ltd; China Mobile Zhejiang Innovation Research Institute Co Ltd
Current assignee: Zhejiang Mobile Information System Integration Co ltd; China Mobile Communications Group Co Ltd; China Mobile Group Zhejiang Co Ltd; China Mobile Zhejiang Innovation Research Institute Co Ltd
Priority date: 2023-09-20
Filing date: 2023-09-20
Publication date: 2024-10-18

Abstract

The application relates to the technical field of mobile communication, and provides a fault positioning method, a device, equipment and a storage medium of a network file system, wherein the method comprises the steps of acquiring NFS protocol data packets through tcpdump at a client and a server to obtain original data; performing unique xid field matching on the original data to obtain complete transfer information of each IO atom operation; obtaining average time delay data of corresponding time points based on complete transfer information of all IO atom operations; and comparing the overall average time delay with an overall time delay threshold, and if a first difference value of subtracting the overall time delay threshold from the overall time delay is larger than a first preset threshold, triggering threshold judgment of the server storage average time delay, the server average time delay, the network average time delay and the client average time delay at corresponding time points one by one to obtain an analysis result for feeding back the fault position. The application generates the fine monitoring index, can definitely and completely consume time for the IO path, and realizes the accurate positioning of the fault point.

Description

Fault location method, device, equipment and storage medium of network file system

技术领域Technical Field

本申请涉及移动通信技术领域，具体涉及一种网络文件系统的故障定位方法、装置、设备及存储介质。The present application relates to the field of mobile communication technology, and in particular to a method, device, equipment and storage medium for locating a fault of a network file system.

背景技术Background Art

基于TCP/IP传输的网络文件系统(Network File System，NFS)存储可方便快捷的在多台服务器之间实现数据共享和保持数据一致，大量用于企业环境中实现在不同的主机乃至系统之间共享数据文件或目录。Network File System (NFS) storage based on TCP/IP transmission can easily and quickly realize data sharing and maintain data consistency among multiple servers. It is widely used in enterprise environments to realize sharing of data files or directories between different hosts or even systems.

现有NFS存储技术特点，通常针对NFS存储主要采取系统自带检测命令，客户端只能获取到读和写请求数据，服务端能够获取所有请求类型的调用数量、整体延迟等数据，两者之间无法有效关联。The existing NFS storage technology features usually use the system's own detection commands for NFS storage. The client can only obtain read and write request data, while the server can obtain data such as the number of calls of all request types and overall latency. There is no effective correlation between the two.

在现有NFS存储架构下，实际的IO路径环节较多，现有性能监控手段可以粗略的判断存储IO路径是否存在性能问题，但仍存在以下缺点：In the existing NFS storage architecture, the actual IO path has many links. The existing performance monitoring methods can roughly determine whether there are performance problems in the storage IO path, but there are still the following shortcomings:

1、现有监控指标不精细，故障点难以快速定位；1. The existing monitoring indicators are not precise, and the fault point is difficult to locate quickly;

2、IO路径各环节耗时信息缺乏，故障点难以准确定位。2. There is a lack of time-consuming information for each link in the IO path, making it difficult to accurately locate the fault point.

发明内容Summary of the invention

本申请实施例提供一种网络文件系统的故障定位方法、装置、设备及存储介质，用以解决监控指标精度低、故障点难以定位的技术问题。The embodiments of the present application provide a method, apparatus, device and storage medium for locating a fault of a network file system, so as to solve the technical problems of low accuracy of monitoring indicators and difficulty in locating fault points.

第一方面，本申请实施例提供一种网络文件系统的故障定位方法，包括：采集实时整体时延和实时服务端存储时延，在客户端和服务端通过tcpdump进行NFS协议数据包采集，以得到原始数据；对原始数据进行唯一的xid字段匹配，得到每个IO原子操作的完整传递信息；基于所有IO原子操作的完整传递信息，得到对应时点的平均时延数据；其中平均时延数据包括整体平均时延、服务端存储平均时延、服务端平均时延、网络平均时延和客户端平均时延；确定阈值时延数据；其中阈值时延数据包括整体时延阈值、服务端存储时延阈值、服务端时延阈值、网络时延阈值和客户端时延阈值；将整体平均时延与整体时延阈值作比较，若平均整体时延减去整体时延阈值的第一差值大于第一预设阈值，则逐一触发对应时点的服务端存储平均时延、服务端平均时延、网络平均时延和客户端平均时延的阈值判断，得到用于反馈故障位置的分析结果。In a first aspect, an embodiment of the present application provides a fault location method for a network file system, comprising: collecting real-time overall delay and real-time server storage delay, collecting NFS protocol data packets through tcpdump on the client and the server to obtain original data; performing unique xid field matching on the original data to obtain complete transmission information of each IO atomic operation; based on the complete transmission information of all IO atomic operations, obtaining average delay data at the corresponding time point; wherein the average delay data includes overall average delay, server storage average delay, server average delay, network average delay and client average delay; determining threshold delay data; wherein the threshold delay data includes overall delay threshold, server storage delay threshold, server delay threshold, network delay threshold and client delay threshold; comparing the overall average delay with the overall delay threshold, if the first difference between the average overall delay and the overall delay threshold is greater than the first preset threshold, then triggering the threshold judgment of the server storage average delay, server average delay, network average delay and client average delay at the corresponding time point one by one, and obtaining an analysis result for feedback of the fault location.

在一个实施例中，在客户端和服务端通过tcpdump进行NFS协议数据包采集，以得到原始数据；对原始数据进行唯一的xid字段匹配，得到每个IO原子操作的完整传递信息，包括：通过tcpdump工具截取NFS存储客户端和服务端的RPC调用请求数据包，并根据NFS协议特征筛选出客户端和服务端NFS操作的全部IO数据包；其中，每个IO数据包的信息包括路径、时间戳、源地址、目的地址、xid字段、原子操作类型和数据包内容；对全部IO数据包进行xid字段匹配，得到xid字段相同的每个IO原子操作的完整传递信息，其中完整传递信息为整个IO路径，包括IO数据包从客户端出、服务端进入、服务端出、客户端进入4个过程。In one embodiment, NFS protocol data packets are collected through tcpdump on the client and server to obtain original data; unique xid field matching is performed on the original data to obtain complete transfer information of each IO atomic operation, including: intercepting RPC call request data packets of the NFS storage client and server through the tcpdump tool, and filtering out all IO data packets of NFS operations on the client and server according to NFS protocol characteristics; wherein the information of each IO data packet includes path, timestamp, source address, destination address, xid field, atomic operation type and data packet content; xid field matching is performed on all IO data packets to obtain complete transfer information of each IO atomic operation with the same xid field, wherein the complete transfer information is the entire IO path, including four processes of IO data packets from the client, into the server, out of the server, and into the client.

在一个实施例中，基于所有IO原子操作的完整传递信息，得到对应时点的平均时延数据，包括：对每个IO原子操作的完整传递信息，利用IO数据包在客户端和服务器端的进出时间戳，计算出单位时间间隔内的服务器耗时、网络耗时和客户端耗时；将服务器耗时作为服务端平均时延；将网络耗时作为网络平均时延；将客户端耗时作为客户端平均时延；其中，服务器耗时为一个完整IO在完成服务端进入和服务端出之间的耗时；网络耗时为一个完整IO请求在完成在客户端出、服务端进入、服务端出、客户端进入4个过程后实际在网络链路传输一来一回的耗时；客户端耗时为一个完整IO的整体耗时减去服务器耗时和网络耗时得出。In one embodiment, based on the complete transmission information of all IO atomic operations, the average delay data of the corresponding time point is obtained, including: for the complete transmission information of each IO atomic operation, using the entry and exit timestamps of the IO data packets on the client and server, the server time, network time and client time in a unit time interval are calculated; the server time is used as the average server delay; the network time is used as the average network delay; the client time is used as the average client delay; wherein, the server time is the time taken for a complete IO to complete the server entry and server exit; the network time is the time taken for a complete IO request to actually be transmitted back and forth on the network link after completing the four processes of client exit, server entry, server exit, and client entry; the client time is the overall time taken for a complete IO minus the server time and network time.

在一个实施例中，逐一触发对应时点的服务端存储平均时延、服务端平均时延、网络平均时延和客户端平均时延的阈值判断，得到用于反馈故障位置的分析结果，包括：将服务端平均时延与服务端时延阈值作比较，若服务端平均时延减去服务端时延阈值的第二差值小于或等于第二预设阈值，则服务器正常；若服务端平均时延减去服务端时延阈值的第二差值大于第二预设阈值，则将服务端存储平均时延与服务端存储时延阈值作比较。In one embodiment, the threshold judgments of the server-side storage average delay, server-side average delay, network average delay and client-side average delay at corresponding time points are triggered one by one to obtain analysis results for feedback of the fault location, including: comparing the server-side average delay with the server-side delay threshold; if the second difference between the server-side average delay and the server-side delay threshold is less than or equal to the second preset threshold, the server is normal; if the second difference between the server-side average delay and the server-side delay threshold is greater than the second preset threshold, comparing the server-side storage average delay with the server-side storage delay threshold.

在一个实施例中，将服务端存储平均时延与服务端存储时延阈值作比较，包括：若服务端存储平均时延减去服务端存储时延阈值的第三差值小于或等于第三预设阈值，则元数据访问异常；若服务端存储平均时延减去服务端存储时延阈值的第三差值大于第三预设阈值，则数据存储异常。In one embodiment, the server-side storage average latency is compared with the server-side storage latency threshold, including: if the third difference between the server-side storage average latency and the server-side storage latency threshold is less than or equal to the third preset threshold, the metadata access is abnormal; if the third difference between the server-side storage average latency and the server-side storage latency threshold is greater than the third preset threshold, the data storage is abnormal.

在一个实施例中，逐一触发对应时点的服务端存储平均时延、服务端平均时延、网络平均时延和客户端平均时延的阈值判断，得到用于反馈故障位置的分析结果，包括：将网络平均时延与网络时延阈值作比较，若网络平均时延减去网络时延阈值的第四差值小于或等于第四预设阈值，则网络链路正常；若网络平均时延减去网络时延阈值的第四差值大于第四预设阈值，则网络链路异常。In one embodiment, the threshold judgments of the server-side storage average delay, the server-side average delay, the network average delay and the client-side average delay at corresponding time points are triggered one by one to obtain analysis results for feedback of the fault location, including: comparing the network average delay with the network delay threshold, if the fourth difference between the network average delay and the network delay threshold is less than or equal to the fourth preset threshold, the network link is normal; if the fourth difference between the network average delay and the network delay threshold is greater than the fourth preset threshold, the network link is abnormal.

在一个实施例中，逐一触发对应时点的服务端存储平均时延、服务端平均时延、网络平均时延和客户端平均时延的阈值判断，得到用于反馈故障位置的分析结果，包括：将客户端平均时延与客户端时延阈值作比较，若客户端平均时延减去客户端时延阈值的第五差值小于或等于第五预设阈值，则客户端链路正常；若客户端平均时延减去客户端时延阈值的第五差值大于第五预设阈值，则客户端链路异常。In one embodiment, threshold judgments of the server-side storage average delay, server-side average delay, network average delay, and client-side average delay at corresponding time points are triggered one by one to obtain analysis results for feedback of the fault location, including: comparing the client-side average delay with the client-side delay threshold; if the fifth difference between the client-side average delay and the client-side delay threshold is less than or equal to the fifth preset threshold, the client link is normal; if the fifth difference between the client-side average delay and the client-side delay threshold is greater than the fifth preset threshold, the client link is abnormal.

第二方面，本申请实施例提供一种网络文件系统的故障定位装置，包括：数据采集模块，用于采集实时整体时延和实时服务端存储时延，在客户端和服务端通过tcpdump进行NFS协议数据包采集，以得到原始数据；对原始数据进行唯一的xid字段匹配，得到每个IO原子操作的完整传递信息；数据萃取模块，用于基于所有IO原子操作的完整传递信息，得到对应时点的平均时延数据；其中平均时延数据包括整体平均时延、服务端存储平均时延、服务端平均时延、网络时延和客户端平均时延；阈值配置模块，用于确定阈值时延数据；其中阈值时延数据包括整体时延阈值、服务端存储时延阈值、服务端时延阈值、网络时延阈值和客户端时延阈值；性能分析定位判断模块，用于将整体平均时延与整体时延阈值作比较，若平均整体时延减去整体时延阈值的第一差值大于第一预设阈值，则逐一触发对应时点的服务端存储平均时延、服务端平均时延、网络平均时延和客户端平均时延的阈值判断，得到用于反馈故障位置的分析结果。In the second aspect, the embodiment of the present application provides a fault location device for a network file system, including: a data acquisition module, which is used to collect real-time overall delay and real-time server storage delay, and collect NFS protocol data packets through tcpdump on the client and server to obtain original data; unique xid field matching is performed on the original data to obtain complete transmission information of each IO atomic operation; a data extraction module is used to obtain average delay data at the corresponding time point based on the complete transmission information of all IO atomic operations; wherein the average delay data includes overall average delay, server storage average delay, server average delay, network delay and client average delay; a threshold configuration module, used to determine the threshold delay data; wherein the threshold delay data includes the overall delay threshold, the server storage delay threshold, the server delay threshold, the network delay threshold and the client delay threshold; a performance analysis positioning judgment module, used to compare the overall average delay with the overall delay threshold. If the first difference between the average overall delay and the overall delay threshold is greater than the first preset threshold, the threshold judgments of the server storage average delay, the server average delay, the network average delay and the client average delay at the corresponding time point are triggered one by one to obtain the analysis results for feedback of the fault location.

第三方面，本申请实施例提供一种电子设备，包括处理器和存储有计算机程序的存储器，所述处理器执行所述程序时实现第一方面所述的网络文件系统的故障定位方法。In a third aspect, an embodiment of the present application provides an electronic device, including a processor and a memory storing a computer program, wherein when the processor executes the program, the method for locating the fault of the network file system described in the first aspect is implemented.

第四方面，本申请实施例一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现第一方面所述的网络文件系统的故障定位方法。In a fourth aspect, an embodiment of the present application provides a non-transitory computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a processor, the method for locating a fault in a network file system described in the first aspect is implemented.

本申请实施例提供的网络文件系统的故障定位方法、装置、设备及存储介质，该方法包括在客户端和服务端通过tcpdump进行NFS协议数据包采集得到原始数据；对原始数据进行唯一的xid字段匹配，得到每个IO原子操作的完整传递信息；基于所有IO原子操作的完整传递信息，得到对应时点的平均时延数据；将整体平均时延与整体时延阈值作比较，若平均整体时延减去整体时延阈值的第一差值大于第一预设阈值，则逐一触发对应时点的服务端存储平均时延、服务端平均时延、网络平均时延和客户端平均时延的阈值判断，得到用于反馈故障位置的分析结果。通过上述方式，本申请能够生成精细监控指标，细化NFS请求IO原子操作，获取完整IO路径及每个IO路径环节时延，以明确完整IO路径耗时，实现故障点精确定位。The fault location method, device, equipment and storage medium of the network file system provided by the embodiment of the present application include collecting NFS protocol data packets through tcpdump on the client and server to obtain original data; matching the original data with a unique xid field to obtain the complete transmission information of each IO atomic operation; obtaining the average delay data of the corresponding time point based on the complete transmission information of all IO atomic operations; comparing the overall average delay with the overall delay threshold, if the first difference between the average overall delay and the overall delay threshold is greater than the first preset threshold, then triggering the threshold judgment of the server storage average delay, server average delay, network average delay and client average delay at the corresponding time point one by one, and obtaining the analysis result for feedback of the fault location. Through the above method, the present application can generate fine monitoring indicators, refine NFS request IO atomic operations, obtain the complete IO path and the delay of each IO path link, so as to clarify the time consumption of the complete IO path and realize accurate location of the fault point.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the present application or the prior art, a brief introduction will be given below to the drawings required for use in the embodiments or the description of the prior art. Obviously, the drawings described below are some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative work.

图1是本申请实施例提供的网络文件系统的故障定位方法的流程示意图；FIG1 is a schematic diagram of a flow chart of a method for locating a fault in a network file system provided in an embodiment of the present application;

图2是本申请实施例提供的网络文件系统的部署架构示意图；FIG2 is a schematic diagram of a deployment architecture of a network file system provided in an embodiment of the present application;

图3是本申请实施例提供的细化图2实际IO路径的示意图；FIG3 is a schematic diagram of a detailed actual IO path of FIG2 provided in an embodiment of the present application;

图4是本申请实施例提供的原子操作路径跟踪示意图；FIG4 is a schematic diagram of atomic operation path tracking provided by an embodiment of the present application;

图5是本申请实施例提供的IO路径各环节耗时示意图；FIG5 is a schematic diagram of the time consumption of each link of the IO path provided in an embodiment of the present application;

图6是本申请实施例提供的性能分析定位判断过程中的流程示意图；FIG6 is a schematic diagram of a process flow in a performance analysis and positioning judgment process provided in an embodiment of the present application;

图7是本申请实施例提供的网络文件系统的故障定位装置的结构示意图；7 is a schematic diagram of the structure of a fault location device for a network file system provided in an embodiment of the present application;

图8是本申请实施例提供的网络文件系统的故障定位装置的工作流程示意图；8 is a schematic diagram of the workflow of a fault location device for a network file system provided in an embodiment of the present application;

图9是本申请实施例提供的电子设备的实体结构示意图。FIG. 9 is a schematic diagram of the physical structure of an electronic device provided in an embodiment of the present application.

具体实施方式DETAILED DESCRIPTION

为使本申请的目的、技术方案和优点更加清楚，下面将结合本申请实施例中的附图，对本申请中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。In order to make the purpose, technical solutions and advantages of this application clearer, the technical solutions in this application will be clearly and completely described below in conjunction with the drawings in the embodiments of this application. Obviously, the described embodiments are part of the embodiments of this application, not all of them. Based on the embodiments in this application, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of this application.

本申请实施例提供的网络文件系统的故障定位方法、装置、设备及存储介质，可以有效改善NFS存储现有技术监控指标不精细，故障点难以快速定位以及IO路径各环节耗时信息缺乏，故障点难以准确定位的问题。The network file system fault location method, device, equipment and storage medium provided in the embodiments of the present application can effectively improve the problems of NFS storage prior art in which monitoring indicators are not precise, fault points are difficult to locate quickly, and there is a lack of time-consuming information on each link of the IO path, making it difficult to accurately locate the fault point.

请参阅图1，图1是本申请实施例提供的网络文件系统的故障定位方法的流程示意图。本实施例提供的网络文件系统的故障定位方法包括步骤S110～S140，各步骤具体如下：Please refer to Figure 1, which is a flow chart of a method for locating a fault of a network file system provided by an embodiment of the present application. The method for locating a fault of a network file system provided by this embodiment includes steps S110 to S140, and each step is specifically as follows:

S110：采集实时整体时延和实时服务端存储时延，在客户端和服务端通过tcpdump进行NFS协议数据包采集，以得到原始数据；对原始数据进行唯一的xid字段匹配，得到每个IO原子操作的完整传递信息。S110: Collect the real-time overall latency and the real-time server storage latency, collect NFS protocol data packets through tcpdump on the client and server to obtain the original data; perform unique xid field matching on the original data to obtain complete transmission information of each IO atomic operation.

请参阅图2-3，图2是本申请实施例提供的网络文件系统的部署架构示意图。图3是本申请实施例提供的细化图2实际IO路径的示意图。Please refer to Figures 2-3, Figure 2 is a schematic diagram of the deployment architecture of the network file system provided in the embodiment of the present application. Figure 3 is a schematic diagram of the actual IO path of Figure 2 provided in the embodiment of the present application.

由图2可知，一个NFS服务端存储通过TCP/IP网络服务多个NFS客户端，由此可抽象出一个完整的NFS存储IO路径模型，即NFS客户端-->TCP/IP网络-->NFS服务端文件系统-->数据存储。As shown in Figure 2, an NFS server storage serves multiple NFS clients through the TCP/IP network, from which a complete NFS storage IO path model can be abstracted, namely, NFS client-->TCP/IP network-->NFS server file system-->data storage.

由图3可知，应用程序通过以下5段IO路径，完成请求及应答。可以看出客户端只有NFS的RPC调用，真实IO发生在服务端：即客户端应用程序通过文件系统服务进程完成IO需求的RPC转换，经由网络到达服务端服务进程完成RPC请求的IO需求还原，最终在服务端完成数据存储读写。详细如下：As shown in Figure 3, the application completes the request and response through the following five IO paths. It can be seen that the client only has NFS RPC calls, and the real IO occurs on the server: that is, the client application completes the RPC conversion of IO requirements through the file system service process, reaches the server service process through the network to complete the IO requirement restoration of the RPC request, and finally completes the data storage and reading and writing on the server. Details are as follows:

①、应用程序同NFS客户端进行IO请求交互：应用程序向NFS文件系统发起IO请求，NFS文件系统向应用程序返回请求结果。① The application interacts with the NFS client for IO requests: The application initiates an IO request to the NFS file system, and the NFS file system returns the request result to the application.

②、NFS客户端进行IO请求和RPC请求转换：NFS文件系统通过客户端服务进程把IO请求转换为RPC网络远程调用请求，并在得到服务端RPC响应后把对应的RPC请求转换为NFS文件系统的IO请求。②. The NFS client performs IO request and RPC request conversion: The NFS file system converts IO requests into RPC network remote call requests through the client service process, and converts the corresponding RPC requests into IO requests of the NFS file system after receiving the RPC response from the server.

③、NFS客户端同服务端进行RPC请求传输：客户端和服务端通过TCP/IP网络完成RPC请求传输。③、The NFS client transmits RPC requests to the server: The client and the server complete the RPC request transmission through the TCP/IP network.

④、NFS服务端进行RPC请求和IO请求转换：服务端进程把接收的RPC请求转换为NFS文件系统IO请求，并先请求访问文件系统的元数据。④. The NFS server performs RPC request and IO request conversion: The server process converts the received RPC request into an NFS file system IO request, and first requests access to the file system metadata.

⑤、NFS服务端完成IO请求数据读写：访问文件系统元数据成功后，服务进程正式对存储数据进行读写操作，完成后自动返回成功结果。⑤. The NFS server completes the IO request data reading and writing: After successfully accessing the file system metadata, the service process formally reads and writes the stored data, and automatically returns a successful result after completion.

在相关技术的NFS存储架构下，在客户端可以通过nfsiostat等工具获取客户端读操作和写操作的远程RPC调用量、平均延迟等指标信息；在服务端，通过nfstat工具查看NFS存储上接收到所有NFS客户端的各类NFS请求操作的数量；在服务端还可通过sar、iostat工具获取NFS文件系统到数据存储的IO使用情况，能够获取的数据为读写操作的IOPS、延迟等数据；然而NFS客户端只能够取到读和写请求的性能数据，NFS服务端能够取到所有NFS请求类型的数量、整体繁忙程度等数据，两者之间无法有效关联：Under the NFS storage architecture of related technologies, the client can obtain the remote RPC call volume, average latency and other indicator information of the client's read and write operations through tools such as nfsiostat on the client; on the server, the number of various NFS request operations received by all NFS clients on the NFS storage can be viewed through the nfstat tool; on the server, the sar and iostat tools can also be used to obtain the IO usage of the NFS file system to the data storage, and the data that can be obtained are the IOPS, latency and other data of the read and write operations; however, the NFS client can only obtain the performance data of the read and write requests, and the NFS server can obtain the number of all NFS request types, the overall busyness and other data, and there is no effective correlation between the two:

1、现有监控指标不精细，故障点难以快速定位1. Existing monitoring indicators are not precise, and fault points are difficult to locate quickly

通过现有命令工具分别能够获取单个NFS客户端及存储服务端的整体使用状态，无法精细化查询来自各客户端的每个IO原子请求完整生命周期内的完成情况，使得单客户端数据无法和服务端全量总体数据进行有效关联匹配，在面临访问性能异常时无法快速及时判断哪个具体客户端的哪类IO原子请求存在问题。The existing command tools can be used to obtain the overall usage status of a single NFS client and storage service respectively, but cannot perform fine-grained query on the completion status of each IO atomic request from each client during its entire life cycle. As a result, the data of a single client cannot be effectively associated and matched with the overall data of the server. When faced with abnormal access performance, it is impossible to quickly and timely determine which type of IO atomic request of which specific client has a problem.

2、IO路径各环节耗时信息缺乏，故障点难以准确定位2. Lack of time-consuming information for each link in the IO path makes it difficult to accurately locate the fault point

客户端的实际IO路径冗长，现有手段只能获取客户端IO整体时延，该时延包含客户端时延、网络传输时延和存储端时延之和，而无法获取各段明细时延。故无法实现全局IO路径串联和有效判断性能问题具体出现在客户端、网络或存储端具体的某个IO环节，从而无法实现故障准确定位。The actual IO path of the client is lengthy, and existing methods can only obtain the overall IO delay of the client, which includes the sum of the client delay, network transmission delay and storage delay, but cannot obtain the detailed delay of each segment. Therefore, it is impossible to achieve global IO path concatenation and effectively determine whether the performance problem occurs in a specific IO link of the client, network or storage end, and thus it is impossible to accurately locate the fault.

故针对以上问题，本实施例设计可以提升NFS监控指标精细度，细化NFS请求IO原子操作，获取完整IO路径及每个IO路径环节时延，从而快速定位解决问题。Therefore, in response to the above problems, the design of this embodiment can improve the precision of NFS monitoring indicators, refine the NFS request IO atomic operations, obtain the complete IO path and the delay of each IO path link, so as to quickly locate and solve the problem.

可选地，在客户端和服务端通过tcpdump进行NFS协议数据包采集，以得到原始数据；对原始数据进行唯一的xid字段匹配，得到每个IO原子操作的完整传递信息，包括：Optionally, collect NFS protocol data packets through tcpdump on the client and server to obtain original data; perform unique xid field matching on the original data to obtain complete transfer information of each IO atomic operation, including:

通过tcpdump工具截取NFS存储客户端和服务端的RPC调用请求数据包，并根据NFS协议特征筛选出客户端和服务端NFS操作的全部IO数据包；其中，每个IO数据包的信息包括路径、时间戳、源地址、目的地址、xid字段、原子操作类型和数据包内容；对全部IO数据包进行xid字段匹配，得到xid字段相同的每个IO原子操作的完整传递信息，其中完整传递信息为整个IO路径，包括IO数据包从客户端出、服务端进入、服务端出、客户端进入4个过程。The tcpdump tool is used to intercept the RPC call request data packets of the NFS storage client and server, and all IO data packets of the NFS operations of the client and server are filtered out according to the NFS protocol characteristics; the information of each IO data packet includes the path, timestamp, source address, destination address, xid field, atomic operation type and data packet content; the xid field of all IO data packets is matched to obtain the complete transmission information of each IO atomic operation with the same xid field, where the complete transmission information is the entire IO path, including the four processes of IO data packets from the client, into the server, out of the server, and into the client.

S120：基于所有IO原子操作的完整传递信息，得到对应时点的平均时延数据。S120: Based on the complete transmission information of all IO atomic operations, average latency data of corresponding time points is obtained.

其中平均时延数据包括整体平均时延、服务端存储平均时延、服务端平均时延、网络平均时延和客户端平均时延。The average latency data includes overall average latency, server storage average latency, server average latency, network average latency and client average latency.

可选地，基于所有IO原子操作的完整传递信息，得到对应时点的平均时延数据，包括：Optionally, based on the complete transfer information of all IO atomic operations, the average latency data at the corresponding time point is obtained, including:

对每个IO原子操作的完整传递信息，利用IO数据包在客户端和服务器端的进出时间戳，计算出单位时间间隔内的服务器耗时、网络耗时和客户端耗时；将服务器耗时作为服务端平均时延；将网络耗时作为网络平均时延；将客户端耗时作为客户端平均时延；其中，服务器耗时为一个完整IO在完成服务端进入和服务端出之间的耗时；网络耗时为一个完整IO请求在完成在客户端出、服务端进入、服务端出、客户端进入4个过程后实际在网络链路传输一来一回的耗时；客户端耗时为一个完整IO的整体耗时减去服务器耗时和网络耗时得出。For the complete transmission information of each IO atomic operation, the server time, network time and client time in a unit time interval are calculated using the entry and exit timestamps of the IO data packets on the client and server. The server time is taken as the average server delay. The network time is taken as the average network delay. The client time is taken as the average client delay. Among them, the server time is the time it takes for a complete IO to enter and exit the server. The network time is the time it takes for a complete IO request to be transmitted back and forth on the network link after completing the four processes of client exit, server entry, server exit and client entry. The client time is the overall time of a complete IO minus the server time and network time.

S130：确定阈值时延数据。S130: Determine threshold delay data.

其中阈值时延数据包括整体时延阈值、服务端存储时延阈值、服务端时延阈值、网络时延阈值和客户端时延阈值。The threshold delay data includes the overall delay threshold, server storage delay threshold, server delay threshold, network delay threshold and client delay threshold.

由于NFS存储架构下各客户端网络位置不同，各NFS存储配置也存在差异，故网络时延及服务端存储时延同样各不相同，因此可以根据作为历史数据和经验数据进行阈值设定生成，为全链路的性能分析提供重要参考依据。Since the network locations of various clients in the NFS storage architecture are different and the NFS storage configurations are also different, the network latency and server storage latency are also different. Therefore, threshold settings can be generated based on historical data and experience data, providing an important reference for full-link performance analysis.

在一些实施例中，阈值时延数据的确定方式如下：In some embodiments, the threshold delay data is determined as follows:

整体时延阈值：根据不同业务环境调整不同的整体IO时延阈值。Overall latency threshold: adjust different overall IO latency thresholds according to different business environments.

服务端存储时延阈值：根据不同的存储型号设定不同的存储时延阈值。Server storage latency threshold: Set different storage latency thresholds according to different storage models.

服务端时延阈值：根据正常情况下的历史数据生成基线参考数据。Server latency threshold: Generates baseline reference data based on historical data under normal conditions.

网络时延阈值：通常情况下，网络环节越多，链路越长，阈值也就越高。Network delay threshold: Generally, the more network links there are and the longer the link is, the higher the threshold is.

客户端时延阈值：根据正常情况下的历史数据生成基线参考数据。Client latency threshold: Generates baseline reference data based on historical data under normal conditions.

S140：将整体平均时延与整体时延阈值作比较，若平均整体时延减去整体时延阈值的第一差值大于第一预设阈值，则逐一触发对应时点的服务端存储平均时延、服务端平均时延、网络平均时延和客户端平均时延的阈值判断，得到用于反馈故障位置的分析结果。S140: Compare the overall average delay with the overall delay threshold. If the first difference between the average overall delay and the overall delay threshold is greater than the first preset threshold, the threshold judgments of the server-side storage average delay, server-side average delay, network average delay and client-side average delay at the corresponding time points are triggered one by one to obtain analysis results for feedback of the fault location.

本申请实施例提供一种提供网络文件系统的故障定位方法，包括在客户端和服务端通过tcpdump进行NFS协议数据包采集得到原始数据；对原始数据进行唯一的xid字段匹配，得到每个IO原子操作的完整传递信息；基于所有IO原子操作的完整传递信息，得到对应时点的平均时延数据；将整体平均时延与整体时延阈值作比较，若平均整体时延减去整体时延阈值的第一差值大于第一预设阈值，则逐一触发对应时点的服务端存储平均时延、服务端平均时延、网络平均时延和客户端平均时延的阈值判断，得到用于反馈故障位置的分析结果。本实施例生成精细监控指标，可明确完整IO路径耗时，实现故障点精确定位。The embodiment of the present application provides a method for locating a fault of a network file system, including collecting NFS protocol data packets through tcpdump on the client and server to obtain raw data; matching the original data with a unique xid field to obtain complete transmission information of each IO atomic operation; obtaining the average delay data at the corresponding time point based on the complete transmission information of all IO atomic operations; comparing the overall average delay with the overall delay threshold, if the first difference between the average overall delay and the overall delay threshold is greater than the first preset threshold, then triggering the threshold judgment of the server storage average delay, server average delay, network average delay and client average delay at the corresponding time point one by one, and obtaining an analysis result for feedback of the fault location. This embodiment generates fine monitoring indicators, which can clarify the time consumption of the complete IO path and realize accurate positioning of the fault point.

本实施例根据NFS协议基于RPC调用的技术原理，利用tcpdump在nfs客户端和服务端(存储端)跟踪网络传输包，并通过RPC的唯一xid来跟踪过滤出各类IO原子操作(如read、write、rename、remove、create等)的数据包，以及整个IO原子操作在nfs服务端、网络和客户端各个环节的时间戳来获取各段时延消耗数据。This embodiment uses tcpdump to track network transmission packets on the nfs client and server (storage end) according to the technical principle of NFS protocol based on RPC calls, and uses the unique xid of RPC to track and filter out data packets of various IO atomic operations (such as read, write, rename, remove, create, etc.), as well as the timestamps of the entire IO atomic operation in each link of the nfs server, network and client to obtain the delay consumption data of each segment.

其核心点在于对整个IO路径的全部数据包进行跟踪，通过数据包内唯一标识xid，完成对各个IO原子化操作的明细统计，且计算出IO路径各个环节对应的IO耗时，和设定阈值对比从而判断问题。现从IO原子操作跟踪、IO路径耗时提取和耗时阈值对比共3方面进行说明如下：The core point is to track all data packets of the entire IO path, complete detailed statistics of each IO atomic operation through the unique identifier xid in the data packet, and calculate the IO time corresponding to each link of the IO path, and compare it with the set threshold to determine the problem. Now the following is an explanation from three aspects: IO atomic operation tracking, IO path time extraction and time threshold comparison:

1、IO原子操作跟踪：该装置首先通过tcpdump工具截取并根据1. IO atomic operation tracking: The device first intercepts the data through the tcpdump tool and

NFS协议特征筛选出客户端和服务端NFS操作的全部IO数据包，整个IO路径包括数据包从客户端出(Co)、服务端进入(Si)、服务端出(So)、客户端进入(Ci)4个过程。请参阅图4，图4是本申请实施例提供的原子操作路径跟踪示意图。The NFS protocol features filter out all IO data packets of the client and server NFS operations. The entire IO path includes four processes: data packets from the client (Co), the server (Si), the server (So), and the client (Ci). Please refer to Figure 4, which is a schematic diagram of atomic operation path tracking provided by an embodiment of the present application.

以NFS协议中IO原子操作之一rename为例，可抓取数据包格式如表1，每条数据包包含有时间节点、源目地址、操作类型等信息；通过唯一的xid字段便可以匹配出每个IO原子操作的完整传递信息，实现完整IO路径串联。Taking rename, one of the atomic IO operations in the NFS protocol, as an example, the format of the captured data packet is shown in Table 1. Each data packet contains information such as time node, source and destination address, and operation type. The complete transmission information of each IO atomic operation can be matched through the unique xid field to achieve complete IO path concatenation.

表1：数据包格式示意表Table 1: Data packet format diagram

2、IO路径耗时提取：通过上一步可跟踪获取到任意IO原子操作的完整路径信息数据，如图5所示，图5是本申请实施例提供的IO路径各环节耗时示意图。2. IO path time extraction: Through the previous step, the complete path information data of any IO atomic operation can be tracked and obtained, as shown in Figure 5, which is a schematic diagram of the time consumption of each link of the IO path provided in an embodiment of the present application.

利用tcpdump可以抓取数据包完整的Co、Si、So、Ci 4个点的时间戳，分别通过St＝So-Si、Nt＝Si-Co+Ci-So、Ct＝Tt-Nt-St即可获得服务端耗时St、网络耗时Nt、客户端耗时Ct三段环节的耗时数据计算。具体公式详细说明如下。Using tcpdump, we can capture the complete timestamps of the four points Co, Si, So, and Ci of the data packet. Through St = So-Si, Nt = Si-Co+Ci-So, and Ct = Tt-Nt-St, we can calculate the time consumption data of the three links: server time consumption St, network time consumption Nt, and client time consumption Ct. The specific formula is described in detail as follows.

以上可知由五个环节组成了整个完整IO路径，根据每段路径通过所处设备位置可分为服务端、网络段、客户端三个环节。其中：As can be seen from the above, the entire IO path consists of five links, which can be divided into three links: server, network segment, and client according to the location of each device. Among them:

服务端耗时：St＝IO路径④(NFS服务端程序访问NFS文件系统元数据)+IO路径⑤(NFS服务端程序访问数据存储)的耗时，为一个完整IO在完成服务端进入(Si)和服务端出(So)之间的耗时，同时也等于文件系统元数据耗时Sm加上IO写入数据存储落盘耗时Sb。即：St＝So-Si＝Sm+Sb。Server time: St = IO path ④ (NFS server program accesses NFS file system metadata) + IO path ⑤ (NFS server program accesses data storage) time, which is the time taken for a complete IO to complete the server entry (Si) and server exit (So), and is also equal to the file system metadata time Sm plus the IO writing data storage time Sb. That is: St = So-Si = Sm+Sb.

网络端耗时：Nt＝IO路径③(NFS网络链路传输)的耗时，实际为在网络链路传输一来一回的耗时；故可以通过服务端收包网络耗时(Si-Co)加上客户端收到回包网络耗时(Ci-So)得出。即：Nt＝Si-Co+Ci-So。Network time consumption: Nt = the time consumption of IO path ③ (NFS network link transmission), which is actually the time consumption of the round trip transmission on the network link; therefore, it can be calculated by the network time consumption of the server receiving the packet (Si-Co) plus the network time consumption of the client receiving the return packet (Ci-So). That is: Nt = Si-Co + Ci-So.

客户端耗时：Ct＝IO路径①(应用程序调用NFS文件系统)+IO路径②(NFS文件系统转化为网络RPC调用)的耗时；故客户端耗时可以通过整体时延Tt减去网络时延Nt和服务端时延St得出。即：Ct＝Tt-Nt-St。Client time consumption: Ct = IO path ① (application calls NFS file system) + IO path ② (NFS file system converted to network RPC call) time consumption; therefore, the client time consumption can be obtained by subtracting the network delay Nt and the server delay St from the overall delay Tt. That is: Ct = Tt-Nt-St.

3、IO路径耗时阈值对比：当NFS存储出现性能服务异常时将以上每段路径环节的耗时St、Nt、Ct参照设定的基线阈值进行比较，就能够判断问题处于服务端、网络或者客户端的具体环节，从而实现NFS存储IO链路性能的精确分析定位。3. IO path time consumption threshold comparison: When NFS storage has performance service abnormalities, the time consumption St, Nt, and Ct of each of the above path links will be compared with the set baseline thresholds. This will enable us to determine where the problem lies in the server, network, or client, thereby achieving accurate analysis and positioning of NFS storage IO link performance.

以上，“利用NFS的RPC调用中xid跟踪获取最小颗粒度单个原子IO数据的方法”和“三段IO路径关键环节操作耗时获取及通过三段IO耗时变化发现及定位问题的方法”是本申请的关键点，具体的作用是：The above, "the method of obtaining the smallest granularity single atomic IO data by tracking xid in NFS RPC call" and "the method of obtaining the operation time consumption of key links in the three-segment IO path and discovering and locating problems through the change of the three-segment IO time consumption" are the key points of this application, and their specific functions are:

1)创新的利用唯一标识xid，通过Tcpdump匹配并筛选出NFS存储操作请求的来往NFS协议的数据包，可获得基于RPC调用的单个IO操作全生命周期完整IO各项数据，数据基于单个完整IO请求，数据颗粒度达到最细，可用于后续各类萃取加工分析；1) Innovatively use the unique identifier xid to match and filter out the NFS protocol data packets of NFS storage operation requests through Tcpdump, and obtain the complete IO data of the entire life cycle of a single IO operation based on RPC calls. The data is based on a single complete IO request, and the data granularity is the finest, which can be used for subsequent extraction, processing and analysis;

2)在获得单个原子IO操作数据情况下，利用在数据包客户端、服务器端进出时间戳，计算出单位时间间隔内，包括客户端时延(Ct)、网络段时延(Nt)、服务器段时延(St)三段IO路径关键环节操作耗时，通过与设定阈值分析判断得出性能分析结果，从而自动快速发现及定位NFS的问题。2) When obtaining single atomic IO operation data, the entry and exit timestamps of the data packet on the client and server are used to calculate the operation time consumption of the three key links of the IO path within a unit time interval, including client delay (Ct), network delay (Nt), and server delay (St). The performance analysis results are obtained by analyzing and judging with the set threshold, so as to automatically and quickly discover and locate NFS problems.

请参阅图6，图6是本申请实施例提供的性能分析定位判断过程中的流程示意图。首先，根据定时对整体平均时延Tt和其基线阈值(即整体时延阈值)Tt_H进行对比，若明显大于基线阈值，则触发逐一判断服务端、网络、客户端的具体问题，定位具体异常的环节路径。具体判断规则如下：Please refer to Figure 6, which is a flow chart of the performance analysis and positioning judgment process provided by the embodiment of the present application. First, the overall average delay Tt is compared with its baseline threshold (i.e., the overall delay threshold) Tt_H according to the timing. If it is significantly greater than the baseline threshold, it triggers the judgment of the specific problems of the server, network, and client one by one, and locates the specific abnormal link path. The specific judgment rules are as follows:

A：若Tt>>Tt_H：即整体平均时延Tt远远大于整体时延阈值Tt_H，则判断NFS存储性能服务异常，开始分别进行三大环节判断，否则NFS存储全链路正常。A: If Tt>>Tt_H, that is, the overall average latency Tt is much greater than the overall latency threshold Tt_H, then the NFS storage performance service is considered abnormal, and the three major links are judged separately. Otherwise, the entire NFS storage link is normal.

B：若St>>St_H，且Sb>>Sb_H：即服务端平均时延St远远大于服务端时延阈值St_H，且服务端存储平均时延Sb远远大于服务端存储时延阈值Sb_H，则判断服务端数据存储异常，否则为服务端正常。B: If St>>St_H, and Sb>>Sb_H: that is, the average server delay St is much larger than the server delay threshold St_H, and the average server storage delay Sb is much larger than the server storage delay threshold Sb_H, then the server data storage is judged to be abnormal, otherwise the server is normal.

C：若St>>St_H，且Sb≈Sb_H：即服务端平均时延St远远大于服务端时延阈值St_H，但服务端存储平均时延Sb正常，根据实际运维经验，判断为文件系统元数据访问异常导致。C: If St>>St_H, and Sb≈Sb_H: the average server latency St is much larger than the server latency threshold St_H, but the average server storage latency Sb is normal. Based on actual operation and maintenance experience, it is judged to be caused by abnormal file system metadata access.

D：若Nt>>Nt_H，即网络平均时延Nt远远大于网络时延阈值Nt_H，则判断为网络链路异常，否则为网络链路正常。D: If Nt>>Nt_H, that is, the average network delay Nt is much greater than the network delay threshold Nt_H, then the network link is judged to be abnormal, otherwise the network link is normal.

E：若Ct>>Ct_H，即客户端平均时延Ct远远大于客户端时延阈值Ct_H，则判断为客户端异常，否则为客户端正常。E: If Ct>>Ct_H, that is, the average client delay Ct is much greater than the client delay threshold Ct_H, then the client is judged to be abnormal, otherwise the client is normal.

一旦整体平均时延Tt有异常，即可通过上述耗时判断方法快速准确得出客户端异常、网络异常、服务端的数据存储异常、逻辑锁问题和文件系统元数据访问异常5类定位结果。Once the overall average delay Tt is abnormal, the above-mentioned time-consuming judgment method can be used to quickly and accurately obtain five types of positioning results: client abnormality, network abnormality, server-side data storage abnormality, logical lock problem and file system metadata access abnormality.

下面对本申请实施例提供的网络文件系统的故障定位装置进行描述，下文描述的网络文件系统的故障定位装置与上文描述的网络文件系统的故障定位方法可相互对应参照。The following describes a fault location device for a network file system provided in an embodiment of the present application. The fault location device for a network file system described below and the fault location method for a network file system described above can refer to each other.

请参阅图7，图7是本申请实施例提供的网络文件系统的故障定位装置的结构示意图。在本实施例中，网络文件系统的故障定位装置包括：Please refer to Figure 7, which is a schematic diagram of the structure of a fault location device for a network file system provided in an embodiment of the present application. In this embodiment, the fault location device for a network file system includes:

数据采集模块710，用于采集实时整体时延和实时服务端存储时延，在客户端和服务端通过tcpdump进行NFS协议数据包采集，以得到原始数据；对原始数据进行唯一的xid字段匹配，得到每个IO原子操作的完整传递信息。The data collection module 710 is used to collect the real-time overall delay and the real-time server storage delay, collect NFS protocol data packets through tcpdump on the client and server to obtain the original data; match the unique xid field of the original data to obtain the complete transmission information of each IO atomic operation.

数据萃取模块720，用于基于所有IO原子操作的完整传递信息，得到对应时点的平均时延数据；其中平均时延数据包括整体平均时延、服务端存储平均时延、服务端平均时延、网络时延和客户端平均时延。The data extraction module 720 is used to obtain the average delay data of the corresponding time point based on the complete transmission information of all IO atomic operations; the average delay data includes the overall average delay, the server storage average delay, the server average delay, the network delay and the client average delay.

阈值配置模块730，用于确定阈值时延数据；其中阈值时延数据包括整体时延阈值、服务端存储时延阈值、服务端时延阈值、网络时延阈值和客户端时延阈值。The threshold configuration module 730 is used to determine the threshold delay data; wherein the threshold delay data includes the overall delay threshold, the server storage delay threshold, the server delay threshold, the network delay threshold and the client delay threshold.

性能分析定位判断模块740，用于将整体平均时延与整体时延阈值作比较，若平均整体时延减去整体时延阈值的第一差值大于第一预设阈值，则逐一触发对应时点的服务端存储平均时延、服务端平均时延、网络平均时延和客户端平均时延的阈值判断，得到用于反馈故障位置的分析结果。The performance analysis and positioning judgment module 740 is used to compare the overall average delay with the overall delay threshold. If the first difference between the average overall delay and the overall delay threshold is greater than the first preset threshold, the threshold judgments of the server-side storage average delay, server-side average delay, network average delay and client-side average delay at the corresponding time points are triggered one by one to obtain analysis results for feedback of the fault location.

本实施例中提供的数据采集模块主要通过命令采集整体IO时延Tt、服务端数据存储时延Sb、以及在客户端和服务端通过tcpdump进行NFS协议数据包采集利用xid匹配过滤用于后续分析；数据萃取模块主要针对采集模块获取原始数据进行加工，依据xid把所有NFS相关的原子操作进行服务端、网络、客户端等路径环节的平均时延计算用于后续判断；阈值配置模块主要用于在NFS服务正常提供情况下，获取正常的整体平均时延Tt、服务端存储平均时延Sb、服务端平均时延St、网络平均时延Nt、客户端平均时延Ct作为历史基线数据用于后续的对比分析判断；性能分析定位判断模块：通过定时比对整体IO时延和基线数据，如明显大于基线阈值，就逐一获取对应时点的服务端存储、服务端、网络、客户端时延，进而分别判断得出具体的异常环节。The data collection module provided in this embodiment mainly collects the overall IO delay Tt, the server data storage delay Sb through commands, and collects NFS protocol data packets through tcpdump on the client and server using xid matching filtering for subsequent analysis; the data extraction module mainly processes the original data obtained by the collection module, and calculates the average delay of the server, network, client and other path links for all NFS-related atomic operations according to xid for subsequent judgment; the threshold configuration module is mainly used to obtain the normal overall average delay Tt, server storage average delay Sb, server average delay St, network average delay Nt, client average delay Ct as historical baseline data for subsequent comparative analysis and judgment when the NFS service is provided normally; performance analysis and positioning judgment module: by regularly comparing the overall IO delay and baseline data, if it is significantly greater than the baseline threshold, the server storage, server, network, and client delays at the corresponding time points are obtained one by one, and then the specific abnormal links are judged respectively.

在一个实施例中，数据采集模块710用于：通过tcpdump工具截取NFS存储客户端和服务端的RPC调用请求数据包，并根据NFS协议特征筛选出客户端和服务端NFS操作的全部IO数据包；其中，每个IO数据包的信息包括路径、时间戳、源地址、目的地址、xid字段、原子操作类型和数据包内容；对全部IO数据包进行xid字段匹配，得到xid字段相同的每个IO原子操作的完整传递信息，其中完整传递信息为整个IO路径，包括IO数据包从客户端出、服务端进入、服务端出、客户端进入4个过程。In one embodiment, the data acquisition module 710 is used to: intercept the RPC call request data packets of the NFS storage client and server through the tcpdump tool, and filter out all IO data packets of the NFS operations of the client and server according to the NFS protocol characteristics; wherein the information of each IO data packet includes the path, timestamp, source address, destination address, xid field, atomic operation type and data packet content; perform xid field matching on all IO data packets to obtain the complete transmission information of each IO atomic operation with the same xid field, wherein the complete transmission information is the entire IO path, including four processes of IO data packets from the client, into the server, out of the server, and into the client.

在一个实施例中，数据萃取模块720用于：对每个IO原子操作的完整传递信息，利用IO数据包在客户端和服务器端的进出时间戳，计算出单位时间间隔内的服务器耗时、网络耗时和客户端耗时；将服务器耗时作为服务端平均时延；将网络耗时作为网络平均时延；将客户端耗时作为客户端平均时延；其中，服务器耗时为一个完整IO在完成服务端进入和服务端出之间的耗时；网络耗时为一个完整IO请求在完成在客户端出、服务端进入、服务端出、客户端进入4个过程后实际在网络链路传输一来一回的耗时；客户端耗时为一个完整IO的整体耗时减去服务器耗时和网络耗时得出。In one embodiment, the data extraction module 720 is used to: for the complete transmission information of each IO atomic operation, use the entry and exit timestamps of the IO data packets on the client and server to calculate the server time, network time and client time within a unit time interval; use the server time as the average server delay; use the network time as the average network delay; use the client time as the average client delay; wherein, the server time is the time taken for a complete IO to complete the server entry and server exit; the network time is the time taken for a complete IO request to actually be transmitted back and forth on the network link after completing the four processes of client exit, server entry, server exit and client entry; the client time is the overall time taken for a complete IO minus the server time and network time.

在一个实施例中，性能分析定位判断模块740用于：将服务端平均时延与服务端时延阈值作比较，若服务端平均时延减去服务端时延阈值的第二差值小于或等于第二预设阈值，则服务器正常；若服务端平均时延减去服务端时延阈值的第二差值大于第二预设阈值，则将服务端存储平均时延与服务端存储时延阈值作比较。In one embodiment, the performance analysis and positioning judgment module 740 is used to compare the average server delay with the server delay threshold. If the second difference between the average server delay and the server delay threshold is less than or equal to the second preset threshold, the server is normal; if the second difference between the average server delay and the server delay threshold is greater than the second preset threshold, the average server storage delay is compared with the server storage delay threshold.

在一个实施例中，性能分析定位判断模块740用于：若服务端存储平均时延减去服务端存储时延阈值的第三差值小于或等于第三预设阈值，则元数据访问异常；若服务端存储平均时延减去服务端存储时延阈值的第三差值大于第三预设阈值，则数据存储异常。In one embodiment, the performance analysis positioning judgment module 740 is used to: if the third difference between the server-side storage average delay and the server-side storage delay threshold is less than or equal to the third preset threshold, then the metadata access is abnormal; if the third difference between the server-side storage average delay and the server-side storage delay threshold is greater than the third preset threshold, then the data storage is abnormal.

在一个实施例中，性能分析定位判断模块740用于：将网络平均时延与网络时延阈值作比较，若网络平均时延减去网络时延阈值的第四差值小于或等于第四预设阈值，则网络链路正常；若网络平均时延减去网络时延阈值的第四差值大于第四预设阈值，则网络链路异常。In one embodiment, the performance analysis positioning judgment module 740 is used to compare the network average delay with the network delay threshold. If the fourth difference between the network average delay and the network delay threshold is less than or equal to the fourth preset threshold, the network link is normal; if the fourth difference between the network average delay and the network delay threshold is greater than the fourth preset threshold, the network link is abnormal.

在一个实施例中，性能分析定位判断模块740用于：将客户端平均时延与客户端时延阈值作比较，若客户端平均时延减去客户端时延阈值的第五差值小于或等于第五预设阈值，则客户端链路正常；若客户端平均时延减去客户端时延阈值的第五差值大于第五预设阈值，则客户端链路异常。In one embodiment, the performance analysis positioning judgment module 740 is used to compare the client average delay with the client delay threshold. If the fifth difference between the client average delay and the client delay threshold is less than or equal to the fifth preset threshold, the client link is normal; if the fifth difference between the client average delay and the client delay threshold is greater than the fifth preset threshold, the client link is abnormal.

请参阅图8，图8是本申请实施例提供的网络文件系统的故障定位装置的工作流程示意图。Please refer to FIG. 8 , which is a schematic diagram of the workflow of the fault location device for the network file system provided in an embodiment of the present application.

数据采集模块：负责本装置的基础数据采集，主要通过nfsiostat、sar/iostat分别进行整体平均时延和服务端存储平均时延Sb的采集；然后利用tcpdump工具截取NFS存储客户端和服务端的RPC调用请求数据包(表1)，依据xid字段过滤匹配IO原子操作完整路径的时间点、操作类型等信息。可汇总得到的时延相关基础数据类型如下表2，对数据保存供后续进一步进行单位时间区间内平均时延的数据萃取处理。Data collection module: responsible for the basic data collection of this device, mainly through nfsiostat, sar/iostat to collect the overall average delay and the server storage average delay Sb; then use the tcpdump tool to intercept the RPC call request data packet of the NFS storage client and server (Table 1), and filter the time point, operation type and other information of the complete path of the IO atomic operation according to the xid field. The basic data types related to the delay that can be summarized are shown in Table 2 below, and the data is saved for further data extraction and processing of the average delay within the unit time interval.

表2：采集时延数据类型示意表Table 2: Schematic table of acquisition delay data types

数据萃取模块：负责根据采集模块的基础数据对NFS原子操作请求的完整IO路径中的整体平均时延Tt、服务端存储平均时延Sb、以及三段IO环节分别进行耗时计算(服务端平均时延St、网络平均时延Nt、客户端平均时延Ct)，用于后续性能分析判断。Data extraction module: responsible for calculating the overall average latency Tt, server storage average latency Sb, and the time consumption of the three IO links (server average latency St, network average latency Nt, client average latency Ct) in the complete IO path of NFS atomic operation requests based on the basic data of the collection module, for subsequent performance analysis and judgment.

整体平均时延计算：由于客户端通过nfsiostat命令工具采集到的本身就为IO的平均读、写时延Rtt_r和Rtt_w，同时可以获取每秒的读、写次数ops。Calculation of overall average latency: The client uses the nfsiostat command tool to collect the average read and write latency Rtt_r and Rtt_w of IO, and can also obtain the number of read and write ops per second.

从而根据单位时间读写次数比例计算获得读、写比例r％和w％(r％+w％＝100％)，并根据实际读写比例合并获得其单位时间内整体平均时延Tt为：Tt＝(Rtt_r*r％)+(Rtt_w*w％)。Therefore, the read and write ratios r% and w% (r%+w%=100%) are calculated according to the ratio of the number of read and write times per unit time, and the overall average delay Tt per unit time is obtained according to the actual read and write ratios: Tt=(Rtt_r*r%)+(Rtt_w*w%).

服务端存储平均时延计算：服务端IO最终落盘需要经过访问文件系统元数据(Sm)和写入数据存储(Sb)两个环节，由于NFS服务进程是内核态运行的原因，访问文件系统元数据(Sm)的耗时本提案暂不做讨论，而写入数据存储耗时(Sb)同样可以在服务端通过命令工具sar、iostat获得平均await值。可以直接取得单位时间内的平均延迟：Sb＝await。Calculation of average server storage latency: The server IO finally falls on the disk through two steps: accessing the file system metadata (Sm) and writing to the data storage (Sb). Since the NFS service process is running in kernel mode, the time consumption of accessing the file system metadata (Sm) is not discussed in this proposal. The time consumption of writing to the data storage (Sb) can also be obtained through the command tools sar and iostat on the server to obtain the average wait value. The average latency per unit time can be directly obtained: Sb = wait.

服务端平均时延计算：服务端耗时为一个完整IO在完成服务端进入(Si)和服务端出(So)之间的耗时，故可以通过服务端收包网络耗时(Si)与服务端回包网络耗时(So)的差值得出。同样取单位时间n内的数据包数量计算出完整IO请求数量为w，最终可得单位时间n内平均服务端时延为：Calculation of average server latency: The server latency is the time it takes for a complete IO to complete the server entry (Si) and server exit (So), so it can be calculated by the difference between the server packet receiving network time (Si) and the server packet return network time (So). Similarly, the number of data packets per unit time n is calculated to be the number of complete IO requests w, and the average server latency per unit time n is finally obtained as:

网络平均时延计算：网络耗时为一个完整IO请求在完成在客户端出(Co)、服务端进入(Si)、服务端出(So)、客户端进入(Ci)4个过程后实际在网络链路传输一来一回的耗时，故可以通过服务端收包网络耗时(Si-Co)加上客户端收到回包网络耗时(Ci-So)得出。同样取单位时间n内的数据包数量计算出完整IO请求数量为w，最终可得单位时间平均网络时延为：Calculation of average network delay: The network time is the time it takes for a complete IO request to be transmitted back and forth on the network link after completing the four processes of client out (Co), server in (Si), server out (So), and client in (Ci). Therefore, it can be calculated by adding the network time it takes for the server to receive the packet (Si-Co) to the network time it takes for the client to receive the return packet (Ci-So). Similarly, the number of data packets per unit time n is calculated to be w, and the average network delay per unit time is finally obtained as:

客户端平均时延计算：一个IO请求完整IO路径包括数据包从客户端出(Co)、服务端进入(Si)、服务端出(So)、客户端进入(Ci)4个过程，故客户端单位时间n内平均耗时可以通过整体时延Tt减去网络传输时延Nt和服务端时延St的差值得出。即： Calculation of average client latency: A complete IO path for an IO request includes four processes: data packets from the client (Co), server (Si), server (So), and client (Ci). Therefore, the average time consumed by the client per unit time n can be calculated by subtracting the difference between the overall latency Tt and the network transmission latency Nt and the server latency St. That is:

故经过该模块的数据处理，可得出单位时间n内的平均整体IO时延Tt、服务端存储时延Sb、及对应的服务端时延网络时延客户端时延作为核心指标进行下一步性能问题分析判断，同时沉淀为历史数据作为阈值参考数据。Therefore, after data processing by this module, the average overall IO latency Tt, server storage latency Sb, and corresponding server latency within unit time n can be obtained. Network latency Client latency It is used as the core indicator for the next step of performance problem analysis and judgment, and is also accumulated as historical data as threshold reference data.

综上可知，本申请实施例提供的网络文件系统的故障定位装置，所实现的有益效果包括：In summary, the fault location device for a network file system provided in the embodiment of the present application has the following beneficial effects:

1、采集IO原子操作指标，生成精细监控指标：本实施例装置中的数据采集模块和数据萃取模块，利用tcpdump获取基础数据包，并通过xid匹配过滤出NFS协议基于RPC调用发起的所有符合条件的原子IO操作的各项数据，作为基础原子数据。通过基础原子操作相关信息可以明确IO操作发起的客户端，结合单个IO原子操作类型信息，经过萃取可获取包括基于客户端及IO类型在内多个维度的精细监控指标数据。1. Collect IO atomic operation indicators and generate fine monitoring indicators: The data collection module and data extraction module in the device of this embodiment use tcpdump to obtain basic data packets, and filter out various data of all qualified atomic IO operations initiated by the NFS protocol based on RPC calls through xid matching as basic atomic data. The client that initiated the IO operation can be clearly identified through the basic atomic operation related information, and combined with the single IO atomic operation type information, after extraction, fine monitoring indicator data including multiple dimensions based on the client and IO type can be obtained.

2、明确完整IO路径耗时，实现故障点精确定位：本实施例装置中通过xid实现了获取NFS IO请求的完整IO路径和相关数据，通过截取单个完整IO在客户端和服务器端时间戳数据，生成单个原子IO关键耗时指标St(存储时延)、Nt(网络时延)、Ct(客户端时延)数据，并在某个采样时间区间进行累加求平均，获得关键耗时指标数据，并通过性能分析判断模块将关键耗时指标与基线阈值分别对比分析，即可实现客户端异常、网络异常、数据存储异常、文件系统元数据访问异常问题的精确定位。2. Clarify the time consumption of the complete IO path and realize accurate positioning of the fault point: In the device of this embodiment, the complete IO path and related data of the NFS IO request are obtained through xid. By intercepting the timestamp data of a single complete IO on the client and server, a single atomic IO key time consumption indicator St (storage delay), Nt (network delay), Ct (client delay) data is generated, and the key time consumption indicator data is accumulated and averaged in a certain sampling time interval to obtain the key time consumption indicator data. The key time consumption indicator is compared and analyzed with the baseline threshold through the performance analysis and judgment module, so as to realize the accurate positioning of client anomalies, network anomalies, data storage anomalies, and file system metadata access anomalies.

另一方面，本申请实施例还提供一种电子设备，图9是本申请实施例提供的电子设备的实体结构示意图，如图9所示，该电子设备可以包括：电子设备可以包括存储器(memory)920、处理器(processor)910及存储在存储器920上并可在处理器910上运行的计算机程序。处理器910执行程序时实现上述各方法所提供的网络文件系统的故障定位方法。On the other hand, an embodiment of the present application further provides an electronic device, and FIG9 is a schematic diagram of the physical structure of the electronic device provided by the embodiment of the present application. As shown in FIG9, the electronic device may include: the electronic device may include a memory 920, a processor 910, and a computer program stored in the memory 920 and executable on the processor 910. When the processor 910 executes the program, the fault location method of the network file system provided by the above methods is implemented.

可选地，电子设备还可以包括通信总线930和通信接口(CommunicationsInterface)940，其中，处理器910，通信接口940，存储器920通过通信总线930完成相互间的通信。处理器910可以调用存储器920中的计算机程序，以执行网络文件系统的故障定位方法，该方法包括：Optionally, the electronic device may further include a communication bus 930 and a communication interface (CommunicationsInterface) 940, wherein the processor 910, the communication interface 940, and the memory 920 communicate with each other via the communication bus 930. The processor 910 may call a computer program in the memory 920 to execute a fault location method for a network file system, the method comprising:

采集实时整体时延和实时服务端存储时延，在客户端和服务端通过tcpdump进行NFS协议数据包采集，以得到原始数据；对原始数据进行唯一的xid字段匹配，得到每个IO原子操作的完整传递信息；基于所有IO原子操作的完整传递信息，得到对应时点的平均时延数据；其中平均时延数据包括整体平均时延、服务端存储平均时延、服务端平均时延、网络平均时延和客户端平均时延；确定阈值时延数据；其中阈值时延数据包括整体时延阈值、服务端存储时延阈值、服务端时延阈值、网络时延阈值和客户端时延阈值；将整体平均时延与整体时延阈值作比较，若平均整体时延减去整体时延阈值的第一差值大于第一预设阈值，则逐一触发对应时点的服务端存储平均时延、服务端平均时延、网络平均时延和客户端平均时延的阈值判断，得到用于反馈故障位置的分析结果。Collect real-time overall delay and real-time server storage delay, collect NFS protocol data packets through tcpdump on the client and server to obtain original data; match the original data with a unique xid field to obtain complete transmission information of each IO atomic operation; based on the complete transmission information of all IO atomic operations, obtain the average delay data at the corresponding time point; the average delay data includes the overall average delay, the server storage average delay, the server average delay, the network average delay and the client average delay; determine the threshold delay data; the threshold delay data includes the overall delay threshold, the server storage delay threshold, the server delay threshold, the network delay threshold and the client delay threshold; compare the overall average delay with the overall delay threshold, if the first difference between the average overall delay and the overall delay threshold is greater than the first preset threshold, then trigger the threshold judgment of the server storage average delay, the server average delay, the network average delay and the client average delay at the corresponding time point one by one, and obtain the analysis result for feedback of the fault location.

此外，上述的存储器920中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the logic instructions in the above-mentioned memory 920 can be implemented in the form of a software functional unit and can be stored in a computer-readable storage medium when it is sold or used as an independent product. Based on this understanding, the technical solution of the present application can be essentially or partly embodied in the form of a software product that contributes to the prior art, and the computer software product is stored in a storage medium, including several instructions to enable a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present application. The aforementioned storage medium includes: various media that can store program codes, such as a USB flash drive, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a disk or an optical disk.

另一方面，本申请还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现以执行上述各方法提供的网络文件系统的故障定位方法，其步骤和原理在上述方法已详细介绍，在此不再赘述。On the other hand, the present application also provides a non-transitory computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, it is implemented to execute the network file system fault location method provided by the above methods. Its steps and principles have been introduced in detail in the above methods and will not be repeated here.

非暂态计算机可读存储介质可以是处理器能够存取的任何可用介质或数据存储设备，包括但不限于磁性存储器(例如软盘、硬盘、磁带、磁光盘(MO)等)、光学存储器(例如CD、DVD、BD、HVD等)、以及半导体存储器(例如ROM、EPROM、EEPROM、非易失性存储器(NANDFLASH)、固态硬盘(SSD))等。Non-transitory computer-readable storage media can be any available media or data storage devices that can be accessed by a processor, including but not limited to magnetic storage (such as floppy disks, hard disks, magnetic tapes, magneto-optical disks (MO), etc.), optical storage (such as CDs, DVDs, BDs, HVDs, etc.), and semiconductor storage (such as ROM, EPROM, EEPROM, non-volatile memory (NANDFLASH), solid-state drives (SSDs)), etc.

以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. Ordinary technicians in this field can understand and implement it without paying creative labor.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus a necessary general hardware platform, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solution is essentially or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, a disk, an optical disk, etc., including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in each embodiment or some parts of the embodiments.

最后应说明的是：以上实施例仅用以说明本申请的技术方案，而非对其限制；尽管参照前述实施例对本申请进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application, rather than to limit it. Although the present application has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for locating a failure of a network file system, comprising:

collecting real-time integral time delay and real-time server storage time delay, and collecting NFS protocol data packets through tcpdump at a client and a server to obtain original data;

performing unique xid field matching on the original data to obtain complete transfer information of each IO atom operation;

Obtaining average time delay data of corresponding time points based on complete transfer information of all IO atom operations; the average delay data comprises an overall average delay, a server storage average delay, a server average delay, a network average delay and a client average delay;

Determining threshold time delay data; the threshold delay data comprises an overall delay threshold, a server storage delay threshold, a server delay threshold, a network delay threshold and a client delay threshold;

And comparing the overall average time delay with the overall time delay threshold, and if the first difference value of subtracting the overall time delay threshold from the average overall time delay is larger than a first preset threshold, triggering threshold judgment of the server storage average time delay, the server average time delay, the network average time delay and the client average time delay of corresponding time points one by one to obtain an analysis result for feeding back the fault position.

2. The method for locating a failure of a network file system according to claim 1, wherein the NFS protocol data packet is collected by tcpdump at a client and a server to obtain original data; performing unique xid field matching on the original data to obtain complete transfer information of each IO atom operation, wherein the method comprises the following steps:

intercepting RPC call request data packets of the NFS storage client and the server through a tcpdump tool, and screening all IO data packets of the NFS operation of the client and the server according to NFS protocol characteristics; the information of each IO data packet comprises a path, a time stamp, a source address, a destination address, an xid field, an atomic operation type and data packet content;

and performing xid field matching on all the IO data packets to obtain complete transfer information of each IO atomic operation with the same xid field, wherein the complete transfer information is a whole IO path and comprises 4 processes of outputting the IO data packets from a client, inputting the IO data packets from a server, outputting the IO data packets from the server and inputting the IO data packets from the client.

3. The method for locating a failure of a network file system according to claim 2, wherein the obtaining average time delay data of the corresponding time points based on the complete transfer information of all IO atomic operations includes:

For the complete transfer information of each IO atom operation, calculating the time consumption of a server, the time consumption of a network and the time consumption of a client in a unit time interval by utilizing the time stamps of the IO data packets in and out of the client and the server;

taking the time consumption of the server as the average time delay of the server;

taking the network time consumption as the network average time delay;

Taking the time consumption of the client as the average time delay of the client;

The time consumption of the server is the time consumption of a complete IO between the completion of the entry of the server and the exit of the server; the network time consumption is the time consumption of actually transmitting a complete IO request in a network link for one round after completing 4 processes of client-side output, server-side input, server-side output and client-side input; the client time consumption is obtained by subtracting the server time consumption and the network time consumption from the whole time consumption of a complete IO.

4. The method for locating a fault in a network file system according to claim 1, wherein the step of triggering the threshold judgment of the average delay of the server, the average delay of the network and the average delay of the client at corresponding time points one by one to obtain the analysis result for feeding back the fault location comprises:

Comparing the average delay of the server with the delay threshold of the server, and if the second difference value of the average delay of the server minus the delay threshold of the server is smaller than or equal to a second preset threshold, the server is normal;

and if the second difference value of the average delay of the server minus the delay threshold of the server is larger than a second preset threshold, comparing the average delay stored by the server with the delay threshold stored by the server.

5. The method for locating a failure of a network file system according to claim 4, wherein comparing the server storage average latency with the server storage latency threshold comprises:

If the third difference value of the average storage delay of the server minus the storage delay threshold of the server is smaller than or equal to a third preset threshold, the metadata access is abnormal;

If the third difference value of the average storage delay of the server minus the storage delay threshold of the server is larger than a third preset threshold, the data storage is abnormal.

6. The method for locating a fault in a network file system according to claim 1, wherein the step of triggering the threshold judgment of the average delay of the server, the average delay of the network and the average delay of the client at corresponding time points one by one to obtain the analysis result for feeding back the fault location comprises:

Comparing the network average time delay with the network time delay threshold, and if the fourth difference value of the network average time delay minus the network time delay threshold is smaller than or equal to a fourth preset threshold, the network link is normal;

If the fourth difference value of the network average time delay minus the network time delay threshold value is larger than a fourth preset threshold value, the network link is abnormal.

7. The method for locating a fault in a network file system according to claim 1, wherein the step of triggering the threshold judgment of the average delay of the server, the average delay of the network and the average delay of the client at corresponding time points one by one to obtain the analysis result for feeding back the fault location comprises:

comparing the client average time delay with the client time delay threshold, and if the fifth difference value of the client average time delay minus the client time delay threshold is smaller than or equal to a fifth preset threshold, the client link is normal;

if the fifth difference value of the client average time delay minus the client time delay threshold is larger than a fifth preset threshold, the client link is abnormal.

8. A failure locator for a network file system, comprising:

The data acquisition module is used for acquiring real-time overall time delay and real-time server storage time delay, and carrying out NFS protocol data packet acquisition on the client and the server through tcpdump so as to obtain original data; performing unique xid field matching on the original data to obtain complete transfer information of each IO atom operation;

The data extraction module is used for obtaining average time delay data of corresponding time points based on complete transmission information of all IO atom operations; the average delay data comprises an overall average delay, a server storage average delay, a server average delay, a network delay and a client average delay;

the threshold configuration module is used for determining threshold time delay data; the threshold delay data comprises an overall delay threshold, a server storage delay threshold, a server delay threshold, a network delay threshold and a client delay threshold;

And the performance analysis positioning judgment module is used for comparing the overall average time delay with the overall time delay threshold, and if the first difference value of subtracting the overall time delay threshold from the average overall time delay is larger than a first preset threshold, triggering the threshold judgment of the server storage average time delay, the server average time delay, the network average time delay and the client average time delay at corresponding time points one by one to obtain an analysis result for feeding back the fault position.

9. An electronic device comprising a processor and a memory storing a computer program, characterized in that the processor implements the method of fault localization of a network file system according to any of claims 1 to 7 when executing the computer program.

10. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements a method of fault localization of a network file system according to any one of claims 1 to 7.