CN121209801A

CN121209801A - Storage system management method, apparatus, device, storage medium, and program product

Info

Publication number: CN121209801A
Application number: CN202511758948.8A
Authority: CN
Inventors: 王军; 杜思瑶
Original assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Current assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Priority date: 2025-11-27
Filing date: 2025-11-27
Publication date: 2025-12-26

Abstract

The application discloses a storage system management method, a device, equipment, a storage medium and a program product, which relate to the technical field of storage, in the method of the application, a monitoring program running in a host kernel is developed, and associating the monitoring program with at least a portion of the protocol layers of the stored protocol stack such that the monitoring program can collect context data of the target event when the associated protocol layer has the target event. Thus, the method is equivalent to the collection channel of the bottom layer operation data newly added in the general storage system. Under the condition that the general storage system only returns high-latitude data, the bottom running state of the storage system can be judged based on the data collected by the monitoring program, and maintenance personnel are prevented from collecting secondary data, so that the management efficiency of the storage system can be greatly improved. Therefore, the problems of higher dimension of running information returned by the general storage system and low management efficiency of the storage system in some technologies are solved.

Description

Storage system management method, apparatus, device, storage medium, and program product

Technical Field

The present application relates to the field of storage technologies, and in particular, to a storage system management method, apparatus, device, storage medium, and program product.

Background

Currently, storage systems are divided into general-purpose storage systems and custom-made storage systems. The cost of the general storage system is low, storage use requirements in a general scene can be met, but the dimension of returned operation information is higher, maintenance personnel cannot directly obtain bottom operation information of the storage system, and therefore management efficiency of the storage system is lower. The cost of customizing the storage system is high, but the hardware firmware (such as firmware in a hard disk controller, firmware in an expander and the like) and each protocol layer can be highly customized according to a specific application scene, so that the storage system returns information of a designated dimension, and the management efficiency of the storage system can be greatly improved. Based on the above description, how to obtain low-dimensional running information of the general storage system becomes a key for improving management efficiency of the storage system.

Disclosure of Invention

The application provides a storage system management method, a storage system management device, electronic equipment, a computer readable storage medium and a computer program product, which at least solve the problems of higher dimension of running information returned by a general storage system and low storage system management efficiency in the related technology.

The application provides a storage system management method, which comprises the following steps:

collecting operation data of a storage protocol stack through a monitoring program, wherein the monitoring program and the storage protocol stack are operated in a host kernel, the monitoring program is associated with at least part of protocol layers of the storage protocol stack, and when a target event occurs in the associated protocol layer, the context data of the target event is collected and used as the operation data of the storage protocol stack, and the target events of different protocol layers are different;

determining a system state of a storage system based on the operation data of the storage protocol stack;

and managing the storage system based on the system state.

The application also provides a storage system management device, comprising:

The data acquisition module is used for acquiring the operation data of a storage protocol stack through a monitoring program, wherein the monitoring program and the storage protocol stack are operated in a host kernel, the monitoring program is associated with at least part of protocol layers of the storage protocol stack, and when a target event occurs in the associated protocol layers, the context data of the target event is acquired as the operation data of the storage protocol stack, and the target events of different protocol layers are different;

the system state determining module is used for determining the system state of the storage system based on the operation data of the storage protocol stack;

and the system management module is used for managing the storage system based on the system state.

The application also provides an electronic device, which comprises a memory for storing a computer program and a processor for realizing the steps of the storage system management method when executing the computer program.

The present application also provides a computer readable storage medium having a computer program stored therein, wherein the computer program when executed by a processor implements the steps of the storage system management method described above.

The application also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the storage system management method described above.

In some embodiments of the present application, a monitor running in a host kernel is developed and associated with at least a portion of a protocol layer of a storage protocol stack, so that when a target event occurs in the associated protocol layer, the monitor can collect context data of the target event. The context data may be bottom-layer operation data of the storage system, or may reflect a bottom-layer operation state of the storage system, that is, may be equivalent to an acquisition channel in which the bottom-layer operation data is newly added in the general storage system. Under the condition that the general storage system only returns high-latitude data, the bottom running state of the storage system can be judged based on the data collected by the monitoring program, and maintenance personnel are prevented from collecting secondary data, so that the management efficiency of the storage system can be greatly improved. Therefore, the problems of higher dimension of running information returned by the general storage system and low management efficiency of the storage system in some technologies are solved.

Drawings

For a clearer description of embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described, it being apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.

FIG. 1 is a schematic diagram of eBPF architecture provided in some embodiments of the present application;

FIG. 2 is a flow chart of a storage system management method according to some embodiments of the present application;

FIG. 3 is a schematic diagram illustrating an architecture of a SAS storage system provided in accordance with some embodiments of the present application;

FIG. 4 is a schematic diagram illustrating an architecture of a SAS storage system provided in accordance with further embodiments of the present application;

FIG. 5 is a block diagram of a storage system management device according to some embodiments of the present application;

Fig. 6 is a schematic block diagram of an electronic device according to some embodiments of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. Based on the embodiments of the present application, all other embodiments obtained by a person of ordinary skill in the art without making any inventive effort are within the scope of the present application.

It should be noted that in the description of the present application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "first," "second," and the like in this specification are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.

The present application will be further described in detail below with reference to the drawings and detailed description for the purpose of enabling those skilled in the art to better understand the aspects of the present application.

Before explaining the method of the present application, a eBPF (extended Berkeley PACKET FILTER, expanded version berkeley packet filter) architecture is described. Referring to fig. 1 in combination, a schematic diagram of eBPF architecture is provided according to some embodiments of the present application. In fig. 1, eBPF architecture includes eBPF program, eBPF map, and user state manager. Wherein eBPF programs run in the host kernel. The user state management program runs in the operating system user space.

Based on the dynamic hooking technique Kprobes/Kretprobes, the eBPF program can be installed to a function entry or exit of the kernel-mode program. Based on the dynamic hooking technique Uprobes/Uretprobes, the eBPF program can be installed to a function entry or exit of the user-mode application. Where a kernel mode program refers to a program running in the host kernel, such as a storage protocol stack. A user-mode application refers to a program running in the user space of an operating system, such as a storage management program, e.g., a file system. Different functions are used to implement different operations. The bit function F21 is used for receiving data to be written into the storage system, and outputting a verification result after verifying the data. Function F22 is used to assemble command frames as specified by the storage protocol.

After eBPF programs are installed on one of the functions, if the function is called or the function outputs an operation result, the eBPF programs are triggered to collect data. Specifically, data acquisition logic, such as target data to be acquired, data acquisition frequency, etc., may be specified in the eBPF program according to the actual needs. In different application scenarios, eBPF programs may be different, i.e. eBPF programs may be developed according to the actual needs.

EBPF is mapped into a data storage structure that enables data interaction between the eBPF program and the user mode manager. eBPF the program can write the collected data into eBPF map, and the user state management program can read eBPF the data collected by the program from eBPF map and perform data processing. For example, the user state management program uses the day as the statistical dimension, and the statistical result is displayed after the data collected by the eBPF program is counted. For another example, when the data collected by the eBPF program is greater than the data threshold, the user mode management program controls the storage system according to a preset control logic.

Briefly, eBPF programs can be considered as sensors, primarily for data acquisition. The user state management program can be regarded as an actual control program for processing the data collected by the eBPF program.

The present state of the current storage system will be described below.

Currently, storage systems can be divided into general-purpose storage systems and custom-made storage systems. Storage system management may include, but is not limited to, failure localization, failure recovery, capacity monitoring and planning, data backup, performance index monitoring, and the like for the storage system.

At present, the general storage system has the advantages of low cost, wide applicability and the like, but the dimension of returned running information is higher. When a maintainer manages the general storage system, the maintainer generally takes high-latitude operation information returned by the storage system as a reference, collects bottom-layer operation information of the storage system through an information collection program, and manages the storage system based on the collected bottom-layer operation information. Because maintenance personnel need to carry out secondary information acquisition through an information acquisition program, the efficiency is lower when managing the general storage system. The custom memory system is then a highly custom memory system relative to a general purpose memory system. In the customizing storage system, the hardware firmware (such as firmware in a hard disk controller, firmware in an expander and the like) of the storage system and each protocol layer of the storage protocol can be highly customized according to the requirement of an actual application scene, so that the storage system returns running information of a specified dimension. Therefore, when the customized storage system is managed, maintenance personnel can directly acquire the operation information of the required dimension from the operation information returned by the storage system, and secondary information acquisition is not needed through an information acquisition program, so that the management efficiency of the storage system is greatly improved. For ease of understanding, the differences in operational information returned by the generic storage system and the custom storage system are described below by taking the failure localization of the storage system as an example.

In particular, in a general purpose memory system, a plurality of underlying error codes are typically normalized to a high-latitude and ambiguous error code. When the system fails, only high latitude error codes are returned. Because one high-dimensional error code corresponds to a plurality of bottom error codes, maintenance personnel cannot directly locate a specific cause of the system fault. In view of this, maintenance personnel can regard high latitude error codes as a reference, gather the bottom layer operation information of storage system through information acquisition procedure to based on bottom layer operation information, locate the concrete reason of system failure. For example, in a general purpose storage system, a plurality of underlying errors such as hardware link disconnection, link negotiation failure, device response timeout, etc. are normalized to communication interruption. When the maintainer receives the error code which is returned by the storage system and represents the communication interruption, the maintainer cannot directly locate the specific reason of the communication interruption, namely, whether the communication interruption is caused by the disconnection of a hardware link, the failure of link negotiation or the response timeout of equipment is caused. In view of this, maintenance personnel can gather the bottom layer operation information such as the hardware system state of storage system, the operation log of hard disk controller through the information acquisition procedure to based on these bottom layer operation information, locate the concrete reason of communication interruption.

In the customized storage system, the hardware firmware and each protocol layer of the storage system can be highly customized according to the requirements of the actual application scene, so that the storage system returns error codes with specified dimensions. Therefore, maintenance personnel can directly position specific reasons of system faults based on error codes returned by the storage system without carrying out secondary information acquisition through an information acquisition program, so that the fault positioning efficiency of the storage system can be greatly improved. For example, in customizing a storage system, after customizing firmware in the hard disk controller, if the storage system is interrupted due to overtime of equipment response, the hard disk controller can return an error code representing overtime of equipment response, so that maintenance personnel can directly locate a fault reason based on the error code without secondary information acquisition. Therefore, the fault locating efficiency is greatly improved.

Based on the description, the application provides a storage system management method, which can directly obtain the low-dimensional operation information of a storage system in the operation process of the general storage system, and solves the problems of higher dimension of the operation information returned by the general storage system and low management efficiency of the storage system in the related technology. The storage system management method is applicable to hosts that need to perform data operations in a general-purpose storage system. The host may include, but is not limited to, a server, tablet, notebook, desktop, controller, etc. Specifically, in the host, a user mode management program may be included. The storage system management method of the present application can be implemented in the running process of the user mode management program.

Referring to fig. 2 in combination, a flowchart of a storage system management method according to some embodiments of the present application is shown. In fig. 2, the storage system management method includes the steps of:

Step S201, collecting operation data of a storage protocol stack through a monitoring program, wherein the monitoring program and the storage protocol stack are operated in a host kernel, the monitoring program is associated with at least part of protocol layers of the storage protocol stack, and when a target event occurs in the associated protocol layers, the context data of the target event is collected and used as the operation data of the storage protocol stack, and the target events of different protocol layers are different.

In this embodiment, the monitor program is eBPF (extended Berkeley PACKET FILTER, expanded version berkeley packet filter) program. The storage protocol stack may include, but is not limited to, NAS (Network Attached Storage ) storage protocol stack, SAS (SERIAL ATTACHED SCSI ) storage protocol stack, and the like.

Both the monitor and the storage protocol stack run in the host kernel. Each storage protocol may include at least one protocol layer. Each protocol layer may include one or more functions. Each function, when called, is used to perform an operation that matches the protocol layer logic. The operations performed by different functions are different. Based on the dynamic hooking technique Kprobes/Kretprobes, the monitor can be mounted to the target function entry or exit of the protocol layer. When the target function is called or the target function outputs a result, the target event is indicated to occur, and at this time, the monitoring program can be triggered to collect the up-and-down file data of the target event. The context file data refers to data for characterizing attributes of a target event, such as a time point when the target event occurs, an object ID updated during the target event, an event ID, and the like. For ease of understanding, the SAS storage protocol stack is described below as an example.

Referring to fig. 3 in combination, a schematic architecture of a SAS storage system according to some embodiments of the application is provided. In fig. 3, the storage system includes an application running in the user space and a user mode manager, SAS storage protocol stack and monitor running in the kernel, and a hardware layer. For the monitoring program and the user mode management program, reference may be made to the above related descriptions, which are not repeated here.

The hardware layer may include, but is not limited to, hardware and firmware of an HBA (Host Bus Adapter), hardware and firmware of an extender, hardware and firmware of a hard disk. Specifically, firmware of the HBA refers to firmware running in the HBA controller. The expander may also be referred to as Expanders. The firmware of the extender refers to firmware running in the extender. Firmware of a hard disk refers to firmware running in a hard disk controller.

Applications may include test software and conventional applications for reading and writing data in a storage system, which may include, but are not limited to, software such as a browser, database, etc. Test software refers to a program for testing a storage system.

The SAS storage protocol stack includes a kernel interface layer, a block device layer, a SCSI driver layer, a SCSI middle layer, a SAS transport layer, and an HBA driver layer. The functions of the respective protocol layers are described below.

Specifically, the kernel interface layer is configured to provide a unified storage system access interface for an application program, and specifically is configured to receive a storage system operation request issued by the application program. The storage system operation requests may include, but are not limited to, data read-write requests, data migration requests, hard disk management requests, and the like.

The block device layer is used for carrying out request scheduling, request merging and partition mapping. For example, the request order is optimized to reduce head seek time. For another example, multiple small requests are combined into one large request. For another example, logical addresses in the request are mapped to physical addresses.

The SCSI drive layer is used for realizing a universal drive framework of the SCSI command set and is decoupled with specific hardware. For example, a request at the block device layer is converted to a SCSI command description block. For another example, discovery and initialization of SCSI devices is performed.

The SCSI middle layer is used to coordinate SCSI higher level drivers and the underlying transport protocol. Such as a SCSI device on the scan bus. For another example, request queue management and control command concurrency are performed.

The SAS transport layer is used to handle frame transmission, address management, and link control specific to SAS storage protocols. For example, SCSI target IDs are mapped to SAS addresses. For another example, storage system management is performed. For another example, the SCSI command description block is encapsulated as a SAS protocol frame.

The HBA driving layer is used for controlling the hardware of the SAS host bus adapter and directly interacts with the hardware layer. For example, SAS frames are transmitted to HBA hardware over a PCIe bus. For another example, loading HBA firmware, configuring RAID functionality. For another example, in response to an interrupt signal from the HBA, interrupt processing is performed.

Among the various protocol layers described above, each protocol layer may include one or more functions in order to implement the functionality of the corresponding protocol layer. For example, at the kernel interface layer, functions C11 and C12 may be included. Wherein the function C11 is used for receiving a storage system operation request of an application program. Function C12 is used to submit storage system operation requests to the block device layer. Also for example, at the block device layer, a function C21 and a function C22 may be included. Wherein function C21 is used to map logical addresses in the request to physical addresses. Function C22 is used to merge multiple small requests into one large request. As another example, at the SCSI driver layer, function C31, function C32, and function C33 may be included. Wherein function C31 is used to convert the request of the block device layer into a SCSI command description block. Function C32 is used to discover SCSI devices. Function C33 is used to initialize SCSI devices.

In practical applications, the objective function to be monitored may be determined based on the management objective of the storage system, and the monitoring program may be installed to the entry or exit of the objective function based on the dynamic hooking technique Kprobes/Kretprobes. When the target function is called or the target function outputs an operation result, the target event is indicated to occur, and the monitoring program can collect the up-and-down file data of the target event. The following is illustrated by way of example.

For example, assume that performance management of a storage system is required (i.e., performance management is a management goal). After analyzing the functions of each protocol layer, the function characteristics of each protocol layer are found as follows:

Based on the function of the HBA driver layer, metrics such as transmission delay, queue depth, interrupt frequency, command completion time, hardware errors, etc. can be counted. These metrics may be used to assess performance bottlenecks of the storage system.

Based on the function of the SAS transmission layer, indexes such as error rate, signal strength, expander routing delay and the like can be counted. These metrics may be used to evaluate link layer performance issues of the storage system.

Based on the function of the block device layer, the throughput of the storage system and the request queue waiting time can be counted. These metrics may be used to evaluate the overall performance of the storage system and identify whether the request scheduling policy is reasonable.

Based on the statistical indexes obtained by the functions of the kernel interface layer, the SCSI drive layer and the SCSI middle layer, the statistical indexes are not related or less related to the performance pressure of the storage system.

Based on the above analysis, at least part of the functions of the HBA driver layer, SAS transmission layer, and block device layer may be taken as an objective function, and a monitor may be installed to an entry or an exit of the objective function. In this way, after the functions of the HBA driver layer, SAS transmission layer, and block device layer are called, the monitor may collect context data of the target event, such as interrupt time, start time and end time of execution of each command, signal strength at various points in time. After the monitoring program writes eBPF the collected data into map, the user state management program analyzes and processes the data collected by the monitoring program, so that the performance pressure of the storage system can be determined.

For another example, assume that a storage system needs to be failure managed (i.e., failure management is a management goal). After analyzing the functions of each protocol layer, the function characteristics of each protocol layer are found as follows:

Based on the function of the SCSI middle layer, SCSI command failure log and device response timeout count can be obtained. This information can be used to diagnose SCSI protocol resolution errors.

Topology change events for the expander may be obtained based on functions of the SAS transport layer. The information can be used for diagnosing errors such as link disconnection, expander configuration errors and the like.

Firmware crash logs may be obtained based on the function of the HBA driver layer. Such information may be used to locate the cause of the hardware failure.

Based on the information obtained by the functions of the kernel interface layer, the block device layer and the SCSI drive layer, the fault location of the storage performance is irrelevant or less relevant.

Based on the above analysis, at least part of the functions of the SCSI middle layer, SAS transport layer, and HBA drive layer may be taken as an objective function, and a monitor may be mounted to the entry or exit of the objective function. Thus, after the functions of the SCSI middle layer, the SAS transport layer, and the HBA driver layer are invoked, the monitor may collect context data for the target event, such as firmware crash logs, expander configuration errors, and the like. After the monitoring program writes eBPF the collected data into map, the user state management program analyzes and processes the data collected by the monitoring program, so as to obtain the failure cause of the storage system.

Step S202, based on the operation data of the storage protocol stack, determining the system state of the storage system.

Specifically, as described in step S201, after the user state management program analyzes and processes the data collected by the monitor program, the system state of the storage system can be determined. The system state is related to a management objective of the storage system. For example, in the case of taking the failure management as the management target, the system state may include whether the storage system fails, the failure cause of the storage system, the failure time, and the like. In the case where performance management is taken as a management target, the system state may include the performance pressure magnitude of the storage system, a performance pressure trend chart, and the like.

Step S203, based on the system status, manages the storage system.

For example, in the case of a failure of the storage system, the failure processing may be performed on the storage system after analyzing the cause of the failure based on the data collected by the monitoring program. For another example, when the performance pressure of the storage system is high, a system performance alarm may be generated to prompt maintenance personnel to process in time.

In summary, in the technical solutions of some embodiments of the present application, a monitor running in a host kernel is developed, and the monitor is associated with at least a portion of a protocol layer of a storage protocol stack, so that when a target event occurs in the associated protocol layer, the monitor can collect context data of the target event. The context data may be bottom-layer operation data of the storage system, or may reflect a bottom-layer operation state of the storage system, that is, may be equivalent to an acquisition channel in which the bottom-layer operation data is newly added in the general storage system. Under the condition that the general storage system only returns high-latitude data, the bottom running state of the storage system can be judged based on the data collected by the monitoring program, and maintenance personnel are prevented from collecting secondary data, so that the management efficiency of the storage system can be greatly improved. Therefore, the problems of higher dimension of running information returned by the general storage system and low management efficiency of the storage system in some technologies are solved.

Furthermore, the application collects and stores the operation data of the protocol stack by the mode that the monitoring program is associated with the protocol layer, and the prior architecture of the general storage system is not required to be changed, for example, the architecture of the storage protocol stack is not required to be changed, and the firmware logic of the hardware layer is not required to be changed. Therefore, on one hand, the advantages of low cost and wide application range of the general storage system can be kept, and on the other hand, the general storage system can have management efficiency matched with a custom management system.

In some embodiments, the determining the system state of the storage system based on the operation data of the storage protocol stack may include:

acquiring a reference system state returned by a storage system;

Determining the association degree of different protocol layers and the reference system state based on the reference system state, and determining the corresponding operation data weight of the different protocol layers based on the association degree, wherein the operation data weight is used for representing the importance degree of operation data acquired from the different protocol layers in the system state determining process;

And based on the weight of the operation data, carrying out weighted fusion calculation on the operation data acquired from different protocol layers to obtain a system state.

Specifically, the reference system state refers to the high latitude data originally returned by the storage system. These high latitude data, while not reflecting the underlying operating logic of the storage system, may reflect the state orientation of the storage system. For example, when the storage system returns an error code that characterizes a communication disruption, it may be determined that the fault is communication-related, although the specific cause of the communication disruption cannot be determined. For another example, when the storage system returns an error code that characterizes the storage system performance as being too stressed, it may be determined that the failure is related to the performance stress, although the specific cause of the storage system performance as being too stressed may not be determined. I.e., based on the reference system state returned by the storage system, an initial range of system states may be determined.

Based on the reference system state, the degree of association of different protocol layers with the reference system state can be determined, and the weight of each protocol layer is determined according to the degree of association of each protocol layer with the reference system state. Specifically, if a protocol layer has a greater degree of association with a reference system state, the protocol layer may have a higher operational data weight, and if a protocol layer has a lesser degree of association with a reference system state, the protocol layer may have a lesser operational data weight. For example, in the case where the reference system state is a communication failure, the SCSI middle layer, the SAS transport layer, and the HBA drive layer may have a larger operational data weight, and the kernel interface layer, the block device layer, and the SCSI drive layer may have a smaller operational data weight. For another example, when the reference system state is that the storage system performance pressure is too great, the HBA driver layer, SAS transport layer, and block device layer may have larger operational data weights, and the kernel interface layer, SCSI driver layer, and SCSI middle layer may have smaller operational data weights.

When analyzing and processing the operation data of each protocol layer collected by the monitoring program, the operation data of the protocol layer with larger operation data weight can be analyzed in a key way, and the operation data of the protocol layer with smaller operation data weight is used as a reference.

The system state determined based on the operating data of the stored protocol stack is a sub-state of the reference system state. For example, when the reference system state is a communication failure, the determined sub-state may be a hardware link disconnection or a link negotiation failure, or the like. Thus, maintenance personnel can conveniently determine the specific cause of the fault.

In the above embodiment, based on the reference system state returned by the storage system, the association degree between different protocol layers and the reference system state is determined, based on the association degree, the operation data weights corresponding to the different protocol layers are determined, and based on the operation data weights, the operation data collected from the different protocol layers are weighted, fused and calculated, so as to obtain the system state. Therefore, the omission of key information can be avoided, and the accuracy of the determined system state is improved.

In some embodiments, the performing weighted fusion calculation on the operation data collected from different protocol layers based on the operation data weight to obtain a system state may include:

and searching at least one protocol layer with the operation data weight higher than a first weight threshold based on the operation data weight, and carrying out weighted fusion calculation on the operation data acquired from the at least one protocol layer to obtain a system state.

Specifically, in this embodiment, the method corresponds to removing the protocol layer operation data with a lower operation data weight, and determining the system state based on the protocol layer operation data with a higher operation data weight. Therefore, the data processing amount can be greatly reduced, the data processing efficiency is improved, and the management efficiency of the storage system is further improved.

In some embodiments, determining the system state of the storage system based on the operational data of the storage protocol stack may include:

acquiring a reference system state returned by a storage system, wherein the system state determined based on the operation data of a storage protocol stack is a sub-state of the reference system state;

Determining the association degree of each target event and the reference system state based on the reference system state, and determining the context data weight corresponding to different target events based on the association degree, wherein the context data weight is used for representing the importance degree of the context data of different target events in the system state determination process;

and carrying out weighted fusion calculation on the context data of different target events based on the context data weight to obtain a system state.

The method for obtaining the system state by carrying out weighted fusion calculation on the context data of different target events based on the context data weight may include:

And searching at least one target event with the context data weight higher than a second weight threshold based on the context data weight, and carrying out weighted fusion calculation on the context data of the at least one target event to obtain the system state.

Specifically, the context data weight corresponding to the target event is basically similar to the operation data weight of the protocol layer, and the main difference is that the operation data is screened according to the association degree of the target event and the state of the reference system in this embodiment. In this way, the accuracy of the operation data for determining the system state can be further improved. For example, in the case where the reference system state is a communication failure, although the SCSI middle layer, the SAS transmission layer, and the HBA drive layer each have a large correlation with the communication failure, in the SCSI middle layer, the SAS transmission layer, and the HBA drive layer, the correlation degree of a part of target events with the communication failure may not be high. If the context data of the target events cannot be removed according to the operation data weight of the protocol layer, or the context data weight of the target events is reduced, the problem of increasing the data processing capacity is caused, the data processing efficiency is further affected, or the problem of missing key information and the like is caused due to overlarge data processing capacity, and the accuracy of the determined system state is further affected. For another example, in the case where the reference system state is a communication failure, although the association degree of the kernel interface layer, the SCSI driver layer, and the SCSI middle layer with the communication failure is low, in these protocol layers, there may be a small number of target events with the communication failure that are highly associated. If the data weight is according to the operation data of the protocol layer, the context data of the target events may be omitted, or the context data of the target events cannot be given a higher weight, so that the accuracy degree of the determined system state is affected.

In view of this, in this embodiment, the context data weight of the target event is determined based on the association degree of the context data of the target event and the reference system state, and the context data of different target events is subjected to weighted fusion calculation based on the context data weight, so as to obtain the system state. In this way, on the one hand, the context data of the target events with smaller association degree with the reference running state can be continuously rejected from the protocol layer running data with larger association degree with the reference running state, or the context data of the target events can be given lower weight, on the other hand, the context data of the target events with larger association degree with the reference running state can be given higher weight from the protocol layer running data with smaller association degree with the reference running state. Thus, the accuracy of the operation data for determining the system state is ensured, and the system management efficiency and the accuracy of the determined system state are further improved.

Based on the operating data of the storage protocol stack in the target history period, predicting the system state of the storage system in the future target period.

Specifically, managing the storage system based on the system state may include:

If the system state of the storage system in the future target period is the first system state, determining a first management strategy aiming at the first system state, wherein the first management strategy is used for preventing the storage system from being converted into the first system state;

and managing the storage system based on the first management policy.

Specifically, the operation data of the storage protocol stack in the target history period can be input into a trained neural network model, and the system state of the storage system in the future target period can be predicted through the neural network model. The neural network model has better nonlinear capturing capability, and the accuracy of the determined system state can be improved by predicting the system state through the neural network model.

By predicting the system state of the storage system at a target period in the future, the storage system can be managed in advance. For example, the first system state may be a communication failure state. When the storage system is predicted to have communication faults in a future target period, the reasons for the communication faults can be avoided in advance, for example, maintenance personnel are prompted to change the hardware link of the storage system in advance, so that the communication faults of the storage system in the future target period can be avoided, and the stability and reliability of the storage system can be improved. For another example, the first system state may be that the performance pressure is too great. When the problem that the performance pressure of the storage system is overlarge in the future target period is predicted, the data operation request distributed to the storage system can be reduced before the future target period, so that the problem that the performance pressure of the storage system is overlarge in the future target period is avoided.

In the above embodiment, the system state of the storage system in the future target period is predicted based on the operation data of the storage protocol stack in the target history period, so that the system state of the storage system can be controlled in advance, the unexpected system state of the storage system is avoided, and the reliability and stability of the storage system are ensured.

In some embodiments, if the system state of the storage system in the future target period is the first system state, the duration of the storage system in the first system state may be predicted, and if the duration exceeds the duration threshold, the storage system may be managed based on the first management policy. For example, in the case that the communication flash may occur in the future target period, since the duration of the communication flash is relatively short, it is not necessary to manage the storage system based on the communication flash fault, that is, the problem of allowing the storage system to occur in the future target period is solved. Thus, the problem of unstable storage system caused by frequent management of the storage system can be avoided.

And determining the system state of the storage system in the current period based on the operation data of the storage protocol stack in the current period.

If the system state of the storage system in the current period is the second system state, predicting the duration of the second system state based on the operation data of the storage protocol stack;

If the duration of the second system state is longer than the duration threshold, determining a second management policy for the second system state, the second management policy being used to transition the storage system from the second system state to a third system state;

and managing the storage system based on the second management policy.

Specifically, when the duration of the second system state is longer than the duration threshold, the storage system is managed based on the second management policy, which is similar to the case that the duration of the first system state of the storage system in the future target period exceeds the duration threshold, the storage system is managed based on the first management policy, which is not described herein.

Referring to fig. 4 in combination, a schematic architecture of a SAS storage system according to some embodiments of the application is provided. Fig. 4 is substantially similar to fig. 3, with the main difference that in fig. 4, the application program may include a hypervisor of the storage system, test software, a conventional application program for reading and writing data in the storage system, an information collection program for collecting information in the storage system, and the like. For conventional applications and test software, reference is made, inter alia, to the relevant description of fig. 3. And are not described in detail herein. The system management program and the information collection program are described below.

The hypervisor may include, but is not limited to, file systems, multipath software, RAID (Redundant Array of INDEPENDENT DISKS ) programs, and the like. It will be appreciated that the management of the storage system by these hypervisors also affects the system state of the storage system. For example, when multipath software manages physical paths of a storage system, if a management policy is inaccurate, there is a possibility that the storage system may have a problem such as communication failure.

In view of this, in some embodiments, determining the system state of the storage system based on the operation data of the storage protocol stack may include:

Collecting operation data of at least part of the system management program through the monitoring program;

and carrying out fusion calculation on the operation data of the storage protocol stack and the operation data of the system management program to determine the system state of the storage system.

In particular, the hypervisor may include a plurality of functions. Based on the dynamic mount technique Uprobes/Uretprobes, a monitor may be mounted to an entry or exit of at least a portion of the functions of the hypervisor.

In the above embodiment, the fusion calculation is performed on the operation data of the storage protocol stack and the operation data of the hypervisor, so that the management policy of the hypervisor on the storage system can be referred to at the same time, and the accuracy degree of the determined system state can be further determined.

The information gathering program may include, but is not limited to, a sg_ utils program, a smp_ utils program, a smartctl program, and the like. Different information collection procedures may be used to collect different underlying operational information of the hardware layer. For example, the sg utils program can be used to collect the capacity of the hard disk and determine whether the storage system is interrupted. The smp_ utils program can be used for collecting topology information of a storage system, basic equipment information of an expander, error rate of the storage system, signal quality and the like. smartctl programs can be used to collect firmware logs of the hardware layer, etc. The firmware running logs of the hardware layers can only be indirectly embodied through the running data of the storage protocol stack, for example, based on the running data of the storage protocol stack, the problem that a certain hard disk of the storage system is possibly damaged can be guessed, but specific hard disk running information cannot be obtained. In view of this, the managing the storage system based on the system state may include:

if the system state is the fourth system state, calling an information acquisition program to acquire at least part of equipment information of the storage system;

determining a third management policy for a fourth system state based on the device information and the operational data of the storage protocol stack;

And managing the storage system based on the third management policy.

Specifically, the fourth system state may refer to a state related to the underlying hardware of the storage system, for example, a failure such as a read-write failure of the storage system due to a failure of the underlying hardware. Because the fourth system state is related to the bottom hardware of the storage system, and the possible problems of the bottom hardware of the storage system can only be approximately guessed through the operation data of the storage protocol stack, in this case, the information acquisition program can be called to acquire at least part of the equipment information of the storage system, so that the bottom hardware problem of the storage system can be accurately evaluated based on the equipment information acquired by the information acquisition program, and further, a more accurate third management policy aiming at the fourth system state can be determined based on the equipment information and the operation data of the storage protocol stack, thereby improving the accuracy of the management policy of the storage system.

In some embodiments, the managing the storage system based on the management policy may include:

At least a portion of the hypervisor is controlled based on the management policy to manage the storage system through the hypervisor.

For example, when the hardware link allocation policy of the storage system needs to be adjusted based on the management policy, multipath software of the storage system can be controlled to adjust the hardware link allocation policy, so that the policy accuracy of the storage system is ensured.

Based on the above description, it can be understood that the user mode management program of the present application can perform linkage control on the information acquisition program and the system management program based on the data acquired by the monitoring program, so that the management flexibility of the storage system can be improved.

In some embodiments, the monitoring program comprises a plurality of sub-monitoring programs, wherein the sub-monitoring programs are in one-to-one correspondence with the protocol layers, each sub-monitoring program is used for collecting operation data of a corresponding protocol layer, or the sub-monitoring programs are in one-to-one correspondence with the target events, each sub-monitoring program is used for collecting context data of a corresponding target event, and each protocol layer is provided with one or more target events. In short, the sub-monitor programs can be developed separately for each protocol layer or each target event (i.e., function), and the sub-monitors are decoupled from each other. Thus, as the actual business requirement is changed, the data acquisition logic of one sub-monitoring program is updated, and other sub-monitoring programs are not affected.

In some embodiments, collecting and storing, by the monitor, operational data of the protocol stack may include:

determining target operation data to be acquired based on a management target of a storage system;

determining a target sub-monitoring program to be operated based on the target operation data;

the control target sub-monitor program is operated, and other sub-monitors except the control target sub-monitor program are stopped.

Specifically, the user state management program can control the operation of the sub-monitor program by loading the sub-monitor program, and can control the sub-monitor program to stop operating by unloading the monitor program.

In this embodiment, a plurality of sub-monitors may be developed for a plurality of management targets. For example, a first set of sub-monitors may be developed for fault management of the storage system, and a second set of sub-monitors may be developed for performance management of the storage system. When the management target is fault management, the user state management program can load a first group of sub-monitoring programs, so that the sub-monitoring programs collect protocol layer operation data related to fault management. When the management objective is performance management, the user state management program may load a second group of sub-monitoring programs, so that the sub-monitoring programs collect protocol layer operation data related to performance management.

In the above embodiment, based on the management objective of the storage system, the sub-monitor program is selectively operated, so that on one hand, the data collection amount can be reduced, the system pressure can be reduced, and on the other hand, the bandwidth consumption in the data transmission process can be reduced.

In some embodiments, the methods of the present application further comprise:

After the operation data of the storage protocol stack are obtained, carrying out statistical processing on the operation data according to preset data processing dimensions, and displaying dimension data obtained by processing;

And/or, after obtaining the system state of the storage system, displaying the system state.

And/or after obtaining the operation data of the storage protocol stack, if the operation data is subjected to statistical processing according to a preset data processing dimension, and the obtained dimension data exceeds a dimension data threshold value, performing a first alarm;

And/or after obtaining the system state of the storage system, if the system state represents that the storage system has a fault, performing a second alarm.

And/or calling an information acquisition program to acquire at least part of equipment information of the storage system;

And displaying a topological graph of the storage system based on the equipment information, the operation data of the storage protocol stack and the system state.

Thus, maintenance personnel can conveniently check the running state of the system.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of a program plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment.

Referring to fig. 5 in combination, a block diagram of a storage system management device according to some embodiments of the present application is provided. In fig. 5, the storage system management device includes:

The data acquisition module 501 is configured to acquire operation data of a storage protocol stack through a monitor program, where the monitor program and the storage protocol stack are operated in a host kernel, the monitor program is associated with at least a part of protocol layers of the storage protocol stack, and when a target event occurs in the associated protocol layer, acquire context data of the target event as the operation data of the storage protocol stack, where the target events of different protocol layers are different;

A system state determining module 502, configured to determine a system state of the storage system based on the operation data of the storage protocol stack;

The system management module 503 is configured to manage the storage system based on the system status.

In some embodiments, the system state determination module 502 is specifically configured to:

In some embodiments, the storage system has at least one hypervisor, and the system state determination module 502 is specifically configured to:

predicting the system state of the storage system in a future target period based on the operation data of the storage protocol stack in the target history period;

Or determining the system state of the storage system in the current period based on the operation data of the storage protocol stack in the current period.

In some embodiments, the system management module 503 is specifically configured to:

the storage system is managed based on the first management policy.

and managing the storage system based on the second management policy.

And managing the storage system based on the third management policy.

In some embodiments, the storage system has at least one hypervisor, and the system management module 503 is specifically configured to:

In some embodiments, the monitoring program comprises a plurality of sub-monitoring programs, wherein the sub-monitoring programs are in one-to-one correspondence with the protocol layers, each sub-monitoring program is used for collecting operation data of a corresponding protocol layer, or the sub-monitoring programs are in one-to-one correspondence with the target events, each sub-monitoring program is used for collecting context data of a corresponding target event, and each protocol layer is provided with one or more target events.

In some embodiments, the data acquisition module 501 is specifically configured to:

In some embodiments, the data acquisition module 501 is further configured to:

after operation data of a storage protocol stack are obtained, if the operation data are subjected to statistical processing according to preset data processing dimensions, and the obtained dimension data exceed a dimension data threshold value, performing first warning;

In some embodiments, the data acquisition module 501 is further configured to:

Calling an information acquisition program to acquire at least part of equipment information of a storage system;

In summary, in the storage system management device according to some embodiments of the present application, the monitor is run in the host kernel and associated with at least a portion of the protocol layers of the storage protocol stack, so that when the associated protocol layer generates the target event, the monitor can collect the context data of the target event. The context data may be bottom-layer operation data of the storage system, or may reflect a bottom-layer operation state of the storage system, that is, may be equivalent to an acquisition channel in which the bottom-layer operation data is newly added in the general storage system. Under the condition that the general storage system only returns high-latitude data, the bottom running state of the storage system can be judged based on the data collected by the monitoring program, and maintenance personnel are prevented from collecting secondary data, so that the management efficiency of the storage system can be greatly improved. Therefore, the problems of higher dimension of running information returned by the general storage system and low management efficiency of the storage system in some technologies are solved.

Referring also to fig. 6 in combination, an embodiment of the application provides an electronic device comprising a memory 10 and a processor 20, the memory 10 having stored therein a computer program, the processor 20 being arranged to run the computer program to perform the steps of any of the communication method embodiments described above.

Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the communication method embodiments described above when run.

In an exemplary embodiment, the computer readable storage medium may include, but is not limited to, a U disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media in which a computer program may be stored.

Embodiments of the present application also provide a computer program product comprising a computer program which, when executed by a processor, implements the steps of any of the communication method embodiments described above.

Embodiments of the present application also provide another computer program product comprising a non-volatile computer readable storage medium storing a computer program which when executed by a processor implements the steps of any of the communication method embodiments described above.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The above description describes in detail a storage system management method, apparatus, device, storage medium and program product provided by the present application. The principles and embodiments of the present application have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present application and its core ideas. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the application can be made without departing from the principles of the application and these modifications and adaptations are intended to be within the scope of the application as defined in the following claims.

Claims

1. A storage system management method, the method comprising:

and managing the storage system based on the system state.

2. The method of claim 1, wherein determining the system state of the storage system based on the operational data of the storage protocol stack comprises:

Acquiring a reference system state returned by the storage system, wherein the system state determined based on the operation data of the storage protocol stack is a sub-state of the reference system state;

Determining the association degree of different protocol layers and the reference system state based on the reference system state, and determining the corresponding operation data weight of the different protocol layers based on the association degree, wherein the operation data weight is used for representing the importance degree of operation data acquired from different protocol layers in the system state determining process;

And based on the operation data weight, carrying out weighted fusion calculation on operation data acquired from different protocol layers to obtain the system state.

3. The method according to claim 2, wherein the performing weighted fusion calculation on the operation data collected from different protocol layers based on the operation data weight to obtain the system state includes:

And searching at least one protocol layer with the operation data weight higher than a first weight threshold based on the operation data weight, and carrying out weighted fusion calculation on the operation data acquired from the at least one protocol layer to obtain the system state.

4. The method of claim 1, wherein determining the system state of the storage system based on the operational data of the storage protocol stack comprises:

and carrying out weighted fusion calculation on the context data of different target events based on the context data weight to obtain the system state.

5. The method of claim 4, wherein the performing weighted fusion calculation on the context data of different target events based on the context data weights to obtain the system state includes:

6. The method of claim 1, wherein the storage system has at least one hypervisor, wherein the determining the system state of the storage system based on the operational data of the storage protocol stack comprises:

7. The method of claim 1, wherein determining the system state of the storage system based on the operational data of the storage protocol stack comprises:

predicting a system state of the storage system in a future target period based on the operation data of the storage protocol stack in the target history period;

8. The method of claim 7, wherein managing the storage system based on the system state comprises:

If the system state of the storage system in the future target period is a first system state, determining a first management strategy aiming at the first system state, wherein the first management strategy is used for preventing the storage system from being converted into the first system state;

and managing the storage system based on the first management policy.

9. The method of claim 7, wherein managing the storage system based on the system state comprises:

if the system state of the storage system in the current period is a second system state, predicting the duration of the second system state based on the operation data of the storage protocol stack;

If the duration of the second system state is longer than a duration threshold, determining a second management policy for the second system state, wherein the second management policy is used for enabling the storage system to be converted from the second system state to a third system state;

and managing the storage system based on the second management policy.

10. The method of claim 1, wherein managing the storage system based on the system state comprises:

if the system state is a fourth system state, an information acquisition program is called to acquire at least part of equipment information of the storage system;

determining a third management policy for the fourth system state based on the device information and the operational data of the stored protocol stack;

And managing the storage system based on the third management policy.

11. The method according to any of claims 8 to 10, wherein the storage system has at least one hypervisor;

Managing the storage system based on a management policy, comprising:

and controlling at least part of the hypervisor based on the management policy to manage the storage system through the hypervisor.

12. The method of claim 1, wherein the monitor comprises a plurality of sub-monitors, wherein:

The sub-monitoring programs are in one-to-one correspondence with the protocol layers, and each sub-monitoring program is used for collecting the operation data of the corresponding protocol layer;

or, the sub-monitoring programs are in one-to-one correspondence with the target events, and each sub-monitoring program is used for collecting context data of the corresponding target event, wherein each protocol layer is provided with one or more target events.

13. The method of claim 12, wherein the collecting, by the monitor, operational data of the stored protocol stack, comprises:

determining target operation data to be acquired based on a management target of the storage system;

And controlling the target sub-monitoring program to run, and controlling other sub-monitoring programs except the target sub-monitoring program to stop running.

14. The method according to claim 1, wherein the method further comprises:

And/or after obtaining the system state of the storage system, displaying the system state.

15. The method according to claim 1, wherein the method further comprises:

After obtaining the operation data of the storage protocol stack, if the operation data is subjected to statistical processing according to a preset data processing dimension, and the obtained dimension data exceeds a dimension data threshold value, performing a first alarm;

16. The method according to claim 1, wherein the method further comprises:

calling an information acquisition program to acquire at least part of equipment information of the storage system;

17. A storage system management apparatus, the apparatus comprising:

18. An electronic device, comprising:

a memory for storing a computer program;

A processor for implementing the storage system management method according to any one of claims 1 to 16 when executing the computer program.

19. A computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, wherein the computer program, when executed by a processor, implements the storage system management method of any of claims 1 to 16.

20. A computer program product comprising a computer program which, when executed by a processor, implements the storage system management method of any one of claims 1 to 16.