CN117632030A

CN117632030A - Method and device for cooperatively scheduling disk I/O in storage system

Info

Publication number: CN117632030A
Application number: CN202311662121.8A
Authority: CN
Inventors: 王策; 蒋方文; 李超
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2023-12-06
Filing date: 2023-12-06
Publication date: 2024-03-01

Abstract

The invention provides a method and a device for cooperatively scheduling disk I/O in a storage system, which belong to the field of data storage, and adopt the following technical scheme: 1. distributing and binding the request to a proper central processing unit for processing according to the characteristics of the physical disk and the I/O request; 2. the mode of using the CPU is dynamically adjusted according to the frequency and pressure of the I/O request. The invention is used for optimizing the disk I/O performance in the storage server by fully utilizing resources such as the central processing unit, the disk, the memory and the like.

Description

Method and device for cooperatively scheduling disk I/O in storage system

Technical Field

The invention relates to the technical field of data storage, in particular to a method and a device for cooperatively scheduling disk I/O in a storage system.

Background

In the Linux server of the Multi-core central processing unit nowadays, each central processing unit distributes a software queue (Multi-queue SSD Access on Multi-core Systems) in the core, so that the I/O requests sent by remote or local clients can be processed in parallel, meanwhile, the NVMe (Non-Volatile Memory express, non-volatile memory host controller interface specification) disk also supports that at most 64K queues write data simultaneously, and the disk of the original AHCI specification only defines one hardware queue, so that the advantages of the Multi-core central processing unit cannot be exerted. Therefore, the combination of the multi-core central processing unit and the multi-queue disk can greatly improve the I/O performance of the disk.

However, storage systems typically do not actively schedule on which central processor to perform I/O operations and receive interrupts after I/O is complete, and therefore do not exhibit higher central processor performance; different data transmission protocols and different services have different requirements for processing I/O, and lack of corresponding customized processing modes, for example, the problem that I/O intensive requests are all processed on the same central processing unit, and other central processing units are idle, so that I/O delay is high may occur. Therefore, a scheme is needed to fully utilize the capabilities of the central processing unit and the disk resources, and solve the problem of unbalanced resource occupation so as to improve the storage efficiency.

In servers of a multi-core processor, there is a lack of customized scheduling policies before storage I/O requests from different clients reach each storage system and different types of disks. When facing different disk types, I/O types, transmission protocols and business requirements, the flexible scheduling is lacking, and the storage performance cannot be better exerted.

Disclosure of Invention

The invention mainly aims to provide a method and a device for cooperatively dispatching disk I/O in a storage system, which integrate all schedulable resources based on the characteristics of a central processing unit, a disk, a data communication protocol and actual service demands, reasonably allocate I/O requests in the storage system as much as possible, dynamically dispatch and balance the I/O requests when the unbalanced I/O requests are encountered, fully utilize the capacities of the central processing unit and the disk, give full play to better storage performance and reduce the effort of manpower and material resources. Can effectively solve the pain point in the background technology.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

a method for cooperatively dispatching disk I/O in a storage system includes adding an I/O dispatching system into the storage system, dynamically distributing and dispatching disk I/O based on the capacities and characteristics of a central processing unit and disks and combining the service demands and importance of different cloud hard disks to achieve higher disk I/O performance and stability, and simultaneously giving consideration to the energy consumption loss of the central processing unit, so that the storage system is in a dynamic balance, and specifically comprises the following steps:

the method comprises the following steps: when the storage system is started, collecting information of the central processor into a linked list data structure, and taking the linked list data structure as a basis for distributing central processor resources for disk I/O later;

step (2): after receiving a command of mounting the cloud hard disk to the server, the storage system initializes an area for storing relevant information of the cloud hard disk from the memory and creates a coroutine, and the scheduling system selects a central processing unit to bind the coroutine;

step (3): the storage scheduling system automatically selects a scheduling strategy capable of ensuring that all the CPU resources can be maximally utilized according to the characteristics of the CPU, the characteristics of the cloud hard disk, the service requirements and the current CPU load information, and improves the storage performance.

Further, in the step (1), the storage system analyzes the mask of the central processing unit which is input during the initialization, collects the frequency information of the central processing units, initializes the structure and inserts the structure into a double-chain table, and traverses and selects one central processing unit to specially process the I/O of the cloud hard disk when the coroutine resource is allocated to the cloud hard disk.

Further, in the step (2), hard disk information is collected each time a cloud hard disk is mounted on a server, where the collected hard disk information includes a data communication protocol, qos configuration, and a disk type, a poller function of a protocol store_thread is registered, and the poller function is added to a queue of an executor and then is run on the executor.

Further, in the step (3), the scheduling system combines the load of the central processing unit and the actual I/O request situation, and is divided into the following processing logic:

(1) When the load of the central processor is low or the cooperative distance running on the executor is small in certain time periods, the running frequency of the central processor is reduced or the executor mode is changed from a polling mode to an interrupt mode, and the central processor can schedule tasks of other processes;

(2) According to the weight of the cloud hard disk and the characteristics of I/O, processing disk I/O on a central processing unit with higher cloud hard disk allocation performance and lower load, which are important and have higher security requirements, wherein the weight of the cloud hard disk is calculated according to a service self-defining algorithm, the information of the previous cooperative program is transmitted between different central processing units through a lock-free queue on an executor, and the executor carrying the new cooperative program takes the information out of the lock-free queue and operates the cooperative program;

(3) When the central processing units exceeding a certain proportion are under higher load, automatically expanding the new central processing unit to operate the executor, and migrating the non-important coroutines to the executor of the new central processing unit to operate according to a weight algorithm.

An apparatus for co-scheduling disk I/O in a storage system, comprising:

an I/O device for inputting and outputting information for performing the method of co-scheduling disk I/O in the storage system of any one of claims 1-4;

the storage device is used for storing the characteristics of the central processing unit, the characteristics of the cloud hard disk, the service requirements, the load information of the current central processing unit and the scheduling policy;

a central processing unit for performing the method of co-scheduling disk I/O in a storage system as claimed in any one of claims 1-4.

Compared with the prior art, the invention has the following beneficial effects:

1. during a period with few I/O requests, such as at night, the storage scheduling system automatically changes the executor operation mode into an interrupt mode or reduces the operation frequency of the central processing unit, releases the total occupation of the central processing unit, provides more time slices for other processes, and reduces the energy consumption.

2. And allocating a specific central processing unit for processing according to the importance of the service and the characteristics of the I/O. For the small file read-write requests with frequent I/O, the small file read-write requests are sent to a central processing unit core with higher frequency, so that the iops of the small file read-write requests are improved; for sequential read-write requests of large block I/O, the time interval of the poler cycle is appropriately increased to improve the amount of I/O data transmitted at one time and increase the throughput.

3. When the storage scheduling system detects that the load difference of a plurality of central processing units is larger and the load of some central processing units is higher, according to an algorithm of actual service, the storage scheduling system can schedule the store_thread corresponding to the important cloud hard disk to other idle central processing units or schedule the unimportant coroutines to other idle central processing units, so that the condition that resources are contended in the I/O of each service is avoided.

4. Based on the user mode driving, the method has lower development cost and more flexibility compared with the kernel driving. The user mode driver can be executed by the development program control protocol, and the user mode driver directly interacts with the hardware device, so that the additional overhead caused by the copying of the I/O data and the context switching caused by the system call is reduced.

Drawings

FIG. 1 is a diagram of a storage system resource architecture;

FIG. 2 is a storage system scheduling flow diagram;

fig. 3 is a schematic diagram of a memory device.

Detailed Description

The invention is further described in connection with the following detailed description, in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.

In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "upper", "lower", "inner", "outer", "front", "rear", "both ends", "one end", "the other end", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific direction, be configured and operated in the specific direction, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "provided," "connected," and the like are to be construed broadly, and may be fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

Examples

Referring to fig. 1-3, the present invention provides a technical solution:

In this embodiment, the detailed technical solution may be divided into three parts:

the central processing unit initializes:

when the storage system is initialized through the command line, a plurality of central processing units are designated by a parameter transmission mask-m [ central processing unit number 1, central processing unit number 2 … … ] and are specially used for processing I/O related work in the storage system.

The storage system allocates space and initializes a struct CPU Info structure, fills the CPU number, CPU frequency information, NUMA structure information into a doubly linked list for later judging on which CPU the filtering I/O execution thread is.

The main thread executes a pthread_create function to create one thread for each central processor and binds to all selected central processor logic cores through the pthread_create_np function, each thread running on a central processor core being called an executor. After the main thread sends a message for starting the executor to the sub thread through the pipe pipeline of the linux, all executors are defaulted to start a polling mode, and the use ratio of the central processing unit is 100%.

Initializing a storage device:

when the storage system detects a newly added local hard disk or a user applies to develop a new virtual hard disk and mounts the new virtual hard disk to a user cloud server/cloud physical host, hard disk information (data communication protocol, qos configuration and disk type) is collected, a poller function of a cooperative store_thread is registered, and the virtual hard disk is added to an executor and then is operated on the executor.

The storage device driver operates in a user state and initializes a corresponding hardware queue in the storage system. Different from threads in the kernel, in the user mode, the program can control the cooperative store_thread, and only one cooperative can access the queue at the same time point, so that different threads are prevented from competing for locks of the same queue to reduce performance.

Scheduling policy:

each block device mounted to the cloud server/cloud physical host calculates a weight value, and the influencing factors include service importance, I/O types (small block random I/O, large block I/O, etc.), QOS configuration, hard disk types (NVMe SSD, SAS HDD), and data transmission protocol (RDMA, PCIe, ISCSI, FC) according to the importance sequence, and each specific weight is configured according to a specific scenario, for example, weight=0.5×service importance+0.3qos+0.2 hard disk type, and the weight value is used for key parameters of a cloud hard disk scheduling policy described below.

The executor thread executes the poler function on each store_thread_list linked list in a while loop, which is mainly used to perform I/O operations, and records that each poler execution time is active_time once per loop, and active_ratio=sum (active_time)/total_time indicates the busyness of the executor.

The storage scheduling system traverses the linked list of the available central processing units according to the weight value, and binds the cloud hard disk newly mounted to the server into one central processing unit according to the frequency of the central processing units, the quantity of polers running on each central processing unit and the active_RATIO value. The complete I/O link includes that the processing after I/O is completed is processed on a fixed CPU, so that the probability of cache miss of the CPU is reduced.

Normally, the load on each CPU is substantially balanced. However, when the I/O requests of different clients change greatly, and some central processing units have larger loads and some central processing units have smaller loads, the scheduling algorithm determines to schedule the coroutines corresponding to the cloud hard disk with higher central processing unit loads and higher central processing units to the central processing units with smaller central processing units, otherwise, the same central processing units are used. By sending the messages to the lock-free ring buffers of the other parties, the store_thread is migrated to the executor of the corresponding central processing unit when the executor loops to the poller function responsible for processing the messages next time.

When the value of the active_ratio counted on the executor exceeds 80% exceeds 90%, a certain number of central processing units are extended to run new executor, and meanwhile, the balancing of the store_thread is completed according to the above 1-4, so that the overload of the individual central processing units is avoided.

A red-black tree structure of a timed_polers_tree is maintained on each executor, wherein the red-black tree structure comprises a poler function which is executed every 10 minutes and is used for detecting the busyness of the current executor. When the default exceeds the time1, the busyness value of the executor is less than 20% or no ACTIVE store_thread_list exists, the CPU down-conversion operation is executed (if the CPU supports the operation), when the default exceeds the time2, the executor mode is changed from a polling mode to an interrupt mode, and at the moment, the CPU can execute tasks of other processes; destroying the executor after the time3 is exceeded.

The invention is based on the following features:

in a server of the multi-core processor, each central processor core independently shares a first-level cache and a second-level cache. When one process/thread is switched to a different central processor for execution, a cache miss may occur, thereby reducing execution efficiency. CPU affinity refers to the tendency of a process/thread to run on a given CPU for as long as possible without being migrated to other processors, improving thread execution efficiency.

The inter-thread exchange of information employs a lock-free queue, and the lock-free policy uses a technique called compare-exchange (CAS Compare And Swap) to identify thread conflicts, and once a conflict occurrence is detected, the current operation is retried until no conflicts are present.

An apparatus for co-scheduling disk I/O in a storage system, comprising:

The foregoing has shown and described the basic principles and main features of the present invention and the advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A method for co-scheduling disk I/O in a storage system, comprising: an I/O scheduling system is added into a storage system, based on the capacities and characteristics of a central processing unit and a disk, and based on the service demands and the importance of different cloud hard disks, the disk I/O is dynamically allocated and scheduled so as to achieve higher disk I/O performance and stability, and meanwhile, the energy consumption of the central processing unit is considered, so that the storage system is in a dynamic balance, and the method specifically comprises the following steps:

2. The method for collaborative scheduling of disk I/O in a storage system according to claim 1, wherein: in the step (1), the storage system analyzes the mask codes of the central processing units transmitted during the initialization, collects the frequency information of the central processing units, initializes the central processing units into a structure body, inserts the structure body into a double-chain table, and traverses and selects one central processing unit to specially process the I/O of the cloud hard disk when the coroutine resource is allocated to the cloud hard disk.

3. The method for collaborative scheduling of disk I/O in a storage system according to claim 1, wherein: in the step (2), hard disk information is collected each time a cloud hard disk is mounted on a server, wherein the collected hard disk information comprises a data communication protocol, qos configuration and disk type, a poller function of a collaborative store_thread is registered, and the poller function is added to a queue of an executor and then is operated on the executor.

4. The method for collaborative scheduling of disk I/O in a storage system according to claim 1, wherein: in the step (3), the scheduling system combines the load of the central processing unit and the actual I/O request situation, and is divided into the following processing logic:

5. An apparatus for co-scheduling disk I/O in a storage system, comprising: