The io.weight I/O-bandwidth controller
There are a number of challenges facing an I/O-bandwidth controller. Some processes may need a guarantee that they will get at least a minimum amount of the available bandwidth to a given device. More commonly in recent times, though, the focus has shifted to latency: a process should be able to count on completing an I/O request within a bounded period of time. The controller should be able to provide those guarantees while still driving the underlying device at something close to its maximum rate. And, of course, hardware varies widely, so the controller must be able to adapt its operation to each specific device.
The earliest I/O-bandwidth controller allows the administrator to set maximum bandwidth limits for each control group. That controller, though, will throttle I/O even if the device is otherwise idle, causing the loss of I/O bandwidth. The more recent io.latency controller is focused on I/O latency, but as Tejun Heo, the author of the new controller, notes in the patch series, this controller really only protects the lowest-latency group, penalizing all others if need be to meet that group's requirements. He set out to create a mechanism that would allow more control over how I/O bandwidth is allocated to groups.
io.weight
The new controller works by assigning a "weight" value to each control
group. Consider, for example, the simple hierarchy shown to the right.
If group A is given a weight of 100 for a specific block device and
group B has a weight of 300, then B will be allowed to use 75% of the
available bandwidth. Absolute weights do not matter, each group's actual
portion of the available bandwidth will be determined by its weight
relative to the sum of all weights at that level in the hierarchy.
That leaves open the question of just how the controller determines how much of the device's capacity each group is using. Simply counting I/O operations or total bandwidth turns out to be inadequate, since some requests can be quite a bit more expensive than others. So the new controller uses a "cost model" that tries to better estimate how much of a device's time will be required to satisfy any given request. This model is relatively simple. First, it determines whether a request is sequential or random; in the former case, the operation will complete much more quickly (especially on rotating drives) than in the latter. The operation is given a fixed cost based on this determination, plus an incremental cost for each page to be transferred. The resulting total cost is an estimate of how long it will take for the request to be executed.
By default, the controller will observe the actual behavior of each device to work out what the cost parameters should be. The administrator can override this behavior by writing some commands to the io.weight.cost_model file in the root-level control group. For each drive, the maximum throughput, along with the maximum number of sequential and random operations that can be performed per second, can be specified. Different costs can be used for read and write operations if appropriate.
The default cost model apparently works pretty well. But, should somebody encounter a situation where that model falls apart, there is, inevitably, a hook to run a BPF program that can calculate the cost in whatever way makes sense.
vtime
The controller works by establishing a virtual clock (called vtime) for each device; that clock normally advances at the usual rate of one second per second. Each control group also has a vtime clock that determines when it can submit another I/O operation. Once the cost of an operation has been determined, it is added to the group's vtime; the operation can only be sent to the device once that device's vtime is ahead of the group's vtime. The weights assigned to each group are implemented by scaling the cost of each operation proportionally to that group's share of the total bandwidth. If group A, above, has 25% of the available bandwidth, the cost of its operations will be multiplied by four. In a sense, control groups live in a relativistic universe where lower-weight groups have slower-moving clocks.
To avoid situations where a device sits idle when there are operations pending, the controller will take note when any given group is not using the full bandwidth available to it and temporarily lower that group's weight to match its actual usage, in effect "lending" the unused bandwidth to other groups that are performing I/O. There is a mechanism to allow a group to quickly grab back its lent bandwidth should it start to need it.
There is one little remaining problem: the vtime mechanism is designed to issue requests at the speed that the device can handle them. But the cost model is unlikely to be perfect, and the performance of any given device can vary over time. If the cost model is off, the controller could dispatch too many requests (increasing latencies) or not enough requests (leaving some bandwidth unused). That, naturally, is a situation worth avoiding.
Should the controller notice that request-completion times are increasing, it takes that as a signal that too many requests are being sent. That situation is addressed by slowing down the vtime associated with the overloaded device, so that requests will be dispatched at a slower rate. Similarly, if the device is not completely busy, its vtime will be advanced more quickly so that more requests will go out.
The controller will try to tune this scaling automatically, but that may not be adequate for some situations. Write operations, in particular, can be queued within the device itself and completed in an order chosen by the device, meaning that the controller loses some control over the latency of any given request. In cases where that is a problem, it may be desirable to slow down request dispatch more aggressively to reduce the latency of request completion, even at the possible cost of leaving some bandwidth unused. There is another control knob in the root group, called, io.weight.qos, that can be used to specify what the desired latency ranges are and how much the device's vtime can be adjusted to achieve those ranges.
See the comments at the top of this patch for more details on the various control knobs and how they work.
Heo notes that the controller does a reasonable job of enforcing each
group's weight using the default parameters — for read requests, at least.
When
there are a lot of writes involved, some playing with the parameters may be
needed to get the best results. Tools and documentation to help
administrators working to tune this controller are promised. Meanwhile,
there has not been a huge amount of feedback on this controller since it
was posted on June 13. Expecting it for 5.3 seems optimistic, but it
may well be ready for a merge window shortly thereafter.
| Index entries for this article | |
|---|---|
| Kernel | Control groups/I/O bandwidth controllers |