The trouble with iowait
In theory, a task that is waiting for short-term I/O (a state referred to in the kernel as "iowait") will need to execute soon. That means that there can be some advantages to treating the task as if it were executing now. The kernel maintains a one-bit field (called in_iowait) in the task_struct structure to mark such tasks. This bit is set prior to waiting for an I/O operation that is expected to be fast (typically a block I/O operation) and cleared once the operation completes. The kernel then uses this information in a couple of ways:
- When an iowait task wakes on completion of the I/O, the scheduler will inform the CPU-frequency governor. The governor, in turn, may choose to run the target CPU at a higher frequency than it otherwise would. Normally, the CPU-frequency decision is driven by the level of utilization of the processor, but tasks performing a lot of I/O may not run up a lot of CPU time. That can lead the CPU-frequency governor to choose a slower frequency than is optimal, with the result that the next I/O operation is not launched as quickly and throughput suffers. Raising the frequency for iowait tasks is meant to help them keep the I/O pipeline full.
- If a CPU goes idle, the system will normally try to put it into a lower-power state to save energy. The deeper the sleep state, though, the longer it takes for the CPU to get back to work when a runnable task is placed on it. The number of iowait tasks queued for a CPU is used as a signal indicating upcoming CPU demand; the presence of those tasks can cause the governor to choose a shallower sleep state than it would otherwise.
In theory, judicious use of the in_iowait flag can lead to significantly improved throughput for I/O-intensive tasks, and there are cases where that is demonstrably true. But the iowait handling can bring other problems, and its effectiveness is not always clear.
Iowait and io_uring
Back in July 2023, Andres Freund encountered a performance problem in the kernel. It was not quite as sensational as certain other problems he has run across, but still seemed worth fixing. He noticed that PostgreSQL processes using io_uring ran considerably slower (as in, 20-40% slower) than those using normal, synchronous I/O. In the synchronous case, the in_iowait flag was set, keeping the CPU out of deeper sleep states; that was not happening in the io_uring case. Freund's proposed fix was to set the in_iowait flag for tasks waiting on the io_uring completion queue; that recovered the lost performance and more. Io_uring maintainer Jens Axboe was quickly convinced; he merged the patch for the 6.5 kernel, and marked it for inclusion into the stable updates as well.
Once that patch was distributed in stable kernels, though, problem reports
like this
one from Phil Elwell began to appear. Suddenly, tasks on the system
were showing 100% iowait time, which looked like a confusing change of
behavior: "I can believe that this change hasn't negatively affected
performance, but the result is misleading,
" Elwell commented.
The root of the problem is the treatment of the iowait state as being something close to actually running. User-space tools (like top or mpstat) display it separately and subtract it from the idle time; the result is the appearance of a CPU that is running constantly, even though the CPU is actually idle almost all of the time. That can result in the creation of confused humans, but also seemingly can confuse various system-management tools as well, leading them to think that a task with a lot of iowait time has gone off the rails.
Axboe responded with a change causing in_iowait to only be set in cases where there were active operations outstanding; it was merged later in the 6.5 cycle. That addressed the immediate reports, but has not put an end to the complaints overall. For example, in February, David Wei pointed out that tools can still be confused by high iowait times; he included a patch to allow users to configure whether the in_iowait flag would be set or not. That patch went through a few variants, but was never merged.
Pavel Begunkov had objected to an early version of Wei's patch, saying that exposing more knobs to user space was not the right approach. Instead, he said, it would be better to separate the concepts of reporting iowait time to user space and influencing CPU-frequency selection.
It took a while, but Axboe eventually went with that approach. His patch series, now in its sixth version, splits the in_iowait bit into two. One of those (still called in_iowait) is used in CPU-frequency decisions, while the other (in_iowait_acct) controls whether the process appears to be in the iowait state to user space. Most existing code in the kernel sets both bits, yielding the same user-space-visible behavior as before, but io_uring can only set in_iowait. That, Axboe hopes, will bring an end to complaints about excessive iowait time.
This change is not universally popular; Peter Zijlstra expressed
some frustration over the seeming papering-over of the problem: "are
we really going to make the whole kernel situation worse just because
there's a bunch of broken userspace?
" User space is what it is, though,
and Axboe's patch set can address some of the complaints coming from that
direction — in the short term, at least.
Eliminating iowait
The discussion on the visibility of the iowait state has brought to the fore a related topic: does the iowait mechanism make any sense at all? Or might iowait be a heuristic that has outlived its time? Christian Loehle thinks that may be the case, and is working to remove the iowait behavior entirely.
There are a number of problems with how iowait works now, he said. A CPU-idle governor might keep a CPU in a higher-power state in anticipation that some iowait tasks will soon become runnable, but there is no guarantee that any of those tasks will actually wake in a short period of time. "Fast I/O" is not defined anywhere, and the kernel has no sense for how long an I/O operation will actually take. So the CPU could be wasting power with nothing to do. When a task does wake, the scheduler will pick what appears to be the best CPU to run it on; that may not be the CPU that was kept hot for it.
Boosting a CPU's frequency after a task wakes may appear to avoid these problems, but there are problems there too. A task can migrate at any time, leaving its boosted CPU behind. The targeted tasks run for short periods of time; the fact that they do not use a lot of CPU time is why the separate boosting mechanism was seen as necessary in the first place. But changing a CPU's frequency is not an instant operation; the iowait task is likely to have gone back to sleep before the CPU ramps up to the new speed. That means that the CPU must be kept at the higher speed while the task sleeps, so that the boost can at least help it the next time it wakes. But, again, nobody knows when that will be or if the task will wake on the same CPU.
On top of all this, Loehle asserted that CPU-frequency boosting is often not helpful to I/O-intensive tasks in any case. All of this reasoning (and more) can be found in the above-linked patch series, which removes the use of iowait in CPU-idle and CPU-frequency management entirely. On the idle side, Loehle noted that the timer events oriented (TEO) governor gets good results despite having never used iowait, showing that the iowait heuristics are not performance-critical. So, along with removing the use of iowait, the patch series makes TEO into the default CPU-idle governor, in place of the menu governor that is the default in current kernels.
Loehle insisted that the iowait heuristics are only useful for
"synthetic benchmarks
". For the io_uring case described above, he
said, the real problem was the CPU-idle governor using iowait (or the lack
thereof) to put the CPU into a deeper sleep state. His patch series
removes that behavior, so there is no longer any need for io_uring to set
the in_iowait flag — or for changes to how iowait tasks are
reported to user space.
He clearly thinks that this is the proper way to solve the problem; he described
Axboe's patch series as "a step in the wrong direction
". Axboe,
though, does not want
to wait for the iowait removal to run its course; his series solves the
problem he is facing, he said, and it can always be removed later if iowait
goes away.
Chances are that things will play out more-or-less that way. Axboe's
patches could land as soon as 6.12, bringing an end (hopefully) to
complaints about how io_uring tasks appear to be using a lot of CPU time.
Heuristics, though, have been built up over a long time and can be harder
to get rid of; there will be a need for a lot of testing and benchmarking
to build confidence that changing the iowait heuristics will not cause some
workloads to slow down. So Loehle's patch series can be expected to take
rather longer to get to a point where it can be merged.
| Index entries for this article | |
|---|---|
| Kernel | io_uring |
| Kernel | Scheduler/and power management |