The trouble with iowait

By Jonathan Corbet
September 10, 2024

CPU scheduling is a challenging job; since it inherently requires making guesses about what the demands on the system will be in the future, it remains reliant on heuristics, despite ongoing efforts to remove them. Some of those heuristics take special note of tasks that are (or appear to be) waiting for fast I/O operations. There is some unhappiness, though, with how this factor is used, leading to a couple of patches taking rather different approaches to improve the situation.

In theory, a task that is waiting for short-term I/O (a state referred to in the kernel as "iowait") will need to execute soon. That means that there can be some advantages to treating the task as if it were executing now. The kernel maintains a one-bit field (called in_iowait) in the task_struct structure to mark such tasks. This bit is set prior to waiting for an I/O operation that is expected to be fast (typically a block I/O operation) and cleared once the operation completes. The kernel then uses this information in a couple of ways:

When an iowait task wakes on completion of the I/O, the scheduler will inform the CPU-frequency governor. The governor, in turn, may choose to run the target CPU at a higher frequency than it otherwise would. Normally, the CPU-frequency decision is driven by the level of utilization of the processor, but tasks performing a lot of I/O may not run up a lot of CPU time. That can lead the CPU-frequency governor to choose a slower frequency than is optimal, with the result that the next I/O operation is not launched as quickly and throughput suffers. Raising the frequency for iowait tasks is meant to help them keep the I/O pipeline full.
If a CPU goes idle, the system will normally try to put it into a lower-power state to save energy. The deeper the sleep state, though, the longer it takes for the CPU to get back to work when a runnable task is placed on it. The number of iowait tasks queued for a CPU is used as a signal indicating upcoming CPU demand; the presence of those tasks can cause the governor to choose a shallower sleep state than it would otherwise.

In theory, judicious use of the in_iowait flag can lead to significantly improved throughput for I/O-intensive tasks, and there are cases where that is demonstrably true. But the iowait handling can bring other problems, and its effectiveness is not always clear.

Iowait and io_uring

Back in July 2023, Andres Freund encountered a performance problem in the kernel. It was not quite as sensational as certain other problems he has run across, but still seemed worth fixing. He noticed that PostgreSQL processes using io_uring ran considerably slower (as in, 20-40% slower) than those using normal, synchronous I/O. In the synchronous case, the in_iowait flag was set, keeping the CPU out of deeper sleep states; that was not happening in the io_uring case. Freund's proposed fix was to set the in_iowait flag for tasks waiting on the io_uring completion queue; that recovered the lost performance and more. Io_uring maintainer Jens Axboe was quickly convinced; he merged the patch for the 6.5 kernel, and marked it for inclusion into the stable updates as well.

Once that patch was distributed in stable kernels, though, problem reports like this one from Phil Elwell began to appear. Suddenly, tasks on the system were showing 100% iowait time, which looked like a confusing change of behavior: "I can believe that this change hasn't negatively affected performance, but the result is misleading," Elwell commented.

The root of the problem is the treatment of the iowait state as being something close to actually running. User-space tools (like top or mpstat) display it separately and subtract it from the idle time; the result is the appearance of a CPU that is running constantly, even though the CPU is actually idle almost all of the time. That can result in the creation of confused humans, but also seemingly can confuse various system-management tools as well, leading them to think that a task with a lot of iowait time has gone off the rails.

Axboe responded with a change causing in_iowait to only be set in cases where there were active operations outstanding; it was merged later in the 6.5 cycle. That addressed the immediate reports, but has not put an end to the complaints overall. For example, in February, David Wei pointed out that tools can still be confused by high iowait times; he included a patch to allow users to configure whether the in_iowait flag would be set or not. That patch went through a few variants, but was never merged.

Pavel Begunkov had objected to an early version of Wei's patch, saying that exposing more knobs to user space was not the right approach. Instead, he said, it would be better to separate the concepts of reporting iowait time to user space and influencing CPU-frequency selection.

It took a while, but Axboe eventually went with that approach. His patch series, now in its sixth version, splits the in_iowait bit into two. One of those (still called in_iowait) is used in CPU-frequency decisions, while the other (in_iowait_acct) controls whether the process appears to be in the iowait state to user space. Most existing code in the kernel sets both bits, yielding the same user-space-visible behavior as before, but io_uring can only set in_iowait. That, Axboe hopes, will bring an end to complaints about excessive iowait time.

This change is not universally popular; Peter Zijlstra expressed some frustration over the seeming papering-over of the problem: "are we really going to make the whole kernel situation worse just because there's a bunch of broken userspace?" User space is what it is, though, and Axboe's patch set can address some of the complaints coming from that direction — in the short term, at least.

Eliminating iowait

The discussion on the visibility of the iowait state has brought to the fore a related topic: does the iowait mechanism make any sense at all? Or might iowait be a heuristic that has outlived its time? Christian Loehle thinks that may be the case, and is working to remove the iowait behavior entirely.

There are a number of problems with how iowait works now, he said. A CPU-idle governor might keep a CPU in a higher-power state in anticipation that some iowait tasks will soon become runnable, but there is no guarantee that any of those tasks will actually wake in a short period of time. "Fast I/O" is not defined anywhere, and the kernel has no sense for how long an I/O operation will actually take. So the CPU could be wasting power with nothing to do. When a task does wake, the scheduler will pick what appears to be the best CPU to run it on; that may not be the CPU that was kept hot for it.

Boosting a CPU's frequency after a task wakes may appear to avoid these problems, but there are problems there too. A task can migrate at any time, leaving its boosted CPU behind. The targeted tasks run for short periods of time; the fact that they do not use a lot of CPU time is why the separate boosting mechanism was seen as necessary in the first place. But changing a CPU's frequency is not an instant operation; the iowait task is likely to have gone back to sleep before the CPU ramps up to the new speed. That means that the CPU must be kept at the higher speed while the task sleeps, so that the boost can at least help it the next time it wakes. But, again, nobody knows when that will be or if the task will wake on the same CPU.

On top of all this, Loehle asserted that CPU-frequency boosting is often not helpful to I/O-intensive tasks in any case. All of this reasoning (and more) can be found in the above-linked patch series, which removes the use of iowait in CPU-idle and CPU-frequency management entirely. On the idle side, Loehle noted that the timer events oriented (TEO) governor gets good results despite having never used iowait, showing that the iowait heuristics are not performance-critical. So, along with removing the use of iowait, the patch series makes TEO into the default CPU-idle governor, in place of the menu governor that is the default in current kernels.

Loehle insisted that the iowait heuristics are only useful for "synthetic benchmarks". For the io_uring case described above, he said, the real problem was the CPU-idle governor using iowait (or the lack thereof) to put the CPU into a deeper sleep state. His patch series removes that behavior, so there is no longer any need for io_uring to set the in_iowait flag — or for changes to how iowait tasks are reported to user space.

He clearly thinks that this is the proper way to solve the problem; he described Axboe's patch series as "a step in the wrong direction". Axboe, though, does not want to wait for the iowait removal to run its course; his series solves the problem he is facing, he said, and it can always be removed later if iowait goes away.

Chances are that things will play out more-or-less that way. Axboe's patches could land as soon as 6.12, bringing an end (hopefully) to complaints about how io_uring tasks appear to be using a lot of CPU time. Heuristics, though, have been built up over a long time and can be harder to get rid of; there will be a need for a lot of testing and benchmarking to build confidence that changing the iowait heuristics will not cause some workloads to slow down. So Loehle's patch series can be expected to take rather longer to get to a point where it can be merged.

Index entries for this article
Kernel	io_uring
Kernel	Scheduler/and power management

to post comments

Disk encryption

Posted Sep 10, 2024 16:13 UTC (Tue) by abatters (✭ supporter ✭, #6932) [Link] (6 responses)

Years ago with:

- Older kernels
- Software disk encryption
- A CPU that didn't support AES-NI

...I used to manually switch the CPU frequency governor to 'performance' when copying a large amount of files, because otherwise the CPU would run at too low a frequency and slow things down considerably. I guess the 'iowait' mechanism described here wasn't present or wasn't working as described back then. I find less of a need to do the manual tweaking now, but I am not sure if it is because of newer kernels or having a CPU with AES-NI.

So might I suggest benchmarking with software disk encryption before removing the iowait logic?

Disk encryption

Posted Sep 10, 2024 16:36 UTC (Tue) by cloehle (subscriber, #128160) [Link] (5 responses)

> So might I suggest benchmarking with software disk encryption before removing the iowait logic?

Thanks for the hint and I can definitely do that for good measure.
Depending on when you ran that (>15 years ago I assume?) there's definitely lots of other factors that changed like the block layer, the storage devices and their latencies, how we do DVFS / CPUfreq and PELT (I mentioned util_est helping here in the patchset), depending what you do maybe even the userspace API altogether, io_uring was mentioned in the article.
Something like copying a large set of files should behave pretty nicely, because there's little reason to not issue it in bulk and asynchronously. The rationale behind iowait boosting (the cpufreq part) really applies mostly to when the device is idle / not fully utilized in between issuing IO requests.
Modern storage devices can have many requests queued, so ideally the time userspace needs to get the next request ready is just fill the empty slot N-1 of the queue.

Disk encryption

Posted Sep 10, 2024 17:43 UTC (Tue) by Heretic_Blacksheep (subscriber, #169992) [Link] (4 responses)

Echoing the above post, unlike 32 bit x86 systems becoming increasingly rare due to attrition and a lack of software support. It's still not uncommon to find older AMD64 CPUs of the early Core/Core2, Athlon, and Piledriver era without cryptographic accelerators still being used in back rooms. I've got 3 such laptops scattered around in various states of use. It's even possible to still get old-stock-never-used parts for them. They all got cheap brand new RAM upgrades back in January extending their lifetime even longer.

I think it would be a good thing to check the performance of those somewhat ancient systems against these patch series to see if the iowait removal has unanticipated side effects that may not easily show on more recent, and ostensibly more capable hardware. That might or might not be a volunteer offer. I haven't had to compile my own kernels for near 20 years. No doubt I've lost the skill.

Disk encryption

Posted Sep 10, 2024 18:01 UTC (Tue) by cloehle (subscriber, #128160) [Link] (3 responses)

Note that, unless there's something in the stack that's not on my radar right now, not having/using AES-NI as opposed to having it should actually make iowait boosting *less* effective, not more. Taken to the extreme, if the workload is CPU-bound, we should already be driving the CPU at maximum frequency.
It's a bit counter-intuitive but the problem iowait boosting set out to solve originally was the CPU being driven at a low frequency because it appears mostly idle.

Disk encryption

Posted Sep 10, 2024 18:52 UTC (Tue) by Heretic_Blacksheep (subscriber, #169992) [Link]

Ok, didn't think that through then, thanks for the explanation. So... this sounds like something that would have been a necessary thing in the Sandy Bridge era where it was possible to have really slow IO buses like USB storage attached to USB 2 buses with encrypted media. The Core 2 I5 2xxx series had AES-NI but some of the motherboards it ended up on didn't have USB 3 controllers so the CPU would be going into idle state because the IO bus was so slow. Am I in the right mental model?

Reason for needing I/O wait

Posted Sep 10, 2024 19:36 UTC (Tue) by farnz (subscriber, #17727) [Link] (1 responses)

If I've understood your descriptions properly, the point of I/O wait isn't about being CPU bound - it's about slow I/O devices (like hard drives instead of SSDs). A HDD can easily have an average seek time over 8ms, and a Haswell-era Intel CPUs (4th Generation Core i7, for example) has (according to static struct cpuidle_state hsw_cstates[] in intel_idle.c) a C-state with 7.7ms target residency, and 2.6ms exit latency (note that the numbers in the struct are in µs).

Without iowait, it's plausible that the system would determine that it can sleep for 7.7ms comfortably after a single read triggers a seek, and enter that deep sleep state; it then takes 2.6ms to wake up before it even processes the completion interrupt. With iowait, the process can contribute to "wakefulness" of the CPU, and ensure that it doesn't enter the deep C-state, and thus avoid sleeping.

Thus, I'd expect that you'd see the worst case with a CPU that has long exit latencies from the deep C state and a HDD as the I/O device - some of the time, you'll hit long exit latencies because cpuidle (correctly) predicts that you're going to sleep for a long time, and those exit latencies will hurt.

Reason for needing I/O wait

Posted Sep 11, 2024 9:43 UTC (Wed) by cloehle (subscriber, #128160) [Link]

You're correct but let me expand. I need to point out the difference between cpufreq and cpuidle regarding iowait behavior. One solves the IO utilization problem, the other selects shallower states on the CPU that had tasks go to sleep on with in_iowait set.
You're referring to cpuidle, in that case the long IO latency is a problem, as you said, with 8ms IO latency the cpuidle governor will have made a correct decision when choosing a state with a target residency ("How long do we have to sleep for this state to be worth it?") of <8ms and doesn't care about the exit latency (I should note, exit latency is kind of treated as worst-case, so you will probably observe significantly less than 2.6ms in your example).
The theoretical worst-case would be an IO device that has a latency just above the target residency of the highest state (if the IO device is slower, the cost of the exit latency will diminish in the IO latency).
If you are in that theoretical worst-case and the exit latency does hurt you, you're much better off not relying on the governor altogether and either using the PM QoS API or disabling the state(s) while you're doing IO, menu (with iowait heuristic) wasn't always doing a good job there either, see the cover letter of my patch.

Unfortunately even if we did say that iowait was a good metric to base cpuidle heuristics on (I have outlined why it isn't in the cover-letter), for how long will we choose to select shallower states? Tasks can be in_iowait for seconds even (i.e. aeons) and if you suffer 'end-to-end' throughput decrease by the exit latency (or for cpufreq the low CPU frequency) (and if that decrease might be a good trade-off for the power saved) is practically impossible for the kernel to tell.
Similarly to the iowait boost comment, if you're maximizing IO performance, that IO request that just completed is hopefully large enough for completion not too happen often or it is just one of many queued requests for the IO device.

Contradiction?

Posted Sep 10, 2024 19:21 UTC (Tue) by intelfx (subscriber, #130118) [Link] (1 responses)

I must be misunderstanding something.

So, if in kernels as recent as 6.5 it can be observed that "PostgreSQL processes using io_uring ran considerably slower (as in, 20-40% slower) than those using normal, synchronous I/O", then why, exactly, are we even saying that "iowait [might] be a heuristic that has outlived its time"?

Doesn't this exact observation prove the opposite?

Contradiction?

Posted Sep 10, 2024 19:23 UTC (Tue) by corbet (editor, #1) [Link]

That point was addressed toward the end - the iowait heuristics can be said to have caused the worse behavior in the first place. If the idle governor is no longer using iowait (or the lack thereof) to pick deeper sleep states, the problem should go away. Assuming, of course, that the governor is making good decisions by some other means.

Other situations with predictable future CPU needs?

Posted Sep 11, 2024 8:31 UTC (Wed) by taladar (subscriber, #68407) [Link]

Is I/O really the only thing where we can anticipate a need for time on a CPU in the near future? It seems to me that this should maybe be turned into a more general concept?

Also, the whole thing about keeping one CPU prepared for a task and the scheduler then choosing another seems broken.

Please test your workloads

Posted Sep 11, 2024 17:46 UTC (Wed) by cloehle (subscriber, #128160) [Link]

Thank you Jonathan for the great article.
I just wanted to mention that the patchset about removing iowait behavior is posted as RFT and I mean that. I tested some workloads on some platforms, but with the quite diverse cpufreq and cpuidle world, I cannot possibly test them all.
I'm curious about regressions, but of course also very happy about any improvements people might find. Essentially spending power on CPU because of some iowait heuristic isn't free and the times the system can sustain 100% utilization of every component are gone for many platforms, you probably have a limited power and/or thermal budget.
There are regression reports caused by the power spent boosting the CPU then not being available to the GPU and reducing graphical performance.

Even if you don't want to run a custom kernel, some distributions do provide both cpuidle governors: menu and teo (no iowait heuristic), so you can just switch at runtime and test the cpuidle part of the series.
If you are running intel_pstate active with HWP support, there is already a sysfs knob to enable/disable iowait boost for the cpufreq part, too: hwp_dynamic_boost.