Utilization inversion and proxy execution

By Jonathan Corbet
May 15, 2020

Over the years, the kernel's CPU scheduler has become increasingly aware of how much load every task is putting on the system; this information is used to make smarter task placement decisions. Sometimes, though, this logic can go wrong, leading to a situation that Valentin Schneider describes as "utilization inversion". At the 2020 Power Management and Scheduling in the Linux Kernel summit (OSPM), he described the problem and some approaches that are being considered to address it.

Utilization tracking, initially conceived as per-entity load-tracking or PELT, gives the scheduler information about what the impact of running a task on a given CPU will be. It is used in the schedutil CPU-frequency governor to select a frequency appropriate to the current workload. On Arm big.LITTLE systems, where some processors are faster and more power-hungry than others, the utilization-tracking signal is also used to decide which type of CPU a task should run on. The situation is complicated a bit by utilization clamping, which allows an administrator to make a task's utilization appear larger or smaller than it really is. Clamping is used to bias the placement of specific tasks and influences CPU-frequency selection as well.

Imagine, Schneider said, a large task (one with a high utilization) and a small task, both of which are contending for the same lock. The large task may have a high minimum clamp, so it looks like an even bigger load even when it is not doing much; the small task, instead, may have a low maximum, ensuring that it always looks small. One would expect the large task to run on a big CPU at a high frequency while the small task is consigned to a small CPU at a low frequency. If the small task grabs the lock, the large task's progress suddenly depends on how quickly the small task can progress.

This situation is similar to priority inversion, though the problem is not as severe. Even so, it would be better if the small task could inherit some of the large task's resources while it holds the lock.

The kernel's realtime mutexes can handle priority inheritance now; if a high-priority task contends for a lock held by a low-priority task, the latter will have its priority boosted until it drops the lock. Priority inheritance can help, but it only affects process priority; it can force preemption, but it does not really change task placement or CPU frequency. Perhaps the kernel could gain a similar mechanism for utilization that would help for placement, at least, if not CPU frequency. Schneider expressed skepticism that such an approach could work well, though.

An alternative he has been working on is proxy execution: giving the lock-holding task the waiting task's execution parameters until it lets go of the lock. This is a work in progress, he said, that doesn't survive for more than 20 seconds on real hardware, and it has no provision for futexes (user-space locks), but it still has some interesting properties, he said.

With proxy execution, a task that blocks on a mutex is not removed from the run queue as it would be in a mainline kernel. It can thus be picked to run by the scheduler in the usual way if it's the highest-priority task in the queue. When that happens, though, the lock-holding task inherits the blocked task's scheduling context. The blocked task is also migrated to the run queue of the lock holder, which brings its utilization information over; that will cause the CPU frequency to be increased, helping the lock holder to get its work done and release the lock.

That solves the problem reasonably well on symmetric multiprocessor systems, but it still falls short on asymmetric systems like big.LITTLE. To address such systems, Schneider would like to put the utilization-tracking information into the scheduling context, where it can be passed more directly to a lock holder. This has to be done carefully, though, or it could create priority inversions of its own; if a low-utilization task is picked to run, it could end up slowing a high-utilization task. Making a smart choice is hard, though, since the utilization signals are highly variable and hard to track in the proxy-execution code. The solution might be to ignore the utilization values and just look at the clamps.

Juri Lelli asked why this mattered, since the clamp values are already aggregated on each run queue. That works for frequency selection, Schneider answered, but it has no influence on task selection, so it doesn't help to ensure that the lock-holding task actually runs.

Then, there is the perennial problem of load balancing. Utilization signals are highly useful here, since they let the scheduler ensure that the load on each CPU is about the same. But what should be done in the proxy execution case? Currently, load-balancing decisions will use the scheduling context of the donor task (the one waiting for the lock), which could lead to interesting decisions. Since contending tasks remain on the run queue, the apparent load on the CPU increases, which can throw things off as well. Peter Zijlstra said that this isn't necessarily a big problem; one does not expect locks to be held for long periods, so things should straighten themselves out relatively quickly.

Patrick Bellasi asked whether just relying on clamp values is sufficient, or whether the load-tracking signal should be used too. Schneider responded that using the clamps really is the best that can be done; there is no choice. Utilization values simply change too quickly to be useful.

Heading toward a conclusion, Schneider said that getting proxy execution working right is his first priority; presumably rebooting after 20 seconds of uptime is getting a little tiresome. He asked whether other developers were interested in proxy execution as well. Zijlstra said that he has been trying to get it to the top of his list for a long time, but has been "failing miserably".

Qais Youssef asked how quickly this work might be done. The next Android release will not be happening for some time, so it would be nice if there were some way to fix this problem in the short term. Could the realtime mutex code help? Zijlstra responded that realtime mutexes are really for realtime processes and won't help with tasks in the completely fair scheduling class, as most Android tasks are. We will get the problem solved when we do, he said.

The session concluded with numerous developers saying that they would like to have a working proxy execution mechanism in the kernel, but nobody has found the time to work on it.

Index entries for this article
Kernel	Scheduler/Load tracking
Kernel	Scheduler/Proxy execution
Conference	OS-Directed Power-Management Summit/2020

Utilization inversion and proxy execution

Posted May 18, 2020 16:40 UTC (Mon) by Sesse (subscriber, #53779) [Link] (4 responses)

Wow, so with proxy execution, if A sleeps holding a lock and B needs it, A can be scheduled on B's timeslice? (Can it just context-switch directly to B?) That sounds really neat; so neat that I can't immediately understand why it hasn't been done before. Maybe it's because you just shouldn't be sleeping while holding a lock in the first place :-) I wonder if you one could do similar things for condition variables or the likes.

Utilization inversion and proxy execution

Posted May 18, 2020 17:11 UTC (Mon) by dgm (subscriber, #49227) [Link] (1 responses)

This all seems to suggest that priority should often be a property of the data, not the process.
I don't know if it has been ever tested (a quick googling didn't turn up anything), but it would be interesting to assign priorities to sockets, files, queues or semaphores in addition to/instead of processses.

Utilization inversion and proxy execution

Posted May 19, 2020 18:18 UTC (Tue) by glenn (subscriber, #102223) [Link]

Assigning priority to data would be equivalent to assigning priority to a critical section. Data and critical section can both be viewed as a resource with an assigned priority. The priority ceiling locking protocol (PCP) assigns priorities to mutexes, and hence critical sections. I believe this was the inspiration for the pthread mutex attr PTHREAD_PRIO_PROTECT.

I think a looser coupling of priority from thread of execution makes more sense. Think of each thread of execution being put in a bucket*, and the CPU scheduler schedules buckets instead of threads. By default, a new thread/process is put in a new bucket. When the CPU schedules a bucket, it delegates to the bucket which thread to schedule. I think this scheme can simplify priority inheritance implementation: a thread inherits a priority by being moved to the bucket of the blocked higher-priority task from which it inherits. The CPU scheduler doesn't need to know about inheritance, because it only schedules buckets.**

I believe both MPI/UNC's LitmusRT and CMU's Linux/RK kernels supports these concepts.

* Literature will call these buckets "servers," but I feel like this term has gotten too overloaded over the years. Speaking of renaming classic terms, I think proxy execution is simply "bandwidth inheritance." Is this right?

** Things can get really crazy when you consider deadline-scheduling. A bucket that looses CPU time due to donating its CPU time to a thread of another bucket is reciprocated with lower-priority budget of the inheriting thread in exchange. Because we're talking about deadline-scheduling, this lower-priority budget _may_ become valuable in the future. The availability of this budget may reduce the need for inheritance actions in the future. I'm not aware of any implementations of this budget-exchange system. I think maintaining a ledger of past budget exchanges could be quite complicated. (I can't find a reference to the academic paper that proposed this scheme.)

Utilization inversion and proxy execution

Posted May 18, 2020 17:27 UTC (Mon) by Wol (subscriber, #4433) [Link]

> Maybe it's because you just shouldn't be sleeping while holding a lock in the first place :-)

Sounds great, but I guess in practice that's impossible. Especially if it's user space locking resources it needs. Locks are just a nightmare, full stop, but you need them to stop processes trampling over each other.

That's why they have things like the deadlock killer - if I'm holding a lock you want, and then I try and claim a lock you're holding, one of us is going to get killed ... there's just no easy way to manage locks.

Cheers,
Wol

Utilization inversion and proxy execution

Posted May 22, 2020 16:55 UTC (Fri) by raistlin (guest, #37586) [Link]

It is, IMO, very neat. The problem is that it's not trivial to decide what to do, on an SMP systems... And, believe me, we thought about just turning CONFIG_SMP off, but we've been told that wouldn't end up well. :-)

So, imagine that A owns a lock, and that both B and C want it, so they both block on A.

At this point, A can run:
1) in its own timeslice, and
2) in B's timeslice, and
3) in C's timeslice.

Which is cool, because being able to run for sooo long, guarantees that A will be quick with the critical section on the lock! So far so good then, we just have to leave both B and C in the runqueue, and when the scheduler picks A, we ran A (of course) but also when the scheduler pick B, we run A and when it picks C, we run A.

Now, what about we are on a 4 CPUs system and, the scheduler, all at the same time:
- picks A on CPU 0
- picks B on CPU 2
- picks C on CPU 3

There. We obviously can run A on 3 CPUs at the same time (that would be _really_funny_!! :-D )... And hence you have to deal with a bunch of special cases and the "neatness" rapidly sinks. :-(

There are solutions, from a theoretical (as in, academic papers) point of view, but they're not the most beautiful to implement in real-world Linux kernel.

Smart people are looking into solutions, and I'm confident they'll find a way to get around the problems. But yeah, there it is why we're not there yet :-)