Coscheduling: simultaneous scheduling in control groups

By Jonathan Corbet
September 10, 2018

The kernel's CPU scheduler must, as its primary task, determine which process should be executing in each of a system's processors at any given time. Making an optimal decision involves juggling a number of factors, including the priority (and scheduling classes) of the runnable processes, NUMA locality, cache locality, latency minimization, control-group policies, power management, overall fairness, and more. One might think that throwing another variable into the mix — and a complex one at that — would not be something anybody would want to attempt. The recent coscheduling patch set from Jan Schönherr does exactly that, though, by introducing the concept of processes that should be run simultaneously.

The core idea behind coscheduling is the marking of one or more control groups as containing processes that should be run together. If one process in a coscheduled group is running on a specific set of CPUs (more on that below), only processes from that group will be allowed to run on those CPUs. This rule holds even to the point of forcing some of the CPUs to go idle if the given control group lacks runnable processes, regardless of whether processes outside the group are runnable.

Why might one want to do such a thing? Schönherr lists four motivations for this work, the first of which is virtualization. That may indeed be the primary motivation, given that Schönherr is posting from an Amazon address, and Amazon is rumored to be running a virtualized workload or two. A virtual machine usually contains multiple processes that interact with each other; these machines will run more efficiently (and with lower latencies) if those processes can run simultaneously. Coscheduling would ensure that all of a virtual machine's processes are run together, maximizing locality and minimizing the latencies of the interactions between them.

Coscheduling can also be good for multiprocess applications that benefit from sharing the processor cache. If those processes are trying to access the same data, having them running together on nearby CPUs will improve their cache locality, and thus their performance. Interestingly, coscheduling might also help when used with independent processes that do not share or, more to the point, contend for any resources. Such processes should be able to run simultaneously without interfering (much) with each other. Coscheduling could ensure this coincidence, while avoiding workload mixes that would be more likely to contend with each other.

Finally, coscheduling can be used to prevent certain processes from running simultaneously. For example, the L1 terminal fault hardware vulnerability can be used by a hostile process to attack others running on a sibling processor (on systems where hyperthreading is in use) — but not if those processes are in separate coscheduled groups that cannot run together. Temporally isolating applications in this matter has the potential to close off a number of side-channel attacks; it can be used as a form of mutual exclusion to prevent some types of race conditions as well.

So there is potential value in coscheduling, but realizing that value is not a straightforward proposition. The coscheduling patch set contains a full 60 patches and is clearly not yet ready to be considered for upstream inclusion. Even in its current form, it is sufficiently intimidating that getting serious review is going to be a challenge. But one has to start somewhere; this patch set shows how coscheduling can work and will allow developers to consider the impact on the scheduler as a whole.

Scheduling domains and run queues

Computer systems are organized in a hierarchical manner. Even a relatively simple machine may have two physical sockets for processor chips. Each socket will provide a number of CPUs, and each CPU may run multiple threads (if hyperthreading is enabled). Different resources (caches, for example) [Scheduling domain diagram] are shared at each level of the hierarchy. Scheduling decisions must take this hierarchy into account to make optimal use of the hardware. To do that, the scheduler creates and uses a set of data structures referred to as scheduling domains. The diagram on the right shows how the scheduling domain layout might look for a simple desktop system containing a single socket and two cores; each core implements what looks like two CPUs, but which are actually hyperthreaded siblings sharing the same underlying hardware. The scheduler uses this hierarchy to know, for example, that moving a process from CPU 0 to CPU 1 would be less costly than moving it to CPU 2.

In current kernels, the scheduler creates a run queue for each CPU in the system; each queue holds all of the runnable processes waiting for its associated CPU. The scheduling decision for any given CPU involves only that CPU's run queue, keeping the decision process local; processes are occasionally moved between run queues as needed to keep the system as a whole balanced. Run queues are, thus, a feature found at the bottom level of the scheduling domain hierarchy. That changes with coscheduling, which attaches new run queues at the higher levels of the hierarchy. As a simple example, it would add two queues to the system diagrammed above: one each for Core 0 and Core 1; it could also add a queue at the socket level.

Whenever a coscheduled process is scheduled to run on a given core, the core-level run queue will (simplifying a bit) be filled with the other processes in the same control group. At that point, all of the CPUs from that level of the hierarchy on down (two of them, in this case) will be constrained to select processes only from that coscheduled group; any other process that is in the run queue for those processors will have to wait. As noted above, this policy extends to forcing one or more CPUs to go idle if there are not enough runnable coscheduled processes to keep them busy, even if other processes would like to run.

At some point, the scheduler will stop running the coscheduled group, at which point the processors it was using will become available for other processes — or for another coscheduled control group. That could happen if all processes in the group block, or if the scheduler decides to preempt the group as a whole. The latter decision involves some complexity, requiring the designation of a "leader" CPU for each level of the hierarchy where coscheduling is enabled.

Administration and rough edges

The control of coscheduling happens at two points, the first of which is the cosched_max_level kernel command-line option; its value is the highest level above the bottom of the scheduling-domain hierarchy at which coscheduling can be performed. Setting it to zero (the default) disables coscheduling entirely. Setting it to one allows coscheduling at the core level, while setting it to two would allow coscheduling at the socket level. Enabling coscheduling for any particular control group is a matter of setting the new cpu.scheduled knob to the level at which coscheduling should happen; setting it to one causes that group to be coscheduled at the core level, for example. This value can not be set above the system-wide maximum set at boot time.

That is the core of how the feature works. The details of actually making it work are rather more involved, naturally. Locking gets complicated when dealing with run queues at multiple levels of the hierarchy, leading to the introduction of two new locking protocols. Features like dynamic tick have to be disabled "for now". Preemption decisions become quite a bit more complicated, since a newly runnable process must preempt an entire coscheduled group (or not preempt at all). The complexities of adding this feature into the scheduler seem likely to inspire discussion for some time.

There are some practical difficulties as well. Scheduling domains are a convenient mechanism for the creation of what might be called "coscheduling domains", but users may well want to divide their systems differently. Some high-end processors can hold 20 or more cores in a socket; that is a lot of cores to hand over to a single coscheduled group, which may not need anywhere near that many. If the cosched_split_domains boot-time option is used, the coscheduling mechanism will reorganize the system's scheduling domains at boot time, adding a virtual layer that splits large groups of processors into smaller ones. That helps the coscheduling use case, but it is also a sign that scheduling domains are being asked to fill a role they were not intended for.

With the current patch set, administrators must also use CPU affinities to force the processes running in a coscheduled group to run within a specific set of processors; the scheduler cannot, by itself, perform load balancing on such groups. So, for example, on the simple system described above, all of the processes in a coscheduled control group would have to be constrained to run only on CPUs 0 and 1, while another group would be bound to CPUs 2 and 3. The resulting administrative busy-work required to use this feature on a large system seems likely to lose its charm relatively quickly. There is also the little problem that transitions into and out of the coscheduling mode are not entirely atomic, so any use case that depends on 100% isolation of the coscheduled processes may not be served as desired.

The patch set is also lacking benchmarks, which will be necessary before it can be seriously considered. Developers will want to know what the performance impact of coscheduling is, especially with regard to processes that are not using the feature. But, in any case, merging of the coscheduling feature is a distant prospect at this point. It is a huge patch, set applied to one of the most sensitive parts of the kernel, that is making its first public appearance; there will undoubtedly be a lot of changes required before it can be considered ready. This posting is just the beginning of a long road; the end will only be reached if the destination is deemed worth the trip.

Index entries for this article
Kernel	Scheduler