A full task-isolation mode for the kernel

April 6, 2020

This article was contributed by Marta Rybczyńska

Some applications require guaranteed access to the CPU without even brief interruptions; realtime systems and high-bandwidth networking applications with user-space drivers can fall into the category. While Linux provides some support for CPU isolation (moving everything but the critical task off of one or more CPUs) now, it is an imperfect solution that is still subject to some interruptions. Work has been continuing in the community to improve the kernel's CPU-isolation capabilities, notably with improvements in the nohz (tickless) mode, but it is not finished yet. Recently, Alex Belits submitted a patch set (based on work by Chris Metcalf in 2015) that introduces a completely predictable environment for Linux applications — as long as they do not need any kernel services.

Nohz and task isolation

Currently, the nohz mode in Linux allows partial task isolation. It decreases the number of interrupts that the CPU receives; for example, the clock tick interrupt is disabled for nearly all CPUs. However, nohz does not guarantee there will be no interruptions; the running task can still be interrupted by page faults (careful design of an application can avoid that) or delayed workqueues. The advantage of this mode is that the tasks can run regular code, including system calls. In addition to that, any additional overhead is limited to the system-call entry and exit paths.

For some applications, the lack of absolute guarantees from nohz may cause problems. As an example, high-performance, user-space network drivers that have a small number of CPU cycles in which to handle each packet; for those, interrupt and interrupt handling may cause a significant delay in their response and use up to the entire time available. Realtime operating systems (RTOSes) can provide the needed guarantees, but they have limited hardware support; the authors of the patch feel that it is less work to develop and maintain interrupt-free applications than to support a RTOS next to Linux, as Belits explained:

The alternative, running RTOS instead of Linux, is becoming more and more labor-consuming because modern CPUs and SoCs have very complex device/resource configuration and management procedures, and at this point for some hardware it is clearly in the realm of impractical to maintain an RTOS with hardware support on par with Linux kernel, reliable and secure at the same time.

In these times, even embedded systems often contain a number of cores, and system designers are adding more for tasks requiring predictability. Belits explained that further:

Therefore OS ability to switch a CPU core into RTOS-ish mode [...] is an important feature for modern embedded systems development. Probably more important than even real-time interrupts latency and preemption, now that people, when they don't like how their interrupts are handled, can just add CPU cores.

The kernel currently has a couple of features meant to make it possible to run applications without interruptions: nohz (described above) and CPU isolation (or "isolcpus"). The latter feature isolates one or more CPUs — making them unavailable to the scheduler and only accessible to a process via an explicitly set affinity — so that any processes running there need not compete with the rest of the workload for CPU time. These features reduce interruptions on the isolated CPUs, but do not fully eliminate them; task isolation is an attempt to finish the job by removing all interruptions. A process that enters the isolation mode will be able to run in user space with no interference from the kernel or other processes.

Configuring and activating task isolation

The authors assume that isolation is not needed in kernel space or during the task's initialization phase. A task enters the isolation mode at some point in time and stays in this mode until it leaves the isolation on its own, performs some action that causes the isolation to be broken, or receives a signal that was directed to it.

The kernel needs to be compiled with the CONFIG_TASK_ISOLATION flag and then booted with the same options as for nohz mode with CPU isolation:

    isolcpus=nohz,domain,CPULIST

where nohz disables the timer tick on the specified CPUs, domain removes the CPUs from the scheduling algorithms, and CPULIST is the list of CPUs where the isolation options are applied. Optionally, the task_isolation_debug kernel command-line option causes a stack backtrace when a task loses isolation.

When a task has finished its initialization, it can activate isolation by using the PR_TASK_ISOLATION operation provided by the prctl() system call. This operation may fail for either permanent or temporary reasons. An example of a permanent error is when the task is set up on a CPU without isolation; in this case, entering isolation mode is not possible. Temporary errors are indicated by the EAGAIN error code; examples include a time when the delayed workqueues could not be stopped. In such cases, the task may retry the operation if it wants to enter isolation, as it may succeed the next time.

In the prctl() call, the developer may also configure the signal to be sent to the task when it loses isolation. The additional macro to use is PR_TASK_ISOLATION_SET_SIG(), passing it the signal to send. The command then becomes similar to the one in the example code:

    prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE
          | PR_TASK_ISOLATION_SET_SIG(SIGUSR1), 0, 0, 0);

Here, the process has requested the receipt of a SIGUSR1 signal rather than the default SIGKILL should it lose isolation.

Losing isolation

The task will lose isolation if it enters kernel space as the result of a system call, a page fault, an exception, or an interrupt. The (fatal by default) signal will be sent when this happens, with a couple of exceptions: a prctl() call to turn off isolation, or exit() and exit_group(); these calls cause the task to exit, so the isolation mode is finished at that point.

When the task loses isolation by any means other than the above system calls, it will receive a signal, SIGKILL by default, which causes termination of the task. The signal can be modified, in the case the application prefers to catch it. This can be used, for example, if an application wants to log the information about lost isolation before exiting or attempt to rerun the code without isolation guarantees.

The task can enter and exit isolation when it desires. To leave isolation without a signal it should call:

    prctl(PR_SET_TASK_ISOLATION, 0, 0, 0, 0);

The internals

When a process calls prtcl() to enable task isolation, it is marked with the TIF_TASK_ISOLATION flag in the kernel. The main part of the job of setting up task isolation, though, is done when returning from the prctl(). When the kernel returns to user space and sees the TIF_TASK_ISOLATION flag set, it arranges for the task not to be interrupted in the future. Interrupts are disabled, and the kernel disables any events that may interrupt the isolated CPU(s). In current patches, it disables the scheduler's clock tick and vmstat delayed work, and drains pages out of the per-CPU pagevec to avoid inter-processor interrupts (IPIs) for cache flushes. More isolation actions may be added in the future.

This isolation work is more straightforward in the current version than it was in the 2015 patch set. Since then, Linux has gained the ability to offload timer ticks from the isolated CPUs to so-called "housekeeping" CPUs — all that are not on the CPU list of the isolcpus kernel option. That removes the need to make additional requirements for dealing with pending timers on CPUs before they can be isolated.

The patch set also adds diagnostics on the non-isolated CPUs. If the kernel finds itself about to interrupt an isolated CPU, it will generate diagnostics (a warning in the kernel log by default, but a stack dump is also possible) on the interrupting CPU. Examples of such situations include sending an IPI or TLB flush. If an interrupt is not handled by Linux, for example a hypervisor interrupt, it can end up sending a reschedule IPI to an isolated CPU, causing the signal to notify the isolated task to be generated. With regard to that problem, Frédéric Weisbecker wondered if support for hypervisors is even necessary, but no conclusion has been reached on this topic.

The task-isolation mode requires changes in the architecture code; the patch set includes implementations for x86, arm, and arm64. An architecture needs to define HAVE_ARCH_TASK_ISOLATION and the new TIF_TASK_ISOLATION task flag. It needs to change its interrupt and page-fault entry routines to add a call to task_isolation_interrupt() so that any isolated tasks will exit isolation. The reschedule IPI should call task_isolation_remote() for the same purpose. Finally the system-call code should invoke task_isolation_syscall() to check if the call is allowed. When exiting to user space it should call task_isolation_check_run_cleanup() to run pending cleanup and task_isolation_start() if the isolation flag is set for the current task.

Apart from the changes in the architecture-specific code, adding the isolation feature caused several changes in other kernel subsystems. For example, in the network code, flush_all_backlogs() will enqueue work only on non-isolated CPUs. The trace ring buffer behaves on isolated CPUs in a similar way to offline ones — any updates will be done when the task exits isolation. Another change in the isolation mode is that kernel jobs are scheduled on housekeeping CPUs only. This includes tasks like probing for PCIe devices. Finally, kick_all_cpus_sync() has been modified to avoid scheduling interrupts on CPUs with isolated tasks. Weisbecker did not agree with this approach and listed a number of race conditions that may happen between this function and the task entering isolation. He suggested fixing the callers instead.

Summary

The patch set has received initial favorable reviews and it seems that this feature is of interest to developers. There are still some unresolved comments to be addressed and some patches did not receive a review yet. The patch set changes some basic kernel functions in a subtle way, so there will surely be questions asked about testing of the feature. In addition, of course, to the possible regressions. When those issues are resolved, it will likely be included in one of the upcoming kernel releases.

Index entries for this article
Kernel	Scheduler/CPU isolation
GuestArticles	Rybczynska, Marta

to post comments

A full task-isolation mode for the kernel

Posted Apr 6, 2020 21:43 UTC (Mon) by ncm (guest, #165) [Link] (14 responses)

About time.

But instead of the isolated process needing to call prctl(), it should happen automatically, for any process running on (and, by inference, bound to) an isolated CPU, after no system calls / page faults have occurred for a few milliseconds. The prctl() call should be needed only if that wouldn't be soon enough.

Also, "nohz" and "domain" should be _the_default_ on any isolcpu. I am not going to isolate a core, yet still want a load of crap interruptions sent to it. I isolated it for reasons. If I want interruptions I can ask for them.

Files that are mapped to an isolated process image (and the file descriptor closed) should never, ever cause the process to be blocked, even if the kernel decides it is time to copy changes in mapped memory to the spinny disk image. If tearing would be a problem, it is the process's problem to solve.

Finally, taskset should be able to designate, _all_by itself_, that the core the process is bound to is to be fully isolated. This business of needing to reserve isolcpus at boot time is nonsense.

A full task-isolation mode for the kernel

Posted Apr 6, 2020 23:04 UTC (Mon) by martin.langhoff (guest, #61417) [Link] (1 responses)

Gotta walk before you run, but long term I completely agree.

Ideally this feature evolves towards essentially being as automagic as possible...

A full task-isolation mode for the kernel

Posted Apr 6, 2020 23:18 UTC (Mon) by ncm (guest, #165) [Link]

OK, but sensible defaults for isolcpus is no more work than crazy defaults.

A full task-isolation mode for the kernel

Posted Apr 6, 2020 23:15 UTC (Mon) by f18m (guest, #133856) [Link] (1 responses)

I fully agree with ncm: what's the sense of isolating a CPU core other than the need of using it without any sort of interrupt to emulate a real-time OS ?

However I'm unsure whether having the kernel deciding that a taskset should be "fully isolated" after a few milliseconds of zero-system-calls...

Anyway this patch set would be greatly useful to e.g. DPDK applications, which are mostly often using isolcpus and nohz options already.
Looking forward for it!

I'd also love to have this RTOS-like as a post-boot option somewhere (maybe a sysctl setting?) rather than being forced to create scripts that must interact with the bootloader (GRUBv1, GRUBv2, etc) to deploy a new Linux boot option... moreover the reboot required to apply this change may not be acceptable in some contexts...

A full task-isolation mode for the kernel

Posted Apr 7, 2020 1:13 UTC (Tue) by flussence (guest, #85566) [Link]

“Soft” isolation is still useful for HPC work, where you'd like to have something stronger than SCHED_BATCH, but nobody's going to end up taken away in an ambulance if it only gets 99.99% of the CPU.

A full task-isolation mode for the kernel

Posted Apr 7, 2020 0:07 UTC (Tue) by erkki (guest, #124843) [Link] (7 responses)

Preventing TLB shootdowns due to writeback of file backed mmaps would really interesting. Maybe through a new mmap flag?

A full task-isolation mode for the kernel

Posted Apr 7, 2020 1:44 UTC (Tue) by ncm (guest, #165) [Link]

TLB shootdowns are the least of the problem. I see multi-millisecond pauses caused by the behavior in question. Such files have to live in a tmpfs, shmfs, or the like, and snapshots of the stats normally kept in them have to be produced by copying to another file in the same fs, and then out, to prevent pathological stalls.

The core that is busy writing out the page necessarily generates contention on the isolated core's cache bus, as dirty cache lines are copied out of it, but no TLB shenanigans ought to be needed.

A full task-isolation mode for the kernel

Posted Apr 7, 2020 3:18 UTC (Tue) by luto (subscriber, #39314) [Link] (5 responses)

I’m afraid this can’t really be done. Suppose CPU A runs an isolated task and writes to a mapped page. Now CPU A has a writable dirty TLB entry for the page.

Then CPU B starts writeback. Subsequently, CPU A writes to the page again.

Without a TLB flush, the kernel has no way to know that the page has been dirtied again.

A full task-isolation mode for the kernel

Posted Apr 7, 2020 10:07 UTC (Tue) by ncm (guest, #165) [Link] (3 responses)

There is no reasonable expectation of coherence anyway, because there is no sychronization available. It will get dirty again, and get copied out again, later.

If I wanted it clean, I would do something else and take the hit. The whole point of isolcpus is not to stall. If it takes a stall to get magickal feature X, it means I don't want magickal X. Just give me whatever approximation to X can be done without stalling.

Isolcpus: not want stalls. What is not clear?

A full task-isolation mode for the kernel

Posted Apr 7, 2020 22:35 UTC (Tue) by luto (subscriber, #39314) [Link] (2 responses)

In your model, if an isolcpus task maps a file writably, the kernel will have to write back every mapped page every 30 seconds or so regardless of whether it’s written. Or, at least, every page that has ever been written. I don’t think this is wise.

A full task-isolation mode for the kernel

Posted Apr 7, 2020 23:45 UTC (Tue) by ncm (guest, #165) [Link] (1 responses)

Ok, thank you. It sounds like the multi-millisecond stall is not necessary, but flushing the TLB entry for the page is.

In practice, though, a mapped stats file will *always* be dirty, no matter how frequently it is flushed. Maybe what we need is an mmap flag to tell the kernel that it should always assume a given mapping is dirty, and should skip the dance of checking. At least mmap flags are discoverable, unlike (e.g.) ioctls.

Mapping from an unbacked fs and copying from there works, but is a deployment problem. The user wants the stats file where they want it, and it is not easy to discover whether the place they want it is backed.

A full task-isolation mode for the kernel

Posted Apr 8, 2020 2:01 UTC (Wed) by luto (subscriber, #39314) [Link]

You could have a little daemon that periodically copies the file from tmpfs to the real backing store and have the isolated code write to tmpfs.

Years ago, I worked a little bit on reducing the stalls from writing to a recently written-back mmapped file. I made some progress but never upstreamed it. Some day I'll finish the job.

A full task-isolation mode for the kernel

Posted Apr 9, 2020 0:30 UTC (Thu) by erkki (guest, #124843) [Link]

Right, that makes sense. At least the kernel could allow for writeback without guaranteeing per page atomicity. In that case writeback CPU and the isolated CPU can operate on the page concurrently.

A full task-isolation mode for the kernel

Posted Apr 7, 2020 1:09 UTC (Tue) by gus3 (guest, #61103) [Link] (1 responses)

This could be an exploit waiting to happen.

Scenario: a multi-core system with N cores. What's to stop a process from forking N times, then each process isolating itself? Perhaps the Nth isolation attempt will fail, but now, you have N-1 isolated processes, and the last core is saturated as if running on a single-core system.

Does this now provide a new opportunity to use Meltdown/Spectre-style exploits against N-1 non-isolated processes?

A full task-isolation mode for the kernel

Posted Apr 7, 2020 1:34 UTC (Tue) by ncm (guest, #165) [Link]

Claiming a core for isolation would reasonably require a capability normally provided only to root.

Systems that configure isolated cores are rarely shared between organizations, although that will probably change as it becomes increasingly impractical to run without.

A full task-isolation mode for the kernel

Posted Apr 7, 2020 2:18 UTC (Tue) by ncm (guest, #165) [Link]

Also, why use SIGKILL, and then provide a back-door way to change it to some other signal, instead of using one of the numerous other defaults-to-terminate, or even defaults-to ignore, signals? That seems like complexity for the sake of complexity. Even inventing a new signal number for the purpose would be simpler.

Using a defaults-to-ignore signal would be more compatible with an eventual goal of automatic task isolation for programs that spin. If you want to drop core if your program violates isolation, a handler is the right way to make it happen. We don't need another.

A full task-isolation mode for the kernel

Posted Apr 7, 2020 13:07 UTC (Tue) by caliloo (subscriber, #50055) [Link]

I wonder how this interacts with the new system calls that can be placed on a buffer ring, sounds to me like you’d be able to place system calls without leaving isolation, which is kind of nice.

A full task-isolation mode for the kernel

Posted Apr 7, 2020 14:52 UTC (Tue) by liralon (guest, #124426) [Link]

It seems to me that raising a fatal signal to the isolated task (which will kill it) when the task invokes a system call is an issue.

It should be quite common for a fast-path running as an isolated task, to use a system call to wake-up some slow-path. For example, GCP Andromeda (https://www.usenix.org/system/files/conference/nsdi18/nsd...) use system calls to wake-up the coprocessor thread or set irqfd (To raise a virtual interrupt to guest).

I would have expected instead, that the fatal signal will be raised to the isolated task when the kernel will reach some code that is about to block the task. E.g. wait for completion of some I/O request or sleep until some eventfd is set.

A full task-isolation mode for the kernel

Posted Apr 7, 2020 17:03 UTC (Tue) by josh (subscriber, #17465) [Link]

What would it take to eliminate the periodic wakeups for vmstats entirely, not just on isolated CPUs but in general? That seems like something the kernel could account for incrementally as it performs operations, and then just sum up when something wants the statistics.

A full task-isolation mode for the kernel

Posted Apr 12, 2020 16:57 UTC (Sun) by dave4444 (subscriber, #127523) [Link] (1 responses)

Looks like some good progress here, but what about events that may (or may not) be out of the control of the kernel? Such as:

SMM/SMI/NMI on that CPU, this may not be preventable, but could it be detected?

ECC errors can cause very unpredictable slowdowns (especially correctable ones).

Some applications require more than just a CPU core's resources to itself, memory contention (L3, or beyond), common IO paths etc... can produce slow or even starved applications. Another app on another core can hog and consume L3 and DRAM bandwidth at the detrement to others for example.

A full task-isolation mode for the kernel

Posted Apr 15, 2020 18:43 UTC (Wed) by jithu83 (guest, #134793) [Link]

> Some applications require more than just a CPU core's resources to itself, memory contention (L3, or beyond), common IO paths etc... can produce slow or even starved applications. Another app on another core can hog and consume L3 and DRAM bandwidth at the detrement to others for example.

X86_CPU_RESCTRL kernel config option does provide some fine grained control over memory bandwidth, L2/L3 cache partitioning/locking etc. This is available only on certain newer processors and requires additional effort to correctly provision these to the appropriate process etc

A full task-isolation mode for the kernel

Posted Aug 11, 2020 0:46 UTC (Tue) by rezete22 (guest, #140723) [Link]

Hello

Any update on this patchset. Is it pushed upstream? Which version of Linux works with this patchset?