RCU update 2024 background material
Much of what follows refers to specific rows of the big RCU API table, which can be viewed separately with a suitably wide monitor.
RCU API purpose
This section reviews the "purpose" and "read-side constraints" rows of the big API table. Much more detail may be found in the Linux kernel's Documentation/RCU/Design/Requirements/Requirements.rst file.
The most common use of RCU is as a high-performance and scalable replacement for reader-writer locks that protect linked data structures. Examples of such data structures include those describing hardware configuration (for example, USB memory sticks might be inserted or removed at any time), in-memory caches (for example, the dentry cache that reflects portions of the file-system tree), and security policies. Note that these use cases feature frequent reads and rare updates.
The original RCU flavor was intended to replace reader-writer spinlocks, and as such, RCU inherited the prohibition against blocking within read-side critical sections. This prohibition enables extremely low-overhead RCU readers in CONFIG_PREEMPTION=n kernels, though this overhead is expected to increase slightly with the advent of lazy preemption.
Occasionally, someone would wish to block while in an RCU read-side critical section. However, throughout the 1990s and the first six years of the 2000s, such a wish always indicated that this someone did not understand RCU. But 2006 featured a notifier use case requiring rare sleeps within RCU read-side critical sections. This situation is in some ways similar to that of reader-writer mutexes, and the resulting SRCU (“sleepable RCU”) can be thought of as the RCU counterpart to the reader-writer mutexes. SRCU and Tasks Trace RCU are also the only RCU flavors whose read-side primitives may be freely invoked from deep within the idle loop and from offline CPUs.
Tracing posed interesting challenges to RCU. For example, some types of tracepoints make use of “trampolines”. When such a tracepoint is enabled, a trampoline is allocated and filled in with the necessary sequence of instructions, ending with a jump back to the location immediately following the original code. Then the original code is replaced with a jump to the trampoline. In some cases, this process is simplified by arranging for the original code to be a no-operation instruction. In many cases, the process for replacing the original code is quite elaborate due to complications such as variable-length instructions and instruction caches. But conceptually, life is simple: The trampoline is allocated, it is filled in with the necessary instructions, and the binary is rewritten to redirect control to the trampoline.
But how to disable a tracepoint?
In non-preemptible kernels, the fact that the code in the trampoline never sleeps enables the following sequence of steps, at least in code that RCU is watching:
- Revert the code that now jumps to the trampoline back to its original state, keeping in mind the complications noted above.
- Invoke synchronize_rcu().
- Because synchronize_rcu() waits for all preemption-disabled regions of code to complete, it is now guaranteed that nothing is executing or will ever again execute any of the code in the trampoline.
- The trampoline may now be freed.
Unfortunately, this simple process fails completely in preemptible kernels because a given task might be preempted within a trampoline, and RCU will not wait on this task. One could imagine using explicit calls to rcu_read_lock() and rcu_read_unlock() to cause RCU to wait for the task to exit from the trampoline, but the problem with this is that these calls must be placed somewhere. They cannot be placed in the trampoline because the task might be preempted just before the rcu_read_lock() or just after the rcu_read_unlock(). They generate too much code for them to be placed at the location of the tracepoint.
Instead, preemptible kernels use Tasks RCU and Tasks Rude RCU. For Tasks RCU, synchronize_rcu_tasks() waits for each non-idle task be blocked, for example, by undergoing a voluntary context switch. Thus, when synchronize_rcu_tasks() returns, all non-idle tasks that were executing within a trampoline will have exited that trampoline.
Click for answer
The point is that one of the synchronize_rcu_tasks_rude()
function's workqueue handlers might preempt a given task within a
trampoline.
If this happens to be the trampoline that is to be freed, this is
a very bad state of affairs.
Thus, the call to synchronize_rcu_tasks() is absolutely
required to wait for all non-idle tasks to exit from their current
trampolines.
Quick Quiz 6:
So preemptible kernels must call both synchronize_rcu_tasks()
and synchronize_rcu_tasks_rude(), but non-preemptible
kernels need only invoke synchronize_rcu()?
Click for answer
Not quite.
Because the deep idle loop is a quiescent state for synchronize_rcu(),
non-preemptible kernels must call both synchronize_rcu() and
synchronize_rcu_tasks_rude().
Fortunately, it is possible to wait for both grace periods concurrently
using synchronize_rcu_mult().
At least until such time as call_rcu_task_rude()
is removed due to lack of use.
Idle tasks are handled by Tasks Rude RCU. The synchronize_rcu_tasks_rude() function schedules a workqueue handler on each CPU, thus forcing any idle CPUs completely out of their idle loops and thus out of any idle-loop trampoline.
Finally, BPF permits entire BPF programs to be executed at tracepoints. And there are some special BPF programs that need to block in order to handle a page fault when accessing userspace memory. One approach would be for those programs to use SRCU, however, the high overhead of srcu_read_lock() and srcu_read_unlock(), courtesy of each function's use of array accesses and smp_mb(), makes this problematic for BPF programs on hot code paths.
Tasks Trace RCU was created for this use case, with rcu_read_lock_trace() and rcu_read_unlock_trace() to mark the extent of such BPF programs, and with synchronize_rcu_tasks_trace() and call_rcu_tasks_trace() to wait synchronously and asynchronously, respectively, for all pre-existing readers to complete.
Because of the special-purpose nature of Tasks Trace RCU, please consult with the RCU maintainers and its current users before making use of it.
Click for answer
Back in the 1990s, Paul would have been surprised by there being more
than even one flavor, so who knows?
But perhaps the next edition of this article will include hazard pointers.
Quick Quiz 8: Forget about more flavors of RCU!!! How about fewer of them???
Click for answer
Again, who knows?
But if appropriate inlining and noinstr marking of functions
are carried out, perhaps Tasks RCU will be able to wait on idle threads
and Tasks Rude RCU could then be removed.
On the other hand, improperly marked functions could ruin your kernel's
whole day.
The different flavors of RCU impose different constraints on their read-side critical sections:
- RCU readers are not allowed to block. In preemptible kernels, they may be preempted, and in PREEMPT_RT kernels, they may block during spinlock acquisition.
- SRCU readers may block in any way that does not result in deadlock.
- Tasks RCU readers are defined by code regions outside of the idle loop that do not voluntarily context switch.
- Tasks Rude RCU readers are defined as any region of code that has preemption disabled.
- Tasks Trace RCU readers are permitted to block, but by convention such blocking must be limited to the duration of a major page fault.
With this background in place, on to the RCU API.
RCU API Overview
This section goes through the big API table, describing the purposes and uses of the API members. These descriptions are telegraphic, so please see these function's kernel-doc headers and the other documentation in the Linux kernel's Documentation/RCU directory for more information. And of course for those wishing to really get to the bottom of it all, there is always the source code.
Click for answer
The API members in bold
were added since 2019, those with overstrike were
deprecated or removed since 2019, and those in italics were
present (though with different names) in the early 1990s RCU implementation
in Sequent's DYNIX/ptx proprietary UNIX kernel.
RCU API
Traditionally, RCU's readers have been marked with rcu_read_lock() and rcu_read_unlock(). More recently, the RCU-bh and RCU-sched flavors were consolidated into RCU, so that code regions that disable bottom halves (also known as softirq handlers) act as RCU readers. For documentation purposes, where feasible, please use rcu_read_lock_bh() and rcu_read_unlock_bh(). Similarly, anything that disables preemption up to and including an NMI handler acts as an RCU reader. And again similarly, for documentation purposes, where feasible, please use rcu_read_lock_sched() and rcu_read_unlock_sched(). Most of these RCU reader markers may be nested.
Click for answer
One example of non-nestability is the pair
local_irq_disable() and local_irq_enable().
Because local_irq_enable() unconditionally enables
interrupts, anyone trying to nest this pair of primitives will get
what they deserve, good and hard.
If you need to nest interrupt disabling, you should instead use
local_irq_save() and local_irq_restore().
Some portions of the tracing code need to use RCU readers, but use of the normal API members would result in recursive tracing. Such code may use the not-traced rcu_read_lock_sched_notrace() and rcu_read_unlock_sched_notrace() API members.
The lockdep facility tracks the extent of RCU readers, and the rcu_read_lock_held(), rcu_read_lock_bh_held(), and rcu_read_lock_sched_held() may be used in lockdep assertions to check for being in the corresponding flavor of RCU read-side critical section. If it does not matter which flavor of RCU reader is in effect, rcu_read_lock_any_held() may be used.
Click for answer
No, only those for RCU itself.
Later, we will see other functions that check for SRCU and Tasks Trace RCU
readers.
RCU's synchronous grace-period wait primitives are synchronize_rcu(), synchronize_net(), and synchronize_rcu_expedited(). All three block until all pre-existing RCU readers have completed, but with different CPU/latency tradeoffs. The synchronize_rcu() function minimizes CPU overhead, but typically sleeps for milliseconds or tens of milliseconds. In contrast, the synchronize_rcu_expedited() function typically sleeps for only tens or hundreds of microseconds, about two orders of magnitude less than synchronize_rcu(). However, it consumes much more CPU, including sending inter-processor interrupts (IPIs) to all non-idle CPUs. In between is synchronize_net(), which will act as either synchronize_rcu() or synchronize_rcu_expedited(), depending on the networking locking state.
RCU's asynchronous grace-period wait primitives are call_rcu() and call_rcu_hurry(), which pass the specified rcu_head pointer to the specified function once all pre-existing reders have completed. These functions may be invoked with preemption disabled, including from interrupt handlers, but not from NMI handlers. In kernels built with lazy RCU disabled, these act identically. Otherwise, call_rcu_hurry() skips the laziness, consuming extra power, while call_rcu() can delay starting an RCU grace period for some seconds, which reduces power consumption and increases battery lifetime on nearly idle systems, for example, smartphones and laptops.
The rcu_barrier() function waits for all pre-exiting callbacks to be invoked, hurrying any lazy callbacks that might have been queued by call_rcu(). This is useful when unloading a module that uses call_rcu(), where a call to rcu_barrier() in the module-unload code path is used to prevent a lingering call_rcu()-instigated invocation of a function in the module that was just unloaded. Such lingering invocations can severely degrade the actuarial statistics of your kernel.
RCU has a much-expanded list of polling grace-period primitives. These operate by handing the user a “cookie” that is passed to other API members in this set.
The functions whose names end in _full take a pointer to a rcu_gp_oldstate cookie structure, which is the size of a pair of unsigned longs. Their non-_full counterparts return a single unsigned long. The rcu_gp_oldstate structure accounts for each and every grace period, whether normal or expedited, while the older-style cookies comprising a single unsigned long can miss one of a pair of overlapping normal and expedited grace periods. Note that both styles of cookies can miss grace periods due to counter wrap on 32-bit systems or on 64-bit systems with exceedingly high clock rates or admirably long uptimes.
The functions containing _expedited start expedited grace periods, while their non-_expedited counterparts start normal grace periods.
The get_completed_synchronize_rcu() and get_completed_synchronize_rcu_full() functions return a cookie corresponding to a mythical grace period that is, always has been, and always will remain already completed. This is useful for marking objects that need not wait for a grace period, or, for that matter, objects that already have waited for a grace period. This is useful for caches that age objects out. These functions may be invoked from NMI handlers.
The get_state_synchronize_rcu() and get_state_synchronize_rcu_full() functions return a cookie corresponding to the next not-yet-started grace period. These functions do not guarantee that such a grace period will ever start, but on the other hand, they may be invoked from NMI handlers.
The start_poll_synchronize_rcu(), start_poll_synchronize_rcu_full(), start_poll_synchronize_rcu_expedited() and start_poll_synchronize_rcu_expedited_full() functions also return a cookie corresponding to the next not-yet-started grace period, but they in addition guarantee that the specified type of grace period will eventually start. The corresponding downside is that they may not be invoked from NMI handlers.
The poll_state_synchronize_rcu and poll_state_synchronize_rcu_full() functions take a cookie and return true if the corresponding grace period has completed. These functions may be invoked from NMI handlers.
The cond_synchronize_rcu(), cond_synchronize_rcu_full(), cond_synchronize_rcu_expedited() and cond_synchronize_rcu_expedited_full() functions take a cookie and return immediately if the corresponding grace period has completed. Otherwise, they wait for that grace period (or some later grace period) to complete.
Click for answer
Using an array with a cookie stored in each element can reduce memory
footprint because then the actual objects themselves do not need to
allow space for the cookie.
Quick Quiz 13: But I tried the following on a kernel where NUM_ACTIVE_RCU_POLL_OLDSTATE==2:
1 c1 = start_poll_synchronize_rcu(); 2 d1 = poll_state_synchronize_rcu(c1); 3 c2 = start_poll_synchronize_rcu(); 4 d2 = poll_state_synchronize_rcu(c1); 5 c3 = start_poll_synchronize_rcu(); 6 d3 = poll_state_synchronize_rcu(c1); 7 ss1 = same_state_synchronize_rcu(c1, c2) 8 ss2 = same_state_synchronize_rcu(c2, c3) 9 WARN_ON_ONCE(!ss1 && !ss2 && 10 !d1 && !d2 && !d3);
And that WARN_ON_ONCE() triggered! How can that possibly happen if there are only two distinct cookie values corresponding to uncompleted grace periods?
Although the grace period corresponding to Click for answer
Easily.
c1 had not
expired at line 2, it had done so prior to line 5.
To see how this could happen, imagine that there was a one-second
delay (perhaps due to preemption) between lines 4 and 5.
There are only a small number of cookie values that can possibly correspond to a not-yet-completed grace period, which means that someone using these polled functions might want to maintain an array, with each element of that array corresponding to one of the not-yet-completed grace periods. The NUM_ACTIVE_RCU_POLL_OLDSTATE and NUM_ACTIVE_RCU_POLL_FULL_OLDSTATE macros specify the number of array elements needed for the non-_full and _full functions, respectively. The same_state_synchronize_rcu() and same_state_synchronize_rcu_full() functions perform equality comparisons on their respective types of cookies. These may be called from NMI handlers.
If you only want to free memory, then kfree_rcu(), kfree_rcu_mightsleep(), kvfree_rcu(), and kvfree_rcu_mightsleep() allow you to defer-free kmalloc() or vmalloc() memory respectively, in a fire-and-forget manner, so that the actual freeing happens only after all pre-existing RCU readers have completed. The functions whose names end in _mightsleep(), as the suffix suggests, might sleep. This happens in OOM conditions, where the function invokes synchronize_rcu() as an alternative to allocating the memory needed to track the object through the course of its grace period. The other non-_mightsleep functions never sleep (and may be invoked from interrupt handlers), but require that the structure include an rcu_head structure.
Due to a welcome change in the slab allocator, you can now pass memory obtained from kmem_cache_alloc() to kfree_rcu() and kfree_rcu_mightsleep(). Unfortunately, rcu_barrier() does not wait on memory that has been passed to kfree_rcu(), so modules that destroy a kmem_cache in the process of unloading must continue to use call_rcu() for memory allocated from that kmem_cache. As noted in the parent article, a fix is in the works.
RCU has several updater-validation API members. The rcu_cpu_stall_reset() function disables RCU CPU stall warnings for the remainder of the current grace period, or for ULONG_MAX/2 jiffies, whichever comes first. This can be useful to kernel or firmware-based debuggers. After all, some might consider it unfriendly for each debugger breakpoint to result in an RCU CPU stall warning. The rcu_is_watching() function does what it says, and is useful for verifying that kernel-entry code has properly handled RCU state transitions and also for complaining about RCU read-side critical sections being placed where RCU cannot see them. The rcu_sleep_check() macro emits a lockdep splat if used where RCU readers rule out sleeping.
Click for answer
In practice, compilers normally cannot break dependencies carried by
pointers, however, integers are another matter altogether.
To see this, consider an array whose size depends on a Kconfig option.
If that array ends up having only one element, the compiler knows
that the index must be zero, and an aggressively optimizing compiler
just might decide to skip the computation entirely and just use the
constant zero, which would break any dependency.
Of course, pointers can also be zero, but this also means that non-buggy
code won't be dereferencing them, thus defining this particular cause
of dependency breaking out of existence.
However, if a dependency carrying pointer is compared to the address
of a statically allocated variable, compilers are quite happy to react
to an equality result by replacing further use of that pointer with the
address of that variable, in turn breaking that dependency.
Issues with dependency breaking are discussed at length
here,
and Linux-kernel coding restrictions required to avoid such breakage
are provided by
Documentation/RCU/rcu_dereference.rst.
One could argue that compilers in fact do respect dependencies via
the C and C++ memory_order_consume facility.
However, this is only ever implemented by promoting it to
memory_order_acquire, which on weakly ordered systems
can result in unnecessary memory-barrier instructions on your fastpaths,
which might not be what you want.
The reason for the promotion to memory_order_acquire is
that compiler writers positively despise having to trace dependencies.
In fact, there was a proposal to
deprecate
memory_order_consume, and it would not be surprising if it
was removed entirely sooner rather than later.
So what is to be done? One proposal here restricts dependency chains to cases where it is difficult for the compiler to break them, and further requires that pointer variables carrying dependencies be marked. Such marking might not go down well with the Linux kernel community, which has been carrying dependencies in unmarked variables for more than 20 years, so there is further informal proposal asking C and C++ implementations to provide a command-line option forcing the compiler to treat any pointer variable as if it had been marked. (Why informal? Because command-line options are outside of the scope of the standard.)
There is a prototype implementation that obtains the functionality of
memory_order_consume without actually using
memory_order_consume, which is briefly described
here.
However, the committee was not all that happy with this approach, preferring
marking of a single pointer variable to maintaining a separate variable
to carry the dependency.
Some on the committee argue that the cost of acquire loads has been
decreasing over time, so that there is no reason to use anything else,
but the fact remains that acquire loads remain sufficiently expensive
that they are often avoided, especially on GPGPUs.
(One can reasonably argue that GPGPU workloads have little need for RCU,
but one might also note that there are GPGPU vendors who are not without
ambition.)
Another approach is to leave the compiler alone, but to construct a tool to detect broken dependency chains, such as the one prototyped by Paul Heidekrüger.
So there is some progress and some hope, but not yet a complete and
agreed-upon solution.
In the meantime, there are the volatile
loads provided by rcu_dereference() and
friends and also the coding restrictions called out in
Documentation/RCU/rcu_dereference.txt.
RCU provides a dazzling array of rcu_dereference() API members, which are used to fetch RCU-protected pointers for later use, including being dereferenced. These API members are conceptually identity functions, but prevent memory misordering on DEC Alpha and provide valuable debugging functionality on all systems. Note well the coding restrictions called out in Documentation/RCU/rcu_dereference.rst, which avoid use cases in which compilers might otherwise be happy to break the address dependencies on which RCU readers depend. Those whose names end in _check take a lockdep expression, which can be useful in code invoked both by RCU readers (protected by rcu_read_lock() and friends) and by RCU updaters (perhaps protected by some lock). This lockdep expression need not cover the possibility of RCU readers because rcu_dereference() and rcu_dereference_check() include lockdep expressions for RCU readers, rcu_dereference_bh() and rcu_dereference_bh_chek() include lockdep expressions for RCU-bh readers (including rcu_read_lock_bh()), rcu_dereference_sched() and rcu_dereference_sched_chek() include lockdep expressions for RCU-sched readers (including rcu_read_lock_sched()).
SRCU API
Each Sleepable RCU (SRCU). instance (or “domain”) is represented by an srcu_struct structure, which must be initialized, and, if initialized at runtime, cleaned up. Build-time initialization is provided by DEFINE_SRCU() for srcu_struct structures used in other translation units and DEFINE_STATIC_SRCU() for static structures. Runtime initialization is carried out by init_srcu_struct() and cleanup by cleanup_srcu_struct().
The reason that this code fragment can result in deadlock is that we
have a cycle.
The ssa read-side critical sections can wait on an
ssb grace period, which waits on ssb read-side
critical sections, which contains a synchronize_srcu(), which
in turn waits on ssa read-side critical sections.
So if you do include synchronize_srcu() in SRCU
read-side critical sections, make sure to avoid cycles.
Of course, the simplest way to avoid cycles is to avoid using
synchronize_srcu() in SRCU read-side critical sections
in the first place.
However, one nice change over the past few years is that lockdep now
complains about these cycles.
Click for answer
In theory, you can use
synchronize_srcu() with a given srcu_struct
within an SRCU read-side critical section that uses some other
srcu_struct.
In practice, however, such use is almost certainly a bad idea,
as it means that the SRCU readers take a long time to complete.
Worse yet, the following could still result in deadlock:
idx = srcu_read_lock(&ssa);
synchronize_srcu(&ssb);
srcu_read_unlock(&ssa, idx);
/* . . . */
idx = srcu_read_lock(&ssb);
synchronize_srcu(&ssa);
srcu_read_unlock(&ssb, idx);
Quick Quiz 16: Does an SRCU read-side critical section in an NMI handler permit general blocking?
Click for answer
Although SRCU would be fine with this, the NMI handler would not be.
The traditional SRCU read-side markers are srcu_read_lock() and srcu_read_unlock(). However, as discussed in New SRCU Read-Side Critical Sections, if any SRCU reader resides in an NMI handler, then all readers for that srcu_struct structure (whether in an NMI handler or not) must instead use srcu_read_lock_nmisafe() and srcu_read_unlock_nmisafe(). As with the classic RCU API, those portions of the tracing code that are susceptible to recursion should use srcu_read_lock_notrace() and srcu_read_unlock_notrace(). Also as discussed in that section, in semaphore-like situations where one task enters a given SRCU read-side critical section and some other task exits it, srcu_down_read() and srcu_up_read() should be used. All of these forms of SRCU reader permit general blocking within their read-side critical sections.
In current implementations, all SRCU read-side markers imply a full barrier, as in smp_mb(). However, this is just an accident of the current implementation, and has not always been the case. Those needing to rely on the full ordering provided by a particular srcu_read_unlock() instance should follow that instance with smp_mb__after_srcu_read_unlock().
Click for answer
If they make their needs known to us, perhaps a
smp_mb__after_srcu_read_lock() can be provided.
The lockdep facility also tracks the extent for SRCU readers, and srcu_read_lock_held() may be used in lockdep assertions to check for the specified SRCU instance's read-side critical sections.
SRCU's synchronous grace-period wait primitives are synchronize_srcu() and and synchronize_srcu_expedited(). Both block until all pre-existing SRCU readers associated with the specified srcu_struct structure have completed, but with different CPU/latency tradeoffs. The synchronize_srcu() function minimizes CPU overhead, but typically incurs a few milliseconds worth of latency. In contrast, the synchronize_srcu_expedited() function typically incurs only tens or hundreds of microseconds of latency, about two orders of magnitude less than synchronize_srcu(). However, it consumes much more CPU, although it does avoid including sending IPIs to CPUs in the common case.
Due to an accident of implementation in an earlier version of SRCU, a call to synchronize_srcu() after an extended idle period acts as if it was instead a call to synchronize_srcu_expedited(). And yes, this is an example of Hyrum's Law.
SRCU's asynchronous grace-period wait primitive is call_srcu(), which passes the specified pointer to the specified function after the completion of all pre-existing SRCU readers using that same srcu_struct. The srcu_barrier() function waits for the invocation of all callbacks that have already been queued by call_srcu() on the same srcu_struct structure. Care is required when using these primitives because SRCU readers are allowed to block indefinitely, these two primitives might take a long time to return, if they return at all. So if you use either call_srcu() or srcu_barrier(), it is your responsibility to make sure that readers complete in a timely fashion, lest large numbers of call_srcu() calls OOM the kernel or your srcu_barrier() refuse to ever return. Or, alternatively, it is your responsibility to make sure that your code does not care that call_srcu() and srcu_barrier() never return.
Click for answer
There have not yet been requests for these things, though there is
very likely a NUM_ACTIVE_SRCU_POLL_OLDSTATE arriving
soon.
SRCU's polling grace-period API is similar to that of RCU, but is much simpler. The get_state_synchronize_srcu() function returns a cookie corresponding to the next not-yet-started grace period, but does nothing to ensure that this grace period actually starts. The start_poll_synchronize_srcu() function also returns a cookie corresponding to the next not-yet-started grace period, and also ensures that this grace period does start. Finally, the poll_state_synchronize_srcu() function returns true if the specified cookie corresponds to an already-completed grace period.
SRCU has a more modest set of srcu_dereference() API members, which are used to fetch SRCU-protected pointers for later use, including being dereferenced. The srcu_dereference() macro contains lockdep checks that the specified SRCU reader is in effect. The srcu_dereference_check() macro also contains lockdep checks that the specified SRCU reader is in effect, but further allows the user to pass in additional lockdep expressions. For example, passing in lockdep_is_held(&mylock) would result in lockdep checking that either the specified SRCU reader is in effect or that mylock is held. The srcu_dereference_notrace() macro does no lockdep checking and no tracing, allowing it to be invoked from tracing code.
Tasks RCU API
Tasks RCU's synchronous grace-period wait primitive is synchronize_rcu_tasks(), which can sleep for seconds or even minutes due to its need to wait for every non-idle task to undergo a voluntary context switch. This situation provides ample motivation for the asynchronous call_rcu_tasks() function, along with the rcu_barrier_tasks() callback-invocation wait function.
As noted in the parent article, an Tasks RCU read-side critical section is any region of code that does not contain a voluntary context switch.
Tasks Rude RCU API
Tasks Rude RCU's synchronous grace-period wait primitive is synchronize_rcu_tasks_rude(), and its asynchronous primitive is call_rcu_tasks_rude() function, along with the rcu_barrier_tasks_rude() callback-invocation wait function. The latter two are not currently used, and are therefore likely to go away.
An Tasks Rude RCU read-side critical section is any region of code with preemption disabled.
It is likely that Tasks RCU will start paying attention to idle tasks. If so, then on architectures in which all low-level entry/exit, idle, and CPU-hotplug code is marked inline or noinstr, Tasks RCU Rude is no longer needed. On such architectures, synchronize_rcu_tasks_rude() will likely become a no-op, except during rcutorture testing.
Tasks Trace RCU API
Click for answer
Some care is required.
However, where SRCU permits general blocking in readers, the only
read-side blocking that Tasks Trace RCU permits is that due to page
faults.
We can therefore hope that less care is
required for call_rcu_tasks_trace() and
rcu_barrier_tasks_trace() than for call_srcu()
and srcu_barrier().
Tasks Trace RCU readers use rcu_read_lock_trace() and rcu_read_unlock_trace(). Tasks Trace RCU lockdep assertions can use rcu_read_lock_trace_held() to check for read-side critical sections. Tasks Trace RCU's synchronous grace-period wait primitive is synchronize_rcu_tasks_trace(), and its asynchronous primitive is call_rcu_tasks_trace() function, along with the rcu_barrier_tasks_trace() callback-invocation wait function.
As an accident of implementation, an Tasks Trace RCU grace period implies a vanilla RCU grace period. Users wishing to rely on this should only do so when the rcu_trace_implies_rcu_gp() function returns true.
Generic RCU APIs
RCU pointers are normally updated using rcu_assign_pointer(), however, its ordering properties can be expensive and are sometimes unnecessary, for example, at build time, before any readers are present, or when assigning a NULL pointer. In these cases, the RCU_INIT_POINTER() and RCU_POINTER_INITIALIZER() API members can be used instead. However, those using these macros where rcu_assign_pointer() was needed will be punished by obscure and hard-to-locate memory-ordering bugs. In contrast, using rcu_assign_pointer() where one of these other two API members could have been used results only in a small performance degradation.
As described in the Read-Side Guards section, RAII-style RCU and SRCU readers are provided by guard()(), in which the resulting reader is confined to the enclosing scope, and also scoped_guard(), in which the resulting reader is confined to the following C statement.
Variables, function parameters, and structure fields can be tagged with the __rcu type qualifier in order to tell the sparse static analyzer that the corresponding object is an RCU-protected pointer. This will cause sparse to complain if plain C-language accesses are used on these objects.
RCU_LOCKDEP_WARN() emits a lockdep splat with the specified message if the specified condition is met. For example, to add lockdep-based teeth to comments stating that the code must be invoked within an RCU read-side critical section, use something like this:
RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "Need rcu_read_lock()!");
It is sometimes useful to wait on multiple different types of grace periods concurrently. For example, the following would sleep on the four specified types of RCU grace period, one after the other:
1 synchronize_rcu(); 2 synchronize_rcu_tasks(); 3 synchronize_rcu_tasks_rude(); 4 synchronize_rcu_tasks_trace();
In contrast, the following would sleep on these same four types of RCU grace period, but concurrently, so that the time slept would be the maximum of the grace-period latencies rather than their sum:
1 synchronize_rcu_mult(call_rcu, 2 call_rcu_tasks, 3 call_rcu_tasks_rude, 4 call_rcu_tasks_trace);
Click for answer
Easy!
Just create a wrapper function for each such instance, where
that wrapper function supplies a pointer to the corresponding
srcu_struct structure.
Then pass each such wrapper function to
synchronize_rcu_mult().
The init_rcu_head() and destroy_rcu_head() functions initialize rcu_head structures for debug-objects use and do cleanup, respectively. These functions are only needed in special situations where the normal debug-objects initialization and cleanup does not apply, for example, when the RCU callback makes the enclosing object available for reuse via some means other than a memory allocator. For another example, init_rcu_head() may be used to prevent deadlocks that could otherwise occur when call_rcu() is invoked from the debug-objects portions of a memory allocator. Another special case is where the rcu_head structure is on the stack, in which case you must use init_rcu_head_on_stack() and destroy_rcu_head_on_stack() to the the initialization and cleanup, respectively.
Debugging sometimes goes more smoothly if it is possible to determine whether or not a given rcu_head structure has been passed to call_rcu(). To this end, use rcu_head_init() to mark a structure at initialization time, then use rcu_head_after_call_rcu() to test whether it has since been passed to call_rcu().
The rcu_access_pointer() macro fetches an RCU-protected pointer that will not be dereferenced, for example, when that pointer will only be compared to NULL. The penalty for improperly using rcu_access_pointer() in cases where the pointer is dereferenced and thus rcu_dereference() should instead have been used, is subtle and irreproducible memory-ordering bugs. Choose wisely.
The rcu_dereference_protected() macro fetches an RCU-protected pointer, but under the protection of the update-side lock specified in the lockdep expression. The rcu_dereference_raw() macro also fetches an RCU-protected pointer, but omits lockdep checks. This should be used sparingly, for example, when the corresponding lockdep condition is extremely complex or unknowable. The rcu_dereference_raw_notrace() macro has been renamed to rcu_dereference_raw_check(), and is to be used from tracing code where endless recursion is a risk.
The rcu_assign_pointer() macro safely updates an RCU-protected pointer, preventing reordering by both the compiler and the CPU. It is the caller's responsibility to provide whatever synchronization might be needed for concurrent updates. The rcu_replace_pointer() macro safely updates an RCU-protected pointer just as rcu_assign_pointer() does, but also returns the previous value. Please note that the caller of rcu_replace_pointer() is responsible for synchronizing with other updaters, which means that this macro is not a replacement for an atomic exchange operation. The rcu_replace_pointer() was once named rcu_swap_protected, which had serious API issues that were fixed as part of the renaming process.
The rcu_pointer_handoff() macro documents where a given RCU-protected pointer transitions from being protected by RCU to some other synchronization mechanism, for example, reference counting or locking. The unrcu_pointer() macro tells the sparse static-analysis tool that the pointer should no longer be considered to be an RCU-protected pointer, which is useful in conjunction with atomic read-modify-write operations.
The next section describes RCU's list-based APIs, which have seen some change since 2019.
RCU List-Based APIs
Fortunately, most of RCU's list-based primitives shown in the RCU List APIs table apply to all of the variants of RCU discussed above. This commonality can in some cases allow more code to be shared, which certainly reduces the API proliferation that would otherwise occur. However, it is quite likely that software-engineering considerations will eventually result in variants of these list-handling primitives that are specialized for each given flavor of RCU, as has in fact happened with hlist_for_each_entry_rcu_bh() and hlist_for_each_entry_continue_rcu_bh(). Alternatively, perhaps it will prove useful to consolidate the read-side RCU flavors.
The following subsections each cover their respective column of the table.
RCU list APIs
The APIs in the first column of the RCU List APIs table operate on the Linux kernel's struct list_head lists, which are circular, doubly-linked lists. These primitives permit lists to be modified in the face of concurrent traversals by readers. The list-traversal primitives are implemented with simple instructions, so are extremely lightweight, though they also execute a memory barrier on DEC Alpha (as does READ_ONCE() in recent kernels). The list-update primitives that add elements to a list incur memory-barrier overhead, while those that only remove elements from a list are implemented using simple instructions. The list_splice_init_rcu() and list_splice_init_rcu() primitives incur not only memory-barrier overhead, but also grace-period latency, and are therefore the only blocking primitives shown in the table.
The INIT_LIST_HEAD_RCU() API member allows a normal list to be initialized even when there are concurrent readers. This is useful for constructing list-splicing functions.
The list-traversal macros are based on C-language for loops, and are normally used within some sort of RCU read-side critical sections. Some of these macros have an optional lockdep-expression parameter, and given a suitable lockdep expression, these can be used by updaters.
The list_for_each_entry_rcu() macro iterates over an RCU-protected list from the beginning, and takes an optional lockdep-expression argument. The list_for_each_entry_lockless() iterates over a insertion-only list (or similar) from the beginning, and is unusual in that it need not be within any kind of RCU read-side critical section. The list_for_each_entry_srcu() macro iterates over an SRCU-protected list from the beginning, and takes a mandatory lockdep expression that normally indicates which srcu_struct structure's SRCU read-side critical section needs to be in effect during the traversal, but which can also specify update-side protection.
The following API members continue a list traversal. Please note that it is the caller's responsibility to ensure that this traversal will be safe, for example, by remaining within the same RCU read-side critical section in which the original traversal was started.
The list_for_each_entry_continue_rcu() macro iterates over an RCU-protected list, starting after specified element. The list_for_each_entry_from_rcu() macro iterates over an RCU-protected list, starting from the specified element.
Click for answer
These two macros use READ_ONCE() internally to account for
the fact that the pointer that they are starting from could change at
any time.
In contrast, list_entry() assumes that its pointer
is unchanging, and therefore does not protect against various
"interesting" compiler optimizations.
The list_entry_rcu() macro, when given a pointer to a list_head structure, returns a pointer to the enclosing element. The list_entry_lockless() macro, when given a pointer to a list_head structure, returns a pointer to the enclosing element, but does not require that the caller be within an RCU read-side critical section. However, list_entry_lockless() is intended only for insertion-only linked data structures.
The list_first_or_null_rcu() macro returns a pointer to the first element on an RCU-protected list, but returns NULL if the list is empty. The list_next_rcu() macro returns a pointer to the next element on the list, which must be handed to one of the rcu_dereference() family of primitives if used by an RCU reader. The list_next_or_null_rcu() macro is similar to list_next_rcu(), except that it returns NULL when invoked on the last element of the list. These macros are normally used to build list iterators. The list_tail_rcu() macro is used on the list header, and returns a pointer to the list_head of the last element of the list. Care is required when using this macro, for example, list_del_rcu() and friends must not be concurrently used on the same list header.
The remaining API members in this section update RCU-protected lists. These updates are atomic from the viewpoint of RCU readers, but not from the viewpoint of concurrent list updates. The caller is therefore required to synchronize with any concurrent updates.
The list_add_rcu() function adds an element at the head of the specified RCU-protected list, and the list_add_tail_rcu() function adds an element at the tail of the specified RCU-protected list. The list_del_rcu() function deletes the specified element from its RCU-protected list. The list_replace_rcu() function replaces an element in an RCU-protected list.
Click for answer
Because both these primitives wait for a grace period, which is
not something you should be doing while holding a spinlock.
The list-splice APIs join a pair of lists into a single combined list. This also requires some form of mutual exclusion, for example, using a mutex or a single designated updater task. The list_splice_init_rcu() function splices a pair of lists together, initializing the source list to empty. Note that this primitive waits for a grace period to elapse. You determine the RCU flavor by passing in the corresponding grace-period-wait primitive, for example, synchronize_rcu(). The list_splice_tail_init_rcu() function is similar to list_splice_init_rcu(), but splices to the tail of the destination list.
RCU hlist APIs
The APIs in the second column of the RCU List APIs table operate on the Linux kernel's struct hlist_head, which is a linear doubly linked list. One advantage of struct hlist_head over struct list_head is that the former requires only a single-pointer list header, which can save significant memory in large hash tables. The struct hlist_head primitives in the table relate to their non-RCU counterparts in much the same way as do the struct list_head primitives. Their overheads are similar to that of their list counterparts in the first category in the table.
The hlist_for_each_entry_rcu() macro iterates over an RCU-protected hlist from the beginning, but also provides an optional lockdep parameter to allow other forms of protection. The hlist_for_each_entry_rcu_bh() macro is similar, but instead iterates over an RCU-bh-protected hlist from the beginning, and does not provide a lockdep parameter. The hlist_for_each_entry_rcu_notrace() macro iterates over an RCU-protected hlist from the beginning, but omits tracing. This macro can thus be safely used within the tracing code. Finally, the hlist_for_each_entry_srcu() macro iterates over an SRCU-protected hlist from the beginning.
The following API members continue an hlist traversal. As with the list macros, it is the caller's responsibility to ensure that this traversal will be safe, for example, by remaining within the same RCU read-side critical section in which the original traversal was started.
The hlist_for_each_entry_continue_rcu() macro iterates over an RCU-protected hlist starting after the specified element. The hlist_for_each_entry_continue_rcu_bh() macro similarly iterates over an RCU-bh-protected hlist starting after the specified element. Finally, the hlist_for_each_entry_from_rcu() macro iterates over an RCU-protected hlist from the specified element (as opposed to after it).
The hlist_first_rcu() macro returns a pointer to the first element on an RCU-protected hlist, or NULL if the hlist is empty. If non-NULL, it must be handed to one of the rcu_dereference() family of primitives if it is to be dereferenced by an RCU reader. The hlist_next_rcu() macro returns a pointer to the next element on an RCU-protected hlist, or NULL if at the end of the hlist. As before, if the result is non-NULL, it must be handed to one of the rcu_dereference() family of primitives if it is to be dereferenced by an RCU reader. The hlist_pprev_rcu() macro returns a pointer to the previous element's pointer to the current element. This must not be used by RCU readers because the ->pprev pointer is poisoned at deletion time.
The remaining API members in this section update RCU-protected hlists. These updates are atomic from the viewpoint of RCU readers, but not from from the viewpoint of concurrent list updates. The caller is therefore required to synchronize with these other updates.
The hlist_add_before_rcu() function adds an element before the specified element in the RCU-protected hlist. The hlist_add_behind_rcu() function adds an element after the specified element in the RCU-protected hlist. The hlist_add_head_rcu() function adds an element to the beginning of the RCU-protected hlist. Finally, the hlist_add_tail_rcu() function adds an element to the end of the RCU-protected hlist. Note that, because an hlist is not a circular list, this function must traverse the entire list in order to find the end.
The hlist_del_rcu() function deletes the specified element from its RCU-protected hlist and the hlist_del_init_rcu() function also deletes the specified element from its RCU-protected hlist, but further initializes its pointer to form an empty list.
The hlist_replace_rcu() function replaces an element in an RCU-protected hlist. Finally, the hlists_swap_heads_rcu() function swaps a pair of lists between a pair of hlist_head structures.
RCU hlist_nulls APIs
The APIs in the third column of the RCU List APIs table operate on Linux-kernel hlist-nulls lists, which are made up of hlist_nulls_head and hlist_nulls_node structures. These lists have special multi-valued NULL pointers, which have the low-order bit set to 1 with the upper bits available to the programmer to distinguish different lists. There are hlist-nulls interfaces for non-RCU-protected lists as well. To see why this sort of list is useful, suppose that CPU 0 is traversing such a list within an RCU read-side critical section, where the elements are allocated from SLAB_TYPESAFE_BY_RCU slab cache. The elements could therefore be freed and reallocated at any time. If CPU 0 is referencing an element while CPU 1 is freeing that element, and if CPU 1 then quickly reallocates that same element and adds it to some other list, then CPU 0 will be transported to that new list along with the element. In this case, distinguishing the source and destination lists would clearly be quite helpful, and hlist_nulls readers must typically check to make sure that they reached the end of the expected list.
Click for answer
The updater can make this sequence of events harmless by always moving
elements to the heads of their destination lists.
This ensures that if CPU 0 finds the end of a list, it will have
traversed that entire list.
Possibly more than once.
To make matters worse, suppose that CPU 0 searches a list and fails to find the element that it was looking for. Was that because the element did not exist? Or because CPU 0 got transported to some other list in the meantime? Readers traversing SLAB_TYPESAFE_BY_RCU lists must carefully validate each element and check for being moved to another list. One way to check for being moved to another list is for each list to have its own value for the NULL pointer. These checks are subtle and easy to get wrong, so please be careful!
The hlist_nulls_for_each_entry_rcu() macro iterates over an RCU-protected hlist-nulls from the beginning. The hlist_nulls_for_each_entry_safe() macro iterates over an RCU-protected hlist-nulls from the beginning, but also allows the current element to be deleted.
The following API members do a stepwise hlist_nulls traversal. It is the caller's responsibility to ensure that this traversal will be safe, for example, by remaining within the same RCU read-side critical section in which the traversal was started. The hlist_nulls_first_rcu() macro returns a pointer to the first element on an RCU-protected hlist-nulls, or NULL if the hlist is empty. The hlist_nulls_next_rcu() macro returns a pointer to the next element on an RCU-protected hlist-nulls, or NULL if at the end of the hlist. In both cases, if the return value is non-NULL, it must be handed to one of the rcu_dereference() family of primitives if it is to be used by an RCU reader.
The remaining API members in this section update an RCU-protected hlist_nulls. These updates are atomic from the viewpoint of RCU readers, but not from from the viewpoint of concurrent list updates. The caller is therefore required to synchronize with any concurrent updates.
The hlist_nulls_add_head_rcu() function adds an element to the beginning of the RCU-protected hlist-nulls. The hlist_nulls_add_tail_rcu() function adds an element to the end of the RCU-protected hlist-nulls. Note that because this is not a circular list, this function must traverse the entire list in order to find the end. The hlist_nulls_add_fake() function marks the current node as the end of the list, while causing the ->pprev pointer to reference this same element. Given that the resulting list is not in hlist_nulls format, please use extreme caution with this function.
The hlist_nulls_del_rcu() function deletes the specified element from its RCU-protected hlist-nulls. The hlist_nulls_del_init_rcu() function deletes the specified element from its RCU-protected hlist-nulls, and also initializes its ->pprev pointer to form an empty list.
RCU hlist_bl APIs
The APIs in the fourth and final column of the RCU List APIs table operate on Linux-kernel hlist-bitlocked lists, which are made up of hlist_bl_head and hlist_bl_node structures. These lists uses the low-order bit of the pointer to the first element as a lock, which allows per-bucket locks on large hash tables while still maintaining a reasonable memory footprint.
The hlist_bl_for_each_entry_rcu() macro iterates over an RCU-protected bitlocked hlist from the beginning.
The hlist_bl_first_rcu() function returns a pointer to the first element on the bitlocked hlist, applying rcu_dreference(). The caller must either be in an RCU read-side critical section or have locked the hlist. Note that this cannot be an RCU-bh, RCU-sched, or SRCU read-side critical section, only an RCU read-side critical section will do.
The remaining API members in this section update RCU-protected hlist_bls. These updates are atomic from the viewpoint of RCU readers, but not from from the viewpoint of concurrent list updates. The caller is therefore required to synchronize with these other updates.
The hlist_bl_add_head_rcu() function adds an element to the beginning of the RCU-protected bitlocked hlist. The hlist_bl_set_first_rcu() function adds an element to the beginning of the RCU-protected bitlocked hlist, which must initially be empty. The hlist_bl_del_rcu() function deletes the specified element from its RCU-protected bitlocked hlist.
Kernel Implementations and Configuration
The last three rows of the Big API Table show how different combinations of Kconfig options select different APIs for the five flavors of RCU. Each of these eight implementations are described in one of the sections below, with a brief design overview, microbenchmark performance data collected on a 16-hardware-thread laptop running Intel(R) Core(TM) i9-10885H CPUs at 2.40GHz. Different hardware and kernel configurations can give different results, so your mileage may vary, and configuration is also briefly described.
- Tiny RCU
- Non-Preemptible Tree RCU
- Preemptible Tree RCU
- Tiny SRCU
- Tree SRCU
- Tasks RCU
- Tasks Rude RCU
- Tasks Trace RCU
Of course, none of this is a substitute for reading the documentation and code, including the Kconfig help test, kernel-parameters.txt, kernel-doc header comments, and so forth. However, the following sections should provide a helpful overview.
Tiny RCU
This RCU implementation is used in kernels built for single-CPU systems (CONFIG_SMP=n) in which preemption is disabled at build time (CONFIG_PREEMPT_DYNAMIC=n along with either CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y).
Tiny RCU Design
Tiny RCU is trivial because there is only one CPU and the kernel is non-preemptible. Because the kernel is non-preemptible, any voluntary context switch is a quiscent state. And because there is only one CPU, that one quiescent state is a grace period. Thus, Tiny RCU's readers and updaters are almost no-ops.
For readers, the “almost” refers to the barrier() directive in rcu_read_lock() and rcu_read_unlock() that prevents the compiler from reordering code (potentially containing page faults) either into or out of the read-side critical section. For updaters, the “almost” refers to the non-atomic increment in synchronize_rcu() that is required to make polled grace-period APIs function correctly.
Callbacks queued via call_rcu() require special handling because such callbacks cannot be invoked within an RCU read-side critical section. Instead, they are invoked from an RCU_SOFTIRQ handler, which is raised when needed during context switches.
Tiny RCU Performance
Click for answer
Because the slowdown is in the noise, given that non-atomic bit
manipulation of a per-CPU variable is extremely fast.
In addition, lazy preemption will have the great benefit of reducing
the number of cond_resched() calls required, perhaps
even reducing this number to zero.
Furthermore, unconditional access to preemption-disabled state will
simplify quite a bit of code, some of it in RCU.
So it is likely to be a good overall tradeoff.
Quick Quiz 25: Given that a Tiny RCU reader could still take an arbitrary length of time, how can synchronize_rcu() possibly be an O(1) operation?
Click for answer
Because Tiny RCU readers run with preemption disabled, which means that
this synchronize_rcu() invocation won't get a chance to
start until after that reader ends.
This situation provides yet another demonstration that it is necessary
to take great care when using asymptotics.
Tiny RCU readers might well contain the same sequence of machine instructions that would be used for a single-threaded implementation of that same algorithm. CONFIG_PREEMPT_COUNT=y will be required when lazy preemption is accepted into the mainline, which will result in rcu_read_lock() and rcu_read_unlock() explicitly disabling and enabling preemption, respectively. So enjoy this sub-nanosecond reader overhead while it lasts!
Synchronous grace-period latency is around 100 nanoseconds for both normal and expedited grace periods, requiring only a single non-atomic volatile read-modify-write operation. That is to say, on Tiny RCU, synchronize_rcu() is an O(1) operation. Queuing a callback for an asynchronous grace period requires only interrupt disabling and non-atomic addition to a linked-list in the common case, requiring roughly 150 nanoseconds. In the uncommon case, idle threads must invoke the scheduler, which can consume upwards of a handful of microseconds.
Tiny Configuration
Configuration? Tiny RCU don't have no configuration! Tiny RCU don't need no stinking configuration!!!
(With apologies to fans of “The Treasure of the Sierra Madre”.)
Non-preemptible Tree RCU
This RCU implementation is used in kernels built for multi-CPU systems (CONFIG_SMP=y) in which preemption is disabled at build time (CONFIG_PREEMPT_DYNAMIC=n along with either CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y).
Non-Preemptible Tree Design
Non-Preemptible Tree RCU is more complex than Tiny RCU. Grace-period requests (from call_rcu(), synchronize_rcu(), and friends) are batched into grace-period computations, and RCU's grace-period kthread runs a series of such computations, one at a time. Any grace-period requests arriving while one computation is in flight will be satisfied by the next grace period.
If RCU's grace-period kthread is idle, then any RCU grace-period request will bring it out of idle. If there are many concurrent requests while RCU is idle, then the rcu_node combining tree will select which request starts the grace period while avoiding excessive levels of lock or memory contention. This same combining tree gathers quiescent states from CPUs, and once each CPU has been observed in a quiescent state, reports the completion of a grace period.
Expedited grace periods are handled in a similar fashion, but are run separately by a set of kthread_worker instances, one per leaf rcu_node structure.
Click for answer
For one thing, blocking is incompatible with configurations that invoke
callbacks from softirq handlers.
For another thing, if you really want to block, why not spawn a workqueue
handler?
For example, instead of call_rcu(), you can use
queue_rcu_work(), which will spawn the specified workqueue
handler after an RCU grace period ends.
The rcu_head structures queued by call_rcu() as callbacks are maintained on per-CPU lists, but due to things like callback offloading and CPU-hotplug operations, a given RCU callback might not be invoked on the CPU that queued it. Depending on configuration, these callbacks might be invoked in a softirq handler, a per-CPU kthread bound to the CPU in question, or an unbound per-CPU kthread. But either way, the callbacks will be invoked under local_bh_disable().
Memory to be freed via the kfree_rcu() and kfree_rcu_mightsleep() functions is tracked using pages of pointers, which greatly reduces the cache-miss rate compared to traversing lists of rcu_head structures. Of course, this means allocating memory on the free path, which is safe only because under OOM conditions kfree_rcu() falls back to using that linked list of rcu_head structures and kfree_rcu_mightsleep() falls back to directly invoking synchronize_rcu() (hence the _mightsleep()).
Non-Preemptible Tree Performance
As with Tiny RCU, in kernels built with CONFIG_PREEMPT_COUNT=n, non-preemptible RCU readers might well contain the same sequence of machine instructions that would be used for a single-threaded implementation of that same algorithm. The result is again that rcu_read_lock() and rcu_read_unlock() are sub-nanosecond O(1) operations. And again as with Tiny RCU, this will end once CONFIG_PREEMPT_COUNT=y is required for lazy preemption, though most likely with minimal (if any) performance degradation. Those interested in testing this theory are referred to the v2 patch series.
In sharp contrast with Tiny RCU, synchronous grace-period latency is measured in milliseconds, and on the laptop Paul is typing this on, seven of them. However, expedited grace periods complete in a few tens of microseconds, with each additional CPU adding a few microseconds, for on the order of 60 or 70 microseconds on Paul's laptop. Queuing a callback for an asynchronous grace period still only requires non-atomic addition to a linked-list with interrupts disabled in the common case, albeit to a more elaborate linked list with more statistics in play. A typical call_rcu() invocation weighs in at about 350 nanoseconds, but with a long tail extending up to tens of milliseconds in order to handle a number of obtuse corner cases that multiple-CPU systems can give rise to. Unlike Tiny RCU, idle threads need not be treated specially. Also unlike Tiny RCU, callbacks may be offloaded, in which case callback queueing is more complex, involving lock acquisitions along with bypass lists to reduce lock contention.
Non-Preemptible Tree RCU Configuration
RCU's Kconfig options and kernel boot parameters can be considered to be part of the RCU API, most especially from the viewpoint of someone building a kernel intended for a specialized device or workload. This section summarizes the RCU-related Kconfig options and the more commonly used kernel boot parameters, but please note that many of the Kconfig options require that the CONFIG_RCU_EXPERT Kconfig option be set.
Combining-Tree Layout
This section also applies to Tree SRCU.
CONFIG_RCU_FANOUT controls the fanout of non-leaf nodes of the rcu_node combining tree. Lower fanout values reduce lock contention, but also consume more memory and increase the overhead of grace-period computations. The default values of 32 on 32-bit systems and 64 on 64-bit systems almost always suffice, with one exception being rcutorture stress testing.
Click for answer
What gives is that we have not yet come across a 32-bit system
with thousands of CPUs.
But to your point, if such systems ever appear, they might well need
kernels built with CONFIG_RCU_FANOUT_LEAF=32.
The CONFIG_RCU_FANOUT_LEAF Kconfig option controls the fanout of leaf nodes of the tree. As with CONFIG_RCU_FANOUT, lower fanout values reduce lock contention, but also consume more memory and increase the overhead of grace-period computations. The default value of 16 is sufficient in most cases, but very large systems (thousands of CPUs) may want to set this to the largest possible value, namely 64. Such systems will also need to boot with skew_tick=1 to avoid massive lock contention on the leaf rcu_node ->lock fields due to synchronized scheduling-clock interrupts. On the other hand, synchronized scheduling-clock interrupts are useful on smaller systems to reduce the number of socket-level CPU wakeup events, which in turn greatly reduces power consumption on systems that are often fully idle.
The leaf-level fanout can also be set at boot time. The rcutree.rcu_fanout_leaf= kernel-boot parameter sets the number of CPUs to assign to each leaf-level rcu_node structure, overriding the CONFIG_RCU_FANOUT_LEAF Kconfig option. The rcutree.rcu_fanout_exact= kernel boot disables autobalancing of the rcu_node combining tree. For example, with autobalancing, an 18-CPU system would assign 9 CPUs to each leaf rcu_node structure. Without autobalancing, the first rcu_node structure would get 16 CPUs and the second only 2 of them. To the best of our knowledge, autobalancing has always worked well, but this parameter allows for the possibility that systems with odd cache layouts might need it.
Expedited Grace-Period Configuration
This section also applies to Tree SRCU.
Click for answer
Because CONFIG_TINY_RCU's normal grace-period primitives
are (almost) no-ops, they are already maximally expedited.
Almost nothing is faster than doing (almost) nothing!
Normal grace periods are relatively slow. Expedited grace periods are orders of magnitude faster, but disrupt idle CPUs to the detriment of energy efficiency and real-time response. The Linux-kernel source code makes careful choices, but some workloads need a more extreme all-or-nothing policy. Workloads valuing fast grace above all else may use the rcupdate.rcu_expedited= kernel boot parameter, which causes normal grace-period primitives to act like their expedited counterparts. For example, invoking synchronize_rcu() will act as if synchronize_rcu_expedited() had been invoked.
On the other hand, workloads valuing reduced OS noise above all else may use the rcupdate.rcu_normal= kernel boot parameter, which causes expedited grace-period primitives to act like their normal counterparts. This kernel boot parameter overrides rcupdate.rcu_expedited= except during very early boot (as in before the scheduler has been initialized).
Finally, workloads valuing reduced OS noise at runtime and fast grace periods during boot may use the rcupdate.rcu_normal_after_boot= kernel boot parameter, which causes expedited grace-period primitives to act like their normal counterparts once init has spawned. For example, real-time systems desiring fast boot but wishing to avoid run-time IPIs from expedited grace periods could set both rcupdate.rcu_expedited= and rcupdate.rcu_normal_after_boot=.
Grace-Period Configuration
In the worst case, RCU's grace-period kthread must check for quiesent states from each CPU. Assuming that checking each CPU consumes a given amount of time, and further assuming constant grace-period latency, it is easy to solve for the number of CPUs a which this checking would cause the grace-period kthread to consume an entire CPU. The RCU design breaks this last assumption by slowing down the grace-period state machine on larger systems. This slowing down also proved to be important for certain large systems of years past, on which RCU's default actions would result in hardware lockups. RCU therefore autotunes the force-quiescent-state delays in its state machine, increasing these delays with increasing numbers of CPUs. This has worked quite well in practice, but there have been a few exceptions. The following kernel boot parameters allow systems administrators to address these exceptions.
The rcutree.jiffies_till_first_fqs= kernel boot parameter sets the number of jiffies to wait between grace-period initialization and the first force-quiescent-state scan. The default depends on the value of HZ and the number of CPUs on the system. By default, all systems wait for at least one jiffy, with one additional jiffy for HZ greater than 250, an additional jiffy for HZ greater than 500, and one additional jiffy for each 256 CPUs on the system. This value may be manually set to zero, which can be useful for specialty systems that tend to have idle CPUs, that need fast grace periods, and that don't mind burning a little extra CPU during grace-period initialization.
The rcutree.jiffies_till_next_fqs= kernel boot parameter sets the number of jiffies to wait between successive force-quiescent-state scans. The default is the same as for rcutree.jiffies_till_first_fqs=. Even those specialty systems that set rcutree.jiffies_till_first_fqs= to zero would be well-advised to set rcutree.jiffies_till_next_fqs= to a non-zero value.
If the grace period takes too long, RCU will solicit help from rcu_note_context_switch(), and cond_resched(). The rcutree.jiffies_till_sched_qs= sets the number of jiffies that a grace period will wait before soliciting such help. An additional rcutree.qovld kernel boot parameter sets the number of RCU callbacks on a given CPU above which RCU will request even more aggressive help.
The rcutree.rcu_kick_kthreads kernel parameter causes the grace-period kthread to get an extra wake-up if it sleeps more than three times longer than specified. This can be useful when tracking down issues in the scheduler and in timers, though it is of course more frequently used to track down issues in RCU.
Callback Configuration
Traditionally, RCU callbacks have been invoked in a softirq handler. In this case, RCU must take care to limit the number of callbacks invoked in a given call to that handler, lest other system processing be unduly delayed. By default, this involves accurate but relatively slow time checks based on the rcutree.rcu_resched_ns kernel boot parameter. This slowness causes time to be checked only every 32 callbacks. To handle the case of a single long-running callback, the CONFIG_RCU_DOUBLE_CHECK_CB_TIME=y Kconfig parameter may be specified. This will use the low-overhead low-accuracy jiffies counter on every callback invocation, but must of course also round the limit up to the next jiffy.
The rcutree.blimit= kernel boot parameter sets the maximum number of RCU callbacks to process in one batch, which defaults to ten callbacks, but increasing with increasing numbers of callbacks queued, controlled by the rcutree.rcu_divisor kernel boot parameter. This limit does not apply to offloaded CPUs. The rcutree.qhimark= kernel boot parameter sets the threshold of queued RCU callbacks beyond which rcutree.blimit= will be ignored. This defaults to 10,000 callbacks. The rcutree.qlowmark= kernel boot parameter sets the threshold of queued RCU callbacks below which rcutree.blimit= will once again have effect. This defaults to 100 callbacks.
The CONFIG_RCU_NOCB_CPU=y Kconfig option allows callback processing to be offloaded from selected CPUs, with the “NOCB” standing for “no callbacks”. Instead of being invoked in a softirq handler, offloaded callbacks are invoked using special rcuog and rcuoc kthreads, which are controlled by the scheduler and can be constrained by the system administrator. This option was intended to reduce operating-system noise for HPC and real-time workloads, but also fortuitously improves RCU's energy efficiency by reducing the number of RCU's callback-processing wakeups.
By default, even in kernels built with CONFIG_RCU_NOCB_CPU=y, no CPUs will actually be offloaded. Therefore, the RCU_NOCB_CPU_DEFAULT_ALL=y Kconfig option causes all CPUs to be offloaded. For those for whom this all-or-nothing choice is unsuitable, offloading can also be specified at boot time using the rcu_nocbs= and nohz_full kernel boot parameters. For example, rcu_nocbs=1-3,7 would cause CPUs 1, 2, 3, and 7 to have their callbacks offloaded to rcuo kthreads. For another example, either rcu_nocbs=0-N or rcu_nocbs=all would cause all CPUs to have their callbacks offloaded. Users cannot yet change the set of offloaded CPUs at runtime, but the infrastructure for doing so is in place.
The rcu_nocb_poll kernel boot parameter offloads not just callback invocation, but also wakeup operations. On the other hand, this means that all of the rcuog kthreads must poll, which probably is not what you want on a battery-powered system.
Callback offloading involves per-CPU rcuoc kthreads, which invoke the callbacks, along with a smaller number of rcuog kthreads that handle the needed grace periods. By default, the number of rcuog kthreads is roughly the square root of the number of CPUs, but it can be controlled by the rcutree.rcu_nocb_gp_stride kernel boot parameter. Larger numbers for this parameter reduce wakeup overhead on the global grace-period kthread, but by shuffling that overhead off to each group's rcuog kthread. Should this tradeoff prove problematic, it may be necessary to have two levels of rcuog kthreads, with the number of upper-level kthreads being the cube root of the number of CPUs and the number of lower-level kthreads being the square of this cube root. But let's actually see a real problem before “solving” it.
The rcutree.nocb_nobypass_lim_per_jiffy kernel boot parameter specifies how many offloaded callbacks may be queued during any given jiffy before a special “bypass” is pressed into service to reduce lock contention. To the best of our knowledge, the default value of this parameter has always worked well.
The CONFIG_RCU_LAZY=y Kconfig option is intended to reduce power consumption on nearly idle systems, over and above that provided by RCU callback offloading. This is accomplished by allowing callbacks to queue up for some time before starting the needed grace period. This time will be abbreviated if the system runs short of memory or if the number of callbacks queued rises too high.
In some cases, those building a kernel might want their users to be able to enable callback laziness, but do not want all of their users to be forced to run with lazy callbacks. The CONFIG_RCU_LAZY_DEFAULT_OFF=y Kconfig option is intended for this case, so that kernels with both options set have the ability to support lazy callbacks but do not do so by default. Then users desiring laziness can specify the rcutree.enable_rcu_lazy kernel boot parameter.
Debugging Configuration
The CONFIG_RCU_TRACE Kconfig option enables RCU event tracing.
The CONFIG_DEBUG_OBJECTS_RCU_HEAD Kconfig option enables debug-objects checking of multiple invocations of call_rcu (and friends) on the same structure.
The CONFIG_PROVE_RCU Kconfig option enables lockdep-RCU checking. The CONFIG_PROVE_RCU_LIST Kconfig option enables lockdep-RCU checking for RCU-protected lists. In both cases, only a single lockdep-RCU splat will be emitted per boot. The rcupdate.rcu_self_test kernel boot parameter, as you might expect from the name, runs RCU's boot-time self-tests. The rcutree.dump_tree kernel boot parameter dumps out a representation of RCU's combining tree, which can be useful for verifying that the Kconfig options and kernel boot parameters used to control the combining-tree shape actually did what was expected.
The CONFIG_RCU_TORTURE_TEST Kconfig option enables RCU torture testing, which is also known as rcutorture. This is a tri-state parameter, permitting rcutorture.c to be compiled into the kernel, built as a module, or omitted entirely. When rcutorture.c is built into the kernel (CONFIG_RCU_TORTURE_TEST=y), then RCU torture testing starts during boot. Please don't try this on production systems! The CONFIG_RCU_SCALE_TEST Kconfig option enables RCU update-side performance testing, which operates in a manner similar to rcutorture. The CONFIG_RCU_REF_SCALE_TEST Kconfig option enables RCU read-side performance testing, which also operates in a manner similar to rcutorture. The rcutree.do_rcu_barrier kernel boot parameter causes the kernel to invoke the rcu_barrier() function, which is useful to clean up between each of a long series of RCU test module rules. This parameter is also useful to clean up after running user-space testing that heavily exercises call_rcu(); one example might be benchmarks that make frequent use of the close() system call in multi-threaded processes. Without this, deferred overhead initiated by the previous run might be incorrectly attributed to the next run.
The CONFIG_RCU_CPU_STALL_TIMEOUT Kconfig option specifies the maximum normal grace-period duration, in seconds, that RCU will tolerate without complaint. Excessively long grace periods are usually caused by CPUs or tasks failing to find their way out of an RCU read-side critical section in a timely manner. CPU stalls can be caused by a number of bugs, as described in Documentation/RCU/stallwarn.txt. This Kconfig variable defaults to 21 seconds. Note that if a grace period persists for more than half of this RCU CPU stall warning timeout, holdout CPUs will start receiving vivacious interprocessor interrupts in an attempt to get them to report quiescent states. The rcupdate.rcu_cpu_stall_timeout= kernel boot parameter overrides the build-time CONFIG_RCU_CPU_STALL_TIMEOUT setting. The rcupdate.rcu_cpu_stall_suppress= kernel boot parameter suppresses RCU CPU stall warning messages. The rcupdate.rcu_cpu_stall_suppress_at_boot kernel boot parameter also suppresses RCU CPU stall warning messages, but only during boot. The rcupdate.rcu_cpu_stall_ftrace_dump kernel boot parameter causes the ftrace buffer to be dumped to the console log after an RCU CPU stall warning is reported.
The CONFIG_RCU_CPU_STALL_CPUTIME Kconfig option causes the RCU CPU stall warning messages to provide additional statistics on CPU consumption, interrupts, and tasks during the sampling period, which starts halfway to the first stall warning in a series. The rcupdate.rcu_cpu_stall_cputime kernel boot parameter overrides the CONFIG_RCU_CPU_STALL_CPUTIME Kconfig option.
The CONFIG_RCU_CPU_STALL_NOTIFIER Kconfig option enables CPU-stall notifiers, which are invoked just before printing the RCU CPU stall warning. You almost certainly do not want this. The rcupdate.rcu_cpu_stall_notifiers kernel boot parameter causes the ftrace buffer to be dumped to the console log after an RCU CPU stall warning is reported. One use for CPU-stall notifiers is to dump the trace buffer before the stall warning is printed.
The CONFIG_RCU_EXP_CPU_STALL_TIMEOUT Kconfig option specifies the maximum expedited grace-period duration, but in milliseconds, that RCU will tolerate without complaint. Some embedded platforms set this to 20 milliseconds. The rcupdate.rcu_exp_cpu_stall_timeout kernel boot parameter overrides the build-time CONFIG_RCU_EXP_CPU_STALL_TIMEOUT setting. The rcupdate.rcu_exp_stall_task_details kernel boot parameter provides stack dumps of any tasks blocking the current expedited RCU grace period.
The CONFIG_RCU_EQS_DEBUG Kconfig option causes RCU to check idle-state entry and exit. Please enable this when adding new interrupt paths to your architecture, including when adding new architectures. This Kconfig option can help you find things like calls to rcu_irq_enter() that lack a matching rcu_irq_exit().
The CONFIG_RCU_STRICT_GRACE_PERIOD Kconfig option provides a debugging implementation that forces fast grace periods, but at the expense of performance and scalability. In fact, this option is available only in kernels built for four or fewer CPUs. This option is useful in combination with tools like KASAN to detect pointers leaking from RCU read-side critical sections. The rcutree.rcu_unlock_delay kernel boot parameter specifies an additional delay, in microseconds, following each call to rcu_read_unlock().
The rcutree.sysrq_rcu kernel boot parameter commandeers the sysrq-y key to dump out RCU state.
If you are working with code that uses RCU, please do us all a favor and test that code with CONFIG_PROVE_RCU and CONFIG_DEBUG_OBJECTS_RCU_HEAD enabled. If you are modifying the RCU implementation itself, you will need to run rcutorture, with multiple runs covering the relevant kernel configuration parameters. A one-hour rcutorture run on an 8-CPU machine qualifies as light rcutorture testing. The automated scripts invoked by tools/testing/selftests/rcutorture/bin/kvm.sh and tools/testing/selftests/rcutorture/bin/torture.sh can be quite helpful.
Yes, running extra tests can be a hassle, but we are here to tell you that extra testing is much easier than trying to track down bugs in your RCU code!!!
kfree_rcu() Configuration
The kfree_rcu() implementation maintains list of pages, which are filled in with the pointers that have been passed in to kfree_rcu(). These pages of pointers increase performance by reducing cache-miss overhead compared to use of a linked list, however, the implementation will fall back to a linked list of rcu_head structures under low-memory conditions. The rcutree.rcu_delay_page_cache_fill_msec kernel parameter set the page-cache refill delay (in milliseconds) in response to low-memory conditions. During this time, the kfree_rcu() implementation refrains from allocating additional pages of pointers. The default is five seconds and values can range from zero to 100 seconds.
The rcutree.rcu_min_cached_objs kernel parameter specifies the minimum number of paged-sized objects which are cached and maintained per one CPU. The cache reduces pressure on the page allocator, and also provides better behavior under low memory condition.
Preemptible Tree RCU
This RCU implementation is used in kernels built for multi-CPU systems (CONFIG_SMP=y) in which preemption is enabled at build time (CONFIG_PREEMPT_DYNAMIC=y or either CONFIG_PREEMPT=y or CONFIG_PREEMPT_RT=y).
Preemptible Tree RCU Design
Preemptible Tree RCU uses the same overall design as its non-preemptible counterpart, but must look for quiescent states from tasks preempted within RCU read-side critical sections as well as quiescent states from CPUs. Because scanning the full list of tasks can be quite slow, RCU instead keeps lists of tasks preempted within critical sections, with one such list per leaf rcu_node structure.
Preemptible Tree RCU also incurs additional complexity compared to its non-preemptible counterpart due to the possibility of preemptible RCU read-side read-side critical sections (those marked by rcu_read_lock() and rcu_read_unlock(), for example) partially overlapping with non-preemptible critical sections (those marked by local_irq_disable and local_irq_enable(), for example). Chains of up to eight partially overlapping readers are tested by rcutorture.
Preemptible Tree RCU Performance
In contrast to both Tiny and non-preemptible RCU, preemptible RCU readers must increment and decrement a per-task counter, and in some (usually rare) circumstances rcu_read_unlock() must remove the current task from a list that blocked while in an RCU reader, urgently report a quiescent state, or cancel RCU priority boosting. An rcu_read_lock() and rcu_read_unlock() pair typically incurs about 6 nanoseconds on Paul's laptop. The fact that preemptible RCU has acceptable performance in most workloads is one reason to be confident that non-preemptible RCU and Tiny RCU will continue to perform well with the advent of lazy preemption.
As with non-preemptible RCU, synchronous grace-period latency is measured in milliseconds, typically about 12 of them. Also as with non-preemptible RCU, expedited grace periods complete in a few tens of microseconds, with each additional CPU adding few microseconds. Queuing a callback is the same as for non-preemptible RCU, corner cases, offloading, and all.
Preemptible Tree RCU Configuration
Preemptible Tree RCU uses the same configuration setup as does Non-Preemptible Tree RCU, but has additional configuration specific to real-time systems.
The CONFIG_RCU_BOOST=y Kconfig parameter enables RCU priority boosting. This could be considered a debugging option, but it is increasingly used in production. This Kconfig parameter causes blocked RCU readers to be priority-boosted in order to avoid indefinite prolongment of the current RCU grace period. The CONFIG_RCU_BOOST_DELAY Kconfig parameter specifies how long RCU will allow a grace period to be delayed before starting RCU priority boosting. The default is 300 milliseconds, which seems to work quite well for many, but not all, workloads. The rcutree.kthread_prio kernel boot parameter specifies the real-time priority to boost to, and defaults to priority one, the least-important real-time priority level. You should set this priority level to be greater than the highest-priority realtime, CPU-bound thread. The default priority is appropriate for the common case where there are no CPU-bound threads running at real-time priority.
The CONFIG_RCU_EXP_KTHREAD=y Kconfig parameter specifies that RCU's expedited grace-period processing should be done in a real-time kthread, which can further reduce the latencies of expedited grace periods at the expense of being more disruptive. Dislike of such disruption leads aggressive real-time kernels to disable this real-time expedited processing as a side effect of booting with rcupdate.rcu_normal_after_boot=1, thus avoiding expediting entirely.
The CONFIG_RCU_NOCB_CPU_CB_BOOST=y Kconfig parameter specifies that the rcuoc kthreads that invoke offloaded RCU callbacks also run at real-time priority.
The rcutree.use_softirq=0 kernel boot parameter offloads RCU grace-period processing from its normal softirq handler to a set of per-CPU rcuc kthreads, thus enabling the scheduler to arrange for RCU grace period processing to run at a time more consistent with the system's real-time response requirements.
Tiny SRCU
Click for answer
First, the current implementation assumes a non-preemptible kernel,
which means that explicit preempt_disable() invocations
must be added.
Second, there is some reason to believe that Tiny SRCU is more sensitive
to indefinite reader preemption than is Tree SRCU, and this requires
testing and possibly further adjustments.
Third, this being computer software, there will always be surprises!
This SRCU implementation is used in kernels built for single-CPU systems (CONFIG_SMP=n) in which preemption is disabled at build time (CONFIG_PREEMPT_DYNAMIC=n along with either CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y). In theory, it could also be used in preemptible kernels, and testing is ongoing to determine whether or not doing so is wise.
Tiny SRCU Design
Tiny SRCU uses a pair of counters and an index. The index is incremented at the beginning and end of each Tiny SRCU grace period. The second-to-lowest bit of this index is used by srcu_read_lock() to select which of the two counters should be incremented. This same value is passed to the matching srcu_read_unlock(), which then decrements whichever counter that srcu_read_lock() incremented. If srcu_read_unlock() finds that the value of that counter is zero, it wakes up the grace-period workqueue handler.
Tiny SRCU Performance
Tiny SRCU readers must explicitly manipulate counters, so an srcu_read_lock() and srcu_read_unlock() pair typically incurs six or seven nanoseconds on Paul's laptop.
Unlike Tiny RCU, Tiny SRCU synchronous grace periods cannot be no-ops, but they typically require less than 2µs. Asynchronous grace periods via call_srcu() incur the usual sub-microsecond latencies.
Tiny SRCU Configuration
Like Tiny RCU, Tiny SRCU does not need any configuration, stinking or otherwise.
Tree SRCU
This SRCU implementation is used in kernels built for multi-CPU systems (CONFIG_SMP=y), and also for single-CPU system in which preemption is enabled at build time (CONFIG_PREEMPT_DYNAMIC=y or either of CONFIG_PREEMPT=y and CONFIG_PREEMPT_RT=y).
Tree SRCU Design
Tree SRCU uses an index-and-counter scheme similar to that of Tiny SRCU, but uses per-CPU counters, and two of them per CPU, one for use by srcu_read_lock() and the other for use by srcu_read_unlock(), both of which increment their respective counters. The grace-period computation then compares the sums of the srcu_read_lock() and srcu_read_unlock() counters, declaring the grace period done when they compare equal. Careful memory ordering is of course required.
Tree SRCU uses an srcu_node combining tree similar to that of RCU's rcu_node combininb tree, for many of the same purposes. However, most srcu_struct instances never have heavy update-side use, and for those instances, a combining tree would be a waste of both CPU and memory. Therefore, by default, on smaller systems (where “smaller” is defined by the srcutree.big_cpu_lim kernel boot parameter, which defaults to 128 CPUs), the combining tree is allocated for a given srcu_struct only after contention is encountered. In cases where the default does not work well, the srcutree.convert_to_big kernel boot parameter may be used to change the allocation strategy, as discussed below and in Paul's 2022 LSFMM+BPF slideset.
Tree SRCU Performance
Tree SRCU readers must manipulate per-CPU counters and also execute full memory barrier instructions, so that an srcu_read_lock() and srcu_read_unlock() pair consume about 20 or 30 nanoseconds on Paul's laptop.
The auto-expediting of the first SRCU grace period results in wide variance in synchronize_srcu() latency, which can be less than four microseconds, is typically about two milliseconds, and can range upwards of 20 milliseconds. The latency of synchronize_srcu_expedited() can also be a handful of microseconds, but tends to be a few tens of microseconds plus a few microseconds per CPU. The latency of calls to the call_srcu() function tend to be in the 100-200 nanosecond range, but can take several milliseconds, for example, in corner cases when an SRCU grace period must be started or when the corresponding srcu_struct structure's combining tree has not yet been allocated.
Tree SRCU Configuration
Click for answer
It is not clear that Hyrum's Law can ever be completely neutralized,
but debugging environments and stress tests can fend it off in at
least some situations.
Tree SRCU automatically expedites the first grace period following a sufficiently long idle phase. This admittedly odd semantic is due to the fact that Tree SRCU's non-tree predecessor did this as an accident of implementation, and its users came to rely on it, which is an example of Hyrum's Law in action. Failure to automatically expedite was therefore one of the first bugs filed against Tree SRCU. The number of nanoseconds that SRCU must have been idle in order to auto-expedite the next grace period is specified by the srcutree.exp_holdoff kernel boot parameter, which defaults to 25,000 of them, or 25 microseconds. As you might guess, a value of zero disables auto-expediting.
However, too much auto-expediting could result in excessive CPU consumption, and so there are a number of kernel boot parameters that further limit auto-expediting. The srcutree.srcu_max_nodelay kernel boot parameter specifies the number of no-delay instances per jiffy for the SRCU grace period worker thread. Beyond this limit, worker thread will be rescheduled with a sleep delay of one jiffy. The srcutree.srcu_max_nodelay_phase kernel boot parameter specifies the number of non-sleeping polls of readers per grace-period phase. Beyond this limit, grace period worker thread will be rescheduled with a sleep delay of one jiffy between each rescan of the readers. The srcutree.srcu_retry_check_delay kernel boot parameter specifies the number of microseconds of non-sleeping delay between each non-sleeping poll of readers.
The initial implementation of Tree SRCU placed the srcu_node combining tree directly into the srcu_struct structure. This was a problem for Linux distributions that build their kernels setting the CONFIG_NR_CPUS Kconfig option to thousands of CPUs, where the resulting srcu_struct structures weighed in at tens of kilobytes each. Although the huge memory sizes of today's systems might make this seem to be an emphatic non-problem, it can still be problematic when a structure appears within another structure used in hot code paths, in terms of both cache locality and instruction addressing modes.
Worse yet, most SRCU use cases feature extremely infrequent update rates, so that this huge combining tree was doing nothing but consuming memory and forcing compilers to use slow instruction sequences for fields beyond the srcu_struct structure. Therefore, newer kernels move the srcu_node combining tree and much else besides out of the srcu_struct structure. By default, on small systems, the combining tree is not allocated until lock contention is encountered. On larger systems, where “larger” by default means 128 CPUs, the combining tree is allocated at init_srcu_struct() time because the default single globally locked callback queue might otherwise result in congestive collapse before the tree could be allocated and initialized. In either case, srcu_struct structures that are statically allocated (via the DEFINE_SRCU() or DEFINE_STATIC_SRCU() macros) preallocate everything at build time.
The default settings seem to be working well, but it is not necessarily wise to rely too much on that, given that a new use case could arise at any time anywhere in the world. Therefore, srcutree.big_cpu_lim kernel boot parameter specifies the number of CPUs constituting a large system, in case 128 ever proves to be a poor choice.
The srcutree.convert_to_big kernel boot parameter specifies under what conditions an SRCU tree srcu_struct structure will be converted to big form, that is, with an rcu_node combining tree:
Code Description 0 Never. 1 At init_srcu_struct() time. 2 When rcutorture decides to. 3 Decide at boot time (default). 0x1X Above plus if high contention.
Either way, the srcu_node tree will be sized based on the actual runtime number of CPUs (nr_cpu_ids) instead of the compile-time CONFIG_NR_CPUS. The srcutree.small_contention_lim kernel boot parameter specifies the number of update-side contention events per jiffy will be tolerated before initiating a conversion of an srcu_struct structure to big form. Note that the value of srcutree.convert_to_big must have the 0x10 bit set for contention-based conversions to occur, with popular choices including 0x10 (don't allocate until contention is encountered) and 0x13 (allocate at initialization time on large systems and when contention is encountered on small systems).
And finally, from the “computer arithmetic is finite” file, the srcutree.counter_wrap_check kernel boot parameter specifies how frequently to check for grace-period sequence counter wrap for the srcu_data structure's ->srcu_gp_seq_needed field. The greater the number of bits set in this kernel parameter, the less frequently counter wrap will be checked for. Note that the bottom two bits are ignored.
Tasks RCU
Tasks RCU is built into kernels that require it for protection of trampolines used in tracing and by BPF (via select TASKS_RCU if PREEMPTION) and for testing (via manual enabling of FORCE_TASKS_RCU). Tasks RCU does not wait for idle tasks, a job for which Tasks Rude RCU is instead used.
Tasks RCU Design
One constraint on the trampolines is that code within them (or that might be called by them) cannot voluntarily block, although in preemptible kernels, it can be preempted. Thus, Tasks RCU uses voluntary context switch, usermode execution, and idle-task execution (including idle injection) as quiescent states. As such, it lacks read-side markers, so that there is no rcu_read_lock_tasks().
Another constraint on the code in (or called by) trampolines is that it cannot invoke cond_resched(), so that in non-preemptible kernels the only context switches that matter are voluntary context switches. This matches the semantics of non-preemptible RCU, so in non-preemptible kernels, the Tasks-RCU API is mapped to RCU's API, hence the if PREEMPTION in the select clause called out above. This will need to change with the advent of lazy preemption (v2 patch series).
Tasks RCU is unique among the current RCU implementations in scanning the full tasks list. Each step of this scan is quite lightweight, consisting of simple loads and comparisons. If the task must be waited for, it is placed on a private list, and the heavier-weight get_task_struct() is used to ensure that the task does not disappear before it is removed from this list. This private list is repeatedly scanned (with increasingly long periods of time between scans) until all the tasks have been removed from it, at which point the RCU Tasks grace period has ended.
Tasks RCU grace periods are normally extremely infrequent, so not only can this scanning can be tolerated, but a single globally locked callback queue suffices. However, there have been a few RCU Tasks callback floods in the wild, and so Tasks RCU will adaptively switch to per-CPU callbacks when heavy callback load is detected and, unlike Tree SRCU, will also switch back to a single queue when the callback flood subsides. In addition, multiple concurrent grace-period requests (whether from synchronize_rcu_tasks() or call_rcu_tasks()) will be batched into a smaller number of grace-period computations. This callback-flooding and grace-period-batching code is common to all of the Tasks RCU implementations.
Tasks RCU Performance
The synchronize_rcu_tasks() function has a typical latency of more than 100 milliseconds, courtesy partly of that full scan of the task list, but mostly because the grace-period kthread normally sleeps for 100 milliseconds on each pass through its loop. The call_rcu_tasks() function has a typical latency of less than a microsecond, though the callback function will not be invoked until around 100 milliseconds later.
Tasks RCU Specific Configuration
There are a number of configuration parameters that are common across Tasks RCU, Tasks Rude RCU, and Tasks Trace RCU. However, those are covered in the next section. This section instead covers a kernel parameter that is specific to Tasks RCU.
In order to reduce CPU consumption and improve battery lifetime, Tasks RCU does not necessarily start a new grace period immediately upon queuing a call_rcu_tasks() callback. Instead, it waits for up to 250 milliseconds before starting a new grace period, or until 32 callbacks have been queued (but see the rcupdate.rcu_task_lazy_lim kernel parameter described below). Although the 250 milliseconds default seems to be working thus far, it might prove to be inappropriate any time anywhere. Therefore, the rcupdate.rcu_tasks_lazy_ms kernel parameter sets this timeout in milliseconds. The default negative value takes the 250-millisecond default. A value of zero disables delays.
Tasks RCU Common Configuration
This section covers a number of configuration parameters that are common across Tasks RCU, Tasks Rude RCU, and Tasks Trace RCU.
The rcupdate.rcu_task_lazy_lim kernel boot parameter specifies the number of callbacks on a given CPU that will cancel laziness on that CPU, overriding the default of 32. A value of -1 disables cancellation of laziness, but please be advised that this increases the danger of OOM due to callback flooding.
Like Tree SRCU, the various flavors of Tasks RCU start with a single globally locked callback queue, which is appropriate for the low update rates typical for tracing trampolines. However, there have been callback floods, which cause current kernels to automatically switch to per-CPU callback queues. Unlike Tree SRCU, when the flood subsides, the Tasks RCU flavors will automatically switch back to a single callback queue. There are use cases for which the default does not work well, most notably, certain rcutorture testing scenarios, and the following kernel parameters allow the default to be overridden.
The rcupdate.rcu_task_enqueue_lim kernel boot parameter sets the number of callback queues to use for the RCU Tasks family of RCU flavors. The default value of -1 allows this to be automatically (and dynamically) adjusted, and positive values disables adjustment. This parameter is intended for use in testing. The rcupdate.rcu_task_collapse_lim kernel boot parameter sets collapse limit, so that if no more than this number of callbacks is present at the beginning of a grace period, the implementation will automatically collapse back to a single callback queue. This switching only occurs when rcupdate.rcu_task_enqueue_lim is set to the default value of -1. The rcupdate.rcu_task_contend_lim kernel boot parameter sets the minimum number of callback-queuing-time lock-contention events per jiffy required to cause the RCU Tasks flavors to switch to per-CPU callback queuing. Again, this switching only occurs when rcupdate.rcu_task_enqueue_lim is set to the default value of -1.
Unlike Tree SRCU but like Tree RCU, the Tasks RCU flavors provide RCU CPU stall warnings. The rcupdate.rcu_task_stall_timeout kernel boot parameter sets the timeout in jiffies for RCU task stall warning messages, which defaults to ten minutes. A value less than or equal to zero disables stall warnings. A run-time change in value does not take effect until the next grace period.
The rcupdate.rcu_task_stall_info kernel boot parameter sets the initial timeout in jiffies for RCU task stall informational messages, which defaults to ten seconds. These messages give some indication of a problem for those not patient enough to wait for ten minutes. Informational messages are only printed prior to the stall-warning message for a given grace period. A value less than or equal to zero disables these informational messages. A run-time change in value does not take effect until the next grace period. The rcupdate.rcu_task_stall_info_mult kernel boot parameter specifies a multiplier for time interval between successive RCU task stall informational messages for a given RCU tasks grace period. This value is clamped to one through ten, inclusive. It defaults to the value three, so that the first informational message is printed 10 seconds into the grace period, the second at 40 seconds, the third at 160 seconds, and then the stall warning at 600 seconds would prevent a fourth at 640 seconds.
Tasks Rude RCU
Tasks Rude RCU is built into kernels that require it for protection of idle-task trampolines used in tracing and by BPF (via select TASKS_RUDE_RCU) and for testing (via manual enabling of FORCE_TASKS_RUDE_RCU). Tasks Rude RCU waits for idle tasks, and in addition for any non-preemptible region of code. Once all architectures properly place inline and noinstr directives on kernel entry/exit and deep-idle code, and once Tasks RCU pays attention to idle tasks and additional portions of the CPU-hotplug code path, it will be possible to phase out Tasks Rude RCU.
Tasks Rude RCU Design
The Tasks Rude RCU grace-period computation is quite straightforward, using schedule_on_each_cpu() to force a schedule on each CPU. As a result, any preemptible region of code and every call to schedule() acts as a quiescent state. This in turn means that there are no Tasks Rude RCU read-side markers. In addition, this use of schedule_on_each_cpu() awakens each and every CPU, even those that are sleeping in deep idle states and those that are executing nohz_full userspace code. This can be quite rude, from battery-lifetime and real-time-response viewpoints, respectively, hence the name.
Tasks Rude RCU reacts to callback flooding and batches grace period computations using code that is shared with Tasks RCU.
Tasks Rude RCU Performance
The synchronize_rcu_tasks_rude() function has a typical latency of around 100 milliseconds, because the grace-period kthread normally sleeps for 100 milliseconds on each pass through its loop. The call_rcu_tasks() function has a typical latency of less than a microsecond, though the callback function will not be invoked until around 100 milliseconds later.
Tasks Rude RCU Configuration
In addition to the aforementioned common configuration, Tasks Rude RCU provides The rcupdate.rcu_tasks_rude_lazy_ms kernel boot parameter, which is similar to the rcupdate.rcu_tasks_lazy_ms kernel parameter.
Tasks Trace RCU
Tasks Trace RCU is built into kernels that require it to protect sleepable BPF programs (via select TASKS_TRACE_RCU) and for testing (via manual enabling of FORCE_TASKS_TRACE_RCU).
Tasks Trace RCU Design
Unlike Tasks RCU and Tasks Rude RCU, Tasks Trace RCU has explicit read-side markers, rcu_read_lock_trace() and rcu_read_unlock_trace(). These functions manipulate counters and flags in the running task's task_struct structure.
Click for answer
Because where Tasks RCU need use only a few light-weight loads from
each task_struct structure, Tasks Trace RCU must acquire
each task's runqueue and/or priority-inheritance locks.
This higher overhead in combination with the very large numbers of
tasks that run on each system in certain datacenters rules out
task-list scans for Tasks RCU.
Bitter experience forbade scanning the full task list, so Tasks Trace RCU's grace-period computation uses a hook in the scheduler that queues tasks that block within their read-side critical sections. This queueing operates in a manner similar to that of preemptible RCU, but using per-CPU lists rather than preemptible RCU's per-leaf rcu_node lists. The grace-period computation must scan these lists and also interact with each online CPU, albeit in a way that does not disturb idle and nohz_full CPUs.
As with Tasks Rude RCU, Tasks Trace RCU reacts to callback flooding and batches grace period computations using code that is shared with Tasks RCU.
Tasks Trace RCU Performance
Like preemptible RCU, Tasks Trace RCU readers must increment and decrement a per-task counter, and in some (usually rare) circumstances rcu_read_unlock_trace() must remove the current task from a list that blocked while in a reader or report a quiescent state. An rcu_read_lock() and rcu_read_unlock() pair typically incurs about 3 nanoseconds on Paul's laptop.
The synchronous grace-period latency of synchronize_rcu_tasks_trace() is measured in milliseconds, typically about 30 of them. Queuing a callback using call_rcu_tasks_trace() is typically a sub-microsecond operation.
Tasks Trace RCU Configuration
In addition to the aforementioned
common configuration,
Tasks Trace RCU provides
the rcupdate.rcu_tasks_trace_lazy_ms kernel boot parameter
which is similar to the
rcupdate.rcu_tasks_lazy_ms kernel parameter.
In addition, the rcupdate.rcu_task_ipi_delay kernel boot
parameter sets the time in jiffies during which RCU tasks will avoid
sending IPIs, starting with the beginning of a given grace period.
This defaults to one-half second in kernels built with
CONFIG_TASKS_TRACE_RCU_READ_MB and to zero otherwise.
Setting a large number avoids disturbing real-time workloads, but
lengthens grace periods.