Kernel development
Brief items
Kernel release status
The 3.15 kernel is out, having been released on June 8. Headline features in 3.15 include some significant memory management improvements, the renameat2() system call, file-private POSIX locks, a new device mapper target called dm-era, faster resume from suspend, and more.
The 3.16 merge window remains open as of this writing; see the separate
summary below for details of what has been merged.
Linus noted that, while overlapping the 3.16 merge window with the
final 3.15 stabilization worked well enough, he is not necessarily inclined
to do it every time. "I also don't think it was such a wonderful experience that
I'd want to necessarily do the overlap every time, without a good
specific reason for doing so. It was kind of nice being productive
during the last week or rc (which is usually quite boring and dead),
but I think it might be a distraction when people should be worrying
about the stability of the rc.
"
Stable updates: 3.14.6, 3.10.42, and 3.4.92 were released on June 7, followed by 3.14.7, 3.10.43, and 3.4.93 on June 11.
Quotes of the week
So perhaps we should be using robust software engineering processes rather than academic peer review as the model for our code review process?
If you (vendors [...]) do not want to play (and be explicit and expose how your hardware functions) then you simply will not get power efficient scheduling full stop.
There's no rocks to hide under, no magic veils to hide behind. You tell _in_public_ or you get nothing.
Kernel development news
3.16 merge window, part 2
This is the second installment of our coverage of the 3.16 merge window. See last week's article for a rundown of what happened in the first few days of the window. Since then, Linus Torvalds has returned to the master branch of his repository after merging back 6800 or so non-merge commits from his next branch. At this point, he has merged 8179 patches for 3.16, which is 2831 since last week's article.
Here are some of the larger changes visible to users:
- Hugepage migration has been turned off for all architectures except x86_64 since it is only tested on that architecture and there are bugs for some of the others. It can be enabled for other architectures when they are ready to support it.
- Automatic NUMA balancing has been turned off for 32-bit x86. Existing 32-bit NUMA systems are not well supported by the code and the developers did not think the effort to support them would be worthwhile.
- The kernel memory control group (kmemcg) has been marked
in the documentation and Kconfig as "
unusable in real life so DO NOT SELECT IT unless for development purposes
". - 16-bit stack segments will be allowed on 64-bit x86 kernels again. That feature was disabled due to a kernel information leak of the top 16 bits of the stack pointer that has now been fixed. Users will regain the ability to run 16-bit Windows programs in Wine on 64-bit kernels.
- The kernel EFI code will now handle Unicode characters. It has also been changed to save and restore the floating point registers around EFI calls since EFI firmware may use the FPU.
- EFI stub support for ARM64 (aarch64) has been added.
- The ARM architecture has gained hibernation support. It has also made Ftrace work with read-only text in modules. In addition, the architecture improved its stack trace support by excluding the stack-trace functions from the trace and by allowing kprobes to record stack traces.
- The remap_file_pages() system call has been marked as deprecated. The replacement that emulates the semantics but will run more slowly has not been submitted for merging yet.
- The control group (cgroup) hierarchy handling has been reworked to provide a single unified hierarchy. Its use is governed by the __DEVEL__sane_behavior mount option. See our article further down and the new Documentation/cgroups/unified-hierarchy.txt for more information.
- Neil Brown's patches to make loopback NFS mounts work reliably have been merged through the NFS tree. The other parts of his fixes are coming via other trees.
- The external data representation (XDR) handling in NFS has been reworked to support access control lists (ACLs) larger than 4KB. It also returns readdir() results in chunks larger than 4KB giving better performance on large directories.
- The PowerPC 64-bit little-endian kernel now supports the ELFv2 ABI. There is also a new 64-bit little-endian boot wrapper for PowerPC.
- New hardware support:
- Clocks: APM X-Gene real-time clocks (RTCs); MicroCrystal RV4162 RTCs; Dallas/Maxim DS1343 and DS1344 RTCs; Microchip MCP795 RTCs; Dialog Semiconductor DA9063 RTCs; Orion5x SoC clocks; and S2MPS11/S2MPS14/S5M8767 crystal oscillator clocks.
- Miscellaneous: Renesas VMSA-compatible IPMMUs (IOMMUs); Realtek RTS5129/39 series USB SD/MMC card readers; Memstick card interface for Realtek RTS5129/39 series USB card readers; X-Powers AXP202 and AXP209 power management units (PMUs); PRCM (Power/Reset/Clock Management) units for Allwinner A31 SoCs; Atmel Microcontrollers found on the iPAQ h3xxx series to handle some keys, the touchscreen, and battery monitoring; ChromeOS EC (embedded controller) i2c command tunneling; Marvell EBU SoC onboard AHCI SATA controllers; MOXART SD/MMC host controllers; Allwinner sunxi SD/MMC host controllers; Renesas USDHI6ROL0 SD/SDIO host controllers; Dell SMO8800/SMO8810 freefall sensors; IBM Akebono (476gtr) evaluation boards; Keymile's kmcoge4 boards; OCA4080 boards; T1040 and T1042 QDS boards; Freescale BSC9132 QDS boards; and Intel MID platform watchdog timers.
- Video: DTCS033 (Scopium) USB Astro-Cameras; Silicon Labs Si2157 tuners; Silicon Labs Si2168 DVB-T/T2/C demods; Broadcom Set Top Box Level 2 interrupt controllers; and Xilinx AXI Video DMA (VDMA) engines.
Kernel developers will see the following changes:
- Flattened device tree (FDT) parsing has been converted to use libfdt. Knowledge of FDT internals has also been removed from most architectures except PowerPC.
- Videobuf2 now supports the Digital Video Broadcasting (DVB) standard.
- The 32-bit only setup_sched_clock() has been removed. Calls to it have been converted to sched_clock_register().
- The create_irq() and destroy_irq() interface (and its
variants) for
handling sparse IRQ allocation has been removed. As Thomas Gleixner put
it: "
get rid of the horrible create_irq interface along with its even more horrible variants
". - The ARM level 2 cache support has been cleaned up, which results in a
"
much nicer structure
" and some performance improvements, according to Russell King in his pull request. - ARM64 (aarch64) has added some optimized assembly for string and memory routines as well as for cryptography algorithms (SHA family, AES, GHASH). It has also added Ftrace support.
- A tracepoint benchmarking facility has been added to the kernel tracing subsystem.
- Some tracers (latency, wakeup, wakeup_rt, irqsoff, preemptoff, preemptirqsoff) can now be used in separate tracing instances, though only one instance can use each tracer at any given time. Also, the function and function graph tracers can be used together.
- As part of the fix for CVE-2014-4014, the inode_capable() function has been renamed to capable_wrt_inode_uidgid() to better reflect what it does.
- A decode_stacktrace.sh script has been added to turn offsets from symbols into filenames and line numbers to make it easier to read.
We should be most of the way through the merge window at this point, but there may still be merges of interest in the next few days. Stay tuned for next week's thrilling conclusion ...
The BFQ I/O scheduler
A block-layer I/O scheduler is charged with dispatching I/O requests to storage devices in a way that maximizes throughput while minimizing latencies. The Linux kernel currently includes a few different I/O schedulers, but, for the most part, development in this area has been slow in recent years, with no new schedulers (or major changes to existing schedulers) proposed for some time. That situation has changed with the posting of the "budget fair queuing" (BFQ) I/O scheduler, which brings some interesting new ideas to this part of the kernel.
Basic BFQ
BFQ, which has been developed and used out of tree for some years, is, in many ways, modeled after the "completely fair queuing" (CFQ) I/O scheduler currently found in the mainline kernel. CFQ separates each process's I/O requests into a separate queue, then rotates through the queues trying to divide the available bandwidth as fairly as it can. CFQ does a reasonably good job and is normally the I/O scheduler of choice for rotating drives, but it is not without its problems. The code has gotten more complex over the years as attempts have been made to improve its performance, but, despite the added heuristics, it can still create I/O latencies that are longer than desired.
The BFQ I/O scheduler also maintains per-process queues of I/O requests, but it does away with the round-robin approach used by CFQ. Instead, it assigns an "I/O budget" to each process. This budget is expressed as the number of sectors that the process is allowed to transfer when it is next scheduled for access to the drive. The calculation of the budget is complicated (more on this below), but, in the end, it is based on each process's "I/O weight" and observations of the process's past behavior. The I/O weight functions like a priority value; it is set by the administrator (or by default) and is normally constant. Processes with the same weight should all get the same allocation of I/O bandwidth. Different processes may get different budgets, but BFQ tries to preserve fairness overall, so a process getting a smaller budget now will get another turn at the drive sooner than a process that was given a large budget.
When it comes time to figure out whose requests should be serviced, BFQ examines the assigned budgets and, to simplify a bit, it chooses the process whose I/O budget would, on an otherwise idle disk, be exhausted first. So processes with small I/O budgets tend not to wait as long as those with large budgets. Once a process is selected, it has exclusive access to the storage device until it has transferred its budgeted number of sectors, with a couple of exceptions. Those are:
- Normally, a process's access to the device ends if it has no more
requests to be serviced. If, however, the last request was
synchronous (a read request, essentially), BFQ will let the drive sit
idle for a short period to give the process a chance to generate
another I/O request. Since the process was probably waiting for the
read to complete before generating more I/O traffic, that request
comes quite often, and it tends to
be contiguous with the last request (or close to it) and, thus, fast
to service. It may be counter-intuitive, but idling a drive briefly
after synchronous requests tends to increase throughput overall.
- There is a maximum period of time allowed for a process to complete its requests. If its I/O load is slow to complete (mostly likely because it consists of random I/O patterns requiring a lot of seeking by the drive), it will lose access to the drive before it has transferred its full budget. In this case, the process will be charged for the full budget anyway to reflect its overall effect on the drive's I/O throughput.
There is still the question of how each process's budget is assigned. In its simplest form, the algorithm is this: each process's budget is set to the number of sectors it transferred the last time it was scheduled, subject to a systemwide maximum. So processes that tend to do small transfers then stop for a while will get small budgets, while I/O-intensive processes will get larger budgets. The processes with the smaller budgets, which often tend to be more sensitive to latency, will be scheduled more often, leading to a more responsive system. The processes doing a lot of I/O may wait a bit longer, but they will get an extended time slice with the storage device, allowing the transfer of a large amount of data and, hopefully, good throughput.
Bring on the heuristics
Some experience with BFQ has evidently shown that the above-described algorithm can yield good results, but that there is room for improvement in a number of areas. The current posting of the code has, in response, added a set of heuristics intended to push the behavior of the system in the desired direction. These include:
- Newly started processes get a medium-sized budget and an increased
weight; this allows them to do a fair amount of I/O with minimal delay
from the outset. The idea here is to improve application startup time
by giving new processes some extra I/O bandwidth to fault their code
into memory. The increased weight decays linearly as the process
runs.
- BFQ's budget calculations, including the maximum allowed budget, are
dependent on an estimate of the underlying device's peak I/O rate.
The peak I/O rate can vary considerably, though, depending on where
the data
is located on the disk and how much caching is going on inside the
drive. A number of tweaks to the peak-rate calculator try to account
for these effects. For example, the observed I/O rate for
processes that run out of time without exhausting their budgets is now
factored in, even though a timeout of this nature usually indicates
that the I/O pattern is random and the drive is not operating at its
peak rate. The reasoning is that a timeout can also indicate that the
maximum budget value is too large. There is also a low-pass filter
used to exclude especially high rate calculations, since those are
more likely to be measuring drive caching than actual I/O rates.
- The budget calculations themselves have been tweaked. If a process
runs out of requests before exhausting its budget, the old response
was to lower the budget to the number of requests issued. In the
current code, instead, the scheduler will look to see if any of the
process's I/O requests are still outstanding; if so, the rate will be
doubled on the theory that more requests will be forthcoming
when those outstanding requests complete. In the case of a timeout,
the budget is, once again, doubled; this tweak is meant to help
processes working from slow parts of a drive and to cause processes
with truly random I/O patterns to be serviced less frequently.
Finally, if a process
still has requests outstanding after using its entire budget, it's
likely to be an I/O-intensive process; in this case the budget is
quadrupled.
- Write operations are more costly than reads because disk drives
tend to cache the data and signal completion immediately; the actual
write to media is done at some later time. That can cause starvation
of read requests later on. BFQ tries to account for this cost by
counting writes more heavily against the budget; indeed, one write is
charged like ten reads.
- On drives that can queue multiple commands internally, idling (as
described above) can cause the internal queue to empty out, reducing
throughput. So BFQ will disable idling on solid-state devices with
command queuing. Idling will also be disabled on rotational devices,
but only when servicing processes with random I/O patterns.
- When multiple processes are operating on the same portion of the disk,
it can be better to keep their queues together rather than servicing
them separately. Evidently QEMU, which divides I/O among a number of
worker threads, is a good example of this type of workload. BFQ
includes an algorithm called "early queue merge" that attempts to
detect such processes and join their queues together.
- BFQ attempts to detect "soft realtime" applications — media players, for example — and boost their weight to help ensure that they experience low latencies. This detection works by looking for a pattern of issuing a set of I/O requests, then going idle (from a disk I/O point of view) for a period of time. Processes that exhibit that pattern will have their weight increased.
The list of heuristics is longer than this, but one should get the idea: tuning the I/O patterns of a system to optimize for a wide range of workloads is a complex task. From the results posted by BFQ developer Paolo Valente, it seems that a fair amount of success has been achieved. The task of getting this code into the mainline may be just a little bit harder, though.
The path toward merging
If BFQ does have a slow path into the mainline, it will not be because the kernel developers dislike it; indeed, almost all of the comments have been quite positive. The results speak for themselves, but there was also a lot of happiness about how the scheduler has been studied and all of the heuristics have been extensively described and tested. The CFQ I/O scheduler also contains a lot of heuristics, but few people understand what they are or how they work. BFQ appears to be a cleaner and much better documented alternative.
What the kernel developers do not want to see, though, is the merging of another complex I/O scheduler that tries to fill the same niche as CFQ. Instead, they would like to see a set of patches that evolves CFQ into BFQ, leaving the kernel with a single, improved I/O scheduler. As Tejun Heo put it:
Changing CFQ in an evolutionary way would also help when the inevitable performance regressions turn up. Finding the source of regressions in BFQ could be challenging; bisecting a series of changes to CFQ would, instead, point directly to the offending change.
The BFQ scheduler has been around for a while, and has seen a fair amount of use. Distributions like Sabayon and OpenMandriva ship it, as does CyanogenMod. It seems to be a well-proven technology. All that's needed is some time put into packaging it properly for inclusion into the mainline. Once that has been done, more extensive performance testing can be done. After any issues found there are resolved, this scheduler could replace CFQ (or, more properly, become the future CFQ) in the kernel relatively quickly.
(See this paper [PDF] for a lot more information on how BFQ works).
The unified control group hierarchy in 3.16
The idea of reworking the kernel's control group implementation is not exactly new; see this article from early 2012, for example. However, that talk has not yet translated into much in the way of user-visible changes to the kernel. That situation will change in the 3.16 release, which will include the new unified control group hierarchy code. This article will be an overview of how the unified hierarchy will work at the user level.At its core, the control group subsystem is simply a way of organizing processes into hierarchies; controllers can then be applied to the hierarchies to enforce policies on the processes contained therein. From the beginning, control groups have allowed the creation of multiple hierarchies, each of which can contain a different mix of processes. So one could, for example, create one hierarchy and attach the CPU scheduler controller to it. Another hierarchy could be created for the memory controller; it could contain the same processes, but with a different organization. That would allow memory usage policy to be applied to different groupings of the same processes.
This flexibility has a certain appeal, but it has its costs. It can be expensive for the kernel to keep track of all the controllers that apply to a given process. Controllers also cannot effectively cooperate with each other, since they may be operating on entirely different hierarchies. In some cases (memory and block I/O bandwidth control, for example), better cooperation is needed to effectively control resource use. And, in the end, there has been little real-world use of this feature. So the plan has long been to get rid of the multiple-hierarchy feature, though it has always been known that this change would take a long time to effect fully.
Work on the unified control group hierarchy has been underway for some time, with much of the preparatory work being merged into the 3.14 and 3.15 kernels. In 3.16, this feature will be available, but only to users who ask for it explicitly. To use the unified hierarchy, the new control group virtual filesystem should be mounted with a command like:
mount -t cgroup -o __DEVEL__sane_behavior cgroup <mount-point>
Obviously, the __DEVEL__sane_behavior option is not intended to be a permanent fixture. It may still be some time, though, before the unified hierarchy becomes available as a default feature.
It is worth noting that the older, multiple-hierarchy mode continues to work even if the unified hierarchy mode is used; it will be kept around for as long as it seems to be needed. The unified hierarchy can be instantiated alongside older hierarchies, but controllers cannot be shared between the unified hierarchy and any others. The care that has been taken in this area should allow users to experiment with the unified mode while avoiding changes that would break existing systems.
In current kernels, controllers are attached to control groups by
specifying options to the mount command that creates the
hierarchy. In the unified hierarchy world, instead, all controllers are
attached to the root of the hierarchy.
(Strictly speaking that's not quite true; controllers attached to old-style
hierarchies will not be available in the unified hierarchy, but that's a
detail that can be ignored for now).
Controllers can be enabled for specific
subtrees of the hierarchy, subject to a small set of rules. For the
purposes of illustrating these rules, imagine a control group hierarchy
like the one shown on the right; groups A and B live directly under the root
control group, while C and D are children of B.
Each control group in the hierarchy has (in its associated control directory) a file called cgroup.controllers that lists the controllers that can be enabled for children of that group. Another file, cgroup.subtree_control, lists the controllers that are actually enabled; writing to that file can turn controllers on or off. It is worth repeating that these files manage the controllers attached to the children of the group; in the unified hierarchy, a control group is thought of as delegating its resources to subgroups for management. There are some interesting implications resulting from this design.
One of those is that a control group must apply a controller to all of its children or none. If the memory controller is enabled in B's cgroup.subtree_control file, it will apply to both C and D; there is no way (from B's point of view) to apply the controller to only one of those subgroups. Further, a controller can only be enabled in a specific control group if it is enabled in that group's parent; a controller cannot be enabled in group C unless it is already enabled in group B. That suggests that all controllers that are actually meant to be used must be enabled in the root control group, at which point they will apply to the entire hierarchy. It is, however, possible to disable a controller at a lower level. So, if the CPU controller is enabled in the root, it can be disabled in group A, exempting all of A's descendant groups from CPU control.
Another new rule is that the cgroup.subtree_control file can only be used to change the set of active controllers if the associated group contains no processes. So, for example, if group B has controllers enabled in its cgroup.subtree_control file, it cannot contain any processes; those processes must all be placed into group C or D. This rule prevents situations where processes in the parent control group are competing with those in the child groups — situations that current controllers handle inconsistently and, often, badly. The one exception to the "no processes" rule is the root control group.
One other control file found in the unified hierarchy is called cgroup.populated; reading it will return a nonzero value if there are any processes in the group (or its descendants). By using poll() on this file, a process can be notified if a control group becomes completely empty; the process would presumably respond by cleaning up and removing the group. Current kernels, instead, create a helper process to provide the notification; this technique has been frowned on for years.
The unified hierarchy will allow a privileged process to delegate access to control group functionality by changing the owner of the associated control files. But this delegation only works to an extent: a unprivileged process with access to the control files can create child control groups and move processes between groups, but it cannot change any controller settings. This policy is there partly to keep unprivileged processes from disrupting the system, but the intent is also to restrict access to the more advanced control knobs. These knobs are currently deemed to expose too much information about the kernel's internals, so there is a desire to avoid having applications depend on them.
All of this work has been extensively discussed for years, with most of the major users of control groups having had their say. So it should be suitable for most of the known uses today, but that is no substitute for actually seeing things work. The 3.16 kernel will provide an opportunity for interested users to try out the new mode and find out which problems remain; actual migration by users to the new scheme cannot be expected to happen for a few more development cycles at the earliest, though. But, at some point, the control group rework will cease being something that's mostly talked about and become just another big job that eventually got done.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>