Kernel development [LWN.net]

Kernel release status

The current development kernel is 3.11-rc3, released on July 28. "Anyway, remember how I asked people to test the backlight changes in rc2 because things like that have really bad track records? Yup. That all got reverted. It fixed things for some people, but regressed for others, and we don't do that 'one step forward, two steps back' thing. But never fear, we have top people looking at it."

Stable updates: 3.10.3 was released on July 25; it was followed by 3.10.4, 3.4.55, and 3.0.88 on July 28. The 3.2.49 release came out on July 27.

3.2.50 is in the review process as of this writing; it can be expected sometime after August 2.

Comments (none posted)

Quotes of the week

I'm not going to add new race conditions today that I need to fix up tomorrow, my patch count is high enough as it is.

— Greg Kroah-Hartman

If you have gotten to the point where you have to make this decision you should probably call it a work day, go home, have a nice drink and spend some time with a loved one. In the morning take a good hard look at your network configuration. You may end up with a different security policies being enforced with IPv4 and IPv6 communications.

— Casey Schaufler

We of course bit more than we could chew and it started pouring so my car is sitting on +Johannes Weiner's drive way with two belts undone, a wheel bolt broken and an engine mount bolt missing somewhere in the maze of pulleys.

— Tejun Heo, whose kernel work must certainly be in better shape

Comments (2 posted)

Multipath TCP 0.87

The 0.87 release of the multipath TCP patch set is available. Improvements include better hardware offload support, zero-copy sendfile/splice support, working NFS support, better middlebox handling, and more. See this article for an overview of multipath TCP.

Comments (3 posted)

I/O Hook

By Jonathan Corbet
July 30, 2013

Writing device drivers can be a painful process; hardware has a tendency to behave in ways other than those described in the documentation. The job can be even harder, though, in the absence of the hardware itself. Developing a complete driver without the hardware can require a simulator built into a tool like QEMU — a development project in its own right. For simpler situations, though, it may be enough to fool the driver about the contents of a few device registers. Rui Wang's recently posted I/O hook patch set aims to make that functionality available.

The I/O hook module works by overriding the normal functions used to access I/O memory, I/O ports, and the PCI configuration space. When kernel code calls one of those functions, the new version will check to see whether an override has been configured for the address/port of interest; if so, the operation will be redirected. In the absence of an explicit override, the I/O operation will proceed as usual. Needless to say, adding this kind of overhead to every I/O operation executed by the kernel could slow things down significantly. In an attempt to minimize the impact, the static key mechanism is used to patch the kernel at run time. So the I/O hooks will not run unless they are in active use at the time.

There is an in-kernel interface that can be used to set up register overrides; it is a simple matter of calling:

    void hook_add_ovrd(int spaceid, u64 address, u64 value, u64 mask,
		       u32 length, u8 attrib);

Here, spaceid is one of OVRD_SPACE_MEM for regular I/O memory, OVRD_SPACE_IO for an I/O port, or OVRD_SPACE_PCICONF for the PCI configuration space. The combination of address, mask, and length describe the range of addresses to be overridden, while value is the initial value to be set in the overridden space. By using the mask value it is possible to override a space as narrow as a single bit. The attrib parameter describes how the space is to behave: OVRD_RW for a normal read/write register, OVRD_RO for read-only, OVRD_RC for a register whose bits are cleared on being read, or OVRD_WC to clear bits on a write.

There are two functions, hook_start_ovrd() and hook_stop_ovrd(), that are used to turn the mechanism on and off. Any number of overrides can be set up prior to turning the whole thing on, so a complicated set of virtual registers can be configured. It's worth noting, though, that the overrides are stored internally in a simple linked list, suggesting that the number of overrides is expected to be relatively small.

While the in-kernel interface may be useful, it will probably be more common to control this facility through the debugfs interface. The module provides a set of files through which overrides can be set up; see the documentation file for details on the syntax. The debugfs interface also provides a mechanism by which a simulated interrupt can be delivered to the driver; if an interrupt number is given to the system (by writing it to the appropriate debugfs file), that interrupt will be triggered once the overrides are enabled.

A system like this clearly cannot be used to emulate anything other than the simplest of devices. A real device has a long list of registers and, importantly, the contents of those registers will change as the device performs the operations requested of it. One could imagine enhancing this module with an interface by which a user-space process could supply register values on demand, but there comes a point where it is probably better just to add a virtual device to an emulator like QEMU.

So where, then, does a tool like this fit in? The use cases provided with the patch posting mostly have to do with the testing of hotplug operations on hardware without hotplug support. A hotplug event typically involves an interrupt and a relatively small number of registers; by overriding just those registers, the I/O hook mechanism can convince a driver that its hardware just went away (or came back). That allows testing the hotplug paths without needing to have suitably capable hardware.

Similarly, overrides can be used to test error paths by injecting various types of errors into the system. Error paths are often difficult to exercise; there are almost certainly large numbers of error paths in the kernel that have never been executed. Code that has never run has a higher-than-average chance of containing bugs. The fault injection framework can be used to test a wide range of error paths, but it is not comprehensive; the I/O hook module could be useful to fill in the gaps.

But, then, anecdotal evidence suggests that relatively few developers even use the fault injection facilities, so uptake of a more complex mechanism may be limited. But, for those who use it, the I/O hook subsystem might well prove to be a useful addition to the debugging toolbox.

Comments (3 posted)

Transparent decompression for ext4

By Jonathan Corbet
July 31, 2013

Transparent compression is often found on the desired feature list for new filesystems; compressing data on the fly allows the system to make better use of both storage space and I/O bandwidth, at the cost of some extra CPU time. The "transparent" in the name indicates that user space need not be aware that the data is compressed, making the feature easy to use. Thus, filesystems like btrfs support transparent compression, while Tux3 has a draft design toward that end. A recent proposal to add compression support to ext4, however, takes a bit of a different approach. The idea may run into trouble on its way into a mainline kernel, but it is indicative of how some developers are trying to get better performance out of the system.

Dhaval Giani's patch does not implement transparent compression; instead, the feature is transparent decompression. With this feature, the kernel will allow an application to read a file that has been compressed without needing to know about that compression; the kernel will handle the process of decompressing the data in the background. The creation of the compressed file is not transparent, though; that must be done in user space. Once the file has been created and marked as compressed (using chattr), it cannot be changed, only deleted and replaced. So this feature enables the transparent use of read-only compressed files, but only after somebody has taken the time to set those files up specially.

This feature is aimed at a rather narrow use case: enabling Firefox to launch more quickly. Desktop users will (as Taras Glek notes) benefit from this feature, but the target users are on Android. Such systems tend to have relatively slow storage devices — slow enough that compressing the various shared objects that make up the Firefox executable and taking the time to decompress them in the CPU is a net win. Decompression at startup time slows things down, but it is still faster than reading the uncompressed data from a slow drive. Firefox currently uses its own custom dynamic linker to load compressed libraries (such as libxul.so) during startup. Moving the decompression code into the filesystem would allow the Firefox developers to dispense with their custom linker.

Dhaval's implementation has a few little problems that could get in the way of merging. Decompression must happen in a single step into a single buffer, so the application must read the entire file in a single read() call; that makes the feature a bit less than fully transparent. Mapping compressed files into memory with mmap() is not supported. The "szip" compression format is hardwired into the implementation. A new member is added to the file_operations structure to read compressed files. And so on. These shortcomings are understood and acknowledged from the outset; Dhaval's main purpose in posting the code at this time was to get feedback on the general design. He plans to fix these issues in subsequent versions of the patch.

But fixing all of those problems will not help if the core filesystem maintainers (who have, thus far, remained silent) object to the intent of the patch. A normal expectation when dealing with filesystems is that data written with write() will look the same when retrieved by a subsequent read() call. The transparent decompression patch violates that assumption by having the kernel interpret and modify the data written to disk — something the kernel normally tries hard not to do.

Having the kernel interpret the data stream could perhaps be countenanced if there were a compelling reason to add this functionality to the kernel. But, if such a reason exists, it was not presented with the patch set. Firefox has already solved this problem with its own dynamic linker; that solution lives entirely in user space. A fundamental rule of kernel design is that work should not be done in the kernel if it can be done equally well in user space; that suggests that an in-kernel implementation of file decompression would have to be somehow better than what Firefox is using now. Perhaps an in-kernel implementation is better, but that case has not yet been made.

The end result is that Dhaval's patch is unlikely to receive serious consideration at this point. Before kernel developers look at the details of a patch, they usually want to know why the patch exists in the first place — how does that patch make the system better than before? That "why" is not yet clear, so the contents of the patch itself are not entirely relevant. That may be part of why this particular patch set has not received much in the way of feedback in the first week after it was posted. Transparent decompression is an interesting idea for speeding application startup with a relatively easy kernel hack; hopefully the next iteration will contain a stronger justification for why it has to be a kernel hack in the first place.

Comments (13 posted)

Device trees as ABI

By Jonathan Corbet
July 30, 2013

Last week's device tree article introduced the ongoing discussion on the status of device tree maintainership in the kernel and how things needed to change. Since then, the discussion has only intensified as more developers consider the issues, especially with regard to the stability of the device tree interface. While it seems clear that most (but not all) participants believe that device tree bindings should be treated like any other user-space ABI exported by the kernel, it is also clear that they are not treated in this way currently. Those seeking to change this situation will have a number of obstacles to overcome.

Device tree bindings are a specification of how the hardware is described to the kernel in the device tree data structure. If they change in incompatible ways, users may find that newer kernels may not boot on older systems (or vice versa). The device tree itself may be buried deeply within a system's firmware, making it hard to update, so incompatible binding changes may be more than slightly inconvenient for users. The normal kernel rule is that systems that work with a given kernel should work with all releases thereafter; no explicit exception exists for device tree bindings. So, many feel, bindings should be treated like a stable kernel ABI.

Perhaps the strongest advocate of the position that device tree bindings should be treated as any other ABI right now (rather than sometime in the future) is ARM maintainer Russell King:

We can draw the line at an interface becoming stable in exactly the same way that we do every other "stable" interface like syscalls - if it's in a -final kernel, then it has been released at that point as a stable interface to the world. [...]

If that is followed, then there is absolutely no reason why a "Stable DT" is not possible - one which it's possible to write a DT file today, and it should still work in 20 years time with updated kernels. That's what a stable interface _should_ allow, and this is what DT _should_ be.

As is often the case, though, there is a disconnect between what should be and what really is. The current state of device tree stability was perhaps best summarized by Olof Johansson:

Until now, we have been working under the assumption that the bindings are _NOT LOCKED_. I.e. they can change as needed, and we _ARE_ assuming that the device tree has to match the kernel. That has been a good choice as people get up to speed on what is a good binding and not, and has given us much-needed room to adjust things as needed.

Other developers agreed with this view of the situation: for the first few years of the ARM migration from board files to device trees, few developers (if any) had a firm grasp of the applicable best practices. It was a learning experience for everybody involved, with the inevitable result that a lot of mistakes were made. Being able to correct those mistakes in subsequent kernel releases has allowed the quick application of lessons learned and the creation of better bindings in current kernels. But Olof went on to say that the learning period is coming to a close: "That obviously has to change, but doing so needs to be done carefully." This transition will need to be done carefully indeed, as can be seen from the issues raised in the discussion.

Toward stable bindings

For example: what should be done about "broken" bindings that exist in the kernel currently? Would they immediately come under a guarantee of stability, or can they be fixed one last time? There is a fair amount of pressure to stop making incompatible changes to bindings immediately, but to do so would leave kernel developers supporting bindings that do not adequately describe the hardware, are not extensible to newer versions of the hardware, and are inconsistent with other bindings. Thus, Tomasz Figa argued, current device tree bindings should be viewed as a replacement for board files, which were very much tied to a specific kernel version:

We have what we have, it is not perfect, some things have been screwed up, but we can't just leave that behind and say "now we'll be doing everything correctly", we must fix that up.

Others contend that, by releasing those bindings in a stable kernel, the community already committed itself to supporting them. Jon Smirl has advocated for a solution that might satisfy both groups: add a low-level "quirks" layer that would reformat old device trees to contemporary standards before passing them to the kernel. That would allow the definitive bindings to change while avoiding breaking older device trees.

Another open question is: what is the process by which a particular set of bindings achieves stable status, and when does that happen? Going back to Olof's original message:

It's likely that we still want to have a period in which a binding is tentative and can be changed. Sometimes we don't know what we really want until after we've used it a while, and sometimes we, like everybody else, make mistakes on what is a good idea and not. The alternative is to grind most new binding proposals to a halt while we spend mind-numbing hours and hours on polishing every single aspect of the binding to a perfect shine, since we can't go back and fix it.

Following this kind of policy almost certainly implies releasing drivers in stable kernels with unstable device tree bindings. That runs afoul of the "once it's shipped, it's an ABI" point of view, so it will not be popular with all developers. Still, a number of developers seem to think that, with the current state of the art, it still is not possible to create bindings that are long-term supportable from the beginning. Whether bindings truly differ from system calls and other kernel ABIs in this manner is a topic of ongoing debate.

Regardless of when a binding is recognized as stable, there is also the question of who does this recognition. Currently, bindings are added to the kernel by driver developers and subsystem maintainers; thus, in some eyes, we have a situation where the community is being committed to support an ABI by people who do not fully understand what they are doing. For this reason, Russell argued that no device tree binding should be merged until it has had an in-depth review by somebody who not only understands device tree bindings, but who also understands the hardware in question. That bar is high enough to make the merging of new bindings difficult indeed.

Olof's message, instead, proposed the creation of a "standards committee" that would review bindings for stable status. These bindings might already be in the kernel but not yet blessed as "locked" bindings. As Mark Rutland (one of the new bindings maintainers) pointed out, this committee would need members from beyond the Linux community; device tree bindings are supposed to be independent of any specific operating system, and users may well want to install a different system without having to replace the device tree. Stephen Warren (another new bindings maintainer) added that bootloaders, too, make use of device trees, both to understand the hardware and to tweak the tree before passing it to the kernel. So there are a lot of constituents who would have to be satisfied by a given set of bindings.

Tied to this whole discussion is the idea of moving device tree bindings out of the kernel entirely and into a repository of their own. Such a move would have the effect of decoupling bindings from specific kernel releases; it would also provide a natural checkpoint where bindings could be carefully reviewed prior to merging. Such a move does not appear to be planned for the immediate future, but it seems likely to happen eventually.

There are also some participants who questioned the value of stable bindings in the first place. In particular, Jason Gunthorpe described the challenges faced by companies shipping embedded hardware with Linux:

There is no way I can possibly ship a product with a DT that is finished. I can't tie my company's product release cycles to the whims of the kernel community.

So embedded people are going to ship with unfinished DT and upgrade later. They have to. There is no choice. Stable DT doesn't change anything unless you can create perfect stable bindings for a new SOC instantaneously.

In Jason's world, there is no alternative to being able to deal with device trees and kernels that are strongly tied together, and, as he sees it, no effort to stabilize device tree bindings is going to help. That led him to ask: "So who is getting the benefit of this work, and is it worth the cost?" That particular question went unanswered in the discussion.

Finally, in a world where device tree bindings have been stabilized, there is still the question of how to ensure that drivers adhere to those bindings and add no novelties of their own. The plan here appears to be the creation of a schema to provide a formal description for bindings, then to augment the dtc device tree compiler to verify device trees against the schema. Any strange driver-specific bindings would fail to compile, drawing attention to the problem.

The conversation quickly acquired a number of interesting side discussions on how the schema itself should be designed. A suggestion that XML could be used evoked far less violence than one might imagine; kernel developers are still trying hard to be nice, it seems. But David Gibson's suggestion that a more C-like language be used seems more likely to prevail. The process of coming up with comprehensive schema definition and checking that it works with all device tree bindings is likely to take a while.

Reaching a consensus on when device tree bindings should be stabilized, what to do about substandard existing bindings, and how to manage the whole process will also probably take a while. The topic has already been penciled in for an entire afternoon during the ARM Kernel Summit, to be held in Edinburgh this October. In the meantime, expect a lot of discussion without necessarily binding the community to more stable device trees.

Comments (36 posted)

Linus Torvalds Linux 3.11-rc3 ?

Greg KH Linux 3.10.4 ?

Greg KH Linux 3.10.3 ?

Greg KH Linux 3.4.55 ?

Ben Hutchings Linux 3.2.49 ?

Greg KH Linux 3.0.88 ?

Preeti U Murthy [RFC PATCH 0/5] cpuidle/ppc: Timer offload framework to support deep idle states ?

Lorenzo Pieralisi ARM: TC2 big.LITTLE CPU idle driver ?

Rajendra Nayak DRA7xx core support ?

Maxime Ripard Introduce Allwinner A20 support ?

Lukasz Majewski cpufreq:boost: CPU Boost mode support ?

Cody P Schafer Add rbtree postorder iteration functions, runtime tests, and update zswap to use. ?

Paul E. McKenney [PATCH RFC nohz_full 0/7] v4 Provide infrastructure for full-system idle ?

Rui Xiang Add namespace support for syslog v2 ?

Hidehiro Kawai Add a hash value for each line in /dev/kmsg ?

Srikar Dronamraju Improve numa scheduling by consolidating tasks ?

Konstantin Krivyakin Per-process power consumption measurement facility ?

David Smith systemtap 2.3 release ?

Rui Wang IO Hook: Method for emulating h/w events ?

Tom Zanussi tracing: trace event triggers ?

zhangwei(Jovi) ktap 0.2 released ?

Namhyung Kim tracing/uprobes: Add support for more fetch methods (v2) ?

Yann Cantin new USB eBeam input driver ?

Robin Gong regulator: pfuze100: add pfuze100 regulator driver ?

Frank Haverkamp [PATCH] Generic WorkQueue Engine (GenWQE) device driver ?

Oleksandr Kozaruk TWL6030, TWL6032 GPADC driver ?

Laxman Dewangan pinctrl: palmas: add pincontrol driver ?

Kishon Vijay Abraham I PHY framework ?

Felipe F. Tonello ALSA: Implement core jack support for kcontrol ?

Jonas Jensen watchdog: Add MOXA ART watchdog driver ?

Maarten Lankhorst fence: dma-buf cross-device synchronization (v12) ?

Sudeep Dutt Enable Drivers for Intel MIC X100 Coprocessors. ?

Philipp Zabel Add Dialog DA9063 core and regulator drivers ?

tom.cooksey@arm.com drm/pl111: Initial drm/kms driver for pl111 ?

Richard Genoud Sound support for at91sam9x5-wm8731 based boards ?

Andrey Moiseev ideapad-slidebar: add new input driver ?

Nicolin Chen [PATCH 0/3] ASoC: Add Freescale i.MX S/PDIF controller driver ?

Marek Szyprowski [PATCH v4 0/4] Device Tree support for CMA (Contiguous Memory Allocator) ?

Laurent Pinchart Renesas VSP1 driver ?

Paul Gortmaker Documentation: add networking/netdev-FAQ.txt ?

Dave Kleikamp loop: Issue O_DIRECT aio using bio_vec ?

Paul E. McKenney [PATCH RFC fs] v2 Make sync() satisfy many requests with one invocation ?

Dhaval Giani ext4: Transparent Decompression Support ?

Vladimir Davydov pram: persistent over-kexec memory file system ?

shli@kernel.org raid5: make stripe handling multi-threading ?

Dave Chinner Sync and VFS scalability improvements ?

Liu Bo Online data deduplication ?

Naoya Horiguchi extend hugepage migration ?

Johannes Weiner improve memcg oom killer robustness ?

Michal Hocko Soft limit rework ?

Simon Horman Add packet recirculation ?

Casey Schaufler LSM: Multiple concurrent LSMs ?

Namhyung Kim perf tools: Introduce new 'ftrace' command (v4) ?

Karel Zak util-linux v2.23.2 ?

Kernel development

Brief items

Kernel release status

Quotes of the week

Multipath TCP 0.87

Kernel development news

I/O Hook

Transparent decompression for ext4

Device trees as ABI

Toward stable bindings

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous