Kernel development [LWN.net]

Kernel release status

The current development kernel remains 3.2-rc2; no 3.2 prepatches have been released in the last week.

Stable updates: the 3.0.10 and 3.1.2 stable kernel updates were released on November 21. They both include another set of important fixes, though this particular update is smaller than many.

The 2.6.32.49, 3.0.11, and 3.1.3 stable updates are in the review process as of this writing; they can be expected on or after November 25.

Comments (2 posted)

Quotes of the week

Merging code in-flight, just because I can. What timezone should I use?

-- Linus Torvalds (thanks to Cesar Eduardo Barros)

Fedora is now getting so weird and dependant on magic initrds it's become pretty unusable for kernel development work nowdays IMHO because it's undebuggable.

-- Alan Cox

...our goal remains to provide a desktop which by default works for the 99%, but that we also think that those we do like to tweak their desktops should have an easy method to do so.

Note my wording: 99% excludes kernel hackers. Not sure if we should say something about that explicitly or not.

-- Olav Vitters; prepare for "Occupy LKML" in the near future

Comments (82 posted)

Drivers as documentation

By Jonathan Corbet
November 22, 2011

As a community, we are highly concerned with the quality of our code. Kernel code is reviewed for functionality, long-term maintainability, documentation, and more. Driver code is not always reviewed to the same degree, but it can be just as important - if our drivers do not work, our kernel does not work. There is an aspect to the long-term maintainability of drivers that could use more attention: the degree to which a driver documents how its hardware works.

One might argue that the job of documenting the hardware falls on whoever writes the associated datasheet. There is some truth to that claim, but, in many cases, only the original author of the driver has access to that datasheet. Those who come after can try to extract documentation from the vendor or to search for clandestine copies hosted on the net. But often the only option is to figure out the hardware from the one source of information that is actually available: the existing driver. If the driver source does not help that new developer, one can argue that the original author has fallen down on the job.

So, if a driver contains code like:

    writel(devp->regs[42], 0xf4ee0815);

it is missing something important. In the absence of the datasheet, there is no way for any other developer to have any clue of what that operation is actually doing.

The problem is worse than that, though; datasheets often omit useful information, obscure the truth, and lie through their teeth. The hardest part of getting a driver to work is often the process of figuring out what the hardware's features and special needs really are. It often seems, for example, that the datasheet is written before the process of designing the hardware begins. As time passes, the understanding of the problem grows, and deadlines loom, hardware engineers start to jettison features that cannot be made to work in time or that, in their sole and not-subject-to-appeal opinion, can be painlessly fixed in software. Updating the datasheet to match the actual hardware never happens.

Thoughtful driver developers will, on discovery of the imaginary nature of a specific hardware feature, add a comment to the driver; that way, no future maintainer has to figure out (the hard way, involving keyboard imprints on the forehead) why the driver does not use a specific, helpful-looking hardware capability.

Then there is the matter of "reserved" bits. There has not yet been a datasheet written that did not contain entries like:

Weird tangential functions register (offset 0xc8)

Bits Function

17 Reserved: do not touch this bit or the terrorists will win

Weird tangential functions register (offset 0xc8)
Bits	Function
17	Reserved: do not touch this bit or the terrorists will win

Somewhere, deep within the company, there will be a maximum of two engineers who know that the document is incomplete, but that nobody had ever gotten around to updating it. If you can corner one of those people, you can usually get them to admit that this bit should be documented as:

Weird tangential functions register (offset 0xc8)

Bits Function

17 0 = DMA engine randomly locks up
1 = DMA engine functions as expected
Default value = 0

Weird tangential functions register (offset 0xc8)
Bits	Function
17	0 = DMA engine randomly locks up 1 = DMA engine functions as expected Default value = 0

A developer who cannot get his hands within range of the neck of at least one of those hardware engineers will likely spend a lot of time figuring out that they need to set the "make it work" bit. This effort can involve reverse-engineering proprietary drivers or, in cases of pure desperation, playing with random bits to see what changes. Once that bit has been located, it is natural for the tired and frustrated developer to quietly set the bit before heading off in a determined effort to eliminate the memory of the entire process through the application of large amounts of beer. A particularly forward-thinking developer might make a note on a printed version of the datasheet for future reference.

But handwritten notes are not usually helpful to the next developer who has to work on that driver. A moment spent documenting that bit:

    #define WTF_PRETTY_PLEASE  0x00020000 /* Always set this or it locks up */

may save somebody else hours of unnecessary pain.

It is tempting to think of a completed driver as being done. But driver code, like other kernel code, is subject to ongoing change. Kernel API changes must be dealt with, problems need to be fixed, and newer versions of the hardware must be supported. Depending on how much beer was involved, the original author may remember that device's peculiarities, but those who follow will not. Everybody would be better served if the driver did not just make the hardware work, but if it also made the reader understand how the hardware works.

Doing so is not usually hard. Define descriptive names for registers, bits, and fields rather than putting in hard-coded constants. Note features that are incompletely described, incorrectly described, or entirely science-fictional. Comment operations that have non-obvious ordering requirements or that do not play well together. And, in general, code with a great deal of sympathy for the people who will have to make changes to your work in the future. Some hardware can never be properly documented because the relevant information is simply not available; see this 2006 article for an example. But what information is available should be made available to others.

Core kernel hackers are occasionally heard to make dismissive remarks about driver developers and the work they do. But driver writers are often given a difficult task involving a fair amount of detective work; they get this task done and make our hardware work for us. Writing drivers that adequately document the hardware is not an unreasonable thing to ask of these developers; they have the hardware knowledge and the skills to do it. The harder problem may be asking driver reviewers to insist that this extra effort be made. Without pressure from reviewers, many drivers will never enable readers to really understand what is going on.

Comments (16 posted)

The pin control subsystem

By Jonathan Corbet
November 22, 2011

Classic x86-style processors are designed to fit into a mostly standardized system architecture, so they all tend, in a general sense, to look alike. One of the reasons why it is hard to make a general-purpose kernel for embedded processors is the absence of this standardized architecture. Embedded processors must be extensively configured, at boot time, to be able to run the system they are connected to at all. The 3.1 kernel saw the addition of the "pin controller" subsystem which is intended to help with that task; enhancements are on the way for (presumably) 3.2 as well. This article will provide a superficial overview of how the pin controller works.

A typical system-on-chip (SOC) will have hundreds of pins (electrical connectors) on it. Many of those pins have a well-defined purpose: supplying power or clocks to the processor, video output, memory control, and so on. But many of these pins - again, possibly hundreds of them - will have no single defined purpose. Most of them can be used as general-purpose I/O (GPIO) pins that can drive an LED, read the state of a pushbutton, perform serial input or output, or activate an integrated pepper spray dispenser. Some subsets of those pins can be organized into groups to serve as an I2C port, an I2S port, or to perform any of a number of other types of multi-signal communications. Many of the pins can be configured with a number of different electrical characteristics.

Without a proper configuration of its pins, an SOC will not function properly - if at all. But the right pin configuration is entirely dependent on the board the SOC is a part of; a processor running in one vendor's handset will be wired quite differently than the same processor in another vendor's cow-milking machine. Pin configuration is typically done as part of the board-specific startup code; the system-specific nature of that code prevents a kernel built for one device from running on another even if the same processor is in use. Pin configuration also tends to involve a lot of cut-and-pasted, duplicated code; that, of course, is the type of code that the embedded developers (and the ARM developers in particular) are trying to get rid of.

The idea behind the pin control subsystem is to create a centralized mechanism for the management and configuration of multi-function pins, replacing a lot of board-specific code. This subsystem is quite thoroughly documented in Documentation/pinctrl.txt. A core developer would use the pin control code to describe a processor's multi-function pins and the uses to which each can be put. Developers enabling a specific board can then use that configuration to set up the pins as needed for their deployment.

The first step is to tell the subsystem which pins the processor provides; that is a simple matter of enumerating their names and associating each with an integer pin number. A call to pinctrl_register() will make those pins known to the system as a whole. The mapping of numbers to pins is up to the developer, but it makes sense to, for example, keep a bank of GPIO pins together to simplify coding later on.

One of the interesting things about multi-function pins is that many of them can be assigned as a group to an internal functional unit. As a simple example, one could imagine that pins 122 and 123 can be routed to an internal I2C controller. Other types of ports may take more pins; an I2S port to talk to a codec needs at least three, while SPI ports need four. It is not generally possible to connect an arbitrary set of pins to any controller; usually an internal controller has a very small number of possible routings. These routings can also conflict with each other; pin 77, say, could be either an I2C SCL line or an SPI SCLK line, but it cannot serve both purposes at the same time.

The pin controller allows the developer to define "pin groups," essentially named arrays of pins that can be assigned as a group to a controller. Groups can (and often will) overlap each other; the pin controller will ensure that overlapping groups cannot be selected at the same time. Groups can be associated with "functions" describing the controllers to which they can be attached. Some functions may have a single pin group that can be used; others will have multiple groups.

There are some other bits and pieces (some glue to make the pin controller work easily with the GPIO subsystem, for example), but the above describes most of the functionality found in the 3.1 version of the pin controller. Using this structure, board developers can register one or more pinmux_map structures describing how the pins are actually wired on the target system. That work can be done in a board file, or, presumably, be generated from a device tree file. The pin controller will use the mapping to ensure that no pins have been assigned to more than one function; it will then instruct the low-level pinmux driver to configure the pins as described. All of that work is now done in common code.

The pin multiplexer on a typical SOC can do a lot more than just assign a pin to a specific function, though. There is typically a wealth of options for each pin. Different pins can be driven to different voltages, for example; they can also be connected to pull-up or pull-down resistors to bias a line to a specific value. Some pins can be configured to detect input signal changes and generate an interrupt or a wakeup event. Others may be able to perform debouncing. It adds up to a fair amount of complexity which is often reflected in the board-specific setup code.

The generic pin configuration interface, currently in its third revision, attempts to bring the details of pin configuration into the pin controller core. To that end, it defines 17 (at last count) parameters that might be settable on a given pin; they vary from the value of the pullup resistor to be used through slew rates for rising or falling signals and whether the pin can be a source of wakeup events. With this code in place, it should become possible to describe the complete configuration of complex pin multiplexors entirely within the pin controller.

The number of pin controller users in the 3.1 kernel is relatively small, but there are a number of patches circulating to expand its usage. With the addition of the configuration interface (in the 3.2 kernel, probably), there will be even more reason to make use of it. One of the more complicated bits of board-level configuration will be supported almost entirely in common code, with all of the usual code quality and maintainability benefits. It is hard to stick a pin into an improvement like that.

Comments (5 posted)

POSIX_FADV_VOLATILE

By Jonathan Corbet
November 22, 2011

Caching plays an important role at almost all levels of a contemporary operating system. Without the ability to cache frequently-used objects in faster memory, performance suffers; the same idea holds whether one is talking about individual cache lines in the processor's memory cache or image data cached by a web browser. But caching requires resources; those needs must be balanced with other demands on the same resources. In other words, sometimes cached data must be dropped; often, overall performance can be improved if the program doing the caching has a say in what gets removed from the cache. A recent patch from John Stultz attempts to make it easier for applications to offer up caches for reclamation when memory gets tight.

John's patch takes a lot of inspiration from the ashmem device implemented for Android by Robert Love. But ashmem functions like a device and performs its own memory management, which makes it hard to merge upstream. John's patch, instead, tries to integrate things more deeply into the kernel's own memory management subsystem. So it takes the form of a new set of options to the posix_fadvise() system call. In particular, an application can mark a range of pages in an open file as "volatile" with the POSIX_FADV_VOLATILE operation. Pages that are so marked can be discarded by the kernel if memory gets tight. Crucially, even dirty pages can be discarded - without writeback - if they have been marked volatile. This operation differs from POSIX_FADV_DONTNEED in that the given pages will not (normally) be discarded right away - the application might want the contents of volatile pages in the future, but it will be able to recover if they disappear.

If a particular range of pages becomes useful later on, the application should use the POSIX_FADV_NONVOLATILE operation to remove the "volatile" marking. The return value from this operation is important: a non-zero return from posix_fadvise() indicates that the kernel has removed one or more pages from the indicated range while it was marked volatile. That is the only indication the application will get that the kernel has accepted its offer and cleaned out some volatile pages. If those pages have not been removed, posix_fadvise() will return zero and the cached data will be available to the application.

There is also a POSIX_FADV_ISVOLATILE operation to query whether a given range has been marked volatile or not.

Rik van Riel raised a couple of questions about this functionality. He expressed concern that the kernel might remove a single page of a multi-page cached object, thus wrecking the caching while failing to reclaim all of the memory used to cache that object. Ashmem apparently does its own memory management partially to avoid this very situation; when an object's memory is reclaimed, all of it will be taken back. John would apparently rather avoid adding another least-recently-used list to the kernel, but he did respond that it might be possible to add logic to reclaim an entire volatile range once a single page is taken from that range.

Rik also worried about the overhead of this mechanism and proposed an alternative that he has apparently been thinking about for a while. In this scheme, applications would be able to open (and pass to poll()) a special file descriptor that would receive a message whenever the kernel finds itself short of memory. Applications would be expected to respond by freeing whatever memory they can do without. The mechanism has a certain kind of simplicity, but could also prove difficult in real-world use. When an application gets a "free up some memory" message, the first thing it will probably need to do is to fault in its code for handling that message - an action which will require the allocation of more memory. Marking the memory ahead of time and freeing it directly from the kernel may turn out to be a more reliable approach.

After the recent frontswap discussions, it is perhaps unsurprising that nobody has dared to observe that volatile memory ranges bear a more than passing resemblance to transcendent memory. In particular, it looks a lot like "cleancache," which was merged in the 3.0 development cycle. There are differences: putting a page into cleancache removes it from normal memory while volatile memory can remain in place, and cleancache lacks a user-space interface. But the core idea is the same: asking the system to hold some memory, but allowing that memory to be dropped if the need arises. It could be that the two mechanisms could be made to work together.

But, as noted above, nobody has mentioned this idea, and your editor would certainly not be so daring.

One other question that has not been discussed is whether this code could eventually replace ashmem, reducing the differences between the mainline and the Android kernel. Any such replacement would not happen anytime soon; ashmem has its own ABI that will need to be supported by Android kernels for a long time. Over years, a transition to posix_fadvise() could possibly be made if the Android developers were willing to do so. But first the posix_fadvise() patch will need to get into the mainline. It is a very new patch, so it is hard to say if or when that might happen. Its relatively non-intrusive nature and the clear need for this capability would tend to argue in its favor, though.

Comments (13 posted)

Greg KH Linux 3.1.2 ?

Greg KH Linux 3.0.10 ?

Steven Rostedt 3.0.10-rt27 ?

Steven Rostedt 3.0.9-rt26 ?

hdoyu@nvidia.com ARM: iommu: tegra: Add initial Tegra IOMMU driver ?

KyongHo Cho iommu/exynos: Add IOMMU/System MMU driver for Samsung Exynos ?

Deepthi Dharwar cpuidle: (POWER) cpuidle driver for pSeries ?

Tanmay Inamdar powerpc/40x: Add APM8018X SOC support ?

H. Peter Anvin RFC x86: Generate system calls from a simple table ?

Kelvin Cheung MIPS: Add CPU support for Loongson1B ?

Pavel Emelyanov [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids ?

Srikar Dronamraju uprobes patchset with perf probe support ?

Mike Turquette common clk framework ?

KAMEZAWA Hiroyuki impelemnt cgroup_(subsys)_disabled in generic. ?

Alessandro Rubini debugfs: a tool to print 32-bit registers ?

Stanislaw Gruszka mm: more intensive memory corruption debug ?

Chris Mason Evil elevator to test cache flushes ?

Donggeun Kim power: introduce Charger-Manager ?

Rafael J. Wysocki PM / Domains: Per-device callbacks and PM QoS ?

Rob Herring net: add calxeda xgmac ethernet driver ?

Laurent Pinchart as3645a flash driver ?

Andrzej Pietrasiewicz Exynos4 JPEG codec v4l2 driver ?

Manjunath Hadli RFC for Media Controller capture driver for DM365 ?

Sjur Brændeland XSHM: Shared Memory Driver for ST-E Thor M7400 LTE modem ?

Tomoya MORINAGA sound/soc/codecs: add LAPIS Semiconductor ML26124 ?

Linus Walleij pinctrl: add a generic pin config interface ?

Lin Ming ata port runtime power management support ?

Yongqiang Yang ext4: add new online resize interface ?

Tao Ma ext4: Add inline data support. ?

Wu Fengguang readahead stats/tracing, backwards prefetching and more ?

John Stultz [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags ?

Marek Szyprowski Contiguous Memory Allocator ?

Mel Gorman Reduce compaction-related stalls and improve asynchronous migration of dirty pages v3 ?

Johannes Weiner mm: per-zone dirty limits v3-resend ?

Neil Horman net: Add network priority cgroup ?

Jesse Gross Open vSwitch ?

kaber@trash.net netfilter: IPv6 NAT ?

Kees Cook fs: symlink restrictions on sticky directories ?

Kernel development

Brief items

Kernel release status

Quotes of the week

Kernel development news

Drivers as documentation

The pin control subsystem

POSIX_FADV_VOLATILE

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Security-related