Kernel development

Brief items

Kernel release status

The current stable 2.6 kernel is 2.6.16.13, released on May 2. This release contains a single patch for a denial of service problem in the SCTP code. 2.6.16.12 had been released the day before with a couple dozen important fixes.

The current 2.6 prepatch is 2.6.17-rc3, released by Linus on April 26, several milliseconds after the LWN Weekly Edition was published. As expected, the changes were mostly fixes, but this prepatch also adds support for version 1.2 trusted platform modules, multiple page size support for the PA-RISC architecture, and the new vmsplice() system call (see below). See the long-format changelog for the details.

The current -mm tree is 2.6.17-rc3-mm1. Recent changes to -mm include some red-black tree optimizations, a new set of page migration patches, some RAID (MD) improvements, the likely() macro profiler (see below), the long-delayed removal of devfs, and some memory hotplug work.

For 2.4 users, 2.4.33-pre3 is out; it was announced by Marcelo on May 1. It contains a small number of fixes, a number of which are security-related.

Comments (2 posted)

Kernel development news

Implementing network channels

Last January, Van Jacobson presented his network channel concept at the 2006 linux.conf.au gathering. Channels, by concentrating network processing in ways which are most friendly to SMP systems, look like a promising way to improve high-speed networking performance. There was a fair amount of excitement about the idea. Unfortunately, Mr. Jacobson appears to have since become busy with other projects, so no contributions of actual code have resulted from his work. So not much has happened on this front in the last few months - or so it seemed.

David Miller recently let slip that he was working on his own channel implementation. It was not something he expected to see functioning anytime soon, however:

[D]on't expect major progress and don't expect anything beyond a simple channel to softint packet processing on receive any time soon.

Going all the way to the socket is a large endeavor and will require a lot of restructuring to do it right, so expect this to take on the order of months.

It turns out, however, that David was not the only person working on this idea; Kelly Daly and Rusty Russell have also put together a rudimentary channel implementation; in response to David's note, they posted their code for review. Since this version is more advanced, it has been the center of most of the discussion.

The Daly/Russell patch creates a data structure called struct channel_ring. It consists of 256 pages of memory, mapped contiguously into the receiving process's address space - though the pages will not be contiguous in kernel space. As Van Jacobson described, the variables used by the producer side are located at the beginning of the ring, while variables used by the consumer are at the end; this separation helps to ensure that the cache lines representing those variables do not bounce between processors. These variables include the circular buffer indexes indicating which buffer each side will use next. There are also flags allowing the consumer to request a wakeup when buffers are added to the ring.

User-space starts by creating a socket with the new PF_VJCHAN protocol type, then using mmap() to map the ring buffer. Thereafter, it can use buffers as they become available (using poll() or select(), if need be, to wait for more data). When a buffer is no longer needed, incrementing the appropriate index will free it up for new data.

The driver-side interface is, so far, quite simple. A buffer can be allocated from a given ring with a call to vj_get_buffer(); once the data has been placed there by the network interface, vj_netif_rx() sends that buffer up into the protocol code. The tricky part is getting each packet into the correct buffer in the first place. Copying packets inside the kernel would defeat the purpose of this whole exercise; it is important that the network interface choose the correct buffer before DMAing the packet data into memory. As it happens, contemporary network cards can be smart enough to make that decision, if programmed properly by the driver.

There are vast numbers of issues to be worked out still. David Miller takes exception to the preallocated buffers, seeing them as inflexible and hard to change; he would rather see a pointer-oriented data structure. But it is hard to see how that might work while still avoiding the overhead of mapping buffers into user space with every packet.

A more difficult issue, perhaps, is netfilter. The zero-copy approach can be quite fast, but it also naturally shorts out the packet filtering done by the netfilter code. It has been suggested that, for established connections, that is an acceptable tradeoff. But Rusty has pointed out that people do use filtering on established connections, for packet counting if nothing else. As he put it: "Basically I don't think we can 'relax' our firewall implementation and retain trust". So some other sort of solution will have to be found here.

Another open issue has to do with whether the channel should go all the way through to user space or not. Van Jacobson's linux.conf.au presentation included discussion of a user-space TCP implementation, taking the end-to-end principle to its logical conclusion. The reasoning behind this move is that, since the data will be processed by the application, putting the protocol code in the same place will be the fastest, most cache-friendly way to do it. But moving protocol code to user space also means duplicating much of the networking stack and adding to the complexity of the system as a whole. Leaving the protocol code in the kernel simplifies the situation, and, it is believed, can be made to yield almost all of the same performance benefits. In particular, protocol processing can happen on the same processor as the destination application (a fair amount of it is done that way now), and zero-copy networking will still be possible.

It has also been pointed out that, since most of the system calls involved with network data reception (read() or recv(), for example) already imply copying the data, that copy might as well be done in kernel space. But implicit in that statement is another conclusion: if channels are to be used to their fullest potential for high-performance networking, a new set of user-space interfaces will have to be developed. The venerable socket interface was never designed for a channel-oriented environment. How such an interface might look is not entirely clear; it could be based on the current asynchronous I/O API, on kevents, or on something completely new.

In summary, the networking developers are working on some major changes to how networking will be done in Linux, and there are a lot of issues which are not yet understood. The developers are groping around for ideas. So the channel implementations which are being posted now are unlikely to resemble the code which will, someday, be merged into the mainline; they are, instead, exercises intended mainly to obtain a better understanding of the real nature of the problem. But they are still a promising start to what looks to be an interesting development effort.

Comments (8 posted)

The Linux power management summit

April 28, 2006

This article was contributed by Patrick Mochel.

On 11 April 2006, 42 attendees from 17 different companies (and 3 universities) arrived in Santa Clara, California for the 2006 Linux Power Management Summit. The Summit was organized by your author, in conjunction with the Consumer Electronics Linux Forum (CELF), which held its Embedded Linux Conference the same week, and with the OSDL Desktop Linux Working Group. Along with CELF, summit sponsors included Intel, Nokia, Google, AMD, FreeScale, and Texas Instruments. The attendees represented over a dozen open source projects, from the low-level embedded (DPM/PowerOp) to the high-level (freedesktop.org) to the broadest (Fedora, SUSE, and Ubuntu distributions). With such a diverse crowd of people, if nothing else, it promised to be an interesting week of discussions.

The Summit spanned 3 days, starting with a welcome reception on Tuesday evening, 11 April and going until mid-day on Friday, 14 April. Wednesday and Thursday were filled with hour-long sessions led by an individual from a project or a company. The sessions were designed to foster discussion, though the format was left entirely up to the presenter. Most had a backing presentation of talking points, and each one succeeded in keeping the discussions flowing.

Wednesday's presentations were centered around various Open Source Power Management projects. First Pavel Machek talked about Linux Suspend [PDF] (Suspend-to-Disk and Suspend-to-RAM), giving an overview of its history, its implementation, and the issues that continue to inhibit the suspend operations from "just working" in the way that people want them to. He spoke about uSwsusp, which moves the suspend functionality to userspace, allowing for less in-kernel complexity and an easier implementation of the user-friendly features found in Nigel Cunningham's Suspend2 patches; and he spoke about the main problem with getting Suspend-to-RAM to work: video drivers.

Len Brown next talked about ACPI [PDF], and what that meant to power management. Len gave an overview of the generic ACPI components (the tables, the ASL compiler. the AML interpretor, and the ACPICA (Component Architecture)), and the Linux implementation (code organization, ACPI device drivers, acpid). He then dove into ACPI power states, and specifically how it represented and implemented CPU C States (idle states that vary in latency to return) and P States (performance states that vary in CPU speed).

Len's session provided a good lead-in to Dominik Brodowski's session about cpufreq [PDF], which does dynamic CPU frequency scaling based on policy and intelligence about measuring and predicting the load. Dominik described the architecture of the subsystem, how decisions were made, and how they were effected via the CPU drivers. He then spoke about the desire to extend cpufreq beyond just frequency scaling (and include voltages and clocks), beyond single CPUs (to be smarter about managing multiple cores and threads), and beyond CPUs in general (to include policy and drivers for other devices with similar functionality).

Todd Poynor and Matthew Locke's session about DPM and PowerOp [PDF] followed, providing a perspective on the same topic from the other end of the tunnel. DPM (Dynamic Power Management) is infrastructure to manage the "Operating Points" of a system, which are states consisting of pre-defined tuples of voltages and clocks (and therefore frequencies). To coordinate and set the voltages and clocks (which usually must be done for several devices in unison), DPM uses a low-level interface called PowerOp. DPM is practically ubiquitous in embedded Linux implementations, though it lives in an out-of-kernel patch.

The next hour was split between Holger Macht -- who talked about SUSE power management -- Dave Jones, who spoke about Fedora power management, and a guest speaker who spoke about Ubuntu power management. SUSE provides an application called powersave that provides a command-line interface (which can then be wrapped by a GUI) for managing suspend states, CPU PM, and some device states (recently added). The Fedora and Ubuntu power management concerns have both centered around getting suspend/resume to work reliably for their users. Both Fedora and Ubuntu seem to use gnome-power-manager as the primary interface for managing power; this tool doesn't expose as many knobs and levers (literally and figuratively) as the powersave family of utilities do.

All of the distributions now provide quite a large list of support for power management (especially suspend/resume) on various laptop models.

To finish off the day, Jim Gettys and Mark Foster from the One Laptop Per Child project spoke about the design and challenges of the $100 Laptop [PDF], especially around power management. Specifically, they are looking for very efficient hardware and software solutions so that charging the battery requires minimal energy and so that the battery lasts an exceptionally long time (by today's standards). Mark presented a proposal [PDF] of a mechanism for achieving a resume-from-RAM in < 300ms.

Sampsa Fabritius from Nokia started the Thursday sessions [PDF] off with a presentation of the power management framework used on the Maemo platform (which is used in the Nokia 770). Maemo is based on GNOME, but it uses a custom power configuration and management scheme, rather than one based on Utopia/HAL/DBUS. At a lower level, they have also written a "clocks" framework for articulating and controlling clock domains (of which the OMAP platform has many). Based on the previous day's discussions, Sampsa presented the question of whether or not it was possible (and prudent) to define a common solution of power configuration and management (or common set of solutions), since many platforms and interfaces are trying to accomplish similar things, sometimes with a set of similar components.

A set of people from the Texas Instruments OMAP division -- Eric Thomas, Shiv Ramamurthi, and Richard Woodruff -- spoke about the OMAP platform, its goals, and the challenges faced with leveraging its power management potential. OMAP has a rich set of power management techniques, and unlike most desktop platforms, it exposes all of the low-level components (clocks, clock domains, power domains, and voltage domains) to the kernel and requires it to coordinate the scaling of each. This is currently done with a modified version of DPM, along with a custom set of scripts and control framework to set and manage the operating points of the system.

Quinn Jensen from FreeScale used the next hour to speak about the MX31 platform [PDF], an ARM 11-based system-on-a-chip that is similar in nature to the OMAP. It has many power management features centered around dynamic voltage and frequency scaling (aka DVFS). Not surprisingly, they are also using a custom version of DPM and associated control infrastructure to control the hardware. Like the others, they are running into limitations of the framework, since it only deals with the lowest-level components and doesn't provide a rich(er) policy framework (like cpufreq does).

Mark Gross, representing CELF, presented a summary of the CELF power management requirements [PDF], as expressed by the CELF member companies. The most important items seemed to be the refinement and inclusion of a dynamic tick/tick-less idle solution (which underscored the use of such solutions by previous presenters), and a mainstream solution for DVFS (a la DPM) that provided a robust policy management (a la cpufreq). Much of the discussion that followed was about the details of a common interface for these solutions.

Jacob Shin from AMD presented next about the low-level details of AMD CPU PM [PDF], specifically how PowerNow works on multi-core K8 processors, and the changes that were necessary to the CPU hotplug and cpufreq bodies of code to support it.

Thursday ended with a birthday celebration for Adam Belay, then an open discussion about the topics covered so far and the issues that were on peoples' minds.

Friday began with another open discussion about what the overall architecture and framework that is needed for power management on any system. After several diagrams, doodlings and lists went up on the wall on gigantic Post-It Notes, the group broke into three smaller groups to talk about the three primary layers of power management and how they might be able to share functionality or features between different platforms and solutions.

Low level hardware configuration and control. This discussion was centered around how to describe different levels of "on-ness" and "off-ness" to high levels in a manner that made the most sense (to both the device drivers and the consumers of such an interface).
The kernel-user space interface. This discussion was based on the assumption that the gap between DPM's low-level management framework and cpufreq's policy framework can and should be bridged in some manner. From there, this group discussed how to design a common interface (via sysfs) which could be used by a user-space policy mechanism to control CPU operating points.
The user space framework that must exist in user space to provide good power management. There are a number of existing solutions for monitoring various types of hardware, monitoring and predicting system load, handling PM-related events, and managing policy. But, they are all disjoint, overlapping only occasionally, and most do not do as good of a job as anyone would like.

It was a long three days, filled with many discussions about system control and management throughout the software stack, and the many interdependencies and special cases that exist on many platforms that Linux supports. Such is the nature of power management. The introductions to new topics and people, as well as the brainstorming about better and more common solutions were top-notch, and bode well for the future of efficiency in Linux.

However, in the meantime, we still have a lot of work to do in the fixing category. Besides the fact that the primary embedded solution (DPM), and it's variants don't exist in a mainstream kernel, there is also this quote to consider about what we're working with today. As Andrew Morton expressed it (via email):

My main concern is stability of the existing stuff, rather than any need for new features. Firstly machines which won't boot, especially ones which _newly_ won't boot. Secondly machines which won't suspend/resume properly, especially ones which used to do this. Huge number of ACPI bug reports, and rather a lot of cpufreq ones too.

My second concern would be with overall stability and maturity and simplicity of the existing kernel APIs - it seems that lots of driver developers get it wrong in subtle ways. (Why am I still staring at those "pm_register is deprecated" warnings??)

Fortunately, we now have a lot more people familiar with the types of Power Management problems, and many more upcoming events to discuss the progress as we move forward.

[Author's Note: This article was written with the help of the extensive notes taken by Jeffery Osier-Mixon, a technical writer from PalmSource who we borrowed for the Summit. Thanks, Jefro.]

Comments (5 posted)

Briefly: patch quality, CKRM, likely(), and vmsplice()

A number of issues have been discussed in recent times that, while too short for a full article, are nonetheless worthy of mention. Here's a few of them.

Development process

The 2.6.17-rc2-mm1 release included, along with the usual huge pile of patches, a complaint from Andrew Morton:

It took six hours work to get this release building and linking in just a basic fashion on eight-odd architectures. It's getting out of control....

Could patch submitters _please_ be a lot more careful about getting the Kconfig correct, testing various Kconfig combinations (yes sometimes people will want to disable your lovely new feature) and just generally think about these things a bit harder? It isn't rocket science.

Andrew, it seems, is getting too many submissions which lack basic testing. Occasionally things simply don't compile. More often, patches create problems when their particular configuration options are disabled, or for architectures not tested by the original developer. Andrew ends up fixing those problems, and that takes a fair amount of his time. The bigger issue is elsewhere, however:

My main reason for the big whine is that this defect rate indicates that people just aren't being sufficiently careful in their work. If so many silly trivial things are slipping through, then what does this tell us about the big things, ie: runtime bugs?

There has been some discussion of how the situation could be improved. Ideas include better automated kernel build farms which would allow any developer to get wider build testing and a checklist to be gone over before patches are sent for review. But what is really needed is for developers to simply take a little more care in the preparation of their patches.

CKRM rebranded

The CKRM resource management patches have been received unenthusiatically by the development community in the past. To many, CKRM looks like a large body of complex code, with hooks distributed throughout the kernel, providing functionality which is of interest to relatively few users. So the CKRM proposals have not gotten very far, and the development team has been quiet recently.

What the developers have been doing, however, is reworking the CKRM patches in an attempt to make them more palatable. The result is now known as Resource Groups, and it is, once again, being pushed for inclusion into the kernel. The Resource Group code has been put on a diet, with many features removed and others shoved out to user space. Duplicated code has been taken out, and a major effort has been made to use kernel library primitives wherever possible.

Andrew Morton had a reasonable positive reaction to the new code submission, saying "...the overall code quality is probably the best I've seen for an initial submission of this magnitude". He was more worried about a proposed memory controller, however, which looks to duplicate much of the memory management subsystem. There have not been a whole lot of comments from elsewhere in the community, however.

Not so unlikely after all

The kernel provides a couple of macros, called likely() and unlikely(), which are intended to provide hints to the compiler regarding which way a test in an if statement might go. The processor can then use that hint, at run time, to direct its branch prediction and speculative execution optimizations. These macros are used fairly heavily throughout the kernel to reflect what the programmer thinks will happen.

A well-known fact of life is that programmers can have a very hard time guessing which parts of their code will actually consume the most processor time. It turns out that they aren't always very good at choosing the likely branches in their code either. To drive this point home, Daniel Walker has put together a patch which does a run-time profile of likely() and unlikely() declarations. With the resulting output, it is possible to see which of those declarations are, in reality, incorrect and slowing down the kernel.

Using this output, Hua Zhong and others have been writing patches to fix the worst offenders; some of them have already found their way into the mainline. In at least one case, the results have made it clear to the developers that things are not working as they were expected to, and other fixes are in the works.

One unlikely() which remains unfixed, however, is in kfree(). Passing a NULL pointer to kfree() is entirely legal, and there has been a long series of janitorial patches removing tests which checked pointers for NULL before freeing them. kfree() itself is coded with a hint that a NULL pointer is unlikely, but it turns out that, in real life, over half of the calls to kfree() pass NULL pointers. There is resistance to changing the hint, however; the preference seems to be to fix the (assumed) small number of high-bandwidth callers which are at the root of the problem.

vmsplice()

Last week, your editor astutely caught the last-minute merging of the vmsplice() system call into 2.6.17-rc3. Rather less astutely, however, your editor missed the fact that the prototype for vmsplice() had changed since it was posted on the linux-kernel mailing list. The current prototype for vmsplice() is:

    long vmsplice(int fd, const struct iovec *iov, 
                  unsigned long nr_segs, unsigned int flags);

The use of the iovec structure allows vmsplice() to be used for scatter/gather operations.

Since then, vmsplice() has picked up a new flag: SPLICE_F_GIFT. If that flag is set, the calling process is offering the pages to the kernel as a "gift." If conditions allow, the kernel can simply remove the page from the process's address space and dump it into, for example, the page cache. With this flag, an application can generate data in memory, then send it on to its destination without copying in the kernel.

Comments (6 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.17-rc3 ?

Andrew Morton 2.6.17-rc3-mm1 ?

Andrew Morton 2.6.17-rc2-mm1 ?

Greg KH Linux 2.6.16.12 ?

Greg KH Linux 2.6.16.13 ?

Con Kolivas 2.6.16-ck8 ?

Con Kolivas 2.6.16-ck9 ?

Marcelo Tosatti Linux 2.4.33-pre3 ?

Architecture-specific

Michael Holzheu s390: Hypervisor File System ?

Core kernel code

Chandra Seetharaman Resource Groups... formerly CKRM ?

MAEDA Naoaki CPU controller ?

Chandra Seetharaman Number of Tasks Resource controller ?

Balbir Singh per-task delay accounting ?

Rafael J. Wysocki [RFC][PATCH] swsusp: support creating bigger images (rev. 2) ?

Jes Sorensen simple task notifier support ?

Paul E. McKenney Make RCU API inaccessible to non-GPL Linux kernel modules ?

Evgeniy Polyakov Kevent subsystem. ?

Arjan van de Ven [patch 1/17] Infrastructure to mark exported symbols as unused-for-removal-soon ?

Badari Pulavarty [PATCH 0/3] VFS changes to collapse AIO and vectored IO into single (set of) fileops. ?

Development tools

Daniel Walker Profile likely/unlikely macros -v2 ?

Marco Costalba qgit-1.2 ?

Matt Mackall Ketchup 0.9.8 released ?

Device drivers

Peter Wainwright [PATCH 1/2] Multi-threaded execution of ACPI control methods ?

Documentation

Randy.Dunlap add Doc/SubmitChecklist ?

Greg KH Add kernel<->userspace ABI stability documentation (try 2) ?

Michael Kerrisk man-pages-2.30 is released ?

Filesystems and block I/O

=?ISO-8859-1?Q?Daniel_Aragon=E9s?= minix filesystem update to V3 for 2.6 kernels ?

NeilBrown md: Introduction - assort md enhancements for 2.6.18 ?

Al Viro symlink nesting level change ?

Jared Hulbert Advanced XIP File System ?

Ian Kent autofs 5.0.0 beta1 ?

Janitorial

David Woodhouse Simple header cleanups ?

Christoph Hellwig mark address_space_operations const ?

Memory management

blaisorblade@yahoo.it remap_file_pages protection support ?

Mel Gorman [PATCH 0/7] Sizing zones and holes in an architecture independent manner V5 ?

Wu Fengguang kernel facilities for cache prefetching ?

Dave McCracken Shared page table core patch ?

Networking

Wong Edison TCP congestion module: add TCP-LP supporting for 2.6.16 ?

Or Gerlitz [PATCH 0/6] iSER (iSCSI Extensions for RDMA) initiator ?

Alex Aizman VJ Channel API - driver level (README) ?

Alex Aizman VJ Channel API - driver level (PATCH) ?

Security-related

Jan Engelhardt MultiAdmin LSM ?

Virtualization and containers

Serge E. Hallyn uts namespaces: Introduction ?

Miscellaneous

Stephane Eranian beta of pfmon-3.2 available ?

Page editor: Jonathan Corbet
Next page: Distributions>>