Kernel development [LWN.net]

Kernel release status

The current 2.6 prepatch remains 2.6.16-rc1. A handful of fixes has appeared in the mainline git repository, including a few new features (see below).

The current -mm release is 2.6.16-rc1-mm3. Recent changes to -mm include more semaphore-to-mutex conversions, two-column stack backtraces on i386 (to make oops traces fit on one screen), various memory management tweaks, the SMP alternatives patch, and lots of fixes.

Comments (none posted)

Quotes of the week

The Linux kernel is under the GPL version 2. Not anything else. Some individual files are licenceable under v3, but not the kernel in general. And quite frankly, I don't see that changing. I think it's insane to require people to make their private signing keys available, for example. I wouldn't do it. So I don't think the GPL v3 conversion is going to happen for the kernel, since I personally don't want to convert any of my code.

-- Linus Torvalds

I am against personal attacks and this is the first time where it tooks more than a day before LKML people started with personal attacks against me. So in principle this is some sort of progress compared to former times.

-- Joerg Schilling

Comments (11 posted)

The 2.6.16 straggler list

The release of 2.6.16-rc1 was supposed to signal the closing of the window for new features. For the most part, things have happened that way. A few additional features did find their way in after 2.6.16-rc1 came out, though. Here is a quick list.

The work of making the slab allocator smarter on NUMA machines continues. In previous versions of the kernel, slab allocations made during the bootstrap process would all end up on the boot node, causing an imbalance across the NUMA system. It was also possible for processes with non-default memory allocation policies to "contaminate" allocations for other processes. The 2.6.16 slab allocator will make more explicit decisions about just how allocations should be performed to spread out boot-time allocations and to ensure that each process gets the allocation policy it asked for.
NUMA systems can also perform memory reclamation on individual memory zones, on the theory that forcing out pages can be cheaper than allocating non-local pages.
A number of new system calls, including openat() and friends, ppoll(), and pselect(), have been merged. These calls were discussed here last December.
Perhaps the biggest late addition is the EDAC ("error detection and correction") subsystem. The purpose of the EDAC code is to watch for errors in the operation of the system and to scream when they are detected. EDAC, as merged, is oriented mainly toward memory errors. It will poll the memory controllers (drivers for a few families of controllers have been merged) on a regular basis for both correctable and uncorrectable errors. Log messages can be generated for both types of errors, and there is a sysfs interface as well. Optionally, the EDAC code can be told to immediately panic the system on an uncorrectable error; in this way, it is hoped, uncorrectable errors will not lead to data corruption elsewhere in the system.
One assumes that uncorrectable errors will be rare, however. The real intent is to allow administrators to see when significant numbers of correctable errors are being detected. Since those errors will often degrade, over time, into uncorrectable problems, the presence of correctable errors is a strong indication that the affected memory bank should be replaced.
The EDAC code can also watch for parity errors on the system's PCI buses. Getting good information from the PCI subsystem can be harder, however, since, apparently, some vendors do not follow the specs when it comes to the generation of parity information.
For more information on EDAC, including details on the sysfs interface, see drivers/edac/edac.txt in the current mainline documentation directory.

At this point, the 2.6.16 merge window can truly be considered closed; the feature set for this release is probably complete.

Comments (none posted)

Review: Understanding Linux Network Internals

The net/ directory tree in the Linux kernel source is an intimidating place. We all use the kernel's networking features, but even experienced kernel hackers often hesitate to wander into the code which implements those features. To many, the networking stack is a black box, maintained by a distinct set of developers who keep many of their secrets to themselves. There is little documentation on how Linux networking is implemented, adding to the challenge of understanding how it all works.

[Cover] Your editor had been told that O'Reilly had a book on the networking stack - a sort of companion to Understanding The Linux Kernel - in the works. But it was still a nice surprise to see the end result - a book by Christian Benvenuti entitled Understanding Linux Network Internals - show up on the doorstep. A couple of weeks later, after having read much of the book, your editor is ready to share some comments. The short version would be: this book is a welcome addition to the (short) list of books about the kernel. It is not as good a book as it could have been, however, and leaves some significant gaps.

Let's get one pet peeve out of the way immediately: any kernel book should disclose, on the cover, which version of the kernel is covered. As LWN readers know well, things change quickly in the kernel. A book which covers one version will likely be obsolete in many places a few versions later. If a kernel book does not include version information, there is no way to know which reality it matches or whether it will be even remotely relevant to current kernels.

In the case of this book, there is no word anywhere regarding which version is covered. It is clearly a 2.6 book, but that is all we know. Your editor has come to the conclusion from his reading that the book was a long time in the writing (not surprising: the subject matter is complex, and the book is over 1,000 pages long), and that, if an effort was made to make it consistently current for a specific kernel version, that effort was incomplete. The section on interrupts, for example, presents the old prototype for interrupt handlers last seen in the 2.5.68 kernel. Other parts are much more current. The book is a bit of a patchwork in that regard.

And in other regards as well. Some parts of the book seem to want to be a programming manual - to the point that the slab cache functions (kmem_cache_create() and friends) are presented on page 4. Page 13 talks about the likely() and unlikely() constructs. Yet, in other areas, detail is much more scarce, and there is no complete discussion of how to write code for the kernel. And (another pet peeve of your editor's) the issues of concurrency and race conditions are passed over almost completely.

Similarly, the section on network device drivers offers a great deal of information on device registration, queueing discipline bits, notifiers, power management, ethtool, dealing with the PCI bus, module initialization, and more. There is even a section on how bottom halves worked in the 2.2 kernel. But there is almost no information on how to write transmit and receive functions. At one point the author writes "This chapter does not strive to be a guide on how to write NIC device drivers." No problem, there are (ahem) other books which cover that ground. But then why bother with things like PCI device registration?

This book does contain a great deal of information. It may pass over driver transmit and receive functions, but it does cover packet transmission and reception in the higher levels of the networking stack in some detail - and that is just what one would want. There is a long section on IPv4 and ICMP, and quite a bit of information on the complicated "neighbor" code (the ARP protocol and such). The last major section is on routing. Stuffed into the middle is a 110-page section on the bridging subsystem.

Networking is a large area, and a large part of the kernel, so it is hard to cover everything even in a 1000-page book. So some important things were left out of Understanding Linux Network Internals. These include TCP, IPv6, IPsec, netfilter, traffic control, and several other topics. And that leads to your editor's last, and perhaps biggest complaint. The inconsistent focus and somewhat irregular choice of topics seen at the lower levels is also present in the large scale. Your editor would have happily traded the four chapters on bridging for a solid overview of how the TCP protocol works in Linux, and your editor suspects that he is not alone. Netfilter and traffic control, perhaps, merit a book of their own, but maybe some of the other chapters could have been tightened up enough to make room for an introduction to IPv6 or IPsec.

So it is hard to recommend this book in an unreserved fashion. That said, there is a great deal of useful information to be found in Understanding Linux Network Internals, and your editor is glad to have it on his bookshelf. It has already come in useful a couple of times while trying to figure out how parts of networking-related patches work. So this book is a welcome addition to the body of kernel-related documentation, even if it is not everything one might wish it would be.

Comments (2 posted)

MD / DM

The Linux software RAID code (often called "MD" for "multi-device") is a longstanding feature of the kernel. RAID users appreciate its robustness, configurability, and the fact that it performs well; better performance than that achieved with hardware RAID controllers is not unheard of. In recent years, little has been heard about the MD code, however. Its feature set has changed slowly, and developments with the device mapper code have taken a higher profile. That, perhaps, is as it should be; a storage subsystem which attracts attention is rarely a good thing.

That said, MD hacker Neil Brown has been busy. His latest patch set implements RAID5 reshaping: the ability to add devices to a RAID5 array without going through a backup and restore cycle - or even shutting the array down. This is a nontrivial task; adding a drive to a RAID5 array requires redistributing data and parity blocks across the entire array. With this version of the patch, Linux MD can not only perform this task, but it can do it while still handling normal I/O to the array. The new patch also checkpoints the process, so that it can be restarted if interrupted in the middle; this corrects a minor defect in the previous version, wherein interrupting the reshaping task would cause all data in the array to be lost.

Neil notes that things could still go wrong:

There is still a small window ( < 1 second) at the start of the reshape during which a crash will cause unrecoverable corruption. My plan is to resolve this in mdadm rather than md. The critical data will be copied into the new drive(s) prior to commencing the reshape. If there is a crash the kernel will refuse the reassemble the array. mdadm will be able to re-assemble it by first restoring the critical data and then letting the remainder of the reshape run it's course.

Neil has various other enhancements in mind, including the ability to upgrade a RAID5 array to RAID6 (which increases fault tolerance by adding another set of parity blocks). Quite a bit, clearly, is happening in the MD world.

All this activity drew queries from a couple of observers who had, it seems, assumed that the addition of the device mapper to the kernel meant that the MD code would eventually whither away. The device mapper can handle some of the lower RAID levels (mirroring and striping) now, and there is work in progress to add RAID5 support. Since the device mapper is a general framework for mixing and matching drives, it makes sense to some that the RAID functionality should move there too.

Unsurprisingly, Neil disagrees. His suggestion is that "anything with redundancy," including RAID5 and RAID6, is best handled in the MD code. The device mapper, instead, is good for fancier arrangements like multipath, encryption, volume management, snapshots, etc. Certainly, those who are placing trust in RAID for redundancy should be comforted by the rather longer track record built up by the MD code. MD is also said to be faster than the device mapper at this time.

As others have pointed out, however, there is a cost to carrying multiple RAID implementations in the kernel. Each must be maintained, and each will have its own unique bugs to contribute to the whole. So, as the device mapper develops higher-level RAID capabilities, it would be nice if some of the core code could be shared between MD and DM. Making that happen, however, will require developer effort - and it's not clear that any hackers are interested in doing that work at this time.

Comments (25 posted)

Andrew Morton 2.6.16-rc1-mm2 ?

Andrew Morton 2.6.16-rc1-mm3 ?

Gerd Hoffmann SMP alternatives ?

david singleton futex: robust futex support ?

George Anzinger Clean up of hrtimer code. ?

Thomas Gleixner hrtimers updates ?

john stultz Time: Generic Timeofday Subsystem (v B17) ?

Peter Williams PlugSched-6.2 for 2.6.16-rc1 and 2.6.16-rc1-mm1 ?

Rafael J. Wysocki swsusp: userland interface (rev 2) ?

Nigel Cunningham Suspend2 2.2 for 2.6.15.1 ?

Junio C Hamano GIT 1.1.4 ?

Catalin Marinas Stacked GIT 0.8.1 ?

Greg KH PCI patches for 2.6.16-rc1 ?

David Vrabel Controller Area Network (CAN) infrastructure ?

David Vrabel MC251x CAN controller driver example ?

Tejun Heo libata: new reset mechanism, take#2 ?

Mike Christie SCSI Userspace Target: scsi-ml changes ?

Mike Christie SCSI Userspace Target: example target driver ?

Mike Christie SCSI Userspace Target: scsi tgt core functions ?

Daniel_Aragon minix filesystem: Update to V3 ?

Mark Fasheh ocfs2 updates ?

Shaun Savage CBD Compressed Block Device, New embedded block device ?

NeilBrown md: Introduction - raid5 reshape mark-2 ?

Adrian Bunk the scheduled removal of the obsolete raw driver ?

Mel Gorman Reducing fragmentation using lists (sub-zones) v22 ?

Kernel development

Brief items

Kernel release status

Kernel development news

Quotes of the week

The 2.6.16 straggler list

Review: Understanding Linux Network Internals

MD / DM

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Janitorial

Memory management