Kernel development [LWN.net]

Kernel release status

The current 2.6 kernel is 2.6.7, which was announced by Linus on June 15. Changes since the last release candidate include a fix for the latest denial of service vulnerability (see below), an NTFS update, some more CPU frequency controller work, and lots of fixes. The biggest changes since 2.6.6 include scheduling domains, a big rework of the reverse-mapping VM code, filtered waitqueues, the removal of the InterMezzo filesystem, quota and extended attribute support in reiserfs, a new API for NUMA systems, the removal of IDE tagged command queueing support, and the usual pile of fixes. See the long-format changelog for the details.

Linus's BitKeeper repository contains no patches beyond 2.6.7 as of this writing.

The current tree from Andrew Morton is 2.6.7-rc3-mm2. Recent additions to -mm include ext3 resizing support (see below), a O_NOATIME option to open(), and various fixes.

The current 2.4 prepatch is 2.4.27-pre6, which was released on June 15. It includes the FPU denial of service fix, of course, along with some architecture updates, DVD-RW write support, and a fair number of fixes.

Comments (2 posted)

Quote of the week

This is all part of what responsible release management is about. I was the junior whiz kid in professional release management teams before starting Namesys. I listened to my elders and learned from them. My standards for professional conduct in this arena are higher than yours as a result of that. You are a bunch of young kids who lack professional experience in release management. That is ok, but don't get aggressive about it.

-- Hans Reiser

Comments (11 posted)

A nasty FPU bug

The problem was initially reported as a gcc bug. If you execute this code:

    static void Handler(int ignore)
    {
	char fpubuf[108];
	__asm__ __volatile__ ("fsave %0\n" : : "m"(fpubuf));
	__asm__ __volatile__ ("frstor %0\n" : : "m"(fpubuf));
    }

in a signal handler, the system (or, at least, the CPU that was running the code) will freeze up hard. Ways of locking up the system from an unprivileged user-space program are generally considered to be bad news; they also, in general, are not seen as compiler bugs. A bit of digging turned up the real problem, and the latest kernel denial of service vulnerability was found.

In theory, the fsave instruction above saves the floating-point unit (FPU) status into the fpubuf array; the subsequent frstor should simply restore the same state back into the FPU. Unfortunately, the above code is incorrect; the assembly instructions should read "m"(*fpubuf) to actually store the state into the fpubuf array. The code, as written, restores from the wrong address, corrupting the state of the FPU and, in particular, setting some exception flags.

FPU exceptions do not result in immediate kernel traps; instead, the trap happens when the next floating-point command is executed. As it happens, the kernel checks when a signal handler returns and, if that handler has used any floating-point instructions, the kernel performs an fwait instruction to ensure that the last operation is complete. That fwait causes the floating point exception caused by the corrupt restore to be delivered as a kernel trap.

The kernel has a way of dealing with floating point traps; it saves the FPU state and queues up a floating point exception signal for the current process. It also sets the TS ("task switched") processor flag to indicate that the FPU state may be other than expected. At that point, it returns to the place where the exception occurred.

Normally, as part of returning from the trap, the kernel would simply deliver the floating-point exception signal to user space and get on with life. But, in this case, the kernel is returning back to kernel space, and back to the same fwait instruction that caused the problem in the first place. That instruction sees the TS flag and generates another trap. The handler for this trap knows just what to do in response to a TS flag; it restores the saved FPU state and returns. The saved FPU state is, however, the corrupted state which was in effect before the first attempt to execute fwait. So, at this point, the loop is closed and a new floating-point trap will be generated. This will go on for a while.

The fix is relatively straightforward, once the problem is understood. The kernel simply clears any pending exceptions before executing fwait, and the problem goes away. All that is left is the updating and rebooting of large numbers of vulnerable systems.

(Thanks to Sergey Vlasov, whose analysis of the problem made this article much easier to write.)

Comments (9 posted)

Online resizing of ext3 filesystems

One of the patches which slipped into 2.6.7-rc3-mm2 is one by Andreas Dilger and others which makes it possible to resize a running ext3 filesystem on the fly. This patch has been shipped with Fedora kernels for a little while, but has not seen a lot of wider use. That could change, of course, if the resize patch finds its way into the mainline.

The resize patch is conceptually quite simple. It simply adds one or more block groups which make use of extra space which, one hopes, is sitting there idle at the end of the existing filesystem. Once the block groups are hooked into the filesystem data structures, a simple ioctl() call or remount will make the space available. Behind this apparent simplicity, of course, is a significant amount of code which makes the resize operation happen on a modern, complex filesystem in a robust manner.

People wanting to try out resizing will need a few things:

A kernel (such as 2.6.7-rc3-mm2) with the online resize patch included.
A patch to e2fsprogs to make use of the resize capability; it is available from the ext2resize SourceForge download area.
Free disk space into which the filesystem can expand. Usually this means that the filesystem should live in a device mapper partition which can be expanded as well.
A very good backup of your filesystem.

This patch and its associated documentation (or lack thereof) still require some work before being ready for widespread deployment. Once they get there, however, life should get easier for system administrators who, throughout history, have routinely found out that all that "extra space" they figured into their filesystems is never enough.

Comments (2 posted)

On the alignment of IP packets

Device drivers for network interfaces must allocate a "socket buffer" ("skb") for each incoming packet. A standard idiom in the skb allocation code is a line like this:

    skb_reserve(skb, 2);

This call tells the socket buffer code to set aside the first two bytes of the data buffer. The reason why this is done can be seen by looking at the resulting layout of an IP packet in the buffer:

The network stack makes frequent use of the IP addresses stored in the packet. By padding the beginning of an ethernet-style packet by two bytes, a network driver can cause those addresses to be aligned on a four-byte boundary. On some architectures, at least, that alignment will speed access to the addresses and make the networking system faster.

Or so it might seem. As Anton Blanchard recently figured out, this padding is not always helpful. A number of modern architectures (Anton works with PPC64, but Intel-style architectures qualify too) have no real problem with unaligned memory accesses, so the two-byte offset on IP packets does not necessarily help things. Unfortunately, the DMA engines in a number of systems do have trouble working with unaligned addresses. A padded packet buffer does not start on an aligned address, with the result that DMA operations to that buffer can be slower than they should be. As network adapters get faster, the DMA performance penalty becomes increasingly significant.

Anton's proposal was to change the skb_reserve() calls into calls to a new skb_align() function, which could, depending on the architecture, decide whether to insert the padding or not. David Miller pointed out, however, that the magic constant "2" appears in quite a few places, and simply removing the padding could create bugs elsewhere in the driver code.

The real solution is likely to be the addition of a defined constant called something like NET_IP_ALIGN; this constant would be the amount of padding needed for packet alignment on the current architecture. Yes, things probably should have been done that way from the beginning, but life is like that. In any case, once the constant is in, each individual driver can be looked over and fixed up as need be. And one small obstacle to top performance on high-end network adapters will have been removed.

Comments (4 posted)

Linus Torvalds Linux 2.6.7 ?

Con Kolivas 2.6.7-ck1 ?

Andrew Morton 2.6.7-rc3-mm2 ?

Andrew Rodland -ar patchset ?

Marcelo Tosatti Linux 2.4.27-pre6 ?

Andi Kleen x86_64-2.6.7-1 released ?

Sam Ravnborg kbuild ?

Sam Ravnborg kbuild: default kernel image ?

Sam Ravnborg kbuild: move rpm to scripts/package ?

Sam Ravnborg kbuild: add deb-pkg target ?

Sam Ravnborg kbuild: make clean improved ?

Sam Ravnborg kbuild: external module build doc ?

Geoff Levand high-res-timers patches for 2.6.6 ?

Peter Williams Single Priority Array CPU Scheduler ?

Con Kolivas Stairacse scheduler v6.E for 2.6.7-rc3 ?

Takao Indoh Diskdump Update ?

David Howells R/W semaphore tester module ?

Randy.Dunlap kmsgdump updated for 2.6.6 ?

Davide Libenzi qlnx-psets configuration tool ?

Vladislav Bolkhovitin [ANNOUNCE] Generic SCSI Target Middle Level for Linux (SCST) with target drivers ?

James Bottomley SCSI updates for 2.6.7 ?

Manish Kumar Bhojasia iscsi driver 4.0.1.7 ?

Pat LaVarre Linux fs/ udf/ walk thru posted ?

David Howells Permit inode & dentry hash tables to be allocated > MAX_ORDER size ?

Stas Sergeev expandable anonymous shared mappings ?

Arthur Kepner lockless loopback patch for 2.6 (version 2) ?

Netfilter Core Team Release of iptables-1.2.10 ?

Huagang Xie LIDS 2.2.0rc1 for kernel 2.6.6 is out ?

Atul Sabharwal Non Invasive Kernel Monitor for threads/processes ?

James Morris Fine-grained Netlink support ?

James Morris [SELINUX][PATCH 1/4] Fine-grained Netlink support - SELinux headers update ?

James Morris [SELINUX][PATCH 2/4] Fine-grained Netlink support - move security_netlink_send() hook. ?

James Morris [SELINUX][PATCH 3/4] Fine-grained Netlink support - add sk to netlink_send hook ?

James Morris Fine-grained Netlink support - SELinux changes ?

christophe.varoqui@free.fr multipath-tools-0.2.3 ?

Kernel development

Brief items

Kernel release status

Kernel development news

Quote of the week

A nasty FPU bug

Online resizing of ext3 filesystems

On the alignment of IP packets

Patches and updates

Kernel trees

Architecture-specific

Build system

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous