Kernel development [LWN.net]

Kernel release status

The current 2.6 development kernel is 2.6.26-rc3, released by Linus on May 18. Lots of fixes, of course, and things are stabilizing (though the list of regressions remains long). Linus also notes that the kernel developers have now been using git for as long as they used BitKeeper - but there's a lot more developers now. As always, the long-format changelog has the details.

As of this writing, almost 300 changesets have been merged into the mainline git repository since 2.6.26-rc3. They include a new test driver for MMC memory cards, a new device_create_drvdata() function (intended to fix a race condition caused by the previous separation of device_create() and dev_set_drvdata()), a USB wireless device management driver, and a lot of fixes.

The current stable 2.6 kernel is 2.6.25.4, released on May 15. It contains a fairly long list of fixes, a couple of which are security-related.

Comments (none posted)

Quotes of the week

So you can either try to drink from the firehose and inevitably be bitched about because you're holding something up or not giving something the attention it deserves, or you can try to make sure that you can let others help you. And you'd better select the "let other people help you", because otherwise you _will_ burn out. It's not a matter of "if", but of "when".

-- Linus Torvalds on git workflows (worth reading in its entirety)

I have spoken with engineers both individual and within companies who have developed and who plan to develop substantial kernel features. I'm forever explaining to people why they should work to get that code merged up. One reason for their not yet having done so which comes up again and again is apprehension at the reception they will receive. In public. This problem appears to be especially strong in Asian countries. You have just made the situation worse.

But it's not just a self-interest thing. It is inevitably and unavoidably the case that when one senior kernel developer acts like an arrogant hostile dickhead, we will all be increasingly regarded as arrogant hostile dickheads.

-- Andrew Morton

I suppose alternately I could send another patch to remove "remember that ext3/4 by default offers higher data integrity guarantees than most." from Documentation/filesystems/ext4.txt ;)

-- Eric Sandeen

Comments (2 posted)

Appropriate sources of entropy

By Jake Edge
May 21, 2008

A steady stream of random events allows the kernel to keep its entropy pool stocked up, which in turn allows processes to use the strongest random numbers that Linux can provide. Exactly which events qualify as random—and just how much randomness they provide—is sometimes difficult to decide. A recent move to eliminate a source of contributions to the entropy pool has worried some, especially in the embedded community.

The kernel samples unpredictable events for use in generating random numbers, storing that data in the entropy pool. Entropy is a measure of the unpredictability or randomness of a data set, so the kernel estimates the amount of entropy each of those events contributes to the pool. Many kernels run on hardware that is lacking some of the traditional sources of entropy. In those cases, the timing of interrupts from network devices has been used as a source of entropy, but it has always been controversial, so it was recently proposed for removal.

Two of the best sources of random data for the entropy pool—user interaction via a keyboard or mouse and disk interrupts—are often not present in embedded devices. In addition, some disk interfaces, notably ATA, do not add entropy, which extends the problem to many "headless" servers. But network interrupts are seen as a dubious source of entropy because they may be able to be observed, or manipulated, by an attacker. In addition, as network traffic rises, many network drivers turn off receive interrupts from the hardware, allowing the kernel to poll periodically for incoming packets. This would reduce entropy collection just at the time when it might be needed for encrypting the traffic.

This is not the first time eliminating the IRQF_SAMPLE_RANDOM flag from network drivers has come up; we looked at the issue two years ago (though the flag was called SA_SAMPLE_RANDOM at that time). It has come up again, starting with a query on linux-kernel from Chris Peterson: "Should network devices be allowed to contribute entropy to /dev/random?" Jeff Garzik, kernel network device driver maintainer, answered: "I tend to push people to /not/ add IRQF_SAMPLE_RANDOM to new drivers, but I'm not interested in going on a pogrom with existing code."

For anyone that is interested in such a pogrom, Peterson proposed a patch to eliminate the flag from the twelve network drivers that still use it. This sparked a long discussion on how to provide entropy for those devices that do not have anything else to use. While the actual contribution of entropy from network devices is questionable, mixing that data into the pool does not harm it, as long as no entropy credit—the current estimate of entropy in the pool—is awarded. Alan Cox proposed a new flag to track sources like that:

A more interesting alternative might be to mark things like network drivers with a new flag say IRQF_SAMPLE_DUBIOUS so that users can be given a switch to enable/disable their use depending upon the environment.

Some were in favor of an approach like this, but Adrian Bunk notes that:

If he can live with dubious data he can simply use /dev/urandom .

If a customer wants to use /dev/random and demands to get dubious data there if nothing better is available fulfilling his wish only moves the security bug from his crappy application to the Linux kernel.

Part of the problem stems from a misconception about random numbers gotten from /dev/random versus those that are read from /dev/urandom, which we described in a Security page article last December. In general, applications should read from /dev/urandom. Only the most sensitive uses of random numbers—keys for GPG for example—need the entropy guarantee that /dev/random provides. In a system that is getting regular entropy updates, the quality of the random numbers from both sources is the same.

There is still an initialization problem for some systems, though, as Ted Ts'o points out:

Hence, if you don't think the system hasn't run long enough to collect significant entropy, you need to distinguish between "has run long enough to collect entropy which is causes the entropy credits using a somewhat estimation system where we try to be conservative such that /dev/random will let you extract the number of bits you need", and "has run long enough to collect entropy which is unpredictable by an outside attacker such that host keys generated by /dev/urandom really are secure".

A potential entropy source, even for embedded systems, is to sample other kernel and system parameters that are not predictable externally. Garzik suggests:

EGD demonstrates this, for example: http://egd.sourceforge.net/ It looks at snmp, w, last, uptime, iostats, vmstats, etc.

And there are plenty of untapped entropy sources even so, such as reading temperature sensors, fan speed sensors on variable-speed fans, etc.

Heck, "smartctl -d ata -a /dev/FOO" produces output that could be hashed and added as entropy.

Another source is from hardware random number generators. The kernel already has support for some, including the VIA Padlock that seems to be well thought of. Not all processors have such support, however. The Trusted Platform Module (TPM) does have random number generation and is becoming more widespread, especially in laptops, but there is no kernel hw_random driver for TPM.

Garzik advocates adding a kernel driver for what he calls the "Treacherous Platform Module", but as others pointed out, it can all be done in user space using the TrouSerS library. Even for the hardware random number generators that are supported in the kernel there is no automatic entropy collection, as it is left up to user space to decide whether to do that. This is done to try and keep policy decisions about the quality of the random data out of kernel code.

Systems that wish to sample that data should use rngd to feed the kernel entropy pool. rngd will apply FIPS 140-2 tests to verify the randomness of the data before passing it to the kernel. Andi Kleen is not in favor of that approach:

Just think a little bit: system has no randomness source except the hardware RNG. you do your strange randomness verification. if it fails what do you do? You don't feed anything into your entropy pool and all your random output is predictable (just boot time) If you add anything predictable from another source it's still predictable, no difference.

There is concern that some of the hardware random number generators are poorly implemented or could malfunction, so it would be dangerous to automatically add that data into the pool. Doing the FIPS testing in the kernel is not an option, leaving it up to user space applications to make the decision. There is nothing stopping any superuser process from adding bits to the entropy pool—no matter how weak—but the consensus is that the kernel itself must use sources it knows it can trust.

Another instance of this problem—in a different guise—appears in a discussion about random numbers for virtualized I/O, with Garzik asking: "Has anyone yet written a "hw" RNG module for virt, that reads the host's random number pool?" Rusty Russell responded with a patch for a virtio "hardware" random number generator as well as one that adds it into his lguest hypervisor. The lguest patch reads data from the host's /dev/urandom, which is not where H. Peter Anvin thinks it should come from:

There is no point in feeding the host /dev/urandom to the guest (except for seeding, which can be handled through other means); it will do its own mixing anyway. The reason to provide anything at all from the host is to give it "golden" entropy bits.

The virtio implementation only provides the hw_random implementation, thus it requires user space help to get entropy data into the kernel. Much like any process that can read /dev/random, lguest could exhaust the host entropy pool, so there was some discussion of limiting how much random data guests can request from the device. A guest implementation could then use a small pool of entropy read from the host to seed its own random number generator for the simulated hardware device.

Removing the last remaining uses of IRQF_SAMPLE_RANDOM in network drivers seems likely, though some way to mix that data into the entropy pool without giving it any credit is still a possibility. With luck, that will encourage more effort into incorporating new sources of entropy using tools like EGD or, for systems that have it available, random number hardware. For systems that lack the traditional entropy sources, this should lead to a better initialized and fuller pool, while eliminating a potential attack by way of network packet manipulation.

Comments (33 posted)

Kill BKL Vol. 2

By Jonathan Corbet
May 21, 2008

Last week's big kernel lock article discussed a BKL-related performance regression and concluded that we would likely see a new interest in its elimination. In the intervening week, that interest has indeed come to the fore. There are now a couple of different efforts afoot to get rid of this long-lasting lock.

One might well wonder why the BKL is so persistent. Over the last (approximately) fifteen years, thousands of locks have been added to the kernel, pushing the BKL into increasingly obscure corners. But there are a lot of those corners, including a great many explicit lock_kernel() calls, the open() method for every char device, most ioctl() implementations, all fasync() implementations, and more. The BKL can be found throughout the kernel, and doesn't appear ready to go without a fight.

Part of the problem is simply that locking is hard. So going in and changing the locking of some crufty, old driver is not at the top of the list for a lot of developers, who would generally rather be creating crufty new drivers. Beyond that, though, the BKL is special. It was originally created to be more than just a locking primitive; its purpose is to allow BKL-covered code to pretend that it is still running on an old, uniprocessor system. So its semantics are very different from any other lock in the Linux kernel.

For example, the BKL nests, so programmers can add lock_kernel() calls anywhere without worrying about whether the BKL might already have been acquired elsewhere. As with a mutex, code holding the BKL can sleep; however, the scheduler will magically release the BKL until the holding thread wakes up again. So there can be various threads in kernel space, all of which think they hold the BKL, but only one of them will actually be running at any given time. The end result is that it is hard to get a handle on what is happening with the BKL at any given time; code can depend on it without ever really being aware of its existence.

As Ingo Molnar put it in his kill the BKL tree announcement:

Furthermore, the BKL is not covered by lockdep, so its dependencies are largely unknown and invisible, and it is all lost in the haze of the past ~15 years of code changes. All this has built up to a kind of Fear, Uncertainty and Doubt about the BKL: nobody really knows it, nobody really dares to touch it and code can break silently and subtly if BKL locking is wrong.

That doesn't mean that people aren't willing to try; Ingo's tree - to which we will return shortly - is a major effort in that direction. But first, consider another initiative which, somewhat accidentally, turned up an example of just how subtle BKL-related issues can be. As was mentioned above, the kernel grabs the BKL whenever a process opens a char device; the BKL is held while the associated driver's open() function runs. To eliminate BKL, one must remove this particular use of it; one cannot just take it out, however, without breaking every driver which does not have proper locking internally. So, in fact, this lock_kernel() call cannot be removed until every driver's open() function has been audited and, if necessary, fixed. That's a big flag day.

An alternative, which your editor rashly jumped into doing, is to push the acquisition of the BKL down one level. Every open() function is forced to be correct through the addition of explicit lock_kernel() and unlock_kernel() calls; once all of the in-tree drivers have been fixed in this way, the higher-level call in chrdev_open() can be removed. This work may seem like a step backward, in that it replaces a single lock_kernel() call with approximately 100 others. But it's actually a big step forward, in that each driver can now be audited and fixed independently. This work has now been done, the resulting tree is in linux-next, and, if all goes well, it should be ready for 2.6.27.

While doing this work, though, your editor noticed quite a few drivers with open functions that were either completely empty (all they do is "return 0") or they do something relatively trivial. These functions, one would think, do not need to acquire the BKL; they touch no global resources and cannot possibly race with any other part of the kernel. In fact, as was suggested by others, the empty open() functions could just be removed altogether.

It was Alan Cox who pointed out that life is not quite so simple. Under the current regime, an open function which looks like this:

    static int empty_open(struct inode *inode, struct file *filp)
    {
    	return 0;
    }

is really better modeled as this:

    static int empty_open(struct inode *inode, struct file *filp)
    {
        lock_kernel();
	unlock_kernel();
    	return 0;
    }

These two may seem the same, but there is a crucial difference: in the second form, empty_open() will not return until it can acquire the BKL. In other words, after empty_open() runs, one knows that the BKL became available at least once. And this matters: a classic device driver error is to (1) register a device with the kernel, then (2) initialize all of the internal data structures needed to manage that device. Should some other process attempt to open and use the device between those two steps, unpleasant things can happen. The lock_kernel() call in the open() function, despite protecting no critical section directly, serializes the opening of the device with the driver's initialization, and thus prevents mayhem. So, says Alan,

I think it would be best to make them lock/unlock kernel in the first pass and then work through them. The BKL can be subtle and evil, but as I brought it into the world I guess I must banish it ;)

Alan will not be alone in that effort, though, and Ingo Molnar's "kill the BKL" tree is likely to help this work considerably. Ingo's approach is to get rid of most of the features which make the BKL special. So, with his patches, the BKL becomes just another mutex which, crucially, can be tracked with the lock validator. It is no longer released when a thread calls schedule(), a change which forced the addition of a few explicit "release, schedule, and reacquire" changes in code which would otherwise deadlock. There's a number of warnings added to point out calls made holding the BKL which should not be. And so on.

This patch set, in essence, removes the BKL entirely, replacing it with just another big lock which happens to do nesting. And the nesting might go too at some point. So the BKL becomes more visible and easier to understand. And, presumably, easier to eliminate.

Linus likes this approach, though he would like to see it reworked to the point that it can be merged into the mainline relatively soon. Doing that would require putting most of the changes behind a configuration option decorated with a sufficient number of scary warnings; then people who wanted to test this code could turn it on and see what explodes. The number of explosions would probably be relatively small - but probably not zero.

This set of changes, along with the other work being done, suggests that significant progress toward the elimination of the BKL can be expected over the next few kernel development cycles. Once it's gone, we'll have a kernel without legacy locking issues, and without the unpleasant performance issues that the BKL can bring. That will still take a while, though; there is simply no substitute for actually looking at all the BKL-covered code and ensuring that it will run safely in the absence of that protection. It's a painstaking job requiring moderate skills which can only be rushed so much.

Comments (2 posted)

Barriers and journaling filesystems

By Jonathan Corbet
May 21, 2008

Journaling filesystems come with a big promise: they free system administrators from the need to worry about disk corruption resulting from system crashes. It is, in fact, not even necessary to run a filesystem integrity checker in such situations. The real world, of course, is a little messier than that. As a recent discussion shows, it may be even messier than many of us thought, with the integrity promises of journaling filesystems being traded off against performance.

A filesystem like ext3 works by maintaining a journal on a dedicated portion of the disk. Whenever a set of filesystem metadata changes are to be made, they are first written to the journal - without changing the rest of the filesystem. Once all of those changes have been journaled, a "commit record" is added to the journal to indicate that everything else there is valid. Only after the journal transaction has been committed in this fashion can the kernel do the real metadata writes at its leisure; should the system crash in the middle, the information needed to safely finish the job can be found in the journal. There will be no filesystem corruption caused by a partial metadata update.

There is a hitch, though: the filesystem code must, before writing the commit record, be absolutely sure that all of the transaction's information has made it to the journal. Just doing the writes in the proper order is insufficient; contemporary drives maintain large internal caches and will reorder operations for better performance. So the filesystem must explicitly instruct the disk to get all of the journal data onto the media before writing the commit record; if the commit record gets written first, the journal may be corrupted. The kernel's block I/O subsystem makes this capability available through the use of barriers; in essence, a barrier forbids the writing of any blocks after the barrier until all blocks written before the barrier are committed to the media. By using barriers, filesystems can make sure that their on-disk structures remain consistent at all times.

There is another hitch: the ext3 and ext4 filesystems, by default, do not use barriers. The option is there, but, unless the administrator has explicitly requested the use of barriers, these filesystems operate without them - though some distributions (notably SUSE) change that default. Eric Sandeen recently decided that this was not the best situation, so he submitted a patch changing the default for ext3 and ext4. That's when the discussion started.

Andrew Morton's response tells a lot about why this default is set the way it is:

Last time this came up lots of workloads slowed down by 30% so I dropped the patches in horror. I just don't think we can quietly go and slow everyone's machines down by this much...

There are no happy solutions here, and I'm inclined to let this dog remain asleep and continue to leave it up to distributors to decide what their default should be.

So barriers are disabled by default because they have a serious impact on performance. And, beyond that, the fact is that people get away with running their filesystems without using barriers. Reports of ext3 filesystem corruption are few and far between.

It turns out that the "getting away with it" factor is not just luck. Ted Ts'o explains what's going on: the journal on ext3/ext4 filesystems is normally contiguous on the physical media. The filesystem code tries to create it that way, and, since the journal is normally created at the same time as the filesystem itself, contiguous space is easy to come by. Keeping the journal together will be good for performance, but it also helps to prevent reordering. In normal usage, the commit record will land on the block just after the rest of the journal data, so there is no reason for the drive to reorder things. The commit record will naturally be written just after all of the other journal log data has made it to the media.

That said, nobody is foolish enough to claim that things will always happen that way. Disk drives have a certain well-documented tendency to stop cooperating at inopportune times. Beyond that, the journal is essentially a circular buffer; when a transaction wraps off the end, the commit record may be on an earlier block than some of the journal data. And so on. So the potential for corruption is always there; in fact, Chris Mason has a torture-test program which can make it happen fairly reliably. There can be no doubt that running without barriers is less safe than using them.

Anybody can turn on barriers if they are willing to take the performance hit. Unless, of course, their filesystem is based on an LVM volume (as certain distributions do by default); it turns out that the device mapper code does not pass through or honor barriers. But, for everybody else, it would be nice if that performance cost could be reduced somewhat. And it seems that might be possible.

The current ext3 code - when barriers are enabled - performs a sequence of operations like this for each transaction:

The log blocks are written to the journal.
A barrier operation is performed.
The commit record is written.
Another barrier is executed.
Metadata writes begin at some later point.

On ext4, the first barrier (step 2) can be omitted because the ext4 filesystem supports checksums on the journal. If the journal log data and the commit record are reordered, and if the operation is interrupted by a crash, the journal's checksum will not match the one stored in the commit record and the transaction will be discarded. Chris Mason suggests that it would be "mostly safe" to omit that barrier with ext3 as well, with a possible exception when the journal wraps around.

Another idea for making things faster is to defer barrier operations when possible. If there is no pressing need to flush things out, a few transactions can be built up in the journal and all shoved out with a single barrier. There is also some potential for improvement by carefully ordering operations so that barriers (which are normally implemented as "flush all outstanding operations to media" requests) do not force the writing of blocks which do not have specific ordering requirements.

In summary: it looks like the time has come to figure out how to make the cost of barriers palatable. Ted Ts'o seems to feel that way:

I think we have to enable barriers for ext3/4, and then work to improve the overhead in ext4/jbd2. It's probably true that the vast majority of systems don't run under conditions similar to what Chris used to demonstrate the problem, but the default has to be filesystem safety.

Your editor's sense is that this particular dog is now wide awake and is likely to bark for some time. That may disturb some of the neighbors, but it's better than letting somebody get bitten later on.

Comments (72 posted)

Linus Torvalds Linux 2.6.26-rc3 ?

Greg Kroah-Hartman Linux 2.6.25.4 ?

Steven Rostedt 2.6.25.4-rt1 (FINALLY!!!!) ?

Steven Rostedt 2.6.25.4-rt2 ?

Steven Rostedt 2.6.24.7-rt5 ?

Steven Rostedt 2.6.24.7-rt8 ?

Vegard Nossum kmemcheck: support for x86_64 ?

Steven Rostedt ftrace ported to PPC ?

David Miller : sparc64: Add Niagara2 RNG driver. ?

Eric Dumazet modules: Use a better scheme for refcounting ?

Jonathan Corbet char dev BKL pushdown ?

Gregory Haskins sched: core-balancer v2 ?

Michael Kerrisk utimensat() non-conformances and fixes -- version 2 ?

Gregory Haskins adaptive-locks v3 ?

Gregory Haskins RT: adaptive-lock enhancements ?

Vegard Nossum tasklets: new tasklet scheduling function ?

Jesse Barnes handling panic under X ?

Peter Oberparleiter gcov kernel support ?

Srinivasa D S [PATCH] To improve kretprobe scalability ?

Greg KH USB: add Sensoray 2255 v4l driver ?

Greg KH Input: add appleir USB driver ?

Bjorn Helgaas PNP: convert resource table to dynamic list, v2 ?

Arnaldo Carvalho de Melo [PATCH][PCI]: Introduce pci_find_capability_cached and make MSI use it ?

Nate Case leds: Add support for Philips PCA955x I2C LED drivers ?

Pau Oliva Fora Add support for HTC Shift Touchscreen ?

Bartlomiej Zolnierkiewicz ide: generic ATAPI support ?

Kim B. Heino add support for ST M41T94 SPI RTC (patch rev. 3) ?

Alan Cox watchdog: Giant scrub ?

Alan Cox Implment a tty port structure and supporting logic ?

Steve Glendinning SMSC LAN911x and LAN921x vendor driver ?

Anas Nashif Intel Management Engine Interface ?

Alek Du i2c: Add Intel SCH I2C SMBus support ?

Darius I2C driver for IMX ?

Magnus Damm Reusable UIO Platform Driver ?

Ramachandra K [PATCH v2 00/13] QLogic VNIC Driver ?

Matthew Garrett Firmware loader driver for USB Apple iSight camera ?

Al Viro vfs patches ?

Tom Spink Introduce filesystem type tracking ?

Miklos Szeredi file locking fixes + fuse nfs export support (v2) ?

Miklos Szeredi vfs: fixes and cleanups (v3) ?

Eric Sandeen (RESEND) ext3[34] barrier changes ?

Theodore Ts'o Updated ext4 patchset ?

Linus Torvalds Remove BKL from FAT/VFAT/MSDOS (v1) ?

Ryo Tsuruta [PATCH 0/2] dm-ioband: I/O bandwidth controller v1.0.0: Introduction ?

Andi Kleen [0/11] Repost of old VFS BKL patchkit ?

Barry Naujok XFS: ASCII case-insensitivity support ?

Hirokazu Takahashi BIO tracking take2 ?

hooanon05@yahoo.co.jp AUFS: merging/stacking several filesystems ?

Michael Halcrow eCryptfs: Privileged kthread for lower file opens ?

Miklos Szeredi vfs: permission API cleanup ?

Adrian Bunk the scheduled smbfs removal ?

Rik van Riel VM pageout scalability improvements (V7) ?

Christoph Lameter mm notifier: Notifications when pages are unmapped. ?

Mel Gorman Guarantee faults for processes that call mmap(MAP_PRIVATE) on hugetlbfs v3 ?

Johannes Weiner bootmem rewrite ?

Nick Piggin mm: lockless get_user_pages ?

Henrique de Moraes Holschuh rfkill class rework ?

Chris Peterson [PATCH] drivers/net: remove network drivers' last few uses of IRQF_SAMPLE_RANDOM ?

Stephen Smalley security: split proc ptrace checking into read vs. attach ?

Andrew G. Morgan [RFC#2 PATCH] security: protecting legacy apps from the bounding set ?

Avi Kivity Make LIST_POISON less deadly (v3) ?

James Morris [PATCH 0/2] Security: Add security tables for mandatory access control ?

Andrew G. Morgan security: protect legacy apps from insufficient privilege ?

Rusty Russell virtio: hardware random device ?

KAMEZAWA Hiroyuki memcg: performance improvement v4 ?

Balbir Singh Add memrlimit controller (v5) ?

Pavel Emelyanov Make bsd process accounting work in pid namespaces ?

Markus Armbruster [PATCH 0/5] xen pvfb: Para-virtual framebuffer, keyboard and pointer driver updates ?

Arjan van de Ven Top kernel oopses/warnings for the week of May 16th 2008 ?

Rafael J. Wysocki 2.6.26-rc2-git5: Reported regressions from 2.6.25 ?

Karel Zak util-linux-ng 2.14-rc3 ?

Kay Sievers udev 122 release ?

Neil Brown ANNOUNCE: mdadm 2.6.5 - A tool for managing Soft RAID under Linux ?

Neil Brown ANNOUNCE: mdadm 2.6.6 - A tool for managing Soft RAID under Linux ?

Kernel development

Brief items

Kernel release status

Kernel development news

Quotes of the week

Appropriate sources of entropy

Kill BKL Vol. 2

Barriers and journaling filesystems

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Janitorial

Memory management

Networking

Security-related

Virtualization and containers

Benchmarks and bugs

Miscellaneous