Kernel development

Brief items

Kernel release status

The 3.0 kernel is not yet released as of this writing. Subtle core kernel bugs, first with the VFS (see below) then with RCU have delayed the release which is otherwise ready; history suggests it will come out immediately after the LWN Weekly Edition is published.

Stable updates: no stable updates have been released in the last week, and none are in the review process as of this writing.

Comments (1 posted)

Quotes of the week

This patch set contains fixes for a trainwreck involving RCU, the scheduler, and threaded interrupts. This trainwreck involved RCU failing to properly protect one of its bit fields, use of RCU by the scheduler from portions of irq_exit() where in_irq() returns false, uses of the scheduler by RCU colliding with uses of RCU by the scheduler, threaded interrupts exercising the problematic portions of irq_exit() more heavily, and so on.

-- Paul McKenney on why we can't have nice 3.0 things (yet)

That said, have fun and make sure that you have the fire extinguisher ready when you start using this!

-- Thomas Gleixner

Comments (none posted)

Garrett: Booting with EFI

Matthew Garrett investigates the subtleties of booting Linux with EFI. Once again, hardware vendors are myopically focusing on Windows. "As we've seen many times in the past, the only thing many hardware vendors do is check that Windows boots correctly. Which means that it's utterly unsurprising to discover that there are some systems that appear to ignore EFI boot variables and just look for the fallback bootloader instead. The fallback bootloader that has no namespacing, guaranteeing collisions if multiple operating systems are installed on the same system. [...] It could be worse. If there's already a bootloader there, Windows won't overwrite it. So things are marginally better than in the MBR [Master Boot Record] days. But the Windows bootloader won't boot Linux, so if Windows gets there first we still have problems."

Comments (25 posted)

The return of the realtime tree

Those who follow the realtime preemption patch set know that it has been stuck on 2.6.33 for some time. With the release of a new patch based on 3.0-rc7, Thomas Gleixner tells us why: the entire series has been reworked and cleaned up, a new solution to the per-CPU variable problem has been implemented, and a nasty bug held up what would otherwise have been a release based on 2.6.38. "The beast insisted on destroying filesystems with reproduction times measured in days and the total refusal to reveal at least a minimalistic hint to debug the root cause. Staring into completely useless traces for months is not a very pleasant pastime." The 3.0-rc7 version of the patch, happily, shows no such behavior.

Comments (6 posted)

IPv6 NAT

By Jonathan Corbet
July 20, 2011

One of the nice things that the IPv6 protocol was supposed to do for us was to eliminate the need for network address translation (NAT). The address space is large enough that many of the motivations for the use of NAT (lack of addresses, having to renumber networks when changing providers) are no longer present. NAT is often seen as a hack which breaks the architecture of the Internet, so there has been no shortage of people who would be happy to see it go; the IPv6 switch has often looked like the opportunity to make it happen.

So it is not surprising that, when Terry Moës posted an IPv6 NAT implementation for Linux, the first response was less than favorable. Anybody wanting to see the end of NAT is unlikely to welcome an implementation which can only serve to perpetuate its use after the IPv6 transition. The sad fact, though, is that NAT appears to be here to stay. David Miller expressed it in a typically direct manner:

People want to hide the details of the topology of their internal networks, therefore we will have NAT with ipv6 no matter what we think or feel.

Everyone needs to stop being in denial, now.

Like it or not, we will be dealing with NAT indefinitely. For those who are curious about how it might work in Linux, Terry's implementation can be found on SourceForge along with a paper describing the design of the code. Both stateless (RFC 6296) and stateful NAT are supported.

Comments (45 posted)

Kernel development news

How to ruin Linus's vacation

By Jonathan Corbet
July 19, 2011

It's all Hugh's fault.

Linus was all set to release the final 3.0 kernel when Hugh Dickins showed up on the list with a little problem: occasionally a full copy of the kernel source tree fails because one of the files found therein vanishes temporarily. What followed was a determined bug-chasing exercise which demonstrates how subtle and tricky some of our core code has become. The problem has been found and squashed, but there may be more.

A bit of background might help in understanding what was happening here. The 2.6.38 release included the dcache scalability patches; this code uses a number of tricks to avoid taking locks during the process of looking up file names. For the right kind of workload, the "RCU walk" method yields impressive performance improvements. But that only works if all of the relevant directory entry ("dentry") structures are in the kernel's dentry cache and the lookup process does not race with other CPUs which may be making changes on the same path. Whenever such a situation is encountered, the lookup process will fall back to the older, slower algorithm which requires locking each dentry.

The dentry cache (dcache) is a highly dynamic data structure, with dentries coming and going at all times. So one CPU might be removing a dentry at the same time that another is using it to look up a name. Chaos is avoided through the use of read-copy-update (RCU) to manage the removal of dentries; a dentry may be removed from the cache, but, if the thread using that dentry for lookup got a reference to it before its removal, the structure itself will continue to exist for as long as that thread needs it. The same should be true of the inode structure associated with that dentry.

Hugh tracked the problem down to a bit of code in walk_component():

	err = do_lookup(nd, name, path, &inode);
	/* ... */
	if (!inode) {
		path_to_nameidata(path, nd);
		terminate_walk(nd);
		return -ENOENT;
	}

If do_lookup() returns a null inode pointer, walk_component() assumes that a "negative dentry" has been encountered. Negative dentries are kept in the dentry cache to record the fact that a specific name does not exist; they are an important performance-enhancing feature in the Linux virtual filesystem layer. To see an example, run any simple program under strace and watch how many system calls return with ENOENT; lookups on nonexistent files happen frequently. What Hugh determined was that this inode pointer was coming back null even though the file exists, leading the code to believe that a negative dentry had been found and causing the "briefly vanishing file" problem.

Hugh must have looked at this code for some time before concluding that the kernel must be removing the dentry from the dcache at just the wrong time during the lookup process. As described above, the dentry itself continues to exist after its removal from the cache, but that does not mean that it is unchanged: the removal process sets its d_inode pointer to NULL. (It's worth noting that this behavior goes against normal RCU practice, which calls for the structure to be preserved unmodified until the last reference is known to be gone). Hugh concluded that this null pointer was being picked up later by the lookup process, causing walk_component() to conclude that the file does not exist when all that had happened was the removal of a dentry from the cache. His problem report included a patch causing the lookup code to check much more carefully when the inode pointer comes up null.

Linus acknowledged the problem but didn't like the fix which, he thought, was too specific to one particular situation. He proposed an alternative: just don't set d_inode to NULL; that would keep the inode pointer from picking up that value later. Al Viro posted an alternative fix which changed dcache behavior in less subtle ways, and worried about the possibility of introducing other weird bugs:

I'm not entirely convinced that it's a valid optimization in the first place (probably is, but I'm seriously scared by the complexity we already have there), and I'm really not fond of the idea of dealing with whatever subtle crap we might discover with Linus' patch. Again, dcache is not in a healthy shape right now; at this point dumb and straightforward is, IMO, better than subtle and risking to step on toes of very odd code out there...

Once we are done with code audit, sure, I'm fine with ->d_inode being kept until dentry is actually freed. Any code that relies on that thing being cleared is asking for trouble and should be rewritten anyway. The only thing is, it needs to be found before we rewrite it...

Linus didn't like Al's fix either; it threatened to force slow lookups when negative dentries are involved. The discussion of the patches went on at some length; in the process of trying to find the safest way to fix this subtle bug the participants slowly came to the realization that they did not actually know what was happening. After looking at things closely, Linus threw up his hands and admitted he didn't understand it:

So how could Hugh's NULL inode ever happen in the first place? Even with the current sources? It all looks solid to me now that I look at all the details.

As it happens, Linus's exposition was enough to point Hugh at the real problem. Just as the process of transiting through a specific dentry is almost complete, do_lookup() makes a call to __follow_mount_rcu(), whose job is to redirect the lookup process if it is passing through a mount point. The inode pointer is passed to __follow_mount_rcu() separately; Hugh noticed that this function was doing the following:

	*inode = path->dentry->d_inode;

In other words, the inode pointer is being re-fetched from the dentry structure; this assignment happens regardless of whether the dentry represents a mount point. That is the true source of the problem: if the dentry has been removed from the dcache after the lookup process gained a reference, d_inode will be NULL. So __follow_mount_rcu() will zero a pointer which had pointed to a valid inode, causing later code to think that the file does not exist at all.

Linus posted a fix for the real problem along with his now-famous Google+ posting saying that he was delaying the 3.0 release for a day just in case:

We have a patch, we understand the problem, and it looks ObviouslyCorrect(tm), but I don't think I want to release 3.0 just a couple of hours after applying it.

Linus delayed the release despite the inconvenient fact that it will push the 3.1 merge window into his planned vacation. That was a well-placed bit of caution on his part: the ObviouslyCorrect(tm) patch had YetAnotherSubtleBug(tm) in it. A fixed version of the patch exists, and this particular bug should, at this point, be history.

There is a sobering conclusion to be drawn from this episode, though. The behavior of the dentry cache is, at this point, so subtle that even the combined brainpower of developers like Linus, Al, and Hugh has a hard time figuring out what is going on. These same developers are visibly nervous about making changes in that part of the kernel. Our once approachable and hackable kernel has, over time, become more complex and difficult to understand. Much of that is unavoidable; the environment the kernel runs in has, itself, become much more complex over the last 20 years. But if we reach a point where almost nobody can understand, review, or fix some of our core code, we may be headed for long-term trouble.

Meanwhile, we should be able to enjoy a 3.0 release (and a 2.6.39 update) without mysteriously vanishing files. One potential short-term problem remains, though: given that the next merge window will push into Linus's vacation, there is a distinct chance that he might be more than usually grumpy with maintainers who get their pull requests in late. Wise subsystem maintainers may want to be ready to go when the merge window opens.

Comments (27 posted)

RLIMIT_NPROC and setuid()

By Jake Edge
July 20, 2011

The setuid() system call has always been something of a security problem for Linux (and other Unix systems). It interacts oddly with security and other kernel features (e.g. the unfortunately named "sendmail-capabilities bug") and is often used incorrectly in programs. But, it is part of the Unix legacy, and one that will be with us at least until the 2038 bug puts Unix systems out of their misery. A recent patch from Vasiliy Kulikov arguably shows these kinds of problems in action: weird interactions with resource limits coupled with misuse of the setuid() call.

There is a fair amount of history behind the problem that Kulikov is trying to solve. Back in 2003, programs that used setuid() to switch to a non-root user could be used to evade the limit on the number of processes that an administrator had established for that user (i.e. RLIMIT_NPROC). But that was fixed with a patch from Neil Brown that would cause the setuid() call to fail if the new user was at or above their process limit.

Unfortunately, many programs do not check the return value from calls to setuid() that are meant to reduce their privileges. That, in fact, was exactly the hole that sendmail fell into when Linux capabilities were introduced, as it did not check to see that the change to a new UID actually succeeded. Buggy programs that don't check that return value can cause fairly serious security problems because they assume their actions are limited by the reduced privileges of the switched-to user, but are actually still operating with the increased privileges (often root) that they started with. In effect, the 2003 change made it easier for attackers to cause setuid() to fail when RLIMIT_NPROC was being used.

Kulikov described the problem back in June, noting that it was not a bug in Linux, but allowed buggy privileged programs to wreak havoc:

I don't consider checking RLIMIT_NPROC in setuid() as a bug (a lack of syscalls return code checking is a real bug), but as a pouring oil on the flames of programs doing poorly written privilege dropping. I believe the situation may be improved by relatively small ABI changes that shouldn't be visible to normal programs.

In the posting, he suggested two possible solutions to the problem. The first is to move the check against RLIMIT_NPROC from set_user() (a setuid() helper function) to execve() as most programs will check the status of that call (and can't really cause any harm if they don't). The other suggestion is one that was proposed by Alexander Peslyak (aka Solar Designer) in 2006 to cause a failed setuid() call to send a SIGSEGV to the process, which would presumably terminate those misbehaving programs.

The first solution is not complete because it would still allow users to violate their process limit by using programs that do a setuid() that is not followed by an execve(), but that is a sufficiently rare case that it isn't considered to be a serious problem. Peslyak's solution was seen as too big of a hammer when it was proposed, especially for programs that do check the status of setuid(), and might have proper error handling for that case.

There were no responses to his initial posting, but when he brought it back up on July 6, he was pleasantly surprised to get a positive response from Linus Torvalds:

My reaction is: "let's just remote the crazy check from set_user() entirely". If somebody has credentials to change users, they damn well have credentials to override the RLIMIT_NPROC too, and as you say, failure is likely a bigger security threat than success.

The whole point of RLIMIT_NPROC is to avoid fork-bombs. If we go over the limit for some other reason that is controlled by the super-user, who cares?

That led to the patch, which changed do_execve_common() to return an error (EAGAIN) if the user was over their process limit and removed the check from set_user(). The patch was generally well-received, though several commenters were not convinced that it should go into the -rc for 3.0 as Torvalds had suggested. In fact, as Brown dug into the patch, he saw a problem that might need addressing:

Note that there is room for a race that could have unintended consequences.

Between the 'setuid(ordinary-user)' and a subsequent 'exit()' after execve() has failed, any other process owned by the same user (and we know where are quite a few) would fail an execve() where it really should not.

Basically, the problem is that switching the process to a new user could now exceed the process limit, but that limit wouldn't actually be enforced until an execve() was done (the failure of which would presumably cause the process to exit). In the interim, any execve() from another of the user's processes would fail. It's not clear how big of a problem that is, though it could certainly lead to unexpected behavior. Brown offered up a patch that would address the problem by adding a process flag (PF_NPROC_EXCEEDED) that would be set if a setuid() caused the process to exceed RLIMIT_NPROC and would then be checked in do_execve_common(). Thus, only the execve() in the offending process would fail.

Kulikov and Peslyak liked the approach, though Peslyak was not convinced it added any real advantages over Kulikov's original patch. He also pointed out that there could be a indeterminate amount of time between the setuid() and execve(), so the RLIMIT_NPROC test should be repeated when execve() is called: "It would be surprising to see a process fail on execve() because of RLIMIT_NPROC when that limit had been reached, say, days ago and is no longer reached at the time of execve()."

So far, Brown has not respun the patch to add that test. There is also the question of whether the problem that Brown is concerned about needs to be addressed at all, and whether it is worth using up another process flag bit (there are currently only three left) to do so. In the end, some kind of fix is likely to go in for 3.1 given Torvalds's interest in seeing this problem with buggy programs disarmed. It's unclear which approach will win out, but either way, setuid() will not fail due to exceeding the allowable number of processes.

As Kulikov and others noted, it is definitely not a bug in the kernel that is being fixed here. But, it is a common enough error in user-space programs—often with dire consequences—which makes it worthwhile to fix as a pro-active security measure. Peslyak listed several recent security problems that arose from programs that do not check the return value from setuid(). He also noted that the problem is not limited to setuid-root programs, as other programs that try to switch to a lesser—differently—privileged user can also cause problems when using setuid() incorrectly.

The impact of this fix is quite small, and badly written user-space programs—even those meant to run with privileges—abound, which makes this change more palatable than some other pro-active fixes. As we have seen before, setuid() is subtle and quick to anger; it can have surprising interactions with other seemingly straightforward security measures. Closing a hole with setuid(), even if the problem lives in user space, will definitely improve overall Linux security.

Comments (4 posted)

Checkpoint/restart (mostly) in user space

By Jonathan Corbet
July 19, 2011

There are numerous use cases for a checkpoint/restart capability in the kernel, but the highest level of interest continues to come from the containers area. There is clear value in being able to save the complete state of a container to a disk file and restarting that container's execution at some future time, possibly on a different machine. The kernel-based checkpoint/restart patch has been discussed here a number of times, including a report from last year's Kernel Summit and a followup published shortly thereafter. In the end, the developers of this patch do not seem to have been able to convince the kernel community that the complexity of the patch is manageable and that the feature is worth merging.

As a result, there has been relatively little news from the checkpoint/restart community in recent months. That has changed, though, with the posting of a new patch by Pavel Emelyanov. Previous patches have implemented the entire checkpoint/restart process in the kernel, with the result that the patches added a lot of seemingly fragile (though the developers dispute that assessment) code into the kernel. Pavel's approach, instead, is focused on simplicity and doing as much as possible in user space.

Pavel notes in the patch introduction that almost all of the information needed to checkpoint a simple process tree can already be found in /proc; he just needs to augment that information a bit. So his patch set adds some relevant information there:

There is a new /proc/pid/mfd directory containing information about files mapped into the process's address space. Each virtual memory area is represented by a symbolic link whose name is the area's starting virtual address and whose target is the mapped file. The bulk of this information already exists in /proc/pid/maps, but the mfd directory collects it in a useful format and makes it possible for a checkpoint program to be sure it can open the exact same file that the process has mapped.
/proc/pid/status is enhanced with a line listing all of the process's children. Again, that is information which could be obtained in other ways, but having it in one spot makes life easier.
The big change is the addition of a /proc/pid/dump file. A process reading this file will obtain the information about the process which is not otherwise available: primarily the contents of the CPU registers and its anonymous memory.

The dump file has an interesting format: it looks like a new binary executable format to the kernel. Another patch in Pavel's series implements the necessary logic to execute a "program" represented in that format; it restores the register and memory contents, then resumes executing where the process was before it was checkpointed. This approach eliminates the need to add any sort of special system call to restart a process.

There is need for one other bit of support, though: checkpointed processes may become very confused if they are restarted with a different process ID than they had before. Various enhancements to (or replacements for) the clone() system call have been proposed to deal with this problem in the past. Pavel's answer is a new flag to clone(), called CLONE_CHILD_USEPID, which allows the parent process to request that a specific PID be used.

With this much support, Pavel is able to create a set of tools which can checkpoint and restart simple trees of processes. There are numerous things which are not handled; the list would include network connections, SYSV IPC, security contexts, and more. Presumably, if this patch set looks like it can be merged into the mainline, support for other types of objects can be added. Whether adding that support would cause the size and complexity of the patch to grow to the point where it rivals its predecessors remains to be seen.

Thus far, there has been little discussion of this patch set. The fact that it was posted to the containers list - not the largest or most active list in our community - will have something to do with that. The few comments which have been posted have been positive, though. If this patch is to go forward, it will need to be sent to a larger list where a wider group of developers will have the opportunity to review it. Then we'll be able to restart the whole discussion for real - and maybe actually get a solution into the kernel this time.

Comments (21 posted)

Patches and updates

Kernel trees

Thomas Gleixner 3.0-rc7-rt0 ?

Architecture-specific

Lin Ming perf: Intel uncore pmu counting support ?

Jonas Bonn OpenRISC Architecture ?

G, Manjunath Kondaiah dt: omap3: add device tree support ?

Prarit Bhargava System Firmware and SMBIOS Support [v3] ?

Matthew Garrett Add an EFI pstore backend ?

Matt Evans net: filter: BPF 'JIT' compiler for PPC64 ?

Grant Likely arm/dt: tegra devicetree support ?

Core kernel code

Pavel Emelyanov Checkpoint/restore mostly in the userspace ?

Christopher Yeoh Cross Memory Attach v3 ?

John Stultz Anonymous shared memory (ashmem) subsystem ?

Josef Bacik fs: add SEEK_HOLE and SEEK_DATA flags ?

Paul E. McKenney [PATCH rcu/urgent 0/6] Fixes for RCU/scheduler/irq-threads trainwreck ?

Development tools

Vaibhav Nagarnaik Add tracepoints to trace all system IRQs ?

Akinobu Mita notifier error injection ?

Keiichi KII perf tools: pagecache monitoring ?

Dan Carpenter Add scripts/rename_review.pl ?

Device drivers

Wolfram Sang watchdog drivers converted to the new framework ?

Jonathan Corbet Non-coherent contiguous DMA for videobuf2 ?

Eric Andersson input: add driver for Bosch Sensortec's BMA150 accelerometer ?

Mark Brown regmap: Generic I2C and SPI register map library ?

Grant Likely Simple irq_domain implementation ?

djkurtz@chromium.org Synaptics image sensor support ?

Guennadi Liakhovetski [PATCH v3] V4L: add two new ioctl()s for multi-size videobuffer management ?

Memory management

Marek Szyprowski Contiguous Memory Allocator ?

Networking

Ian Campbell enable SKB paged fragment lifetime visibility ?

Terry Moës NAT66 : A first implementation ?

Chetan Loke Enhance af-packet to provide (near zero)lossless packet capture functionality. ?

Security-related

Vasiliy Kulikov implement SL*B and stack usercopy runtime checks ?

Mike Waychison Support dropping of capabilities from early userspace. ?

Miscellaneous

Soeren Sandmann Announce: Sysprof 1.1.8 system-wide CPU profiler for Linux ?

Page editor: Jonathan Corbet
Next page: Distributions>>