Kernel development
Brief items
Kernel release status
The 2.6.31 kernel is out, released by Linus on September 9. A few of the major features in 2.6.31 include performance counter support, the "fsnotify" notification infrastructure, kernel mode setting for ATI Radeon chipsets, the kmemleak tool, char drivers in user space support, USB 3 support, and much more. As always, see the KernelNewbies 2.6.31 page for a much more exhaustive list.The last prepatch, 2.6.31-rc9, was released on September 5.
The current stable kernel is 2.6.30.6, released (along with 2.6.27.32 2.6.27.33) on September 8.
Both contain a long list of fixes, many of which are in the KVM subsystem.
Kernel development news
Quotes of the week
In brief
reflink() for 2.6.32. Joel Becker's announcement of his 2.6.32 ocfs2 merge plans included a mention that the reflink() system call would be merged alongside the ocfs2 changes. A call to reflink() creates a lightweight copy, wherein both files share the same blocks in a copy-on-write mode. The final reflink() API looks like this:
int reflink(const char *oldpath, const char *newpath, int preserve);
int reflinkat(int olddirfd, const char *oldpath,
int newdirfd, const char *newpath,
int preserve, int flags);
A call to reflink() causes newpath to look like a copy of oldpath. If preserve is REFLINK_ATTR_PRESERVE, then the entire security state of oldpath will be replicated for the new file; this is a privileged operation. Otherwise (if preserve is REFLINK_ATTR_NONE), newpath will get a new security state as if it were an entirely new file. The reflinkat() form adds the ability to supply the starting directories for relative paths and flags like the other *at() system calls. For more information, see the documentation file at the top of the reflink() patch.
Joel's patch adds reflink() support for the ocfs2 filesystem; it's not clear whether other filesystems will get reflink() support in 2.6.32 or not.
A stable debugfs?. Recurring linux-kernel arguments tend to focus on vitally important issues - like where debugfs should be mounted. The official word is that it belongs on /sys/kernel/debug, but there have been ongoing problems with rogue developers mounting it on unofficial places like /debug instead. Greg Kroah-Hartman defends /sys/kernel/debug by noting that debugfs is for kernel developers only; there's no reason for users to be interested in it.
Except, of course, that there is. The increasing utility of the ftrace framework is making it more interesting beyond kernel development circles. That led Steven Rostedt to make a suggestion:
Steven would like a new virtual filesystem for stable kernel ABIs which is easier to work with than sysfs and which can be mounted in a more typing-friendly location. Responses to the suggestion have been scarce so far; somebody will probably need to post a patch to get a real discussion going.
data=guarded. Chris Mason has posted a new version of the ext3 data=guarded mode patch. The guarded mode works to ensure that data blocks arrive on disk before any metadata changes which reference those blocks. The goal is to provide the performance benefits of the data=writeback mode while avoiding the potential information disclosure (after a crash) problems with that mode. Chris had mentioned in the past that he would like to merge this code for 2.6.32; the latest posting, though, suggests that some work still needs to be done, so it might not be ready in time.
Some notes from the BFS discussion
As was recently reported here, Con Kolivas recently resurfaced with a new CPU scheduler called "BFS". This scheduler, he said, addresses the problems which ail the mainline CFS scheduler; the biggest of these, it seems, is the prioritization of "scalability" over use on normal desktop systems. BFS was meant to put the focus back on user-level systems and, perhaps, make the case for supporting multiple schedulers in the kernel.
Since then, CFS creator Ingo Molnar has responded with a series of
benchmark results comparing the two schedulers. Tests included kernel
build times, pipe performance, messaging performance, and an online
transaction processing test; graphs were posted showing how each scheduler
performed on each test. Ingo's conclusion: "
Alas, as it can be seen
in the graphs, i can not see any BFS performance improvements, on this
box.
" In fact, the opposite was true: BFS generally performed
worse than the mainline scheduler.
Con's answer was best described as "dismissive":
[snip lots of bullshit meaningless benchmarks showing how great cfs is and/or how bad bfs is, along with telling people they should use these artificial benchmarks to determine how good it is, demonstrating yet again why benchmarks fail the desktop]
As far as your editor can tell, Con's objections to the results mirror those heard elsewhere: Ingo chose an atypical machine for his tests, and those tests, in any case, do not really measure the performance of a scheduler in a desktop situation. The more cynical observers seem to believe that Ingo is more interested in defending the current scheduler than improving the desktop experience for "normal" users.
The machine chosen was certainly at the high end of the "desktop" scale:
A number of people thought that this box is not a typical desktop Linux system. That may indeed be true - today. But, as Ingo (among others) has pointed out, it's important to be a little ahead of the curve when designing kernel subsystems:
Btw., that's why the Linux scheduler performs so well on quad core systems today - the groundwork for that was laid two years ago when scheduler developers were testing on a quads. If we discovered fundamental problems on quads _today_ it would be way too late to help Linux users.
Partly in response to the criticisms, though, Ingo reran his tests on a single quad-core system, the same type of system as Con's box. The end results were just about the same.
The hardware used is irrelevant, though, if the benchmarks are not testing performance characteristics that desktop users care about. The concern here is latency: how long it takes before a runnable process can get its work done. If latencies are too high, audio or video streams will skip, the pointer will lag the mouse, scrolling will be jerky, and Maelstrom players will lose their ships. A number of Ingo's original tests were latency-related, and he added a couple more in the second round. So it looks like the benchmarks at least tried to measure the relevant quantity.
Benchmark results are not the same as a better desktop experience, though, and a number of users are reporting a "smoother" desktop when running with BFS. On the other hand, making significant scheduler changes in response to reports of subjective "feel" is a sure recipe for trouble: if one cannot measure improvement, one not only risks failing to fix any problems, one is also at significant risk of introducing performance regressions for other users. There has to be some sort of relatively objective way to judge scheduler improvements.
The way preferred by the current scheduler maintainers is to identify causes of latencies and fix them. The kernel's infrastructure for the identification of latency problems has improved considerably over the last year or two. One useful tool is latencytop, which collects data on what is delaying applications and presents the results to the user. The ftrace tracing framework is also able to create data on the delay between when a process is awakened and when it actually gets into the CPU; see this post from Frederic Weisbecker for an overview of how these measurements can be taken.
If there are real latency problems remaining in the Linux scheduler - and there are enough "BFS is better" reports to suggest that there are - then using the available tools to describe them seems like the right direction to take. Once the problem is better understood, it will be possible to consider possible remedies. It may well be that the mainline scheduler can be adjusted to make those problems go away. Or, possibly, a more radical sort of approach is necessary. But, without some understanding of the problem - and associated ability to measure it - attempted fixes seem a bit like a risky shot in the dark.
Ingo welcomed Con back to the development community and invited him to help improve the Linux scheduler. This seems unlikely to happen, though. Con's way of working has never meshed well with the kernel development community, and he is showing little sign of wanting to change that situation. That is unfortunate; he is a talented developer who could do a lot to improve Linux for an important user community. The adoption of the current CFS scheduler is a direct result of his earlier work, even if he did not write the code which was actually merged. In general, though, improving Linux requires working with the Linux development community; in the absence of a desire to do that effectively, there will be severe limits on what a developer will be able to accomplish.
(See also: Frans Pop's benchmark tests, which show decidedly mixed results.)
News from the staging tree
The staging tree has made a lot of progress since it appeared in June 2008. To start with, the
tree itself quickly moved into the mainline
in October 2008; it also has accumulated more than 40 drivers of various
sorts. Staging is an outgrowth of the Linux Driver Project that is
meant to collect drivers, and other "standalone" code such as filesystems,
that are not yet ready for the mainline. But, it was never meant to be a
"dumping ground for dead
code
", as staging maintainer Greg Kroah-Hartman put it in a recent status update. Code that
is not being improved, so that it can move into the mainline, will be
removed from the tree.
Some of the code that is, at least currently, slated for removal includes
some fairly high-profile drivers, including one from Microsoft that was
released with great fanfare
in July. After a massive cleanup that resulted in more than 200 patches to
get the code "into a semi-sane kernel coding style
",
Kroah-Hartman said that it may have to be removed in six months or so:
Microsoft is certainly not alone in Kroah-Hartman's report—which details the status of the tree for the upcoming 2.6.32 merge window—as several other large companies' drivers are in roughly the same boat. Drivers for Android hardware (staging/android), Intel's Management Engine Interface (MEI) hardware (staging/heci), among others were called out in the report. Both are slated for removal, android for 2.6.32, and heci in 2.6.33 (presumably). The latter provides an excellent example of how not to do Linux driver development:
Kroah-Hartman's lengthy report covers more than just drivers that may be removed; it also looks at those that have made progress, including some that should be moving to the mainline, as well as new drivers that are being added to staging. But the list of drivers that aren't being actively worked on is roughly as long as the other two lists combined, which is clearly suboptimal.
Presumably to see if folks read all the way through,
Kroah-Hartman sprinkles a few laughs in an otherwise dry summary. For the
me4000 and meilhaus drivers, he notes that there is no
reason to continue those drivers "except to watch the RT guys squirm
as they try to figure out the byzantine locking and build logic here (which
certainly does count for something, cheap entertainment is
always good.)
"
He also notes several drivers that are in the inactive category, but are quite close to being merge-worthy. He suggests that developers looking for a way to contribute consider drivers such as asus_oled (Asus OLED display), frontier (Frontier digital audio workstation controller), line6 (PODxt Pro audio effects modeler), mimio (Mimio Xi interactive whiteboard), and panel (parallel port LCD/keypad). Each of those should be relatively easy to get into shape for inclusion in the mainline.
There are a fair number of new drivers being added for 2.6.32,
including the Microsoft Hyper-V drivers (staging/hv) mentioned
earlier, as well as VME bus drivers (staging/vme), the industrial
I/O subsystem (staging/iio), and several wireless drivers (VIA
vt6655 and vt6656, Realtek rtl8192e, and Ralink 3090). Also,
"another COW driver
" is being added: the Cowloop copy-on-write
pseudo block driver
(staging/cowloop).
Two of
Evgeniy Polyakov's projects—mistakenly listed in the "new driver"
section though they were added in 2.6.30—were also mentioned.
The distributed storage (DST)
network block device (staging/dst), which Kroah-Hartman notes may
be "dead
" is a candidate for removal, while the distributed
filesystem POHMELFS (staging/pohmelfs) is mostly being
worked on out-of-tree. Polyakov agrees that DST is not needed in the
mainline, but is wondering about moving POHMELFS out of staging and
into fs/. Since there are extensive changes on the way for
POHMELFS,
it is unlikely to move out of staging for another few kernel releases at
least.
There was also praise for the work on various drivers which have been
actively worked on over the last few months. Bartlomiej Zolnierkiewicz
was singled out for his work on rt* and rtl* wireless
drivers (which put him atop the list of most active 2.6.31
developers), along with Alan Cox for work on the et131x driver
for the
Agere gigabit Ethernet adapter. Johannes Berg noted that much of Zolnierkiewicz's work on
the rt* drivers "will have been in vain
" because of
the progress being made by the rt2x00 project. But that doesn't faze Zolnierkiewicz:
In the meantime (before clean and proper support becomes useful) Linux users are provided with the possibility to use their hardware before it becomes obsolete.
At least one developer stepped up to work on one of the inactive drivers (asus_oled) in
the thread. In addition, Willy Tarreau mentioned that he had heard from another who
was working on panel, telling Kroah-Hartman: "This
proves that the principle of the staging tree seems to work
".
Overall, the staging tree seems to be doing exactly what Kroah-Hartman and others envisioned. Adding staging into the mainline, which raised the profile and availability of those drivers, has led to a fair amount of cleanup work, some of which has resulted in the drivers themselves moving out of staging and into the mainline. Some drivers seem to be falling by the wayside, but one would guess that Kroah-Hartman would welcome them back into the tree should anyone show up to work on them. In the meantime, the code certainly hasn't suffered from whatever fixes various kernel hackers found time to do. Those changes will be waiting for anyone who wants to pick that code back up, even if it is no longer part of staging.
POSIX v. reality: A position on O_PONIES
Sure, programmers (especially operating systems programmers) love their specifications. Clean, well-defined interfaces are a key element of scalable software development. But what is it about file systems, POSIX, and when file data is guaranteed to hit permanent storage that brings out the POSIX fundamentalist in all of us? The recentfsync()/rename()/O_PONIES
controversy was the most heated in recent memory but not out of
character for fsync()-related discussions. In this
article, we'll explore the relationship between file systems
developers, the POSIX file I/O standard, and people who just want to
store their data.
In the beginning, there was creat()
Like many practical interfaces (including HTML and TCP/IP), the POSIX file system
interface was implemented first and specified second. UNIX was
written beginning in 1969; the first release of the POSIX
specification for the UNIX file I/O interface (IEEE Standard 1003.1)
was released in 1988. Before UNIX, application access to non-volatile
storage (e.g., a spinning drum) was a decidedly application- and
hardware-specific affair. Record-based file I/O was a common paradigm,
growing naturally out of punch cards, and each kind of file was treated
differently. The new interface was designed by a few guys (Ken
Thompson, Dennis Ritchie, et alia) screwing around with their new
machine, writing an operating system that would make it easier
to, well, write more operating systems.
As we know now, the new I/O interface was a hit. It turned out to be a
portable, versatile, simple paradigm that made modular software
development much easier. It was by no means perfect, of course: a
number of warts revealed themselves over time, not all of which were
removed before the interface was codified into the POSIX
specification. One example is directory hard links, which permit the
creation of a directory cycle - a directory that is a descendant of
itself - and its subsequent detachment from the file system hierarchy,
resulting in allocated but inaccessible directories and files.
Recording the time of the last access time - atime - turns every read
into a tiny write. And don't forget the apocryphal quote from Ken
Thompson when asked if he'd do anything differently if he were
designing UNIX today: "If I had to do it over again? Hmm... I guess
I'd spell 'creat' with an 'e'
". (That's the creat()
system call to create a new file.) But overall, the UNIX file system
interface is a huge success.
POSIX file I/O today: Ponies and fsync()
Over time, various more-or-less portable additions have accreted around the standard set of POSIX file I/O interfaces; they have been occasionally standardized and added to the canon - revelations from latter-day prophets. Some examples off the top of my head include pread()/pwrite(), direct I/O, file preallocation, extended attributes, access control lists (ACLs) of every stripe and color, and a vast array of mount-time options. While these additions are often debated and implemented in incompatible forms, in most cases no one is trying to oppose them purely on the basis of not being present in a standard written in 1988. Similarly, there is relatively little debate about refusing to conform to some of the more brain-dead POSIX details, such as the aforementioned directory hard link feature.
Why, then, does the topic of when file system data is guaranteed to be
"on disk" suddenly turn file systems developers into pedantic
POSIX-quoting fundamentalists? Fundamentally (ha), the problem comes
down to this: Waiting for data to actually hit disk before returning
from a system call is a losing game for file system performance. As
the most extreme example, the original synchronous version of the UNIX
file system frequently used only 3-5% of the disk throughput. Nearly
every file system performance improvement since then has been
primarily the result of saving up writes so that we can allocate and
write them out as a group. As file systems developers, we are going
to look for every loophole in fsync() and squirm our way
through it.
[PULL QUOTE:
As file systems developers, we are going
to look for every loophole in fsync() and squirm our way
through it.
END QUOTE]
Fortunately for the file systems developers, the POSIX specification
is so very minimal that it doesn't even mention the topic of file
system behavior after a system crash. After all, the original
FFS-style file systems (e.g., ext2) can theoretically lose your entire
file system after a crash, and are still POSIX-compliant. Ironically,
as file systems developers, we spend 90% of our brain power coming up
with ways to quickly recover file system consistency after system
crash! No wonder file systems users are irked when we define file
system metadata as important enough to keep consistent, but not file
data - we take care of our own so well. File systems developers have
magnanimously conceded, though, that on return
from fsync(), and only from fsync(), and
only on a file system with the right mount options, the changes to
that file will be available if the system crashes after that point.
At the same time, fsync() is often more expensive than it
absolutely needs to be. The easiest way to
implement fsync() is to force out every outstanding write
to the file system, regardless of whether it is a journaling file
system, a COW file system, or a file system with no crash recovery
mechanism whatsoever. This is because it is very difficult to map
backward from a given file to the dirty file system blocks needing to
be written to disk in order to create a consistent file system
containing those changes. For example, the block containing the
bitmap for newly allocated file data blocks may also have been changed
by a later allocation for a different file, which then requires that
we also write out the indirect blocks pointing to the data for that
second file, which changes another bitmap block... When you solve the
problem of tracing specific dependencies of any particular write, you
end up with the complexity
of soft updates. No
surprise then, that most file systems take the brute force approach,
with the result that fsync() commonly takes time
proportional to all outstanding writes to the file system.
So, now we have the following situation: fsync() is
required to guarantee that file data is on stable storage, but it may
perform arbitrarily poorly, depending on what other activity is going
on in the file system. Given this situation, application developers
came to rely on what is, on the face of it, a completely reasonable
assumption: rename() of one file over another will either
result in the contents of the old file, or the contents of the new
file as of the time of the rename(). This is a subtle
and interesting optimization: rather than asking the file system to
synchronously write the data, it is instead a request to order the
writes to the file system. Ordering writes is far easier for the file
system to do efficiently than synchronous writes.
However, the ordering effect of rename() turns out to be
a file system specific implementation side effect. It only works when
changes to the file data in the file system are ordered with respect
to changes in the file system metadata. In ext3/4, this is only true
when the file system is mounted with the data=ordered
mount option - a name which hopefully makes more sense now! Up until
recently, data=ordered was the default journal mode for
ext3, which, in turn, was the default file system for Linux; as a result,
ext3 data=ordered was all that
many Linux application developers had any experience with. During the
Great File System Upheaval of 2.6.30, the default journal mode for
ext3 changed to data=writeback, which means that file
data will get written to disk when the file system feels like it, very
likely after the file's metadata specifying where its contents are
located has been written to disk. This not only breaks
the rename() ordering assumption, but also means that the
newly renamed file may contain arbitrary garbage - or a copy
of /etc/shadow, making this a security hole as well as a
data corruption problem.
Which brings us to the present
day fsync/rename/O_PONIES
controversy, in which many file systems developers argue that
applications should explicitly call fsync() before
renaming a file if they want the file's data to be on disk before the
rename takes effect - a position which seems bizarre and random until
you understand the individual decisions, each perfectly reasonable,
that piled up to create the current situation. Personally, as a file
systems developer, I think it is counterproductive to replace a
performance-friendly implicit ordering request in the form of
a rename() with an impossible to
optimize fsync(). It may not be POSIX, but the
programmer's intent is clear - no one ever, ever wrote
"creat(); write(); close(); rename();" and hoped they
would get an empty file if the system crashed during the next 5
minutes. That's what truncate() is for. A generalized
"O_PONIES do-what-I-want" flag is indeed not possible,
but in this case, it is to the file systems developers' benefit to
extend the semantics of rename() to imply ordering so
that we reduce the number of fsync() calls we have to cope
with. (And, I have to note, I did have a real, live pony when I was a
kid, so I tend to be on the side of giving programmers ponies when
they ask for them.)
My opinion is that POSIX and most other useful standards are helpful clarifications of existing practice, but are not sufficient when we encounter surprising new circumstances. We criticize applications developers for using folk-programming practices ("It seems to work!") and coming to rely on file system-specific side effects, but the bare POSIX specification is clearly insufficient to define useful system behavior. In cases where programmer intent is unambiguous, we should do the right thing, and put the new behavior on the list for the next standards session.
Patches and updates
Kernel trees
Architecture-specific
Build system
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>