Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.39-rc6, released on May 3. Linus said:

We're still chasing some stuff down, but I think we're ok for -rc6. This isn't going to be the final -rc, but it doesn't seem to be in bad shape, and people who have posted regression reports please check them again, and people who haven't, please do give it a test.

See the full changelog for all the details.

Stable updates: 2.6.38.5 (May 2), 2.6.35.13 (April 28), and 2.6.27.59 (April 30). Each contains a long list of important changes.

Comments (none posted)

Quotes of the week

Let ARM rot in mainline. I really don't care anymore.

-- Russell King

Nah, I'll leave it in so that they can have a blast from the past and can boast on the slashdot of the future or on ((LWN++)..)++ that they've fixed the oldest bug in Linux when booting it on quantum x86 computers and tried running their RAS daemon.

-- Borislav Petkov

Yes, open source programming is a team sport, but finding the right people is really the killer feature (the same is obviously true in the kernel too - I really do think we have a great set of maintainers. I may complain about them and I'm somewhat infamous for my flames when things don't work, but at the same time I'm convinced there's some of the best people out there working on maintaining the kernel).

-- Linus Torvalds

Comments (none posted)

The Open Wireless Movement kernel wiki

In response to the EFF's call for owners of wireless networks to allow open access, Luis Rodriguez has put up a wiki page to gather information on how providing open access can be made safer. Contributions of ideas and comments are welcome.

Comments (none posted)

Interview with Linus Torvalds (LinuxFR)

LinuxFR has an interview with Linus Torvalds on a wide range of subjects. "LinuxFR : What is your opinion about Android ? Are you mostly happy they made cellphones very usable or sad because it's really a kernel fork ? Linus Torvalds : I think forks are good things, they don't make me sad. A lot of Linux development has come out of forks, and it's the only way to keep developers honest - the threat that somebody else can do a better job and satisfy the market better by being different. The whole point of open source to me is really the very real ability to fork (but also the ability for all sides to then merge the forked content back, if it turns out that the fork was doing the right things!)"

Comments (65 posted)

Mob rule for dentries

By Jonathan Corbet
May 4, 2011

As was discussed at the 2011 filesystem, storage, and memory management summit, there is an increasing level of interest in restricting the amount of kernel memory which can be used by groups of processes. One area of special interest is the directory entry (dentry) cache; a malicious program can, by creating a deep enough directory hierarchy, run the kernel out of memory with an explosion of the size of the dentry cache. So limiting dentry use has some real appeal, especially for those working to ensure that containers running on a Linux system cannot interfere with each other.

Pavel Emelyanov's per-container dcache management patches are a first attempt at limiting dentry use. This patch works by organizing dentries into "mobs," being groups of dentries all of which represent names in a specific subtree of the filesystem. If the root of a mob were the root of a container's filesystem namespace, all dentries created by that container would be contained within that mob. At that point, a simple sort of resource control can be applied: adding a dentry to a mob which has hit its maximum size would require the removal of another dentry to compensate. If no dentries an be removed, attempts to add others will fail.

The patch set adds three new ioctl() calls: FIMOBROOT to create a new mob at a given point in the filesystem, FIMOBSIZE to set the maximum size of a mob, and FIMOBSTAT to query the current usage of a mob. Pavel is somewhat apologetic about this interface; he seems to think it will have to change before the work could be considered upstream. But the first step is get some discussion of the concept; so far, there have been no responses to Pavel's patches.

Comments (none posted)

Kernel development news

Raw events and the perf ABI

By Jonathan Corbet
May 3, 2011

The perf events subsystem often looks like it's on the path to take over the kernel; there is a great deal of development activity there, and it has become a sort of generalized event reporting mechanism. But the original purpose of perf events was to provide access to the performance monitoring counters made available by the hardware, and it is still used to that end. The merging of perf was a bit of a hard pill for users of alternative performance monitoring tools to swallow, but they have mostly done so. The recent discussion on "offcore" events shows that there are still some things to argue about in this area, even if everybody seems likely to get what they want in the end.

The performance monitoring unit (PMU) is normally associated with the CPU; each processor has its own PMU for monitoring its own specific events. Some newer processors (such as Intel's Nehalem series) also provide a PMU which is not tied to any CPU; in the Nehalem case it's part of the "uncore" which handles memory access at the package level. The off-core PMU has a viewpoint which allows it to provide a better picture of the overall memory behavior of the system, so there is interest in gaining access to events from that PMU. Current kernels, though, do not provide access to these offcore events.

For a while, the 2.6.39-rc kernel did provide access to these events, following the merging of a patch by Andi Kleen in March. One piece that was missing, though, was a patch to the user-space perf tool to provide access to this functionality. There was an attempt to merge that piece toward the end of April, but it did not yield the desired results; rather than merge the additional change, perf maintainer Ingo Molnar removed the ability to access offcore events entirely.

Needless to say, that action has led to some unhappiness in the perf user community; there are developers who had already been making use of those events. Normally, breaking things in this way would be considered a regression, and the patch would be backed out again. But, since this functionality never appeared in a released kernel, it cannot really be called a regression. That, of course, is part of the point of removing the feature now.

Ingo's complaint is straightforward: the interface to these events was too low-level and too difficult to use. The rejected perf patch had an example command which looked like:

    perf stat -e r1b7:20ff -a sleep 1

Non-expert readers may, indeed, be forgiven for not immediately understanding that this command would monitor access to remote DRAM - memory which is hosted on a different socket. Ingo asserted that the feature should be more easily used, perhaps with a command like (from the patch removing the feature):

    perf record -e dram-remote ./myapp

He also said:

But this kind of usability is absolutely unacceptable - users should not be expected to type in magic, CPU and model specific incantations to get access to useful hardware functionality.

The proper solution is to expose useful offcore functionality via generalized events - that way users do not have to care which specific CPU model they are using, they can use the conceptual event and not some model specific quirky hexa number.

The key is the call for "generalized events" which are mapped, within the kernel, onto whatever counters the currently-running hardware uses to obtain that information. Users need not worry about the exact type of processor they are running on, and they need not dig out the data sheet to figure out what numbers will yield the results they want.

Criticism of this move has taken a few forms. Generalized events, it is said, are a fine thing to have, but they can never reflect all of the weird, hardware-specific counters that each processor may provide. These events should also be managed in user space where there is more flexibility and no need to bloat the kernel. There were some complaints about how some of the existing generalized events have not always been implemented correctly on all architectures. And, they say, there will always be people who want to know what's in a specific hardware counter without having the kernel trying to generalize it away from them. As Vince Weaver put it:

Blocking access to raw events is the wrong idea. If anything, the whole "generic events" thing in the kernel should be ditched. Wrong events are used at times (see AMD branch events a few releases back, now Nehalem cache events). This all belongs in userspace, as was pointed out at the start. The kernel has no business telling users which perf events are interesting, or limiting them!

Ingo's response is that the knowledge and techniques around performance monitoring should be concentrated in one place:

Well, the thing is, i think users are helped most if we add useful, highlevel PMU features added and not just an opaque raw event pass-through engine. The problem with lowlevel raw ABIs is that the tool space fragments into a zillion small hacks and there's no good concentration of know-how. I'd like the art of performance measurement to be generalized out, as well as it can be.

Vince, meanwhile, went on to claim that perf was a reinvention of the wheel which has ignored a lot of the experience built into its predecessors. There are, it seems, still some scars from that series of events. Thomas Gleixner disagreed with the claim that perf is an exercise in wheel reinvention, but he did say that he thought the raw events should be made available:

The problem at hand which ignited this flame war is definitely borderline and I don't agree with Ingo that it should not made be available right now in the raw form. That's an hardware enablement feature which can be useful even if tools/perf has not support for it and we have no generalized event for it. That's two different stories. perf has always allowed to use raw events and I don't see a reason why we should not do that in this case if it enables a subset of the perf userbase to make use of it.

It turns out that Ingo is fine with raw events too. His stated concern is that access to raw events should not be the primary means by which most users gain access to those performance counters. So he is blocking the availability of those events for now for two reasons. One of those is that he wants the generalized mode of access to be available first so that users will see it as the normal way to access offcore events. If there is never any need to figure out hexadecimal incantations, many user-space developers will never bother; as a result, their commands and code should eventually work on other processors as well.

The other reason for blocking raw events now is that, as the interface to these events is thought through, the ABI by which they are exposed to user space may need to change. Releasing the initial ABI in a stable kernel seems almost certain to cement it in place, given that people were already using it. By deferring these events for one cycle (somebody will certainly come up with a way to export them in 2.6.40), he hopes to avoid being stuck with a second-rate interface which has to be supported forever.

There can be no doubt that managing this feature in this way makes life harder for some developers. The kernel process can be obnoxious to deal with at times. But the hope is that doing things this way will lead to a kernel that everybody is happier with five years from now. If things work out that way, most of us can deal with a bit of frustration and a one-cycle delay now.

Comments (5 posted)

Expanding seccomp

By Jake Edge
May 4, 2011

Sandboxing processes such that they cannot make "dangerous" system calls is an attractive feature that has already been implemented in a limited way for Linux with seccomp. Two years ago, we looked at a proposal to expand seccomp to allow more fine-grained control over which system calls would be allowed. That proposal has been mostly dormant since then, but was recently resurrected after incorporating some of the suggestions made at that time. The reaction to the current proposal so far seems positive, and it might just be gaining some traction that the previous patchset lacked.

Seccomp (from "secure computing") is enabled via a prctl() call and, once enabled, restricts the process from making any further system calls beyond read(), write(), exit(), and sigreturn()—any other system call will abort the process. That creates a pretty secure sandbox, but it is also extremely limited as there are other things that developers might want to do from within such a sandbox. In fact, the Chromium browser has gone to great lengths to implement its own sandbox that uses seccomp, but expands the range of legal system calls through some contortions.

That led Adam Langley of the Chromium team to propose adding a bitmask of allowable system calls for a new seccomp mode. That would have allowed processes to make a binary choice (allowed or disallowed) for each system call. At the time, Ingo Molnar suggested using the Ftrace filter code to make the interface even more flexible by allowing filters to be applied to the system call arguments. Essentially, that would make for three choices for each system call: enabled, disabled, or filtered.

Fast-forward to today, and that is what a patchset from Will Drewry implements. It should come as no surprise that Molnar was pleased to see his idea result in working code: "Ok, color me thoroughly impressed - AFAICS you implemented my suggestions [...] and you made it work in practice!". Eric Paris was likewise impressed, noting that an expanded seccomp could be used for QEMU. Molnar and Paris did not agree about replacing the LSM approach using filters, but that was something of an aside. Serge E. Hallyn also pointed out that the new feature would be useful for containers: "to try and provide some bit of mitigation for the fact that they are sharing a kernel with the host".

The proposed interface, which is likely to change based on comments in the thread, looks like:

    const char *filters[] =
      "sys_read: (fd == 1) || (fd == 2)\n"
      "sys_write: (fd == 0)\n"
      "sys_exit: 1\n"
      "sys_exit_group: 1\n"
      "on_next_syscall: 1";
    prctl(PR_SET_SECCOMP, 2, filters);

That example is taken from Drewry's documentation file that accompanies the patches. It would allow reading from two file descriptors (1 and 2) and writing to one (0), while allowing any calls to the two other system calls listed. The on_next_syscall means that the rules would not be enforced until after one more system call is made. That would allow a parent to fork(), set up the seccomp sandbox in the child process, then exec a new program which would be governed by the new rules.

That on_next_syscall piece drew a few comments. As it turns out, there are really only two cases that need to be handled, either the rules should go into effect immediately (for a process that wants to restrict itself before handling untrusted input for example), or they should go into effect after an exec (for a parent that is spawning an untrusted child). Making the "after exec" case the default, while still allowing a process to request immediate application, seems to be the way things are headed.

There were also questions about using kernel-internal symbol names like sys_read. Exporting those as a kernel ABI is not likely to pass muster, as it might restrict the option of changing those function names down the road—or require a messy compatibility layer if they did change. Drewry wanted to avoid using the system call numbers as Langley's original patch did, but as Frederic Weisbecker pointed out, those numbers are already part of the kernel ABI. Drewry is planning to make that switch and users of the interface will need to use the unistd.h header file or a library to map system call names to numbers.

The patches also modify the /proc/PID/status file to output any existing filters that are applied to the process. Given that most applications that read that file don't need the extra information, though, Motohiro Kosaki suggested that seccomp get its own file. Drewry's plan is to provide that information in the /proc/PID/seccomp_filter file instead, and remove it from status.

Since it uses the Ftrace infrastructure and hooks, the new seccomp mode only works for those system calls that have Ftrace events associated with them. Using one of those non-instrumented system calls in the filters will result in an EINVAL from the prctl() call. Enabling CONFIG_SECCOMP_FILTER (which depends on CONFIG_FTRACE_SYSCALLS) will allow the use of the new mode.

Overall, Drewry has been very receptive to suggestions for changes, and the feedback to the concept has been pretty uniformly positive. Molnar suggested breaking out the Ftrace filter engine further—beyond the minimal changes that Drewry's patches make—so that it would be available for more widespread use in the kernel. Molnar does wonder whether Linus Torvalds or Andrew Morton might object to more use of the filter mechanism, however: "are you guys opposed to such flexible, dynamic filters conceptually? I think we should really think hard about the actual ABI as this could easily spread to more applications than Chrome/Chromium." So far, neither has spoken up one way or the other.

Currently it would seem that Drewry is off working on the next revision of the patchset, and it certainly doesn't seem like anything that would be merged in the upcoming 2.6.40 cycle. As Molnar notes, the ABI needs to be carefully thought-out, there are still some RCU issues that are being discussed, and it probably needs some soaking time in the -next tree, but barring some major complaint cropping up, it's a feature that will likely make its way into the mainline relatively soon. While that won't allow Chromium to immediately ditch its complicated sandboxing arrangement, it may well be able to do so a few years down the road. Other applications will benefit from an expanded seccomp as well.

Comments (10 posted)

LFCS: Building the kernel with Clang

By Jake Edge
May 4, 2011

Back in October, Bryce Lelbach announced that he (and others) had built a working Linux kernel using (mostly) Clang, the LLVM-based C compiler. At the Linux Foundation Collaboration Summit (LFCS) back in April, Lelbach gave a talk about the progress that had been made, and the work still to be done, for the LLVM Linux (LLL) project. That talk, along with the rest of the LLVM track, was quite interesting, and once again showed that having two (or more) "competing" projects is generally beneficial to both.

Why build Linux with Clang?

Lelbach started off describing the reasons behind the decision to try to build Linux with Clang, most of which centered around the diagnostics that the compiler produces. The Clang static analyzer has the ability to show "what the compiler sees when it's looking at your code", he said. He thought that a huge codebase like Linux could benefit from that kind of analysis.

In fact, the Clang diagnostics were quite useful when he was building the Broadcom wireless driver for his MacBook, he said. Clang doesn't forget things, so it can show macros before their expansion, typedefs, and so on. It also shows the line in the source code with a caret pointing to the offending code, along with "fixit hints". Those hints can be automatically applied to the source code to fix the problem in question.

The project got a 2.6.36-based kernel running back in October, and now has working kernels based on .37 and .38. Neither Xen nor KVM worked at the time of the talk and Xen won't even compile, though KVM is said to work now. More than 90% of the drivers in the kernel will at least compile, and many will work. Some out-of-tree binary drivers (Broadcom, NVIDIA) will work as well. SMP versions of the kernel for both 32 and 64-bit x86 platforms are now working, though some of the code needs to be patched in order to build correctly.

Things that don't work

The integrated assembler (IA) for Clang does not have support for generating "real mode" code using .code16gcc directives, so the Linux boot code cannot be built using IA. There is a "nasty pile" of real mode code required to boot on x86, Lelbach said. IA is the default for recent versions of Clang, but using the GNU Assembler (gas) was required for the boot code. Adding support for an LLVM x86-16 backend is the right approach, he said, and LLVM project members in the audience agreed that it was something that could be added to IA.

The "vast majority of GCC extensions are supported" by Clang, even those which are not documented, which makes compiling the kernel much easier. Things like inline assembly, the __attribute__ and __builtin__ syntax, and so on, all just work. He expected that there might be problems with inline assembly, but that has not proven to be the case. Clang defaults to the C99 standard, though, so the gnu89 standard needs to be specified to build the kernel.

There are some GCC extensions that aren't implemented, however, including explicit register variables. That lack blocks Xen and some user-space libraries (like glibc) from compiling. There are also some "intentionally unsupported extensions", including local/nested functions, which is only used in a Thinkpad driver. A bigger problem is that Clang lacks support for variable-length arrays in structures (VLAIS). A declaration like:

    void f (int i) {
        struct foo_t {
            char a[i];
        } foo;
    }

cannot be compiled in Clang, though declarations like:

    void f (int i) {
        char foo[i];
    }

are perfectly acceptable. Code like the former is used in the iptables code, the kernel hashing (HMAC) routines, and some drivers. Those parts have to be patched in order to be built, he said. Once again, someone from the audience piped up to say that support for VLAIS could be added as long as the patches were not "wildly invasive". The LLL folks "prefer adding things to Clang rather than patching the kernel", Lelbach said.

That led to a question about whether the project was pushing any of its patches upstream to the kernel. Lelbach said that the PaX team (who is another LLL developer) had submitted a few, but that those were rejected; "after three, we stopped" submitting them. Part of the problem is that the patches are not ready for inclusion because there is a lack developer time to get them into shape. As an audience member noted, though, the kernel folks are quick to take any patches that fix bugs found by Clang.

Code generation and optimization problems

There are several code generation and optimization options for GCC that aren't supported by Clang. One of those is -mregparm that governs the number of registers used to pass integer arguments. That means calls to functions like memcpy() are generated that ignore the custom calling conventions.

Also, -fcall-saved-reg is not supported by Clang and that affects the uses of the ALTERNATIVE() macro in the kernel, which chooses between assembly instructions depending on the processor model. For some of the __arch_hweight*() implementations ALTERNATIVE() buries the actual function call inside assembly code, so Clang doesn't know about it. That means that the generated code is not saving all of the registers that it needs to, so uses of ALTERNATIVE() are commented out and a normal call to the function is used instead.

Another problem is with -pg, which enables instrumentation code for function calls in GCC, and is used when building Ftrace. For inline functions, the calls to mcount() get added multiple times, both when the code is generated and when it is expanded inline. The no_instrument_function attribute is not properly propagated to inline functions, he said.

The final problem that Lelbach mentioned is the -fno-optimize-sibling-calls flag that is not supported by Clang. The flag disables tail call elimination, and the kernel introspection code (like Ftrace) assumes specific stack depths in various places. Because Clang doesn't support the flag, code which walks the call stack can end up dereferencing user-space pointers, which leads to runtime crashes. This was worked around by defining HAVE_ARCH_CALLER_ADDR for x86 and defining CALLER_ADDR[1-6] as dummy values, effectively disabling the stack backtracing.

It is not just Lelbach who is working LLL, and he noted that the PaX team, Alp Toker, and Török Edwin have all contributed, along with various Clang/LLVM and Linux kernel hackers. There are plans to create a mailing list for the project and the beginnings of a wiki are taking shape. Overall, it's an interesting project that will likely end up helping to find bugs in the kernel while discovering features that could or should be supported by LLVM/Clang.

[ Thanks to Bryce Lelbach, PaX team, and Török Edwin for filling in holes in my notes. ]

Comments (5 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.39-rc6 ?

Greg KH Linux 2.6.38.5 ?

Andi Kleen The 2.6.35.13 longterm Linux kernel is released ?

Willy Tarreau Linux 2.6.27.59 ?

Architecture-specific

Vivien Didelot Support for Technologic Systems TS-5500 board ?

Daniel Drake OLPC Power Management ?

Grant Likely ARM: Basic Xilinx Support ?

Borislav Petkov RAS daemon: kernel part ?

Core kernel code

Rafael J. Wysocki PM / Runtime: Generic clock manipulation rountines for runtime PM (v3) ?

Trinabh Gupta [RFC PATCH V4 0/4] cpuidle: global registration of idle states with per-cpu statistics ?

Paul E. McKenney Preview of RCU patches for 2.6.40 ?

Nikhil Rao Increase resolution of load weights ?

Development tools

Jiri Olsa x86_64,tracing: multiplexing function tracer ?

Device drivers

Grant Likely General device tree irq domain infrastructure ?

Grant Likely Basic ARM devicetree support ?

Rafael J. Wysocki PM: Support for generic I/O power domains ?

Jamie Iles Use basic_mmio_gpio to implement a generic gpio chip ?

Filesystems and block I/O

Jan Kara Block reservation on page fault time for ext3 ?

Theodore Ts'o V2 ext4 Bigalloc patches ?

Christoph Hellwig XFS status update for April 2011 ?

Darrick J. Wong [PATCH v3 0/3] data integrity: Stabilize pages during writeback for ext4 ?

Josef Bacik fs: add SEEK_HOLE and SEEK_DATA flags ?

Memory management

KAMEZAWA Hiroyuki memcg: reclaim memory from node in round-robin ?

Pavel Emelyanov Per-container dcache management (and a bit more) ?

Networking

Anton Blanchard net: Add sendmmsg socket system call ?

Javier Cardona Support for secure mesh in userspace and other mesh fixes ?

Security-related

Will Drewry Revisiting expanded seccomp functionality ?

Jari Ruusu Announce loop-AES-v3.6c file/swap crypto package ?

Samir Bellabes snet: Security for NETwork syscalls ?

Benchmarks and bugs

Rafael J. Wysocki 2.6.39-rc5-git4: Reported regressions from 2.6.38 ?

Rafael J. Wysocki 2.6.39-rc5-git4: Reported regressions 2.6.37 -> 2.6.38 ?

Miscellaneous

Karel Zak util-linux v2.19.1 ?

Page editor: Jonathan Corbet
Next page: Distributions>>