Kernel development

Brief items

Kernel release status

The current development kernel is 3.15-rc6, released on May 22. "With rc5 being a couple of days early, and rc6 being several days late, we had almost two weeks in between them. The size of the result is not twice as large, though, hopefully partially because it's getting late in the rc series and things are supposed to be calming down, but presumably also because some submaintainers just didn't send their pull requests because they knew I was off-line. Whatever the reason, things don't look bad." Linus plans to return to the normal Sunday schedule for rc7, presumably on June 1, which might be the last rc for 3.15.

Stable updates: 3.4.91 was released on May 18 with a handful of important fixes, including one (known) important security fix. For those who like their kernels especially stable, 2.6.32.62 was released on May 21; it includes fixes for 39 CVE-numbered bugs.

Comments (none posted)

Quotes of the week (code simplicity edition)

I've already been eyed by vulturous frozen sharks flying in circles above me lately after a few overengineering visions.

— Frederic Weisbecker

Given that I haven't heard too many people complaining that RCU is too simple, I would like to opt out of runtime changes to the nohz_full mask.

— Paul McKenney. Proponents of more RCU complexity need to start making themselves heard.

Putting MM stuff into core waitqueue code is rather bad. I really don't know how I'm going to explain this to my family.

— Andrew Morton (Thanks to Vlastimil Babka).

Comments (none posted)

Tux3 posted for review

By Jonathan Corbet
May 21, 2014

After years of development and some seeming false starts, the Tux3 filesystem has been posted for review with the hope of getting it into the mainline in the near future. Tux3, covered here in 2008, promises a number of interesting next-generation filesystem features combined with a high level of reliability. This posting is a step forward for Tux3, but it will still probably be some time before it finds its way into the mainline.

The only developer to review the code so far is Dave Chinner, and he was not entirely impressed. There is a lot of stuff to clean up, but Dave is most concerned about various core memory management and filesystem changes that, he says, need to be separated out for review on their own merits. One of the core Tux3 mechanisms, called "page forking," was not well received at the 2013 Storage, Filesystem and Memory Management Summit, and Tux3 developer Daniel Phillips has done little since then to address the criticisms heard there.

Dave is also worried about the "work in progress" nature of a number of promised Tux3 features. Years ago, Btrfs was merged while in an incomplete state in the hope of accelerating development; Dave now says that was a mistake he does not want to repeat:

The development of btrfs has shown that moving prototype filesystems into the main kernel tree does not lead stability, performance or production readiness any faster than if they stayed as an out-of-tree module until most of the development was complete. If anything, merging into mainline reduces the speed at which a filesystem can be brought to being feature complete and production ready.

All told, it adds up to a chilly reception for this new filesystem. Daniel appears to be up to the challenge of getting this code into shape for merging, though. If he follows through, we should start seeing smaller patch sets that will facilitate the review of specific Tux3-related changes. Only after that process completes will it be time to look at getting the filesystem itself into the mainline.

Comments (5 posted)

Kernel development news

2038 is closer than it seems

By Jonathan Corbet
May 21, 2014

Most LWN readers are likely aware of the doom impending upon us in January 2038, when the time_t type used to store time values (in the form of seconds since January 1, 1970) runs out of bits on 32-bit systems. It may be surprising that developers are increasingly worried about this deadline, which is still nearly 24 years in the future, but there are good reasons to be concerned now. Whether those worries will lead to a solution in the near term remains to be seen; not much has happened since this topic came up last August. But recent discussions have at least shed a bit of light on the forms such a solution might take.

At times, developers have hoped that this problem might solve itself. On 64-bit systems, the time_t type has always been defined as a 64-bit quantity and will not run out of space anytime soon. Given that 64-bit systems appear to be taking over the world — even phone handsets seem likely to make the switch in the next few years — might the best solution be to just wait for 32-bit systems to die out and take the problem with them? A "no action required" solution has an obvious appeal.

There are two problems with that reasoning: (1) 32-bit systems are likely to continue to be made for far longer than most people might expect, and (2) there are 32-bit systems being deployed now that can be expected to have lifetimes of 24 years or longer. 32-bit systems will be useful as cheap microcontrollers for a long time, and, once deployed, they will often be expected to work for many years while being difficult or impossible to update. There are almost certainly systems already deployed that are going to provide unpleasant surprises in 2038.

Kernel-based solutions

So it would appear to make sense to solve the problem soon, rather than in, say, 2036 or so. There is only one snag: the problem is not all that easy to solve. At least, it is not easy if one is concerned about little details like not breaking existing programs. Since Linux developers at most levels are quite concerned about compatibility, the simplest solutions (such as a BSD-style ABI break) are not seen as being workable. In a recent discussion, John Stultz outlined a couple of alternative approaches, neither of which is without its difficulties.

The first approach would be to change the 32-bit ABI to use a 64-bit version of time_t (related data structures like, struct timespec and struct timeval would also change). Old binaries could be supported through a compatibility interface, but newly compiled code would normally use the new ABI. There are some advantages to this approach, starting with the fact that lots of applications could be updated simply by rebuilding them. Since a couple of BSD variants have already taken this path, a number of the worst application problems have already been fixed. Embedded microcontrollers typically run custom distributions built entirely from source; changing the ABI in this way would make it possible to build 2038-capable systems in the near future with a minimum of pain.

On the other hand, the kernel would have to maintain a significant compatibility layer for a long time. Developers are also worried that there will be many applications that store 32-bit time_t values in their own data structures, in on-disk formats, and more. Many of these applications could break in surprising ways, and they could prove to be difficult to fix. There are also some concerns about the runtime cost of using 64-bit time_t values on 32-bit systems. Much of this cost could be mitigated within the kernel by using a different format internally, but applications could slow down as well.

The alternative approach is to simply define a new set of system calls, all of which are defined to use better time formats from the beginning. The new formats could address other irritations at the same time; not everybody likes the separate seconds and nanoseconds fields used in struct timespec, for example. All system calls defined to use the old time_t values would be deprecated, with the idea of removing them, if possible, before 2038.

With this approach, there would be no hard ABI break anytime soon and applications could be migrated gradually. Once again, embedded systems could be built using the new system calls in the relatively near future, while desktop systems could be left alone for another decade or so. And it would be a chance to start over and redesign some longstanding system calls with 21st-century needs in mind.

Defining new system calls has its downsides as well, though. It would push Linux further away from being a POSIX system, and would take us down a path different from the one chosen by the BSD world. There are a lot of system calls to replace, and time_t values show up in other places as well, most notably in a long list of ioctl() calls. Applications would have to be updated, including those running only on 64-bit systems, which would not see much of a benefit from the new system calls. And, undoubtedly, there would be lots of applications using the older system calls that would surface in 2037. So this approach is not an easy solution either.

Including glibc

Discussions of these alternatives went on for a surprisingly long time before Christoph Hellwig made an (in retrospect) obvious suggestion: the C library developers are going to have to be involved in the implementation of any real solution to the year-2038 problem, so perhaps they should be part of the discussion now. For years, communications between the kernel community and the developers of C libraries (including the GNU C library — glibc) have been sporadic at best. The changing of the guard at glibc has made productive conversations easier to have, but changing old habits has proved hard. In any case, it is true that the glibc developers will have to be involved in the design of the solution to this problem; the good news is that such involvement appears likely to happen.

Glibc developers are not known for their love of ABI breaks — or of non-POSIX interfaces for that matter. So, once glibc developer Joseph Myers joined the conversation, the tone shifted a bit toward a solution that would allow a smooth transition while retaining existing POSIX system calls and application compatibility. The plan (which was discussed only in rough form and would need a lot of work yet) looks something like this:

Create new, 64-bit versions of the affected system calls. So, for example, there would be a gettimeofday64() that returns the time in a struct timeval64. The existing versions of these system calls would be unchanged.
Glibc would gain a new feature test macro with a name like TIME_BITS. If TIME_BITS=64 on a 32-bit system, a call to gettimeofday() will be remapped to gettimeofday64() within the library. So applications can opt into the new world by building with an appropriate value of TIME_BITS defined.
Eventually, TIME_BITS=64 would become the default, probably after distributions had been shipping in that mode for a while. Even in the 64-bit configuration, compatibility symbols would remain so that older binaries would still work against newer versions of the C library.

Such an approach could possibly allow for a relatively smooth transition to a system that will work in 2038, though, naturally, a number of troublesome details remain. There was talk of remapping ioctl() calls in a similar way, but that looks like a recipe for trouble given just how many of those calls there are and how hard it would be to even find them all. Developers in other C library projects, who often don't wish to maintain the sort of extensive compatibility infrastructure found in glibc, may wish to take a different approach. And so on.

But, even with its challenges, the existence of a vague plan hashed out with participation from kernel and glibc developers is reason for hope. Maybe, just maybe, some sort of reasonably robust solution to the 2038 problem will be found before it becomes absolutely urgent, and, with luck, before lots of systems that will need to function properly in 2038 are deployed. We have the opportunity to avoid a year-2038 panic at a relatively low cost; if we make use of that opportunity, our future selves will thank us.

Comments (21 posted)

Taking the Eudyptula Challenge

By Jake Edge
May 21, 2014

Linux kernel development is tricky to learn. Though there are lots of resources available covering many of the procedural and technical facets of kernel development as well as a mailing list for kernel newbies, it can still be difficult to figure out how to get started. So some kind of "challenge" may be just what the penguin ordered: a series of increasingly difficult, focused tasks targeted at kernel development. As it turns out, the Eudyptula Challenge—which has been attracting potential kernel hackers since February—provides exactly that.

As the challenge web page indicates, it was inspired by the Matasano Crypto Challenge (which, sadly, appears to be severely backlogged and not responding to requests to join). The name of the challenge comes from the genus of the Little Penguin (Eudyptula minor) and the pseudonymous person (people?) behind the challenge goes by "Little Penguin" (or "Little" for short).

Getting started

Signing up for the challenge is easy; just send an email to little at eudyptula-challenge.org. In fact, it is so easy that more than 4700 people have done so, according to Little, which is many more than were expected. The first message one receives establishes the "rules" of the challenge as well as assigning an ID used in all of the subsequent email correspondence with Little. After that, the first task of the challenge is sent.

All of the interaction takes place via email, just like Linux kernel development. For the most part, the tasks just require locally building, changing, and running recent kernels. There are a few tasks where participants will need to interact with the wider kernel community, however. A recent rash of cleanup patches in the staging tree may quite possibly be related to the challenge.

I started the challenge on March 3 after seeing a Google+ post about it. I have done some kernel programming along the way. I did a couple of minor patches a few years back and some more extensive hacking many moons ago—back in the 2.4 days—and some before that as well. So I was by no means a complete novice. I have a good background in C programming and try to keep up with the Kernel page of a certain weekly publication. While all of that was helpful, it still was quite a bit of work to complete the twenty tasks that make up the challenge. It took a fair amount of effort, but it was also a lot of fun—and I learned a ton in the process.

While I won't be revealing the details on any of the tasks (and Little tells me that some have been tweaked over time and may be different than the ones I did), I can describe the general nature and subject areas for the tasks. It should be no surprise that the first task is a kernel-style "hello world" of sorts. From there, things get progressively harder, in general, and the final task was fairly seriously ... well ... challenging.

Building both mainline and linux-next kernels figures into most of the tasks. Many of the tasks also require using the assigned ID in some way. Not surprisingly, kernel modules figure prominently in many of the tasks. As you progress through the early tasks, you will learn about creating devices, debugfs, and sysfs. In addition, kernel coding style, checkpatch.pl, sparse, and how to submit patches are all featured.

In the later tasks, things like locking, kernel data structures, different kinds of memory allocation, kernel threads and wait queues, as well as a few big subsystems, such as networking and filesystems, are encountered. It is, in some sense, a grand tour of the kernel and how to write code for it. Which is not to say that it covers everything—that might require a tad more than twenty tasks—but it does get you into lots of different kernel areas. There are even plans to add more tasks once there is a good chunk of people (say 40–50) who have completed the challenge, Little said. At the current rate, that will be several months out.

Participant progress

The challenge is not a race of any kind—unless, perhaps, you are trying to complete it so you can write an article about it—but at least twenty people have completed it in the three months since it was announced. Given the nature of the challenge, it's a bit difficult to pin down how many are actually actively still working on it, but the statistics on how many have completed certain tasks can give a bit of a picture.

Roughly three-quarters of the 4700 who have signed up have never completed the first task. Presumably many of those were just curious about the challenge without any real intention to participate. But it is entirely possible that some of those folks are still working on that first task and may eventually work their way through the whole thing. On the other hand, though, that means that around 1200 folks have gotten a module built and loaded into their kernel, which is quite an accomplishment for those participants—and for the challenge.

The bulk of the participants are still working on tasks up through number six and there are less than 100 working on tasks ten and higher. Some of that may be due to some tasks' queues being slower than others. There is manual review that needs to be done for each task, and some tasks take longer than others to review. It might not have been designed that way, but that sort of simulates the review process for kernel patches, which features unknown wait times. Longer queues may also be caused by participants having to try several times on certain tasks, while other tasks are commonly "solved" the first time.

Often the tasks are described loosely enough that there are several to many different ways to complete them. Some solutions are better than others, of course, and Little will patiently point participants in the right direction when they choose incorrectly. I found that the more open-ended the task, the more difficult it was for me—not to complete it so much as to choose what to work on. That may just be a character flaw on my part, however. The final task was also a significant jump in difficulty as well, I thought.

Documentation

One thing that participants are likely to notice is how scattered kernel documentation is. There is, of course, the Documentation directory, but it doesn't cover everything and was actually of fairly limited use to me during the challenge. Google is generally helpful, but there is a huge amount of information out there in various forums, mailing lists, blog posts, weekly news magazines, and so on, some of which is good and useful, some of which is old and possibly semi-useful, and some of which is just plain misleading or wrong. It's not clear what to do about that problem, but it is something that participants (and others learning about the kernel) will encounter.

Based on an April 24 status report that Little sent out, there have been some growing pains. It's clear that the challenge is far more popular than was expected. That has led to longer wait times and some misbehavior from the "convoluted shell scripts that are slow to anger and impossible to debug" (as they are described on the web page). There have also been problems with folks expecting a tutorial, rather than a challenge that will take some real work to complete. Lastly, Little mentioned some people trying to crash or crack the scripts, which is kind of sad, but probably not unexpected.

In any case, the Eudyptula Challenge is a lot of fun, which makes it a great way to learn about kernel development. While it is targeted at relative newbies to the kernel, it wouldn't shock me if even seasoned kernel hackers learned a thing or two along the way. It is self-paced, so you can put in as much or as little time as you wish. There is no pressure (unless self-imposed) to complete anything by any particular deadline—the kernel will still be there in a week or a month or a year. Give the challenge a whirl ... it will be time well spent.

Comments (16 posted)

BPF: the universal in-kernel virtual machine

By Jonathan Corbet
May 21, 2014

Much of the recent discussion regarding the Ktap dynamic tracing system was focused on the addition of a Lua interpreter and virtual machine to the kernel. Virtual machines seem like an inappropriate component to be running in kernel space. But, in truth, the kernel already contains more than one virtual machine. One of those, the BPF interpreter, has been growing in features and performance; it now looks to be taking on roles beyond its original purpose. In the process, it may result in a net reduction in interpreter code in the kernel.

"BPF" originally stood for "Berkeley packet filter"; it got its start as a simple language for writing packet-filtering code for utilities like tcpdump. Support for BPF in Linux was added by Jay Schulist for the 2.5 development kernel; for most of the time since then, the BPF interpreter has been relatively static, seeing only a few performance tweaks and the addition of a few instructions for access to packet data. Things started to change in the 3.0 release, when Eric Dumazet added a just-in-time compiler to the BPF interpreter. In the 3.4 kernel, the "secure computing" (seccomp) facility was enhanced to support a user-supplied filter for system calls; that filter, too, is written in the BPF language.

The 3.15 kernel sees another significant change in BPF. The language has now been split into two variants, "classic BPF" and "internal BPF". The latter expands the set of available registers from two to ten, adds a number of instructions that closely match real hardware instructions, implements 64-bit registers, makes it possible for BPF programs to call a (rigidly controlled) set of kernel functions, and more. Internal BPF is more readily compiled into fast machine code and makes it easier to hook BPF into other subsystems.

For now, at least, internal BPF is entirely hidden from user space. The packet filtering and secure computing interfaces still accept programs in the classic BPF language; these programs are translated into internal BPF before their first execution. The idea seems to be that internal BPF is a kernel-specific implementation detail that might change over time, so chances are it will not be exposed to user space anytime soon. That said, the documentation for internal BPF indicates that one of the goals of the project is to be easier for compilers like GCC and LLVM to generate. Given that any developer attempting to embed LLVM into the kernel has a rather small chance of success, that suggests that there may eventually be a way to load internal BPF directly from user space.

This latter-day work has been done by Alexei Starovoitov, who looks set to continue improving BPF going forward. In 3.15, the just-in-time compiler only understands the classic BPF instruction set; in 3.16, it will be ported over to the internal format instead. Also, for the first time, the secure computing subsystem will be able to take advantage of the just-in-time compiler, speeding the execution of sandboxed programs considerably.

Sometime after 3.16, use of BPF may be extended further beyond the networking subsystem. Alexei recently posted a patch that uses BPF for tracing filters. This is an interesting change that deletes almost as much code as it adds while improving performance considerably.

The kernel's tracepoint mechanism allows a suitably privileged user to receive detailed tracing information every time execution hits a specific tracepoint in the kernel. As one might imagine, the amount of data that results from some tracepoints can be quite large. The NSA might be able to process such fire-hose-like streams at its new data center (once it's running), but many of the rest of us are likely to want to thin that stream down to something a bit more manageable. That is where the filtering mechanism comes in.

Filters allow the association of boolean expression with any given tracepoint; the tracepoint only fires if the expression evaluates to true at execution time. An example given in Documentation/trace/events.txt reads like this:

    # cd /sys/kernel/debug/tracing/events/signal/signal_generate
    # echo "((sig >= 10 && sig < 15) || sig == 17) && comm != bash" > filter

With this filter in place, the signal_generate tracepoint will only fire if the specific signal being generated is within the given range and the process generating the signal is not running bash.

Within the tracing subsystem, an expression like the above is parsed and represented as a simple tree with each internal node representing one of the operators. Every time that the tracepoint is encountered, that tree will be walked to evaluate each operation with the specific data values present at the time; should the result be true at the top of the tree, the tracepoint fires and the relevant information is emitted. In other words, the tracing subsystem contains a small parser and interpreter of its own, used for this one specific purpose.

Alexei's patch leaves the parser in place, but removes the interpreter. Instead, the predicate tree produced by the parser is translated into an internal BPF program, then discarded. The BPF is translated to machine code by the just-in-time compiler; the result is then run whenever the tracepoint is encountered. From the benchmarks posted by Alexei with the patch, the result is worth the effort: the execution time for most filters is reduced by a factor of approximately twenty — and sometimes quite a bit more. Given that the overhead of tracing can often hide the very problems that tracing is being used to find, a huge reduction in that overhead can only be welcome.

The patch set was indeed welcomed, but it is unlikely to find its way into the 3.16 kernel. It currently depends on the other 3.16 changes, which are merged into the net-next tree; that tree is not normally used as a dependency for changes elsewhere in the kernel. As a result, merging Alexei's changes into the tracing tree creates compilation failures — an unwelcome result.

The root problem here is that the BPF code, showing its origins, is buried deeply within the networking subsystem. But usage of BPF is no longer limited to networking code; it is being employed in core kernel subsystems like secure computing and tracing as well. So the time has come for BPF to move into a more central location where it can be maintained independently of the networking code. This change is likely to involve more than just a simple file move; there is still a lot of networking-specific code in the BPF interpreter that probably needs to be factored out. It will be a bit of work, but that is normal for a subsystem that is being evolved into a more generally useful facility.

Until that work is done, BPF-related changes to non-networking code are going to be difficult to merge. So that is the logical next step if BPF is to become the primary virtual machine for interpreted code loaded into the kernel. It makes sense to have only one such machine that, presumably, is well debugged and maintained. There are no other credible contenders for that role, so BPF is almost certainly it, once it has been repackaged as a utility for the whole kernel to use. After that happens, it will be interesting to see what other users for BPF come out of the woodwork.

Comments (3 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.15-rc6 ?

Jiri Slaby Linux 3.12.20 ?

Greg KH Linux 3.4.91 ?

Ben Hutchings Linux 3.2.59 ?

Willy Tarreau Linux 2.6.32.62 ?

Architecture-specific

Alex Elder ARM: SMP: support Broadcom mobile SoCs ?

Tarek Dakhran Exynos 5410 support ?

Maxime COQUELIN Add STiH407 SoC and reference board support ?

Georgi Djakov Add Qualcomm APQ8084 SoC support ?

Matthias Brugger arm: Add basic support for Mediatek Cortex-A7 SoCs ?

Thomas Petazzoni [PATCH RFCv2 0/5] ARM core support for hardware I/O coherency in non-SMP platforms ?

Marc Zyngier arm64: GICv3 support ?

Marc Zyngier arm64: KVM: debug infrastructure support ?

Jiang Liu use irqdomain to dynamically allocate IRQ for IOAPIC pin ?

Core kernel code

Frederic Weisbecker workqueue: Introduce low-level unbound wq sysfs cpumask v3 ?

John Stultz timekeeping: Improved NOHZ frequency steering (v2) ?

Lai Jiangshan workqueue: async worker destruction and worker attaching/detaching ?

Richard Guy Briggs namespaces: log namespaces per task ?

Device drivers

Jassi Brar Common Mailbox Framework ?

srinivas.kandagatla@linaro.org Add Qualcomm SD Card Controller support ?

Gregory CLEMENT USB support for Armada 38x and Armada 375 ?

Rahul Sharma phy: Add exynos-simple-phy driver ?

Murali Karicheri Add Keystone PCIe controller driver ?

Boris BREZILLON mfd: AXP22x: add support for APX221 PMIC ?

Guenter Roeck watchdog: Add reboot API ?

Andreas Werner [PATCH 0/3] Introduce MEN 14F021P00 Board Management Controller driver set ?

Max Schwarz i2c: add driver for Rockchip RK3xxx SoC I2C adapter ?

Ricardo Ribalda Delgado usb: gadget: net2280: Add support for PLX USB338X ?

Sebastian Hesselbarth Marvell Berlin full clock support ?

Keerthy tps65917: Drivers for TPS65917 PMIC ?

Sergei Shtylyov can: add Renesas R-Car CAN driver ?

Peter Chen Add Gadget Bus ?

Javi Merino The power allocator thermal governor ?

Dave Airlie DisplayPort MST support ?

Maxime Ripard Add support for the Allwinner A31 DMA Controller ?

Documentation

Michael Kerrisk (man-pages) man-pages-3.66 is released ?

Filesystems and block I/O

Daniel Phillips Tux3 for review ?

Memory management

Naoya Horiguchi pagecache scanning with /proc/kpagecache ?

Networking

Simon Horman datapath: Add basic MPLS support to kernel ?

Phoebe Buckheister [PATCH net-next 00/11] 802154: implement link-layer security ?

Ezequiel Garcia net: Introduce a software TSO helper API ?

Security-related

Seth Forshee Add support for devtmpfs in user namespaces ?

Stephan Mueller SP800-90A Deterministic Random Bit Generator ?

Miscellaneous

Dave Chinner xfstests: new mailing list ?

David Heidelberger iputils-s20140519 ?

Page editor: Jonathan Corbet
Next page: Distributions>>