Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.37-rc2, released on November 15. "And it all looks the way I like to see my -rc2's: nothing really interesting there." It's mostly fixes, but there's also some residual big kernel lock removal work, the final removal of hard barrier support from the block layer, and a couple of new LED drivers. See the full changelog for the details.

A significant driver API change was merged after the -rc2 release: the SCSI midlayer queuecommand() function is now invoked without the host lock; the function's prototype has changed as well.

Stable updates: there have been no stable updates released since October 29.

Comments (none posted)

Quotes of the week

We *all* want to build infrastructure; when other coders are forced to use it we rise up the kernel dominance hierarchy. Ook ook! (Every Unix app has its own config language for the same reason: the author distills the mental sweat of the users into some kind of Elixir of Coder Hubris).

Yet abstractions obfuscate: let's resist our primal urges to add another speed hump on the lengthening road to kernel expertise.

-- Rusty Russell

Finally, the whole "user space is more flexible" is just a lie. It simply doesn't end up being true. It will be _harder_ to configure some user-space daemon than it is to just set a flag in /sys or whatever. The "flexibility" tends to be more a flexibility to get things wrong than any actual advantage.

-- Linus Torvalds

Our real problem with tracing is lack of relevance, lack of utility, lack of punch-through analytical power.

-- Ingo Molnar

Comments (6 posted)

Coccinelle workshop: January 26, 2011

Julia Lawall has announced that a Coccinelle workshop will be held in Copenhagen on January 26, 2011. "I expect that the program will consist of some presentations about Coccinelle and associated tools, as well as some time for discussions and practical experiments." Anybody who is interested in attending should drop her a note.

Full Story (comments: none)

Announcing a new utility: 'trace'

A group of Linux tracing developers has announced the creation of a new top-level command, called simply "trace." "After years of efforts we have not succeeded in meeting (let alone exceeding) the utility of decades-old user-space tracing tools such as strace - except for a few new good tools such as PowerTop and LatencyTop. 'trace' is our shot at improving the situation: it aims at providing a simple to use and straightforward tracing tool based on the perf infrastructure and on the well-known perf profiling workflow." Obtaining the tool requires fetching a git tree for now.

Full Story (comments: 37)

Simple user-space tracing

One gets the sense that an extended tracing hacking session has been going on. Ingo Molnar has posted a simple patch to support user-space tracing. It is currently implemented as an extension to the prctl() system call which allows an application to inject tracing data into the kernel, where it will be properly mixed with kernel events. With some suitable user-space work (making DTrace tracepoints use this facility, for example), Linux may finally be on a path toward having proper integrated user- and kernel-space tracing.

Comments (1 posted)

Punching holes in files

By Jonathan Corbet
November 17, 2010

The XFS and OCFS2 filesystems currently have the ability to "punch a hole" in a file - a portion of the file can be marked as unwanted and the associated storage released. Josef Bacik, noting that this capability may be added to other filesystems in the near future, came to the conclusion that the kernel should offer a standard interface for hole punching. The result is an extension to the fallocate() system call adding that ability.

In particular, this patch adds a new flag (FALLOC_FL_PUNCH_HOLE) which is recognized by the system call. If the underlying filesystem is able to perform the operation, the indicated range of data will be removed from the file; otherwise ENOTSUPP will be returned. The current implementation will not change the size of the file; if the final blocks of the file are "punched" out, the file will retain the same length. There has been some discussion of whether changing the size of the file should be supported, but the consensus seems to be that, for now, changing the file size would create more problems than it would solve.

Comments (18 posted)

Kernel development news

TTY-based group scheduling

By Jonathan Corbet
November 17, 2010

As long as we have desktop systems, there will almost certainly be concerns about desktop interactivity. Many complex schemes for improving interactivity have come and gone over the years; most of them seem to leave at least a subset of users unsatisfied. Miracle cures are hard to come by, but it seems that a recent patch has come close, at least for some users. Interestingly, it is a conceptually simple solution that may not need to be in the kernel at all.

The core idea behind the completely fair scheduler is its complete fairness: if there are N processes competing for the CPU, each with equal priority, than each will get 1/N of the available CPU time. This policy replaced the rather complicated "interactivity" heuristics found in the O(1) scheduler; it yields better desktop response in most situations. There are places where this approach falls down, though. If a user is running ten instances of the compiler with make -j 10 along with one video playback application, each process will get a "fair" 9% of the CPU. That 9% may not be enough to provide the video experience that the user was hoping for. So it is not surprising that many users see "fairness" differently; wouldn't be nice if the compilation job as a whole got 50%, while the video application got the other half?

The kernel has been able to implement that kind of fairness for years though a feature known as group scheduling. A set of processes placed within a group will each get a fair share of the CPU time allocated to the group as a whole, but groups will, themselves, compete for a fair share of the CPU. So, if the video player were to be placed in one group and the compilation in another, each group would get half of the available processor time. The various processes doing the compilation would then get a fair share of their group's half; they will compete with each other, but not with the video player. This arrangement will ensure that the video player gets enough CPU time to keep up with the stream and any interactivity requirements.

Groups are thus a nice feature, but they have not seen heavy use since they were merged for the 2.6.24 release. The reasons for that are clear: groups require administrative work and root privileges to set up; most users do not know how to tweak the knobs and would really rather not learn. What has been missing all these years is a way to make group scheduling "just work" for ordinary users. That is the goal of Mike Galbraith's per-TTY task groups patch.

In short, this patch automatically creates a group attached to each TTY in the system. All processes with a given TTY as their controlling terminal will be placed in the appropriate group; the group scheduling code can then share time between groups of processes as determined by their controlling terminals. A compilation job is typically started by typing "make" in a terminal emulator window; that job will have a different controlling TTY than the video player, which may not have a controlling terminal at all. So the end result is that per-TTY grouping automatically separates tasks run in terminals from those run via the window system.

This behavior makes Linus happy; Linus, after all, is just the sort of person who might try to sneak in a quick video while waiting for a highly-parallel kernel compilation. He said:

So I think this is firmly one of those "real improvement" patches. Good job. Group scheduling goes from "useful for some specific server loads" to "that's a killer feature".

Others have also reported significant improvements in desktop response, so this feature looks like one which has a better-than-average chance of getting into the mainline in the next merge window. There are, however, a few voices of dissent, most of whom think that the TTY is the wrong marker to use when placing processes in group.

Most outspoken - as he often is - is Lennart Poettering, who asserted that "Binding something like this to TTYs is just backwards"; he would rather see something which is based on sessions. And, he said, all of this could better be done in user space. Linus was, to put it politely, unimpressed, but Lennart came back with a few lines of bash scripting which achieves the same result as Mike's patch - with no kernel patching required at all. It turns out that working with control groups is not necessarily that hard.

Linus, however, still likes the kernel version, mainly because it can be made to "just work" with no user intervention required at all:

Put another way: if we find a better way to do something, we should _not_ say "well, if users want it, they can do this <technical thing here>". If it really is a better way to do something, we should just do it. Requiring user setup is _not_ a feature.

In other words, an improvement that just comes with a new kernel is likely to be available to more users than something which requires each user to make a (one-time) manual change.

Lennart isn't buying it. A real user-space solution, he says, would not come in the form of a requirement that users edit their .bashrc files; it, too, would be in a form that "just works." It should come as little surprise that the form he envisions is systemd; it seems that future plans involve systemd taking over session management, at which time per-session group scheduling will be easy to achieve. He believes that this solution will be more flexible; it will be able to group processes in ways which make more sense for "normal desktop users" than TTY-based grouping. It also will not require a kernel upgrade to take effect.

Another idea which has been raised is to add a "run in separate group" option to desktop application launchers, giving users an easy way to control how the partitioning is done.

Linus seems to be holding his line on the kernel version of the patch:

Anyway, I find it depressing that now that this is solved, people come out of the woodwork and say "hey you could do this". Where were you guys a year ago or more?

Tough. I found out that I can solve it using cgroups, I asked people to comment and help, and I think the kernel approach is wonderful and _way_ simpler than the scripts I've seen. Yes, I'm biased ("kernels are easy - user space maintenance is a big pain").

The next merge window is not due until January, though; that is a fair amount of time for people to demonstrate other approaches. If a solution based in user space turns out to be more flexible and effective in the long run, it may yet prevail. That is especially true because merging Mike's patch does not in any way inhibit user-space solutions; if a systemd-based approach shows better results, that may be what the distributors decide to enable. One way or the other, it seems like better interactive response is coming in the near future.

Comments (41 posted)

The media controller subsystem

By Jonathan Corbet
November 16, 2010

Over the course of the last decade, video acquisition hardware has evolved from relatively rare, bulky, external devices to being a standard feature in a large variety of gadgets. Increasingly, chipsets intended for embedded use have video support as a standard feature. This support is becoming more complex; contemporary video devices are not just frame grabbers anymore. That complexity is revealing limitations in the kernel's device model, prompting the proposal of a new "media controller" abstraction. This article will provide an overview and mild critical review of this new subsystem.

Video acquisition devices have never been entirely simple. Even a minimal camera device will usually be a composite of at least three distinct devices: a sensor, a DMA bridge to move frames between the sensor and main memory, and an I2C bus dedicated to controlling the sensor. Most devices coming onto the market now are more sophisticated than that. For example, the integrated controller in current VIA chipsets (still a very simple device) adds a "high-quality video" (HQV) unit which can perform image rotation and format conversions; that unit can be configured into or out of the processing pipeline depending on the application's needs. For a more complex example, consider the OMAP 3430, which is found in N900 phones; it has multiple video inputs, a white balance processor, a lens shading compensation processor, a resizer, and more.

Each of these components can be thought of as a separate device which can be powered up or down independently, and which, in some cases, can be configured in or out at any given time. The current V4L2 system wasn't designed with this kind of device structure in mind, and neither was the current Linux device model. An added problem is that these devices can be tied with devices managed by other subsystems - audio devices in particular - making it hard for applications to grasp the whole picture. The media controller is an attempt to rectify that situation.

The most recent version of the media controller patch was posted by Laurent Pinchart back in September; if all goes according to plan, it will be merged for 2.6.38. The patch creates a new media_device type which has the responsibility of managing the various components which make up a media-related device. These components are called "entities"; and they can take many forms. Sensors, DMA engines, video processing units, focus controllers, audio devices, and more are all considered to be "entities" in this scheme.

Most entities will have at least one "pad," being a logical connection point where data can flow into or out of the device. "Data" in this sense can be multimedia data, but it might also be a control stream. Pads are exclusively input ("sink") or output ("source") ports, and an entity can have an arbitrary number of each. The final piece is called a "link"; it is a directional connection from a source pad to a sink. Links are created by the media device driver, but they can, in some cases, be enabled or disabled from user space.

Using this scheme, the simple VIA device described above could be represented with three entities and three links:

The "sensor" entity has a single source pad which can be connected, via links, to the HQV unit or directly to the DMA controller. Only one of those paths can be active at once. The HQV unit has two pads - one sink, one source - allowing it to be slotted into the video pipeline if need be. The DMA controller has a single sink pad.

As an aside: entities also have a "group" number assigned to them; groups are intended to indicate hardware which is meant to function together. All of the units described above would probably be placed into the same group by the driver. If there were a microphone attached to the camera, then the associated audio entity would also be placed in the same group. This mechanism is intended to make it easier for applications to associate related devices with each other.

On the application side, there is a device (probably /dev/media0 or some such) which can be opened to gain access to this device. From there, the interface looks very much like the rest of V4L2 - lots of ioctl() calls to discover what is available and configure it. These calls include:

MEDIA_IOC_DEVICE_INFO to get overall information about the device: driver name, device model, etc.
MEDIA_IOC_ENUM_ENTITIES is used to iterate through all of the entities contained within the device. Information returned includes an ID number, a coarse entity type (e.g. V4L or ALSA), a subtype (few of these are defined in the patch; "sensor" is one of them), the group ID, the device number, and the numbers of pads and links.
MEDIA_IOC_ENUM_LINKS iterates through all of the links attached to source pads on a given entity. Thus, it is only possible to discover the outbound links from any entity; obtaining the whole graph requires iterating through all entities.
MEDIA_IOC_SETUP_LINK changes the properties of a specific link; in particular, it can enable or disable the link (though links can be marked "immutable" by the driver). Enabling a link will have the side effect of powering up all components reachable via that link, while disabling the last link to an entity will cause that entity to be powered down. Thus, changing the status of a link affects both the data path and the power configuration of a device.

Thus far, there have been no applications posted which actually use this framework (though a gstreamer source element is in the works). One can certainly see the utility of being able to discover and modify the configuration of a complex media device in this manner. But, at the Linux Plumbers Conference, your editor heard some concerns that the complexity of this interface could prove daunting to application developers. An application which is intended to work with a specific device (the camera application on a mobile handset, say) can be written with a relatively high level of awareness of that device and make good use of this interface. Writing an application which can make full use of any device - without requiring the developer to know about specific hardware - could be more challenging.

One other concern raised at LPC was that this functionality should really be exported via sysfs rather than through an ioctl()-based API. The information contained here would fit well within a sysfs hierarchy, with links represented by symbolic links in the filesystem. Given that the configuration interface (in its current form) changes a single bit at a time, there is no need for the sort of transactional functionality that can make ioctl() preferable to sysfs. On the other hand, V4L2 applications are already a big mass of ioctl() calls; the media controller API will be a natural fit while rooting through sysfs would be a new experience for V4L2 developers.

Something else is worth thinking about here: the problem may be bigger than just media devices. More complex devices are the norm, and it is becoming clear that the kernel's hierarchical device model is not up to the task of representing the structure of our systems. Back in 2009, Rafael Wysocki proposed a mechanism for representing power-management dependencies with explicit links. The media controller mechanism looks quite similar; it is even being used for power management purposes. That suggests that we should be looking for a data structure which can represent device connections and dependencies across the kernel, not just in one subsystem. Otherwise we run the risk of creating duplicated structures and multiple user-space ABIs, all of which must be supported indefinitely.

The media controller subsystem is aimed at solving a real problem, and it is certainly a credible solution. It is also a significant new user-space ABI, one which does not necessarily conform to current ideas of how interfaces should be done. The work done here may also be applicable well beyond the V4L2 and ALSA subsystems, but any attempt at a bigger-picture solution should probably be made before the code is merged and the ABI is set in stone. All of this suggests that the media controller code could benefit from review outside of the V4L mailing list, which tends to be inhabited by relatively focused developers.

(Thanks to Andy Walls, Hans Verkuil, and Laurent Pinchart for their comments on this article).

Comments (1 posted)

Making attacks a little harder

By Jonathan Corbet
November 17, 2010

Regardless of whether one believes that the security of the Linux kernel is as good as it should be, it is hard to disagree with the idea that it could be made more secure. For some years, it has seemed like much of the security-related work on the kernel has been directed toward the creation of new access control mechanisms. But access control is only so helpful if the kernel itself is vulnerable, allowing any access control system to be bypassed. Recently we have begun to see more work aimed at making small improvements to the security of the kernel itself; this article will survey some of that work.

One key to hardening a system against attackers is to make it harder for them to obtain information which could be used to compromise the kernel. So it is not surprising to see an increase in patches which lock down access to information. It turns out, though, that there is not universal agreement on the value of restricting any kind of information about the running system.

Marcus Meissner started things off with a simple patch removing world-read access from /proc/kallsyms. It is difficult to subvert the kernel without knowledge of how the kernel's memory is laid out, so, Marcus thought, there is no point in providing that information to anybody who asks. The problem with this change, as Ingo Molnar pointed out, is that there are many sources of that information. For example, the System.map file shipped by most distributors also has the locations of all symbols built into the kernel.

Now, one can certainly read-protect System.map as well, but that may not be particularly helpful. Most systems out there are running distributor-supplied kernels, and the packages for those kernels are widely available. So an attacker does not need to read /proc/kallsyms or System.map if the target system is running a stock kernel; they need only dig up a package file containing the needed information. For this reason, Ingo suggested that a complete solution would require restricting access to the running kernel version as well. Removing all of the globally-readable kernel version information from a system would be hard, but, if it could be done, attackers would no longer have easy access to the locations of functions and data structures within the kernel.

Suffice to say that this idea was not received with universal acclaim. Critics claim that there are plenty of ways to determine which kernel version is running; hiding version information would just make life harder for legitimate applications (which may need that information to know which features are available) without appreciably slowing attackers. Ingo talked some about instrumenting the kernel to detect an attacker's attempts to determine the running kernel version, thus giving an early alarm, but this idea did not seem to gain a great deal of traction. So, chances are, kernel versions will not be hidden in any near-future release (the /proc/kallsyms patch has been merged for 2.6.37, though).

Dan Rosenberg has a similar concern: when the kernel exposes pointer values to user space, it gives information to potential attackers. These values can be found in a number of places, including the system log and numerous places in /proc. Keeping pointer values out of the system log seems like a hopeless task, but it is possible to better restrict access to that log. To that end, Dan has posted a patch adding a new sysctl knob controlling access to the syslog() system call. Later versions of the patch include a configuration option for the default setting of this knob; with that, distributors can make the system log off limits for unprivileged users starting at boot.

Kernel addresses also show up in other places, though; for example, /proc/net/tcp contains the address of the sock structure associated with each open TCP connection. Dan worries about exposing the address of these structures, especially since many of them contain function pointers; if an attacker is somehow able to change the contents of kernel memory, this kind of address might facilitate the task of taking over the system. To raise the bar a bit, Dan posted a series of patches which replaces the pointer value with an integer value (often zero) if the process reading the associated /proc file is not suitably privileged.

Unlike the syslog patch, which has made it into the mainline, the /proc modification ran into some stiff opposition. It was described as "security theater," and developers worried that it would break applications which are legitimately using the pointer values. There were suggestions that, perhaps, pointer values could be hashed, or that a more general solution could be had by modifying the behavior of "%p" in format strings. We might see the "%p" patch at some point, but Dan has given up on the /proc patches for now, saying "It's clear that there's too much resistance to this effort for it to ever succeed, so I'm ceasing attempts to get this patch series through."

Making it difficult to find structures containing function pointers may make life harder for an attacker, but it still seems better to block the modification of those structures whenever possible, regardless of who knows their location. To that end, Kees Cook has announced his intent to try to lock down more of the kernel:

The proposal is simple: as much of the kernel should be read-only as possible, most especially function pointers and other execution control points, which are the easiest target to exploit when an arbitrary kernel memory write becomes available to an attacker.

Getting various structures marked const is an obvious starting point; "constification" patches have been produced by many developers over the years, but many structures still can be modified at run time. Beyond that, though, Kees would like to have working read-only and no-execute memory in loadable modules, "set once" pointers for things like the security module operations vector, and more; many of the changes he would like to see merged can currently be found in the grsecurity tree. It could be a long process, but Kees says that it would be a security win for everybody and that he would appreciate cooperation from subsystem maintainers.

Not all kernel vulnerabilities are in the core code; many, instead, are found in loadable modules. An attacker wishing to exploit a vulnerability in a module must first ensure that the module is loaded. Module loading is a privileged operation, but there are a number of ways in which an unprivileged user can cause the kernel to load a module anyway; the kernel normally goes out of its way to autoload modules on demand so that things "just work." It seems clear that a kernel which never allows users to trigger the loading of modules is less likely to be affected by any vulnerability which is found in a loadable module.

Dan has posted another patch (again, based on work done in the grsecurity tree) which makes the demand loading of modules harder. It replaces the existing modules_disable sysctl knob with a more flexible version; if it is set to one, only root can trigger the loading of modules. Setting it to two disables module loading entirely until the next boot. The changing of the existing ABI was not well received, so a future version of the patch will keep the existing switch and its semantics. Beyond that, doubts have been expressed regarding whether administrators will enable this option, since demand loading is a convenient feature.

Hardening the kernel to make the exploiting of vulnerabilities more difficult seems like a good thing, but it would also be nice if we could find those vulnerabilities before anybody even tries to exploit them. One technique which can help in this regard is "fuzzing," the process of passing random values into system calls and looking for unexpected behavior. Some attackers certainly have good fuzzing tools, but the development community seems to be rather less well equipped. So it is good to see some recent work by Dave Jones aimed at the creation of a more intelligent fuzzer. It turns out that, by making system call parameters a bit less fuzzy, the tool is more likely to get past the trivial checks and turn up real problems; the improved fuzzer has already turned up one real bug.

The value of all this work may not be clear to everybody, and it probably will not all make it into the mainline kernel. But it does seem that we are seeing the beginning of a more focused effort to improve the security of the kernel and to make it harder to exploit the inevitable bugs. A more secure kernel may make it harder to gain true ownership of our gadgets in the future, but it still is generally a good thing.

Comments (7 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.37-rc2 ?

Architecture-specific

Catalin Marinas ARM: Add support for the Large Physical Address Extensions ?

Build system

dirk.brandewie@gmail.com Add the ability to link device blobs into vmlinux ?

Core kernel code

Paul Turner reducing overhead for tg->shares distribution ?

Michael Holzheu taskstats: Enhancements for precise process accounting (version 2) ?

Alexander Shishkin system time changes notification ?

Christopher Yeoh Cross Memory Attach v2 ?

Steven Rostedt Format for the new stable events ABI ?

Steven Rostedt tracing/events: stable tracepoints ?

Mike Galbraith sched: automated per tty task groups ?

Development tools

Andi Kleen perf-events: Add support for supplementary event registers ?

Mathieu Desnoyers New tools: lttngtrace and lttngreport ?

shaohui.zheng@intel.com NUMA Hotplug Emulator - Introduction & Feedbacks ?

Ingo Molnar trace: Add user-space event tracing/injection ?

Device drivers

Marek Szyprowski Videobuf2 framework ?

Documentation

Michael Kerrisk man-pages-3.31 is released ?

Linus Walleij clocksource: document some basic concepts ?

Filesystems and block I/O

Josef Bacik Hole Punching V2 ?

Jeff Hansen net/unix: Allow Unix sockets to be treated like normal files. ?

Nick Piggin [rfc] dcache scaling part 1 ?

Memory management

Mel Gorman Use compaction to reduce a dependency on lumpy reclaim ?

Johan Mossberg hwmem: Hardware memory driver ?

Wu Fengguang IO-less dirty throttling v2 ?

Michel Lespinasse Avoid dirtying pages during mlock ?

Security-related

Dan Rosenberg Restrict unprivileged access to kernel syslog ?

Jari Ruusu Announce loop-AES-v3.5b file/swap crypto package ?

Jarkko Sakkinen Smack: label for task objects ?

Marcus Meissner kernel/time: Make /proc/timer_list mode 0400 ?

Virtualization and containers

Jeremy Fitzhardinge PV ticket locks without expanding spinlock ?

Miscellaneous

Bruno Randolf [PATCH v6] Add generic exponentially weighted moving average (EWMA) function ?

Douglas Gilbert sg3_utils-1.30 available ?

Douglas Gilbert sdparm 1.06 available ?

Page editor: Jonathan Corbet
Next page: Distributions>>