Kernel development

Brief items

Kernel release status

The current 2.6 prepatch remains 2.6.24-rc1. Fixes have been flowing into the mainline repository (along with a Japanese translation of the SubmittingPatches document), but -rc2 has not been released as of this writing.

For older kernels: the 2.6.22.11 stable update is in the review process now; it should be released sometime on or after November 2. It contains 26 patches addressing a number of problems. This is likely to be the last update to 2.6.22.

2.6.16.56-rc2 was released on October 29; it adds a handful of fixes.

Comments (none posted)

Kernel development news

Quotes of the week

Do you know why Unix was a success and MULTICS a failure? It's because Unix had mode bits and MULTICS had ACLs. Fortunately for those of us who wear titles like "Security Expert" or "Trust Technologist" with pride there are enough clinical paranoids in positions of authority to keep the Trusted System niche from closing up completely and hence supporting our Rock Star Lifestyles. The good news is that the situation is no worse than that faced by the people who are bringing you Infiniband or Itanium, neither of which will ever be the life of the party either. Sure security is important, but I learned (in college, and yes they had colleges way back then) not to drink too much at parties I'd crashed.

-- Casey Schaufler

Please always prepare and test patches against the latest kernel. 2.6.23 is very much _not_ the latest kernel - there is a 50MB diff between 2.6.23 and 2.6.24-rc1. That's a lot of difference.

-- Andrew Morton

Rule #1 in kernel programming: don't *ever* think that things actually work the way they are documented to work.

-- Linus Torvalds

Comments (2 posted)

The chained scatterlist API

By Jonathan Corbet
October 29, 2007

When asked which of the changes in 2.6.24 was most likely to create problems, an informed observer might well point at the i386/x86_64 merger. As it happens, that large patch set has gone in with relatively few hitches, but a rather smaller change has created quite a bit of fallout. The change in question is the updated API for the management of scatterlists, which are used in scatter/gather I/O. This work broke a number of in-tree drivers, so it seems likely to affect a lot of out-of-tree code as well.

Scatter/gather I/O allows the system to perform DMA I/O operations on buffers which are scattered throughout physical memory. Consider, for example, the case of a large (multi-page) buffer created in user space. The application sees a continuous range of virtual addresses, but the physical pages behind those addresses will almost certainly not be adjacent to each other. If that buffer is to be written to a device in a single I/O operation, one of two things must be done: (1) the data must be copied into a physically-contiguous buffer, or (2) the device must be able to work with a list of physical addresses and lengths, grabbing the right amount of data from each segment. Scatter/gather I/O, by eliminating the need to copy data into contiguous buffers, can greatly increase the efficiency of I/O operations while simultaneously getting around the problem that the creation of large, physically-contiguous buffers can be problematic in the first place.

Within the kernel, a buffer to be used in a scatter/gather DMA operation is represented by an array of one or more scatterlist structures, defined in <linux/scatterlist.h>. This array has traditionally been constrained to fit within a single page, which imposes a maximum length on scatter/gather operations. That limit has proved to be a bottleneck on high-end systems, which could otherwise benefit from transferring very large buffers (usually to and from disk devices). As a result, there has been a search for ways to get around that limit; the large block size patches which occasionally surface on the mailing lists are one approach. But the solution which has made it into the 2.6.24 kernel is to remove the limit on the length of scatter/gather lists by allowing them to be chained.

A chained scatter/gather list can be made up of more than one page, and those pages, too, are likely to be scattered throughout physical memory. When this chaining is done, a couple of low-order bits in the buffer pointer are used to mark chain entries and the end of the list. This usage is not something which driver code needs to worry about, but the existence of special bits and chain pointers forces some changes to how drivers work with scatterlists.

Drivers which do not perform chaining will allocate their scatterlist arrays in the usual way - usually through a call to kcalloc() or some such. Prior to 2.6.23, there was no initialization step required, beyond, perhaps, zeroing the entire array. That has changed, however; drivers should now initialize a scatterlist array with:

    void sg_init_table(struct scatterlist *sg, unsigned int nents);

Here, sg points to the allocated array, and nents is the number of allocated scatter/gather entries.

As before, a driver should loop through the segments of the buffer, setting one scatterlist entry for each. It is no longer possible to set the page pointer directly, however: that pointer does not exist in 2.6.24. Instead, the usual way to set a scatterlist entry will be with one of:

    void sg_set_page(struct scatterlist *sg, struct page *page,
		     unsigned int len, unsigned int offset);

    void sg_set_buf(struct scatterlist *sg, const void *buf,
	      	    unsigned int buflen);

2.6.24 scatterlists also require that the end of the list be explicitly marked. This marking is performed when sg_init_table() is called, so drivers will not normally have to mark the end explicitly. Should the I/O operation not use all of the entries which were allocated in the list, though, the driver should mark the final segment with:

    void sg_mark_end(struct scatterlist *sg, unsigned int nents);

Where nents is the number of valid entries in the scatterlist.

After the scatterlist has been mapped (with a function like dma_map_sg()), the driver will need to program the resulting DMA addresses into the hardware. The old approach of just stepping through the array will no longer work; instead, a driver should move on to the next entry in a scatterlist with:

    struct scatterlist *sg_next(struct scatterlist *sg);

The return value will be the next entry to process - or NULL if the end of the list has been reached. There is also a for_each_sg() macro which can be used to iterate through an entire scatterlist; it will typically be used in code which looks like:

    int i;
    struct scatterlist *list, *sgentry;

    /* Fill in list and pass it to dma_map_sg().  Then... */
    for_each_sg(i, list, sgentry, nentries) {
	program_hw(device, sg_dma_address(sgentry), sg_dma_len(sgentry));
    }

Drivers which wish to take advantage of the chaining feature must do just a little more work. Each piece of the scatterlist must be allocated independently, then those pieces must be chained together with:

    void sg_chain(struct scatterlist *prv, unsigned int prv_nents,
		  struct scatterlist *next);

This call turns the scatterlist entry prv[nents] into a chain link to next. If the chaining is done while the list is being filled, prv should have no more than prv_nents-1 segments stored into it. Alternatively, a driver can chain together the pieces of the list ahead of time (remembering to allocate one entry for each chain link), then use sg_next() to fill the list without the need to worry about where the chain links are.

As of this writing, this API is still evolving in response to issues which have come up with in-tree drivers. It seems unlikely that any more substantial changes will be made before the 2.6.24 release, but surprises are always possible.

Comments (none posted)

Fixing CAP_SETPCAP

By Jake Edge
October 31, 2007

Linux capabilities have been around for almost ten years now – they were originally merged into a 2.1 kernel – but they haven't gotten a lot of use in that time. One pretty basic missing feature, support for associating capabilities with files, has been merged for 2.6.24. This allows a longstanding hack, which redefines the proper usage of CAP_SETPCAP, to be fixed; this too has been merged into 2.6.24.

A bit of review is probably in order. Capabilities are a way to separate individual privileges that are normally all granted to the root user. There are currently 31 different capabilities defined (in linux/capability.h), but there are efforts underway to allow for expansion. The idea is that a program should be able to set the system time, for example, without needing the entire set of privileges that come with a setuid(0) program.

Capabilities originally came from a proposed POSIX standard that was eventually not adopted, but, in the meantime, got included into Linux. The feature has languished since, for a number of reasons, but perhaps the largest was that there was no way to associate executable programs with a set of capability bits. Now that capability bits can be stored in the extended attributes of files, the process can get the proper capabilities when the program is invoked. Standard UNIX permissions still apply – users can only execute programs they have an x bit for.

In order to use capabilities at all, prior to being able to store them with files, a method was needed to set the capabilities of a running process. The CAP_SETPCAP capability was co-opted for this purpose. A process with this capability, which, in practice, meant root processes could set the capabilities of another process. If that process was meant to be able to do the same – something that needs to be carefully considered – it could get the CAP_SETPCAP bit as well.

This could really only be used to add capabilities to long running processes that were not run as root (which has all of the capabilities), or to remove some capabilities from daemons run as root. Other schemes using setuid wrappers for utility programs that needed some privileges could also be imagined, but distributions or tools that use capabilities are not widespread.

CAP_SETPCAP was never meant to have this behavior, so the recent patch restores it its original meaning. As odd as it might seem at first, CAP_SETPCAP is only meant to allow changes to a process's own capabilities; in fact, with this patch applied, there is no way for a process to change a running process's capabilities. That is probably the biggest user-visible change.

Capabilities are not a single set of bits, but are instead, three sets of bits representing the effective, permitted, and inheritable capabilities of a process. Files, similarly, have three capability sets which are combined with those of the process executing the file using the "capability rules" (described in the patch and in an LWN article from a year ago) to determine the three sets for the process created.

For processes, the effective set contains those capabilities currently enabled – a process might drop some that it is allowed once it has performed the corresponding privileged operation – while the permitted set is a superset of the effective set, including all capabilities allowed to that process. The inheritable set are those that are passed on to a new program started by an exec() call, which is where the new CAP_SETPCAP comes into play; a process with this capability can change its inheritable set to include any capability, including those that are not in their permitted set.

This allows processes to bestow privileges that they do not possess upon their children, which provides for some interesting uses. It helps further partition privileges by not requiring a process to have a particular capability simply to pass it on to children. The example provided in the patch illustrates this nicely: the login program does not require many privileges, but through some policy mechanism (pam_cap for example) could allow certain users to have extra capabilities. Because the login process does not itself possess those extra capabilities, this could limit the damage an exploit of login could do.

It is unclear whether these recent additions to the capability feature set will result in more capability users. There is a lot of work in the kernel security space right now as kernel hackers and security folks try to come up with sensible security solutions for Linux. The complexity of SELinux, along with the fact that many administrators disable it rather than try to figure it out, seems to have the community casting about for other solutions. It is possible that capabilities might be a part of another solution, though its complexities are far from trivial. Though most of the major distributions have already made their security model choice, a capabilities-based distribution would be interesting to see; it might make a nice project for a smaller, up-and-coming, distribution to try.

Comments (10 posted)

Notes from a container

By Jonathan Corbet
October 29, 2007

"Containers" are a form of lightweight virtualization as represented by projects like OpenVZ. While virtualization creates a new virtual machine upon which the guest system runs, containers implementations work by making walls around groups of processes. The result is that, while virtualized guests each run their own kernel (and can run different operating systems than the host), containerized systems all run on the host's kernel. So containers lack some of the flexibility of full virtualization, but they tend to be quite a bit more efficient.

As of 2.6.23, virtualization is quite well supported on Linux, at least for the x86 architecture. Containers lag a little behind, instead. It turns out that, in many ways, containers are harder to implement than virtualization is. A container implementation must wrap a namespace layer around every global resource found in the kernel, and there are a lot of these resources: processes, filesystems, devices, firewall rules, even the system time. Finding ways to wrap all of these resources in a way which satisfies the needs of the various container projects out there, and which also does not irritate kernel developers who may have no interest in containers, has been a bit of a challenge.

Full container support will get quite a bit closer once the 2.6.24 kernel is released. The merger of a number of important patches in this development cycle fills in some important pieces, though a certain amount of work remains to be done.

Once upon a time, there was a patch set called process containers. The containers subsystem allows an administrator (or administrative daemon) to group processes into hierarchies of containers; each hierarchy is managed by one or more "subsystems." The original "containers" name was considered to be too generic - this code is an important part of a container solution, but it's far from the whole thing. So containers have now been renamed "control groups" (or "cgroups") and merged for 2.6.24.

Control groups need not be used for containers; for example, the group scheduling feature (also merged for 2.6.24) uses control groups to set the scheduling boundaries. But it makes sense to pair control groups with the management of the various namespaces and resource management in general to create a framework for a containers implementation.

The management of control groups is straightforward. The system administrator starts by mounting a special cgroup filesystem, associating the subsystems of interest with the filesystem at mount time. There can be more than one such filesystem mounted, as long as each subsystem appears on at most one control group. So the administrator could create one cgroup filesystem to manage scheduling and a completely different one to associate processes with namespaces.

Once the filesystem is mounted, specific groups are created by making directories within the cgroup filesystem. Putting a process into a control group is a simple matter of writing its process ID into the tasks virtual file in the cgroup directory. Processes can be moved between control groups at will.

The concept of a process ID has gotten more complicated, though, since the PID namespace code was also merged. A PID namespace is a view of the processes on the system. On a "normal" Linux system, there is only the global PID namespace, and all processes can be found there. On a system with PID namespaces, different processes can have very different views of what is running on the system. When a new PID namespace is created, the only visible process is the one which created that namespace; it becomes, in essence, the init process for that namespace. Any descendants of that process will be visible in the new namespace, but they will never be able to see anything running outside of that namespace.

Virtualizing process IDs in this way complicates a number of things. A process which creates a namespace remains visible to its parent in the old namespace - and it may not have the same process ID in both namespaces. So processes can have more than one ID, and the same process ID may be found referring to different processes in different namespaces. For example, it is fairly common in containers implementations to have the per-namespace init process have ID 1 in its namespace.

[PULL QUOTE: What all of this means is that process IDs only make sense when placed into a specific context. That, in turn, sets a trap for any kernel code which works with process IDs. END QUOTE] What all of this means is that process IDs only make sense when placed into a specific context. That, in turn, sets a trap for any kernel code which works with process IDs; any such code must take care to maintain the association between a process ID and the namespace in which it is defined. To make life easier (and safer), the containers developers have been working for some time to eliminate (to the greatest extent possible) use of process IDs within the kernel itself. Kernel code should use task_struct pointers (which are always unambiguous) to refer to specific processes; a process ID, instead, has become a cookie for communication with user space, and not much more.

This job of cleaning up PID use is not complete at this point. In fact, the process ID namespace work has a great many loose ends in general, to the point that some of the developers do not think that it is really ready to be used yet. In particular, there is concern that some of the management APIs could change, breaking code which is written for the 2.6.24 API. Adding new user-space APIs is always problematic in this regard: getting an API right is hard, and getting it right the first time is even harder. But user-space APIs are supposed to stay constant once they are merged; there is no provision for any sort of stabilization period where things can change. For PID namespaces, what's likely to happen is that the feature will be marked "experimental" in the hope that nobody will use it in its 2.6.24 form.

Also merged for 2.6.24 is the network namespace patch. The idea behind this code is to allow processes within each namespace to have an entirely different view of the network stack. That includes the available interfaces, routing tables, firewall rules, and so on. These patches are in a relatively early state; they add the infrastructure to track different namespaces, but not a whole lot more. Quite a few internal networking APIs have been changed to take a namespace parameter, but, in most cases, the code simply fails any operation which is attempted in anything other than the default, root namespace. There is a new "veth" virtual network device which can be used to create tunnels between namespaces.

The PID and network namespace patches have added a couple of lines to <linux/sched.h>:

    #define CLONE_NEWPID	0x20000000	/* New pid namespace */
    #define CLONE_NEWNET	0x40000000	/* New network namespace */

These entries highlight an interesting problem: the CLONE_ flags are passed to the kernel as a 32-bit value. As of this writing, there are only two bits left for new flags. So the containers developers are going to run out of flags; how they plan to deal with that problem is not clear at this point.

These developers are also working on the management of containers, and, in particular, how to move between them. One of the things likely to come out of that work in the near future is a proposal for a new system call:

    int hijack(unsigned long clone_flags, int which, int id);

This system call behaves much like clone() in that it creates a new process, but with an interesting twist. The new process created by clone() takes all of its resources - including namespaces - from the calling process; these resources will be copied or shared as directed by the clone_flags argument. A call to hijack(), instead, obtains all of those resources from the process whose ID is given in the id parameter. So it is possible to write a little program which forks via a hijack() call and runs a shell in the resulting child process; that shell will be running with all of the namespaces of the hijacked process.

To make life easier for people working with containers, the which parameter was added in recent versions of this API. If which is passed as 1, the call treats id as a process ID, as described above. A value of 2, instead, says that id is actually an open file descriptor for the tasks file in a cgroup control directory. In this case, hijack() finds the lead process for that control group and obtains resources from there.

This system call is new, and it has not seen a whole lot of review outside of the containers mailing list. So chances are that some changes will be requested once it becomes more widely visible; among other things, a name change might be called for. In general, there is a lot yet to be done with the containers code, but progress is visibly being made. There will come a point where the mainline kernel comes equipped with complete container capabilities.

Comments (11 posted)

Patches and updates

Kernel trees

Steven Rostedt 2.6.23-rt3 ?

Steven Rostedt 2.6.23.1-rt4 ?

Steven Rostedt 2.6.23.1-rt5 ?

Adrian Bunk Linux 2.6.16.56-rc2 ?

Architecture-specific

Huang, Ying [PATCH 1/3 -v4] x86_64 EFI runtime service support: EFI basic runtime service support ?

David Howells MN10300: Add the MN10300 architecture to Linux kernel ?

Zhang Wei Add DMA engine driver for Freescale MPC85xx processors. ?

Mathieu Desnoyers cmpxchg_local standardization across architectures ?

Core kernel code

Gregory Haskins RT: CPU priority management ?

Gregory Haskins RT: balance rt tasks enhancements v6 ?

Device drivers

Kristen Carlson Accardi ata: ahci: Enable enclosure management via LED (resend) ?

Dave Airlie AGP initial support for chipset flushing.. ?

Jeff Garzik [SCSI] Asynchronous event notification infrastructure ?

Matti Linnanvuori wan: new driver retina ?

David Brownell GPIO implementation framework ?

Florian Fainelli Add support for the RDC R6040 Fast Ethernet controller ?

Documentation

Rob Landley Documentation/DocBook/scsi_midlayer.tmpl ?

Filesystems and block I/O

Theodore Ts'o 2.6.24-rc1-ext4-1 patchset released ?

David Howells Remove iget() and read_inode() [try #5] ?

Abhishek Rai Clustering indirect blocks in Ext3 (was Ext2) ?

Memory management

David Chinner Allow lazy unmapping by taking extra page references V3 ?

Matt Mackall maps4: pagemap monitoring v4 ?

Peter Zijlstra Swap over NFS -v14 ?

Paul Jackson cpuset relative memory policies - second choice ?

Networking

Hideo AOKI UDP memory accounting and limitation (take 6) ?

Security-related

Casey Schaufler [PATCH 0/2] Version 9 (2.6.24-rc1) Smack: Simplified Mandatory Access Control Kernel ?

Jari Ruusu Announce loop-AES-v3.2b file/swap crypto package ?

Andrew Morgan Alternative 64-bit capability patch ?

Virtualization and containers

Serge E. Hallyn namespaces: introduce sys_hijack (v7) ?

Miscellaneous

Geert Uytterhoeven Updated PS3 Linux Distro Kit released (v1.5) ?

Matthew Wilcox Stringbuf, v2 ?

Mathieu Desnoyers [PATCH] Create instrumentation/ kernel directory ?

Page editor: Jonathan Corbet
Next page: Distributions>>