[go: up one dir, main page]

|
|
Log in / Subscribe / Register

System call updates: indirect(), timerfd(), and hijack()

By Jonathan Corbet
November 28, 2007
Last week's discussion of the proposed indirect() system call ended with some complaints from developers on the ugliness of the interface. Since then there has been some talk about system call interfaces in general, but not a whole lot of ideas for how indirect() could be done better.

The leading alternative would be that pushed by H. Peter Anvin: rather than use indirect() to extend a system call, simply make a new system call with the desired additional parameters. Then, usually, the old implementation can be replaced with a simple stub which calls the new version with the default values for the new parameters. It is a simple approach which easily maintains binary compatibility with very little runtime cost. Since there is no particular shortage of system call numbers, this is a process which could go on for a long time.

The management of increasing numbers of system calls does impose a cost, though; each one of those system calls is a user-space API which cannot ever be broken. The indirect() approach, instead, does not add more system calls. As long as the addition of parameters (with default values of zero) is done with care, avoiding API problems should be relatively easy to do.

There are also limits on how many parameters can be easily passed to system calls; on most systems, that limit is around six. Any system call requiring more arguments must already do uncomfortable things with indirect blocks. Creating new system calls with additional parameters will create more cases where this sort of indirect parameter handling is required. So the approach used by indirect() will find itself being used, in some form, anyway.

The key argument, though, still appears to be the syslet/threadlet mechanism. The ability to make any system call asynchronous has a lot of appeal, but doing so requires some additional information - a place to store the result of the call, if nothing else. Asynchronous system calls, in Linux, are, for all practical purposes, a type of indirect call. The proposed indirect() interface looks like it should be able to accommodate asynchronous calls nicely - though the precise API has not, yet, been nailed down.

As a result of all this, chances are that some form of indirect() will find its way into the mainline - though there is still time for somebody to come up with a better idea.

Meanwhile, the last time timerfd() was discussed here, it had been disabled in the 2.6.23 kernel as a result of complaints about its interface. Since then, little has happened with timerfd(), with the result that it will almost certainly not be present in 2.6.24 either. Some work has been done with this system call, though, and a new API proposal has been posted. This version has three system calls, the first of which is timerfd_create():

    int timerfd_create(int clockid, int flags);

The clockid argument tells the system which clock should be used: CLOCK_MONOTONIC or CLOCK_REALTIME. The flags argument is a recent addition; it is currently unused and must be zero. It was added on the assumption that somebody, somewhere, will always want some sort of behavior modification and one might as well avoid the need for an indirect version while it's easy. The return value from timerfd_create() is a file descriptor which can be passed to read() or any of the poll() variants. But, first, the timer should probably be programmed with:

    int timerfd_settime(int fd, 
                        int flags,
		        const struct itimerspec *timer,
		    	struct itimerspec *old_timer);

Here, fd is a file descriptor obtained from timerfd_create(), flags contains TFD_TIMER_ABSTIME if the timer is being set to an absolute time, and timer is the expiration time for the timer. If old_timer is not NULL, the location pointed to will be set to the previous value of the timer.

It is also possible to query the value of the timer with:

    int timerfd_gettime(int fd, struct itimerspec *timer);

The value returned in *timer will be the current setting of the timer associated with fd.

There's not been a whole lot of comments on this version of the API, so something very similar to it will probably be merged. It would normally be considered to be too late to put a change like this into 2.6.24, but the 2.6.24-rc3-mm2 patch log says "Probably 2.6.24?". So one never knows. If this change is not merged soon, it will almost certainly become available for 2.6.25.

Finally, the hijack() system call continues to be developed on relatively quiet kernel subsystem lists. This call (described here in October) behaves much like clone() in that it creates a new process. Unlike clone(), however, hijack() causes the new process to share resources with a specified third process rather than with the parent. Its main reason for existence is to make it easy to enter different namespaces.

The hijack() interface remains almost unchanged:

    int hijack(unsigned long clone_flags, int which, int id);

The specified id value is interpreted according to which, which now has three possible values:

  • HIJACK_PID says that id is a process ID; the newly-created process will share resources (including namespaces) with the indicated process.

  • HIJACK_CG says that id is an open file descriptor for the tasks file in a target control group. In this case, the kernel will find a process within that control group and use it as the source for resources and namespaces.

  • HIJACK_NS is the newest option; like HIJACK_CG, it is an open file descriptor indicating a control group. In this case, though, only the control group itself and any associated namespaces will be inherited by the new process. This version is intended for use when entry into an empty control group (where there are no processes to inherit from) is desired.

This new system call still has not seen any exposure on linux-kernel; it may well not survive its first experience there in its current form. If nothing else, a name change (to something which is more descriptive of the real function and, preferably, which does not put users onto intelligence agency watch lists) may well be called for. But a full container implementation on Linux will clearly need some sort of enter_container() system call at some point.

Index entries for this article
KernelContainers
Kernelindirect()
KernelSystem calls
Kerneltimerfd()


to post comments

System call updates: indirect(), timerfd(), and hijack()

Posted Dec 1, 2007 6:24 UTC (Sat) by kbob (guest, #1770) [Link]

Shouldn't timerfd_create() be spelled timerfd_creat()?

(-:


Copyright © 2007, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds