Fun with file descriptors
This issue is a problem for kernel developers. They would rather not create new, file-descriptor-based services (completion events for syslet-based asynchronous I/O, for example) if glibc will not use those services. So there has been a search for alternatives, most of which involve creating a separate space for "system" file descriptors. Linus suggested one way of doing this:
Davide Libenzi took this idea forward with a patch to create a non-sequential file descriptor area. The current kernel tracks file descriptors in a linear array - a technique which works well as long as the "lowest available descriptor" rule holds. As soon as one starts setting high-order bits in file descriptor numbers, however, the linear array becomes rather less practical. So Davide's patch creates a separate, linked-list data structure used for the non-sequential file descriptor range. The second part of the patch set then fixes up the dup2() system call to use the new file descriptor range. The normal behavior of dup2() has not changed, but if the destination file descriptor is passed as FD_UNSEQ_ALLOC, a random file descriptor will be allocated from the non-sequential area. A specific file descriptor in that area can be requested by passing a number higher than FD_UNSEQ_BASE.
This approach has the advantage of not requiring any new system calls or changing the default user-space binary interface at all. But according to Ulrich Drepper, that attribute is not an advantage at all. Since using this capability requires application changes in any case, Ulrich would rather just see a new system call created; he proposes:
int nonseqfd(int fd, int flags);
This system call would duplicate the open file descriptor fd into the non-sequential space, optionally closing fd in the process. The flags parameter would allow other attributes of the new file descriptor to be controlled. Of particular interest is whether that descriptor shows up in the /proc/pid/fd directory. The optimal way of closing all open file descriptors, apparently, is to read that directory to see which descriptors are currently open. Keeping special descriptors out of that directory (perhaps shifting them to a parallel private-fd directory) will prevent well-meaning applications from closing the library's file descriptors.
It has been suggested that the open() system call should get a flag which would cause it to select a non-sequential file descriptor from the outset, eliminating the need for a separate call to nonseqfd(). There are, however, a number of system calls which create file descriptors but which have no flags parameter and which, thus, will never be able to return non-sequential file descriptors; socket() is a classic example. So there will still be a need for a system call which can duplicate a file descriptor into the new space.
Ulrich has requested that all file descriptors in the non-sequential space be allocated randomly. He would rather not ever see a situation where application developers think they can rely on any specific allocation behavior when using that space. There have also been suggestions that the non-sequential space could be useful for for high-performance applications which hold large numbers of file descriptors open - web servers, for example. Such applications usually have no use for the "lowest available descriptor" guarantee and would happily do without the overhead of implementing that guarantee. Davide's current implementation does not appear to have been written with thousands of non-sequential file descriptors in mind, though.
On another front, Ulrich has been working on a race condition which comes up with certain types of applications. It is possible to request that a file descriptor be automatically closed if the process performs an exec(); the fcntl() system call is used for this purpose. The problem is that there is some time between when the file descriptor is created (with an open() call, perhaps) and the subsequent fcntl() call. If another thread forks and runs a new program between those two calls, its copy of the new file descriptor will not have the close-on-exec flag set and will thus remain open.
Solving that problem generally will take some work, but fixing the
open case is relatively easy. Ulrich is proposing a new
O_CLOEXEC flag for this purpose. There does not appear to be much
opposition to this idea, so the new flag might well make an appearance in
2.6.23.
| Index entries for this article | |
|---|---|
| Kernel | File descriptors |
| Kernel | System calls |