execns()

[Posted July 11, 2006 by corbet]

The developers behind a whole range of virtualization and containerization projects are continuing to work on ways to get the isolation features they need into the mainline kernel. Much of that work is centered around the elimination of global namespaces and additions to the unshare() system call so that interested processes can retreat into their own, private namespaces. For example, on mainline Linux systems today, the process ID namespace is global - a given process ID identifies the same process for every other process on the system. The container developers would like to move away from a global PID namespace so that containers can present their own process IDs to the processes trapped inside. Many other kernel namespaces are receiving the same sort of treatment.

Cedric Le Goater has posted a patch set which takes this work forward in an interesting way by de-globalizing another namespace and adding a different interface for creating new namespaces. The new namespace type added by the patch is the "user" namespace - the system's view of user ID values. For the most part, the kernel just uses user IDs for the enforcement of permissions; it does not really care if one set of processes interprets user ID values differently than another. So, if processes within one container cannot see resources (processes, SYSV IPC, filesystems) belonging to another container, there is little opportunity for processes to interfere with each other, even if they are running with the same numeric user ID value. That user ID can map to two entirely different accounts in the different containers, and the isolation provided by those containers will keep them separate.

The one little exception is the user_struct structure maintained in kernel/user.c. This structure exists to allow the kernel to enforce per-user resource limits; to that end, one is allocated for each user ID currently active on the system. The function responsible for looking up one of these structures (find_user()) implements a global user ID namespace, so processes sharing a user ID number in different containers will affect each others' resource limits.

Cedric's patch fixes this problem by creating a new namespace type for user IDs, allowing resource limits to be isolated within containers. The implementation of this namespace is simple, but allowing processes to move into a new user namespace with unshare(), as it turns out, is not. When a process gets around to calling unshare(), it may have a long list of resources which are reflected in the user_struct structure. Disconnecting from the old structure will require the system to somehow disassociate the process's current resource usage from that structure and add them to the new one instead. This process is detailed and error-prone; even if it works once, keeping it maintained and functional into the future could be a challenge. The same challenge applies to SYSV IPC namespaces. A process which holds references to a SYSV semaphore, for example, must have those references taken away, any undo information handled properly, and so on.

Rather than try to fix up unshare() to handle all of these issues, Cedric has taken a different approach: only allow a process to disconnect from namespaces when all of its references to those namespaces are being shut down anyway. That time is when the process calls a form of exec() to run a new program. So Cedric has created a new form of the execve() call:

    int execns(int unshare_flags, char *filename, char **argv, char **envp);

This call will function like execve, in that it will cause the process to run the program found in filename with the given arguments and environment. The new unshare_flags argument, however, allows the caller to specify a set of namespaces to be unshared at the same time. As a result, the new program starts fresh with its new namespaces and no dangling references into the older ones. To help ensure that things happen this way, execns() closes all open files, regardless of whether they are marked "close on exec."

Moving namespace creation into exec() would seem to make some sense. The creation of namespaces is a rare act, done as part of the establishment of a new container; it's not something that running processes just occasionally decide to do. The execns() will allow a container's init-like process to start with a clean slate while, with luck, simplifying the unsharing logic within the kernel.

Index entries for this article
Kernel	execns()
Kernel	unshare()
Kernel	Virtualization/Containers

to post comments

execns()

Posted Jul 13, 2006 20:24 UTC (Thu) by iabervon (subscriber, #722) [Link]

My naive thought is that, if you unshare the user ID namespace, you should still have the same user_struct; you just wouldn't necessarily find it under your original UID. I'd think that if a whole-system user starts a process in new container, the limits of root of the new container would be those of the original whole-system user, at least until things ran setreuid. I'm also not clear why setreuid wouldn't need all the complicated stuff in any case, since it must be handling the process changing user_structs.