execns()
Cedric Le Goater has posted a patch set which takes this work forward in an interesting way by de-globalizing another namespace and adding a different interface for creating new namespaces. The new namespace type added by the patch is the "user" namespace - the system's view of user ID values. For the most part, the kernel just uses user IDs for the enforcement of permissions; it does not really care if one set of processes interprets user ID values differently than another. So, if processes within one container cannot see resources (processes, SYSV IPC, filesystems) belonging to another container, there is little opportunity for processes to interfere with each other, even if they are running with the same numeric user ID value. That user ID can map to two entirely different accounts in the different containers, and the isolation provided by those containers will keep them separate.
The one little exception is the user_struct structure maintained in kernel/user.c. This structure exists to allow the kernel to enforce per-user resource limits; to that end, one is allocated for each user ID currently active on the system. The function responsible for looking up one of these structures (find_user()) implements a global user ID namespace, so processes sharing a user ID number in different containers will affect each others' resource limits.
Cedric's patch fixes this problem by creating a new namespace type for user IDs, allowing resource limits to be isolated within containers. The implementation of this namespace is simple, but allowing processes to move into a new user namespace with unshare(), as it turns out, is not. When a process gets around to calling unshare(), it may have a long list of resources which are reflected in the user_struct structure. Disconnecting from the old structure will require the system to somehow disassociate the process's current resource usage from that structure and add them to the new one instead. This process is detailed and error-prone; even if it works once, keeping it maintained and functional into the future could be a challenge. The same challenge applies to SYSV IPC namespaces. A process which holds references to a SYSV semaphore, for example, must have those references taken away, any undo information handled properly, and so on.
Rather than try to fix up unshare() to handle all of these issues, Cedric has taken a different approach: only allow a process to disconnect from namespaces when all of its references to those namespaces are being shut down anyway. That time is when the process calls a form of exec() to run a new program. So Cedric has created a new form of the execve() call:
int execns(int unshare_flags, char *filename, char **argv, char **envp);
This call will function like execve, in that it will cause the process to run the program found in filename with the given arguments and environment. The new unshare_flags argument, however, allows the caller to specify a set of namespaces to be unshared at the same time. As a result, the new program starts fresh with its new namespaces and no dangling references into the older ones. To help ensure that things happen this way, execns() closes all open files, regardless of whether they are marked "close on exec."
Moving namespace creation into exec() would seem to make some
sense. The creation of namespaces is a rare act, done as part of the
establishment of a new container; it's not something that running processes
just occasionally decide to do. The execns() will allow a
container's init-like process to start with a clean slate while,
with luck, simplifying the unsharing logic within the kernel.
| Index entries for this article | |
|---|---|
| Kernel | execns() |
| Kernel | unshare() |
| Kernel | Virtualization/Containers |