User namespaces progress
The results of the user namespaces work on Linux have been a long time in
coming, probably because they are the most complex of the various namespaces that have been
added to the kernel so far. The first pieces of the implementation started
appearing when Linux 2.6.23 (released in late 2007) added the
CLONE_NEWUSER flag for the clone() and unshare() system calls. By
Linux 2.6.29, that flag also became meaningful in the clone()
system call. However, until now many of the pieces necessary for a complete
implementation have remained absent.
We last looked at user namespaces back in April, when Eric Biederman was working to push a raft of patches into the kernel with the goal of bringing the implementation closer to completion. Eric is now engaged in pushing further patches into the kernel with the goal of having a more or less complete implementation of user namespaces in Linux 3.8. Thus, it seems to be time to have another look at this work. First, however, a brief recap of user namespaces is probably in order.
User namespaces allow per-namespace mappings of user and group IDs. In the context of containers, this means that users and groups may have privileges for certain operations inside the container without having those privileges outside the container. (In other words, a process's set of capabilities for operations inside a user namespace can be quite different from its set of capabilities in the host system.) One of the specific goals of user namespaces is to allow a process to have root privileges for operations inside the container, while at the same time being a normal unprivileged process on the wider system hosting the container.
To support this behavior, each of a process's user IDs has, in effect, two values: one inside the container and another outside the container. Similar remarks hold true for group IDs. This duality is accomplished by maintaining a per-user-namespace mapping of user IDs: each user namespace has a table that maps user IDs on the host system to corresponding user IDs in the namespace. This mapping is set and viewed by writing and reading the /proc/PID/uid_map pseudo-file, where PID is the process ID of one of the processes in the user namespace. Thus, for example, user ID 1000 on the host system might be mapped to user ID 0 inside a namespace; a process with a user ID of 1000 would thus be a normal user on the host system, but would have root privileges inside the namespace. If no mapping is provided for a particular user ID on the host system, then, within the namespace, the user ID is mapped to the value provided in the file /proc/sys/kernel/overflowuid (the default value in this file is 65534). Our earlier article went into more details of the implementation.
One further point worth noting is that the description given in the previous paragraph looks at things from the perspective of a single user namespace. However, user namespaces can be nested, with user and group ID mappings applied at each nesting level. This means that a process might have distinct user and group IDs in each of the nested user namespaces in which it is a member.
Eric has assembled a number of namespace-related patch sets for submission in the upcoming 3.8 merge window. Chief among these is the set that completes the main pieces of the user namespace infrastructure. With the changes in this set, unprivileged processes can now create new user namespaces (using clone(CLONE_NEWUSER)). This is safe, says Eric, because:
The point that Eric is making here is that following the work (described in our earlier article) to implement the kuid_t and kgid_t types within the kernel, and the conversion of various calls to capable() to its namespace analog, ns_capable(), having a user ID of zero inside a user namespace no longer grants special privileges outside the namespace. (capable() is the kernel function that checks whether a process has a capability; ns_capable() checks whether a process has a capability inside a namespace.)
The creator of a new user namespace starts off with a full set of permitted and effective capabilities within the namespace, regardless of its user ID or capabilities on the host system. The creating process thus has root privileges, for the purpose of setting up the environment inside the namespace in preparation for the creation or the addition of other processes inside the namespace. Among other things, this means that the (unprivileged) creator of the user namespace (or indeed any process with suitable capabilities inside the namespace) can in turn create all other types of namespaces, such as network, mount, and PID namespaces (those operations require the CAP_SYS_ADMIN capability). Because the effect of creating those namespaces is limited to the members of the user namespace, no damage can be done in the host system.
Other notable user-space changes in Eric's patches include extending the unshare() system call so that it can be employed to create user namespaces, and extensions that allow a process to use the setns() system call to enter an existing user namespace.
Looking at some of the other patches in the series gives an idea of just how subtle some of the details are that must be dealt with in order to create a workable implementation of user namespaces. For example, one of the patches deals with the behavior of set-user-ID (and set-group-ID) programs. When a set-user-ID program is executed (via the execve() system call), the effective user ID of the process is changed to match the user ID of the executable file. When a process inside a user namespace executes a set-user-ID program, the effect is to change the process's effective user ID inside the namespace to whatever value was mapped for the file user ID. Returning to the example used above, where user ID 1000 on the host system is mapped to user ID 0 inside the namespace, if a process inside the user namespace executes a set-user-ID program owned by user ID 1000, then the process will assume an effective user ID of 0 (inside the namespace).
However, what should be done if the file user ID has no mapping inside the namespace? One possibility would be for the execve() call to fail. However, Eric's patch implements another approach: the set-user-ID bit is ignored in this case, so that the new program is executed, but the process's effective user ID is left unchanged. Eric's reasoning is that this mirrors the semantics of executing a set-user-ID program that resides on a filesystem that was mounted with the MS_NOSUID flag. Those semantics have been in place since Linux 2.4, so the kernel code paths should for this behavior should be well tested.
Another notable piece of work in Eric's patch set concerns the files in the /proc/PID/ns directory. This directory contains one file for each type of namespace of which the process is a member (thus, for each process, there are the files ipc, mnt, net, pid, user, and uts). These files already serve a couple of purposes. Passing an open file descriptor for one of these files to setns() allows a process to join an existing namespace. Holding an open file descriptor for one of these files, or bind mounting one of the files to some other location in the filesystem, will keep a namespace alive even if all current members of the namespace terminate. Among other things, the latter feature allows the piecemeal construction of the contents of a container. With this patch in Eric's recent series, a single /proc inode is now created per namespace, and the /proc/PID/ns files are instead implemented as special symbolic links that refer to that inode. The practical upshot is that if two processes are in, say, the same user namespace, then calling stat() on the respective /proc/PID/ns/user files will return the same inode numbers (in the st_ino field of the returned stat structure). This provides a mechanism for discovering if two processes are in the same namespace, a long-requested feature.
This article has covered just the patch set to complete the user namespace implementation. However, at the same time, Eric is pushing a number of related patch sets towards the mainline, including: changes to the networking stack so that user namespace root users can create network namespaces: enhancements and clean-ups of the PID namespace code that, among other things, add unshare() and setns() support for PID namespaces; enhancements to the mount namespace code that allow user namespace root users to call chroot() and to create and manipulate mount namespaces; and a series of patches that add support for user namespaces to a number of file systems that do not yet provide that support.
It's worth emphasizing one of the points that Eric noted in a documentation patch for the user namespace work, and elaborated on in a private mail. Beyond the practicalities of supporting containers, there is another significant driving force behind the user namespaces work: to free the UNIX/Linux API of the "handcuffs" imposed by set-user-ID and set-group-ID programs. Many of the user-space APIs provided by the kernel are root-only simply to prevent the possibility of accidentally or maliciously distorting the run-time environment of privileged programs, with the effect that those programs are confused into doing something that they were not designed to do. By limiting the effect of root privileges to a user namespace, and allowing unprivileged users to create user namespaces, it now becomes possible to give non-root programs access to interesting functionality that was formerly limited to the root user.
There have been a few Acked-by: mails sent in response to
Eric's patches, and a few small questions, but the patches have otherwise
passed largely without comment, and no one has raised objections. It seems
likely that this is because the patches have been around in one form or
another for a considerable period, and Eric has gone to considerable effort
to address objections that were raised earlier during the user namespaces
work. Thus, it seems that there's a good chance that Eric's pull request to have the patches merged in the
currently open 3.8 merge window will be successful, and that a complete
implementation of user namespaces is now very close to reality.
| Index entries for this article | |
|---|---|
| Kernel | Namespaces/User namespaces |