User namespaces progress

By Michael Kerrisk
December 13, 2012

The results of the user namespaces work on Linux have been a long time in coming, probably because they are the most complex of the various namespaces that have been added to the kernel so far. The first pieces of the implementation started appearing when Linux 2.6.23 (released in late 2007) added the CLONE_NEWUSER flag for the clone() and unshare() system calls. ~~By Linux 2.6.29, that flag also became meaningful in the clone() system call.~~ However, until now many of the pieces necessary for a complete implementation have remained absent.

We last looked at user namespaces back in April, when Eric Biederman was working to push a raft of patches into the kernel with the goal of bringing the implementation closer to completion. Eric is now engaged in pushing further patches into the kernel with the goal of having a more or less complete implementation of user namespaces in Linux 3.8. Thus, it seems to be time to have another look at this work. First, however, a brief recap of user namespaces is probably in order.

User namespaces allow per-namespace mappings of user and group IDs. In the context of containers, this means that users and groups may have privileges for certain operations inside the container without having those privileges outside the container. (In other words, a process's set of capabilities for operations inside a user namespace can be quite different from its set of capabilities in the host system.) One of the specific goals of user namespaces is to allow a process to have root privileges for operations inside the container, while at the same time being a normal unprivileged process on the wider system hosting the container.

To support this behavior, each of a process's user IDs has, in effect, two values: one inside the container and another outside the container. Similar remarks hold true for group IDs. This duality is accomplished by maintaining a per-user-namespace mapping of user IDs: each user namespace has a table that maps user IDs on the host system to corresponding user IDs in the namespace. This mapping is set and viewed by writing and reading the /proc/PID/uid_map pseudo-file, where PID is the process ID of one of the processes in the user namespace. Thus, for example, user ID 1000 on the host system might be mapped to user ID 0 inside a namespace; a process with a user ID of 1000 would thus be a normal user on the host system, but would have root privileges inside the namespace. If no mapping is provided for a particular user ID on the host system, then, within the namespace, the user ID is mapped to the value provided in the file /proc/sys/kernel/overflowuid (the default value in this file is 65534). Our earlier article went into more details of the implementation.

One further point worth noting is that the description given in the previous paragraph looks at things from the perspective of a single user namespace. However, user namespaces can be nested, with user and group ID mappings applied at each nesting level. This means that a process might have distinct user and group IDs in each of the nested user namespaces in which it is a member.

Eric has assembled a number of namespace-related patch sets for submission in the upcoming 3.8 merge window. Chief among these is the set that completes the main pieces of the user namespace infrastructure. With the changes in this set, unprivileged processes can now create new user namespaces (using clone(CLONE_NEWUSER)). This is safe, says Eric, because:

Now that we have been through every permission check in the kernel having uid == 0 and gid == 0 in your local user namespace no longer adds any special privileges. Even having a full set of caps in your local user namespace is safe because capabilities are relative to your local user namespace, and do not confer unexpected privileges.

The point that Eric is making here is that following the work (described in our earlier article) to implement the kuid_t and kgid_t types within the kernel, and the conversion of various calls to capable() to its namespace analog, ns_capable(), having a user ID of zero inside a user namespace no longer grants special privileges outside the namespace. (capable() is the kernel function that checks whether a process has a capability; ns_capable() checks whether a process has a capability inside a namespace.)

The creator of a new user namespace starts off with a full set of permitted and effective capabilities within the namespace, regardless of its user ID or capabilities on the host system. The creating process thus has root privileges, for the purpose of setting up the environment inside the namespace in preparation for the creation or the addition of other processes inside the namespace. Among other things, this means that the (unprivileged) creator of the user namespace (or indeed any process with suitable capabilities inside the namespace) can in turn create all other types of namespaces, such as network, mount, and PID namespaces (those operations require the CAP_SYS_ADMIN capability). Because the effect of creating those namespaces is limited to the members of the user namespace, no damage can be done in the host system.

Other notable user-space changes in Eric's patches include extending the unshare() system call so that it can be employed to create user namespaces, and extensions that allow a process to use the setns() system call to enter an existing user namespace.

Looking at some of the other patches in the series gives an idea of just how subtle some of the details are that must be dealt with in order to create a workable implementation of user namespaces. For example, one of the patches deals with the behavior of set-user-ID (and set-group-ID) programs. When a set-user-ID program is executed (via the execve() system call), the effective user ID of the process is changed to match the user ID of the executable file. When a process inside a user namespace executes a set-user-ID program, the effect is to change the process's effective user ID inside the namespace to whatever value was mapped for the file user ID. Returning to the example used above, where user ID 1000 on the host system is mapped to user ID 0 inside the namespace, if a process inside the user namespace executes a set-user-ID program owned by user ID 1000, then the process will assume an effective user ID of 0 (inside the namespace).

However, what should be done if the file user ID has no mapping inside the namespace? One possibility would be for the execve() call to fail. However, Eric's patch implements another approach: the set-user-ID bit is ignored in this case, so that the new program is executed, but the process's effective user ID is left unchanged. Eric's reasoning is that this mirrors the semantics of executing a set-user-ID program that resides on a filesystem that was mounted with the MS_NOSUID flag. Those semantics have been in place since Linux 2.4, so the kernel code paths should for this behavior should be well tested.

Another notable piece of work in Eric's patch set concerns the files in the /proc/PID/ns directory. This directory contains one file for each type of namespace of which the process is a member (thus, for each process, there are the files ipc, mnt, net, pid, user, and uts). These files already serve a couple of purposes. Passing an open file descriptor for one of these files to setns() allows a process to join an existing namespace. Holding an open file descriptor for one of these files, or bind mounting one of the files to some other location in the filesystem, will keep a namespace alive even if all current members of the namespace terminate. Among other things, the latter feature allows the piecemeal construction of the contents of a container. With this patch in Eric's recent series, a single /proc inode is now created per namespace, and the /proc/PID/ns files are instead implemented as special symbolic links that refer to that inode. The practical upshot is that if two processes are in, say, the same user namespace, then calling stat() on the respective /proc/PID/ns/user files will return the same inode numbers (in the st_ino field of the returned stat structure). This provides a mechanism for discovering if two processes are in the same namespace, a long-requested feature.

This article has covered just the patch set to complete the user namespace implementation. However, at the same time, Eric is pushing a number of related patch sets towards the mainline, including: changes to the networking stack so that user namespace root users can create network namespaces: enhancements and clean-ups of the PID namespace code that, among other things, add unshare() and setns() support for PID namespaces; enhancements to the mount namespace code that allow user namespace root users to call chroot() and to create and manipulate mount namespaces; and a series of patches that add support for user namespaces to a number of file systems that do not yet provide that support.

It's worth emphasizing one of the points that Eric noted in a documentation patch for the user namespace work, and elaborated on in a private mail. Beyond the practicalities of supporting containers, there is another significant driving force behind the user namespaces work: to free the UNIX/Linux API of the "handcuffs" imposed by set-user-ID and set-group-ID programs. Many of the user-space APIs provided by the kernel are root-only simply to prevent the possibility of accidentally or maliciously distorting the run-time environment of privileged programs, with the effect that those programs are confused into doing something that they were not designed to do. By limiting the effect of root privileges to a user namespace, and allowing unprivileged users to create user namespaces, it now becomes possible to give non-root programs access to interesting functionality that was formerly limited to the root user.

There have been a few Acked-by: mails sent in response to Eric's patches, and a few small questions, but the patches have otherwise passed largely without comment, and no one has raised objections. It seems likely that this is because the patches have been around in one form or another for a considerable period, and Eric has gone to considerable effort to address objections that were raised earlier during the user namespaces work. Thus, it seems that there's a good chance that Eric's pull request to have the patches merged in the currently open 3.8 merge window will be successful, and that a complete implementation of user namespaces is now very close to reality.

Index entries for this article
Kernel	Namespaces/User namespaces

User namespaces progress

Posted Dec 13, 2012 19:41 UTC (Thu) by luto (subscriber, #39314) [Link]

This article just inspired me to do some review :)

http://lkml.kernel.org/r/<50CA2B55.5070402@amacapital.net>

(That link is currently broken. I think it will come to life soon.)

User namespaces progress

Posted Jan 3, 2013 1:56 UTC (Thu) by kevinm (guest, #69913) [Link] (16 responses)

So, what stops an unprivileged process from creating a new user namespace, so acquiring CAP_BIND in the new namespace, then binding a privileged port?

Or does the creation of a new user namespace force the creation of a new namespace of all the other types at the same time?

User namespaces progress

Posted Jan 3, 2013 3:31 UTC (Thu) by mkerrisk (subscriber, #1978) [Link] (15 responses)

So, what stops an unprivileged process from creating a new user namespace, so acquiring CAP_BIND in the new namespace, then binding a privileged port?

I think the answer there is that while the unprivileged process that creates a user namespace gets all privileges for operations inside the namespace, that doesn't give it privilege for operations on objects (e.g., a network namespace) outside the user namespace. To do what you are thinking of would require creating a network namespace inside the user namespace; you could then bind to privileged ports inside that network namespace.

User namespaces progress

Posted Jan 3, 2013 5:50 UTC (Thu) by quotemstr (subscriber, #45331) [Link] (12 responses)

Besides: is the concept of a "privileged port" even relevant these days? It's a bad idea to make a security decision based on whether an incoming packet has a source port below 1024.

User namespaces progress

Posted Jan 3, 2013 7:59 UTC (Thu) by ebiederm (subscriber, #35028) [Link] (11 responses)

Privileged ports keep those pesky users off of the ports where you run your servers.

User namespaces progress

Posted Jan 3, 2013 16:42 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (10 responses)

And forces you to run daemons under the freaking root user. Great improvement, yes.

User namespaces progress

Posted Jan 3, 2013 17:18 UTC (Thu) by andresfreund (subscriber, #69562) [Link] (9 responses)

Err, no. They can change their uid away after start without any problems. Or they can get an additional CAP_NET_BIND_SERVICE without all the rest of root's powers.

User namespaces progress

Posted Jan 3, 2013 17:21 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (8 responses)

Ok. I have a Java application that needs to listen on port 80.

How do I do it? I've actually tried multiple ways and all of them failed.

User namespaces progress

Posted Jan 3, 2013 17:40 UTC (Thu) by man_ls (guest, #15091) [Link] (1 responses)

Have you tried setcap?

setcap 'cap_net_bind_service=+ep' /path/to/program

It worked for me but it was not Java; in your case run setcap for the java binary.

User namespaces progress

Posted Jan 3, 2013 19:15 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

And now all Java programs have this privilege. Which is not that bad, since this restriction is brain-dead in the first place. But it also breaks during updates and is totally non-transparent (NOBODY checks file caps).

You might actually notice that I have an answer in the thread you've linked: http://stackoverflow.com/a/7701793/625001 However, while it works for erlang it somehow fails for Java. Don't ask me why.

User namespaces progress

Posted Jan 3, 2013 17:44 UTC (Thu) by andresfreund (subscriber, #69562) [Link] (5 responses)

It depends a bit on how you want to start java, but in general you can do stuff like:
$ nc -l 234
nc: Permission denied
$ cp `which nc` /tmp/nc && sudo setcap cap_net_bind_service+ep /tmp/nc
$ /tmp/nc -l 234
^C

In many scenarios you probably will end up using something like capsh or pam-cap.

User namespaces progress

Posted Jan 3, 2013 19:18 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

By copying a binary.

Beautiful. Not.

> In many scenarios you probably will end up using something like capsh or pam-cap.
I'll gladly send you a beer if you can give me a command line that actually works. I have tried all sorts of capsh command variations, but NONE of them works.

User namespaces progress

Posted Jan 3, 2013 21:05 UTC (Thu) by andresfreund (subscriber, #69562) [Link] (3 responses)

> By copying a binary.
> Beautiful. Not.

I only copied the binary because I do *not* want my normal nc to have the capability to bind to root-only ports.

> In many scenarios you probably will end up using something like capsh or pam-cap.

libpam-cap is probably easier for you:
apt-get install libpam-cap
pam-auth-update (enable "capabilities management")
sensible-editor /etc/security/capability.conf
# add "cap_net_bind_service cyberax"

It should be rather similar for other distributions.

Then start a new shell as your user (*not* via sudo "su - cyberax", use sudo -u cyberax, or su - cyberax from *your* user or such, pam_rootok makes a pretty unfortunate shortcut there) and voila:
andres@alap2:~$ sudo -u andres nc -l 434
^C

User namespaces progress

Posted Jan 3, 2013 21:09 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

That's much better than setting caps for executable files, but still has the problem of non-locality. It's impossible to understand from the daemon's command line that it magically acquires additional caps.

User namespaces progress

Posted Jan 3, 2013 21:21 UTC (Thu) by andresfreund (subscriber, #69562) [Link] (1 responses)

Hm, I don't really see that as a problem. But anyway:

sudo /sbin/capsh --caps=cap_net_bind_service+pei == --user=andres -- -c "nc -l 434"

Yes. Ugly. But it works. (capsh is/was a demo tool)

User namespaces progress

Posted Jan 3, 2013 21:24 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Doesn't work with Java, just tried it on my system (it's Debian Stable).

User namespaces progress

Posted Jan 3, 2013 7:57 UTC (Thu) by ebiederm (subscriber, #35028) [Link] (1 responses)

When entering a user namespace you drop privileges.

The best way I can explain it is to describe an April Fool's day joke that you can play on your friendly local sysadmin.

Create a binary and call it something like $HOME/bin/su. Have that binary
call unshare(CLONE_NEWUSER) and write to /proc/[pid]/uid_map and /proc/[pid]/gid_map so that 0 in the current user namespace maps to the current uid and gid. Have this binary exec $SHELL. No privileges required.

Report that su is working without requiring root privileges in your account.

You can look around in /proc/self/status and see that your uid and gid are 0 and that you have all privs.

Extra points if you can get your local sysadmin to start trying to do things and from your $HOME/bin/su, because things really won't work and if you don't realize what is going on you are likely to be quite frustrated.
Services won't restart. You can't kill processes owned by other users etc.

Having a pam module set it up so that the user that looks like root has a distinct uid from everyone else is trickier to setup but could be more entertaining.

So while you have CAP_NET_BIND and can bind to any port in any network namespace you create, creating a network namespace won't do you much good because that network namespace is not connected to any other network namespace.

User namespaces progress

Posted Jan 4, 2013 2:00 UTC (Fri) by kevinm (guest, #69913) [Link]

Thanks Eric.

So from this it sounds like all the other types of namespaces (net, pid, mount...) are "owned" by a user namespace (the one in which they were created). When a permission check is done, it is done using the user namespace that owns that namespace that the relevant resource is in - for example, when I try to bind a privileged port, the permission check is done using the user namespace that owns the current network namespace (not the user namespace of the current process, which might well be different). Does that sound like the right concept?