Namespaces in operation, part 6: more on user namespaces

By Michael Kerrisk
March 6, 2013

In this article, we continue last week's discussion of user namespaces. In particular, we look in more detail at the interaction of user namespaces and capabilities as well as the combination of user namespaces with other types of namespaces. For the moment at least, this article will conclude our series on namespaces.

User namespaces and capabilities

Each process is associated with a particular user namespace. A process created by a call to fork() or a call to clone() without the CLONE_NEWUSER flag is placed in the same user namespace as its parent process. A process can change its user-namespace membership using setns(), if it has the CAP_SYS_ADMIN capability in the target namespace; in that case, it obtains a full set of capabilities upon entering the target namespace.

On the other hand, a clone(CLONE_NEWUSER) call creates a new user namespace and places the new child process in that namespace. This call also establishes a parental relationship between the two namespaces: each user namespace (other than the initial namespace) has a parent—the user namespace of the process that created it using clone(CLONE_NEWUSER). A parental relationship between user namespaces is also established when a process calls unshare(CLONE_NEWUSER). The difference is that unshare() places the caller in the new user namespace, and the parent of that namespace is the caller's previous user namespace. As we'll see in a moment, the parental relationship between user namespaces is important because it defines the capabilities that a process may have in a child namespace.

Each process also has three associated sets of capabilities: permitted, effective, and inheritable. The capabilities(7) manual page describes these three sets in some detail. In this article, it is mainly the effective capability set that is of interest to us. This set determines a process's ability to perform privileged operations.

User namespaces change the way in which (effective) capabilities are interpreted. First, having a capability inside a particular user namespace allows a process to perform operations only on resources governed by that namespace; we say more on this point below, when we talk about the interaction of user namespaces with other types of namespaces. In addition, whether or not a process has capabilities in a particular user namespace depends on its namespace membership and the parental relationship between user namespaces. The rules are as follows:

A process has a capability inside a user namespace if it is a member of the namespace and that capability is present in its effective capability set. A process may obtain capabilities in its effective set in a number of ways. The most common reasons are that it executed a program that conferred capabilities (a set-user-ID program or a program that has associated file capabilities) or it is the child of a call to clone(CLONE_NEWUSER), which automatically obtains a full set of capabilities.

If a process has a capability in a user namespace, then it has that capability in all child (and further removed descendant) namespaces as well. Put another way: creating a new user namespace does not isolate the members of that namespace from the effects of privileged processes in a parent namespace.

When a user namespace is created, the kernel records the effective user ID of the creating process as being the "owner" of the namespace. A process whose effective user ID matches that of the owner of a user namespace and which is a member of the parent namespace has all capabilities in the namespace. By virtue of the previous rule, those capabilities propagate down into all descendant namespaces as well. This means that after creation of a new user namespace, other processes owned by the same user in the parent namespace have all capabilities in the new namespace.

We can demonstrate the third rule with the help of a small program, userns_setns_test.c. This program takes one command-line argument: the pathname of a /proc/PID/ns/user file that identifies a user namespace. It creates a child in a new user namespace and then both the parent (which remains in the same user namespace as the shell that was used to invoke the program) and the child attempt to join the namespace specified on the command line using setns(); as noted above, setns() requires that the caller have the CAP_SYS_ADMIN capability in the target namespace.

For our demonstration, we use this program in conjunction with the userns_child_exec.c program developed in the previous article in this series. First, we use that program to start a shell (we use ksh, simply to create a distinctively named process) running in a new user namespace:

    $ id -u
    1000
    $ readlink /proc/$$/ns/user       # Obtain ID for initial namespace
    user:[4026531837]
    $ ./userns_child_exec -U -M '0 1000 1' -G '0 1000 1' ksh
    ksh$ echo $$                      # Obtain PID of shell
    528
    ksh$ readlink /proc/$$/ns/user    # This shell is in a new namespace
    user:[4026532318]

Now, we switch to a separate terminal window, to a shell running in the initial namespace, and run our test program:

    $ readlink /proc/$$/ns/user       # Verify that we are in parent namespace
    user:[4026531837]
    $ ./userns_setns_test /proc/528/ns/user
    parent: readlink("/proc/self/ns/user") ==> user:[4026531837]
    parent: setns() succeeded

    child:  readlink("/proc/self/ns/user") ==> user:[4026532319]
    child:  setns() failed: Operation not permitted

The following program shows the parental relationships between the various processes (black arrows) and namespaces (blue arrows) that have been created:

Looking at the output of the readlink commands at the start of each shell session, we can see that the parent process created when the userns_setns_test program was run is in the initial user namespace (4026531837). (As noted in an earlier article in this series, these numbers are i-node numbers for symbolic links in the /proc/PID/ns directory.) As such, by rule three above, since the parent process had the same effective user ID (1000) as the process that created the new user namespace (4026532318), it had all capabilities in that namespace, including CAP_SYS_ADMIN; thus the setns() call in the parent succeeds.

On the other hand, the child process created by userns_setns_test is in a different namespace (4026532319)—in effect, a sibling namespace of the namespace where the ksh process is running. As such, the second of the rules described above does not apply, because that namespace is not an ancestor of namespace 4026532318. Thus, the child process does not have the CAP_SYS_ADMIN capability in that namespace and the setns() call fails.

Combining user namespaces with other types of namespaces

Creating namespaces other than user namespaces requires the CAP_SYS_ADMIN capability. On the other hand, creating a user namespace requires (since Linux 3.8) no capabilities, and the first process in the namespace gains a full set of capabilities (in the new user namespace). This means that that process can now create any other type of namespace using a second call to clone().

However, this two-step process is not necessary. It is also possible to include additional CLONE_NEW* flags in the same clone() (or unshare()) call that employs CLONE_NEWUSER to create the new user namespace. In this case, the kernel guarantees that the CLONE_NEWUSER flag is acted upon first, creating a new user namespace in which the to-be-created child has all capabilities. The kernel then acts on all of the remaining CLONE_NEW* flags, creating corresponding new namespaces and making the child a member of all of those namespaces.

Thus, for example, an unprivileged process can make a call of the following form to create a child process that is a member of both a new user namespace and a new UTS namespace:

    clone(child_func, stackp, CLONE_NEWUSER | CLONE_NEWUTS, arg);

We can use our userns_child_exec program to perform a clone() call equivalent to the above and execute a shell in the child process. The following command specifies the creation of a new UTS namespace (-u), and a new user namespace (-U) in which both user and group ID 1000 are mapped to 0:

    $ uname -n           # Display hostname for later reference
    antero
    $ ./userns_child_exec -u -U -M '0 1000 1' -G '0 1000 1' bash

As expected, the shell process has a full set of permitted and effective capabilities:

    $ id -u              # Show effective user and group ID of shell
    0
    $ id -g
    0
    $ cat /proc/$$/status | egrep 'Cap(Inh|Prm|Eff)'
    CapInh: 0000000000000000
    CapPrm: 0000001fffffffff
    CapEff: 0000001fffffffff

In the above output, the hexadecimal value 1fffffffff represents a capability set in which all 37 of the currently available Linux capabilities are enabled.

We can now go on to modify the hostname—one of the global resources isolated by UTS namespaces—using the standard hostname command; that operation requires the CAP_SYS_ADMIN capability. First, we set the hostname to a new value, and then we review that value with the uname command:

    $ hostname bizarro     # Update hostname in this UTS namespace
    $ uname -n             # Verify the change
    bizarro

Switching to another terminal window—one that is running in the initial UTS namespace—we then check the hostname in that UTS namespace:

    $ uname -n             # Hostname in original UTS namespace is unchanged
    antero

From the above output, we can see that the change of hostname in the child UTS namespace is not visible in the parent UTS namespace.

Capabilities revisited

Although the kernel grants all capabilities to the initial process in a user namespace, this does not mean that process then has superuser privileges within the wider system. (It may, however, mean that unprivileged users now have access to exploits in kernel code that was formerly accessible only to root, as this mail on a vulnerability in tmpfs mounts notes.) When a new IPC, mount, network, PID, or UTS namespace is created via clone() or unshare(), the kernel records the user namespace of the creating process against the new namespace. Whenever a process operates on global resources governed by a namespace, permission checks are performed according to the process's capabilities in the user namespace that the kernel associated with the that namespace.

For example, suppose that we create a new user namespace using clone(CLONE_NEWUSER). The resulting child process will have a full set of capabilities in the new user namespace, which means that it will, for example, be able to create other types of namespaces and be able to change its user and group IDs to other IDs that are mapped in the namespace. (In the previous article in this series, we saw that only a privileged process in the parent user namespace can create mappings to IDs other than the effective user and group ID of the process that created the namespace, so there is no security loophole here.)

On the other hand, the child process would not be able to mount a filesystem. The child process is still in the initial mount namespace, and in order to mount a filesystem in that namespace, it would need to have capabilities in the user namespace associated with that mount namespace (i.e., it would need capabilities in the initial user namespace), which it does not have. Analogous statements apply for the global resources isolated by IPC, network, PID, and UTS namespaces.

Furthermore, the child process would not be able to perform privileged operations that require capabilities that are not (currently) governed by namespaces. Thus, for example, the child could not do things such as raising its hard resource limits, setting the system time, setting process priorities, or loading kernel modules~~, or rebooting the system~~. All of those operations require capabilities that sit outside the user namespace hierarchy; in effect, those operations require that the caller have capabilities in the initial user namespace.

By isolating the effect of capabilities to namespaces, user namespaces thus deliver on the promise of safely allowing unprivileged users access to functionality that was formerly limited to the root user. This in turn creates interesting possibilities for new kinds of user-space applications. For example, it now becomes possible for unprivileged users to run Linux containers without root privileges, to construct Chrome-style sandboxes without the use of set-user-ID-root helpers, to implement fakeroot-type applications without employing dynamic-linking tricks, and to implement chroot()-based applications for process isolation. Barring kernel bugs, applications that employ user namespaces to access privileged kernel functionality are more secure than traditional applications based on set-user-ID-root: with a user-namespace-based approach, even if an applications is compromised, it does not have any privileges that can be used to do damage in the wider system.

The author would like to thank Eric Biederman for answering many questions that came up as he experimented with namespaces during the course of writing this article series.

Index entries for this article
Kernel	Namespaces/User namespaces

to post comments

Namespaces in operation, part 6: more on user namespaces

Posted Mar 6, 2013 18:04 UTC (Wed) by johill (subscriber, #25196) [Link] (2 responses)

Hmm, what are those namespace IDs (4026531837 and 4026532318)? Those numbers are "prettier" in hex (0xeffffffd and 0xf00001de respectively) but it's not obvious (to me) where they come from. Did I miss that in an earlier article?

Namespaces in operation, part 6: more on user namespaces

Posted Mar 6, 2013 20:06 UTC (Wed) by nix (subscriber, #2304) [Link]

They're inode numbers of /proc/$pid/ns/user. (This is an identity for the namespace because all members of a given namespace have, in effect, all their "ns/user"s hardlinked together.)

Namespaces in operation, part 6: more on user namespaces

Posted Mar 6, 2013 20:09 UTC (Wed) by ebiederm (subscriber, #35028) [Link]

I believe you did miss it in an earlier article.

Those numbers are proc inode numbers, which happen to be recorded in the symlink.

If you need persistent names for any kind of namespace it is recommended that you do mount --bind /proc/PID/ns/user /a/filesystem/path.

Namespaces in operation, part 6: more on user namespaces

Posted Mar 6, 2013 18:45 UTC (Wed) by justincormack (subscriber, #70439) [Link] (2 responses)

reboot is actually a special case now (though not apparently documented in the man pages). The reboot syscall in a pid namespace just sends a SIGHUP to the namespaces init process, not a real reboot, so in this sense it is also a namespace capability.

Namespaces in operation, part 6: more on user namespaces

Posted Mar 6, 2013 23:41 UTC (Wed) by mkerrisk (subscriber, #1978) [Link] (1 responses)

> reboot is actually a special case now

Thanks for that pointer; I'd missed that 3.9 change. I've added a strikethrough in the article. And something will make its way into the man page soonish.

Namespaces in operation, part 6: more on user namespaces

Posted Mar 7, 2013 11:33 UTC (Thu) by lyda (subscriber, #7429) [Link]

As an aside, this is why the LWN comment section is the best comment section online. In a short exchange a missing piece of info was noted, an explanation of what happens was offered, the author of the article addressed the concern and a project gets a patch that makes it better.

Not only was everything "civil," everything was positive and productive.

Filesystems and security.

Posted Mar 6, 2013 20:21 UTC (Wed) by ebiederm (subscriber, #35028) [Link] (1 responses)

It is true that user namespaces allow userspace to interact with more code, and thus somewhat increases the surface one has to worry about for kernel exploits.

However, most filesystems can not be mounted with just user namespace permissions. Even for the filesystems you can mount with user namespace permissions remount is not supported. So while the cited tmpfs issue could have been a problem it it occurred elsewhere in tmpfs, it was not exploitable with user namespaces.

The recent work on converting all of the filesystems for user namespace support is essentially a constructive compiler checked proof that shows that all of the uids and gids that come from userspace are properly converted into kuids. Making it safe to use existing filesystems in the presence of multiple user namespaces.

Filesystems and security.

Posted Mar 14, 2013 15:38 UTC (Thu) by impossible7 (guest, #89863) [Link]

> However, most filesystems can not be mounted with just user namespace permissions.

Are there any plans to change this? If not, is there a reason for that?

It would be nice if an ordinary user could create a user and mount namespaces and then e.g. mount an ext4 fs from a block device that they own.

Resource limits

Posted Mar 6, 2013 20:39 UTC (Wed) by ebiederm (subscriber, #35028) [Link] (4 responses)

I expect there are some system administrators out there looking at this and asking: What resources limits are in place that can be used to limit user namespaces?

The only answer at this point are memory control groups. Everything takes up memory and limiting the amount of memory used limits everything else. Unfortunately for the classic unix rlimit based resource limits fall down when multiple processes are involved.

Resource limits

Posted Mar 7, 2013 11:36 UTC (Thu) by lyda (subscriber, #7429) [Link] (3 responses)

What options exist or interact with this? I worked on a system that doled out memory, cpu and disk usage to a subtree of processes in a container. That was mainly handled in userland; is there work being done to manage this within the kernel or is the feeling that userland is the correct place for this?

Resource limits

Posted Mar 8, 2013 0:52 UTC (Fri) by ebiederm (subscriber, #35028) [Link]

At a very basic level I don't see anything in any of the namespaces really being any different from any other process. The big differences are is that it is now possible to allocate kinds of resources that no one has added rlimits for, and that if /etc/subuid is setup and your users have multiple uids per user limits go from mostly useless to totally useless.

To my knowledge there is not much in the control groups that is namespace or container specific. Although I seem to remember a network memory controller that had a connection with the network namespace.

Beyond that it all depends on how heavy a sandbox you want to run. Certainly with ptrace and a firm hand you can implement very fine control on processes.

When done well I think the lightest weight solutions will live in the kernel. Certainly the cpu controller seems to live up to that notion.

But honestly whatever works and whatever is easiest.

If there is any consensus of feeling on the matter it is that cgroups are ugly but they are the best general solution we have to the problem so far.

Beyond that it looks like most of the time resource consumption is not a problem for most people. With the result that technology to implement and enforce resource limits are frequently neglected.

I hope that helps a little.

Eric

Resource limits

Posted Mar 11, 2013 18:05 UTC (Mon) by BernardB (subscriber, #47903) [Link] (1 responses)

I'm also interested in having cgroup management supported within namespaces. I've seen a couple of patchsets posted to LKML to attempt to achieve this - most recently from Gao Feng. It seemed to get strong opposition from Tejun Heo though, who was pushing for a userland solution: http://article.gmane.org/gmane.linux.kernel.containers/24825

I haven't seen anything more recent though.

Resource limits

Posted Mar 11, 2013 21:05 UTC (Mon) by ebiederm (subscriber, #35028) [Link]

I don't know if userland vs kernel is the appropriate way to characterize the debate.

But yes there is a question of how unprivileged users can take advantage of the facilities cgroups offer, and how we can integrate cgroup support cleanly into containers.

Last I heard mount --bind /cgroupfs/my/group /path/to/container/cgroupfs
worked as a good approximation to what many people want.

I have not had a chance to look at it in any detail beyond that. It is nothing fundamentally hard it is just something that someone familiar with all the details needs to spend some time and to iron out.

Namespaces in operation, part 6: more on user namespaces

Posted Mar 7, 2013 13:57 UTC (Thu) by rfrancoise (subscriber, #15508) [Link] (1 responses)

Thank you Michael for this fantastic article series!

Namespaces in operation, part 6: more on user namespaces

Posted Mar 10, 2013 15:19 UTC (Sun) by cstanhop (subscriber, #4740) [Link]

Yes. Thank you! This has been a great, comprehensive introduction to the namespace features.

Namespaces in operation, part 6: more on user namespaces

Posted Mar 8, 2013 17:07 UTC (Fri) by alexl (guest, #19068) [Link] (5 responses)

I tried this in the latest Fedora kernel (3.8.1-201.fc18.x86_64), but it seems the user namespaces doesn't work:

./userns_child_exec -U
clone: Invalid argument

strace says:
clone(child_stack=0x7020f0, flags=CLONE_NEWUSER|SIGCHLD) = -1 EINVAL (Invalid argument)

Isn't this supposed to work in 3.8.1?

Namespaces in operation, part 6: more on user namespaces

Posted Mar 8, 2013 19:47 UTC (Fri) by ebiederm (subscriber, #35028) [Link]

cat /proc/config.gz | grep USER_NS

Were user namespaces enabled?

I expect the generic fedora build enables one of the filesystems that has not had a kuid/kgid conversion in 3.8 and thus can not enable user namespaces.

Namespaces in operation, part 6: more on user namespaces

Posted Mar 8, 2013 20:58 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (3 responses)

Looking at the config in the kernel srpm, the user namespace isn't supported even in Rawhide yet (3.9.0-rc1).

Namespaces in operation, part 6: more on user namespaces

Posted Mar 8, 2013 22:28 UTC (Fri) by ebiederm (subscriber, #35028) [Link] (2 responses)

I suspect it will have to wait until 3.10 when XFS stops turning off user namespaces.

Namespaces in operation, part 6: more on user namespaces

Posted Mar 8, 2013 23:07 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

I was actually planning on building my own kernel RPM with user namespaces this weekend. I don't use XFS anywhere, so I'm fine with the tradeoff.

Namespaces in operation, part 6: more on user namespaces

Posted Jul 7, 2013 12:58 UTC (Sun) by Tobu (subscriber, #24111) [Link]

The XFS work isn't merged yet… Does the same sort of complexity exist for Btrfs, which exposes low-level structures through the search ioctl?

Namespaces in operation, part 6: more on user namespaces

Posted Mar 11, 2013 14:44 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (2 responses)

Is there going to be a tool to help manage namespaces (I'd guess from util-linux)?

Namespaces in operation, part 6: more on user namespaces

Posted Mar 11, 2013 20:56 UTC (Mon) by ebiederm (subscriber, #35028) [Link]

In the latest git of util-linux there is nsenter and unshare.

There is code in lxc.

Just for network namespaces there is code in iproute.

There is work underway to get support for multiple uids and gids per user into shadow-utils.

Namespaces in operation, part 6: more on user namespaces

Posted Mar 18, 2013 9:51 UTC (Mon) by Da_Blitz (guest, #50583) [Link]

If none of the other options out there float your boat (LXC, unshare command or others) then you may want to try a tool i wrote to play with this from python called asylum. its available at http://code.pocketnix.org/asylum

Namespaces in operation, part 6: more on user namespaces

Posted Jun 27, 2013 19:35 UTC (Thu) by Urhixidur (guest, #91620) [Link] (1 responses)

What happens to the user ID mappings of a process when it switches user namespaces? Are they zapped?

I also presume the owner IDs of files are translated just like the owner IDs of processes? If a process in a user namespace tries to run an executable whose owner ID has not been mapped, will it be denied access?

Namespaces in operation, part 6: more on user namespaces

Posted Feb 13, 2019 9:40 UTC (Wed) by mkerrisk (subscriber, #1978) [Link]

When a process switches user namespaces, its UID map will be the UID map of the user namespace that it moved to.

Running an executable whose UID and GID have not been mapped is fine: since the file notionally has the "overflow IDs", the this means that the executing process will need o+x permission on the file to do the exec.