Filesystem mounts in user namespaces

By Jonathan Corbet
July 29, 2015

User namespaces are a sandbox within which an otherwise unprivileged user can appear to be root. Given a kernel with user namespaces enabled, any user can create such a namespace and, while they are running within it, do anything that root is allowed to do. The intent is that system administrators can allow the creation of these namespaces by unprivileged users, secure in the knowledge that any actions carried out within them will, in the context of the wider system, be constrained by the user's global credentials. There are some limits to what can be done within user namespaces though, with the mounting of filesystems being near the top of the list. A recent attempt to ease that limitation shows just how hard the problem is to solve.

Linux systems place a lot of trust in their underlying filesystems; the data stored there can, in the form of setuid files, file capabilities, and security labels, be used to empower processes with elevated privileges. When user namespaces were first added, finding a way to safely restrict that trust within a namespace was deemed to be an overly hard problem, so, for the most part, the mount() system call is denied to processes running within user namespaces, even if they are privileged in their namespaces.

There are, of course, exceptions. The restriction applies to new filesystem mounts. Operations that apply to already-mounted filesystems (bind mounts, in particular) are allowed. Even with new mounts, there is an exception for filesystems that, via the FS_USERNS_MOUNT flag, identify themselves as being safe for use within user namespaces. The list of such filesystems is short; it includes /proc, sysfs, ramfs, tmpfs, and not much more. Restricting mounts to these filesystems makes user namespaces safer to deploy, but it also restricts their functionality. In particular, containers running within their own user namespace are limited in how they can change their own filesystem layout.

It would be nice to offer more flexibility with regard to the filesystem layout within user namespaces if it were possible. In an attempt to make that happen, Seth Forshee recently posted a patch set laying the groundwork for a relaxation of the restrictions on filesystem mounts within user namespaces. It is only the first of a series of steps in that direction, but even this step raised a number of difficult questions.

The patch set starts by adding a new field (s_user_ns) to the super_block structure for each mounted filesystem to track the namespace from which it was mounted. It also adds a check to ensure that the mounting process has CAP_SYS_ADMIN inside its current namespace — but does not remove the check on whether the filesystem is marked for mounting within namespaces. An early patch also changes the way the "no devices" flag (set, among other ways, with the nodev option to mount) is handled, making it impossible (most of the time) to open device files on filesystems mounted outside of the root user namespace.

The next step is to deal with the handling of setuid and file capabilities. A user could create an arbitrary filesystem and add a setuid-root executable file to it. That filesystem, when mounted within a user namespace, would enable running as root within that namespace. Thus far, that is not a problem; a user capable of mounting the filesystem within that namespace is already privileged, so respecting setuid and file capabilities adds no privilege-escalation risk. If such a file leaks out of the namespace, though, it could be used to escalate privileges in a namespace where the user is not otherwise privileged. To avoid that possibility, the setuid and file-capabilities code is made to check for a filesystem mounted in a foreign namespace and to ignore setuid and file capabilities in that case.

There was little disagreement over the changes as described thus far. That agreement did not hold into the next part of the patch set, though, which caused the SELinux and Smack security modules to ignore labels attached to files in user-namespace-mounted filesystems. Again, the concern that drove this change was the possibility that a maliciously labeled object could be passed out of the namespace, most likely by passing an open file descriptor to a cooperating process. Simply ignoring security labels makes that particular problem go away.

But, as Casey Schaufler pointed out, that fix comes at the cost of adding other problems. It essentially makes user namespaces incompatible with security modules which, given the increasing use of security modules, is probably not a desirable outcome. A Smack-based system, in the absence of security labels, will probably not respond in a useful way; the most likely outcome is to make all files readable and none writable. With SELinux, instead, as noted by Stephen Smalley, the result would be to make all files inaccessible.

There was a fair amount of discussion about how severe the problem would be and whether security-module support needs to be present in this feature from the beginning. A solution that seemed to gather some support was to look at the security labels for the backing store of any given filesystem. Most filesystems mounted within a user namespace will be loopback mounts to a local file; that file will have its own security labels. Those labels can be propagated into the mounted filesystem and applied to every file found therein. That makes the access policy for files within the filesystem the same as the policy for the file containing the filesystem itself.

There was one other problem which, despite the fact that it is likely to be harder to solve, saw less discussion. Dave Chinner initially raised the issue of using a hostile filesystem to attack the kernel directly. Filesystems have to be able to trust the underlying data they are working with; if that data can be manipulated, there is no end of problems that can result. One possibility is denial-of-service attacks through the creation of hard-link loops in the filesystem metadata, but there are almost certainly more sinister opportunities available as well.

Fixing this problem, Dave said, is not a simple task:

The only way a filesystem would be able to trust what it reads from disk has not been tampered with in a system with untrusted mounts is if it has some kind of cryptographically secure signature in the metadata and the attacker is unable to access the key for that signature. No filesystem we have has that capability and AFAIA there are no plans for any filesystem to implement such tamper detection.

One might be tempted to point out that this problem already exists: in some system configurations, unprivileged users can plug in a USB drive containing a malicious filesystem and mount it now. The difference is that such attacks require physical presence, while an attack via a filesystem image stored in a file can be made from anywhere. That significantly increases the attack surface exposed by the kernel and, according to Dave, is asking for a great deal of trouble.

It is also tempting to think that this problem could be handled by extensive checking of the filesystem at mount time. Even without the delays that would be imposed by, essentially, running fsck every time the filesystem is mounted, though, this idea has problems. Finding every potential attack in a filesystem image will always be a challenging task, even if the filesystem itself is stable. But, in this case, an attacker will have control over the underlying backing store and, thus, will be able to change things at any time. A filesystem that is correct and safe at mount time may no longer be shortly thereafter.

There were no proposals for solutions to the hostile-filesystem problem. In truth, it is not at all clear that the kernel can be made safe against attacks via user-manipulated filesystem images or that such a fix, if it were possible, would not bring an unacceptable performance penalty with it. But, in the absence of some sort of assurance that they can be made safe, unprivileged filesystem mounts are unlikely to gain acceptance; even if the feature gets into the kernel, distributions would be likely to disable it.

In the end, the only safe way to allow unprivileged mounts may be the filesystem in user space (FUSE) mechanism, which isolates the filesystem code into a separate user-space process. FUSE is not the ideal solution for those who would like to add native unprivileged filesystem mount support to user namespaces, but, unless the security concerns can be addressed, it may, in the end, be the only viable solution.

Index entries for this article
Kernel	Namespaces/User namespaces
Security	Linux kernel/Filesystems
Security	Namespaces

to post comments

Filesystem mounts in user namespaces

Posted Jul 30, 2015 6:13 UTC (Thu) by butlerm (subscriber, #13312) [Link] (13 responses)

It ought to a be a goal of every filesystem implementation for any kind of corruption to a filesystem image, whether the filesystem is mounted or not, to yield no side effects outside of the data and metadata returned to clients of that filesystem. No crashing, no hanging, no unbounded resource consumption, etc. Otherwise the entire system is at risk from much more mundane causes than a direct attack.

If those conditions are met, does it really matter if a filesystem mounted inside a user namespace is corrupt in every other way?

Filesystem mounts in user namespaces

Posted Jul 30, 2015 7:34 UTC (Thu) by neilbrown (subscriber, #359) [Link]

> It ought to a be a goal of every filesystem implementation

Certainly, one goal of several including high performance and reliability. Developer resources need to be apportioned to these goals wisely. Guarding against maliciously corrupt media is understandably not high on the list of priorities though (well, it might be for isofs, udf, and VFAT).

One way to get more attention in this areas is start fuzz-testing filesystems and seeing if you can cause mis-behaviour. Without being able to measure improvements, it is hard to motivate them.

Filesystem mounts in user namespaces

Posted Jul 30, 2015 10:39 UTC (Thu) by Fowl (subscriber, #65667) [Link] (10 responses)

I think having the possibility of a malicious attacker able to control much more of the state of the system by supplying the value of every read() call and poke at the filesystem from the other side simultaneously is likely to significantly easier to exploit and defend against.

Maliciously different data between two accesses is not something that could happen before. (modulo maybe network filesystems?)

I know *nix doesn't really love mandatory locking, but maybe it could be appropriate to at least initially restrict mounts to devices/files that can be mandatory locked (ie. limit to single level of untrusted recursion)?

Filesystem mounts in user namespaces

Posted Jul 30, 2015 17:04 UTC (Thu) by nybble41 (subscriber, #55106) [Link] (9 responses)

> Maliciously different data between two accesses is not something that could happen before. (modulo maybe network filesystems?)

It could happen with any filesystem mounted from a USB device. For example, the "USB stick" could actually be an embedded computer emulating a USB stick, and thus able to return arbitrary data for each read.

Filesystem mounts in user namespaces

Posted Jul 30, 2015 23:09 UTC (Thu) by Fowl (subscriber, #65667) [Link] (1 responses)

Indeed. "BadUSB" type devices can do pretty evil things.

However in many scenarios (eg. cloud server) attaching a USB device is a slightly higher bar though, don't you think?

Filesystem mounts in user namespaces

Posted Jul 31, 2015 2:38 UTC (Fri) by nybble41 (subscriber, #55106) [Link]

The ability to attach a USB device is a higher bar. I was simply pointing out that this situation is not unprecedented. There are scenarios where untrusted users with limited, supervised physical access are expected to be able to plug in their own USB storage devices without compromising the entire system. Any data coming from a removable drive ought to treated as unsanitized input. For that matter, applying the same rule to non-removable drives would help to improve robustness in the face of data corruption.

Filesystem mounts in user namespaces

Posted Aug 6, 2015 15:55 UTC (Thu) by Wol (subscriber, #4433) [Link] (6 responses)

Or indeed, any hard drive ...

Just as USB sticks have been programmed to be malicious, the same attacks have been shown to work by reprogramming a hard drive's firmware.

If you can't trust the hardware, you're stuffed. And most hardware nowadays is "smart" and contains a processor and operating system of some sort - indeed, some hard drives run linux :-)

Cheers,
Wol

Filesystem mounts in user namespaces

Posted Aug 6, 2015 16:59 UTC (Thu) by nybble41 (subscriber, #55106) [Link] (5 responses)

> Or indeed, any hard drive ...

Sure, but unlike USB, untrusted users are rarely permitted access to a SATA or SCSI port. If the user has that level of physical access then there isn't much anyone can do—they could replace the entire system if they wanted—but a Linux-powered kiosk in a public place should be able to read and write user-provided USB storage devices (or SD cards, etc.) without major security issues due to assumptions in the filesystem layer.

Filesystem mounts in user namespaces

Posted Aug 6, 2015 18:35 UTC (Thu) by raven667 (subscriber, #5198) [Link] (3 responses)

That may not be practically achievable because it requires a large body of code to be perfect, which will never happen, given that you can't make a large body of code perfect, what's your next plan for containing the risk? How about a small system which handles the USB and filesystem that passes just the file data over an internal network to the main system, so the actual attack vector against the main system is that file passing network interface which can be made much simpler than all of USB and filesystem drivers.

In most cases people will just eat the risk and deal with the fact that kiosks can be broken into if you try, putting a keylogger on the kiosk might not even reduce its usability such that the owner even cares, certainly not enough to pay more money to increase security.

Filesystem mounts in user namespaces

Posted Aug 7, 2015 2:01 UTC (Fri) by nybble41 (subscriber, #55106) [Link] (2 responses)

> How about a small system which handles the USB and filesystem that passes just the file data over an internal network to the main system...

You mean like a microkernel? Kidding aside, this wouldn't be too hard to do with User-Mode Linux and FUSE. I think there is already a project to allow FUSE mounts of any supported filesystem with UML; it just needs support for disk images backed by libusb.

Filesystem mounts in user namespaces

Posted Aug 7, 2015 2:58 UTC (Fri) by raven667 (subscriber, #5198) [Link]

> You mean like a microkernel?

Haha, we are already well into microkernel territory when we talk about VMs or containers. The whole idea of microkernels is to use the hardware memory protection to enforce separation between services, which is what VMs and containers do, rather than any specific implementation. You could break the system into sections for hardware interaction, sections for user interaction and backend data storage with VMs to enforce the separation.

Filesystem mounts in user namespaces

Posted Aug 7, 2015 15:56 UTC (Fri) by ewan (guest, #5533) [Link]

>You mean like a microkernel?

If you're building a real kiosk it could even be hardware. Taking a photo printing kiosk for example, the USB/SD card readers could be connected to a simple device (even down to the level of an Arduino-esque microcontroller) that then transfers files over a limited interface to the real system that prints things, drives touchscreens, and takes credit card payments.

Filesystem mounts in user namespaces

Posted Aug 7, 2015 0:58 UTC (Fri) by dlang (guest, #313) [Link]

> Sure, but unlike USB, untrusted users are rarely permitted access to a SATA or SCSI port

you haven't noticed that laptops and wireless access points are shipping with external SATA ports on them. In fact, I've seen a few USB3/eSATA combined ports

Filesystem mounts in user namespaces

Posted Aug 6, 2015 16:01 UTC (Thu) by Wol (subscriber, #4433) [Link]

> It ought to a be a goal of every filesystem implementation

It ought to be a goal of every power station to implement a perpetual motion engine and generate free energy ...

Just because it's a sensible goal, doesn't mean it's actually possible to achieve it.

Cheers,
Wol

Filesystem mounts in user namespaces

Posted Jul 30, 2015 6:59 UTC (Thu) by smurf (subscriber, #17840) [Link]

> There was one other problem which, despite the fact that it is likely
> to be harder to solve, saw less discussion.

What do you mean, "despite"? _Because_.

Filesystem mounts in user namespaces

Posted Jul 30, 2015 11:31 UTC (Thu) by flussence (guest, #85566) [Link]

I'd be happy to use FUSE drivers to mount my normal disk partitions if it were made possible; I've been using it to avoid NFS successfully for years.

Filesystem mounts in user namespaces

Posted Jul 30, 2015 16:54 UTC (Thu) by fandingo (guest, #67019) [Link] (1 responses)

It seems like this suffers from the same problem that capabilities has. It's clear that depending on the use case and environment different policies are necessary, but this doesn't play well at all with the crude permission flags that are employed in seemingly every syscall. It's just not practical to encode an entire policy in a bitwise int.

I understand that not everyone like LSMs, and there are a number of non-LSM alternatives (eg. capiscum). Nonetheless, it seems that those more complex policy execution and enforcement engines are the proper place to handle this complexity. I don't see any other way to create the expressive policies users need without throwing out the notion of controlling this behavior with syscall flags. It's well past time to abandon using integer flags to define and enforce security policy.

I think the best course of action is to neuter file system user name spaces, preventing them from do anything privileged, and allow LSMs to sort out what gets access to specific privileged actions.

In regards to the hostile file system threat, I fear that's intractable. The Linux Kernel project has always been terrible at that sort of thing because there are few -- if any -- resources dedicated to it. Given the abject failure to secure the public syscall interface, it's unsurprising that private file system structures have insufficient validation and safety mechanisms.

Filesystem mounts in user namespaces

Posted Aug 3, 2015 11:40 UTC (Mon) by rwmj (guest, #5474) [Link]

I think the bigger problem is that there's no mathematical/logical model around the security in Linux[1]. It's just layers of stuff added on top of other stuff, with ad hoc changes (such as this one) stirred in. It's probably too late for Linux, but let's hope the next free OS gets this right.

[1] It is possible, apparently. See: https://en.wikipedia.org/wiki/EROS_%28microkernel%29

Filesystem mounts in user namespaces

Posted Aug 7, 2015 1:01 UTC (Fri) by koverstreet (✭ supporter ✭, #4296) [Link] (4 responses)

The one really evil thing I saw mentioned that's problematic to catch/handle at runtime is circular (directory) hardlinks. Does anyone know what FUSE does to deal with this, if anything? And does anyone know of any other really fundamental (i.e. tied in with posix filesystem semantics) issues?

Filesystem mounts in user namespaces

Posted Aug 7, 2015 1:19 UTC (Fri) by neilbrown (subscriber, #359) [Link] (3 responses)

> circular (directory) hardlinks.

I don't think these are hard to handle - but I don't think that they are handled well. This,to me, is the key point. The code hasn't been audited against attack-by-filesystem.

When a lookup finds a directory that happens to already exist somewhere in the dcache, d_splice_alias will be called which will call __d_unalias which uses __d_move to move the original dentry to the new location.

If the original was an ancestor, you now have a loop which will never be freed and will cause nasty messages when you unmount the filesystem.

This is trivial to fix by putting a call to is_subdir() in there somewhere - __d_unalias already takes all the locks needed to make this reliable.

So: easy to fix, but never audited. Feel free to submit a patch :-)

Filesystem mounts in user namespaces

Posted Aug 7, 2015 1:45 UTC (Fri) by koverstreet (✭ supporter ✭, #4296) [Link] (1 responses)

Oh yeah, the lack of auditing is definitely scary - that's not quite what I was getting at, though. Because there's never been any real pressure to be concerned about this attack vector before, I'm worried that there might be security issues that are just baked into the semantics of posix filesystems - or at least, the VFS interface.

Like it's straightforward enough to e.g. in bcachefs audit all the code that reads in a btree node for buffer overflows and whatnot; there, the issues one is looking for are all local. An adversarial device modifying fs metadata underneath us to do god knows what at the filesystem level? Ugh...

It would be nice if it motivates people to start caring about FUSE performance, too.

Filesystem mounts in user namespaces

Posted Aug 7, 2015 2:18 UTC (Fri) by neilbrown (subscriber, #359) [Link]

> I'm worried that there might be security issues that are just baked into the semantics of posix filesystems - or at least, the VFS interface

There are things like SETUID and device-special files of course, but they are already understood and handled.

The only other thing I can think of is that you could create a (nearly) arbitrarily deep directory tree and consume all of memory in the dcache - because all ancestors of anything in the dcache must also be in the dcache.

If mem-cgroups is able to fail allocations for new inodes or dentries, then this isn't a problem. If it can't, then this could be a denial-of-service vector.

Mind you, you can already do:
% while :;do mkdir a;cd a;done

so it isn't really anything new.

Filesystem mounts in user namespaces

Posted Aug 26, 2015 18:36 UTC (Wed) by bfields (subscriber, #19510) [Link]

When a lookup finds a directory that happens to already exist somewhere in the dcache, d_splice_alias will be called

which does:

    if (unlikely(d_ancestor(new, dentry))) {
        write_sequnlock(&rename_lock);
        spin_unlock(&inode->i_lock);
        dput(new);
        new = ERR_PTR(-ELOOP);
        pr_warn_ratelimited(
            "VFS: Lookup of '%s' in %s %s"
             " would have caused loop\n",
             dentry->d_name.name,
             inode->i_sb->s_type->name,
             inode->i_sb->s_id);

(first checked in d_splice_alias 95ad5c291313 "dcache: d_splice_alias should detect loops", in v3.17, though that logic always was in d_materialiase_unique, I think. And Al combined the two more recently.)