The inherent fragility of seccomp()

By Jonathan Corbet
November 10, 2017

Kernel developers have worried for years that tracepoints could lead to applications depending on obscure implementation details; the consequent need to preserve existing behavior to avoid causing regressions could end up impeding future development. A recent report shows that the seccomp() system call is also more prone to regressions than users may expect — but kernel developers are unlikely to cause these regressions and, indeed, have little ability to prevent them. Programs using seccomp() will have an inherently higher risk of breaking when software is updated.

seccomp() allows the establishment of a filter that will restrict the set of system calls available to a process. It has obvious uses related to sandboxing; if an application has no need for, say, the open() system call, blocking access to that call entirely can reduce the damage that can be caused if that application is compromised. As more systems and programs are subjected to hardening, the use of seccomp() can be expected to continue to increase.

Michael Kerrisk recently reported that upgrading to glibc 2.26 broke one of his demonstration applications. That program was using seccomp() to block access to the open() system call. The problem that he ran into comes down to the fact that applications almost never invoke system calls directly; instead, they call wrappers that have been defined by the C library.

The glibc open() wrapper has, since the beginning, been a wrapper around the kernel's open() system call. But open() is an old interface that has long been superseded by openat(). The older call still exists because applications expect it to be there, but it is implemented as a special case of openat() within the kernel itself. In glibc 2.26, the open() wrapper was changed to call openat() instead. This change was not visible to ordinary applications, but it will break seccomp() filters that behave differently for open() and openat().

Kerrisk was not really complaining about the change, but he did want to inform the glibc developers that there were user-visible effects from it: "I want to raise awareness that these sorts of changes have the potential to possibly cause breakages for some code using seccomp, and note that I think such changes should not be made lightly or gratuitously". The developers should, he was suggesting, keep the possibility of breaking seccomp() filters in mind when making changes, and they should document such changes when they cannot be avoided.

Florian Weimer, however, disagreed:

I have the opposite view: We should make such changes as often as possible, to remind people that seccomp filters (and certain SELinux and AppArmor policies) are incompatible with the GNU/Linux model, where everything is developed separately and not maintained within a single source tree (unlike say OpenBSD). This means that you really can't deviate from the upstream Linux userspace ABI (in the broadest possible sense) and still expect things to work.

Another way of putting this might be: seccomp() filters are not considered to be a part of the ABI that is provided by glibc, so incompatible changes there are not considered regressions. They are, instead, a consequence of filtering below the glibc level while expecting behavior above that level to remain unchanged.

Weimer's point of view would appear to be the one that will govern glibc development going forward. So Kerrisk has proposed some man-page changes to make the fragility of seccomp() filters a bit less surprising to developers. Playing the game at this level will require a fairly deep understanding of what is going on and the ability to adapt to future C-library changes.

This outcome could be seen as an argument in favor of a filtering interface like OpenBSD's pledge(). Like seccomp(), pledge() is used to limit the set of system calls available to a process, but pledge() is defined in terms of broad swathes of functionality rather than working at the level of individual system calls. It can be used to allow basic file I/O, for example, while disabling the opening (or creation) of new files. pledge() is far less expressive than seccomp() and cannot implement anything close to the same range of policies but, for basic filtering, it seems far less likely to generate surprises after a kernel or library update.

But Linux doesn't have pledge() and seems unlikely to get it. seccomp() can certainly get the sandboxing job done, but developers who use it should expect to spend some ongoing effort maintaining their filters.

Index entries for this article
Kernel	Security/seccomp
Security	Linux kernel/Seccomp

to post comments

The inherent fragility of seccomp()

Posted Nov 10, 2017 21:07 UTC (Fri) by juliank (guest, #45896) [Link] (5 responses)

This is why, when designing the seccomp filtering for APT's downloading code [1], I looked at a list of all syscalls and picked all similar ones. So if I pick open(), I also pick openat(), for example. In fact, I broadly categorized it into

(1) base set of permissions (normal file I/O, sysv IPC [if fakeroot is used])
(2) directory reading
(3) sockets

See https://anonscm.debian.org/cgit/apt/apt.git/tree/methods/... and later lines.

This will break eventually if a new syscall is introduced. I consider two ways to solve that:

(1) Keep a list of all syscalls that have been checked in the source code, and regularly (on CI) check if there are new ones. If new ones appear, they have to be compared to the existing set, and if similar enough, added to the list.
(2) make syscalls return ENOSYS instead of aborting the program. This should cause libc to fall back from new optimised syscalls to old syscalls, as it has to maintain a certain base level

Combining the two should yield a maintainable result.

[1] https://juliank.wordpress.com/2017/10/23/apt-1-6-alpha-1-...

Another thing people don't consider are NSS modules and LD_PRELOAD. They could be doing all kind of weird stuff when you call getaddrinfo(). For example, they could use SYSV IPC to talk to another process, like a DNS cache. Evil little bastards. We had the same problem with people running apt in fakeroot: fakeroot needs sysv ipc to talk to its metadata daemon thing, and these were not whitelisted. I hacked in support for that - if FAKED_MODE is set in the environment, it now adds ipc syscalls. Ugly.

The inherent fragility of seccomp()

Posted Nov 10, 2017 21:10 UTC (Fri) by juliank (guest, #45896) [Link] (1 responses)

Maybe we could start a libsseccomp-easy where we consolidate groups of syscalls and maintain that in a central place, optimally in libseccomp. It would be similar to pledge, except for the paths component - that would require kernel changes AFAICT.

The inherent fragility of seccomp()

Posted Nov 11, 2017 1:45 UTC (Sat) by pkern (subscriber, #32883) [Link]

Well, systemd does that: https://www.freedesktop.org/software/systemd/man/systemd....

At the same time as stated in the original post AppArmor also leaks the details of the libraries an application loads into the profiles. Or if they exec something you need to account for whatever the exec'ed app does.

The inherent fragility of seccomp()

Posted Nov 11, 2017 0:14 UTC (Sat) by nix (subscriber, #2304) [Link] (2 responses)

(1) Keep a list of all syscalls that have been checked in the source code, and regularly (on CI) check if there are new ones. If new ones appear, they have to be compared to the existing set, and if similar enough, added to the list.

You have to check all libraries your program uses, as well, and all libraries those libraries use, and so on ad infinitum. Oh and don't forget LD_PRELOADed libraries, dynamically loaded plugins, etc etc etc. (Particularly relevant if things like Gtk are in use because of the possibility of accessibility and IM plugins that call out to weird hardware and the like that you have quite possibly never realised even exists: but speech recognition for blind people sometimes relies on LD_PRELOAD to interpose all console I/O, etc etc etc... the list of obscure edge cases crucial to someone that this breaks is endless, and IMHO unmaintainable.)

(2) make syscalls return ENOSYS instead of aborting the program. This should cause libc to fall back from new optimised syscalls to old syscalls, as it has to maintain a certain base level

See my comment below for a case where the affected syscall was getpid(). getpid() is guaranteed to never fail, so nobody ever checks to see if it failed.

I just checked the seccomp filters active in a bunch of programs running on the system on which I'm typing this. Several of them still do not whitelist getpid(), almost a year after glibc 2.25 was released. I guess they're working by luck. The first such example is something that really *needs* seccomp, too: ntpd 4.2.8p10. It calls getpid() multiple times in the very same source file where it sets up a filter list that excludes getpid(): the obscure and out-of-the-way ntpd/ntpd.c. One of its calls does not check for failure, so can easily end up trying to set a process group of (pid_t)-1... it's in a tangle of conditionals that mean that most of the time, if you're lucky, you'll end up not compiling in that code -- but there are several other calls elsewhere in the source tree... and oh yes it also links to OpenSSL's libcrypto. Any bets on whether *that* calls getpid()? Repeat for every other syscall it doesn't allow past, and every syscall it allows past but only with argument checking.

This is not a maintainable strategy for any but the simplest programs.

The inherent fragility of seccomp()

Posted Nov 11, 2017 8:38 UTC (Sat) by alonz (subscriber, #815) [Link]

I believe the OP meant something subtly different: he wasn't planning to check which syscalls the program uses, rather just what syscalls exist in the kernel. When new syscalls are added - he would add them to the appropriate group in the filters (e.g., if it's a new way to open files, it will be filtered the same as all other open* syscalls). And until this update happens, the filters will ensure glibc (or any other library) will get ENOENT for this new syscall, forcing it to fall back to older syscalls.

In a sense, this just implements a poor-man's-pledge, with the CI system ensuring it evolves together with the kernel (or at least trying to).

The inherent fragility of seccomp()

Posted Nov 11, 2017 16:26 UTC (Sat) by marcH (subscriber, #57642) [Link]

Another random example of how seccomp breaks corner cases. This one took a few months to realize

https://bugs.chromium.org/p/chromium/issues/detail?id=772273
sslh seccomp policy blocks ssh to ChromeOS over link-local IPv6 addresses
https://chromium-review.googlesource.com/c/chromiumos/ove...

The inherent fragility of seccomp()

Posted Nov 10, 2017 21:59 UTC (Fri) by luto (subscriber, #39314) [Link] (7 responses)

I've never understood why this is such a big deal. Whitelist the okay syscalls, handle known-bad ones sensibly, and force -ENOSYS from everything else. Glibc needs to work on old kernels, so it has to handle -ENOSYS correctly.

The inherent fragility of seccomp()

Posted Nov 10, 2017 22:41 UTC (Fri) by arnd (subscriber, #8866) [Link]

glibc has the concept of a minimum kernel version, currently linux-3.2 IIRC. If a system call was available on all architectures in that version, the glibc policy is to assume it works. Removing backwards compatibility fallbacks is generally considered a good thing here, but that is what caused the issue.

Part of the problem is that we have reduced the set of available syscalls on modern architectures, anything that uses include/uapi/asm-generic/unistd.h for instance intentionally offer only openat() but not open(). When glibc can reasonably assume that openat() is available on all architectures, the logical next step is to always call that to reduce the differences between architectures.

The inherent fragility of seccomp()

Posted Nov 10, 2017 22:43 UTC (Fri) by pbonzini (subscriber, #60935) [Link]

glibc can be compiled with a guaranteed minimum kernel version, and will skip compatibility code if the minimum kernel version is higher or equal to the one that included a particular system call. You can search the libc manual for "--enable-kernel".

The inherent fragility of seccomp()

Posted Nov 10, 2017 22:45 UTC (Fri) by juliank (guest, #45896) [Link]

One problem with ENOSYS is that you can get weird behaviour in programs due to them not checking errors properly. It's much easier to detect issues when trapping, you can even write a signal handler that writes the blocked syscall to stdout (well, to fd 1 :D) [or just look at a backtrace]. My approach would be too have a list of syscalls, mark the good ones, add traps for all other syscalls in the list, and return -ENOSYS from all other (new) syscalls (or EINVAL for stuff like prctl). This way you have a defined baseline. You can even regularly trap new syscalls if you continue maintaining the software.

The backtrace thing with the trap signal is especially useful for stuff like NSS modules and preloaded libraries.

The inherent fragility of seccomp()

Posted Nov 11, 2017 0:00 UTC (Sat) by nix (subscriber, #2304) [Link] (3 responses)

That entirely depends on the syscall.

This is not the first instance of such breakage: in glibc 2.25, BIND's named daemon stopped working. The failure was catastrophic: rather than daemonizing, it hung forever, which had a tendency to bring boots to a grinding halt: if it didn't, it had a tendency to bring whole networks to a halt if this wasn't noticed and all machines were eventually rebooted after an update. The cause? glibc 2.25 dropped the internal caching of getpid() which it had long done, since it didn't speed much up, added a lot of complexity, introduced subtle bugs, and broke horribly with PID namespaces. When this was done, threaded programs which called getpid() for the first time after activating their seccomp filters needed to whitelist it in those filters, where they never needed to before. BIND had not done so, and called getpid() before daemonizing. Strangely neither it nor glibc expected getpid() to fail. POSIX guarantees it cannot fail, but thanks to seccomp it now can.

Worse yet, this sort of failure can happen even if the call is only made in some non-glibc library, even if the library has no idea the seccomp filters are in force in the first place, and even if the program installing the filter has no idea the library was calling the function (perhaps it wasn't when the filters were added, and who can check every change ever made to every library your program depends on, even indirectly?)

Expecting glibc and other libraries to avoid making changes that break seccomp filters is tantamount to demanding that they never change the set of syscalls they invoke (or the arguments passed to them, because who knows what validation those filters are carrying out) in any situation ever, which would make library development on Linux essentially impossible.

I don't see a way to fix this in the current model other than to demand that all seccomp-filtered programs be statically linked and never upgraded (which would make it impossible to fix security holes in them or any libraries they used: this is of course ridiculous). The increasing use of seccomp is placing silent landmines beneath the feet of everyone using every seccomp-filtered program. This is a shame, because if programs were never upgraded and their behaviour was completely predictable, seccomp would be an excellent way to prevent malicious behaviour. However, in a world like that, programs would all already be secure and we wouldn't need seccomp in the first place.

The obvious fix, to introduce LD_AUDIT-style filtering on *library* arguments, falls at the same hurdle, for the same reason: as long as the filters are process-wide, some library getting upgraded can unintentionally violate the contract of the filter and break. The only solution I can see that would work reliably would be for each library to filter *its own* calls, so that it could at least in theory adjust its filters as its own set of expected calls changed: a sort of DT_SYMBOLIC per-.so filter for inter-shared library function calls. God knows how to implement that without totally wrecking performance though: it would mean every filtered call would have to go through the PLT and ld.so, at the very least: the very opposite of the increasing reduction in lazy binding that's actually happening. I suspect there are more complexities I haven't considered, too.

The inherent fragility of seccomp()

Posted Nov 11, 2017 3:37 UTC (Sat) by patrakov (subscriber, #97174) [Link]

I believe the current situation has some similarity to the decade-old sendmail bug:
https://sites.google.com/site/fullycapable/Home/thesendma...

There, it was also a syscall failing, that could not fail previously (with sendmail not checking the return), due to a new security mechanism (capabilities).

The inherent fragility of seccomp()

Posted Nov 12, 2017 8:27 UTC (Sun) by epa (subscriber, #39769) [Link] (1 responses)

Surely the right way to do seccomp() is to just kill the process with a signal if it calls a system call that isn't allowed. That would be much less dangerous than a pernicious weakening of all API promises, where random things can start failing even if POSIX guarantees they don't.

Programs that are seccomp-aware, and want to handle these things defensively, could arrange to trap the signal. Otherwise existing code would at least either work correctly or fail cleanly.

The inherent fragility of seccomp()

Posted Nov 13, 2017 1:01 UTC (Mon) by simcop2387 (subscriber, #101710) [Link]

This is one of the things I do for enabling arbitrary code execution on a pastebin I run, along with a bunch of other techniques to sandbox the whole thing and prevent it from having any system-visible side effects. The code is publish on cpan, https://metacpan.org/pod/App::EvalServerAdvanced , I'm working on making it easier to handle arbitrary programs that can be sent to the sandbox and keep things secure.

The inherent fragility of seccomp()

Posted Nov 11, 2017 4:59 UTC (Sat) by roc (subscriber, #30627) [Link] (3 responses)

We hit similar issues with some rr tests as well. Various rr things not related to seccomp also had to be updated to handle openat efficiently.

There isn't really a good solution here. pledge() won't scale to a broader software ecosystem. Trying to let libraries express their syscall requirements and collect those transitively would be complicated and prone to errors that over-expose the kernel. Probably a more capability-based kernel API would be better, but it's hard to get there from here.

The inherent fragility of seccomp()

Posted Nov 12, 2017 8:59 UTC (Sun) by wahern (subscriber, #37304) [Link] (2 responses)

Getting there from here is nearly as easy as a single merge: https://github.com/google/capsicum-linux

The inherent fragility of seccomp()

Posted Nov 12, 2017 9:27 UTC (Sun) by roc (subscriber, #30627) [Link] (1 responses)

Getting that merged would in itself be a monumental task.

Then you'd have to rewrite glibc and most other userspace libraries and applications to use capsicum-enabled APIs.

It could be great, but don't claim it's easy.

The inherent fragility of seccomp()

Posted Nov 13, 2017 21:42 UTC (Mon) by wahern (subscriber, #37304) [Link]

True, actually using Capsicum from applications takes considerable work unless you're starting from scratch. But getting it merged seems more like a political rather than a technical issue, as most of the technical work exists for the taking.

Getting over that political hurdle seems daunting, unfortunately. AFAIK the CLONE_FD patch (https://lwn.net/Articles/638613/), necessary for implementing Capsicum's pdfork() interface, _still_ hasn't been merged.

Regarding glibc, I'm not sure how much of an impact it would have on glibc. The particular case of open v openat is irrelevent because applications are supposed to be using openat in the Capsicum model, anyhow. The benefit of Capsicum is that it builds upon the existing, de facto file descriptors-as-capabilities model in Unix. From the perspective of libc, playing nice with Capsicum is roughly similar to refactoring to better leverage the latest evolutions of POSIX and privilege separation best practices. For example, use getrandom() instead of expecting to open /dev/urandom. And stop relying on /proc so heavily because it's not always visible. These are things glibc has to do, anyhow.

The inherent fragility of seccomp()

Posted Nov 13, 2017 7:00 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (6 responses)

Maybe it's time to move the development of libc and other foundational userspace libraries into the kernel repository so that they evolve together.

The inherent fragility of seccomp()

Posted Nov 13, 2017 14:43 UTC (Mon) by musicinmybrain (subscriber, #42780) [Link]

There’s more than one credible libc out there.

The inherent fragility of seccomp()

Posted Nov 13, 2017 15:08 UTC (Mon) by nix (subscriber, #2304) [Link] (4 responses)

The problem is this doesn't just apply to "foundational" libraries. It applies to all of them, the complete set of all syscalls that might be made by all libraries in the address space throughout the lifetime of the seccomped process, and you just can't tell what those might be, not at compile time, not at link time, not even at install time.

e.g. if your sshd has something obscure LD_PRELOADed into it for the sake of a blind user, now you have to adapt to the new syscalls it makes in routine operation, even though you probably had no idea the thing existed. (OK, in this case, the blind user would more likely be preloading something into the ssh *client*, which is not seccomped, but if we're going to try seccomping anything associated with a user interface we'll suddenly have to consider input methods and God knows what getting preloaded in or plugged in).

The inherent fragility of seccomp()

Posted Nov 14, 2017 23:41 UTC (Tue) by wahern (subscriber, #37304) [Link] (3 responses)

> OK, in this case, the blind user would more likely be preloading something into the ssh *client*, which is not seccomped

Which brings up another benefit of pledge over seccomp--pledge doesn't require root privileges to invoke. Almost all the standard utilities in OpenBSD call pledge, _including_ ssh(1). pledge can do this because it's not inherited across exec, which smartly sidesteps all the messy security issues with the setuid and setgid executable bits.

The inherent fragility of seccomp()

Posted Nov 15, 2017 0:34 UTC (Wed) by nix (subscriber, #2304) [Link]

Which brings up another benefit of pledge over seccomp--pledge doesn't require root privileges to invoke.

Neither does the installation of a seccomp filter, as long as you have done a prctl(PR_SET_NO_NEW_PRIVS, 1) first to ensure that you can't go invoking setuid programs, etc, later on. Heck, it was basically designed for Chromium's renderers, and no way are they run as root except by absolute lunatics :)

(This is how it avoids the old sendmail cap attack: setuid programs or their children can't be fooled into running with an unexpected seccomp filter installed before the setuid took effect, because installation of a filter requires turning permanently off the ability to invoke setuid programs in the process hierarchy that has the filter in force.)

The inherent fragility of seccomp()

Posted Dec 11, 2017 1:13 UTC (Mon) by roc (subscriber, #30627) [Link] (1 responses)

If the security constraints are not carried across execve(), then execve() has to be blocked or the constraints are worthless. That's a problem; I've recently implemented a seccomp sandbox around an application that definitely had to use execve().

The inherent fragility of seccomp()

Posted Dec 11, 2017 7:14 UTC (Mon) by mjg59 (subscriber, #23239) [Link]

Not necessarily - in combination with an LSM policy you could restrict which things can be execve()ed. But the fact that all of these security features are effectively orthogonal makes it pretty hard to write an overarching policy.

openat() was available before - why they are blaming glibc?

Posted Nov 15, 2017 12:29 UTC (Wed) by sasha (guest, #16070) [Link] (4 responses)

I do not understand why somebody blames glibc at all. There were 2 system calls: open() and openat(). Some "security" filter decided that it wants to prevent user from calling open(), but they forget about openat(). An application may use openat() with any (g|uc|musl)libc, it is just a syscall. So this "security filter" is just stupid and does not provide any security at all. The new glibc release accidentally showed the hole in the filter, thanks you very much. If the developers of this "security filter" blame glibc for this, then it looks... strange.

openat() was available before - why they are blaming glibc?

Posted Nov 16, 2017 8:24 UTC (Thu) by smcv (subscriber, #53363) [Link] (3 responses)

> Some "security" filter decided that it wants to prevent user from calling open(), but they forget about openat().

The situation here seems to be the other way round: a whitelist-based filter allowed a particular program to call the open syscall (and therefore open files), but in recent glibc, the open(2) wrapper function actually uses the more general openat syscall, which the filter didn't allow. This caused that program to become unable to open files - not vulnerable, but also not usable ("failing closed").

openat() was available before - why they are blaming glibc?

Posted Nov 16, 2017 9:19 UTC (Thu) by jem (subscriber, #24231) [Link] (2 responses)

I'm not convinced. I don't think it is fair to blame Glibc for system calls suddenly disappearing from underneath it at the whim of some random system administrator or application developer. If you use drastic tools like seccomp(), you should really know what you are doing and be prepared for surprises like changing library implementations. In the case of open() vs. openat(), I wonder what the reason was for whitelisting one but not the other. Maybe somebody was just sloppy and simply forgot openat() existed?

openat() was available before - why they are blaming glibc?

Posted Nov 16, 2017 13:27 UTC (Thu) by vadim (subscriber, #35271) [Link]

The problem goes as follows:

1. Programmer writes code, wants more protection and decides on the use of seccomp.
2. Programmer looks at what the code needs, and comes up with the 'open' syscall. However the code doesn't call the syscall directly, but the wrapper glibc provides.
3. Code is finished, programmer moves on to the next project.
4. Kernel development goes on, and the 'openat' syscall gets created.
5. Glibc adds usage of openat, and makes it so that in some cases, the glibc provided open wrapper sometimes actually calls 'openat'.
6. In those cases, the previously written code ends up using the 'openat' syscall which is not whitelisted because it didn't exist when the code was written, or because the 'open' wrapper always used the 'open' syscall and nothing else, and this changed later.

The 'open' call doesn't go anywhere. Glibc just doesn't promise to do an exact 1 to 1 wrapper, or not to introduce internal usage of additional new syscalls for its own internal reasons. When you call glibc open(), glibc may actually invoke a new, more advanced syscall like openat instead, or use additional syscalls in the wrapper.

openat() was available before - why they are blaming glibc?

Posted Nov 16, 2017 14:09 UTC (Thu) by corbet (editor, #1) [Link]

This conversation confuses me a bit. Who is blaming glibc? The article is about a particular kernel functionality that is prone to breakage. The term "fragility" in the title was applied to seccomp(), not glibc, after all. I've not seen comments blaming glibc either...?