Seccomp filters: No clear path

By Jake Edge
July 7, 2011

Patches to expand the functionality of seccomp ("secure computing") have been floating around for two years or more without making any real progress into the mainline. There are a number of projects that are interested in using an expanded seccomp, but the patches themselves seem to have run into a "catch-22" situation. There are conflicting visions of how the feature should be added, without a clear sense that any of the options will be acceptable to all of the maintainers involved. That leaves a useful feature without a clear path into the kernel, which is undoubtedly frustrating to some.

We first looked at seccomp sandboxing a little over two years ago, when Adam Langley posted patches that would provide a way for a process to restrict the system calls that it (and its children) could make. The idea is to allow processes to sandbox themselves by choosing which system calls are available, rather than being restricted to just the four hard-coded system calls that the existing seccomp implementation allows (read(), write(), exit(), and sigreturn()). The impetus behind Langley's patches was to provide an easier mechanism for sandboxing processes in the Chromium web browser—and to eventually remove the somewhat convoluted sandbox that Chromium currently uses on Linux.

At the time of that proposal, Ingo Molnar suggested that Ftrace-style filtering would make the expanded seccomp much more useful. That idea wasn't universally hailed at the time, and the seccomp feature went mostly dormant until it was restarted by Will Drewry back in April. Drewry took Molnar's suggestions and implemented a version of seccomp that would allow system calls to be enabled, disabled, or filtered with simple boolean expressions (e.g. sys_read: (fd == 0)).

While Molnar was pleased with the progress, he didn't think it went far enough and suggested that a perf-like interface be used instead of prctl(), which is used by the existing seccomp. He had some fairly wide-ranging ideas that using perf events in a more active way could lead to better kernel security solutions than the existing Linux Security Modules (LSM) approach provides. Once again, this idea was not universally popular. The LSM developers, in particular, were not enamored by that idea.

Nevertheless, Drewry implemented a proof of concept along the lines of what Molnar had suggested. That led to complaints from a somewhat surprising direction, as both Peter Zijlstra and Thomas Gleixner strongly objected to perf being used in an active role. Their responses didn't leave room for any middle ground, with Zijlstra, who is one of the perf maintainers along with Molnar, saying that he and Gleixner would NAK "any and all patches that extend perf/ftrace beyond the passive observing role".

All of which led Drewry, who must be feeling a bit whipsawed at this point, to return to the patchset that seemed to have the most support: using Ftrace/perf-style filters, but maintaining the prctl() interface that is currently used by seccomp. Linus Torvalds had expressed some skepticism that the feature would have any real users, but Drewry outlined how it would be used by Chromium, and several other developers spoke up in favor of expanding seccomp, saying that QEMU, Linux containers (LXC), and others would use the feature. Those endorsements, along with resolving some other technical concerns, was enough for Torvalds to remove his objection to the feature. But, as might be guessed, Molnar is still not satisfied with the approach.

When Drewry reposted the patchset toward the end of June, and asked what the next steps were, Molnar noted that his concerns were not being addressed: "You are pushing the 'filter engine' approach currently, not the (much) more unified 'event filters' approach." But Drewry is trying to find a balance between the needs of the potential users, other maintainers, and Molnar's requests, which is somewhere between difficult and impossible:

Based on the support from potential API consumers, I believe there is interest in this patch series, and I worry that just like with the last two attempts in the last two years, this series will be relegated to the lwn archives in anticipation of a future solution that uses infrastructure that isn't quite ready. I'm trying to approach a problem that can be addressed today in a flexible, future-friendly way, rather than try to open up a larger cross-kernel impacting patch series that I'm unsure of exactly how to integrate sanely and don't know that I can commit to doing.

But Molnar is adamant that the "filter engine" approach is short-sighted, citing the diffstats of the various implementations as evidence:

Not doing it right because "it's too much work", especially as the trivial 'proof of concept' prototype already gave us something very promising that worked to a fair degree:

       bitmask (2009):  6 files changed,  194 insertions(+), 22 deletions(-)
 filter engine (2010): 18 files changed, 1100 insertions(+), 21 deletions(-)
 event filters (2011):  5 files changed,   82 insertions(+), 16 deletions(-)

are pretty hollow arguments to me. That diffstat sums up my argument of proper structure pretty well.

But, as Drewry points out, there is still a lot of work to be done to get beyond the proof-of-concept and to a fully fleshed-out solution. Given that the approach has already received several NAKs, doing all of that work has a very uncertain future. Drewry would like to see the feature be available soon, and is concerned that working on the larger problem is likely to delay that significantly, if it can ever get beyond the objections: "If all the other work is a prerequisite for system call restriction, I'll be very lucky to see anything this calendar year assuming I can even write the patches in that time."

Molnar is undeterred, however, suggesting that there is a path into the kernel through the tree that he co-maintains:

Do it properly generalized - as shown by the prototype patch. I can give you all help that is needed for that: we can host intermediate stages in -tip and we can push upstream step by step. You won't have to maintain some large in-limbo set of patches. 95% of the work you've identified will be warmly welcome by everyone and will be utilized well beyond sandboxing! That's not a bad starting position to get something controversial upstream: most of the crazy trees are 95% crazy.

The problem, of course, is that the 5% is the piece that Drewry and others are most interested in seeing (i.e. the system call restrictions for sandboxing) in the kernel. So, what Molnar seems to be offering is a fairly sizable chunk of work that could, in the end, still leave the "interesting" part out in the cold. Molnar may be confident that he can overcome the objections from Zijlstra and Gleixner, but Drewry can hardly be as sanguine. He describes the problem as he sees it:

It seems like a catch-22. There's not a perfectly clear path forward, and anything that looks like the perf-style proof of concept will be NACK'd by other maintainers. While I believe we could lift perf up off its foundation and create a shared location for storing perf events and ftrace events so that they will be inherited the same way (currently nack'd by linus) and walked the same way (kinda), the syscall interface couldn't currently be shared (also nack'd by perf), and creating a new one is possible modeled on the perf one, but it's also unclear what the ABI should be for a generic filtering system.

Both Zijlstra and Gleixner have been absent from the most recent discussion, so it's a little hard to guess what their thoughts are. In the absence of any kind of posting softening their stances, though, it would be a bad idea to believe that they have changed their minds.

It's a problem that we have seen before, where a new feature is, to some extent, held hostage to requests that a larger problem be solved. The problem was discussed at the 2009 Kernel Summit, where there was agreement that those requests should be advisory in nature, rather than demands. In this case, Molnar is not really demanding that the bigger task be done, just that he is uninterested in taking the code via the -tip tree unless it solves the larger problem.

It is unclear where things go from here. Drewry said that he would look at trying to do things Molnar's way ("but if my only chance of any form of this being ACK'd is to write it such that it shares code with perf and has a shiny new ABI, then I'll queue up the work for when I can start trying to tackle it"), but it may be a ways off. In the meantime, there are various projects interested in using the feature.

If falling back to the bitmask version of the feature solves enough of the problem for those projects, there is the possibility of trying to get that into the kernel via another tree (e.g. the security tree). There would undoubtedly be objections from Molnar, but if enough users lined up behind it, that might be a reasonable approach. It would create an ABI that would need to be maintained going forward, which is one of Molnar's objections, but it would solve problems for Chromium and others.

Steven Rostedt suggested adding the seccomp expansion as a discussion item for the Kernel Summit in October, which might provide a path forward. It's likely that most or all of the interested parties will be there (unlike the Linux Security Summit that will be held with Plumbers in September, which was suggested as an alternative). While a face-to-face discussion could be helpful, it might be a stretch to believe that the disagreement between active vs. passive perf could be resolved that way. On the other hand, it could lead to some kind of decree about the proper direction from Torvalds. That could go a long way toward resolving the issue.

Index entries for this article
Kernel	Security/seccomp
Security	Linux kernel