Re: [RFC 1/1] seccomp: Add bitmask of allowed system calls.
[Posted May 12, 2009 by corbet]
| From: |
| Ingo Molnar <mingo-AT-elte.hu> |
| To: |
| Adam Langley <agl-AT-google.com>, Andrew Morton <akpm-AT-linux-foundation.org>, Frédéric Weisbecker <fweisbec-AT-gmail.com>, Tom Zanussi <tzanussi-AT-gmail.com>, Li Zefan <l |
| Subject: |
| Re: [RFC 1/1] seccomp: Add bitmask of allowed system calls. |
| Date: |
| Fri, 8 May 2009 00:14:47 +0200 |
| Message-ID: |
| <20090507221447.GE28770@elte.hu> |
| Cc: |
| linux-kernel-AT-vger.kernel.org, markus-AT-google.com |
| Archive‑link: | |
Article |
(i've restored the Cc: line of the previous thread)
* Adam Langley <agl@google.com> wrote:
> (This is a discussion email rather than a patch which I'm
> seriously proposing be landed.)
>
> In a recent thread[1] my colleague, Markus, mentioned that we
> (Chrome Linux) are investigating using seccomp to implement our
> rendering sandbox[2] on Linux.
>
> In the same thread, Ingo mentioned[3] that he thought a bitmap of
> allowed system calls would be reasonable. If we had such a thing,
> many of the acrobatics that we currently need could be avoided.
> Since we need to support the currently existing kernels, we'll
> need to have the code for both, but allowing signal handling,
> gettimeofday, epoll etc would save a lot of overhead for common
> operations.
>
> The patch below implements such a scheme. It's written on top of
> the current seccomp for the moment, although it looks like seccomp
> might be written in terms of ftrace soon[4].
>
> Briefly, it adds a second seccomp mode (2) where one uploads a
> bitmask. Syscall n is allowed if, and only if, bit n is true in
> the bitmask. If n is beyond the range of the bitmask, the syscall
> is denied.
>
> If prctl is allowed by the bitmask, then a process may switch to
> mode 1, or may set a new bitmask iff the new bitmask is a subset
> of the current one. (Possibly moving to mode 1 should only be
> allowed if read, write, sigreturn, exit are in the currently
> allowed set.)
>
> If a process forks/clones, the child inherits the seccomp state of
> the parent. (And hopefully I'm managing the memory correctly
> here.)
>
> Ingo subsequently floated the idea of a more expressive interface
> based on ftrace which could introspect the arguments, although I
> think the discussion had fallen off list at that point.
>
> He suggested using an ftrace parser which I'm not familiar with, but can
> be summed up with:
> seccomp_prctl("sys_write", "fd == 3") // allow writes only to fd 3
It's the ftrace filter parser and execution engine.
I.e. we first parse the filter expression when setting up a seccomp
context. Each syscall has the following attributes:
on # enabled unconditionally
off # disabled unconditionally
filtered
In the filtered case, the filter can be simple:
"fd == 0"
To restrict sys_write() to a single fd (but still allow sys_read()
from other fds).
Or as complex as:
(fd == 4 || fd == 5) && (buf == 0x12340000) && (size <= 4096)
To restrict IO to two specific fds and to restrict output to a
specific memory address and to restrict size to 4K or smaller.
This is how the filter engine works: we parse the string and save it
into a binay expression structure (cache) that can later on be run
by the engine in a pretty fast way. (without any string parsing or
formatting overhead in the validation fastpath)
The filter is thus evaluated in the sandbox task's context, without
the need for any context-switching. It's very, very fast. It is i
think faster than LSM rules, and it is also atomic and lockless (RCU
based).
> In general, I believe that ftrace based solutions cannot safely
> validate arguments which are in user-space memory when multiple
> threads could be racing to change the memory between ftrace and
> the eventual copy_from_user. Because of this, many useful
> arguments (such as the sockaddr to connect, the filename to open
> etc) are out of reach. LSM hooks appear to be the best way to
> impose limits in such cases. (Which we are also experimenting
> with).
That assessment is incorrect, there's no difference between safety
here really.
LSM cannot magically inspect user-space memory either when multiple
threads may access it. The point would be to define filters for
system call _arguments_, which are inherently thread-local and safe.
> However, such a parser could be very useful in one particular
> case: socketcall on IA32. Allowing recvmsg and sendmsg, but not
> socket, connect etc is certainly something that we would be
> interested in.
There are two problems with the bitmap scheme, which i also
suggested in a previous thread but then found it to be lacking:
1) enumeration: you define a bitmap. That will be problematic
between compat and native 64-bit (both have different syscall
vectors).
2) flexibility. It's an on/off selection per syscall. With the
filter we have on, off, or filtered. That's a _whole_ lot more
flexible.
The filter expression based solution does not suffer from this: it
is string enumerated. "sys_read" means that syscall, and we could
specify whether it's the compat or the native one.
Ingo