Re: [RFC 1/1] seccomp: Add bitmask of allowed system calls.

[Posted May 12, 2009 by corbet]

From:		Ingo Molnar <mingo-AT-elte.hu>
To:		Adam Langley <agl-AT-google.com>, Andrew Morton <akpm-AT-linux-foundation.org>, Frédéric Weisbecker <fweisbec-AT-gmail.com>, Tom Zanussi <tzanussi-AT-gmail.com>, Li Zefan <l
Subject:		Re: [RFC 1/1] seccomp: Add bitmask of allowed system calls.
Date:		Fri, 8 May 2009 00:14:47 +0200
Message-ID:		<20090507221447.GE28770@elte.hu>
Cc:		linux-kernel-AT-vger.kernel.org, markus-AT-google.com
Archive‑link:		Article

(i've restored the Cc: line of the previous thread)

* Adam Langley <agl@google.com> wrote:

> (This is a discussion email rather than a patch which I'm 
> seriously proposing be landed.)
> 
> In a recent thread[1] my colleague, Markus, mentioned that we 
> (Chrome Linux) are investigating using seccomp to implement our 
> rendering sandbox[2] on Linux.
> 
> In the same thread, Ingo mentioned[3] that he thought a bitmap of 
> allowed system calls would be reasonable. If we had such a thing, 
> many of the acrobatics that we currently need could be avoided. 
> Since we need to support the currently existing kernels, we'll 
> need to have the code for both, but allowing signal handling, 
> gettimeofday, epoll etc would save a lot of overhead for common 
> operations.
> 
> The patch below implements such a scheme. It's written on top of 
> the current seccomp for the moment, although it looks like seccomp 
> might be written in terms of ftrace soon[4].
> 
> Briefly, it adds a second seccomp mode (2) where one uploads a 
> bitmask. Syscall n is allowed if, and only if, bit n is true in 
> the bitmask. If n is beyond the range of the bitmask, the syscall 
> is denied.
> 
> If prctl is allowed by the bitmask, then a process may switch to 
> mode 1, or may set a new bitmask iff the new bitmask is a subset 
> of the current one. (Possibly moving to mode 1 should only be 
> allowed if read, write, sigreturn, exit are in the currently 
> allowed set.)
> 
> If a process forks/clones, the child inherits the seccomp state of 
> the parent. (And hopefully I'm managing the memory correctly 
> here.)
> 
> Ingo subsequently floated the idea of a more expressive interface 
> based on ftrace which could introspect the arguments, although I 
> think the discussion had fallen off list at that point.
> 
> He suggested using an ftrace parser which I'm not familiar with, but can
> be summed up with:
>   seccomp_prctl("sys_write", "fd == 3")  // allow writes only to fd 3

It's the ftrace filter parser and execution engine.

I.e. we first parse the filter expression when setting up a seccomp 
context. Each syscall has the following attributes:

 on                # enabled unconditionally
 off               # disabled unconditionally
 filtered

In the filtered case, the filter can be simple:

	"fd == 0"

To restrict sys_write() to a single fd (but still allow sys_read() 
from other fds).

Or as complex as:

	(fd == 4 || fd == 5) && (buf == 0x12340000) && (size <= 4096)

To restrict IO to two specific fds and to restrict output to a 
specific memory address and to restrict size to 4K or smaller.

This is how the filter engine works: we parse the string and save it 
into a binay expression structure (cache) that can later on be run 
by the engine in a pretty fast way. (without any string parsing or 
formatting overhead in the validation fastpath)

The filter is thus evaluated in the sandbox task's context, without 
the need for any context-switching. It's very, very fast. It is i 
think faster than LSM rules, and it is also atomic and lockless (RCU 
based).

> In general, I believe that ftrace based solutions cannot safely 
> validate arguments which are in user-space memory when multiple 
> threads could be racing to change the memory between ftrace and 
> the eventual copy_from_user. Because of this, many useful 
> arguments (such as the sockaddr to connect, the filename to open 
> etc) are out of reach. LSM hooks appear to be the best way to 
> impose limits in such cases. (Which we are also experimenting 
> with).

That assessment is incorrect, there's no difference between safety 
here really.

LSM cannot magically inspect user-space memory either when multiple 
threads may access it. The point would be to define filters for 
system call _arguments_, which are inherently thread-local and safe.

> However, such a parser could be very useful in one particular 
> case: socketcall on IA32. Allowing recvmsg and sendmsg, but not 
> socket, connect etc is certainly something that we would be 
> interested in.

There are two problems with the bitmap scheme, which i also 
suggested in a previous thread but then found it to be lacking:

1) enumeration: you define a bitmap. That will be problematic 
   between compat and native 64-bit (both have different syscall 
   vectors).

2) flexibility. It's an on/off selection per syscall. With the 
   filter we have on, off, or filtered. That's a _whole_ lot more 
   flexible.

The filter expression based solution does not suffer from this: it 
is string enumerated. "sys_read" means that syscall, and we could 
specify whether it's the compat or the native one.

	Ingo