Proper handling of unknown flags in system calls

February 26, 2014

This article was contributed by Michael Kerrisk.

As noted by various commenters on our earlier article, "Flags as a system call API design pattern", there is an important step that is required when equipping a system call with a flags argument. If the kernel does not take care to return an error if unknown flags are set, it is setting the stage for a number of compatibility problems in the future. Unfortunately, the history of Linux (and Unix) system call development shows that this lesson has been hard to learn.

In particular, a system call flags argument (or indeed any input structure argument that has a bit-flags field) should always include a check of the following form in its implementation:

	if (flags & ~(FL_XXX | FL_YYY))
	    return -EINVAL;

Here, FL_XXX and FL_YYY form the hypothetical set of flags that the system call understands, and the effect of this check is to deliver an error when the caller specifies any bit value other than one in the set. Checks like this future-proof the API against the day when the system call understands additional flags. Suppose that the system call adds a new flag, FL_ZZZ, and adjusts its check to:

	if (flags & ~(FL_XXX | FL_YYY | FL_ZZZ))
	    return -EINVAL;

A user-space application is now able to check whether it is running on a kernel where the system call supports FL_ZZZ by checking for an EINVAL error when making the system call. This allows the application to flexibly deal with system call differences across kernel versions.

Although implementing flags checks such as the above inside the kernel might seem simple and obvious, it turns out that dozens of system calls don't make this check, including clock_nanosleep(), clone(), epoll_ctl(), fcntl(F_SETFL), mmap(), msgrcv(), msgsnd(), open(), recv(), send(), sigaction(), splice(), unshare(), and many others.

Most of those system calls have been around for several years. In more recent times, most new system calls that have a flags argument include the required check. However, such checks are missing even in a few system calls added in recent kernel versions, such as as open_by_handle_at() (2.6.39), recvmmsg() (2.6.33), and sendmmsg() (3.0). In each of those recent cases, the implementer was presumably emulating the lack of checking that was done in the corresponding earlier system call (open(), recv(), send()). However, the failure to add the checks represents a missed opportunity to improve on the original API.

For each of the system calls that lack a check on the flags argument, user-space applications have no easy way of detecting what API flags a particular kernel version supports. Furthermore, failure to implement such checks in the kernel can also complicate the lives of kernel developers, as a couple of examples demonstrate.

When the kernel fails to check that only valid bits are passed in flags, user-space applications can, with impunity, place random garbage in the "unused" bits of flags. If a kernel developer then decides to make use of one of the hitherto unused bits, this may lead to surprising breakage in user-space applications, which in turn may require the kernel developer to write suboptimal implementations of new user-space API features. One recent example of this was in the implementation of the EPOLLWAKEUP flag, where avoiding user-space breakage meant that the kernel silently ignored this flag if the caller did not have the CAP_BLOCK_SUSPEND capability. Ideally, of course, the kernel would have informed the caller by returning an error from the call. Consequently, applications that want to be absolutely sure that the call will succeed must explicitly check beforehand that they have the CAP_BLOCK_SUSPEND capability.

An even more recent example was in the implementation of the O_TMPFILE flag for open(), where the flag definition incorporated the O_DIRECTORY flag, with the goal that older kernels that do not support O_TMPFILE would give an error if the flag was specified in a call to open(). This was necessary, because applications that create temporary files are often security conscious, and need to know whether their requests to create hidden temporary files have been honored. Without this fix, the O_TMPFILE flag would be silently ignored on older kernels, and an application might end up creating a visible file. An unpleasant side effect of that implementation is that user-space applications must check for two different errors from open() in order to determine whether they are running on a kernel that doesn't support O_TMPFILE.

Finally, it is worth mentioning that a few system calls have added the required flags check after the call was first implemented. Two examples are the ancient system calls umount2() (check added in Linux 2.6.34) and swapon() (check added in Linux 3.4). In addition, the mremap() call, which first appeared in Linux 2.0, added the check in Linux 2.4, and the timerfd_settime() system call, which first appeared in Linux 2.6.25, added the check in Linux 2.6.29.

However, the addition of flags checks to these system calls represents an exception to the general rule that such checks cannot be added after the fact, because doing so would break existing applications that happen to pass random garbage in the "unused" bits of flags. With umount2() and swapon(), the change was possible presumably because there are few users of these system calls other than the mount and swapon commands, and those programs could be modified if the kernel change caused them to break. In the case of timerfd_settime(), the change was made soon after the initial implementation, when there were likely to have been few users of the interface. And in the case of mremap(), the change was made at the time of a major kernel version change (from 2.2 to 2.4), when such ABI changes were occasionally permitted; with the contemporary 10-week release cycle, such changes are not permitted.

Thus if the check on unused flag bits is not included in the initial implementation, it is often impossible to add it later. The clear conclusion is that any addition of flag bits to a system call should come with the proper checks from the outset.

Index entries for this article
Kernel	Development model/Patterns
Kernel	System calls
GuestArticles	Kerrisk, Michael

Distinguish between old and new programs

Posted Feb 27, 2014 11:30 UTC (Thu) by epa (subscriber, #39769) [Link] (3 responses)

Add a new system call api_level(). By default the level is 0. Any extra flags passed to clone() will be quietly ignored. If the process sets api_level(1) then strict checking is enabled on the flags.

When a new flag is added to clone() it can continue to be ignored with api_level 0 and only have effect with 1 and above.

I imagine that the standard C library would call api_level as soon as the process is created.

Distinguish between old and new programs

Posted Feb 27, 2014 12:52 UTC (Thu) by HelloWorld (guest, #56129) [Link]

Yes, some sort of approach to handle this is needed. There's no excuse for inflicting this kind of brain damage:
> the kernel silently ignored this flag if the caller did not have the CAP_BLOCK_SUSPEND capability
on all future developer generations to come.

Distinguish between old and new programs

Posted Mar 2, 2014 23:58 UTC (Sun) by andreasb (guest, #80258) [Link] (1 responses)

This sounds like an api_level() global to the process. There's a big problem with that: You can't link different code using different api_levels into one program, unless you want to call api_level() before every syscall (and properly protect it in multithreaded programs) which sounds quite impractical.

Linked libraries would be the obvious biggest problem. Also, if you want to use a new flag for one syscall in your program and have to raise api_level for that, you'd have to audit the complete source for any syscalls incompatible with that api_level.

Distinguish between old and new programs

Posted Mar 3, 2014 10:31 UTC (Mon) by epa (subscriber, #39769) [Link]

You're right, it would be for the whole process. Better than nothing, I think. I don't agree that you would always have to 'audit the complete source' because most access to the kernel goes through the standard C library, which will perform its own validation or scrubbing on flags. The upgrade to api_level would be by upgrading to a newer version of libc. It is common for systems to have older libcs available to support older binaries.

Hidden flag

Posted Feb 27, 2014 22:31 UTC (Thu) by cesarb (subscriber, #6266) [Link] (3 responses)

If the syscall has at least one pointer parameter, there is an easy fix: use the top bit of the pointer as a "validate flags" flag. Older kernels will return an error ("wtf are you trying to do with a kernel pointer, userspace program?"). Newer kernels will mask it and carry on.

Note: I am joking.

Hidden flag

Posted Feb 28, 2014 10:17 UTC (Fri) by renox (guest, #23785) [Link] (2 responses)

Can't do this only for the OS: having access to the top bit of a pointers (on 64bit address spaces) would be very useful for garbage collectors which could then record efficiently which pointers they have visited or not.

That's said, it's still possible: two bits for the GC(in case one isn't enough), two bits for the 'OS compatibility' (in case one isn't enough), it remains 60-bit for the real address which is more than enough.

Note: I am *not* joking.

Hidden flag

Posted Feb 28, 2014 22:50 UTC (Fri) by cesarb (subscriber, #6266) [Link]

Whatever bits the userspace program is using for GC or other tagging already have to be masked out before passing the pointer to the kernel, so it does not matter if the same bits are used for userspace tags and system call flags. They won't conflict.

That said, I stand by my proposal being a joke (though my jokes being taken seriously has happened before). It only works if the address space is split right in the middle; it would work for x86-64 and for 32-bit architectures with a 2G/2G split, but would not work for 32-bit architectures with a 3G/1G split, where there are valid userspace pointers with the high bit set. It's also an horribly ugly kludge.

Hidden flag

Posted Jul 18, 2014 22:25 UTC (Fri) by jengelh (guest, #33263) [Link]

>having access to the top bit of a pointers (on 64bit address spaces) would be very useful for garbage collectors which could then record efficiently which pointers they have visited or not.

Unless I am mistaken, AddressSanitizer already does something like that to keep state.

https://code.google.com/p/address-sanitizer/wiki/AddressS...

Proper handling of unknown flags in system calls

Posted Feb 28, 2014 14:41 UTC (Fri) by paulj (subscriber, #341) [Link]

I have to say, building APIs that require that users of new flags have to search for which subset are supported by testing for EINVAL is horrid.

Catching misuse, and "standardizing after the fact"

Posted Feb 28, 2014 15:13 UTC (Fri) by davecb (subscriber, #1574) [Link] (2 responses)

We had exactly this problem on Solaris. David J. Brown of spec-1170 fame went through and removed all the edge-cases from our implementation, over a number of years.

We offered a "next version" guarantee and provided a tool, "appcert". If your program passed appcert but didn't run on the next release of Solaris, it was Sun's fault and we were on the hook to fix it. Think of it as "validate once, run forever"

Mostly, people failed appcert, but a few passed and blew up, thus exposing new bugs for us before they shipped their product.

Linux could do something similar: announce an addition is coming in the next release and ship a test tool to detect breakage. Obviously one would also ship another optional tool to let a broken app keep running (;-))

In the specific case of flags, it's easy to write an interposer (interceptor) that checks the flags. Easy, in this case, was on the order of 30 seconds ...

$cat open.c
/*
 * open -- intercept calls to open to catch flag errors
 * Generated by mkinterposer from a template by David Collier-Brown.
 * This is free software, copyright is disclaimed by the author. 
 */
#define MASK 0xDEADBEEF

#include <dlfcn.h>	/* For dlopen(). */
#include <link.h>
#include <assert.h>	/* For assert() macro. */
#include <fcntl.h>	/* For open(). */
#include <sys/stat.h>	/* For open(). */
#include <sys/types.h>	/* For open(). */
/*
 * open -- intercept open
 */
int open(const char *pathname, int flags) {
        int open(const char *pathname, int flags, mode_t mode) {
	static void *actual = NULL;

	if (actual == NULL) {
		actual = dlsym(RTLD_NEXT, "open");
		assert(actual != NULL);
	}
	assert((flags & MASK) == 0);
	return ((int (*)(const char *, int))actual)(pathname, flags);
}

The same kind of interposer can be used to fix the flags, for people who don't have the source code for a buggy program they still want to run.

If anyone wants the mkinterposer script or a complete set of interposers, send me mail.

--dave
davecb@spamcop.net

Catching misuse, and "standardizing after the fact"

Posted Feb 28, 2014 18:42 UTC (Fri) by mm7323 (subscriber, #87386) [Link] (1 responses)

Perhaps strace(1) would be a good starting point? As a side benefit it would at least update strace with knowledge of all valid flags and system calls.

Catching misuse, and "standardizing after the fact"

Posted Feb 28, 2014 19:46 UTC (Fri) by davecb (subscriber, #1574) [Link]

We prototyped this sort of thing using strace (truss, actually: it was Solaris) but that's an intrusive, heavyweight program and not suitable for production. It's also a lot of work!

Adding a library and one more call/return is harmless, and can be added to a regression-test suite with LD_PRELOAD=/path/to/intrerposer.so.1

As it happens, we fell in love with shared libraries and built several tools, including apptrace, a standard-library strace equivalent, a shared library profiler and a transaction performance monitor, all lightweight and non-crash-inducing.

Proper handling of unknown flags in system calls

Posted Mar 3, 2014 8:01 UTC (Mon) by dlang (guest, #313) [Link] (3 responses)

I'll also point out the example of the ecn flag bit in TCP. It was marked as 'reserved' for years, so suitably paranoid firewalls either dropped packets with it, or zeroed it out. This caused grief when the bit was redefined to mean something new and large portions of the Internet couldn't use it.

In this case, I think the firewalls were doing the right thing. they didn't know what this bit meant and so had no idea what the effect of allowing it would be.

You need to think of this sort of situation when you talk about how to handl unknown flags.

Proper handling of unknown flags in system calls

Posted Mar 5, 2014 14:29 UTC (Wed) by mkerrisk (subscriber, #1978) [Link] (2 responses)

> I'll also point out the example of the ecn flag bit in TCP. It was
> marked as 'reserved' for years, so suitably paranoid firewalls
> either dropped packets with it, or zeroed it out. This caused
> grief when the bit was redefined to mean something new and large
> portions of the Internet couldn't use it.

(Just as background for others, and to make sure that we're talking about the same thing, I presume we are talking about this: http://tools.ietf.org/search/rfc3360#section-3 )

Someone else also mentioned this example to me. I haven't thought about it at length, but it seems to me that this case is different in kind. The problem there is that *a third party* is mediating the conversation between two end-points. In that case, things get more tricky. But, the case of a direct two-party user-space-to-kernel conversation is simpler, and I think the "check for invalid flags" rule always applies.

Proper handling of unknown flags in system calls

Posted Mar 5, 2014 14:31 UTC (Wed) by dlang (guest, #313) [Link] (1 responses)

You are correct, but we are getting to the point where there are also 'third party' security tools trying to mediate between userspace and the kernel (seccomp, SELinux policies, etc) so the comparison isn't as dissimilar as it may seem at first glance.

Proper handling of unknown flags in system calls

Posted Mar 5, 2014 15:22 UTC (Wed) by mkerrisk (subscriber, #1978) [Link]

I see where you're going (I think), but I don't think it changes things from the perspective of kernel system call service routine, which should still do the EINVAL flags check. However, different rules may be needed for intermediaries that you refer to.