mseal() and what comes after

By Jonathan Corbet
October 20, 2023

Jeff Xu recently proposed the addition of a new system call, named mseal(), that would allow applications to prevent modifications to selected memory mappings. It would enable the hardening of user-space applications against certain types of attacks; some other operating systems have this type of feature already. There is support for adding this type of mechanism to the Linux kernel as well, but it has become clear that mseal() will not land in the mainline in anything resembling its current form. Instead, it has become an example of how not to do kernel development at a number of levels.

Xu described the new system call's purpose as:

Memory sealing additionally protects the mapping itself against modifications. This is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management syscall. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped.

The target user for this functionality is the Chrome browser which, among other things, includes a just-in-time (JIT) compilation engine for JavaScript code. Since it generates executable code on the fly, JIT compilation must be done with care, lest it create (and run) the wrong kind of code. As described in this blog post by Stephen Röttger, a lot of effort has gone into control-flow integrity and preventing the JIT system from becoming a tool for an attacker. If, however, an attacker is able to somehow force a memory-management system call that changes memory permissions, all bets are off. Thus, the Chrome developers would like to have a mechanism that puts those system calls off-limits for specific regions of memory, hardening the browser against that sort of attack.

The cover letter notes that mseal() is similar to mimmutable(), which was added recently to OpenBSD. The prototype for the proposed system call is quite different from mimmutable(), though:

    int mseal(void *addr, size_t len, unsigned int types, unsigned int flags);

The range of memory to be affected is indicated by addr and len. The flags must be passed as zero, and types controls which system calls will be blocked on that address range:

MM_SEAL_MPROTECT: mprotect() and pkey_mprotect()
MM_SEAL_MMAP: mmap()
MM_SEAL_MUNMAP: munmap()
MM_SEAL_MREMAP: mremap()
MM_SEAL_MSEAL: future mseal() calls

Linus Torvalds was quick to object to the patch series, saying: "I have no objections to adding some kind of 'lock down memory mappings' model, but this isn't it". He had a number of complaints about the details of the implementation, but he later made it clear the design of the system call was wrong. Blocking munmap(), for example, makes little sense if other operations that can result in the unmapping of addresses (mmap() and mremap(), for example), are still allowed. The effort that was put into only blocking operations from specific system calls, he said, was overtly wrong; if unmapping a range of memory (for example) is blocked, it must be blocked from all directions or the protection provided will be illusory.

Matthew Wilcox questioned the complexity of the interface, suggesting instead that a couple of flags added to mprotect() would suffice. A memory region, he said, should either be immutable (with the possible option of further reducing access) or not, without regard to which system call was used. He later added:

This is the seccomp disease, only worse because you're trying to deny individual syscalls instead of building up a list of permitted syscalls. If we introduce a new syscall tomorrow which can affect VMAs, we'll then make it the application's fault for not disabling that new syscall. That's terrible design!

The conversation even brought about a rare appearance on linux-kernel by OpenBSD maintainer Theo de Raadt, who agreed with Torvalds and suggested that Linux should simply add mimmutable() rather than reinventing that functionality in a more complex form. Torvalds was amenable to this idea, though he suggested adding a flags argument for future changes — an idea that de Raadt disliked. That reflects the fact that OpenBSD controls its user space, so it can add a flags parameter later if the need arises; Linux has no such luxury, so that parameter must be present from the beginning if it is to exist at all.

Xu, instead, resisted the idea, prompting a typical (if relatively mild) de Raadt response. Indeed, Xu continued to cling to his proposed design despite the comments he had received, leading to a somewhat exasperated post from Wilcox, who tried to direct the conversation toward what the patch series is actually trying to accomplish:

Let's start with the purpose. The point of mimmutable/mseal/whatever is to fix the mapping of an address range to its underlying object, be it a particular file mapping or anonymous memory. After the call succeeds, it must not be possible to make any address in that virtual range point into any other object.
The secondary purpose is to lock down permissions on that range. Possibly to fix them where they are, possibly to allow RW->RO transitions.
With those purposes in mind, you should be able to deduce for any syscall or any madvise(), ... whether it should be allowed.

Xu, Wilcox concluded, needed do a better job of listening to the developers who were trying to help him.

At this point, it is clear that mseal() will not enter the kernel in anything like its current form. That leads to the question of what should be done instead. Röttger jumped into the conversation to point out that a pure mimmutable() solution does not do everything that the Chrome developers would like to see; they have cases where they want to prevent unmapping, but still need to be able to change memory protections with mprotect(). De Raadt described that case as "partial sealing" that means that the memory in question is not actually protected.

There will presumably be some sort of follow-up proposal that maintains that capability while removing the more complex options provided by mseal(). But whether that proposal will be mimmutable() or some variant thereof remains to be seen.

One can point to a number of things that went wrong here. The original proposal was seen by many as an implementation of what the Chrome developers said they wanted without looking deeply at what the real requirements (for Chrome and any other potential users) are. Google has no shortage of experienced developers who could have reviewed this submission before it was posted publicly, but that does not appear to have happened, with the result that a relatively inexperienced developer was put into a difficult position. Feedback on the proposal was resisted rather than listened to. The result was an interaction that pleased nobody.

Despite all of that, there is a use case here that everybody involved appears to see as legitimate. So it is just a matter of finding the right solution to the problem, and hopefully that problem is better understood now. If the next attempt looks a lot more like mimmutable() and reflects the feedback that has been given, the kernel may yet get the sealing capability that addresses the Chrome use case and provides for wider user-space hardening as well.

Index entries for this article
Kernel	Security
Kernel	System calls

mseal() and what comes after

Posted Oct 20, 2023 19:48 UTC (Fri) by tux3 (subscriber, #101245) [Link] (2 responses)

The mention of glibc setting executable stack during dlopen() was a surprise to me. I'd vaguely heard of GNUSTACK as a historical kerfuffle that I thought had been taken care of.

Is there background one might want to read on why this behavior hasn't been able to be put to rest yet?

mseal() and what comes after

Posted Oct 22, 2023 12:24 UTC (Sun) by dvdeug (subscriber, #10998) [Link]

I can't find a great source for it from my cell phone, but GNAT (the Ada compiler) still uses an executable stack for nested function pointers.

mseal() and what comes after

Posted Oct 26, 2023 7:02 UTC (Thu) by koomi (guest, #165546) [Link]

That wormhole to the stone age has been put to great use in the recent ssh-agent forwarding exploit.

I guess it would break too much stuff to disable it entirely, but maybe there are some knobs to twiddle?

mseal() and what comes after

Posted Oct 20, 2023 22:00 UTC (Fri) by rywang014 (subscriber, #167182) [Link] (1 responses)

so an extra flag in mprotect() can do all the work right?

mseal() and what comes after

Posted Oct 21, 2023 3:29 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

Yes. That's essentially what mimmutable() syscall is.

mseal() and what comes after

Posted Oct 21, 2023 14:02 UTC (Sat) by mtodorov (guest, #158788) [Link] (7 responses)

Just curious, do you think that the mseal() or immutable() syscall would break libc live patching by TuxCare?

Thanks.

mseal() and what comes after

Posted Oct 21, 2023 14:07 UTC (Sat) by mtodorov (guest, #158788) [Link]

To correct myself, this is obviously mseal() and mimmutable() breaking live libc patching by TuxCare.

But while trying to fix the typo, the spellchecker tried to be smarter than me again and automatically changed mimmutable() to immutable() :-)

mseal() and what comes after

Posted Oct 21, 2023 16:31 UTC (Sat) by willy (subscriber, #9762) [Link] (5 responses)

That depends how it works. If it works by patching individual running processes eg with ptrace(POKETEXT), then it will be denied. If it patches the on-disk binary then there's a chance it'll work. It probably does the former.

Possibly more relevant to people is that mimmutable() will prevent inserting breakpoints. I believe OpenBSD copes with this by disabling mimmutable() when running under a debugger.

mseal() and what comes after

Posted Oct 22, 2023 7:38 UTC (Sun) by roc (subscriber, #30627) [Link]

It will break uprobes too.

In rr we'll be able to work around any mimmutable-style API by emulating it --- intercept the mimmutable calls, don't send them to the kernel, and apply the semantics ourselves when the application makes mmap/munmap/etc calls. This would actually be better than "disabling mimmutable() when running under a debugger" because you might want to debug bugs involving use of mimmutable()...

mseal() and what comes after

Posted Oct 22, 2023 8:10 UTC (Sun) by mtodorov (guest, #158788) [Link] (3 responses)

It seems kind of awkward that the unmapped pages in the range mimmutable(addr, len) are left in a state one can account for (essentially undefined) by the OpenBSD - that effectively makes them lost for the process (an especially tough issue for daemons that are supposed to run for months or years w/o a restart).

I am not an expert on this, but it seems prudent to be able to replace an in-memory copy of glibc with a patched, signed version on-the-fly. I don't know how the TuxCare guys do it in their KernelCare+ libpatch service we're testing because the technology is state-of-the-art and proprietary.

I can't really see what patching on-disk binary semantics would be? What is supposed to happen with the mimmutable() protected page when the page in the underlying mapped library changes on-disk?

mseal() and what comes after

Posted Oct 22, 2023 14:50 UTC (Sun) by WolfWings (subscriber, #56790) [Link]

mimmutable would only prevent changes to the memory region configuration, it doesn't (by itself) make the memory region read-only.

So LongRunningDaemon could have library.so mmapped RO, and FancyPatcherTool could map library.so RW, and modifications made by FancyPatcherTool would show up in LongRunningDaemon's mmap, it just couldn't make such modifications itself. There's exceptions, here be dragons, etc, etc.

mseal() and what comes after

Posted Oct 22, 2023 16:15 UTC (Sun) by willy (subscriber, #9762) [Link] (1 responses)

Ideally, daemons would not run for years without a restart. They would respond to SIGHUP by serializing their state to storage and exec() themselves, picking up new libraries automatically. Ensuring that the new daemon actually starts correctly is left as an exercise to the programmer. (What do you mean, somebody edited the config file three months ago, introducing a syntax error, and didn't run the checker first?!)

mimmutable() simply makes this address range mapping to the object unchangeable. It does nothing to prevent the object underneath unchangeable. That's left to other mechanisms (eg libfoo.so should have r-x file permissions!)

So a sufficiently privileged process (like, say, a binary patching program) can modify the contents of both the file and the page cache. mimmutable is simply a piece of the puzzle, it isn't everything.

Of course, this relies on the process having not COWed the page cache. If it had mapped the file MAP_PRIVATE and stored to it, it would have its own copy of the previous contents of that page, plus its own modifications, and changes to the underlying file would not be reflected in its address space. This is quite unlikely with ELF (unless you insert a uprobe or breakpoint), but IIRC it was fairly common with a.out (and was a motivation for switching to ELF)

mseal() and what comes after

Posted Nov 2, 2023 13:44 UTC (Thu) by sammythesnake (guest, #17693) [Link]

If the COW mechanism isn't reciprocal* then you could end up with some *really* nasty behaviour if a process has made changes to one page that are incompatible with something the patching changed in another page - every obscure bug you can think of is possible in that scenario...

* I.e. of the changes made by the patching program to pages the COW mapped program didn't change are seen by that program, but pages changed via COW remain in place without being overwritten by the patcher's version - does creating a COW mapping imply all other mappings are thereby made COW, too? If the COW-ness *is* reciprocal, then of course that means that any patching is invisible/ineffective for any memory that's been COW mapped, which doesn't seem optimal, either...

mseal() and what comes after

Posted Oct 22, 2023 11:56 UTC (Sun) by summentier (subscriber, #100638) [Link] (5 responses)

I'm intrigued by Theo de Raadt's offhand comment from his cited email:

> I am pretty sure Linux will never get as far as we got. Even our main stacks are marked immutable, but in Linux that would conflict with glibc ld.so mprotecting RWX the stack if you dlopen() a shared library with GNUSTACK, a very bad idea which needs a different fight...

I am afraid I don't know enough about memory management to understand this point. Does anybody care to elaborate? Is it really a fundamental (architectural) flaw in Linux?

mseal() and what comes after

Posted Oct 22, 2023 12:52 UTC (Sun) by randomguy3 (subscriber, #71063) [Link] (2 responses)

https://wiki.gentoo.org/wiki/Hardened/GNU_stack_quickstart has some info about GNU_STACK - it's an ELF header that indicates whether the code in the file requires the stack to executable (eg: the compiler might make use of an executable stack to implement nested functions). This presumably allows the loader to decide whether to set the no-execute bit on the stack when an executable is run (by looking at the value of this header for the executable and any libraries it links to).

I would assume de Raadt's comment means that if you dlopen() a library that has GNU_STACK indicating an executable stack is required, the stack's no-execution protection will be removed (to allow you to actually use the library you just loaded). If you've made the protection bits immutable, this is no longer possible.

mseal() and what comes after

Posted Oct 24, 2023 3:37 UTC (Tue) by calumapplepie (guest, #143655) [Link] (1 responses)

How about we add a new flag on binaries, GNU_STACK_FORBID, which says "I solemnly swear that this program won't open anything with a GNU_STACK". Distributions can globally enable the flag on all their binaries, allowing ld.so to apply immutability to the stack.

Similar things can be done for other OpenBSD protections. Stuff like RAW_SYSCALL_FORBID (to block syscalls outside of libc), or a new section that declares what sections of the desired memory map are the stack to allow for the pointer protection. For most programs, the compiler knows where the stack is, and for the exceptions, distributors can remove the flags from binaries. Existing programs would obviously lack the flags, and just keep working.

mseal() and what comes after

Posted Oct 24, 2023 11:54 UTC (Tue) by dullfire (guest, #111432) [Link]

So a note: The presence of a GNU_STACK section allows ld.so to possibly NOT mark the stack as executable. If the header is missing ld.so will be forced to mark the stack as executable. So forbidding loading anything will defiantly not achieve the desired results.

Additional certain gcc nested functions require stack trampolines (and thus an executable stack). I don't know about the prevalence of this feature, especially in shared libs. However an "GNU_STACK_FORBID" would have to be a compile time opt-in thing (since you could not know if it is actually needed until runtime).

Finally, the compiler will have no clue where the stack is. Address space layout randomization means it could be basically anywhere. Additionally, there's[1] 1 stack per thread, but the thread number is runtime determined. So even if ASLR was disabled, and GNU_STACK (or another mechanism added) was extended to allow specifying the stack address (AFAIK it does not currently allow this), only single threaded programs would have "compile time" knowledge of the stack address.

That said: a GNU_STACK_MASK might be a good idea. Basically a mask of disallow permission on the stack. Any attempt to dlopen() a library that would violate the mask could return an error, and a load time ld.so attempt to violate a programs mask could just throw an error and exist before main().

[1] Actually there are good reasons to have multiple stacks per thread. One of them is a SIGSEGV handler, which requires a separate stack. Another, though rare, is any attempt to use any form of userland threading.

mseal() and what comes after

Posted Oct 24, 2023 18:31 UTC (Tue) by ballombe (subscriber, #9523) [Link] (1 responses)

Is it really a fundamental (architectural) flaw in Linux?

It is not related to Linux.

gcc implements an extension to C, which is called nested function which requires the use of trampoline when the address of a nested function is needed (this is an instance of the funarg problem). The use of trampoline requires the stack to be executable.

The use of nested functions in GNU C is rare since it is not part of the C standard and hence not idiomatic. However it is used to implement nested functions in other languages which have them (e.g. algol and pascal derivative).

GNU_STACK provides a way to mark the stack executable only when this feature (address of nested function) actually requires it.

If you want to fix that, you need either to find a way to implement trampoline with a noexec-stack or to remove that feature entirely.

Now, the relevance to this issue is that if your code do not use that feature but uses dlopen, dlopen can potentially load a module that uses that feature, and in that case dlopen will have to set the stack executable.

mseal() and what comes after

Posted Oct 25, 2023 10:30 UTC (Wed) by geert (subscriber, #98403) [Link]

https://gcc.gnu.org/onlinedocs/gcc/Nested-Functions.html

mseal() and what comes after

Posted Nov 2, 2023 6:56 UTC (Thu) by dxin (guest, #136611) [Link]

I think the Linux people is also misunderstanding the purpose of this feature. "Partial seal" is meaningless to attackers but is "partially" meaningful to defend against bugs. Limiting the damage that could be done by an attacker is never the top concern of an individual dev, and it's the damage by bugs that they want to minimize, even if just partially, because that's what drives them crazy every day.