mseal() gets closer

By Jonathan Corbet
January 19, 2024

The proposed mseal() system call stirred up some controversy when it was first posted in October 2023. Since then, it has been evolving in a quieter fashion, and seems to have reached a point where the relevant commenters are willing to accept it. Should mseal() be merged in a future development cycle, it will look rather different than it did at the outset.

As a reminder, mseal() was created as a way of preventing changes to portions of the virtual address space. It is meant to thwart attacks that depend on changing memory that is supposed to be read-only or otherwise messing with a process's idea of how its memory is laid out. An attacker who can change memory permissions or mappings may, for example, be able to circumvent control-flow-integrity protections. By using mseal(), a process can prevent changes of that type from being made. The initial user is expected to be the Chrome browser, where it will be used to further harden the program against memory-based attacks.

mseal(), as proposed in October, had this prototype:

    int mseal(void *addr, size_t len, unsigned int types, unsigned int flags);

The types parameter, which allowed the caller to fine-tune the changes that mseal() would prohibit, was one of the more controversial features, with a number of people questioning why anything other than an outright ban on changes would be useful. Even so, version 2 (posted shortly after the first version) and version 3 (posted in mid-December) retained that parameter. In response to the latter posting, Linus Torvalds reiterated his dislike of that aspect of the API, and asked: "I want to know why we don't just do the BSD immutable thing, and why we need this multi-level sealing thing".

Chrome developer Stephen Röttger answered that Chrome needed the ability to allow madvise(MADV_DONTNEED) in specific places where the region could otherwise be sealed. This operation was forbidden in sealed memory because it is essentially a mapping change; it discards the underlying memory, which (for anonymous pages) will be refilled with zeroes if it is accessed again. It is useful for (for example) discarding unneeded cached data, but it also has the potential to create surprises. In the Chrome case, the type argument was used to allow MADV_DONTNEED on writable anonymous memory — memory that the process has the ability to write directly even when it is sealed. Torvalds replied that the proper solution was for mseal() to only allow MADV_DONTNEED if the mapping in question is writable. Indeed, he thought that restriction might make sense even in the absence of sealing.

As a result of that discussion, version 4, posted in early January, implemented the new semantics with regard to MADV_DONTNEED. This version also finally dropped the types parameter; memory is now either sealed or not. Torvalds was satisfied by the changes; he declared that "this seems all reasonable to me now" and withdrew from the discussion. The fifth version brought only small changes, suggesting that the major concerns have been addressed; Kees Cook noted that "this code is looking to land". Since then, version 6 was posted with a few more small changes.

If mseal() is merged in this form, its prototype will be:

    int mseal(void *addr, size_t len, unsigned long flags);

The addr and len describe the range of memory to be sealed; the flags argument is currently unused and must be zero. It will only be available on 64-bit systems. This documentation patch contains more information about its use.

So this story may have run its course, but there is still one aspect of it that has been somewhat swept under the rug. OpenBSD has a similar system call, mimmutable(), that has been in place since 2022. It, too, prevents modifications to a specific range of the address space. Over the course of the conversation, simply implementing mimmutable() for Linux has been suggested a number of times. Jeff Xu, the developer of mseal(), has always shrugged off that suggestion, to the point that Theo de Raadt, the creator of mimmutable(), suggested that "maybe this proposal should be using the name chromesyscall() instead". It seems that implementing mimmutable() for Linux has never been seriously considered.

As mseal() has gotten simpler, though, the features that differentiated it from mimmutable() have melted away, to the point that they do almost the same thing. About the only difference is that mimmutable() allows downgrading permissions (setting memory read-only even if it has been sealed), while mseal() does not; OpenBSD may yet remove that feature, though, further reducing the difference in semantics between the two system calls. Given that, it may be worth asking, one more time, why Linux doesn't just adopt the existing interface and add mimmutable(). It is not a question that has been directly addressed.

Possible answers do exist. mseal() carries the flags parameter that long experience says is a good idea, even if the immediate need for it is not apparent. It may also be that the use of this system call will always be so specialized and low-level that any code using it will need to be system-specific in any case, in which case there may be little value in using the same name. Finally, adding an mimmutable() wrapper around mseal() in the C library would be an almost trivial undertaking if it were deemed worthwhile.

If and when mseal() is merged, it will initially only benefit the Chrome browser (and its small band of users). As the mseal() cover letter points out, though, Röttger is working on adding support to the GNU C Library so that most programs would be able to run with a fair amount of sealing automatically applied. That would greatly increase the use of this new system call, and the ability to use it in the C library would increase confidence that the API is correct. That seems likely to truly seal the deal.

Index entries for this article
Kernel	Releases/6.10
Kernel	Security/System calls

Naming things - mimmutable vs mseal

Posted Jan 20, 2024 8:12 UTC (Sat) by fredrik (subscriber, #232) [Link] (2 responses)

Naming things is hard, but I'm curious how it works in this case, and wonder if someone can shed some light on it?

AFAIU the mseal API has now evolved to become semantically very similar to mimutable in OpenBSD, right? Especially if OpenBSD were to remove the ability to reduce the permissions on a sealed region, and we ignore the currently unused flags argument for mseal.

So, assuming they are semantically the same at this point, is there any advantage to change the name of mseal to mimmutable now?

AFAIU, once a new user space API is committed to mainline and released, the name and semantics are more or less set in stone due to Linus' pledge to kernel API stability, right? I.e renaming it is only possible before it goes in.

Which leaves my question, does it make sense to do so from some point of view, say technically or to indicate to users that they do the same thing?

I can imagine that one argument against adopting the OpenBSD name would be that if the semantics of the API in Linux were to diverge from OpenBSD at some point in the future, retaining the original name from OpenBSD would muddle the water at that point. OTOH, userspace API:s are expected to semantically stable too, so perhaps that isn't possible anyway.

What other pros and cons are there to having the same or different names, assuming the API is semantically equal?

Now, it may very well be that the authors of mseal think that their name simply is a better choice to convey its purpose. And I assume that no libc standards committee was involved in picking the name mimmutable for OpenBSD. Basically the only reason mimmutable would have precedence over mseal is that OpenBSD picked that name before any implementation appeared in the Linux kernel.

PS. I really do not intend to start a heated bikeshedding debate with these questions. So if anyone else feel an irresistible urge to do so, please raise another thread for that. Thanks! And do consider that the editors of LWN probably would be very happy if you abstained entirely. :)

Naming things - mimmutable vs mseal

Posted Jan 20, 2024 15:58 UTC (Sat) by Karellen (subscriber, #67644) [Link] (1 responses)

Is it worth considering that whatever libc API is used to wrap the syscall, does not necessarily need to use the same name as the syscall? Or, libc could present multiple wrappers for the same syscall - including adding a BSD-like mimmutable() that calls mseal() under the hood? (e.g. AIUI glibc open() actually calls openat(2) on Linux these days, ignoring Linux's native open(2) syscall altogether.)

Of course, glibc devs may be opposed to this for any number of perfectly valid reasons, but that doesn't mean the option isn't there. Or, if not glibc, then another libc might want to.

Naming things - mimmutable vs mseal

Posted Jan 20, 2024 20:03 UTC (Sat) by dezgeg (guest, #92243) [Link]

setuid() is one good example where the Linux syscall has different semantics than the libc setuid(). The syscall version only applies to the current thread which is not POSIX compatible. It's then libc which does setuid() for each thread of the process to match POSIX.

mseal() gets closer

Posted Jan 20, 2024 20:52 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

About naming, I actually started looking up WTF is "mimmu table" when I read the article about mseal/mimmutable several months ago. Can we settle on mseal(), please?

mseal() gets closer

Posted Jan 24, 2024 5:43 UTC (Wed) by amarao (guest, #87073) [Link]

But what it has to do with Microsoft eals?

If you want to misread, you will.

mseal_all()

Posted Jan 21, 2024 11:02 UTC (Sun) by itsmycpu (guest, #139639) [Link] (11 responses)

Does a simple app, which plans to do not much else than to allocate more memory via malloc-like functions, have something worth sealing?

I'd guess a C lib can't just seal everything by default, so I wonder if a "mseal_all()" function would make sense that has no parameters at all.
It could be used by those who wouldn't much about how to use mseal() and its parameters.

(Of course in addition to mseal(). )

mseal_all()

Posted Jan 21, 2024 22:58 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (10 responses)

I'm not certain what you propose for mseal_all to do, but there are not a whole lot of options that make sense to me:

1. mseal the process's whole address space. But then you can't create any new memory mappings, because they would have to fall somewhere within the process's address space, and the whole address space is sealed. You also can't modify the data in any existing memory mapping, so your process is pretty much not allowed to touch anything that is not a register. I guess you can move the instruction pointer around (with nops or branch instructions?), but not much else.
2. mseal all parts of the address space that have mappings. But then you probably break half of libc, because libc has significant amounts of live data it expects to be able to update. Unfortunately, that includes libc malloc, which has free lists, block metadata, etc. that it has to update when you malloc. It also includes a significant amount of unallocated heap memory that malloc currently believes it has the right to hand out (without having to call sbrk or mmap), so even if malloc could somehow return a block of memory, that block might already be sealed. The only way around this is to mmap your own private memory arena and then use a third-party malloc that can be confined to that arena. But even then, you've still broken things like fread/fwrite (which have userspace buffers), and probably lots of other stuff.
3. A library function in libc, that seals all parts of the address space that have mappings, except for anything that would cause libc to break. We can subdivide this into two kinds of memory: Memory that is owned by libc (and that libc is OK with sealing), and memory that is not owned by libc. The former can be sealed by libc automatically for all processes during startup, and so we don't need a function to do it. The latter is potentially dangerous, because if you don't own a given piece of memory, you have no way of knowing whether it is safe to seal (so instead of breaking libc, you'd just end up breaking some other library instead).

mseal_all()

Posted Jan 21, 2024 23:10 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (9 responses)

Correction: mseal does not prevent you from writing to the memory, so actually this is less of a problem than I thought. Still not sure it's workable, however, because nearly any nontrivial malloc implementation will eventually want to call sbrk or mmap, and that would be prevented by sealing the whole address space. With cooperation from libc, some kind of more restricted sealing might be possible, however.

mseal_all()

Posted Jan 22, 2024 0:44 UTC (Mon) by itsmycpu (guest, #139639) [Link] (8 responses)

Yes, my understanding is that mseal() fixes the permissions (like read/write/execute) only.

So the idea is that an app first creates any memory definitions it needs, and then, assuming it arrives at a point where it doesn't want to change or add anymore, at that point it calls mseal_all() to prevent any further unwanted or accidental modifications.

This assumes that there is way to do this without preventing the mere allocation of more memory. If that currently isn't possible, maybe it can be made possible.

I'm not sure if a C lib is in the best position to do this, the kernel might have a better overview of the process's resources, and the kernel might be in a better position to do this securely.

mseal_all()

Posted Jan 22, 2024 4:21 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (7 responses)

> This assumes that there is way to do this without preventing the mere allocation of more memory. If that currently isn't possible, maybe it can be made possible.

There is no general way to do that. To explain why, I'm going to introduce two made-up terms:

* Some portion of memory is kernel-allocated if it belongs to a valid mapping of some kind. In other words, memory is kernel-allocated if it is possible to dereference a pointer to that memory without segfaulting.
* Some portion of memory is userspace-allocated if it is valid stack memory (as defined by the architecture etc.), if it is statically allocated, or if has been returned by malloc or some malloc-like function and not subsequently freed. In other words, memory is userspace-allocated if it would be "valid" for the program to actually use the memory for some purpose, without needing to do any malloc-like bookkeeping. The term "valid" is intentionally undefined, because the semantics of malloc and malloc-like functions will depend on the implementation and API, but in general, this is roughly synonymous with C's notion of pointer validity (i.e. you are not generally allowed to just make up your own pointers into the heap and do whatever you like with them).

The basic problem here is that the total amount of kernel-allocated memory is finite. Once the address space is sealed, you cannot add any more mappings, so you cannot kernel-allocate any more memory. Therefore, the only way to create more userspace-allocated memory is to use the kernel-allocated memory you already have, and you will eventually run out.

The other problem is that, in practice, glibc malloc tacitly assumes it can just mmap whatever it wants, whenever it wants. If you malloc a large amount of memory in one call, it will not fiddle around with the existing heap memory, it will just pass the arguments through to mmap, and give you a whole new mapping. This mapping may later be unmapped if you call free (or maybe it isn't, I haven't actually read the source code). It also calls mmap if it detects thread contention, which is probably difficult to predict in real applications. These are theoretically changeable behaviors, but I doubt the glibc people would be happy with the resulting performance regressions.

For these reasons, I think it would have to be a libc service, because only libc has the necessary userspace knowledge to figure out which mappings are safe to seal, which ones might need to be created in the future, and which ones might need to be unmapped in the future. But I think that starts to become redundant to "just automatically mseal everything that libc knows it can safely mseal," and you don't need a function for that, it can just automatically happen at startup.

mseal_all()

Posted Jan 22, 2024 6:13 UTC (Mon) by itsmycpu (guest, #139639) [Link] (6 responses)

> But I think that starts to become redundant to "just automatically mseal everything that libc knows it can safely mseal,"
> and you don't need a function for that, it can just automatically happen at startup.

Well at startup the app might first want to create a few mappings (directly or indirectly) that are unknown to the userspace lib, and maybe not directly known to the app either.

I don't know if you can reliably assume that a userspace lib isn't modified or replaced, in part or whole. I'd probably rather have a kernel function at least for "most", that maybe doesn't prevent additional new mappings, if that wouldn't work. Or perhaps limits future mappings to some degree.

Maybe the userspace lib would have options like "Use mseal() to prevent new heap and/or stack mappings from being made executable", just for example, if that can't be enforced from the kernel side.

mseal_all()

Posted Jan 22, 2024 9:19 UTC (Mon) by itsmycpu (guest, #139639) [Link]

Maybe of interest in this context, this text regarding OpenBSD (linked in a previous article) talks about various ways in which both the kernel and the "shared library linker" could automatically apply seals :

https://lwn.net/Articles/915662/

mseal_all()

Posted Jan 22, 2024 20:52 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (4 responses)

> Well at startup the app might first want to create a few mappings (directly or indirectly) that are unknown to the userspace lib, and maybe not directly known to the app either.
>
> I don't know if you can reliably assume that a userspace lib isn't modified or replaced, in part or whole. [...]

This is precisely my point. The only piece of code that knows whether a given mapping is safe to seal is the piece of code that actually created that mapping. Neither the kernel, nor the application, nor libc can safely seal a mapping that it does not have direct knowledge of.

1. The loader can probably(!) seal things like the .text segment and other very basic "this memory is always statically allocated" segments. If the loader is not prepared to do that, then I suppose libc could probably figure it out.
2. libc can seal mappings that it creates during startup or for other internal purposes, but probably not any mappings involved in malloc (because they might need to be unmapped later when freed). Similarly, it can probably seal mappings created with functions like dlopen(3), but then you can't unmap them when you dlclose(3) them, so maybe that's a bad idea?
3. The application can seal mappings that it creates manually, if desired.
4. libfoo can seal mappings that it creates manually, if it somehow(?) knows that those mappings will never be unmapped or remapped. For most libraries, that seems a bit presumptuous, but I suppose some libraries might explicitly say in their API "this function creates a permanent allocation that cannot be freed, so don't call it in a loop or something, because you'll leak memory."

To the best of my understanding, the kernel is not in a position to distinguish any of these items from each other - it just sees them all as "mappings." So a kernel-side mseal_all() would be very much an all-or-nothing operation, and since that's obviously unworkable (you can't seal random mappings out from under random bits of code without warning them!), it would have to be a userspace function that knows the difference between these mappings and can selectively seal just the mappings that are safe to seal.

(1) and (2) can be done automatically at startup (or when the mapping is created), so mseal_all() doesn't need to touch them (it would be redundant). It might be nice for mseal_all() to do (3), to save the application writer the trouble of calling mseal() repeatedly, but the problem is that you can't reasonably distinguish (3) from (4) at runtime, and it is certainly not safe for an application to seal (4) behind libfoo's back, because...

> Maybe the userspace lib would have options like "Use mseal() to prevent new heap and/or stack mappings from being made executable", just for example, if that can't be enforced from the kernel side.

...the lib would have to refrain from unmapping or re-mapping anything that has been sealed behind its back. Which pretty much means the lib has to be using its own internal malloc-like function (and not libc malloc), and that function has to be designed to never discard or resize a mapping (unlike libc malloc in practice). Of course, you could also have a lib that just uses libc malloc, and never directly creates a mapping itself, and that would be fine if your mseal_all() avoids touching libc-owned mappings. The problem is, what if your lib is itself a custom allocator, but not one that is aware of these funky "don't remap anything" rules? Then you basically can't use mseal_all() when that lib is loaded, or else you will break it. At that point, it's probably cleaner to just tell application writers to manually call mseal() on the specific mappings that are safe to seal.

mseal_all()

Posted Jan 22, 2024 22:39 UTC (Mon) by itsmycpu (guest, #139639) [Link] (3 responses)

> This is precisely my point. The only piece of code that knows whether a given mapping is safe to seal is
> the piece of code that actually created that mapping. Neither the kernel, nor the application,
> nor libc can safely seal a mapping that it does not have direct knowledge of.

You are surely right in many ways, however I'd like to question this for a simple application that does fancy things only during intialization if at all.
Perhaps, after setting everything up, a simple app can say: From this point on, only simple things should happen:

For example, no existing mappings that are writable should become executable anymore, and no existing mappings that are executable should become writable anymore. Maybe this requires additional features in mseal() or elsewhere, also glibc should be able to say: this new mapping should not be changeable to 'executable', but it should remain possible to free it.

In any case, the text I quoted implies that the kernel and the "shared library linker" can automatically seal many mappings, and that would be partial success.

mseal_all()

Posted Jan 22, 2024 23:05 UTC (Mon) by itsmycpu (guest, #139639) [Link] (2 responses)

Or something like this: mseal_all() would mean that all existing mappings perhaps have certain unconditional restrictions, yet the additional restriction that other operations on them can only be performed by code that is now sealed and in a read-only memory area. (This would mean the mappings internally receive a timestamp, and any attempt to change a restricted mapping involves comparing the code's seal-timestamp to the mappings seal-timestamp.)

mseal_all()

Posted Jan 23, 2024 1:24 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (1 responses)

> This would mean the mappings internally receive a timestamp, and any attempt to change a restricted mapping involves comparing the code's seal-timestamp to the mappings seal-timestamp.

While I agree that in principle a mechanism like this might prove useful, it is far more complex than the mechanism which is currently proposed, and I'm not sure it would make sense to tie it to this particular API (especially seeing as they just got finished *removing* the concept of different "types" of sealing).

One thing I do feel obligated to point out, as said by the immortal James Mickens[1]: "Gadgets are eternal. There will always be gadgets, there were gadgets before we got here, there'll be gadgets after we're dead." In other words, you really can't say "this was executed by read-only code, therefore it must be non-malicious," because of ROP-style attacks. No matter how many control flow invariants you try to enforce, sooner or later somebody is going to invent another way of fiddling the instruction pointer into a clever position and exploiting code that already exists.

[1]: https://youtu.be/ajGX7odA87k?si=y0eIv5UAtcv28-Zd&t=1874

mseal_all()

Posted Jan 23, 2024 2:00 UTC (Tue) by itsmycpu (guest, #139639) [Link]

> While I agree that in principle a mechanism like this might prove useful, it is far more complex than the mechanism which is currently proposed, and I'm not sure it would make sense to tie it to this particular API (especially seeing as they just got finished *removing* the concept of different "types" of sealing).

Yes, of course this would be a separate step. And thanks, I guess.

> One thing I do feel obligated to point out, as said by the immortal James Mickens[1]: "Gadgets are eternal. There will always be gadgets, there were gadgets before we got here, there'll be gadgets after we're dead." In other words, you really can't say "this was executed by read-only code, therefore it must be non-malicious," because of ROP-style attacks. No matter how many control flow invariants you try to enforce, sooner or later somebody is going to invent another way of fiddling the instruction pointer into a clever position and exploiting code that already exists.

Sure, a high bar which probably can't be reached by any single measure. Any sealing, automatic or explicit, be it from libc, the kernel, the loader, or otherwise, that doesn't require (potentially simple) apps to figure out address ranges, would be another separate step in this sense.

mseal() gets closer

Posted Feb 1, 2024 1:55 UTC (Thu) by kmeyer (subscriber, #50720) [Link] (1 responses)

> Röttger is working on adding support to the GNU C Library so that most programs would be able to run with a fair amount of sealing automatically applied.

Did you link to that somewhere or have more details? I'm curious what glibc would seal by default.

mseal() gets closer

Posted Feb 1, 2024 1:58 UTC (Thu) by kmeyer (subscriber, #50720) [Link]

Maybe https://patchwork.kernel.org/project/linux-mm/cover/20240... :

> The specific scenario currently in mind is
> glibc's use case of loading and sealing ELF executables. To this end,
> Stephen is working on a change to glibc to add sealing support to the
> dynamic linker, which will seal all non-writable segments at startup.

Ok, interesting.