Generalizing address-space isolation

By Jonathan Corbet
November 5, 2019

Linux systems have traditionally run with a single address space that is shared by user and kernel space. That changed with the advent of the Meltdown vulnerability, which forced the merging of kernel page-table isolation (KPTI) at the end of 2017. But, Mike Rapoport said during his 2019 Open Source Summit Europe talk, that may not be the end of the story for address-space isolation. There is a good case to be made for increasing the separation of address spaces, but implementing that may require some fundamental changes in how kernel memory management works.

Currently, Linux systems still use a single address space, at least when they are running in kernel mode. It is efficient and convenient to have everything visible, but there are security benefits to be had from splitting the address space apart. Memory that is not actually mapped is a lot harder for an attacker to get at. The first step in that direction was KPTI. It has performance costs, especially around transitions between user and kernel space, but there was no other option that would address the Meltdown problem. For many, that's all the address-space isolation they would like to see, but that hasn't stopped Rapoport from working to expand its use.

Beyond KPTI

Recently, he tried to extend this idea with system-call address-space isolation, which implemented a restricted address space for system calls. When a system call is invoked, most of the code and data space within the kernel is initially unmapped and inaccessible; any access to that space will generate a page fault. The kernel can then check to determine whether it thinks the access is safe; if so, the address in question is mapped, otherwise the calling process will be killed.

There are potentially a few use cases for this kind of protection, but the immediate objective was to defend against return-oriented programming (ROP) attacks. If the target of a jump matches a known symbol in the kernel, the jump can be considered safe and allowed to proceed; the page containing that address will be mapped for the duration of the call. ROP attacks work by jumping into code that is not associated with a kernel symbol, so most of them would be blocked by this mechanism. Mapping the occasional page for safe references will make some code available to ROP attacks again, but it will still be a fraction of the entire kernel text (which is available in current kernels).

These patches have not been merged, though, for a number of reasons. One is that nobody really knows how to check data accesses for safety; the known-symbol test used for text is not applicable to data. A system call with invalid parameters can still result in mapping a fair amount of code, making ROP gadgets available to an attacker. This patch also slowed execution considerably, which always makes acceptance harder.

The L1TF and MDS speculative-execution vulnerabilities bring some challenges of their own. In particular, they allow a host system to be attacked from guests, and are thus frightening to cloud providers. The defense, for now, is to disable hyperthreading, but that can have a significant performance cost. A better solution, Rapoport said, might be another form of address-space isolation; in this case, it would be a special kernel mapping used whenever control passes into the kernel from a guest system. This "KVM isolation" mechanism was posted by Alexandre Chartre in May, but has not been merged.

Other address-space isolation ideas are also circulating. One of these would be to map specific kernel data only for the process that needs to access it. That would be done by setting up a private range in the kernel page tables. Kernel code could allocate memory in this space with new functions like kmalloc_proclocal(). For extra isolation, memory allocated in this way would be removed from the kernel's "direct map", which is a linear mapping of all of the system's physical memory. Taking pages out of the direct map has its own performance issues, though, since it requires breaking up huge pages into small pages.

Then, there are user-exclusive mappings — user-space mappings that are only visible to the owning process. These could be used to store secrets (cryptographic keys, for example) in memory where they could not be (easily) accessed from outside. Once again, this memory would be removed from the direct map; it would also not be swappable. The MAP_EXCLUSIVE patch series implementing this behavior was posted recently.

Finally, Rapoport also mentioned namespace-based isolation: kernel memory that is tied to a specific namespace and which is only visible to processes running within that namespace. This turns out to get tricky when dealing with network namespaces, though. The sk_buff structures used to represent packets would be obvious candidates for isolation, but they also often cross namespace boundaries.

Generalizing address-space isolation

While each of the address-space isolation mechanisms described above is different, there are some common factors between all of them. They are all concerned with creating a restricted address space from existing memory, then making this space available when entering the proper execution context. So Rapoport is working on an in-kernel API to support address-space isolation mechanisms in general. That is going to require some interesting changes, though.

The kernel's memory-management code currently assumes that the mm_struct structure attached to a process is equivalent to that process's page tables, but that connection will need to be broken. A new pg_table structure will need to be created to represent page tables; there will also be an associated API to manipulate these page tables. A particular challenge will be the creation of a mechanism that can safely free kernel page tables.

Creating the restricted contexts is, instead, relatively easy. Some, like KPTI, are set up at boot time; others will need to be established at the right time: process creation, association with a namespace, etc. The context-switch code will need to be able to switch between restricted address spaces; again, switching the kernel's page tables is likely to be tricky. There will need to be code to free these restricted address spaces as well, with appropriate care taken to avoid the inconvenience that would result from, say, freeing the main kernel page tables.

Once the infrastructure is in place, the kernel will need to gain support for private memory allocations. Functions like alloc_page() and kmalloc() will need to gain awareness of the context into which memory is being allocated; there will be a new __GFP_EXCLUSIVE flag to request an allocation into a restricted context. Once again, pages so allocated will need to be removed from the kernel's direct mapping (and return once they are freed). Extra care will need to be taken with objects that need to cross context boundaries.

Finally, the slab caches will also need to be enhanced to support this behavior. Some of the necessary mechanism is already there in the form of the caching used by the memory controller. Slab memory is often freed from a context other than the one in which it was allocated, though, leading to a number of potential traps.

Rapoport concluded by stating that address-space isolation needs to be investigated; it offers a way of significantly reducing the kernel's attack surface, even in the presence of hardware bugs. Whether the security gained this way justifies the extra complexity required to implement it is something that will have to be evaluated as the patches take shape. Expect to see some interesting patches on the mailing lists in the near future as this work is developed.

[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting your editor's travel to the event.]

Index entries for this article
Kernel	Memory management/Address-space isolation
Kernel	Security/Meltdown and Spectre
Conference	Open Source Summit Europe/2019

to post comments

Generalizing address-space isolation

Posted Nov 6, 2019 14:50 UTC (Wed) by TheGopher (subscriber, #59256) [Link] (15 responses)

On the path towards a microkernel?

Generalizing address-space isolation

Posted Nov 6, 2019 15:01 UTC (Wed) by comio (subscriber, #115526) [Link] (14 responses)

Tanenbaum was right ;)

https://groups.google.com/forum/#!topic/comp.os.minix/wlh...

Tanenbaum and Torvalds [was Generalizing address-space isolation]

Posted Nov 6, 2019 21:23 UTC (Wed) by kreijack (guest, #43513) [Link] (13 responses)

[Please take this post as an ironic one]

This was (is) a fantastic thread !
Some of my preferred parts:

https://groups.google.com/d/msg/comp.os.minix/wlhw16QWltI...
Andy Tanenbaum (29/01/92)
[...]
> In the meantime, RISC chips happened, and some of them are running at over
> 100 MIPS. Speeds of 200 MIPS and more are likely in the coming years.
> These things are not going to suddenly vanish. What is going to happen
> is that they will gradually take over from the 80x86 line.
[...]
> MINIX was designed to be reasonably portable, and has been ported from the
> Intel line to the 680x0 (Atari, Amiga, Macintosh), SPARC, and NS32016.
> LINUX is tied fairly closely to the 80x86. Not the way to go.
[...]

https://groups.google.com/d/msg/comp.os.minix/wlhw16QWltI...
Linus, few replies below:
[...]
> Linus "my first, and hopefully last flamefest" Torvalds

Linus, further comments below
https://groups.google.com/d/msg/comp.os.minix/wlhw16QWltI...
[...]
> Tanenbaum Writes
> > still maintain the point that designing a monolithic kernel in 1991 is
> >a fundamental error. Be thankful you are not my student. You would not
> >get a high grade for such a design :-)

> Well, I probably won't get too good grades even without you: I had an
> argument (completely unrelated - not even pertaining to OS's) with the
> person here at the university that teaches OS design. I wonder when
> I'll learn :)

BTW, when I was young I really believed that the RISC processor will be the future...

Tanenbaum and Torvalds [was Generalizing address-space isolation]

Posted Nov 6, 2019 23:36 UTC (Wed) by roc (subscriber, #30627) [Link] (3 responses)

To be fair, modern x86 processors are RISC cores wrapped in an x86 decoder.

Tanenbaum and Torvalds [was Generalizing address-space isolation]

Posted Nov 7, 2019 2:44 UTC (Thu) by gus3 (guest, #61103) [Link]

And Qemu is a core wrapped in a RISC decoder. Church-Turing.

Tanenbaum and Torvalds [was Generalizing address-space isolation]

Posted Nov 7, 2019 19:01 UTC (Thu) by kreijack (guest, #43513) [Link] (1 responses)

> To be fair, modern x86 processors are RISC cores wrapped in an x86 decoder.
Look from the other side: the ISA is not so important; the technology has evolved to the point that the decoder is not a bottleneck anymore.

Tanenbaum and Torvalds [was Generalizing address-space isolation]

Posted Nov 18, 2019 12:44 UTC (Mon) by renox (guest, #23785) [Link]

>Look from the other side: the ISA is not so important; the technology has evolved to the point that the decoder is not a bottleneck anymore.

1) only if you don't care about the power used by the decoder, it sure has not helped Intel compete against ARM in embedded CPUs..

2) ISA still matter: from memory, going from x86 to x86-64 allowed a 10% improvement in benchmarks..

Tanenbaum and Torvalds [was Generalizing address-space isolation]

Posted Nov 7, 2019 12:48 UTC (Thu) by anselm (subscriber, #2796) [Link] (6 responses)

when I was young I really believed that the RISC processor will be the future

It seems to me that ARM processors aren't doing too badly these days …

Tanenbaum and Torvalds [was Generalizing address-space isolation]

Posted Nov 7, 2019 14:32 UTC (Thu) by excors (subscriber, #95769) [Link] (4 responses)

Is ARM more RISCy than x86 nowadays? They both have complex instruction decoders because the instruction set doesn't match their high-performance internal architecture. E.g. some ARMs have instruction fusion for sequences like mov+movk (loading a 32-bit immediate), adrp+ldr (PC-relative load), cmp+branch, etc, then instructions get broken down again into micro-ops for execution, in a similar way to x86. The compiler-facing instruction set on ARM is more RISC-like than x86 but that seems largely irrelevant to the CPU's performance; the instruction set on both is essentially just a poorly-designed compression format for micro-ops.

Tanenbaum and Torvalds [was Generalizing address-space isolation]

Posted Nov 7, 2019 23:43 UTC (Thu) by flussence (guest, #85566) [Link] (1 responses)

Isn't it technically still reduced instruction set if you have a huge pile of reduced instruction sets that happen to share a CPU? :)

(is Thumb still a thing on arm64?)

Tanenbaum and Torvalds [was Generalizing address-space isolation]

Posted Nov 10, 2019 22:36 UTC (Sun) by excors (subscriber, #95769) [Link]

(There is no 64-bit version of Thumb, although ARMv8-A CPUs still support 32-bit Thumb and non-Thumb encodings so it's not saving any complexity. I guess they figured that nobody cares that much about code density outside of M-series CPUs, and those people can carry on using ARMv8-M which is 32-bit-only and Thumb-only.)

Tanenbaum and Torvalds [was Generalizing address-space isolation]

Posted Nov 21, 2019 15:51 UTC (Thu) by mwsealey (subscriber, #71282) [Link] (1 responses)

Yes, Arm is more RISCy than x86 in the sense that it doesn't have LESS instructions but every instruction has less scope. RISC was never about reducing the number of possible combinations of opcodes, but to make sure that each 'instruction' was orthogonal up to the point that it remained useful. x86 has ADDs which can sign extend, load or store memory, AND NOT combined, and so on.

RISC decouples those so that if you have a need to do AND, NOT and ANDN, then you don't actually need a special instruction that can ANDN, you don't need memory operands for arithmetic if you have LDR and STR. Yes, it takes more instructions to do the same job, but in the end not a larger amount of time -- there isn't much of a case for LDR+LDR+ADD+STR taking any longer than MOV+ADD to actually execute.

Where Arm is less RISCy than academic RISC is the flexible second operand (i.e. you can shift and sign/zero extend inputs) which is definitely a convenience for code density. I don't think there was any consideration for 'complex decode logic' in Arm, the point was to take advantage of registers and being able to compartmentalize your memory accesses (because memory bandwidth is always bad).

Tanenbaum and Torvalds [was Generalizing address-space isolation]

Posted Nov 21, 2019 15:59 UTC (Thu) by TomH (subscriber, #56149) [Link]

Hate to disappoint you but ARM has BIC which is really ANDN with a different name ;-)

Tanenbaum and Torvalds [was Generalizing address-space isolation]

Posted Nov 7, 2019 18:58 UTC (Thu) by kreijack (guest, #43513) [Link]

> It seems to me that ARM processors aren't doing too badly these days …

In '90, the Acorn archimedes (one of the first arm processor that I remember) was a lot faster than the 386 (in terms of 1:2 and more). At the time the ARM was a younger processor than the x86 so the expectation was bright future for this kind of CPU. In fact after these years, every new CPU was a RISC one.

Now the ARM has yes a batter ratio power/watt; but the absolute maximum power of an ARM cpu is lesser or at most equal to an x86.

The x86 technology has evolved at the point that the decoder is not more the bottleneck. So the ISA doesn't matter anymore.

Tanenbaum and Torvalds [was Generalizing address-space isolation]

Posted Nov 10, 2019 16:51 UTC (Sun) by jezuch (subscriber, #52988) [Link] (1 responses)

BTW, I have recently learned that human (natural) languages have a deep parallel to the CISC/RISC split. Some have simple syllabes that can be pronounced quickly, others have complex syllabes that take more time to articulate. But the end result is that when you measure information content over time, all languages have basically the same throughput of around 40 bps. So perhaps this is also what happened with RISC: yes, you could execute instructions faster, but this was completely offset by the fact that you needed more of them to do the same thing.

I wonder if the languages with complex syllables are also just fronts for a language with much simpler syllabes underneath, though ;)

Tanenbaum and Torvalds [was Generalizing address-space isolation]

Posted Nov 11, 2019 18:17 UTC (Mon) by rgmoore (✭ supporter ✭, #75) [Link]

My impression is that a big part of what happened was that the RISC vs CISC battle took place when raw computing power was the main limit on how fast a computer could accomplish things, so there was a serious issue of whether it was better to have many small operations with high clock speed or fewer bigger operations with a slower clock speed. But what's happened since then is that processors have outstripped the rest of the system to the point that instruction set complexity is no longer where the action is. Instead, the limit on the computer is how fast and efficiently you can get instructions and data to it. That means the big battle is now things like how big your cache is and how cleverly you use it, how you can minimize RAM latency, and whether you can avoid backtracking in a deep pipeline by using speculative execution.