Huge pages and persistent memory

By Jonathan Corbet
March 17, 2015

LSFMM 2015

One of the final sessions in the memory-management track of LSFMM 2015 had to do with the intersection of persistent memory and huge pages. Since persistent memory looks set to come in huge sizes, using huge pages to deal with it looks like a performance win on a number of levels. But support for huge pages on these devices is not yet in the kernel.

Matthew Wilcox started off by saying that he has posted a patch set adding huge-page support to the DAX subsystem. But, he said, only one other developer reviewed the code. The biggest complaint was the introduction of the pmd_special() function, which tracks a "special" bit at the page middle directory (PMD) level in the page table hierarchy, which is where huge pages are managed.

Some background: the kernel allows architecture-level code to mark specific pages as being "special" by providing a pte_special() function. These pages have some characteristic that causes them to behave differently than ordinary memory. In cases where the architecture has enough bits available in its page table entries, pte_special() just checks a bit there; otherwise things get more complicated. The core memory-management code treats so-marked pages, well, specially; for example, virtual memory areas containing "special" pages should also have a find_special_page() method to get the associated struct page.

Back to the discussion: adding pmd_special() requires that the "specialness" of the huge page be tracked at the PMD level. It is not clear that every architecture has a free bit available in the PMD to track that state. In theory, free bits should abound there since as many as 20 bits in the lower part of the entry are not needed to map to a page frame number, but some quick searching by developers in the room revealed that, on x86 at least, the "extra" bits must be set to zero. For now, though, Matthew is using the same bit that pte_special() uses, so his code should work on every architecture that supports pte_special().

In the case of huge pages backed by persistent memory, the pmd_special() bit indicates to the memory-management code that there is no associated page structure. Andrea Arcangeli asked why a special bit was needed to mark that condition; Matthew responded that it's because he doesn't really understand the memory-management subsystem, so he implemented something he knew he could make work.

This code may eventually be pushed in a direction where pmd_special() is no longer needed. But there are some other issues that come up. Matthew raised one: what happens when an application creates a MAP_PRIVATE mapping of a file into memory, then writes to a page in that mapping? The write will cause the memory-management code to allocate anonymous memory to replace the 2MB huge page being written to; the question is: should it allocate and copy a full 2MB page, or just copy the 4KB page that was actually written? Andy Lutomirski suggested that the answer had to be to copy 4KB; copying the full 2MB for each single-page change would be too expensive. But Kirill Shutemov replied that copy-on-write for huge pages does a 2MB copy now; the behavior with persistent memory, he said, should be consistent.

Matthew moved on to the topic of in-kernel uses for persistent memory. There will be some interesting ones, he thought, but how it should all work has yet to be worked out. HP, for example, is using ioremap() to map persistent memory into the kernel as if it were device memory; Matthew said that seems like the wrong approach to him. We should, he said, be using logical interfaces to persistent memory rather than direct physical interfaces like ioremap(). So he would like to see the creation of some sort of mapping interface implemented within the virtual filesystem layer that would allow persistent memory to be mapped into the kernel's address space.

Andy said that the pstore mechanism could benefit from directly-mapped persistent memory. There was also talk of maybe being able to load kernel modules from persistent memory without the need to copy them into "regular" memory. It might be possible to even map the entire kernel, but there is one little catch: the kernel patches its own code for a number of reasons, including use of optimal instructions for the specific hardware in use and turning tracepoints on and off. If the kernel were mapped from persistent memory, that patching would change the version stored in the device as well — probably not the desired result.

Finally, Matthew said, there have been requests for the ability to use extra-huge, 1GB pages as well as 2MB pages. He is looking at adding that functionality, but he has been struck by the amount of code duplication that exists at each of the four page table levels. He has some thoughts about creating a level-independent "virtual page table entry" abstraction that could be used to get rid of much of that duplication. The reaction from the assembled memory-management developers was cautiously positive; Matthew was encouraged to implement this abstraction within the DAX code. If it works out well there, it can then spread into the rest of the memory-management code.

Index entries for this article
Kernel	Memory management/Nonvolatile memory
Conference	Storage, Filesystem, and Memory-Management Summit/2015

to post comments

Huge pages and persistent memory

Posted Mar 20, 2015 4:06 UTC (Fri) by dlang (guest, #313) [Link] (3 responses)

> the kernel patches its own code for a number of reasons, including use of optimal instructions for the specific hardware in use and turning tracepoints on and off. If the kernel were mapped from persistent memory, that patching would change the version stored in the device as well — probably not the desired result.

why not?

If the kernel can apply the appropriate patches for a different hardware etc it shouldn't matter which form is on the persistent media (In addition, how much are these things going to migrate from one system to another?)

For things like tracepoint, I can see value in having them persist across a reboot. I should be a boot time flag to disable all tracepoints.

Yes, it will require a little bit of change, but it doesn't seem like a huge amount of work, and there is value in letting thing persist.

Huge pages and persistent memory

Posted Mar 24, 2015 15:16 UTC (Tue) by roblucid (guest, #48964) [Link]

Well writing to storage medium involve "wear".
Checksums used for security purposes will be invalidated.
Finally images written to disk in use, can't generally be shared.

A few reasons of top of head, why Sys Admins may be likely to prefer read only.

Huge pages and persistent memory

Posted Apr 13, 2015 12:27 UTC (Mon) by mingo (subscriber, #31122) [Link] (1 responses)

The problem with leaving a kernel image patched is that the code that runs up to the patching function is often patched as well. So the kernel assumes that the bootup image is 'unpatched', with generic instructions in it that will work on all CPUs that are supported by that image.

So if you patch it for a given specific CPU, and then migrate the device over to another system with an incompatible CPU, it might crash (or even corrupt data) before it has a chance to patch things.

It's not an unsolvable problem, but has to be kept in mind as an extra complication.

Huge pages and persistent memory

Posted Apr 13, 2015 15:39 UTC (Mon) by nye (subscriber, #51576) [Link]

>So if you patch it for a given specific CPU, and then migrate the device over to another system with an incompatible CPU, it might crash (or even corrupt data) before it has a chance to patch things.

Is that a use case that anyone would actually expect to work? If so, is that a reasonable expectation?

To me, that sounds a lot like doing a suspend to disk, removing the disk, putting it in another machine containing entirely different hardware, and expecting to somehow successfully resume from disk - and I presume nobody would expect *that* to work? It sounds like 'simply' being able to detect that this is something that's happened and saying 'no way' should be sufficient.