Memory-management: tiered memory, huge pages, and EROFS

By Jonathan Corbet
August 15, 2024

The kernel's memory-management developers have been busy in recent times; it can be hard to keep up with all that has been happening in this core area. In an attempt to catch up, here is a look at recent work affecting tiered-memory systems, underutilized huge pages, and duplicated file data in the Enhanced Read-Only Filesystem (EROFS).

Promotion and demotion

Tiered-memory systems are built with multiple types of memory that have different performance characteristics; the upper tiers are usually faster, while lower tiers are slower but more voluminous. To make the best use of these memory tiers, the system must be able to optimally place each page. Heavily used pages should normally go into the fast tiers, while memory that is only occasionally used is better placed in the slower tiers. Since usage patterns change over time, the optimal placement of memory will also change; the system must be able to move pages between tiers based on current usage. Promoting and demoting pages in this way is one of the biggest challenges in tiered-memory support.

Promotion is usually the easier side of the problem; it is not hard for the system to detect when memory is being accessed and move it to a faster tier. In current kernels, though, this migration only works for memory that has been mapped into a process's address space; the machinery requires that memory be referred to by a virtual memory area (VMA) to function. As a result, heavily used memory that is not currently mapped will not be promoted.

This situation comes about for page-cache pages that are being accessed by way of system calls (such as read() and write()), but which are not mapped into any address space. Memory-access speed can be just as important for such pages, though, so this inability to promote them can hurt performance.

This patch series from Gregory Price is an attempt to address that problem. The migration code in current kernels (migrate_misplaced_folio_prepare() in particular) needs to consult the VMA that maps a given folio (set of pages) prior to migration; if that folio is both shared and mapped with execute permission, then the migration will not happen. Pages that are not mapped at all, though, cannot meet that condition, so the absence of a VMA just means that this check need not be performed. With that change (and a couple of other adjustments) in place, it is simply a matter of adding an appropriate call in the swap code to migrate folios from a lower to a higher tier when they are referenced.

A kernel that is trying to appropriately place memory will always be running a bit behind the game; it cannot detect a changed access pattern without first watching the new pattern play out. Sometimes, though, an application will know that it will be shifting its attention from one range of memory to another. Informing the kernel of that fact might help the system ensure that memory is in the best location before it is needed; at least, that is the intent behind this patch from "BiscuitOS Broiler".

Quite simply, this patch adds two new operations to the madvise() system call. They are called MADV_DEMOTE and MADV_PROMOTE; they do exactly what one would expect. An application can use these operations to explicitly request the movement of memory between tiers in cases where it knows that the access pattern is about to change.

There is nothing technically challenging about this work, but it is also not clear that it is necessary. The kernel already provides a system call, migrate_pages(), that can be used to move pages between tiers; David Hildenbrand asked why migrate_pages() is not sufficient in this case. The answer seems to be that madvise() is found in the C library, but the wrapper for migrate_pages() is in the extra libnuma library instead. As Hildenbrand answered, that is not a huge impediment to its use. So, while making this feature available via madvise() might be convenient for some users, that convenience seems unlikely to be enough to justify adding this new feature to the kernel.

Reclaiming underutilized huge pages

The use of huge pages can improve application performance, by reducing both the usage of the system's translation lookaside buffer (TLB) and memory-management overhead in the kernel. But huge pages can suffer from internal fragmentation; if only a small part of the memory within a huge page is actually used, the resulting waste can be significant. The corresponding increase in memory use has inhibited the adoption of huge pages in many settings that would otherwise benefit from them.

One way to get the best of both worlds might be to actively detect huge pages that are not fully used, split them apart into base pages, then reclaim the unused base pages; that is the objective of this patch series from Usama Arif. It makes two core changes to the memory-management subsystem aimed at recovering memory that is currently wasted due to internal fragmentation.

The first of those changes takes effect whenever a huge page is split apart and mapped at the base-page level, as often happens even in current kernels. As things stand now, splitting a huge page will leave the full set of base pages in its wake, meaning that the amount of memory in use does not change. But, if the huge page is an anonymous (user-space data) page, any base pages within it that have not been used will only contain zeroes. Those base pages can be replaced in the owning process's page tables with references to the shared zero page, freeing that memory. Arif's patch set makes that happen by checking the contents of base pages during the splitting process and freeing any pages found to hold only zeroes.

That will free underutilized memory when a page is being split, which is a start. It would work even better, though, if the kernel could actively find underutilized huge pages and split them when memory is tight; that is the objective of the second change in Arif's patch set.

A huge page, as represented by a folio within the kernel, can at times be partially mapped, meaning that not all of the base pages within the huge page have been mapped in the owning process's page tables. When a fully mapped folio is partially unmapped for any reason, the folio is added to the "deferred split list". If, at some later point, the kernel needs to find some free memory, it will attempt to split the folios on the deferred list, then work to reclaim the base pages within each of them.

Arif's patch set causes the kernel to add all huge pages to the deferred list whenever they are either faulted in or created from base pages by the khugepaged thread. When memory gets tight and the deferred list is processed, these huge pages (which are probably still fully mapped) will be checked for zero-filled base pages; if the number of such pages exceeds a configurable threshold, the huge page will be split and all of those zero-filled base pages will be immediately freed. If the threshold is not met, instead, the page will be considered to be fully used and removed from the deferred list.

It is worth noting that the threshold is an absolute number; for the tests mentioned in the cover letter it was set to 409, which is roughly 80% of a 512-page huge page. This mechanism means that, while this feature can split underutilized PMD-sized huge pages implemented by the processor, it will not be able to operate on smaller, multi-size huge pages implemented in software. On systems using PMD-sized huge pages, though, the results reported in the cover letter show that this change can provide the performance benefits that come from enabling transparent huge pages while clawing back most of the extra memory that would otherwise be wasted.

Page-cache deduplication for EROFS

Surprisingly often, a system's memory will contain multiple pages containing the same data. When this happens with anonymous pages, the kernel samepage merging feature can perform deduplication, recovering some memory (albeit with some security concerns). The situation with file-backed pages is harder, though. Filesystems that can cause a single file to appear with multiple names and inodes (as can happen with Btrfs snapshots or in filesystems that provide a "reflink" feature) are one case in point; if more than one name is used, multiple copies of a file's data can appear in the page cache. This can also happen in the mundane cases where files contain the same data; container images can duplicate data in this way.

The problem with deduplicating such pages is that each page in the page cache must refer back to the file from which it came; there is no concept in the kernel of a page coming from multiple files. If a page is written to, or if a file changes by some other means, the kernel has to do the right thing at all levels. So those duplicate pages remain duplicated.

Hongzhen Luo has come up with a solution for the EROFS filesystem, though — at the file level, at least. EROFS is a read-only filesystem, so the problems that come from possible changes to its files do not arise here.

An EROFS filesystem can be created with a special extended attribute, called trusted.erofs.fingerprint, attached to each file; the content of that attribute is a hash of the file's contents. When a file in the filesystem is opened for reading, the hash will be stored in an XArray-based data structure, associated with the file's inode. Anytime another file is opened, its hash is looked up in that data structure; if there is a match, the open is rerouted to the inode of the file that was opened first.

This mechanism can result in a number of processes holding file descriptors to different files on disk that all refer to a single file within the kernel. Since the files have the same contents, though, this difference is not visible to user space (though an fstat() call might return a surprising inode number). Within the kernel, redirecting file descriptors for multiple identical files to a single file means that only one copy of that file's contents needs to be stored in the page cache.

The benchmark results included with the series show a significant reduction in memory use for a number of different applications. Since this feature is contained entirely within the EROFS filesystem, it seems unlikely to run into the sorts of challenges that often await core memory-management patches. Deduplication of file-backed data in the page cache remains a hard problem in the general case, but it appears to have been at least partially solved for this one narrow case.

Index entries for this article
Kernel	Filesystems/EROFS
Kernel	Memory management/Huge pages
Kernel	Memory management/Tiered-memory systems

to post comments

Hard links?

Posted Aug 15, 2024 15:56 UTC (Thu) by quotemstr (subscriber, #45331) [Link] (9 responses)

SQ: why not just hard link identical files when constructing EROFS images? Most of the problems with hard links occur when mutating files behind the hard link (e.g. atomic rename versus in-place replacement), but what problems would simple hard links cause in an immutable filesystem?

Or is the patch more about sharing content between identical files in *different* filesystems? That seems much riskier. Suppose I have /dev/sda and /dev/sdb mounted on /mnt/foo and /mnt/bar, respectively. Let's suppose both filesystems are EROFS. One program opens /mnt/foo/my.txt and another, unrelated, program opens identical /mnt/bar/my.txt. AIUI, this patch silently re-routes the second open (for the file under /mnt/bar) to the inode already opened under /mnt/foo.

What happens when I surprise-remove /dev/sda? ISTM both programs would break, even though my expectation would be that only the program that opened /mnt/foo/my.txt would break and the program that opened /mnt/bar/my.txt would keep working.

Hard links?

Posted Aug 15, 2024 20:44 UTC (Thu) by iabervon (subscriber, #722) [Link]

My impression is that EROFS isn't for the use case of filesystems mapped one-to-one with physical devices. It's for the case where you have a single physical device storing a dozen images, each used with different overlays to make different containers. I believe the implementation is such that you'd have three storage areas: one with all of the file contents, one with the mapping of paths in /mnt/foo, and one with the mapping of paths in /mnt/bar, and all three would be on the same physical device. Currently, despite the fact that both /mnt/foo/my.txt and /mnt/bar/my.txt use the same storage for the file contents, the cache stores it twice.

Hard links?

Posted Aug 16, 2024 9:27 UTC (Fri) by hsiangkao (subscriber, #123981) [Link] (5 responses)

> SQ: why not just hard link identical files when constructing EROFS images? Most of the problems with hard links occur when mutating files behind the hard link (e.g. atomic rename versus in-place replacement), but what problems would simple hard links cause in an immutable filesystem?

Simply because hardlink cannot resolve cases where files have identity data but different metadata (such as timestamps and/or file modes or likewise). That is quite common due to image rebuild or likewise, but users need to keep different metadata.

> Or is the patch more about sharing content between identical files in *different* filesystems? That seems much riskier. Suppose I have /dev/sda and /dev/sdb mounted on /mnt/foo and /mnt/bar, respectively. Let's suppose both filesystems are EROFS. One program opens /mnt/foo/my.txt and another, unrelated, program opens identical /mnt/bar/my.txt. AIUI, this patch silently re-routes the second open (for the file under /mnt/bar) to the inode already opened under /mnt/foo.
> What happens when I surprise-remove /dev/sda? ISTM both programs would break, even though my expectation would be that only the program that opened /mnt/foo/my.txt would break and the program that opened /mnt/bar/my.txt would keep working.

This patchset is still far from complete, ideally it needs an anonymous address_space for both identified inodes. If one instance unmounted, it will re-route to different source of another device.

Sorry about the incomplete implementation so far due to his current kernel knowledge.

Hard links?

Posted Aug 16, 2024 9:46 UTC (Fri) by hsiangkao (subscriber, #123981) [Link] (1 responses)

>> SQ: why not just hard link identical files when constructing EROFS images? Most of the problems with hard links occur when mutating files behind the hard link (e.g. atomic rename versus in-place replacement), but what problems would simple hard links cause in an immutable filesystem?
> Simply because hardlink cannot resolve cases where files have identity data but different metadata (such as timestamps and/or file modes or likewise). That is quite common due to image rebuild or likewise, but users need to keep different metadata.

Also hardlinks cannot resolve cross-mount instances, which is common for image-based golden filesystem cases, of course.

Hard links?

Posted Sep 4, 2024 1:59 UTC (Wed) by viro (subscriber, #7872) [Link]

Huh? You can't call link(2) between different mounts, but then your filesystem is readonly, so you can't call any directory-modifying syscalls anyway. Different mounts can very well share struct inode instances - otherwise cache coherency would've been an awful PITA.

Metadata differences make for a valid reason; any mount-related stuff is a red herring.

Hard links?

Posted Aug 16, 2024 13:06 UTC (Fri) by hsiangkao (subscriber, #123981) [Link] (2 responses)

>> Or is the patch more about sharing content between identical files in *different* filesystems? That seems much riskier. Suppose I have /dev/sda and /dev/sdb mounted on /mnt/foo and /mnt/bar, respectively. Let's suppose both filesystems are EROFS. One program opens /mnt/foo/my.txt and another, unrelated, program opens identical /mnt/bar/my.txt. AIUI, this patch silently re-routes the second open (for the file under /mnt/bar) to the inode already opened under /mnt/foo.
>> What happens when I surprise-remove /dev/sda? ISTM both programs would break, even though my expectation would be that only the program that opened /mnt/foo/my.txt would break and the program that opened /mnt/bar/my.txt would keep working.
> This patchset is still far from complete, ideally it needs an anonymous address_space for both identified inodes. If one instance unmounted, it will re-route to different source of another device.
> Sorry about the incomplete implementation so far due to his current kernel knowledge.

Also add some words, for the case you mentioned, there is nothing different from page-based page cache sharing and even page cache sharing within the same device (considering data link errors, various device mapper targets, and more);

That is, you turn on this page cache feature _only if_ you trust all data sources since each data source can contribute data to the system shared page cache.

Hard links?

Posted Aug 16, 2024 14:38 UTC (Fri) by quotemstr (subscriber, #45331) [Link] (1 responses)

Thanks for the explanation. Super helpful; sounds like a nice feature once the kinks get worked out.

Another SQ: might it be possible to piggyback on top of fs-verity and make it general across filesystems --- at least read only ones? fs-verity already maintainers per file hashes in metadata, and fs-verity ensures that nobody's lying about what's in the file.

Also, have you considered some kind of "RAID 1" striped IO strategy? If I have block devices A and B with read only filesystems containing identical file X, why not read X from A and B in parallel? I can think of all sorts of fun things to do with the knowledge that two files on the same filesystem are identical.

Hard links?

Posted Aug 16, 2024 15:14 UTC (Fri) by hsiangkao (subscriber, #123981) [Link]

> Another SQ: might it be possible to piggyback on top of fs-verity and make it general across filesystems --- at least read only ones? fs-verity already maintainers per file hashes in metadata, and fs-verity ensures that nobody's lying about what's in the file.

I tend to keep this feature knob priviledged all the time (yeah, I know EROFS mounting needs priviledge). The priviledged user mount program could decide whether it uses dm-verity/fs-verity or other(TM) technologies to enable this feature safely (so we trust the priviledged user mount program.)

> Also, have you considered some kind of "RAID 1" striped IO strategy? If I have block devices A and B with read only filesystems containing identical file X, why not read X from A and B in parallel? I can think of all sorts of fun things to do with the knowledge that two files on the same filesystem are identical.

I guess it will be helpful (also it's still compatible if linux MM has a finer page-based page cache sharing
mechanism in the future), but I think let's make this feature work with the original use case first.

Also I think it would be better to reuse the current EROFS trusted domain id concept. I think all sources in the same trusted domain (e.g. from the same physical device, the same network host or a user-defined trusted array could form a domain) could share page cache safely, but even some sources from different trusted domain won't share by design.

Hard links?

Posted Aug 16, 2024 12:17 UTC (Fri) by HongzhenLuo (guest, #172928) [Link] (1 responses)

Let's take the container scenario as an example. Due to different mount points, accessing the same file in two containers will generate different inodes within the EROFS file system. Therefore, it's not feasible to share the page cache simply through hard links.

As @hsiangkao said, the current patch set is still incomplete, although it has shown positive effects in saving memory. The current implementation maps files with identical content to the same list and uses the page cache of a certain inode in that list to fulfill read requests. When an inode is destroyed, it is removed from the list. However, if the inode currently being destroyed happens to be the inode for the shared page cache, it must wait for all reads to complete before being removed from the list (this is also a drawback of the latest implementation). A solution that can alleviate this situation is to perform reads using anonymous inode: all files with the same content are mapped to the same anonymous inode, and reading is performed through the page cache of this anonymous inode. When an inode is destroyed, the anonymous inode will obtain the necessary information to complete the read request from one of the remaining inodes. From the perspective of the unmounted instance, this will hardly take any waiting time.

Finally, I am a new kernel developer. Due to my limited knowledge of the kernel, I am sorry for the incompleteness in the current implementation.

Hard links?

Posted Sep 4, 2024 0:52 UTC (Wed) by fest3er (guest, #60379) [Link]

"Finally, I am a new kernel developer. Due to my limited knowledge of the kernel, I am sorry for the incompleteness in the current implementation."

No need for apologies. As a work-in-progress, it is expected to have functionality holes and to change significantly as inefficiencies are found, methods are polished, and you learn more about the kernel innards.

cases where it might be better to keep less-frequently accessed data in nearer memory

Posted Aug 15, 2024 20:34 UTC (Thu) by dankamongmen (subscriber, #35141) [Link]

i'm trying to think of cases where Bélády's "algorithm" doesn't strictly want this condition met i.e. where optimizing for frequency of access is suboptimal independently of policy (i probably ought reread Bélády 1969). i guess it then comes down to scheduling and timing?

assume i have a process that continuously walks 32MB of data, snugly filling L3 (assume inclusive caches), and this is 99.999% of my memory accesses across the machine. then i have one process that touches 4KB once per second. once p1 warms cache, each time processes change, i hit a single page. if i discount the cache, it makes sense to keep P1's pages warm over P2, as they are referenced far more often. but they're actually very rarely referenced, due to cache hits. so keep P2's page warm at all costs (reducing latency of its inevitable capacity miss); P1 has a better chance of loading the page it needs early, since it's interacting with so many more pages.

so let's say P1 is 64MB of data, and thus it can't fit in L3. now i'm actually hitting my P1 pages far more often than i hit my P2 page, and i ought prioritize them over the lousy P2 page, right? possibly not -- if P1's computation cannot keep up with L3 bandwidth, and i have sufficient memory bandwidth, L3 misses can be hidden entirely underneath the rolling computation. P2 has nothing to hide its prompt miss under, and stalls. so i probably always want to keep the P2 page live, even at the exclusion of a P1 page that is hit far more often. total transfer from memory remains the same either way, right?

i'm sure priorities and fairness complicate this further.

just thinking out loud.

Are these calls prone to abuse?

Posted Aug 16, 2024 1:40 UTC (Fri) by DanilaBerezin (guest, #168271) [Link] (2 responses)

> Quite simply, this patch adds two new operations to the madvise() system call. They are called MADV_DEMOTE and MADV_PROMOTE; they do exactly what one would expect. An application can use these operations to explicitly request the movement of memory between tiers in cases where it knows that the access pattern is about to change.

Do these calls really just explicitly tell the kernel to promote or demote the memory? As in, barring some sort of error condition, these calls are guaranteed to succeed in either demoting or promoting the memory? If that's the case, couldn't user space programs just abuse these calls to constantly give themselves memory in higher tiers? I would hope that it's just a suggestion (as is in the name of "madvise") to the kernel that may or may not be followed.

Are these calls prone to abuse?

Posted Aug 17, 2024 11:40 UTC (Sat) by grawity (subscriber, #80596) [Link] (1 responses)

Is that generally a problem on Linux? Do we currently have programs abusively mlocking or renicing themselves to get more resources?

Are these calls prone to abuse?

Posted Aug 18, 2024 16:02 UTC (Sun) by DanilaBerezin (guest, #168271) [Link]

I'm honestly not sure, but my instinct is to not trust user space so that kind of raises alarms in my head.

zero pages

Posted Aug 17, 2024 17:22 UTC (Sat) by mst@redhat.com (guest, #60682) [Link]

why is a special machinery needed to find the zero pages?
it looks like a subset of the problem that KSM is trying to address,
does it not?