The state of the page in 2023

By Jonathan Corbet
May 17, 2023

The conversion of the kernel's memory-management subsystem over to folios was never going to be done in a day. At a plenary session at the start of the second day of the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, Matthew Wilcox discussed the current state and future direction of this work. Quite a lot of progress has been made — and a lot of work remains to be done.

There is no single best page size for the management of memory in a Linux system, Wilcox began. On some benchmarks, using 64KB pages produces significantly better results, but others do better with 4KB base pages. In general, though, managing a system with 4KB pages is inefficient at best; at that size, the kernel must scan through millions of page structures to provide basic management functions. Beyond that, the page structure is badly overloaded and difficult to understand. If it needs to grow for one page type, it must grow for all, meaning in practice that it simply cannot grow, because somebody will always push back on it.

To address this problem, Wilcox and others are trying to split struct page into a set of more specialized structures. Eventually struct page itself will shrink to a single pointer, where the least-significant bits are used to indicate what type of usage-specific structure is pointed to. The work to move slab-allocator information out of struct page has already been completed. There are plans (in varying states of completion) to make similar changes for pages representing page tables, compressed swap storage, poisoned pages, folios, free memory, device memory, and more.

When will this work be done? Wilcox waved his hands and said "two years" to general laughter. There have been 1,087 commits in the mainline so far that mentioned folios. The page cache has been fully converted, as have the slab allocators. The tail-page portion of compound pages has been moved to folios, allowing the removal of another section from struct page. The address_space_operations structure has been converted — except for three functions that will soon be deleted entirely.

There are three filesystems (XFS, AFS, and EROFS) that have been fully converted, as have the iomap and netfs infrastructure layers. A number of other filesystems, including ext4, NFS, and tmpfs, can use single-page folios now. The get_user_pages() family of functions uses folios internally, though its API is still based on struct page. Much of the internal memory-management code has been converted. One might be tempted to think that this work is nearly done, but there is still a lot of code outside of the memory-management layer that uses struct page and will need to be converted.

Every conversion that is done makes the kernel a little smaller, Wilcox said, due to the simplifying assumption that there are no pointers to tail pages. Over time, this shrinkage adds up.

There are plenty of topics to discuss for the future, he said. One is the conversion of the buffer-head layer, which is in progress (and which was the subject of the next session). Folios will make it easier to support large filesystem block sizes. The get_user_pages() interfaces need to be redesigned, and there are more filesystem conversions to do. A big task is enabling multi-page anonymous-memory folios. Most of the work done so far has been with file-backed pages, but anonymous memory is also important.

One change that is worth thinking about, he said, is reclaiming the __GFP_COMP allocation flag. This flag requests the creation of a compound page (as opposed to a simple higher-order page); that results in the addition of a bunch of metadata to the tail pages. This is useful for developers working on kernel-hardening projects, who can use it to determine if a copy operation is overrunning the underlying allocation. They would like the kernel to always create compound pages and simply drop non-compound allocations so, Wilcox suggested, the page allocator could just do that by default and drop the __GFP_COMP flag entirely.

He mentioned some pitfalls that developers working on folio conversions should be aware of. Some folio functions have different semantics than the page-oriented functions they replace; the return values may be different, for example. These changes have been carefully thought about, he said, and result in better interfaces overall, but they are something to be aware of when working in this area.

Multi-page folios can also cause surprises for code that is not expecting them. He mentioned filesystems that check for the end of a file by calculating whether an offset lands within a given page; now they must be aware that it could happen anywhere within a multi-page folio. Anytime a developer encounters a definition involving the string PAGE (PAGE_SIZE, for example), it is time to be careful. And so on.

There are also, he concluded, a few misconceptions about folios that are worth clearing up. One of those is that there can be only one lock per folio; he confessed that he doesn't quite understand why there is confusion here. There was always just one lock per compound page as well. The page lock is not highly contended; whenever it looks like a page lock is being contended, it is more likely to be an indication of threads waiting for I/O to complete.

Some developers seem to think that dirtiness can only be tracked at the folio level. It is still entirely possible to track smaller chunks within a folio, though; that is up to the filesystem and how it handles its memory. The idea that page poisoning affects a whole folio is also incorrect; that is a per-page status.

As the session wound down, David Hildenbrand said that, while folios are good, there is still often a need to allocate memory in 4KB chunks. Page-table pages, he said, would waste a lot of memory if allocated in larger sizes. What is really needed is the ability to allocate in a range of sizes, depending on how the memory will be used. Wilcox closed the session by saying that is exactly the outcome that the developers are working toward.

Index entries for this article
Kernel	Memory management/Folios
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2023

The state of the page in 2023

Posted May 17, 2023 18:46 UTC (Wed) by bluss (guest, #47454) [Link] (2 responses)

Are slides or recordings available from LSFMM+BPF?

The state of the page in 2023

Posted May 17, 2023 18:50 UTC (Wed) by jake (editor, #205) [Link] (1 responses)

> Are slides or recordings available from LSFMM+BPF?

Not yet, but they are supposed to start appearing in the LF YouTube site in a couple of weeks ...

https://www.youtube.com/@LinuxfoundationOrg/videos

jake

The state of the page in 2023

Posted May 29, 2023 19:41 UTC (Mon) by bluss (guest, #47454) [Link]

And the videos have arrived :). The recording for the talk in the article is https://www.youtube.com/watch?v=U0FwqTTtBRk

Huge

Posted May 18, 2023 1:25 UTC (Thu) by ncm (guest, #165) [Link] (9 responses)

The hugepage situation is desperately in need of cleanup. The transparent-hugepages fiasco poisoned the well, so systems and programs have devolved to static provisioning; can we ever recover? We need programs not to need to know anything about them, so they just get used automatically whenever that would be useful, as they now (apparently) have in FreeBSD. In particular, hugepagefs should end up synonymous with tmpfs, and mapping a regular file on any filesystem, most particularly a .so, should get them automatically anywhere they would fit.

Huge code pages

Posted May 18, 2023 2:50 UTC (Thu) by DemiMarie (subscriber, #164188) [Link] (1 responses)

Using huge pages for executable code is not entirely a win, as it means that the entire huge page must be resident in memory. Time for a benchmark?

Huge code pages

Posted May 18, 2023 3:06 UTC (Thu) by willy (subscriber, #9762) [Link]

The benchmarks have been done. The wins are big enough that I'm being inveigled to make it happen on earlier kernel versions.

Huge

Posted May 18, 2023 3:03 UTC (Thu) by willy (subscriber, #9762) [Link]

There's no option to disable or control the use of large folios. Once the filesystem supports them, the page cache decides how large to make them.

Hugetlbfs does need to go away, but that's a big job as it has grown features like reservations and page table sharing which need to be made generic.

Huge

Posted May 18, 2023 10:50 UTC (Thu) by adobriyan (subscriber, #30858) [Link] (5 responses)

> We need programs not to need to know anything about them, so they just get used automatically whenever that would be useful, as they now (apparently) have in FreeBSD.

Is this what "transparent" means in THP, when programs don't need to know about mappings pagetables?

It would be nice if kernel provided some guarantees without countless knobs some of them are global:

1) get mapping sizes,

2) get mapping, include flags which
a) specify mapping sizes in order of preference,
b) specify how hard kernel has to work to get one (similar to GFP_KERNEL vs GFP_ATOMIC),
this flag basically says "do everything up and including to starting filesystem writeback".

For example: "dd bs=2MB" may try to get 2MB buffer, quickly get ENOMEM on fragmented machine and fallback to 4KB pages.

Another example: video game may do "mmap((size_t[]){2<<20}, MAP_DWIM);",
because it wants some for internal allocators (gamedev hates malloc), and
2MB is important so that performance doesn't degrade visibly.

3) mremap could change underlying pagetable size in theory

Huge

Posted May 18, 2023 14:30 UTC (Thu) by willy (subscriber, #9762) [Link] (3 responses)

The way this actually works for your dd case is that we see a 2MB read, try to allocate a 2MB folio. If that fails, we fall down the orders; 2x1MB, 4x512kB, etc. We use the GFP flags that will wake kswapd, but not do direct reclaim.

Some minor modifications to that if the 2MB read is not 2MB aligned in the file.

What's more interesting is when the dd is, say 1kB blocks. Readahead kicks in and we start using 16kB, then 64kB, then 256kB folios. This works out really well, although I'm sure the algorithm could be tuned better. There's no reasoning behind the current one other than "try to use large folios to test the rest of the code".

Huge

Posted May 19, 2023 11:03 UTC (Fri) by adobriyan (subscriber, #30858) [Link] (2 responses)

> The way this actually works for your dd case is that we see a 2MB read, try to allocate a 2MB folio.

Is it 2MB page underneath?

dd might be bad example because it doesn't really process the data just dmaing back and forth, rsync is better.

Huge

Posted May 19, 2023 16:29 UTC (Fri) by willy (subscriber, #9762) [Link] (1 responses)

For those who haven't adapted to the new naming yet, yes ;-)

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/...

* A folio is a physically, virtually and logically contiguous set
* of bytes. It is a power-of-two in size, and it is aligned to that
* same power-of-two. It is at least as large as %PAGE_SIZE. If it is
* in the page cache, it is at a file offset which is a multiple of that
* power-of-two. It may be mapped into userspace at an address which is
* at an arbitrary page offset, but its kernel virtual address is aligned
* to its size.

So a folio is the new name for "potentially compound page". A folio pointer points either to a head page or an order-0 page.

It doesn't really matter what the application is, a read() call is a read call. The only hint we take from the application is the length of the read (and it's only a hint; we may choose other size folios based on our own determination)

Huge

Posted May 30, 2023 19:22 UTC (Tue) by Paf (subscriber, #91811) [Link]

This continues to be remarkable work; and there’s even useful documentation(!).

Huge

Posted Jul 19, 2023 21:50 UTC (Wed) by knotapun (guest, #166136) [Link]

> Another example: video game may do "mmap((size_t[]){2<<20}, MAP_DWIM);",
> because it wants some for internal allocators (gamedev hates malloc), and
> 2MB is important so that performance doesn't degrade visibly.

Can you really blame them? SLAB and SLUB are essentially the same thing, one is just user space.