Sidestepping kernel memory management with DMEMFS

By Jonathan Corbet
December 7, 2020

One of the kernel's primary jobs is to manage the memory installed in the system. Over the years, though, there have been various reasons for removing a portion of the system's memory from the kernel's view. One of the latest can be seen in a mechanism called DMEMFS, which is being proposed as a way to get around some inefficiency in how the kernel keeps track of RAM.

In the early years, the motivation for hiding memory from the kernel was to avoid the problems caused by fragmentation. Allocating large contiguous areas tended to be nearly impossible after a system had been running for some time, creating problems for hardware that absolutely could not function without such areas. Once upon a time, an out-of-tree patch called "bigphysarea" was often used to reserve a range of memory for such allocations; since the kernel did not get its hands on this memory directly, it could not fragment it. LWN first captured a bigphysarea announcement in 1999, but the patch had been around for some time by then.

In the relatively recent past (2010), the contiguous memory allocator (CMA) patches provided a similar functionality using the same technique. Since then, though, the problem of allocating large contiguous areas has gotten much smaller. The kernel's own defragmentation mechanisms have improved considerably, and simply having more memory around also helps. CMA now relies on compaction and no longer uses a carved-out memory region.

DMEMFS has a different motivation. The kernel tracks memory via a data structure called the "memory map", which is essentially an array of page structures. A great deal of information is packed into this structure to tell the kernel how each page is used, track its position on various lists, connect it to its backing store, and more. Much effort has been expended over the years to keep struct page as small as possible, but it still occupies 64 bytes on 64-bit systems.

That may not seem like a lot of memory but, with the usual page size of 4KB, there are a lot of these structures in a contemporary system. A laptop with 16GB installed has 4,194,304 pages, meaning that 256MB of memory is used just to keep track of memory. Losing that much memory on a laptop is perhaps tolerable, but there are other settings where it hurts more. In the patch posting, author Yulei Zhang points out that a hosting provider running servers with 320GB of installed memory is losing 5GB of that memory to page structures. If that memory could be reclaimed, the provider could cram more guests into the machine, increasing the revenue that each server brings in — a metric that hosting providers pay a lot of attention to.

As described above, DMEMFS works by carving out a portion of system memory at boot time; the exact amount is controlled with the dmem= command-line parameter. The specified amount is reserved on each NUMA node in the system; if that amount starts with "!", it tells DMEMFS how much memory to give the kernel while grabbing the rest. Once the system has booted, this carved-out memory can be allocated by mounting the dmemfs filesystem and creating one or more files of the desired size. A call to mmap() will map that memory into a process's address space. A DMEMFS file can also be handed to QEMU as the backing store for a guest machine. This memory supports NUMA policies and can provide huge pages, just like ordinary memory.

Perhaps DMEMFS is the best solution for use cases where making the most use of memory is paramount and there is no need for the kernel to actually manage that memory. But this is a 37-part patch set adding over 3,400 lines of code dedicated to cutting the kernel out of the management of an important system resource; whenever something like that comes along, it's worth looking at the source of the problem and whether other solutions might exist.

Back in 1991, when the first Linux kernel was posted, machines had less memory than they do now. This archived PC ad is instructive; a high-end system featured 4MB of RAM and cost a mere 3,700 1991 dollars. That system used 4KB pages, so there were a total of 1,024 pages for the kernel to manage. Most personal-computer systems at that time had less memory than that.

Contemporary computers are rather larger, but the page size remains 4KB, so the number of pages managed by the kernel has increased by three orders of magnitude or so. The number of page structures has increased accordingly. Those structures have also gotten larger; the transition to 64-bit processors, in particular, led to a significant increase in the size of struct page. The increased size hurts, obviously, but the sheer number of page structures also increases overhead in many places in the kernel.

One possible solution is to increase the size of the pages managed by the kernel, clustering multiple physical pages into larger groups if need be. Some architectures support use of larger page sizes now; the arm64 kernel can use a 64KB page size, for example. Over the years, numerous developers have attempted to implement some sort of generalized page clustering in the kernel, but none of those efforts have made it into the mainline. The complexity of the task has been one impediment to getting this work merged, but it's not the only one.

The other concern with using larger page sizes is internal fragmentation — wasting memory in situations where full pages must be allocated but only a small amount of memory is needed. A classic example is representing a small file in the page cache. A one-line shell script may fit into less than 100 bytes, but it still needs a full page in the page cache. Anything beyond those 100 bytes is wasted; larger pages will clearly waste more. Decades-old folk wisdom says that most files on Unix systems are small; that may be less true that it once was, though.

Your editor is unaware of any studies that have made a serious effort to measure the memory lost to internal fragmentation on real-world systems with a larger page size, but this concern has made it hard to get patches merged anyway. At the moment, the work that is closest is probably the large pages in the page cache effort by Matthew Wilcox — which applies at a higher level and will not reduce the number of page structures in the system.

As a result, the DMEMFS patches may need to be merged for the simple reason that they exist now and work. But it may well be that the real solution to this problem lies elsewhere; rather than hide pages from the kernel, reduce the overhead within the kernel by dealing with memory in larger chunks. It seems inevitable that increasing memory sizes will eventually force that change; said change has, however, proved entirely evitable for many years of memory-size growth. Until the kernel can be changed to deal with memory more efficiently, there may be no choice other than merging workarounds that simply take the kernel out of the picture.

Index entries for this article
Kernel	Memory management/struct page

Sidestepping kernel memory management with DMEMFS

Posted Dec 8, 2020 1:31 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (12 responses)

> A one-line shell script may fit into less than 100 bytes, but it still needs a full page in the page cache.
Has anybody looked at not storing files in page cache unless they are larger than 4kb? Plenty of code uses read/write calls that can work fine with files that are not backed by real pages.

Sidestepping kernel memory management with DMEMFS

Posted Dec 8, 2020 1:56 UTC (Tue) by gus3 (guest, #61103) [Link] (1 responses)

> Plenty of code uses read/write calls that can work fine with files that are not backed by real pages.

Isn't that the buffer cache?

Sidestepping kernel memory management with DMEMFS

Posted Dec 8, 2020 14:49 UTC (Tue) by mageta (subscriber, #89696) [Link]

The page cache is the buffer cache. There is no separate buffer cache anymore.

Sidestepping kernel memory management with DMEMFS

Posted Dec 8, 2020 6:49 UTC (Tue) by glqhw (guest, #131853) [Link] (7 responses)

> Has anybody looked at not storing files in page cache unless they are larger than 4kb?
Here are some difficulties:
1. This means that reading such files will ALWAYS lead to I/Os. Most of the time, I/O means more latency compared with accessing memory.
2. An executable file such as a shell script needs to be mmapped before running. Memory mapping requires page caching.

Sidestepping kernel memory management with DMEMFS

Posted Dec 8, 2020 7:11 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

> 1. This means that reading such files will ALWAYS lead to I/Os. Most of the time, I/O means more latency compared with accessing memory.
I'm not proposing removing the cache entirely, just using some kind of a high-granularity malloc-ish allocator for small files instead of page-based granularity.

> 2. An executable file such as a shell script needs to be mmapped before running. Memory mapping requires page caching.
Sure. There should be a mechanism to transform this kind of cache into full page-based cache (and probably back).

Sidestepping kernel memory management with DMEMFS

Posted Dec 8, 2020 8:58 UTC (Tue) by matthias (subscriber, #94967) [Link] (5 responses)

>> 2. An executable file such as a shell script needs to be mmapped before running. Memory mapping requires page caching.
> Sure. There should be a mechanism to transform this kind of cache into full page-based cache (and probably back).

Am I understanding you correctly, that you propose to store many small files in one page and whenever someone wants to mmap, copy the file to a fresh page and remove this page again once the file is no longer needed?

Indeed this sounds like a good compromise: More compact cache with the downside to copy a few hundred bytes before a file can be accessed. And if there is a process accessing many small files, you can even copy all of them to the same page. It may be tempting to avoid the copy altogether, but then the kernel would have to check first, that in the page backing a file that is about to be mmaped, there is no information (especially no file) that the process should not have access to.

Sidestepping kernel memory management with DMEMFS

Posted Dec 8, 2020 9:29 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

> Am I understanding you correctly, that you propose to store many small files in one page and whenever someone wants to mmap, copy the file to a fresh page and remove this page again once the file is no longer needed?
Essentially. I was thinking about just using something like SLAB for it.

If we go to 64kb pages (or even 16kb like apparently on the Apple Silicon), the overhead of unused pages would be kinda significant.

Sidestepping kernel memory management with DMEMFS

Posted Dec 8, 2020 21:09 UTC (Tue) by willy (subscriber, #9762) [Link]

The THP patchset attempts to provide this benefit without this overhead.

As Jon notes, the struct pages still exist, but they essentially go unused. Memory is managed in larger chunks. Exactly how much larger depends on how the application uses the file. Small files would still use a 4kB page, but if you open() a file and read 32kB from it, the page cache will allocate an order-3 page and manage that 32kB as a single entity.

YMMV, do feel free to try out the current patchset (if you use XFS)
https://git.infradead.org/users/willy/pagecache.git/

Sidestepping kernel memory management with DMEMFS

Posted Dec 10, 2020 16:25 UTC (Thu) by Karellen (subscriber, #67644) [Link] (2 responses)

Given the existence of huge pages, clearly not all pages on a system have to be the same size. Why not define "tiny pages" of e.g. 256 bytes that can be used to back small files, with 4/16/64KB "normal" pages, and 1/2/4MB huge pages (and larger?) - which are chosen per-allocation to minimise wastage?

Yes, PAGESIZE might need to be the lowest multiple of all of these (i.e. 256 bytes) but is there any reason that that would cause any insoluble problems?

Sidestepping kernel memory management with DMEMFS

Posted Dec 10, 2020 19:29 UTC (Thu) by geert (subscriber, #98403) [Link] (1 responses)

Because the possible page sizes depend on what the actual MMU hardware supports?

You can find a list of supported sizes for contemporary architectures at https://en.wikipedia.org/wiki/Page_(computer_memory).
The smallest page size in the list is 4 KiB, which is something most architectures settled on more than 20 years ago.
Older systems did support smaller page sizes. IIRC the MC68451 and MC68851 supported pages as small as 1 KiB. Still larger than 256 bytes, though.

Sidestepping kernel memory management with DMEMFS

Posted Dec 11, 2020 9:14 UTC (Fri) by geert (subscriber, #98403) [Link]

Correction after discovering more information on the 'net: the MC68851 supported page sizes from 256 bytes to 32 KiB.

Sidestepping kernel memory management with DMEMFS

Posted Dec 8, 2020 16:54 UTC (Tue) by tlamp (subscriber, #108540) [Link] (1 responses)

Couldn't the page cache merge such small files into a single page transparently? Seems a bit weird to me that caching 40 ~100 byte files would need 40 pages (160 KiB) not only one - I mean there's probably some overhead of an access structure, but should be still much better.

I could imagine that there may be some security issues (some side channel?) or complexity issues making this non trivial, but that's just speculation.

Sidestepping kernel memory management with DMEMFS

Posted Dec 8, 2020 21:05 UTC (Tue) by willy (subscriber, #9762) [Link]

The mmap() interface requires that each file starts at offset 0 in the page.

Sidestepping kernel memory management with DMEMFS

Posted Dec 8, 2020 7:49 UTC (Tue) by LtWorf (subscriber, #124958) [Link] (3 responses)

> Decades-old folk wisdom says that most files on Unix systems are small; that may be less true that it once was, though.

Using a javascript based software brings you thousands of js files that are around 100 bytes each.

Sidestepping kernel memory management with DMEMFS

Posted Dec 10, 2020 3:19 UTC (Thu) by willy (subscriber, #9762) [Link] (2 responses)

Sounds like .js files should be stored in a .jar file. Or a git-style .pack file. Or ... something. No filesystem (nor OS) handles millions of hundred-byte files nearly as well as it handles hundreds of million-byte files.

Sidestepping kernel memory management with DMEMFS

Posted Dec 10, 2020 15:51 UTC (Thu) by imMute (guest, #96323) [Link] (1 responses)

The "thousands of small files" is only during development. Most JS builds end up merging the source files into one big file for distribution.

This is because HTTP is similarly inefficient at transferring many small files versus one larger one.

Sidestepping kernel memory management with DMEMFS

Posted Dec 10, 2020 19:59 UTC (Thu) by LtWorf (subscriber, #124958) [Link]

Not necessarily… it's completely up to the developers.

See kibana for example (an elasticsearch product). It ships with a staggering amount of js files and random garbage, since the repos of those libraries are all cloned. So it contains licenses, README, .h files, example files, and whatnot.

The power of npm :D (the js tool to pull dependencies).

Sidestepping kernel memory management with DMEMFS

Posted Dec 8, 2020 21:17 UTC (Tue) by willy (subscriber, #9762) [Link]

One of the biggest problems with patchsets like dmemfs and other approaches to the problem is that if there is no struct page for a particular blob of memory, Linux is not able to do I/O to it. There are various historical reasons for this, but it's something that's going to have to be addressed. It'll be a huge undertaking. If you've heard of data structures like the 'struct bio', the 'struct scatterlist' and 'struct skbuff', they're full of references to struct pages. So every driver (networking, block, usb) that touches hardware will have to change. It is a daunting prospect.

I think Joao's approach has a much higher chance of success.
https://lore.kernel.org/linux-mm/20201208172901.17384-1-j...
Obviously I'm biased because I've been 'helping' with (aka throwing peanuts at) his approach, but I'd recommend taking a look at it.

Sidestepping kernel memory management with DMEMFS

Posted Dec 8, 2020 21:23 UTC (Tue) by neilbrown (subscriber, #359) [Link] (1 responses)

This all seems reminiscent of NVRAM. If you have terabytes of NVRAM and want to mmap some of it, you need the struct page, but where do you put it?
I recall talk of dynamically allocating struct page on demand, but I don't know what the final outcome was.

Was there a resolution that avoided permanently allocating the struct-page?

If there was, could it be used here? e.g. tell the kernel to treat some chunk of memory just like NVRAM, and allocate struct-page only on demand??

Sidestepping kernel memory management with DMEMFS

Posted Dec 9, 2020 14:52 UTC (Wed) by darnok (subscriber, #20299) [Link]

https://lore.kernel.org/linux-mm/20201208172901.17384-1-j...
https://lore.kernel.org/linux-nvdimm/20200110190313.17144...

There is that - which adds a DEV-DAX and does the same thing but with less code changes.

Sidestepping kernel memory management with DMEMFS

Posted Dec 12, 2020 18:51 UTC (Sat) by anton (subscriber, #25547) [Link]

As it happens, I have recently tried to measure the additional internal fragmentation from larger pages. The results are (memory sizes in KB):

VMAs unique    used     total      8KB    16KB    32KB    64KB   
 7552  2333    555964  1033320    6704   22344   56344  125144 machine1
82836 25276   5346060 15707448   76072  223000  514472 1113672 machine2
47017 15425 105490636 60186068   40804  134492  319852  708588 machine3

So the additional cost of going to 8KB pages seems to be quite modest.

One other solution that comes to my mind: File system developers have tried to tackle the overhead of block meta-data by instead using meta-data for extents, which may be larger than blocks, but make space management more complex. Maybe a similar approach could help for reducing page meta-data for memory.

How is RAM usage for the Kernel's MM a problem?

Posted Dec 15, 2020 19:10 UTC (Tue) by pr1268 (guest, #24648) [Link] (1 responses)

A laptop with 16GB installed has 4,194,304 pages, meaning that 256MB of memory is used just to keep track of memory. [...] a hosting provider running servers with 320GB of installed memory is losing 5GB of that memory to page structures.

I don't get it... It seems that since time immemorial¹, 64-bit Linux has always used 1/64 of physical RAM just to maintain page structures. No matter if one has 512 MB or 64 GB, that system is using 1/64 (=1.5625%) of RAM to keep track of the RAM.

So that begs the question: How come this is *just now* becoming an issue? And are you really losing the memory? After all, with pretty much any shared computing resource, there's always going to be some administrative overhead to maintaining said resource.

¹ At least since 1994, when Linux was ported to the 64-bit DEC Alpha (IIRC; someone correct me if I'm wrong).

How is RAM usage for the Kernel's MM a problem?

Posted Dec 15, 2020 20:29 UTC (Tue) by james (guest, #1325) [Link]

With virtualisation, you're paying the tax twice — once in the guest and once in the host.

As you say, you pretty much need to pay that for a general-purpose operating system. But for an OS that is only hosting standardised virtual machines running known workloads, you don't need to. I think the point of the "5 GB" is that we're now getting to the point where the tax in the host is roughly the amount you need for another guest.

Sidestepping kernel memory management with DMEMFS

Posted Dec 21, 2020 15:43 UTC (Mon) by rlhamil (guest, #6472) [Link]

Various modern CPUs can support multiple page sizes. And OSs can not just use those internally, but make them available to processes. Solaris has allowed stack and heap page size preferences (from those available on the particular hardware) to be set to non-default values, and mmap(2) together with memcntl(2) to achieve that effect for mmap() calls. Used prudently, that can considerably reduce page table entries required. I have the impression that on some CPUs, that can be used by a hypervisor too, handy insofar as VMs are big memory users. I think VirtualBox can do that on a Solaris host, but only with a suitable Intel CPU (not with those lacking the needed features nor with an AMD, even if it has some virtualization acceleration features).