Transparent hugepages

By Jonathan Corbet
October 28, 2009

Most Linux systems divide memory into 4096-byte pages; for the bulk of the memory management code, that is the smallest unit of memory which can be manipulated. 4KB is an increase over what early virtual memory systems used; 512 bytes was once common. But it is still small relative to the both the amount of physical memory available on contemporary systems and the working set size of applications running on those systems. That means that the operating system has more pages to manage than it did some years back.

Most current processors can work with pages larger than 4KB. There are advantages to using larger pages: the size of page tables decreases, as does the number of page faults required to get an application into RAM. There is also a significant performance advantage that derives from the fact that large pages require fewer translation lookaside buffer (TLB) slots. These slots are a highly contended resource on most systems; reducing TLB misses can improve performance considerably for a number of large-memory workloads.

There are also disadvantages to using larger pages. The amount of wasted memory will increase as a result of internal fragmentation; extra data dragged around with sparsely-accessed memory can also be costly. Larger pages take longer to transfer from secondary storage, increasing page fault latency (while decreasing page fault counts). The time required to simply clear very large pages can create significant kernel latencies. For all of these reasons, operating systems have generally stuck to smaller pages. Besides, having a single, small page size simply works and has the benefit of many years of experience.

There are exceptions, though. The mapping of kernel virtual memory is done with huge pages. And, for user space, there is "hugetlbfs," which can be used to create and use large pages for anonymous data. Hugetlbfs was added to satisfy an immediate need felt by large database management systems, which use large memory arrays. It is narrowly aimed at a small number of use cases, and comes with significant limitations: huge pages must be reserved ahead of time, cannot transparently fall back to smaller pages, are locked into memory, and must be set up via a special API. That worked well as long as the only user was a certain proprietary database manager. But there is increasing interest in using large pages elsewhere; virtualization, in particular, seems to be creating a new set of demands for this feature.

A host setting up memory ranges for virtualized guests would like to be able to use large pages for that purpose. But if large pages are not available, the system should simply fall back to using lots of smaller pages. It should be possible to swap large pages when needed. And the virtualized guest should not need to know anything about the use of large pages by the host. In other words, it would be nice if the Linux memory management code handled large pages just like normal pages. But that is not how things happen now; hugetlbfs is, for all practical purposes, a separate, parallel memory management subsystem.

Andrea Arcangeli has posted a transparent hugepage patch which attempts to remedy this situation by removing the disconnect between large pages and the regular Linux virtual memory subsystem. His goals are fairly ambitious: he would like an application to be able to request large pages with a simple madvise() system call. If large pages are available, the system will provide them to the application in response to page faults; if not, smaller pages will be used.

Beyond that, the patch makes large pages swappable. That is not as easy as it sounds; the swap subsystem is not currently able to deal with memory in anything other than PAGE_SIZE units. So swapping out a large page requires splitting it into its component parts first. This feature works, but not everybody agrees that it's worthwhile. Christoph Lameter commented that workloads which are performance-sensitive go out of their way to avoid swapping anyway, but that may become less true on a host filling up with virtualized guests.

A future feature is transparent reassembly of large pages. If such a page has been split (or simply could not be allocated in the first place), the application will have a number of smaller pages scattered in memory. Should a large page become available, it would be nice if the memory management code would notice and migrate those small pages into one large page. This could, potentially, even happen for applications which have never requested large pages at all; the kernel would just provide them by default whenever it seemed to make sense. That would make large pages truly transparent and, perhaps, decrease system memory fragmentation at the same time.

This is an ambitious patch to the core of the Linux kernel, so it is perhaps amusing that the chief complaint seems to be that it does not go far enough. Modern x86 processors can support a number of page sizes, up to a massive 1GB. Andrea's patch is currently aiming for the use of 2MB pages, though - quite a bit smaller. The reasoning is simple: 1GB pages are an unwieldy unit of memory to work with. No Linux system that has been running for any period of time will have that much contiguous memory lying around, and the latency involved with operations like clearing pages would be severe. But Andi Kleen thinks this approach is short-sighted; today's massive chunk of memory is tomorrow's brief email. Andi would rather that the system not be designed around today's limitations; for the moment, no agreement has been reached on that point.

In any case, this patch is an early RFC; it's not headed toward the mainline in the near future. It's clearly something that Linux needs, though; making full use of the processor's capabilities requires treating large pages as first-class memory-management objects. Eventually we should all be using large pages - though we may not know it.

Index entries for this article
Kernel	Huge pages
Kernel	Memory management/Huge pages

to post comments

!this

Posted Oct 29, 2009 8:12 UTC (Thu) by jzbiciak (guest, #5246) [Link] (4 responses)

today's massive chunk of memory is tomorrow's brief email.

I can't say I look forward to the 2MB (and later, 1GB) "Me too!" email. Or as the kids seem to say these days, "THIS."

!this

Posted Oct 31, 2009 23:37 UTC (Sat) by man_ls (guest, #15091) [Link] (3 responses)

Tell that to Outlook and its insistence on pasting BMP versions of images around. I am guilty of sending multi-MB messages like this at work, every time I send a screenshot for clarification. The standard solution: paste the images into a Word document and send that instead. Ick.

!this

Posted Nov 1, 2009 1:46 UTC (Sun) by jzbiciak (guest, #5246) [Link] (2 responses)

You can save other formats from MS Paint (including PNG!), and Outlook will let you attach them. (Open MS Paint, take your screen shot with Alt-PrtSc, hit Ctrl-V in Paint followed by Save As... Voila!) That's significantly less hurl-inducing than Word-encapsulated JPGs.

Also, there's a Windows version of GIMP that also works well for acquiring and cropping screenshots.

That said, I still don't want to see the 1GB "Me too!" email.

!this

Posted Nov 1, 2009 7:18 UTC (Sun) by man_ls (guest, #15091) [Link] (1 responses)

I have all the GIMP for Windows and use it for serious work; it works very well. But the 5 second "alt+print screen, ctrl+v" message that doesn't even touch the hard disk, and that contains a bug report or a bug resolution, still takes about 5 MBs. There, my timesaver is your spacecruncher.

!this

Posted Nov 1, 2009 15:17 UTC (Sun) by jzbiciak (guest, #5246) [Link]

Fair enough.

Mind you, I was comparing launching MS Paint vs. launching MS Word, since you mentioned sending screenshots wrapped in a Word document. (And I only mentioned GIMP in case you wanted to get fancier than what MS Paint lets you do.)

Transparent hugepages

Posted Oct 30, 2009 8:21 UTC (Fri) by MisterIO (guest, #36192) [Link] (2 responses)

Couldn't they add this feature in a parameterized form? then at compile time you decide the max dimension of hugepages.

Transparent hugepages

Posted Oct 30, 2009 14:53 UTC (Fri) by dtlin (subscriber, #36537) [Link]

Only if somebody writes the code first. Right now I think it's one-size-
only.

Transparent hugepages

Posted Nov 8, 2009 3:29 UTC (Sun) by butlerm (subscriber, #13312) [Link]

Hugepages here rely on multiple page size support in hardware to minimize
translation lookaside buffer (TLB) overhead in the CPU. As such, you can
only use the size or handful of sizes the CPU supports.

It is possible to do something similar in software alone, to create larger
than normal "virtual" pages, but while that may have certain internal
efficiencies, it doesn't reduce the TLB lookup overhead of all the smaller
physical pages at all.

Transparent hugepages

Posted Oct 30, 2009 16:04 UTC (Fri) by alejluther (subscriber, #5404) [Link]

It seems hugepages are going to be more important in the future and it has a lot of work to do with main changes to the kernel.

Is it not such a feature enough for starting a 2.7 kernel version?

Transparent hugepages

Posted Oct 30, 2009 16:33 UTC (Fri) by giraffedata (guest, #1954) [Link]

he would like an application to be able to request large pages with a simple madvise() system call

That sounds like an abuse of madvise(). madvise() isn't supposed to instruct the OS on how to provide virtual memory. It's supposed to advise the OS on how the process will use the memory. "I will access this range uniformly" would be something that might inspire the OS to use large pages.

Transparent hugepages

Posted Oct 31, 2009 7:23 UTC (Sat) by ch (guest, #4097) [Link] (1 responses)

what systems used 512 byte pages?

Transparent hugepages

Posted Oct 31, 2009 13:12 UTC (Sat) by avik (guest, #704) [Link]

VAX.

Note to lwn.net comment filter: there actually is text in this comment.