Large pages, large blocks, and large problems

By Jonathan Corbet
September 19, 2007

Most of the core virtual memory subsystem developers met for a mini-summit just before the 2007 Kernel Summit in Cambridge. They came away feeling that they had resolved a number of VM scalability problems. Subsequent discussions have made it clear that, perhaps, this conclusion was a bit premature. They may well have resolved things, but it is not clear that everybody came to the same resolution.

All of the issues at hand relate to scalability in one way or another. While the virtual memory subsystem has been through a great many changes aimed at making it work well on contemporary systems, one key aspect of how it works has remained essentially unchanged since the beginning: the 4096-byte (on most architectures) page size. Over that time, the amount of memory installed on a typical system has grown by about three orders of magnitude - that's 1000 times more pages that the kernel must manage and 1000 times more page faults which must be handled. Since it does not appear that this trend will stop soon, there is a clear scalability problem which must be managed.

This problem is complicated by the way that Linux tends to fragment its memory. Almost all memory allocations are done in units of a single page, with the result that system RAM tends to get scattered into large numbers of single-page chunks. The kernel's memory allocator tries to keep larger groups of pages together, but there are limits to how successful it can be. The file /proc/buddyinfo can be illustrative here; on a system which has been running and busy for a while, the number of higher-order (larger) pages, as shown in the rightmost columns, will be very small.

The main response to memory fragmentation has been to avoid higher-order allocations at almost any cost. There are very few places in the kernel where allocations of multiple contiguous pages are done. This approach has worked for some time, but avoiding larger allocations does not always make the need for such allocations go away. In fact, there are many things which could benefit from larger contiguous memory areas, including:

Applications which use large amounts of memory will be working with large numbers of pages. The translation lookaside buffer (TLB) in the CPU, which speeds virtual address lookups, is generally relatively small, to the point that large applications run up a lot of time-consuming TLB misses. Larger pages require fewer TLB entries, and will thus result in faster execution. The hugetlbfs extension was created for just this purpose, but it is a specialized mechanism used by few applications, and it does not do anything special to make large contiguous memory regions easier for the kernel to find.
I/O operations can work better with larger contiguous chunks of data to work with. Users trying to use "jumbo frames" (extra-large packets) on high-performance network adapters have been experiencing problems for a while. Many devices are limited in the number of scatter/gather entries they support for a single operation, so small buffers limit the overall I/O operation size. Disk devices are pushing toward larger sector sizes which would best be supported by larger contiguous buffers within the kernel.
Filesystems are feeling pressure to use larger block sizes for a number of performance reasons. This message from David Chinner provides an excellent explanation of why filesystems benefit from larger blocks. But it is hard (on Linux) for a filesystem to work with block sizes larger than the page size; XFS does it, but the resulting code is seen as non-optimal and is not as fast as it could be. Most other filesystems do not even try; as a result, an ext3 filesystem created on a system with 8192-byte pages cannot be mounted on a system with smaller pages.

None of these issues are a surprise; developers have seen them coming for some time. So there are a number of potential solutions waiting on the wings. What is lacking is a consensus on which solution is the best way to go.

One piece of the puzzle may be Mel Gorman's fragmentation avoidance work, which has been discussed here more than once. Mel's patches seek to separate allocations which can be moved in physical memory from those which cannot. When movable allocations are grouped together, the kernel can, when necessary, create higher-order groups of pages by relocating allocations which are in the way. Some of Mel's work is in 2.6.23; more may be merged for 2.6.24. The lumpy reclaim patches, also in 2.6.23, encourage the creation of large blocks by targeting adjacent pages when memory is being reclaimed.

The immediate cause for the current discussion is a new version of Christoph Lameter's large block size patches. Christoph has filled in the largest remaining gap in that patch set by implementing mmap() support. This code enables the page cache to manage chunks of file data larger than a single page which, in turn, addresses many of the I/O and filesystem issues. Christoph has given a long list of reasons why this patch should be merged, but agreement is not universal.

At the top of the list of objections would appear to be the fact that the large block size patches require the availability of higher-order pages to work; there is no fallback if memory becomes sufficiently fragmented that those allocations are not available. So a system which has filesystems using larger block sizes will fall apart in the absence of large, contiguous blocks of memory - and, as we have seen, that is not an uncommon situation on Linux systems. The fragmentation avoidance patches can improve the situation quite a bit, but there is no guarantee that [PULL QUOTE: If this patch set is merged, some developers want it to include a loud warning to discourage users from actually expecting it to work. END QUOTE] fragmentation will not occur, either as a result of the wrong workload or a deliberate attack. So, if this patch set is merged, some developers want it to include a loud warning to discourage users (and distributors) from actually expecting it to work.

An alternative is Nick Piggin's fsblock work. People like to complain about the buffer head layer in current kernels, but that layer has a purpose: it tracks the mapping between page cache blocks and the associated physical disk sectors. The fsblock patch replaces the buffer head code with a new implementation with the goals of better performance and cleaner abstractions.

One of the things fsblock can do is support large blocks for filesystems. The current patch does not use higher-order allocations to implement this support; instead, large blocks are made virtually contiguous in the vmalloc() space through a call to vmap() - a technique used by XFS now. The advantage of using vmap() is that the filesystem code can see large, contiguous blocks without the need for physical adjacency, so fragmentation is not an issue.

On the other hand, using vmap() is quite slow, the address space available for vmap() on 32-bit systems is small enough to cause problems, and using vmap() does nothing to help at the I/O level. So Nick plans to extend fsblock to implement large blocks with contiguous allocations, but with a fallback to vmap() when large allocations are not available. In theory, this approach should be be best of both worlds, giving the benefits of large blocks without unseemly explosions in the presence of fragmentation. Says Nick:

However fsblock can do everything that higher order pagecache can do in terms of avoiding vmap and giving contiguous memory to block devices by opportunistically allocating higher orders of pages, and falling back to vmap if they cannot be satisfied.

From the conversation, it seems that a number of developers see fsblock as the future. But it is not something for the near future. The patch is big, intrusive, and scary, which will slow its progress (and memory management patches have a tendency to merge at a glacial pace to begin with). It lacks the opportunistic large block feature. Only the Minix filesystem has been updated to use fsblock, and that patch was rather large. Everybody (including Nick) anticipates that more complex filesystems - those with features like journaling - will present surprises and require changes of unknown size. Fsblock is not a near-term solution.

One recently-posted patch from Christoph could help fill in some of the gaps. His "virtual compound page" patch allows kernel code to request a large, contiguous allocation; that request will be satisfied with physically contiguous memory if possible. If that memory is not available, virtually contiguous memory will be returned instead. Beyond providing opportunistic large block allocation for fsblock, this feature could conceivably be used in a number of places where vmalloc() is called now, resulting in better performance when memory is not overly fragmented.

Meanwhile, Andrea Arcangeli has been relatively quiet for some time, but one should not forget that he is the author of much of the VM code in the kernel now. He advocates a different approach entirely:

From my part I am really convinced the only sane way to approach the VM scalability and larger-physically contiguous pages problem is the CONFIG_PAGE_SHIFT patch (aka large PAGE_SIZE from Hugh for 2.4).

The CONFIG_PAGE_SHIFT patch is a rework of an old idea: separate the size of a page as seen by the operating system from the hardware's notion of the page size. Hardware pages can be clustered together to create larger software pages which, in turn, become the basic unit of memory management. If all pages in the system were, say, 64KB in length, a 64KB buffer would be a single-page allocation with no fragmentation issues at all.

If the system is to go to larger pages, creating them in software is about the only option. Most processors support more than one hardware page size, but the smallest of the larger page sizes tend to be too large for general use. For example, i386 processors have no page sizes between 4KB and 2MB. Clustering pages in software enables the use of more reasonable page sizes and creates the flexibility needed to optimize the page size for the expected load on the system. This approach will make large block support easy, and it will help with the I/O performance issues as well. Page clustering is not helpful for TLB pressure problems, but there is little to be done there in any sort of general way.

The biggest problem, perhaps, with page clustering is that it replaces external fragmentation with internal fragmentation. A 64KB page will, when used as the page cache for a 1KB file, waste 63KB of memory. There are provisions in Andrea's patch for splitting large pages to handle this situation; Andrea claims that this splitting will not lead to the same sort of fragmentation seen on current systems, but he has not, yet, convinced the others of this fact.

Conclusions from this discussion are hard to come by; at one point Mel Gorman asked: "Are we going to agree on some sort of plan or are we just going to handwave ourselves to death?" Linus has just called the whole discussion "idiotic". What may happen is that the large block size patches go in - with warnings - as a way of keeping a small subset of users happy and providing more information about the problem space. Memory management hacking requires a certain amount of black-magic handwaving in the best of times; there is no reason to believe that the waving of hands is going to slow down anytime soon this time around.

Index entries for this article
Kernel	fsblock
Kernel	Huge pages
Kernel	Memory management/Large allocations

to post comments

Large pages, large blocks, and large problems

Posted Sep 19, 2007 16:22 UTC (Wed) by james (guest, #1325) [Link] (20 responses)

Some of Linus' thoughts (presumably) can be found at Real World Technologies (and associated thread):

[The performance cost of] Page table handling stays pretty constant - you basically get TLB misses proportionately to your data size, which means that the more TLB misses you get, the more data cache misses you get!
So realistically, TLB costs are never going to grow in any unbounded kind of manner - they are always limited by (and generally much smaller than) the D$ costs! There are loads that are more TLB-intensive than others (and loads that are more D$ intensive, of course!), but in the end, TLB's aren't the problem.
Unless the CPU micro-architecture is unbalanced, of course. There have certainly been uarchs that increased the cache size a lot without increasing the TLB size. Now, of course they'll be TLB-limited! But that's not really a fundamental issue, it's just an unbalanced design.

and

You want good cache behavior if you have 256GB of memory, or your performance will suck. It's that easy. And if you have good locality in the D$, then the TLB's will work fine.

Large pages, large blocks, and large problems

Posted Sep 19, 2007 16:51 UTC (Wed) by avik (guest, #704) [Link] (4 responses)

It's an oversimplification. Suppose you have a large memory machine and
you're doing completely random access. With 4KB pages, you'll get a tlb
miss and two cache hits (one for the data and one for the pte). With
large pages, the page tables can be all cached and you only take a tlb
miss and a single cache miss.

So for this contrived workload, you get a ~2X speedup by using large
pages. Obviously real workloads will get less speedup, but it is still
significant.

Another way of stating this is that large pages don't just increase the
coverage of the tlb, they also increase the pagetable coverage of the
data cache, which can be much more significant.

Large pages, large blocks, and large problems

Posted Sep 20, 2007 0:52 UTC (Thu) by sayler (guest, #3164) [Link] (1 responses)

Yes, *but* you have a commensurate increase in the cost (in silicon area, latency, validation) to support large-page-TLB entries. IIRC (from the above RWT thread) modern Intel processors support only 2 TLB entries for large pages!

In other words, you may lose some performance by going to large pages because of resource constraints.

Large pages, large blocks, and large problems

Posted Sep 20, 2007 5:51 UTC (Thu) by avik (guest, #704) [Link]

I'm talking about current hardware, not proposing changes to hardware.

The workload I described will gain a 2X boost from using large pages
regardless of the tlb's allocation of large and small pages. And while it
is not a real-life workload, others have demonstrated nice performance
improvements with large pages.

Large pages, large blocks, and large problems

Posted Sep 21, 2007 17:17 UTC (Fri) by vonbrand (subscriber, #4458) [Link] (1 responses)

The cost is the same once the data is in RAM. Loading and storing large pages is costlier. It is not at all that simple.

Large pages, large blocks, and large problems

Posted Sep 21, 2007 19:48 UTC (Fri) by avik (guest, #704) [Link]

My comment (and Linus' remarks) is talking about large pages, not large
blocks. The assumption is that this is a memory workload, not an I/O
workload. There's no disk I/O involved.

The story repeats itself

Posted Sep 19, 2007 17:00 UTC (Wed) by khim (subscriber, #9252) [Link] (6 responses)

Deja vu: the same thing happened when question about more the 4GB of memory in 32-bit system was raised. Of course it's insane to try to handle 16GiB or 32GiB with 32bit CPU. Of course the proper solution is switch to 64bit CPU. But when the most popular arch is 32-bit and users need huge memory systems - what can you do ?

Here we have the same situation. Most CPUs are balanced (PPC, Athlon64, etc), but there are one vendor which sells two architectures of CPU which are seriously unbalanced. Can we safely ignore this obscure vendor and these crippled CPUs ? Not when vendor is called Intel and CPUs are Pentium 4 and Core 2. They both only have 128-items TLB (enough for 512KiB with 4KiB pages) and 4MiB of cache (at least in some models). That's 8 times difference! Yes, this is insane, yes, this is problem of CPU design. Yet when it's the most popular vendor and the "crippled architecture" is the most popular CPU from said vendor - you can not just ignore the problem and hope that it'll go away.

There are rumors that Vista SP1 will use 4MB pages to speedup I/O...

The story repeats itself

Posted Sep 19, 2007 17:23 UTC (Wed) by proski (guest, #104) [Link] (4 responses)

Isn't Core 2 64-bit? That problem is going away. And 16 Gb of RAM costs significantly more than a CPU and a motherboard that can support it properly, and it has always been like that.

The same applies to the cache. Changing the CPU will cost fraction of the memory it's supposed to support.

The story repeats itself

Posted Sep 19, 2007 17:38 UTC (Wed) by khim (subscriber, #9252) [Link] (3 responses)

Core 2 is 64bit only, but people needed 16GiB of RAM years ago when x86-compatible 64bit CPUs were just a project. Thus PAE support was added to Linux.

The same - with TLB today: may be someday we'll have the truly balanced architecture but today - we don't. Add it's not clear if we'll have balanced architecture tomorrow: TLB must be fast (or else it's useless) but it's hard to create large and fast cache. Of course it's possible to use 2-level TLB (like AMD does today), but it's not some minor modification - it's possible that we'll be forced to wait few years till the new Intel's design. And all these years Linux will be worse then Windows... not a good position...

The story repeats itself

Posted Sep 19, 2007 22:44 UTC (Wed) by proski (guest, #104) [Link]

I don't see any references to Windows in the story.

The story repeats itself

Posted Sep 20, 2007 0:52 UTC (Thu) by sayler (guest, #3164) [Link] (1 responses)

"Core 2 is 64bit only"

No.

Oops

Posted Sep 20, 2007 21:09 UTC (Thu) by khim (subscriber, #9252) [Link]

Perils of editing. Initially I wanted so say that only "Core 2" is 64bit while Pentium 4 (except latest models of Prescott), Pentium 3 and so on are 32bit. Phrase was too cumbersome and I've removed everything but the "only" word was left after editing...

The story repeats itself

Posted Sep 27, 2007 13:31 UTC (Thu) by anton (subscriber, #25547) [Link]

[...] Pentium 4 and Core 2. They both only have 128-items TLB (enough for 512KiB with 4KiB pages)

Note that this assumes that each page is fully utilized with hot data, i.e., the best case. That's rarely the case and that's why cache lines are smaller than a page. So, in the worst case, 128 TLB entries cover only 128 cache lines, i.e., with 64-byte L1 cache lines, 8KB. I have no data on typical cases, but if half the page contains hot data, a 128-entry TLB will cover only 256KB of data in the caches. You can then hope that your working set is smaller, or you will see TLB thrashing.

If page utilization is a problem, larger pages will probably have limited benefit: the relative utilization (hot data/page size) will probably go down. However, as long as the absolute utilization (hot data/page) goes up, larger pages will still be useful in terms of reducing TLB misses; they do have other costs, though.

Large pages, large blocks, and large problems

Posted Sep 19, 2007 18:39 UTC (Wed) by eSk (guest, #45221) [Link] (7 responses)

I've never understood Linus' aversion against general support for superpages. He always seems to make use of this argument that it does not matter (performance or otherwise). However, at the OSDI '02 Navarro published a very thorough analysis of how superpage support helped performance by increasing TLB coverage [1]. The page sizes used were 8KB, 64KB, 512KB, and 4MB and the whole thing was implemented in FreeBSD. He also described the techiques used for promoting and demoting superpages (i.e., automatically converting beween different page sizes as needed). He didn't even talk about other advantages of using superpages (e.g., for supporting large blocks), but the results he obtained were still impressive. BTW, another side effect of using superpages is that you tend to get better cache utilization because you automatically get a cache coloring effect by using the larger page sizes.

[1] Navarro et al., "Practical, transparent operating system support for superpages", OSDI '02.
http://www.usenix.org/events/osdi2002/tech/full_papers/na...

Large pages, large blocks, and large problems

Posted Sep 20, 2007 0:54 UTC (Thu) by sayler (guest, #3164) [Link] (6 responses)

"This system is implemented in FreeBSD on the Alpha architecture," The Alpha has (had?) good support for higher-order page allocation at the hardware level that is not currently present in current-gen Intel and AMD chips.

Large pages, large blocks, and large problems

Posted Sep 20, 2007 11:13 UTC (Thu) by eSk (guest, #45221) [Link] (1 responses)

Sure. Running these experiments on any x86-lineage chip would obviously not work (except perhaps in a simulator) because of lacking the wider range of page sizes. The point I was trying to make is that Linus argues strongly against the usefulness of any page size above 4K/8K, except for some special cases where very large superpages are used explicitly.

Large pages, large blocks, and large problems

Posted Sep 27, 2007 13:43 UTC (Thu) by anton (subscriber, #25547) [Link]

Running these experiments on any x86-lineage chip would obviously not work (except perhaps in a simulator) because of lacking the wider range of page sizes.

On a machine that has plenty of memory for what's running on it, a variation of the Navarro approach might still be useful even on an IA32/AMD64 machine. The OS would rarely feel enough memory pressure to consider splitting a large page.

Also, since Linux also supports architectures that have finer-grained page size steps, it should not look just at what IA32/AMD64 support. And once a popular OS like Linux supports it, the hardware designers at Intel and AMD can better justify adding additional page sizes to their hardware.

Large pages, large blocks, and large problems

Posted Sep 20, 2007 13:42 UTC (Thu) by zlynx (guest, #2285) [Link] (3 responses)

Intel's Itanium supports page sizes in all powers of 2 from 4K to 256M. That's a current-gen Intel chip.

Large pages, large blocks, and large problems

Posted Sep 21, 2007 14:58 UTC (Fri) by jamesh (guest, #1159) [Link]

It may be present generation, but any software developer who made decisions based on the assumption of increased adoption of Itanium is crazy.

Large pages, large blocks, and large problems

Posted Sep 21, 2007 17:21 UTC (Fri) by vonbrand (subscriber, #4458) [Link] (1 responses)

Current generation intel chips are clones of AMD64.

Large pages, large blocks, and large problems

Posted Sep 21, 2007 18:33 UTC (Fri) by zlynx (guest, #2285) [Link]

Not really, or they would have included more of the good parts like the IOMMU and memory controller.

Large pages, large blocks, and large problems

Posted Sep 19, 2007 20:44 UTC (Wed) by vblum (guest, #1151) [Link] (3 responses)

How do applications with large allocated arrays for numerics (say) get around the page size / memory fragmentation issue? Wouldn't one expect them to experience some slowdown when the system's memory gets heavily fragmented, if the basic arrays are significantly larger than the cache size? just wondering ...

Large pages, large blocks, and large problems

Posted Sep 19, 2007 22:06 UTC (Wed) by i3839 (guest, #31386) [Link] (2 responses)

Only the kernel accesses memory by physical addresses, userspace processes use virtual memory, so "fragmented memory" can look like one contiguous chunk to them.

Large pages, large blocks, and large problems

Posted Sep 20, 2007 9:18 UTC (Thu) by vblum (guest, #1151) [Link] (1 responses)

Yes, masking fragmentation at a low level will make things look contiguous, but in reality someone somewhere has to do the work to bring the fragments together. Is the adverse impact on performance (wall-clock time for large operations) really negligible? (thanks to clever cache use?)

Large pages, large blocks, and large problems

Posted Sep 20, 2007 12:55 UTC (Thu) by i3839 (guest, #31386) [Link]

Not really.

There is always a mapping made between the virtual addresses and the physical ones, so what's different is which physical pages are chosen. Choosing contiguous physical pages is harder and takes more time, but it doesn't give any advantages. Bringing the fragments together is done by mapping them to a contiguous virtual memory range.

Mapping to contiguous physical pages might be faster if the hardware supports (really) variable sized pages, but as far as I know none does.

Fragmentation doesn't slow things down, it only makes it harder to allocate chunks bigger than one pagesize.

The demand for bigger pagesizes comes from people who think it will reduce overhead for their workload, because everything that's done per page can be done less often. The main thing preventing this from working seamlessly is fragmentation.

Large pages, large blocks, and large problems

Posted Sep 19, 2007 23:42 UTC (Wed) by riddochc (guest, #43) [Link] (5 responses)

Memory management is not something I know much about - relevant classes I took at the university were limited to pre-386 assembly and digital logic, so I'm not even sure that my understanding of a "page" is accurate.

Can someone recommend reading materials (preferably free, but don't rule something really good out just on that account) on this stuff?

And otherwise, I'm curious how the proposed changes would affect the work of people who are trying to get Linux to scale down to smaller devices?

Large pages, large blocks, and large problems

Posted Sep 20, 2007 0:49 UTC (Thu) by drag (guest, #31333) [Link] (2 responses)

> And otherwise, I'm curious how the proposed changes would affect the work of people who are trying to get Linux to scale down to smaller devices?

I donno. All of this stuff is over my head, but I expect it would either have a generally null effect to generally positive effect.

I know that a popular task to put Linux to use for is those little embedded 'NAS' controllers. You know, those things running little ARM proccessor or something lightweight like that were you can shove 3 or 4 SATA drives into and they cost around 100-200 bucks or so.

I know that for Gigabit speed networks, and faster interconnects, one of the major problems you have, in terms of performance, is that they are still using very tiny MTU's originally developed for 10Mbit/s networks. 'Jumbo frames' are were you take the small 1500 bytes and bump the size up to 9500bytes or even higher. This leads to significantly less interrupts being generated by the controller and much less TCP overhead. IF all your hardware and network hardware supports it. You can realy get very significant network performance improvements. Sometimes 2x the performance at half the cpu usage.

Then if you take that further and are able to use large packets with large disk blocks, say that if you strip away the ethernet frame and tcp information the datagram of the packet and the size of the disk block is the same size, then I suppose you can reduce overhead and increase performance even more.

All in all this would allow people to make slower/cheaper proccessors and perform better. Cheaper, faster embedded Linux devices.

Of course this is all very idealized. Lots of switches and NICs don't support jumbo packets, most people will still use Widnows with SMB which is just naturally slow, and most people don't have the abilty to configure the network in this way even if they know how. Plus the sorts of CPU they use I don't know if they would even have those large memory page sizes supported.

Oh well.

> Can someone recommend reading materials (preferably free, but don't rule something really good out just on that account) on this stuff?

Ever checked out http://kernelnewbies.org/
or http://www.linux-books.us/linux_general_0014.php ?

Large pages, large blocks, and large problems

Posted Sep 20, 2007 1:58 UTC (Thu) by sayler (guest, #3164) [Link] (1 responses)

In general, I agree with what you say, but keep in mind that Ethernet frames are inherently variable in size, that is, you can have 1500, 1501, 1502, ... byte frames and the transmission time will increase nearly linearly.

We have much coarser choices for page sizes. Even on Alpha (which apparently did a good job here), page size choices were something like 8k ** 2*N where N ran between 0 and 3..

There is some other somewhat interesting data here: http://lists.freebsd.org/pipermail/freebsd-hackers/2003-O... showing measured {i,d}tlb size for various page sizes on various uArchs.

Large pages, large blocks, and large problems

Posted Sep 21, 2007 15:16 UTC (Fri) by jamesh (guest, #1159) [Link]

It is true that ethernet frames are variable size, but it also states that the maximum payload size is 1500 bytes as the grandparent post says. You need to have some upper limit in order to make hardware that can reliably store and forward packets (as a switch would need to do when forwarding a packet to a slower network).

Ethernet frames larger than 1500 bytes are non-standard and commonly known as "jumbo frames". And as you can guess, they'll only work if all the hardware involved in the link supports the larger frames.

Large pages, large blocks, and large problems

Posted Sep 20, 2007 14:04 UTC (Thu) by lethal (guest, #36121) [Link]

The biggest problem on the small side is transparent usage of large TLBs, the idea being something akin to Andrea's CONFIG_PAGE_SHIFT but relative to the TLB size whilst maintaining normal PAGE_SIZE'ed PTEs. One thing that was tossed about at kernel summit was the idea of having the VM provide base page and range hints for contiguous page frames which could be optimized for in the TLB miss handler for software-loaded TLBs (many embedded systems, where TLBs are very small, for example). Namely, for some extra performance hit in the architecture-specific hot path we have the ability to cut off linear page faults directly, rather than speculatively (this is an important distinction between this approach and the rice superpages as well as the approaches used by HP-UX and IRIX).

The other issue is that the d-cache does grow, and the TLB doesn't always scale accordingly. For heavy shared library and multi-threading apps, folks love to toss on copious amounts of slower cache, to the point where there's insufficient TLB coverage to make it out of cache, and thus, thrashing ensues when small pages are used. On ia64 the answer to this is always to bump up PAGE_SIZE, where 64kB tends to be a requirement to make it out of cache (and these are _huge_ TLBs!). On embedded where the TLBs are orders of magnitude smaller and consistently under pressure, bumping up the page size is simply not an option. We don't want a large page size, we want a large TLB entry size that can span multiple pages in order to reduce the amount of application time we waste on linear faulting.

I brought this up at kernel summit, and Linus supported the idea of VM hinting for page ranges, so it will be interesting to see where this work goes. Not only will such things tie in with Christoph's work, it also operates under the assumption that we're not fragmented out of the box, too. Thus, there's also a dependence on Mel's work, especially if one is to consider ways to passively provide hints during page reclaim or so.

It is worth differentiating between large pages and large TLBs. Large pages on embedded outside of application specific use (ie, hugetlbfs) are generally undesirable. The general embedded case is usually reasonably large memory apertures (relative to TLB and PAGE_SIZE), especially in peripheral space. Then a combination of many small files and some very big ones. The places where we have explicit control over the TLB size (ie, ioremap()) are already handled by the architectures that care, so in terms of transparency, it's simply anonymous and file-backed pages where VM hinting is helpful. Background scanning is mentioned from time to time, but is unrealistic for these applications since the system is usually doing run-time power management, also.

The picture today is certainly much less bleak than it was even just a year ago, but there is still a lot of work to be done.

Large pages, large blocks, and large problems

Posted Sep 23, 2007 23:23 UTC (Sun) by man_ls (guest, #15091) [Link]

I found Henson's article here on LWN quite good: KHB: Transparent support for large pages. The referred paper by Navarro (cited here too) is a joy to read, maybe due to the contribution from Alan Cox.

Large pages, large blocks, and large problems

Posted Sep 20, 2007 0:29 UTC (Thu) by MisterIO (guest, #36192) [Link] (1 responses)

I don't clearly understand what exactly might be the problem if these kind of patches would be accepted in the mainline that makes Linus so much against them.Could someone please explain this?

Large pages, large blocks, and large problems

Posted Sep 27, 2007 13:43 UTC (Thu) by forthy (guest, #1525) [Link]

The reason: Linus is an idiot. Ok, I don't really think so, but he started with the name calling; he should go over to OpenBSD (or better: create his own *BSD) and become another Theo de Raadt ;-). Honestly, what he's doing is factless rants against a discussion that's definitely fact-based. Larger pages help to keep a large memory organized without too many TLB misses. Larger blocks help to keep a large disk with high transfer rates and low seek capability fast. The fact that Linux has big problems with large pages and blocks means probably it's doing it completely wrong. The fact that people discuss this, and write code, even though Linus shouts against them is encouraging. The big flamefest this generates is discouraging. If this also results in a veto from Linus once the code is ready, even worse. Then I think it's time to fork Linux.

Large pages actually make your effective data cache size bigger

Posted Sep 21, 2007 4:21 UTC (Fri) by adsharma (guest, #8967) [Link]

A lot of people tend to think that large pages are about TLB. But for most apps I've seen - they're actually about making the caches bigger on x86.

With 4K pages, the application data and page table pages compete for the data cache. With 2MB pages, the page tables are much smaller - effectively allowing more app data to be cached.