Reworking vmap()

By Jonathan Corbet
October 21, 2008

Kernel memory is normally allocated in relatively small chunks - usually just a single page at a time. As the size of an allocation grows, satisfying that allocation with physically-contiguous pages gets progressively harder. So most of the kernel has been written with an eye toward avoiding the use of large, contiguous allocations. There are times, though, when a large memory array needs to be virtually contiguous, but not necessarily physically contiguous. One example is the allocation of space for loadable modules; any given module should live in a single, contiguous address range, but nobody cares how it's laid out in physical RAM. For cases like this, the kernel provides a set of functions like vmalloc() and vmap().

Functions like vmalloc() have long been known to be somewhat expensive to use. They have to work with a single shared (and limited) address range, and they require making changes to the kernel's page tables. Page table changes, in turn, require translation lookaside buffer (TLB) flushes, which are a costly, all-CPUs operation. So kernel developers have generally tried to avoid using these functions in performance-critical parts of the kernel.

Nick Piggin has noticed, though, that the performance characteristics of vmalloc() and friends are catching up with us. The vmalloc() address space is kept on a linked list and protected by a global lock, which does not scale very well. But the real cost is in freeing memory regions in this space; the ensuing TLB flush must be done using an inter-processor interrupt to every CPU, each of which must then flush its own TLB. People normally do not buy more CPUs unless they have more work to run on them, so systems with more processors will, as a general rule, be performing more mapping and freeing in the vmalloc() range. As systems grow, there will be more global TLB flushes, each of which disrupts more processors. In other words, the amount of work grows proportional to the square of the number of processors - meaning that everything falls down, eventually.

To make things worse, Nick has a longstanding series of patches which, among other things, do a lot of vmap() calls to support larger block sizes in the filesystem layer and page cache. Merging those patches would add significantly to the amount of time the system spends managing the vmalloc() space, which would not be a good thing. So fixing vmalloc() seems like a good thing to do first. As of 2.6.28, Nick has, in fact, fixed the management of kernel virtual allocations.

The first step is to get rid of the linked list and its corresponding global lock. Instead, a red-black tree is used to track ranges of available address space; finding a suitable region can now be done without having to traverse a long list. The tree is still protected by a global lock, which poses potential scalability problems. To avoid this issue, Nick's patch creates a separate, per-CPU list of small address ranges which can be allocated and freed in a lockless manner. New functions must be called to make use of this facility:

    void *vm_map_ram(struct page **pages, unsigned int count, 
                     int node, pgprot_t prot);
    void vm_unmap_ram(const void *mem, unsigned int count);

A call to vm_map_ram() will create a virtually-contiguous mapping for the given pages. The associated data structures will be allocated on the given NUMA node; the memory will have the protection specified in prot. With the version of the patch merged for 2.6.28, mappings of up to 64 pages can be made from the per-cpu lists.

Note that these functions do not allocate memory, they just create a virtual mapping for a given set of pages. They are a replacement for vmap() and vunmap(), not vmalloc() and vfree(). It is probably possible to rewrite vmalloc() to use this mechanism, but that has not happened. So vmalloc() calls still require the acquisition of a global lock.

There's another trick in this patch set which is used by all of the kernel virtual address management functions. Nick realized that it is not actually necessary to flush TLBs across the system immediately after an address range is freed. Since those addresses are being given back to the system, no code will be making use of them afterward, so it does not matter if a processor's TLB contains a stale mapping for them. All that really matters is that the TLB gets cleaned out before those addresses are used again elsewhere. So unmapped regions can be allowed to accumulate, then all flushed with a single operation. That cuts the number of TLB flushes significantly.

How much faster do things run? Nicks patch (the merged version can be found here) contains some benchmark results. With an artificial test aimed at demonstrating the difference, the new code runs 25 times faster. By changing the vmap() code in the XFS filesystem to use vm_map_ram() instead, some workloads were sped up by a factor of twenty. So it seems to work.

Index entries for this article
Kernel	Memory management
Kernel	vmalloc()

to post comments

Reworking vmap()

Posted Oct 27, 2008 9:02 UTC (Mon) by kleptog (subscriber, #1183) [Link] (1 responses)

I've often wondered why processer designers didn't include an instruction to invalidate TLB entires directly. Tagged TLB entries are nice but for stuff like this a simple "throw out any TLB entires relating to address X" would solve most of the problems, no?

Reworking vmap()

Posted Dec 31, 2008 19:03 UTC (Wed) by terminus (subscriber, #49129) [Link]

I believe there are instructions to flush particular TLB entries -- invlpg for one.

AFAICS the trouble in this case stems from the inefficiency of a global TLB flush. Post a vfree() you have changed some global state on your *local* CPU. However, the other remote TLBs don't know about this change in state and will happily resolve references to the vfree'd range unless told otherwise. Thus, the IPIs to inform them to flush TLB.

Reworking vmap()

Posted Oct 30, 2008 9:19 UTC (Thu) by gat3way (guest, #47864) [Link]

I don't think a TLB flush is needed on a uniprocessor machine as far as vmalloc() is concerned.

Furthermore I am not quite sure which is the bigger problem here - flushing TLB on all CPUs....or the vmalloc()'s locking mechanism.

Definitely the systems will 'grow' in means of more kernel memory allocations which in turn means more TLB flushes. However the RAM is getting faster which makes me think that a TLB flush is getting less of a hassle as compared to global locks which can definitely become more and more serious as the number of CPU cores grows in time :)