Reworking vmap()
Functions like vmalloc() have long been known to be somewhat expensive to use. They have to work with a single shared (and limited) address range, and they require making changes to the kernel's page tables. Page table changes, in turn, require translation lookaside buffer (TLB) flushes, which are a costly, all-CPUs operation. So kernel developers have generally tried to avoid using these functions in performance-critical parts of the kernel.
Nick Piggin has noticed, though, that the performance characteristics of vmalloc() and friends are catching up with us. The vmalloc() address space is kept on a linked list and protected by a global lock, which does not scale very well. But the real cost is in freeing memory regions in this space; the ensuing TLB flush must be done using an inter-processor interrupt to every CPU, each of which must then flush its own TLB. People normally do not buy more CPUs unless they have more work to run on them, so systems with more processors will, as a general rule, be performing more mapping and freeing in the vmalloc() range. As systems grow, there will be more global TLB flushes, each of which disrupts more processors. In other words, the amount of work grows proportional to the square of the number of processors - meaning that everything falls down, eventually.
To make things worse, Nick has a longstanding series of patches which, among other things, do a lot of vmap() calls to support larger block sizes in the filesystem layer and page cache. Merging those patches would add significantly to the amount of time the system spends managing the vmalloc() space, which would not be a good thing. So fixing vmalloc() seems like a good thing to do first. As of 2.6.28, Nick has, in fact, fixed the management of kernel virtual allocations.
The first step is to get rid of the linked list and its corresponding global lock. Instead, a red-black tree is used to track ranges of available address space; finding a suitable region can now be done without having to traverse a long list. The tree is still protected by a global lock, which poses potential scalability problems. To avoid this issue, Nick's patch creates a separate, per-CPU list of small address ranges which can be allocated and freed in a lockless manner. New functions must be called to make use of this facility:
void *vm_map_ram(struct page **pages, unsigned int count,
int node, pgprot_t prot);
void vm_unmap_ram(const void *mem, unsigned int count);
A call to vm_map_ram() will create a virtually-contiguous mapping for the given pages. The associated data structures will be allocated on the given NUMA node; the memory will have the protection specified in prot. With the version of the patch merged for 2.6.28, mappings of up to 64 pages can be made from the per-cpu lists.
Note that these functions do not allocate memory, they just create a virtual mapping for a given set of pages. They are a replacement for vmap() and vunmap(), not vmalloc() and vfree(). It is probably possible to rewrite vmalloc() to use this mechanism, but that has not happened. So vmalloc() calls still require the acquisition of a global lock.
There's another trick in this patch set which is used by all of the kernel virtual address management functions. Nick realized that it is not actually necessary to flush TLBs across the system immediately after an address range is freed. Since those addresses are being given back to the system, no code will be making use of them afterward, so it does not matter if a processor's TLB contains a stale mapping for them. All that really matters is that the TLB gets cleaned out before those addresses are used again elsewhere. So unmapped regions can be allowed to accumulate, then all flushed with a single operation. That cuts the number of TLB flushes significantly.
How much faster do things run? Nicks patch (the merged version can be
found here)
contains some benchmark results. With an artificial test aimed at demonstrating
the difference, the new code runs 25 times faster. By changing the
vmap() code in the XFS filesystem to use vm_map_ram()
instead, some workloads were sped up by a factor of twenty. So it seems to
work.
| Index entries for this article | |
|---|---|
| Kernel | Memory management |
| Kernel | vmalloc() |