The BPF allocator runs into trouble
In current kernels, memory space for BPF programs (after JIT translation) is allocated using the same code that allocates space for loadable kernel modules; this would seem to make sense since, in either case, that space will be used for executable code that runs within the kernel. But there is a key difference between those two use cases. Kernel modules are relatively static; they are almost never removed once they have been loaded. BPF programs, instead, can come and go frequently; there can be thousands of loading and unloading events over the life of the system.
That difference turns out to be important. Memory for executable code must, unsurprisingly, have execute permissions set and thus, must also be read-only. That requires this memory to have its own mapping in the page tables, meaning that it must be split out of the kernel's (huge-page) direct mapping. That breaks up the direct map into smaller pages. Over time, this has the effect of fragmenting the direct map, which can affect performance measurably. The main goal for the BPF allocator was to segregate these allocations into a set of dedicated, huge pages and avoid this fragmentation.
Shortly after this code was merged, though, the regression reports, along with more general expressions of concern, started to roll in. That drew the attention of Linus Torvalds and other developers, and revealed a series of problems. While some of those problems were in the BPF allocator itself, the most disruptive issue come down to an older change made in an entirely different subsystem: the vmalloc() allocator.
vmalloc() and huge pages
vmalloc() (along with its inevitable variants) differs from other kernel memory-allocation interfaces in that it returns memory that is virtually contiguous, but which may be physically scattered. It is thus good for larger allocations where the memory need not be physically contiguous. Heavy use of vmalloc() was once discouraged due to its higher overhead and the shortage of available address space on 32-bit systems, but attitudes have changed over time. It is now reasonably common to use vmalloc() as a way of avoiding the possibility that a larger allocation might fail due to memory fragmentation. Functions like kvmalloc(), which will automatically fall back to vmalloc() if an ordinary allocation is not possible, have been added in recent years.
In 2021, Nick Piggin enhanced vmalloc() with the ability to allocate huge pages if the requested size is large enough. One might well wonder why this was useful, since vmalloc() is explicitly meant for cases when the memory need not be physically contiguous; the answer, of course, is that huge pages can give better performance by reducing pressure on the CPU's translation lookaside buffer. The kernel has a few larger allocations that can benefit from this improvement, so it was merged for the 5.13 kernel.
Even at the time, there were some caveats, though. There are places in the kernel that will be unpleasantly surprised by receiving huge pages in response to a vmalloc() call; these include the PowerPC module loader. So Piggin also added a flag, VM_NO_HUGE_VMAP, which requests that only base pages be used. Of course, vmalloc() takes no flags, so the ability to avoid huge-page allocations only could only be accessed via the low-level __vmalloc_node_range() function until vmalloc_no_huge() was added later in the 5.13 cycle. Huge-page allocations were also not enabled for the x86 architecture at that time since nobody had put in the time to look for potential problems there.
The first patch in the BPF-allocator series enabled huge-page allocations in vmalloc() for the x86 architecture; that was needed to make huge pages available to the BPF allocator. It all seemed to work fine until wider testing started to turn up problems; it seems that enabling huge pages in vmalloc() on x86 might not have been the best idea. Except that the problem actually had little to do with the x86 architecture.
When vmalloc() (as it existed at the beginning of the 5.18 cycle) would allocate a huge page in response to a request, the result was a compound page — a set of contiguous base pages that behaves like a single, larger page. These pages are organized differently; most of the information regarding their use is stored in the page structure for the first ("head") base page. The page structures for the following ("tail") pages mostly just contain a pointer to the head page. It is important not to treat tail pages as being independent, or bad things will happen.
Bad things happened. It turns out that the kernel does not lack for code that assumes it can treat memory from vmalloc() as being made up of base pages; this code will tweak individual page structures without noticing that it is dealing with tail pages. That leads to corruption of the system memory map and a kernel oops once that corruption is noticed. One case where this is known to happen, which was first noticed by Rick Edgecombe, is driver code calling vmalloc_to_page() to obtain a page structure somewhere within a vmalloc() allocation (and, thus, possibly in the middle of a compound page). It turns out that there are quite a few drivers using vmalloc_to_page(); each of those is almost certainly broken if the memory involved is made up of compound pages.
This particular problem was eventually fixed by Piggin; the
code now splits allocated huge pages back into base pages (while retaining the
huge-page mapping), taking tail pages out of the picture.
But there were some other surprises lurking within the vmalloc()
subsystem as well; as the issues accumulated, Torvalds concluded
that "HUGE_VMALLOC was badly misdesigned
". It was, he
said, buggy from the beginning; the problems only turned up now because
enabling the feature on the x86 architecture resulted in far wider testing.
Resolutions for 5.18
Piggins's fix was merged for the 5.18-rc4 prepatch. Meanwhile, Song Liu, the author of the BPF allocator patches, was working to find a set of solutions that would allow that allocator to be used safely; the result was a four-part patch set that:
- Removed the VM_NO_HUGE_VMAP flag in favor of a new VM_ALLOW_HUGE_VMAP variant. That changes the sense of the flag, making huge-page allocations an opt-in feature rather than opt-out.
- Caused alloc_large_system_hash() (which is used to allocate space for large hash tables) to opt into huge-page allocations, since they are known to be safe there.
- Added a function called module_alloc_huge() which also enables huge-page allocations.
- Used module_alloc_huge() to allocate the space used by the BPF allocator.
This response might have been sufficient if the wider use of huge
pages in vmalloc() was the only problem. Torvalds, however, didn't
like what he saw in the BPF allocator code either. Among other things,
he pointed out that it enabled execute permission on the allocated memory
without initializing it first, adding a bunch of random executable text to
the kernel's address space. He concluded: "I really don't think
this is ready for prime-time
".
Following through on that conclusion, he decided to apply just Liu's first patch, which had the effect of disabling huge-page allocations in vmalloc() entirely (since nothing used the new opt-in flag). Initially he intended to stop there, but later decided that the second patch was also safe to apply. Then he even went one step further, adding a patch of his own enabling huge-page allocations in kvmalloc(). The reasoning here is that memory returned from that function might have come from a slab allocator, so recipients should not be using low-level tricks with the underlying page structures in any case.
Liu has since fixed the uninitialized-memory problem in another patch series. BPF maintainer Alexei Starovoitov has tried to make the case that this work should be applied as well, making the BPF allocator available in the 5.18 release. Torvalds remains unconvinced, though, so this work seems more likely to be 5.19 (or possibly even later) material. BPF users will probably just have to wait one more cycle to have access to the specialized memory allocator.
There are a number of conclusions that can be drawn from this little
episode. Tweaking low-level memory-management features is tricky and can
create problems in surprising places. There is a lot of value in the
widespread testing that comes with the more popular architectures; it will
turn up bugs that can remain hidden on architectures with smaller user
bases. But, perhaps most significantly, this is the kind of problem that
lends credence to the claim that access to struct page should
never have been allowed outside of the memory-management subsystem.
Exposing such low-level details to the kernel as a whole was always going
to lead to surprises of this type. Weaning the rest of the kernel off of
struct page (which is just beginning
to happen) will be a long and difficult task, but may well be
worth the pain.
| Index entries for this article | |
|---|---|
| Kernel | BPF/Memory management |
| Kernel | Memory management/BPF |
| Kernel | Memory management/Direct map |
| Kernel | Releases/5.18 |
| Kernel | vmalloc() |