A memory allocator for BPF code

By Jonathan Corbet
February 4, 2022

Loading a BPF program into the kernel involves a lot of steps, including verification, permissions checking, linking to in-kernel helper functions, and compilation to the native instruction format. Underneath all of that, though, lies one other simple task: allocating some memory to store the compiled BPF program in the kernel's address space. It turns out that this allocation can be somewhat wasteful of memory in current kernels, especially on systems where large numbers of BPF programs are loaded. This patch set from Song Liu seeks to remedy this problem by introducing yet another specialized memory allocator into the kernel.

The kernel allows BPF programs to be reasonably large, but most of them are, in practice, quite small. BPF can do quite a bit with a few hundred bytes of code space. But the kernel allocates space for BPF programs in units of full base (usually 4KB) pages, with all of the space past the end of each program being wasted for as long as that program remains loaded. For a small program, most of the page allocated to hold the code is unused; if there are many programs loaded, the amount of wasted memory can start to add up.

Liu's patch set adds a new "bpf_prog_pack" allocator to address this problem. It would be, on the surface, one of the simplest memory allocators in the system. It maintains a list of huge pages to hold BPF programs, allocating new ones as needed. Whenever space is needed for a new BPF program, a simple, first-fit algorithm allocates the amount needed. A bitmap associated with each huge page (or "pack") tracks the free space (in 64B chunks), thus automatically coalescing chunks returned to the allocator when programs are unloaded. It is not written for speed, but it does not need to be; even on a system that makes heavy use of BPF, allocation and free operations will be relatively rare.

The advantages of this simple allocator are reasonably clear. Because multiple programs can be packed into a single page, the memory that is wasted due to internal fragmentation will be greatly reduced. Performance will be helped by putting BPF programs into huge pages, which reduces translation lookaside buffer (TLB) pressure. But it is worth asking why yet another allocator is needed when the kernel already has memory-management code that has been extensively developed and tuned over the years. The answer is that BPF brings some special needs that cannot be met with existing allocators.

The problem with using the kernel's page allocator without a subdividing layer has already been described: too much space is wasted. But the kernel's slab allocators have been designed for exactly this task: they take large chunks from the page allocator and pass them out to the rest of the kernel in smaller pieces. So it might be natural to think about using the slab allocator here; Liu does not explain why that was not done, but there are a couple of plausible reasons for this choice.

One reason is that the slab allocators are optimized for the allocation of equal-sized chunks. But there is no one size that fits all for BPF programs, so the simple expedient of using a dedicated slab and kmem_cache_alloc() will not work here. It would be possible to use multiple slabs, of course, either directly within the BPF code or simply by allocating space with kmalloc(), but there would be internal fragmentation issues there (albeit smaller) either way. The power-of-two approach used by kmalloc() would still waste a fair amount of space at the end of each program.

There is a more fundamental problem with using the slab allocator, though. BPF programs are executable code, which must be handled specially. Code requires execute permissions at the page level, meaning that it cannot be intermixed with data (as kmalloc() would do). After all, developers have gone out of their way for years to ensure that data cannot be made executable within the kernel. So the slab allocators, which are meant to manage space for data, are not really suitable for this application.

A certain amount of trickery is required to make the program-packing scheme work, even with the special-purpose allocator. Updating code in a running system is fraught with perils; if a CPU in the system tries to run code while it is being changed, the results tend to be bad. The kernel relies fairly heavily on self-modifying code, so the mechanisms to do this modification safely are there, but the task must still be handled carefully. The inability to write to executable memory in straightforward ways, along with the fact that executable memory in the kernel must be read-only, presents challenges for the JIT compiler.

So when the new allocator is invoked to find a spot for a new program, it actually allocates two buffers. The first is the read-only, executable space where the program is destined to live; the second is ordinary memory obtained with kvmalloc(). The JIT compiler will compile into the second buffer, which can be written to, but any addresses it generates must be relative to the first buffer instead. Once the compilation process is complete, the kernel's "text poke" machinery, enhanced with a new copy function, can be used to move the compiled program into the read-only space and the temporary buffer freed.

This patch set has been through eight revisions as of this writing, with a fair amount of review activity creating a stream of new items to fix. At this point, though, there may not be a whole lot more to change. So it would not be entirely surprising to see this code find its way into 5.18; heavy BPF users should benefit with reduced memory usage and a small performance improvement.

Index entries for this article
Kernel	BPF/Memory management
Kernel	Memory management/BPF
Kernel	Releases/5.18

A memory allocator for BPF code

Posted Feb 6, 2022 0:06 UTC (Sun) by Sesse (subscriber, #53779) [Link] (27 responses)

You sure have a lot of BPF programs if you feel that a huge page (2MB) is a good minimum size to spend on them!

A memory allocator for BPF code

Posted Feb 6, 2022 8:00 UTC (Sun) by epa (subscriber, #39769) [Link] (23 responses)

I think you must have a lot if you care at all that each one uses a whole four kilobytes normally.

A memory allocator for BPF code

Posted Feb 6, 2022 14:30 UTC (Sun) by jhoblitt (subscriber, #77733) [Link] (22 responses)

There are already kubernetes CNIs that use bpf to create the overlay network. Maybe someone is pushing packet filtering into bpf as one rule per prog?

A memory allocator for BPF code

Posted Feb 8, 2022 7:53 UTC (Tue) by bartoc (guest, #124262) [Link] (21 responses)

I remain baffled that overlay networks are so ingrained in Kubernetes

A memory allocator for BPF code

Posted Feb 8, 2022 21:22 UTC (Tue) by atnot (guest, #124910) [Link] (20 responses)

I think they are unfortunately pretty inevitable when everyone has to share the same 24/16 bits of address space, especially over existing networks usually not built to the security and scalability requirements of running thousands of containers.

A memory allocator for BPF code

Posted Feb 8, 2022 21:27 UTC (Tue) by Sesse (subscriber, #53779) [Link] (7 responses)

Really, “has no”? I assume these things support IPv6?

A memory allocator for BPF code

Posted Feb 8, 2022 22:50 UTC (Tue) by atnot (guest, #124910) [Link] (6 responses)

It does theoretically, except that:

* It is only somewhat usable in recent versions
* docker still does not enable it out of the box so nobody builds oci images with ipv6 in mind
* very little of the tooling has ever encountered a v6 address nevermind a v6-only environment
* neither have most developers working in the space
* the cloud and corporate environments these systems are usually rigidly v4-only

Sometimes I wonder where we'd be today if the people at docker had decided to use v6 internally right away. Not like anyone would have batted an eye with all of the other quirks of docker. But instead every organization using 172.16.0.0/16 internally now has to deal with an endless stream of users running docker complaining not being able to access the network. Oh well.

A memory allocator for BPF code

Posted Feb 8, 2022 22:59 UTC (Tue) by Sesse (subscriber, #53779) [Link] (5 responses)

OK, that's pretty insane (it's not like IPv6 has been particularly exotic for the last decade or two). But I guess people are so incredibly wed to their IPv4 thinking that they'd rather add thirty layers of complexity than switch to the version with more address space :-)

A memory allocator for BPF code

Posted Feb 9, 2022 6:58 UTC (Wed) by bartoc (guest, #124262) [Link] (4 responses)

Yeah having to administer bgp servers of all things always seemed more painful to me than just getting ipv6 working correctly.

I think part of it is that it comes from google's borg, and google is the main contributor to the ecosystem, and google built out their datacenter architecture and management tools before ipv6 was a "thing", and deploying ipv6 at the lowest levels of the "hyperscalers" data centers is .... really, really scary, so they didn't, and then k8s needs to actually work on google infrastructure (and indeed, there's no incentive to make it work anywhere else).

A memory allocator for BPF code

Posted Feb 11, 2022 20:12 UTC (Fri) by jd (guest, #26381) [Link] (3 responses)

Google was founded in 1998, two years after I'd set up Manchester University's IPv6 node. So, yes, it's before IPv6 became mainstream, but stacks were in Linux as of 2.0.20 (with a patch) and were in the main kernel by early 2.1 IIRC. (They'd already been in Windows - thanks to FTP Software, and Solaris.) KAME was already out for the BSDs, I think.

At that time, IIRC, NRL were distributing a library for stack-independent network code. (The connection could be IPv4, IPv6 or indeed any other supported protocol and that detail would be hidden from the application.) Since Google's use case was not that complicated, Google could even have written a small bit of code to do the same thing, although it would have meant getting the web server to support it, which would have added to what they had to maintain.

So Google could have either supported IPv6 directly or developed their code to simply not care about that layer at all. Now, in hindsight, the extra work then (when things were simple) might have made sense, although it's just as possible that they simply didn't have the time or money to throw in features that they couldn't sell at the time. And, let's be honest, there was quite a lot of cynicism about IPv6 back then as well.

Today, to have a protocol-independent communications layer would be a LOT of work for everyone, I'm not sure you could introduce such a system at this point, and switching a massive heap of interdependent legacy IPv4 code to IPv6 would be almost as painful.

A memory allocator for BPF code

Posted Feb 11, 2022 20:39 UTC (Fri) by zdzichu (subscriber, #17118) [Link]

It's worth noting there's another company with great engineering accomplishments, namely Facebook (or maybe Meta now?). They went with IPv6-first internal networks in DCs. They also have own container-orchestration solution – Tupperware.
I'd love to see a comparison (with emphasis on networking) between Tupperware and Kubernetes/Borg.

A memory allocator for BPF code

Posted Feb 12, 2022 1:07 UTC (Sat) by Sesse (subscriber, #53779) [Link] (1 responses)

FWIW, I did large chunks of Google's IPv6 porting back in the day, and nearly all code was indeed written protocol-independently, so slotting in IPv6 support was fairly easy. You had to fix a few central abstractions, and some stuff around logging and ACLs and such, but it turns out most code doesn't care much about what an IP address _is_, just that you can store it and send it around to other parts of the code. (We were, de facto, three people who did most of it over a period of 1–2 years, not all of us full-time.) IIRC, you could have pulled up an IPv6-only Borg cluster around early 2011 or so if you wanted, and have it run real production-like workloads. But changing how cluster operations work is a completely different game; one that I never pursued, and I have no idea how it works internally now (I've since quit and then rejoined Google, but in a completely different part of the company). And public cloud happened after all of us had moved on to other systems, so no, I'm not responsible for any deficiencies that might have :-)

A memory allocator for BPF code

Posted May 1, 2022 12:01 UTC (Sun) by MaZe (subscriber, #53908) [Link]

Cluster stuff is now (finally!) basically done and ipv6-only.
Though of course there's always weird special cases and exceptions, like critical ipv4-only hardware (temperature sensors, ntp/gps and the like).
The current true battlefront is in cloud... and to a lesser extent trying to get to ipv6-only corp (due to rfc1918 [incl. cgnat] exhaustion).

A memory allocator for BPF code

Posted Feb 8, 2022 21:40 UTC (Tue) by jhoblitt (subscriber, #77733) [Link] (11 responses)

It doesn't particularly matter how large the addressable space is. The option is to either try to coordinate the ports used by potentially 100s of containers dynamically scheduled onto the same host or use an abstraction layer such that every container can have a service listening on port 80. I can think of no examples of a container orchestrator that went with port coordination. docker swarm, meos, ecs, k8s, cloud foundry, etc. support an overlay.

A memory allocator for BPF code

Posted Feb 8, 2022 21:43 UTC (Tue) by Sesse (subscriber, #53779) [Link] (10 responses)

> It doesn't particularly matter how large the addressable space is. The option is to either try to coordinate the ports used by potentially 100s of containers dynamically scheduled onto the same host or use an abstraction layer such that every container can have a service listening on port 80.

Why is it not an option to give each container an IPv6 subnet?

> I can think of no examples of a container orchestrator that went with port coordination.

Borg did.

A memory allocator for BPF code

Posted Feb 8, 2022 23:02 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

You certainly can, and that's what the newest K8s does. But K8s overlay network also provides some security guarantees, because it's trusted and is not open to the public Internet.

A memory allocator for BPF code

Posted Feb 9, 2022 7:34 UTC (Wed) by bartoc (guest, #124262) [Link] (1 responses)

That's sort of true I suppose, but it's true in the same way that nat provides security guarantees (you can't have an overlay network without doing that stuff, but you can do that stuff without having an overlay network)

A memory allocator for BPF code

Posted Feb 9, 2022 7:51 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

K8s network is not open to the public Internet at all. If you receive a packet from it, there's a guarantee that it has been sent by somebody within the trusted net.

A memory allocator for BPF code

Posted Feb 9, 2022 1:19 UTC (Wed) by jhoblitt (subscriber, #77733) [Link] (6 responses)

Giving each container its own network address which is distinct from the container host is called an overlay network. If it is ipv4, ipv6, ipx, or appletalk is just an implementation detail.

k8s supports ipv4, ipv6, and last year dual stack came out of beta.

A memory allocator for BPF code

Posted Feb 9, 2022 7:31 UTC (Wed) by bartoc (guest, #124262) [Link]

no, "overlay networks" are not the same as "giving each container it's own IP"

Overlay networks imply some kind of packet encapsulation, with ipv6 that's not necessary.

A memory allocator for BPF code

Posted Feb 11, 2022 21:18 UTC (Fri) by jd (guest, #26381) [Link] (4 responses)

Well, given the way IPv6 has been used in practice, you're absolutely right. And if you are content to only consider what is mainstream practice, nothing after this point will be of the slightest interest.

According to how it was originally designed (a prefix which defines the route, and a suffix that defines any given physical or virtual machine that is connected via that route), there was a bit more freedom as any physical or virtual interface was absolutely guaranteed a unique IPv6 address. This was when autoconfiguring networks were to be the way and DHCP was seen as a maintenance-heavy dead end.

(This is how the original mobility system worked. If you moved a machine from one network to another, you simply notified everyone connected that your prefix had changed and all packets - including those in transit - would be diverted to your new prefix. Your old address would be marked transient and would remain usable for existing packets but not new ones. This necessitates a unique suffix.)

This means that to have a new subnet, you simply add a byte to the prefix to identify the new network. As long as there were bytes left you could use, defining new subnets was trivial. Which meant one interface had one IP address. There was no concept of multihome. In order to have more than one IP address go through an interface, you needed a virtual network, where each virtual interface had one IP address. Yes, it's still an overlay network, but it became part of the design rather than an add-on, and there was no distinction between what was software and what was hardware. It was just one network.

Telebit devised an extension for this, although I think it vanished with them. As far as I could understand, their system allowed the creation of networks that were utterly transparent to the outside world through some sort of NAT. Your traffic could even go over these transparent segments and back onto the visible network but it would look like a single hop to the outside world, which wasn't bad for nascent IPv6 in 1996. That would have let you get past the prefix length limit.

Of course, nothing actually stops you from using the autoconfigure protocol for containers and virtual switches in a virtual network of containers - well, other than this really doesn't play nice with software that assumes a static IP. You're only guaranteed a static suffix.

If you're happy with mainstream protocols, whether or not it's standard use, then stop here. Because it's about to get scary.

IPv6 evolved out of a couple of experimental IPng protocols, one of which was TUBA. Now, with TUBA, the idea was that addresses were variable length. (I did say it was about to get scary.) IIRC, routing would be done by moving the cursor up or down from the current position to just the next few bytes to see where the packet should go. Something hardware engineers were getting ready to mutiny over, from the sounds of things. How this would have ever worked is left as an exercise for Steven King fans. But in principle, it would have meant that you couldn't run out of space on the prefix.

A memory allocator for BPF code

Posted Feb 11, 2022 21:29 UTC (Fri) by jhoblitt (subscriber, #77733) [Link] (1 responses)

There is also the small issue that until extremely recently, ipv6 support in network equipment often could not be trusted. It is easy to wave your hands and be dismissive of this but all of the so called hyperscalers had massive difficulty in rolling out ipv6. It is easy to say supports ipv6 on the box of a switch but does it work with the same reliability and hardware acceleration as ipv4? Say with NDP working reliability over an EVPN based spine+leaf deployment? The answer was obviously no until practical last week.

A memory allocator for BPF code

Posted Feb 12, 2022 1:10 UTC (Sat) by Sesse (subscriber, #53779) [Link]

> It is easy to wave your hands and be dismissive of this but all of the so called hyperscalers had massive difficulty in rolling out ipv6.

Citation needed, please? Or at least a clarification of what “rolling out IPv6” would entail, because e.g. www.google.com has had AAAA records since June 6th, 2012 (World IPv6 Launch).

A memory allocator for BPF code

Posted Feb 16, 2022 19:49 UTC (Wed) by flussence (guest, #85566) [Link] (1 responses)

> IPv6 evolved out of a couple of experimental IPng protocols, one of which was TUBA.

Wait, the real answer to the ignoramuses going “LOL IPv6, why don't we just make IPv4 addresses longer?” was sitting right there in an RFC (1347) all this time?!

I almost want them to find it and try and implement it, just for the schadenfreude.

A memory allocator for BPF code

Posted Feb 16, 2022 20:14 UTC (Wed) by nybble41 (subscriber, #55106) [Link]

TUBA (RFC1347) keeps TCP and UDP but replaces the IP layer with Connectionless Network Layer Protocol (CNLP). No compatibility was offered via translation or tunneling between IP and CNLP networks and the CNLP protocol diverges more from IPv4 than IPv6 does. Ergo, TUBA would not have been any less disruptive than dual-stack IPv4+IPv6 and represents a step backward from something like 464XLAT which permits IPv6-only networks to communicate with IPv4-only hosts via stateful NAT and protocol translation. Plus, of course, all the hardware engineers' perfectly legitimate objections to variable-length addresses.

A memory allocator for BPF code

Posted Feb 6, 2022 8:10 UTC (Sun) by zdzichu (subscriber, #17118) [Link] (2 responses)

There could be more than you expect. You can use bpftool prog command to see them.

On my typical Fedora laptop, there are 21 programs loaded. On my home server there are 470. All of them loaded by systemd and libvirtd, nothing custom.

A memory allocator for BPF code

Posted Feb 6, 2022 16:25 UTC (Sun) by plugwash (subscriber, #29694) [Link] (1 responses)

A huge page is 512 regular pages. So it sounds like you would need hundreds of bpf programs to see a memory usage benefit from this patch. From your numbers it sounds like that would not be the case on your laptop.

Of course you could argue that a couple of wasted megabytes is lost in the noise on a modern desktop/laptop but.............

A memory allocator for BPF code

Posted Feb 6, 2022 17:46 UTC (Sun) by Sesse (subscriber, #53779) [Link]

More precisely: You'd need hundreds just to break even, and in the thousands to see any gains.

My server seems to have 20–30 BPF programs, but I guess this will eventually increase. Maybe this allocator will be optional?

A memory allocator for BPF code

Posted Feb 6, 2022 17:29 UTC (Sun) by ttuttle (subscriber, #51118) [Link]

BPF programs come in as bytecode, and this allocator holds machine code. By the time it's running, the code is already stored in a temporary buffer (from kvmalloc), so it knows how much space it needs.

How does the BPF code itself know how big a buffer to ask kvmalloc for? Or is it one of those "ask for a reasonable size and double it every time it fills up" kind of deals?

A memory allocator for BPF code

Posted Feb 7, 2022 10:41 UTC (Mon) by k3ninho (subscriber, #50375) [Link] (2 responses)

This 'packing and fragmentation' pattern looks like a filesystem -- would extending RAMFS with de-allocation (so the kernel can manage the pool) be a better way to manage permissions and pack smaller-than-4K items across pages?

K3n.

RAMFS?

Posted Feb 8, 2022 8:56 UTC (Tue) by matthias (subscriber, #94967) [Link] (1 responses)

I guess the answer is a simple no.

Does RAMFS even support block sizes smaller than pagesize? The new allocator uses 64B as blocksize. Also the filesystem has a huge overhead, as each file needs an inode and a directory entry. All this is not needed for BPF programs. Knowing their address (in memory) and maybe size is enough to use them. All the overhead a filesystem has to represent a single file will be much bigger than the program itself.

And the filesystem cannot manage the permissions. Filesystem permissions are a completely different thing. Here, we are talking about the access bits that are used by the memory management unit. And a filesystem will not help with that. You cannot mix data with executable code, as both things need different bits set in the pagetables. However these bits are only available per page. So you need a pool in memory that does not hold any data, just executable code.

RAMFS?

Posted Feb 8, 2022 11:45 UTC (Tue) by k3ninho (subscriber, #50375) [Link]

I was so hoping we could apply everything-is-a-file to this filter-all-the-things pattern...

K3n.

A memory allocator for BPF code

Pointed question, I guess...

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

A memory allocator for BPF code

RAMFS?

RAMFS?