Modernizing BPF for the next 10 years
BPF was first generalized beyond packet filtering more than a decade ago. In that time, it has changed a lot, becoming much more capable. Alexei Starovoitov kicked off the second day of the BPF track at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit by leading a session discussing which changes to BPF are going to come in the next ten years as it continues evolving. He proposed several ideas, including expanding the number of registers available to BPF programs, dynamic deadlock detection, and relaxing some existing limits of the verifier.
Starovoitov started with a recap of the last ten years of BPF development. BPF's initial use case was for networking — hence the name "Berkeley Packet Filters". In 2015, this expanded into a new generation of tools, including using it for tracing. Everything in BPF has evolved for a reason, he said. Once support for BPF existed in the kernel, user-space tools like katran and Cilium started popping up to take advantage of it.
As tracing tools evolved, they needed access to kernel headers in order to understand how the data they were seeing was laid out. This led to things like the BPF Type Format (BTF), "compile once — run everywhere" (CO-RE), and teaching Clang how to do symbolic field accesses instead of using a known offset, he explained. Other BPF usability features were also introduced by necessity.
In C, the natural representation of global variables is to store them in .data sections (or .bss, .rodata, etc.). For BPF programs, however, it often makes sense to share global variables with user space for configuration or reporting purposes, which means that data needs to be stored in a BPF map. Even if the globals are stored in a BPF map, however, userspace still might not know how they are arranged. The solution is BPF's skeleton support that generates types that match how the compiler lays out global variables, making them easier to access programmatically.
The way BPF code is written has also changed a lot. Initially, the verifier had no support for function calls — meaning that programmers had to annotate all of their functions with always_inline. That is no longer necessary, but many users don't seem to know that because of the variety of old examples floating around the internet. "I feel we need to do something in terms of evangelizing the better practices," Starovoitov said. Loops are in a similar situation; they were not supported initially, but over time have become increasingly usable. Yet there are still patches submitted to the mailing list today that use bpf_loop(), even though open-coded iterators (a better replacement) were added two years ago.
The interfaces used by BPF code have also seen some changes. The initial mechanism for BPF to call into kernel code was helper functions. These are considered a stable interface, so it's hard to experiment with new possibilities. They've been supplemented by kfuncs, which are an unstable interface that can be augmented from anywhere, including kernel modules. At the time of Starovoitov's talk, there were 211 helper functions and 164 kfuncs in the kernel.
For kernel code calling into BPF, the same flexibility is offered by the struct_ops mechanism. The feature was first introduced for sched_ext, but has turned out to have many users across the kernel. Implementing the feature did take some time, Starovoitov said, because the necessary translation to and from the BPF calling convention is not trivial. But overall, struct_ops has been an incredibly useful feature.
It is also not the only change to BPF driven by sched_ext; the necessary integration with the scheduler means it drove many features that improve data sharing between the kernel and BPF programs. The most notable is perhaps kptrs, which are direct pointers to kernel data structures that rely on the BPF verifier to track ownership and lifetime information.
The general trend has, of course, been toward making BPF programs more capable. The most recent changes that Starovoitov discussed were BPF arenas and cond_break, which represent a big step toward BPF being able to implement arbitrary algorithms and data structures. Adding these means that extending BPF no longer needs as many kernel changes, and it also turns a lot of static verification into runtime verification, Starovoitov said. Now that these are in place, there will start to be more BPF libraries. Libraries for regular expressions and hash tables already exist — a string manipulation library is probably next.
The future
Right now, library code is reused between BPF programs using the oldest library-management technique: copy-and-paste. That needs to change, he said. BPF developers need some way to distribute shared BPF code as libraries — ideally, an approach modeled on Rust or Python, where libraries are distributed as source only. C and C++ don't really handle dependency management well, and BPF should learn from their mistakes.
Not every desirable library can be written in BPF today; there are some additional enhancements necessary to enable truly arbitrary algorithms. sched_ext can't do everything users might want — notably, reimplementing EEVDF (the current default Linux scheduler) — because of the current limits on locks. Only one lock can be held at a time, and the program cannot call kfuncs while it is held. Worse, bugs in the verifier code that ensures this property could potentially cause deadlocks. What BPF needs, Starovoitov asserted, is another line of defense, so that locks taken by BPF programs (or the core infrastructure of the BPF VM) can't interfere with the kernel. Once that is possible, run-time deadlock detection can be implemented, and then it will become possible to relax the restrictions the verifier currently puts on locks.
Many people see relaxing the verifier's requirements as something necessary to make BPF Turing-complete but contrary to popular belief, it already is, he said. BPF arenas are almost the last feature required to demonstrate this fact by implementing an interpreter in BPF. The only missing part is support for jump tables and indirect gotos. Writing an interpreter in BPF is a "toy motivation" that he does not actually expect anyone to want to do, Starovoitov clarified, but jump tables are something BPF will need to remain relevant.
Related to indirect gotos is support for tail calls. BPF does already have bpf_tail_call(), but Starovoitov called that a hack, saying it was cumbersome to use. A cleaner solution would be to use a dedicated instruction for indirect calls — which BPF has actually had since the beginning, as long as the call targets are global functions. The missing component is verifier support; the verifier needs to be changed to understand the concept of "the address of the instruction".
Efficiency
Even if there is little else needed to make BPF more flexible, there are plenty of things that could make it more efficient, such as dedicated bit-manipulation instructions or changes to the calling convention, Starovoitov explained. In particular, a function attribute that can be used to mark functions that don't need to use caller-saved registers could let compilers be smarter about register allocation and reduce the number of registers spilled to the stack. Another possible idea for better register allocation is just to increase the number of available registers. When BPF was first being developed, x86 was the dominant architecture under consideration; it does not offer many registers compared to other architectures. For modern BPF, the Arm and RISC-V architectures are important contenders — but BPF programs can't take advantage of the larger number of available registers.
Starovoitov mentioned a few ways that the BPF developers could approach the problem, such as switching to virtual registers and doing register allocation in the kernel. Other possibilities include rejecting programs that use too many registers for a given architecture, fat binaries compiled for different numbers of registers, or having the verifier track spills to the stack and convert them to registers when possible. David Vernet pointed out that BPF's instruction encoding only has 4 bits for registers, so using more than 16 registers would be difficult. Starovoitov replied that new instructions could potentially add more space for registers, noting that he feels that the restriction on the number of registers is showing its age.
Increasing the number of available registers could also open the door to passing six or more arguments. Right now, BPF has a maximum of five arguments per function, because only five registers are used for passing arguments. That could be worked around by using the stack, but extra registers would be an easier solution. The key constraint around changing the BPF calling convention is making sure it can be efficiently mapped to the kernel's calling convention, so the right solution is not immediately obvious.
More uses
Starovoitov mentioned one final pie-in-the-sky idea for better interoperability between BPF and the kernel: with extra registers and an expanded calling convention, it might be feasible to compile the kernel to the BPF ISA. That would open the door to a number of previously unimagined applications, such as using BPF debugging across the whole kernel.
There are other more feasible improvements, though, such as permitting alloca(). BPF programs are currently restricted to a stack of 512 bytes, which makes using alloca() impractical, since there is not much space available for allocations. While it may be possible to expand the size of the BPF stack, another solution is to use a "divided stack" that tracks return addresses and local variables on separate stacks. The 512-byte stack that the kernel is aware of could then be saved purely for function calls, with a larger stack allocated (perhaps in a BPF arena) on the fly.
Vernet questioned how desirable alloca() was in BPF programs, noting that it creates a bunch of extra instructions compared to using a static allocation — and, in fact, alloca() is forbidden in the kernel for that and other reasons. Starovoitov acknowledged that it was forbidden in the kernel, but didn't think that was relevant to BPF programs. alloca() is much cheaper than a heap allocation, and can be guaranteed to succeed. BPF programs might find it useful for holding structures that vary in size depending on the size of kernel structures, a complication that comes up fairly frequently.
Of course, all of this future flexibility and dynamism will come at a cost. "Not everything can be done statically," Starovoitov said. Making BPF programs safely cancellable, with run-time timeouts to augment program verification, will probably become necessary. Starovoitov said that work was already in progress. Debugging all of these new features is not likely to become a problem, however, because BPF already has good observability. BTF debugging info is a good match for C code (and kernel code), and existing tools like bpftrace use it to great effect. One thing that is missing is letting these same observability tools work in user space. Starovoitov noted that it is a chicken-and-egg problem: user-mode BPF probes are not fast, so why bother supporting them? But they won't become fast without additional support. Also, many languages used in user space don't match BTF semantics as well, which may require changes to the format.
A potential side effect of making BPF more capable is making more work for the verifier. Currently, the verifier has a one-million-instruction limit, just to bound the amount of time it will spend on a pathological BPF program. For large programs, however, any verifier or compiler change could potentially make it hit the limit, which is a "horrible user experience". There is no real solution yet, Starovoitov said, but the problem is something that BPF developers must consciously focus on in order to fix. There are some workarounds, such as testing compiler changes in the BPF CI. Another possible solution would be to relax the limit, and let the verifier go beyond one million instructions if it can tell that it is making forward progress.
The last idea Starovoitov introduced was making BPF into a kernel module. He noted that the main problem for vendors shipping products that need BPF is the variety of different kernels in use; it is difficult to make a BPF program work across all of them. One potential solution, he explained, would be to make it possible to upgrade the BPF subsystem independently of the base kernel. That would, itself, be a significant challenge — the existing kernel module mechanism is not enough to support it — but it could solve some persistent problems.
Questions
Starovoitov finished by saying that he thought BPF's next growth areas were likely to be in Linux security modules (LSMs), other security use cases, and continuing improvements to sched_ext. The audience had several questions. One person asked whether Starovoitov thought BPF was harder or easier to use now, after a decade of changes. Starovoitov replied that the BPF developers have made it much easier to use, but that they couldn't actually simplify some of the core design behind BPF because of stability guarantees, so it wasn't as easy to use as it could be. Another person asked whether Starovoitov expected BPF to see growth outside the kernel. Starovoitov replied that the main power of BPF is in observability and safety. There may be use cases that call for that combination outside the kernel, but user space as a whole would not benefit from BPF, he said.
Another member of the audience asked about how to communicate all of this to the many BPF users who were not present at LFSMM+BPF; they suggested doing outreach to other conferences to help spread some of these ideas. Starovoitov said that he agreed completely that this was a good idea, but that he is not specialized in doing evangelism. He called on the other people present in the session to help spread the word.
BPF has grown remarkably quickly, and there's no particular sign that it will stop. Many users are finding value in the greater observability and configurability it brings to the kernel. At the same time, it is clear that there are still big plans to change BPF. It may look quite different in another decade.
| Index entries for this article | |
|---|---|
| Kernel | BPF |
| Conference | Storage, Filesystem, Memory-Management and BPF Summit/2024 |