Leading items

Welcome to the LWN.net Weekly Edition for September 19, 2024

This edition contains the following feature content:

An update on BPF generation from GCC: a report from the GNU Tools Cauldron on the state of the BPF backend for GCC.
Debating ifupdown replacements for Debian trixie: does Debian still need its own network-configuration utilities?
Fedora evicts WolfSSL: an SSL library is escorted out of Fedora shortly after being added.
Kernel developers at Cauldron: a meeting between compiler and kernel developers on how GCC could better serve kernel development.
A discussion of Rust safety documentation: Rust-for-Linux developers at the Kangrejos conference discuss conventions for documenting the safety of Rust code.
Some 6.11 development statistics: where the code in the 6.11 release came from.
The RCU API, 2024 edition: more details than you ever wanted on changes to the kernel's read-copy-update API over the last five years.
Vanilla OS 2: an immutable distribution to run all software: the recently released Vanilla OS 2 has some rough edges, and some unique and useful features.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

An update on BPF generation from GCC

By Jonathan Corbet
September 17, 2024

Cauldron

The generation of binary code for the kernel's BPF virtual machine has been limited to the Clang compiler since the beginning; even developers who use GCC to build kernels must use Clang to compile to BPF. Work has been underway for some years on adding a BPF backend to GCC as well; the developers involved ran a session at the 2024 GNU Tools Cauldron to provide an update on that project. It would seem that the BPF backend is close to being ready for production use.

The session was run by David Faust, Cupertino Miranda, and José Marchesi, with Faust doing most of the talking. He started by saying that the core of the BPF backend is now nearly in a production-ready state. It is able to compile all of the kernel test cases — but they do not yet all run correctly. In each case, the generated code is correct, but it is unable to get past the BPF verifier. Most of the self-tests pass at this point, though.

The BPF backend for GCC has been shipped with Oracle Linux for some time, and has more recently found its way into Debian, Fedora, and Gentoo, which is now using GCC to build systemd's BPF programs.

Keeping up with the latest BPF developments is an ongoing task for the GCC developers; there are a number of new instructions that have been added in recent years that must be supported and used to their full potential. These include unconditional byte-swap instructions, jumps with a 32-bit displacement, signed memory-load and register-move operations, and signed division and modulus instructions.

Another significant project in the last year has been addressing one of the biggest divergences with Clang, having to do with the handling of compile-time integer overflow, such as can happen when, for example, assigning a negative number to an unsigned value. Both compiler teams agreed to accept any value that will fit into the intended data type and let the BPF virtual machine deal with it from there.

The GCC backend can now place additional useful information into the generated ELF header, including the version of the BPF virtual machine that was targeted. That information can be used by disassemblers to interpret the binary code correctly.

An important milestone is the completion of support for the compile once, run everywhere ("CO-RE") mechanism, which allows BPF code to run correctly on multiple versions of the kernel. The compiler now produces the necessary (for CO-RE) BPF type format (BTF) data by default when the -g option is provided. The compiler will also generate the somewhat strange, C-like assembly syntax produced by Clang, despite a strong preference on the part of the GCC developers for something that looks more like traditional assembly languages. There is, in any case, a lot of assembly using this format in various header files, so the format must be supported.

There is a new feature macro, __BPF_CPU_VERSION, indicating the version of the virtual machine that a program is being compiled for; the -mcpu= flag can be used to target a specific machine version. There are also options to enable specific instruction types that would otherwise not be available in a given machine version; evidently there are distributors out there that backport implementations of some instructions to older kernels and would like those instructions to be supported.

Segher Boessenkool pointed out that the PowerPC backend used to have a similar ability to enable specific instructions, but that the developers had been forced to remove that capability. As the number of possible instructions grows, the number of combinations that must be tested explodes. Faust agreed that this could be a problem, but said that the testing is manageable for now.

Miranda spoke briefly about the CO-RE implementation, which uses type-based relocation to adjust offsets at program-load time. Imagine a BPF program that accesses specific fields within a kernel structure. If a newer kernel version adds a field to the beginning of the structure, the offset of all the other fields will change. CO-RE makes a note of all these references, calculates the proper offsets when a program is loaded, and adjusts the program accordingly.

This functionality, Miranda said, is implemented using various GCC builtin operations, but the developers have been working to reduce them over time. Instead, relocation information is placed into the GIMPLE intermediate representation of the program. The result is a simpler and more correct implementation, and the kernel's CO-RE self-tests are now passing when built with GCC.

Faust returned to talk more about the generation of BTF data, which is necessary for the loading of programs into the kernel. BTF, he said, shares ancestry with the CTF type format; it is "an ugly domain-specific cousin". GCC can generate BTF, but only for programs written in C. The BTF code has been refactored over the last year, partly with the idea of making it work properly with link-time optimization — a goal that remains unrealized.

Originally, GCC was emitting BTF for every type that was declared in a given program, regardless of whether the type was actually used. Since BPF programs typically have to include a number of kernel headers, the declaration of unused types is a regular occurrence. The result was that even the most trivial of programs would include BTF data for something like 9,000 types, leading to the creation of output files that were vastly larger than the ones Clang produced.

With the new -gprune-btf option, though, GCC can now remove BTF for unused types, eliminating that waste. This option is not the default because in some cases, such as when the kernel itself is being built, the creation of BTF for all types is necessary. Clang does not produce BTF in this way for full-kernel builds; instead, the pahole tool is used to translate the needed information from DWARF debugging info.

A useful future enhancement would be to add a way to encode the flags used by the kernel in type declarations in BTF. These flags include __user, which indicates that a given pointer contains a user-space address. The BPF verifier could make use of this information to further check program correctness. The proposed solution is to encode this information as strings in the DWARF and BTF output. It would be nicer to treat a flag as a type qualifier (like const, for example), but doing so can break applications that consume DWARF data. For the time being, this information can only be attached to the child of a pointer type, so it only works with pointer values. There are some schemes being considered to improve this solution, but they are in an early stage.

Another future task is adding proper BTF support to the binutils package. That would allow, for example, the merging and deduplication of BTF data in the linker, which is needed to support link-time optimization.

An ongoing problem is verifier-aware compilation. Getting code past the BPF verifier is an ongoing frustration for BPF developers. Often, code will be transformed by a compiler's optimization passes into something that the verifier is unable to understand; one version of a compiler might work, while others do not. This topic was discussed at the 2023 GNU Tools Cauldron; Faust said, but little progress has been made since then.

Marchesi took the microphone as time was running out to say that there is now a defined memory model for BPF programs, based on the kernel's memory model. He would like to provide built-in functions that provide proper atomic operations implementing that model in BPF.

He also touched briefly on maintenance issues for the BPF backend. One ongoing task is developing a better assembly syntax for BPF — then getting the rest of the world to adopt it. Meanwhile, there are a lot of BPF users who have found themselves stuck using one specific version of the Clang compiler, since it is the only one that will generate code for their programs that can pass the verifier. GCC users, he said, can be expected to find themselves in a similar situation, but it would be good to find a way to do better. He closed the session with an unanswered question about what sort of maintenance model might help to improve this situation.

[ Thanks to the Linux Foundation, LWN's travel sponsor, for supporting our travel to this event. ]

Comments (3 posted)

Debating ifupdown replacements for Debian trixie

By Joe Brockmeier
September 12, 2024

Debian does not have an official way to configure networking. Instead, it has four recommended ways to configure networking, one of which is the venerable ifupdown, which has been part of Debian since the turn of the century and is showing its age. A conversation about its maintainability and possible replacement with ifupdown‑ng has led to discussions about the default network-management tools for Debian "trixie" (Debian 13, which is expected in 2025) and beyond. No route to consensus has been found, yet.

Time to retire ifupdown?

The classic ifupdown is a set of custom scripts for configuring networking in Debian that became a project in its own right after the Debian "potato" release in 2000. Debian now has not one, but several implementations of ifupdown. In addition to ifupdown-ng, there is ifupdown2, which is an implementation in Python with a largely closed development model involving a private repository where changes are later pushed into a public repository. BusyBox has its own ifupdown implementation as well.

The network-tool debate had its genesis in an off-list discussion that included Martin-Éric Racine and Santiago Ruano Rincón, who is the current maintainer of ifupdown. Racine explained that the private discussion had been about "the plethora of unresolved issues with ifupdown and Santiago's limited time to devote to its maintenance". Racine said that Ruano Rincón had asked if he would be interested in maintaining ifupdown, and then had questioned the need for multiple implementations of it. The conclusion was that one implementation of ifupdown, maintained by a team, would be better than many implementations.

Racine had sent the conclusion to Daniel Gröber, one of the maintainers of ifupdown-ng. He forwarded snippets of the off-list conversation to debian-devel on July 7, saying that "I don't think we should be having this/these discussions in private". In the email, Gröber asked Ruano Rincón about the future maintenance of ifupdown and replacing it with ifupdown‑ng; he made the case that ifupdown-ng had a better core design, modern code, an active upstream, and a test suite. Overall, he said, ifupdown-ng has "the potential to fully replace ifupdown without breaking anyone's system doing it". He acknowledged that it was not fully compatible but he was working on it and was optimistic it could be.

Ruano Rincón replied that he had hit design issues with ifupdown and that it would make sense to focus efforts on ifupdown‑ng as a replacement, but only once it was a drop-in replacement for ifupdown, and it is not there yet. He ruled out ifupdown2 as the replacement due to its Python dependencies, lack of upstream community, and failure to support basics like DHCP for IPv6. He also asked for guidance on an unresolved discussion that began in June 2023 about changing the recommended DHCP client from ISC DHCP (which is no longer being developed) to dhcpcd.

On July 9, Racine followed up to say that he had looked at ifupdown-ng and that its syntax was not a drop-in replacement for ifupdown. Marco d'Itri agreed: "either it's drop-in compatible or we may as well switch the default to NM [NetworkManager] and/or systemd‑networkd".

Popping open Pandora's box

After the topic of replacing ifupdown had been broached by d'Itri, Matthias Urlichs said: "Well, here's a heretical thought: why don't we do that anyway, at least for new installations?" That sparked a lively discussion about the idea. Simon McVittie noted that Debian was already there to an extent. New laptop and desktop installs default to NetworkManager, and he argued that systemd‑networkd would be a better choice for server and embedded installations. The need for Debian to have its own network tools had passed, he said, "and we now have non‑distro‑specific network management components that can do just as good a job (if not better)". Even better, systemd‑networkd and NetworkManager were used "outside the Debian bubble", and that meant many more people were responsible for their maintenance.

A switch to NetworkManager or systemd‑networkd, Michael Biebl pointed out, would require modifying the Debian installer. Lukas Märdian made a pitch for adopting Netplan, a project sponsored by Canonical for generating network configurations for multiple tools, such as systemd‑networkd or NetworkManager, from a YAML description of the network interfaces. He said support for Netplan was "pretty close" to being merged into the Debian installer: "Netplan should be considered a unification layer on top of those networking daemons, which allows Debian as a project to use common language around networking."

Simon Richter complained that systemd's roadmap was unclear and that support for features they depend on "could be discontinued at any moment, and it would not be the first time a feature Debian relied on was removed". Debian is a volunteer project that cannot afford to be in "perpetual firefighting mode". He asked if it would be possible to put in place a process to deal with breaking changes, and suggested that "a long-term support commitment from upstream" would significantly reduce the effort needed to plan for those changes.

What's wrong with ifupdown?

Gröber later asked for specifics on problems with ifupdown's design or maintenance. Ruano Rincón pointed to design problems with ifupdown such as the one described in this bug report. The ifupdown utility refuses to configure IPv4 on an interface if there is a configuration error for IPv6. That did not convince Gröber, who said that he had only seen two technical arguments against ifupdown and wanted to know what problems ifupdown causes that require removing "all traces" of it from a minimal Debian install.

Josh Triplett responded that there was a simple answer: having multiple network tools "trying to poke at the same network devices, *or* network tools working around each other carefully" should not be the default. Installers should treat network-management tools as mutually exclusive, and allow users to install more than one later if the user desired.

What about Netplan?

In early August, Märdian led a birds-of-a-feather (BoF) session at DebConf 2024 about how to "improve and homogenize" the networking landscape in Debian. He presented three scenarios, found in his slides, dubbed "Status Quo++", "Minimized", and "Universal", along with his summary of the work that would still need to be done to make them possible. The status quo scenario he proposed involved transitioning to ifupdown‑ng, which would require switching to dhcpcd-base and making ifupdown‑ng a drop-in replacement for ifupdown. The minimized scenario involved switching to systemd‑networkd for base and server images, or NetworkManager for desktop images. That would require work on the Debian installer, migrating the existing ifupdown configurations, dropping ifupdown, and enabling autopkgtest support for those packages.

Finally, the universal scenario would keep ifupdown[-ng] in the base installation, NetworkManager for desktop installs, and enable systemd‑networkd on server installs. On top of those, it would add Netplan as "a common configuration interface across variants". Users could fall back to any of the underlying tools without fussing with Netplan configuration if desired. The only remaining work for this, according to the slides, would be for ftpmasters to raise the priority of the netplan-generator package so that it is installed by default.

Private to public

During the BoF, Märdian suggested setting a deadline for a decision of "six to eight weeks". He also committed to writing an email to debian-devel that summarized the BoF discussion for those not at DebConf. That did not quite go as planned. Before starting a public discussion on debian-devel, he had privately emailed networking maintainers to gauge their opinions. On August 20, Gröber forwarded parts of Märdian's email to debian-devel before Märdian took the discussion public.

In the email, Märdian had proposed a compromise that would make systemd-networkd the recommended networking tool for minimal installations. Either ifupdown, or ifupdown-ng if it could be made drop-in compatible, would be a fallback and remain available for existing installations that were upgrading to trixie. Then ifupdown[-ng] would be dropped in "forky", the release following trixie. Märdian had added, "I'd like to avoid drama and calling the CTTE [Debian's technical committee] to make a decision on our behalf, but rather find a compromise between us networking maintainers."

Gröber, reasonably, said that he was not interested in doing the work of getting ifupdown-ng ready to be the default in trixie if it would be removed in forky. He also said that the choice of network-configuration tools should be based on the wants of the majority of Debian users, and not a technical decision among maintainers. He suggested an informal vote to gauge the preference of Debian developers and the wider community, and asked the list how similar decisions had been made in the past. If a vote showed that he was "alone in wanting Debian to retain [its] identity as Debian" he would step aside on the matter.

Märdian followed up and said that it was fine for his points to be made public, but he wanted to add the parts that Gröber had dropped from the original mail. In his reply, he mentioned that he had formed a new Debian networking team after discussions at the BoF at DebConf that would "bundle resources and help each other out with critical maintenance & have a common channel for communications". However, no further details about that team and its membership were shared.

He said that the decision about networking tools needed to be made by developers, "because someone needs to do the work":

Otherwise, we would be suffering the same way we did with classic ifupdown in the past years, as it's becoming harder and harder to maintain. We need a healthy upstream project and people willing to do the actual maintenance work in Debian.

To decide or not to decide

Gröber replied that building a community around networking in Debian should happen in public. "Who knows the publicity might just spawn some (much needed) new contributors." He was "a bit peeved" that the Pandora's Box of removing ifupdown had been opened because it impacted his ability to find people willing to work on preserving it. It was also wrong to put time pressure on deciding, he said, when release dates for trixie were not even announced. "Remember: it's done when it's done. This is a big decision let's do it right."

The discussion continued with a number of people chiming in with support for their favored networking tools. It also featured a fair amount of lobbying for Netplan by Märdian, though some were not convinced. For that matter, no consensus or plan of action was reached—nor even an agreement that one was actually needed. Thomas Goirand strongly opposed the idea that it should be decided by a vote, though. "We do not need another systemd vote". He recommended involving Debian's technical committee "if we can't agree reasonably, rather than yet-another-GR".

Ruano Rincón came back to the conversation on September 8, and thanked Märdian for taking care of the topic. He reiterated his interest in replacing ifupdown with ifupdown-ng as soon as possible. The choice of a default network-configuration tool was a secondary question, but he suggested that Debian would benefit from "moving to something shared with other distros". He said that providing users with a more modern, better-maintained tool was more important than retaining Debian's identity through its choice of networking tool.

But he was unprepared to give a recommendation about the right tool to choose. He had not had time to do a thorough review and comparison of the tools, so he was not ready to give his opinion on how to move forward yet. "Just please, let's make this the last thread about the default network stack in future Debian releases."

Thus far, there is no further discussion on the topic and the only consensus seems to be that "classic" ifupdown's days are numbered. Just how numbered, and what will replace it, seem no clearer today than when the discussions began in July. One hopes that the final outcome will be a clear consensus on Debian's default networking configuration tool or tools, as well as who makes that choice in the future.

Comments (53 posted)

Fedora evicts WolfSSL

By Joe Brockmeier
September 16, 2024

The Fedora Engineering Steering Committee (FESCo) has voted to immediately remove the WolfSSL package from all of Fedora's repositories due to its maintainer failing to gain approval to package a new cryptography library for Fedora. WolfSSL's brief travels through Fedora's package system highlights gaps in documentation, as well as in the package‑review process. The good news is that this may stir Fedora to improve its documentation and revive a formal security team.

Fedora and cryptography

Fedora's packaging guidelines require that every application entering Fedora be checked for compliance with the policies on cryptography, but those policies could be written more clearly and are in need of an update. For example, the crypto policies say that new libraries "must comply with the crypto policies to enter Fedora" which seems oddly circular since the reader would likely think that is what they are reading. However, that is meant to be a reference to Fedora's crypto‑policies project, and that crypto libraries must have full integration with this system.

The crypto-policies project, maintained by Alexander Sosedkin, is an effort to unify the crypto policies for the whole distribution and also simplify the management of crypto applications and libraries. This means, in part, that Fedora has a limited set of approved crypto "back-ends" such as OpenSSL, GnuTLS, and Libreswan.

Fedora users can set system-wide crypto policies, such as a legacy policy when older encryption algorithms are needed for compatibility or the FIPS policy for conformance with FIPS 140 requirements. This system was adopted with the Fedora 21 release in 2014. The change proposal has a description of the system, and the crypto-policies man page describes the available policies and tools. A new crypto library would need to integrate with this system, but first it would have to be accepted in the first place.

Crypto libraries new to Fedora are required to get approval before being added, though the documentation does not do a great job of describing that process. Even though the packaging guidelines are a bit confusing, it should be clear enough that packagers need to consult with the Fedora security team, and then gain an exception from the Fedora packaging committee before being added to the repositories. There is one minor problem with this, though: the Fedora security team has been defunct for a while, and the policies have not been updated to reflect this.

WolfSSL

This posed difficulties for Andrew Bauer, who wanted to package WolfSSL, an SSL/TLS library written in C, as a dependency for another package he maintains, Netatalk. That software is an open-source implementation of the Apple Filing Protocol, which allows Linux (and other) systems to act as AppleShare file servers for new and old Apple systems. Really old, in fact—Netatalk version 2 supports systems as far back as the Apple II and version 3 supports Macs running Mac OS 7.5 (released in 1994) through today's Macs.

The Netatalk project had dropped OpenSSL in favor of WolfSSL with its 3.2.0 release in June because OpenSSL 3.0 had deprecated encryption algorithms that it needed to use for authentication with ancient Mac OS clients. This meant that Bauer could no longer rely on OpenSSL to fill the needs of the Netatalk package in Fedora, and needed to package WolfSSL. However, WolfSSL would need special approval to enter Fedora beyond simple package approval.

With Netatalk requiring WolfSSL, Bauer created a package and filed a review request on August 3. As part of the request, he included the rpmlint output that had flagged the package as non-compliant with Fedora's packaging policy. Bauer said:

Grepping the source code, one can see that wolfssl calls wolfSSL_CTX_set_cipher_list() rather than SSL_CTX_set_cipher_list(). Thus, this is a false positive and can be ignored.

And ignore it he did. So did Felix Wang, who reviewed the package and approved it on August 17. It might have helped if the error had been more explicit, or if it were listed among the common rpmlint issues in the package documentation. An issue was opened about the lack of description in 2023. The error is meant to alert the user that the application may not be using the system-wide crypto policy. FESCo member Fabio Valentini replied to the WolfSSL review‑request ticket to say that it would require further review and pointed to the crypto policies.

Security team snafu

So, Bauer sought help via the fedora‑devel list on August 18 to figure out how to contact the non-existent Fedora security team per the packaging guidelines. Neal Gompa responded that the "formal security team disbanded many years ago". He suggested that Bauer visit the Fedora security team room on Matrix, which Bauer did. (Note that link and several others require logging into a Matrix server. It is possible to log in without creating an account, click "Continue", select the Element client, and then click "Continue in your browser".)

He said that "a question was asked whether this package requires approval from Fedora Security". He got a quick suggestion from Demi Marie Obenour about compilation options that did not address the question of approval directly. He thanked Obenour, and seems to have left the channel before he received messages about Fedora's crypto policies or suggesting that new crypto libraries would require a change request. Bauer plunged ahead and updated the review ticket to report he'd been given advice on building WolfSSL, and was requesting a package repository now. The WolfSSL package entered the testing repositories for EPEL 9, Fedora 39, 40, 41 (which is still in development), and rawhide on August 20.

Valentini replied to Bauer's email on fedora-devel saying that the question of whether WolfSSL complies with system crypto libraries had not been answered. He was unhappy that it had already been imported into Fedora. Daniel P. Berrangé wrote that it seemed non-compliant and that a Fedora packaging committee (FPC) exception is required.

More than two weeks later, Valentini followed up in the review ticket and said it looked like Bauer had ignored his earlier comment about crypto compliance completely. The two went back and forth a bit. Bauer again complained the documentation was unclear and said: "Once we identify the source of authority, looks like that's Felix, I'll be glad to [work] through this."

Retirement

Instead of debating further with Bauer, Valentini opened a ticket with FESCo about the matter. He noted that "the reviewer of the package (not the submitter)" had filed a request with the packaging committee to approve the package, but it had already made its way into Fedora's stable repositories.

Stephen Gallagher said that he'd spoken with the crypto-policies maintainer and members of Red Hat's security team. Their opinion was that WolfSSL should not be permitted in Fedora without honoring crypto-policies. Gallagher proposed that FESCo vote on the following:

WolfSSL is immediately retired from Fedora. The maintainers may file a new package review request when WolfSSL respects the crypto system policy. This review request must be presented to the FPC, who must approve it before it is added back to the repositories.

On September 10, Valentini wrote that Gallagher's proposal had been discussed, voted on, and approved by FESCo. Five votes were in favor of the proposal, none against or abstaining.

Typically, packages are only retired from rawhide or branched versions of rawhide that are in the process of becoming the next stable Fedora release but have not entered final freeze. Retiring packages from stable versions of Fedora is a rare step, and usually only undertaken when a package has licensing problems. Once the package is retired, it is removed from subsequent composes and will be removed from the archives. It will not be removed from systems it was installed on unless it is also added to the fedora-obsolete-packages package. That has not been done in this case. As of this writing, the package is still installable, but it should be disappearing soon.

Aftermath

There are several points when Bauer clearly should have slowed down and gotten the message that he needed to consult Fedora's packaging committee. If nothing else, Valentini was waving a caution flag that Bauer should have heeded.

Still, he is not wrong that the documentation leaves much to be desired in terms of clarity and completeness. And, as Matthew Miller noted in the FESCo ticket, the security team is in a "disordered state" and it may be time to "reorganize and re-formalize the security team". Michel Lind commented that he had been sounding out frequent posters in the Matrix security room about improving things, so the project may be gathering steam to put a real team in place, once again, to avoid a recurrence of non-packagers giving random advice that is taken as official. It might also behoove Fedora to consider whether a chat room is a real substitute for an email list that does not emphasize real-time communication or require a special client to use.

Fedora's policies may also need some updating to take into account use cases that require special consideration for compatibility with obsolete hardware and software. It is a net good that Fedora users can use software like Netatalk to connect with classic computers, and it would be a shame if Fedora's policies blocked such software permanently. In the end, no real harm has been done, and Bauer's over-eagerness has helped the Fedora project identify a few real problems to address.

Comments (45 posted)

Kernel developers at Cauldron

By Jonathan Corbet
September 18, 2024

Cauldron

A Linux system is made up of a large number of interdependent components, all of which must support each other well. It can thus be surprising that, it seems, the developers working on those components do not often speak with each other. In the hope of improving that situation, efforts have been made in recent years to attract toolchain developers to the kernel-heavy Linux Plumbers Conference. This year, though, the opposite happened as well: the 2024 GNU Tools Cauldron hosted a discussion where kernel developers were invited to discuss their needs.

David Malcolm started the discussion by asking whether there is interest in performing more static analysis on the kernel. Steve Rostedt pointed out some of the tools that are used for that purpose now, noting that sparse is useful for checking pointer annotations in the kernel. It can, for example, find code that does not treat user-space pointers with appropriate caution or follow the read-copy-update locking rules. David Faust said that there had been a proposal to incorporate the sparse annotations into BPF type format (BTF) tags, which might be possible with the help of C2x attributes.

Rostedt suggested that this kind of annotation could have helped to find a recent BPF bug. The BPF verifier was unaware of the fact that some tracepoints could fire with a null pointer value passed in and, as a result, did not require BPF programs attached to those tracepoints to check for null. That meant that some BPF programs were able to crash the kernel, which is not supposed to be possible. A "could be null" annotation could help the verifier in such situations.

Malcolm pointed out that GCC has supported a nonnull attribute for a long time, but that is the opposite of the needed "might be null and must be checked" meaning. It is an optimization-related feature, and not really suited for this purpose. As the discussion moved on, Malcolm said that he had created a tracker bug on using GCC to run static analysis on the kernel.

José Marchesi brought up the topic of the struct layout randomization GCC plugin that can be used to build the kernel. There is a problem, in that the randomization happens after the creation of debug data; if a structure's layout is reordered, the debug information will be incorrect. Clang, instead, orders the work correctly and does not have this problem. Segher Boessenkool asserted that the plugin is simply broken, and Sam James said that the plugin situation in general is "a mess". There is evidently a fix in circulation that depends on a new plugin hook for GCC. But I pointed out that the kernel project would like to move away from GCC plugins entirely in favor of having the necessary features supported directly by the compiler. Fixing the plugin would be welcome, but replacing it with a proper implementation would be better.

There was some discussion about the value of struct layout randomization in general, with some calling it "security through obscurity". Rostedt, though, defended the technique as a useful way to limit the effectiveness of exploits. Bradley Kuhn said that there are some "serious licensing issues" around some of the GCC plugins used by the kernel, with some users getting "aggressive emails" from the original author of some of that work. It would be far better, he said, to rewrite that functionality from scratch, built into the compiler.

It was also mentioned that layout reordering could also be used to optimize structures, eliminating internal holes. Boessenkool pointed out that this would violate the standard, which requires the address of a structure to be equal to the address of its first element. That seems like a price that some users would be willing to pay for a useful (and optional) feature, though. This part of the discussion concluded that struct layout reordering would be useful to have in GCC, but nobody said that they would work to actually implement it.

Next on the list was the unwinding of user-space stacks in the kernel, possibly by making use of SFrame data. There is interest in possibly moving the kernel over to SFrame from the ORC format used by the kernel now, Rostedt said. User-space unwinding would be useful for the generation of profiles that include both the user and kernel sides of the equation. It would be implemented by setting a task flag in the kernel saying that a user-space stack trace is needed; that trace would be generated just prior to returning to user space from the kernel.

Creating that trace would require access to user-space SFrame data from within the kernel, though. That, in turn, could require the addition of a system call for user space to provide that data. To complicate things, the SFrame situation could change rapidly over time, especially in processes running a just-in-time compiler. So perhaps the SFrame data would be stored in a memory region that is shared between the kernel and user space, eliminating the need to make a system call every time something changes.

The final topic that was discussed was a desire to obtain hints from the compiler about functions that do not return or which contain jump tables. As Josh Poimboeuf explained, this information is needed to make the kernel's objtool utility work properly on 64-bit Arm systems; that, in turn, is needed to support live patching. Indu Bhagat said that this information could be useful for some control-flow-integrity applications as well. The discussion wandered inconclusively for a while after that, with no clear solution identified.

The session did not manage to address half of the potential subjects that had been listed at the beginning. It did show, though, that there is value in getting groups of developers to talk with each other about their needs and wishes. The discussion will continue at the Linux Plumbers Conference in Vienna and, presumably, at future events as well.

[ Thanks to the Linux Foundation, LWN's travel sponsor, for supporting our travel to this event. ]

Comments (23 posted)

A discussion of Rust safety documentation

By Daroc Alden
September 17, 2024

Kangrejos 2024

Kangrejos 2024 started off with a talk from Benno Lossin about his recent work to establish a standard for safety documentation in Rust kernel code. Lossin began his talk by giving a brief review of what safety documentation is, and why it's needed, before moving on to the current status of his work. Safety documentation is easier to read and write when there's a shared vocabulary for discussing common requirements; Lossin wants to establish that shared vocabulary for Rust code in the Linux kernel.

Safety documentation has two parts, Lossin explained: requirements and justifications. Requirements are comments attached to functions that have been marked unsafe, and explain what must be true for the function to be used. He gave the example of the Arc::into_raw() and Arc::from_raw() functions that convert between a reference-counted smart pointer (Arc) and a plain pointer. For example, from_raw() must be called once for each call to into_raw() on a given allocation, otherwise the reference count will be incorrect. Also, from_raw() must be given a pointer that really did come from into_raw(), or it will do bad things to whatever object is being pointed to when the Arc is dropped and the reference count is decremented.

Those requirements should be spelled out in a safety comment attached to the function, Lossin said, so that users of the function know how to use it correctly. Furthermore, uses of that function should also have the second kind of safety documentation: justifications. Having a comment explaining why the functions requirements hold at that particular call site makes it easier for reviewers to check whether the author of a patch really got their logic right, which prevents mistakes.

Lossin briefly addressed the objection that these kinds of comments are not traditional in kernel C code. He said that Rust has "higher stakes" — since Rust's safety is its primary reason to exist, the Rust-for-Linux folks had better make it easy to write correct Rust code. Also, Rust has some more complex language features, such as smart pointers and compile-time lifetime tracking, that can be harder to work with when writing low-level unsafe code. Finally, actually writing out one's assumptions can sometimes catch errors that would otherwise have gone unnoticed.

Once everyone had been brought up to date on what Lossin meant by safety documentation, he talked about the current status of his effort to document standards for safety documentation. He pointed out that safety documentation has actually been required by reviewers since the beginning; what he's asking for is to standardize the wording and formatting to make those comments easier to understand.

Almost all existing unsafe blocks in the kernel have associated documentation. Soon, the project will enable a lint to warn about undocumented unsafe blocks. This is good, Lossin said, but there's a catch: the comments do not always use the same terminology, which can make it difficult to know whether they're correct.

For example, some comments say that their function needs a "valid pointer" and some say that it needs a "valid, non-null pointer". But all valid pointers in Rust are non-null. So is this just two ways of saying the same thing? Or is one of the comments using "valid" in a way that's not consistent with Rust's language documentation? It's impossible to tell what the actual requirements are without reading the code, which rather defeats the point of having concise safety documentation to read, Lossin said.

Ideally, all of the comments would be correct, complete, and easy to understand. That's easier to accomplish if there's a shared vocabulary for common conventions — an author shouldn't need to write "valid, non-null" when just "valid" will do. Lossin suggested that they might want to standardize a dictionary of common terms, so that authors can write as little as possible, but readers will still be able to understand. Plus, having an explicit resource saying how to read safety documentation will make it easier for learners to come up to speed, and reduce the chances of misunderstandings between maintainers.

Lossin ended the introductory portion of his talk by calling for people to read his RFC, and get on the Rust-for-Linux project's Zulip chat server to talk about it. Then, he opened things up for a discussion.

Richard Weinberger started by asking whether the safety documentation was meant to be human-readable or machine-readable. Lossin thought that was a good question for discussion, and tossed it back to the attendees. Andreas Hindborg, one of the organizers of Kangrejos, said "I like it when the documentation is human-readable".

Daniel Almeida agreed, asking why the project would want the documentation to be machine-readable. Lossin suggested that it might be possible to have tooling process it, for formal verification. Almeida objected, saying that the existing code linter can ensure the comments are present, and what more is needed?

Lossin suggested that perhaps they could write tooling to check the comments for correctness. Paul McKenney pointed out that if it were both human- and machine-readable, there could be tooling that would expand terse comments into a more detailed form using the dictionary Lossin had composed.

Other attendees remained skeptical. One person pointed out that unsafe code is used exactly when Rust cannot statically guarantee something is safe — so trying to run static analysis on those parts sounds like a fool's errand. Lossin objected that they could simplify common checks, such as checking that a pointer is valid, or check kernel-specific invariants that the wider Rust world doesn't care about. He also pointed out that the Rust compiler has macros that can be used for annotations — those could be used to attach checks to different functions. The downside is that it would raise the difficulty of writing kernel Rust code, something that he was worried about.

McKenney thought that even "stupid" formal verification can be helpful for catching mistakes. Even something as simple as checking that the requirement and justification comments match, somehow, would be valuable.

Weinberger pointed out that the kernel already has sparse, which works on C code, and that could potentially be expanded to Rust. Hindborg noted that there are people using formal verification on some drivers, such as checking that there are no writes outside of a DMA buffer. Formally verifying Rust code has proved to be easier than formally verifying C code in the past.

Despite the possibility of improved tooling, Hindborg was against using machine-readable comments, saying that the Rust-for-Linux project is already bringing in a new language that's hard to accept. Adding some kind of formal notation on top of that will not go well.

Lossin noted that you could make English machine-readable by restricting the grammar and vocabulary. He also pointed out that there are existing tools for formal verification. Despite that, he thought that it made more sense to "surgically insert" formal verification only where it would matter the most, making it opt-in.

Almeida thought that even Lossin's idea of having a standardized dictionary of terms might be a step too far, saying that it could make it harder for people to contribute. Lossin suggested letting contributors know that they can ask for help when writing the comments during the review process if they have trouble with the dictionary.

Miguel Ojeda raised the idea of having different standards for core and leaf functions — the Rust-for-Linux developers could set an example with the core APIs, and let that propagate. Lossin said that the normal documentation already works like that; the kernel crate is entirely documented (as is the Rust standard library), but it's up to individual subsystem maintainers to pick a standard that works for them. Since, ideally, most drivers will not need to use unsafe Rust, maybe the safety documentation is only needed for the kernel crate.

José Marchesi asked how much of the information Lossin wanted comments to capture could instead be expressed in Rust itself. Lossin asked whether he meant in the type system or at run time, and Marchesi clarified that he meant at run time. In Ada, he said, there can be preconditions to a function, and he wondered if Rust had anything similar.

Lossin agreed that it could be a good option to add Rust macros for that, if it could be made to work. Ojeda noted that some formal-verification people are already working on this, and may eventually get pre- and post-conditions into the language itself. But he didn't think that obviated the need for comments.

Gary Guo raised the topic of foreign functions — sometimes, a function is unsafe to call only because it is a foreign function, and there aren't really any requirements to satisfy before using it. Lossin thought that the real problem there was the lack of documentation of the kernel's C functions. In order to know whether the FFI function is really safe, you need to look at the C code. He suggested that perhaps it could make sense to document foreign functions in the C code, and have bindgen (the C-to-Rust bindings generator) transfer the documentation. Guo said that there were a lot of functions that do not really require documentation, and that documenting things in multiple places would certainly not be helpful.

Maciej Falkowski asked why it was necessary to have obvious code accompanied by a safety comment. Lossin said that this gets into why safety documentation is needed in the first place. In C, the programmer needs to argue that the whole program is correct. Having unsafe blocks reduces that global problem to a local property, Lossin said. So if the unsafe blocks are small, only a small amount of context is needed to make sure it's right — which makes the whole system more correct. From that point of view, safety documentation is actually better if it's obvious, because that lets maintainers and people working on the code later have more confidence that the code is correct.

Overall, the attendees were mostly sympathetic to the need for standardized safety documentation, although there was clearly still some disagreement around exactly what form that should take, and whether it should be used as the basis for new tooling. Eventually, the session ran out of time.

[ Thanks to the Linux Foundation, LWN's travel sponsor, for supporting our coverage of Kangrejos. ]

Comments (50 posted)

Some 6.11 development statistics

By Jonathan Corbet
September 16, 2024

The 6.11 kernel was released on September 15 after a typical nine-week development cycle. This release integrates 13,890 non-merge changesets, so it was a moderately busy cycle, slightly more so that 6.10 was. With a new release comes a new round of development statistics; read on for the details.

This release was contributed to by 1,970 developers, of whom 250 were first-time contributors. The most active contributors this time around were:

Most active 6.11 developers

By changesets

Jeff Johnson 282 2.0%

Krzysztof Kozlowski 266 1.9%

Jani Nikula 228 1.6%

Kent Overstreet 169 1.2%

Ville Syrjälä 161 1.2%

Christoph Hellwig 140 1.0%

Dmitry Baryshkov 129 0.9%

Michal Wajdeczko 128 0.9%

Johannes Berg 125 0.9%

Andy Shevchenko 114 0.8%

Wolfram Sang 98 0.7%

Thomas Zimmermann 94 0.7%

Frank Li 87 0.6%

Dr. David Alan Gilbert 82 0.6%

Sean Wang 82 0.6%

Douglas Anderson 76 0.5%

Bartosz Golaszewski 72 0.5%

Geert Uytterhoeven 71 0.5%

Konrad Dybcio 70 0.5%

Uwe Kleine-König 69 0.5%

By changed lines

Aurabindo Pillai 227656 22.3%

Hawking Zhang 83481 8.2%

Ian Rogers 78043 7.7%

Likun Gao 8820 0.9%

Alexander Duyck 7908 0.8%

Benjamin Tissoires 7685 0.8%

Bitterblue Smith 7597 0.7%

Ping-Ke Shih 7534 0.7%

Eric Biggers 7375 0.7%

Bartosz Golaszewski 7095 0.7%

Christophe Leroy 6612 0.6%

Kent Overstreet 6445 0.6%

Johannes Berg 6320 0.6%

Maxime Ripard 5627 0.6%

Lorenzo Bianconi 5578 0.5%

Michal Wajdeczko 5499 0.5%

Frank Li 5370 0.5%

Dmitry Baryshkov 5324 0.5%

Stefan Herdler 5054 0.5%

Danila Tikhonov 5025 0.5%

The most prolific contributor of changesets this time around was Jeff Johnson, whose work consisted almost entirely of adding MODULE_DESCRIPTION() lines to modules that were lacking them. Krzysztof Kozlowski continued a long-running series of cleanups in many parts of the driver tree. Jani Nikula worked extensively in the graphics subsystem (and i915 driver specifically), Kent Overstreet continued to work to stabilize the bcachefs filesystem, and Ville Syrjälä joined Nikula in i915 driver work.

In the "changed lines" column, Aurabindo Pillai contributed 27 commits adding yet another big pile of amdgpu register definitions; Hawking Zhang's 21 commits made that pile even bigger. Ian Rogers added another set of perf vendor-event definitions. Likun Gao also worked on the amdgpu driver, and Alexander Duyck added the fbnic network driver.

The top testers and reviewers this time around were:

Test and review credits in 6.11

Tested-by

Daniel Wheeler 331 22.5%

Philipp Hortmann 66 4.5%

Neil Armstrong 51 3.5%

Babu Moger 33 2.2%

Pucha Himasekhar Reddy 29 2.0%

Heiko Stuebner 18 1.2%

Claudiu Beznea 17 1.2%

Amit Pundir 17 1.2%

Nicolin Chen 16 1.1%

Chandan Kumar Rout 16 1.1%

Tao Liu 16 1.1%

Miguel Luis 15 1.0%

Bryan O'Donoghue 13 0.9%

Andrew Halaney 13 0.9%

Sujai Buvaneswaran 13 0.9%

Reviewed-by

Dmitry Baryshkov 243 2.6%

Rodrigo Vivi 186 2.0%

Krzysztof Kozlowski 181 1.9%

Konrad Dybcio 165 1.7%

Simon Horman 146 1.5%

Christoph Hellwig 143 1.5%

Jani Nikula 132 1.4%

Hawking Zhang 127 1.3%

David Sterba 121 1.3%

Rob Herring (Arm) 121 1.3%

AngeloGioacchino Del Regno 97 1.0%

Ilpo Järvinen 96 1.0%

Linus Walleij 95 1.0%

Neil Armstrong 93 1.0%

Laurent Pinchart 89 0.9%

As always, Daniel Wheeler tests AMD graphics patches at a rate of about five per day. Other testers are somewhat less prolific, but their work is equally valuable. On the review side, Dmitry Baryshkov has been busy with numerous mobile drivers, Rodrigo Vivi reviewed lots of i915 graphics-driver patches, and Kozlowski reviewed many devicetree changes.

Looking at the Signed-off-by tags added to patches can yield some interesting insights. Specifically, tags added by people other than the author track the handling of patches, especially the point where any given patch turns into a commit in some repository. Those non-author signoffs show us who the gatekeepers to the kernel are. In 6.11, the most non-author signoffs came from:

Who Signoffs Subsystem

Alex Deucher 1034 8.0% AMD graphics

Jakub Kicinski 581 4.5% Networking

Andrew Morton 560 4.4% Memory management

Mark Brown 531 4.1% Regulator, sound, SPI

Bjorn Andersson 484 3.8% Qualcomm

Greg Kroah-Hartman 425 3.3% Drivers

Hans Verkuil 239 1.9% Media

David S. Miller 234 1.8% Networking

Jonathan Cameron 231 1.8% Industrial I/O

Jens Axboe 227 1.8% Block, io_uring

Lee Jones 177 1.4% LED, MFD

Paolo Abeni 170 1.3% Networking

David Sterba 161 1.3% Btrfs

Kalle Valo 154 1.2% WiFi

Johannes Berg 144 1.1% WiFi

Bjorn Helgaas 139 1.1% PCI

Krzysztof Wilczyński 132 1.0% PCI

Shawn Guo 131 1.0% NXP devicetree

Namhyung Kim 131 1.0% Perf

Christian Brauner 131 1.0% Filesystems

Who	Signoffs	Subsystem
Alex Deucher	1034	8.0%	AMD graphics
Jakub Kicinski	581	4.5%	Networking
Andrew Morton	560	4.4%	Memory management
Mark Brown	531	4.1%	Regulator, sound, SPI
Bjorn Andersson	484	3.8%	Qualcomm
Greg Kroah-Hartman	425	3.3%	Drivers
Hans Verkuil	239	1.9%	Media
David S. Miller	234	1.8%	Networking
Jonathan Cameron	231	1.8%	Industrial I/O
Jens Axboe	227	1.8%	Block, io_uring
Lee Jones	177	1.4%	LED, MFD
Paolo Abeni	170	1.3%	Networking
David Sterba	161	1.3%	Btrfs
Kalle Valo	154	1.2%	WiFi
Johannes Berg	144	1.1%	WiFi
Bjorn Helgaas	139	1.1%	PCI
Krzysztof Wilczyński	132	1.0%	PCI
Shawn Guo	131	1.0%	NXP devicetree
Namhyung Kim	131	1.0%	Perf
Christian Brauner	131	1.0%	Filesystems

This table has changed a bit over time. Networking was once concentrated under a single maintainer, and thus often appeared at the top of the list; that maintainership has now been split across multiple developers. Greg Kroah-Hartman's traditional position near the top of the table has been ceded to others, at least for now, as churn in the staging tree has decreased.

Graphics drivers again led to a position at the top of the list. Interestingly, though, AMD graphics is represented here, but Intel graphics is not. That is because of the more distributed nature of maintainership on the Intel side. As we saw above, Nikula and Syrjälä both contributed many i915 graphics changes. But, since they committed those changes to the relevant repositories themselves, no other developer's signoff appears there. The Intel graphics subsystem is nearly unique in operating this way.

Associating non-author signoffs with employers yields this result:

Google 1477 11.5%

Intel 1350 10.5%

AMD 1327 10.3%

Meta 1060 8.2%

Red Hat 875 6.8%

Linaro 846 6.6%

Qualcomm 707 5.5%

Arm 667 5.2%

Linux Foundation 503 3.9%

(Unknown) 451 3.5%

SUSE 307 2.4%

(None) 281 2.2%

Huawei Technologies 276 2.1%

Cisco 239 1.9%

IBM 223 1.7%

NVIDIA 200 1.6%

Microsoft 196 1.5%

Oracle 168 1.3%

MediaTek 145 1.1%

Texas Instruments 137 1.1%

[Note: an error in the above table was corrected on September 23.] About half of all commits going into the mainline pass through maintainers working for just five companies. This situation has been stable for many years, though the specific companies involved has changed somewhat over time.

Work on 6.11 was supported by 213 companies that we were able to identify. The most active of those companies were:

Most active 6.11 employers

By changesets

Intel 2045 14.7%

AMD 1237 8.9%

(Unknown) 971 7.0%

Google 897 6.5%

Linaro 884 6.4%

Red Hat 647 4.7%

(None) 621 4.5%

Qualcomm 601 4.3%

SUSE 355 2.6%

Renesas Electronics 305 2.2%

NVIDIA 283 2.0%

IBM 278 2.0%

Huawei Technologies 274 2.0%

Oracle 257 1.9%

Meta 248 1.8%

NXP Semiconductors 236 1.7%

(Consultant) 221 1.6%

Texas Instruments 175 1.3%

BayLibre 167 1.2%

MediaTek 145 1.0%

By lines changed

AMD 361622 35.5%

Google 113096 11.1%

Intel 85054 8.3%

(Unknown) 67772 6.6%

Red Hat 36435 3.6%

Linaro 32680 3.2%

Qualcomm 29029 2.8%

(None) 24823 2.4%

Meta 16056 1.6%

NXP Semiconductors 14284 1.4%

Realtek 13283 1.3%

Collabora 11602 1.1%

Oracle 10985 1.1%

NVIDIA 10978 1.1%

Renesas Electronics 10473 1.0%

SUSE 9971 1.0%

Texas Instruments 9417 0.9%

MediaTek 8459 0.8%

IBM 8232 0.8%

ST Microelectronics 6991 0.7%

At this point, nearly 25% of the commits landing in the mainline came from developers working for just two chip manufacturers — Intel and AMD — and, as we have seen, quite a bit of their work is focused on keeping their graphics drivers working. Beyond that, there is not much that is noteworthy in the above numbers.

As of this writing, there are less than 10,000 commits in linux-next, suggesting that the 6.12 development cycle will be a relatively slow one, at least with regard to changeset counts. There are some significant changes on deck for that release, though. LWN will, of course, follow the development of that release as it happens.

(As a reminder, LWN subscribers can get the above information and more at any time by way of the LWN Kernel Source Database).

Comments (2 posted)

The RCU API, 2024 edition

September 13, 2024

This article was contributed by Paul McKenney

Read-copy-update (RCU) is a synchronization mechanism that was added to the Linux kernel in October 2002. RCU is most frequently used as a replacement for reader-writer locking, but is also used in a number of other ways. This mechanism is notable in that RCU readers do not directly synchronize with RCU updaters, which makes RCU read paths extremely fast and also permits RCU readers to accomplish useful work even when running concurrently with RCU updaters. Those wishing an in-depth introduction to RCU are invited to consult the LWN series here, here, and here.

This article covers recent changes to the RCU API; it was contributed by Paul McKenney, Boqun Feng, Frederic Weisbecker, Joel Fernandes, Neeraj Upadhyay, and Uladzislau Rezki.

Although the basic idea behind RCU has not changed during the three decades following its introduction into DYNIX/ptx, the RCU API has evolved significantly since the 2010, 2014, and 2019 editions of the Linux-kernel RCU API. The most recent five years of this evolution is documented by the following sections.

Summary of RCU API changes
How did those 2019 predictions turn out?
What next for the RCU API?

If that is not enough RCU for you, there are a lot more details to be found in the associated background-material article.

But first, a few announcements:

People familiar with the 2019 edition of the RCU API will note several additional names on the byline. These people have taken up the challenge of learning the various Linux-kernel RCU implementations, each investing significant time over a period of years. Not that Paul intends to go anywhere anytime soon, his father having taken early retirement only recently, but Mother Nature might force him into her retirement program at any time. Paul therefore asks you to please take them seriously, because if something happens to him, they are your Linux-kernel RCU maintainers.
It is wise to CC rcu@vger.kernel.org on any RCU-related email that might otherwise have been sent privately to Paul, who is likely to become less aggressive about checking email during off-hours.
In a not-unrelated change, the source of truth for RCU commits has moved from Paul's venerable -rcu tree to a shared RCU tree. His -rcu tree will continue to exist, but will be one of several feeding into the shared RCU tree and elsewhere.

Summary of RCU API changes

These sections summarize some of the most visible changes to RCU:

Lazy RCU grace asynchronous periods
Reworking of kfree_rcu()
Polled RCU grace periods
Tasks Rude RCU and Tasks Trace RCU
New SRCU read-side critical sections
Read-side guards
RCU callback dynamic (de-)offloading
Miscellaneous

Lazy RCU grace asynchronous periods

The addition of lazy RCU grace periods was prompted by energy-efficiency concerns on battery-powered platforms, most notably Android and ChromeOS. This laziness affects callbacks queued via call_rcu(), and is not to be confused with the lazy processing of RCU callbacks from kfree_rcu(), which is covered in the next section.

New-age lazy call_rcu() grace periods are controlled at build time by a new CONFIG_RCU_LAZY kernel configuration option and, in kernels built with CONFIG_RCU_LAZY=y, at run time by a new rcutree.enable_rcu_lazy kernel boot parameter. There is an additional CONFIG_RCU_LAZY_DEFAULT_OFF configuration option that changes the default setting of this new kernel boot parameter to be disabled. In other words, when the kernel boot parameter is not specified, kernels built with CONFIG_RCU_LAZY=y but without specifying CONFIG_RCU_LAZY_DEFAULT_OFF (or, equivalently, built with CONFIG_RCU_LAZY_DEFAULT_OFF=n) will have lazy call_rcu() callbacks, while kernels built with CONFIG_RCU_LAZY=y and CONFIG_RCU_LAZY_DEFAULT_OFF=y will have non-lazy (hurried) call_rcu() callbacks.

The kernel boot parameter overrides the configuration options, so that (for example) booting a CONFIG_RCU_LAZY=y kernel with rcutree.enable_rcu_lazy=1 will result in lazy call_rcu() callbacks. Lazy callbacks will wait up to ten seconds before starting a grace period, thus greatly reducing the number of grace periods (and their associated wakeups of idle CPUs) on an almost-idle device. However, call_rcu() callbacks will only ever be lazy on rcu_nocbs CPUs. The reason for this is that systems with non-rcu_nocbs CPUs consume more energy than their rcu_nocbs counterparts, so those concerned with battery lifetime would be well-advised to first enable rcu_nocbs. This can be done by building the kernel with CONFIG_RCU_NOCB_CPU=y and booting it with rcu_nocbs=all.

Testing of lazy call_rcu() grace periods showed well in excess of 10% improvements in energy efficiency.

Some uses of call_rcu() cannot tolerate laziness, for example, when the callback function does a wakeup. These uses should instead invoke the new call_rcu_hurry() function. Note that invoking call_rcu_hurry() on a given CPU will hurry along any earlier lazy callbacks that were previously queued using call_rcu().

When suspending or hibernating, it is also important for all callbacks to hurry. (But don't take our word for it, ask a user about adding a ten-second delay to suspend!) RCU therefore has an rcu_pm_notify() function that hurries callbacks at the start of a suspend or hibernation operation. This same function re-enables laziness when this operation completes.

Reworking of `kfree_rcu()`

From the viewpoint of pre-existing, working code, kfree_rcu() works just like it always did. However, under the covers, performance has increased significantly through use of pages of pointers to track the memory and through use of kfree_bulk(). Both of these changes greatly improve cache locality, which provides up to a 12% increase in performance. Performance was increased still further for battery-powered devices via carefully developed heuristics that govern how long a partially populated page of pointers waits before being submitted to RCU.

In addition, kfree_rcu() can now handle memory from kmem_cache_alloc(), not due to any changes to kfree_rcu(), but rather due to changes to kfree(). However, please note that rcu_barrier() does not wait for memory being freed via kfree_rcu(), so that there currently is no way to safely invoke kmem_cache_destroy() on module exit if that module ever used kfree_rcu() on memory from the corresponding kmem_cache structure. A fix is in the works. In the meantime, such modules should continue using call_rcu() for this use case.

One disadvantage of kfree_rcu() is that an rcu_head structure must be embedded within the structure to be freed, costing 16 additional bytes on 64-bit systems. For example, a structure referenced by p where the rcu_head structure is in a field named rh can be RCU-deferred-freed using:

    kfree_rcu(&p->rh, rh)

This issue is addressed by a new kfree_rcu_mightsleep() function, which takes a single pointer to the beginning of the object to be freed, as in kfree_rcu_mightsleep(p), thus avoiding the need for that rcu_head structure and its 16 bytes of memory.

Of course, there is always a catch, and in this case the catch is that, as the name suggests, kfree_rcu_mightsleep() might sleep. In fact, if it is unable to allocate the memory needed to track the object being deferred-freed, it will simply invoke synchronize_rcu(), which blocks for some tens of milliseconds, and then directly invokes kfree(). In contrast, kfree_rcu() reacts to low memory by invoking call_rcu(), thus giving up cache locality but not caller-visible latency. As always, choose wisely!

There are versions of the kernel that have a single-argument variant of kfree_rcu() instead of kfree_rcu_mightsleep(). This single-argument approach proved to be a serious mistake that led to subtle bugs in which people forgot to specify the second argument, which fails (but only sometimes) from atomic contexts such as interrupt handlers. Therefore, recent kernels provide only the double-argument variant of kfree_rcu(). Where the old single-argument version would have been used, use kfree_rcu_mightsleep() instead.

There are also kvfree_rcu() and kvfree_rcu_mightsleep() functions that operate on vmalloc() memory.

In short, if you have RCU callbacks that do nothing but kfree(), vfree(), or kmem_cache_free() to immortal kmem_cache structures, these functions might save you both CPU time and a few lines of code. Many of these have already been converted, courtesy of a Coccinelle script and patches from Julia Lawall, but new code is written all the time.

Polled RCU grace periods

The historic RCU API does quite a bit of work for the user, who can simply invoke synchronize_rcu() and, upon its return, know that all pre-existing readers are done. Or, if the synchronize_rcu() function's blocking is problematic, pass a pointer and a callback function to call_rcu() and upon invocation of that callback function, know that all pre-existing readers are done. Better yet, if that callback function is doing nothing but invoking kfree() or kmem_cache_free(), just pass a pointer and rcu_head offset to kfree_rcu(), and rely on RCU to do everything, including the freeing. Or even better still, if within a latency-tolerant, non-atomic context, dispense with the rcu_head and pass just the pointer itself to kfree_rcu_mightsleep().

However, as reported here, there are situations in which these conveniences become counterproductive. For example, in caching situations, it might be that an object that has been queued for deferred free is once again needed. In such cases, it might be helpful to cancel a synchronize_rcu(), prevent a call_rcu() callback from being invoked, or to prevent an object passed to kfree_rcu() or kfree_rcu_mightsleep() from being freed. For another example, in situations that invoke many instances of call_rcu() in short time periods, RCU's choice of software-interrupt context for callback invocation might be suboptimal. Unfortunately, providing for these situations would further complicate the RCU API and implementation, and would also result in performance degradation.

Quick Quiz 1: Why would call_rcu() be helpful in scheduling polling for a particular grace period?

(Click for answer)

Because the callback passed to call_rcu() tends to be invoked shortly after a grace period has completed, which is an excellent time to do polling for the end of a grace period. Alternatively, if the RCU-callback softirq context is inconvenient, you can instead use queue_rcu_work() to schedule a workqueue handler to execute shortly after a grace period completes.

In recent Linux kernels, RCU instead provides a complete polling API for managing RCU grace periods. This permits RCU users to take full control over all aspects of waiting for grace periods and responding to the end of a particular grace period. For example, the get_state_synchronize_rcu() function returns a cookie that can be passed to poll_state_synchronize_rcu(). This latter function will return true if a grace period has elapsed in the meantime. The user may choose any convenient method to schedule the polling for the end of the grace period, from mod_timer() up to and including use of call_rcu() itself. Once the grace period in question has elapsed, the user may choose any convenient context from which to free memory, or to undertake whatever other processing is required.

For a bit more background on RCU's polled grace-period API, please see Stupid RCU Tricks: Waiting for Grace Periods From NMI Handlers or slides 34-on in the Reclamation Interactions with RCU LSFMM+BPF 2024 presentation.

Tasks Rude RCU and Tasks Trace RCU

The 2019 edition of the RCU API described the addition of Tasks RCU for use by ftrace and kprobes. These facilities use trampolines containing tracepoint code, and Tasks RCU is used to synchronize removal of a trampoline with tasks that might still be executing within it. Because ftrace and kprobes trampolines never do context switches, nor do they invoke functions that do context switches, a voluntary context switch suffices as a Tasks RCU quiescent state. This edition describes Tasks Rude RCU, which was consolidated from an open-coded implementation provided by Steve Rostedt, and also Tasks Trace RCU, which was added for use by sleepable BPF programs that might block.

Tasks Rude RCU augments Tasks RCU by handling the idle tasks that Tasks RCU ignores, thus permitting trampolines to be installed in the idle loop. One could instead prohibit tracepoints and kprobes in the idle loop, but the increasing quantity of power-management code living there makes such a prohibition unpalatable.

Therefore, true to its name, Tasks Rude RCU uses the schedule_on_each_cpu() function to force a context switch on each CPU, and thus force each CPU out of the idle loop. Those using battery-powered systems might well consider the resulting wakeups of deep-idle-state CPUs to be quite rude, hence the name.

Like Tasks RCU, Tasks Rude RCU has no read-side markers. It has synchronize_rcu_tasks_rude() and call_rcu_tasks_rude() functions to wait for a grace period, synchronously and asynchronously, respectively. The rcu_barrier_tasks_rude() function waits for the invocation of all callbacks queued by previous invocations of call_rcu_tasks_rude(), which is needed when unloading modules such as rcutorture that invoke call_rcu_tasks_rude(). However, neither call_rcu_tasks_rude() nor rcu_barrier_tasks_rude() is used in current mainline, which is likely to lead to their removal sooner rather than later.

Peter Zijlstra and Thomas Gleixner reworked the x86 entry/exit code, using noinstr and inlining, so that any function that RCU is not watching cannot be traced. On any architecture where this is completed, where the CPU-hotplug architecture-specific offline code path has been addressed, and where the maintainers feel confident that it will stay completed, synchronize_rcu_tasks_rude() can become a no-op and call_rcu_tasks_rude() can invoke its callback immediately from a clean context.

BPF programs use a combination of RCU, Tasks RCU, and Tasks Rude RCU for various purposes, including synchronizing uses of and removals of trampolines. RCU is used to protect entire BPF programs, which works well, but which prohibits BPF programs from blocking. This prohibition in turn prevents BPF programs from unconditionally loading the contents of user-space memory because of the fact that user-space accesses might result in page faults. That, in turn, might result in blocking, waiting for that user-space data to be paged back in. Therefore, BPF programs have been given conditional access to user-space memory, which either completes the access or indicates failure.

These failure indications can be quite inconvenient, so a new special-purpose Tasks Trace RCU flavor has been created that allows limited blocking within its read-side critical sections. As such, Tasks Trace RCU can be thought of as variant of sleepable RCU (SRCU) with low-overhead read-side markers. Although, like SRCU, the implementation can tolerate arbitrary blocking, by convention Tasks Trace RCU readers are only permitted to block for long enough to handle a major page fault.

The Tasks Trace RCU read-side markers are rcu_read_lock_trace() and rcu_read_unlock_trace(), and there is a lockdep-enabled rcu_read_lock_trace_held() function that indicates whether or not it is within an Tasks Trace RCU read-side critical section. (When lockdep is disabled, this function always returns the value one.) Synchronous grace-period waits are provided by synchronize_rcu_tasks_trace() and asynchronous grace-period waits by call_rcu_tasks_trace(). The rcu_barrier_tasks_trace() function waits for the invocation of all callbacks queued by previous invocations of call_rcu_tasks_trace(). It is not unusual for BPF code to need to wait for both an RCU and an Tasks Trace RCU grace period, and a current accident of implementation means that any Tasks Trace RCU grace period is also a plain RCU grace period. This could of course change at any time, so there is a rcu_trace_implies_rcu_gp() function (which currently unconditionally returns true) that specifies whether or not this happy accident is still in effect.

Again, Tasks Trace RCU is quite specialized, so those wishing to use it should consult not only with its maintainers, but also with its current users.

New SRCU read-side critical sections

SRCU read-side critical sections use this_cpu_inc(), which excludes interrupt and software-interrupt handlers, but is not guaranteed to exclude non-maskable interrupt (NMI) handlers. Therefore, SRCU read-side critical sections may not be used in NMI handlers, at least not in portable code. This restriction became problematic for printk(), which is frequently called upon to do stack backtraces from NMI handlers. This situation motivated adding srcu_read_lock_nmisafe() and srcu_read_unlock_nmisafe(). These new API members instead use atomic_long_inc(), which can be more expensive than this_cpu_inc(), but which does exclude NMI handlers.

However, SRCU will complain if you use both the traditional and the NMI-safe API members on the same srcu_struct structure. In theory, it is possible to mix and match but, in practice, the rules for safely doing so are not consistent with good software-engineering practice. So if you need any of a given srcu_struct structure's read-side critical sections to appear in an NMI handler, use srcu_read_lock_nmisafe() and srcu_read_unlock_nmisafe() to mark all of that srcu_struct structure's read-side critical sections. When lockdep is enabled, the kernel will complain bitterly if you attempt to mix and match NMI-safe and non-NMI-safe SRCU readers on the same srcu_struct structure.

An SRCU read-side critical section must be wholly contained within a given task. Discussions led to the belief that this restriction was too severe, resulting in the new srcu_down_read() and srcu_up_read() API members, by analogy to down_read() and up_read(). However, these APIs have not yet seen any use. If they continue to be unused, they will be removed.

In addition, the list_for_each_entry_srcu() and hlist_for_each_entry_srcu() iterators were added, and are actually in use.

Finally, the cleanup_srcu_struct_quiesced() was removed because the deadlock issue that led to its creation was resolved by adding WQ_MEM_RECLAIM workqueues. Therefore, any code that would previously have used cleanup_srcu_struct_quiesced() can now use cleanup_srcu_struct() instead.

Read-side guards

Zijlstra introduced read-side guards for RCU and SRCU and Johannes Berg made the RCU read-side guards safe for the sparse static-analysis tool. These guards use the __cleanup__ attribute to cause a read-side critical section to be exited as soon as the scope ends. This enables the RAII (resource allocation is initialization) pattern for RCU and SRCU, for example. From fs/libfs.c:

 1 static inline struct dentry *get_stashed_dentry(struct dentry *stashed)
 2 {
 3   struct dentry *dentry;
 4
 5   guard(rcu)();
 6   dentry = READ_ONCE(stashed);
 7   if (!dentry)
 8     return NULL;
 9   if (!lockref_get_not_dead(&dentry->d_lockref))
10     return NULL;
11   return dentry;
12 }

Quick Quiz 2: What is with the extra pair of parentheses on line 5?

Click for answer

For lock-based guards, these would specify which lock to acquire. But RCU is global in nature, so does not need anything between the second pair of parentheses.

Those willing to look more deeply under the covers will see that the (rcu) is the argument to the guard() macro and the () is an argument to the constructor function that enters the RCU read-side critical section.

Line 5 creates an RCU read-side critical section that extends to the end of the enclosing scope, that is, to the end of the function body.

An SRCU read-side guard must specify which srcu_struct to use, for example, as follows:

 1 static void gpiochip_setup_devs(void)
 2 {
 3   struct gpio_device *gdev;
 4   int ret;
 5
 6   guard(srcu)(&gpio_devices_srcu);
 7
 8   list_for_each_entry_srcu(gdev, &gpio_devices, list,
 9          srcu_read_lock_held(&gpio_devices_srcu)) {
10     ret = gpiochip_setup_dev(gdev);
11     if (ret)
12       dev_err(&gdev->dev,
13         "Failed to initialize gpio device (%d)\n", ret);
14   }
15 }

Here, line 6 enters an SRCU read-side critical section using the srcu_struct structure named gpio_devices_srcu, and this critical section extends to the end of the enclosing scope.

But sometimes it is necessary to exit the critical section prior to the end of the enclosing scope. For this purpose, scoped_guard() creates an SRCU read-side critical section that covers only the following statement, which will often be a compound statement, for example, as follows:

 1 struct gpio_desc *gpio_to_desc(unsigned gpio)
 2 {
 3   struct gpio_device *gdev;
 4
 5   scoped_guard(srcu, &gpio_devices_srcu) {
 6     list_for_each_entry_srcu(gdev, &gpio_devices, list,
 7         srcu_read_lock_held(&gpio_devices_srcu)) {
 8       if (gdev->base <= gpio &&
 9           gdev->base + gdev->ngpio > gpio)
10         return &gdev->descs[gpio - gdev->base];
11     }
12   }
13
14   if (!gpio_is_valid(gpio))
15     pr_warn("invalid GPIO %d\n", gpio);
16
17   return NULL;
18 }

Quick Quiz 3: So why aren't there read-side guards for the three Tasks RCU variants?

Click for answer

For Tasks RCU and Tasks Rude RCU, the readers are implicit, which means that there is no useful way to create a guard. For Tasks Trace RCU, read-side guards are easy to implement, and will be implemented should a use case arise.

Here, line 5 enters an SRCU read-side critical section that extends from line 6 through line 12.

In all cases, use of read-side guards avoids bugs in which code enters a critical section but then fails to exit it.

RCU callback dynamic (de-)offloading

RCU callbacks may be offloaded, which means that, instead of being invoked in software-interrupt context (usually on the CPU that queued them), they are instead processed (shepherded through the end of the grace period) in the context of rcuog kthreads, then invoked in the context of per-CPU rcuoc kthreads. Callback offloading can provide substantial improvements for both HPC and real-time workloads by removing the "noise" of callback invocation from the CPUs doing time-critical work, or, failing that, by letting the scheduler decide when and where the RCU callbacks should be invoked. These kthreads may be controlled by the system administrator using the usual set of scheduling facilities, ranging from taskset to control groups.

Quick Quiz 4: Why not always offload RCU callback invocation?

Click for answer

Because offloaded callbacks require that call_rcu() explicitly synchronize with whatever CPU the corresponding rcuog kthread might be running on. This synchronization is not free, and poses the usual choice between high throughput (callbacks not offloaded) and real-time response (callbacks offloaded). And because RCU has no way of knowing which approach is best for a given workload, it must therefore defer to the better judgment of the system administrator.

By default, CPUs are never offloaded. Setting the RCU_NOCB_CPU_DEFAULT_ALL kernel configuration option causes all CPUs to be offloaded. Either build-time option may be overridden by the nohz_full or the rcu_nocbs kernel boot parameter, or both, if you feel that your future self will need the additional confusion.

As the number of CPUs on the typical computer system has increased, so has the number of workloads running on such a system, and in turn, the need to dynamically adjust those workloads. There is thus now an in-kernel facility to offload and de-offload RCU callbacks on specific CPUs. However, this facility has not yet been made available to user space due to persistent issues, including race conditions, deadlocks, and other hangs.

The current direction is to allow run-time offloading and de-offloading, but only for offline CPUs. This is expected to significantly simplify the code while supporting all known use cases. This facility will likely be made available to user space with the keenly anticipated run-time adjustment of the nohz_full kernel boot parameter.

Miscellaneous

The CONFIG_RCU_FAST_NO_HZ kernel configuration option was intended to improve energy efficiency, but a survey in 2021 showed that the only users of this option also offloaded RCU callbacks. In that case, all CONFIG_RCU_FAST_NO_HZ does is to provide a slight slowdown for transitions to and from idle. This configuration option was therefore removed.

The data-access APIs added rcu_dereference_raw_check(), rcu_replace_pointer(), and unrcu_pointer(), but also removed rcu_dereference_raw_notrace() and rcu_swap_protected().

The updater-validation APIs removed RCU_NONIDLE() because Zijlstra's and Gleixner's idle-loop rework removed the need for it.

The RCU list APIs added list_tail_rcu(), hlists_swap_heads_rcu(), hlist_nulls_add_tail_rcu(), and hlist_nulls_add_fake(), but removed hlist_bl_del_init_rcu().

How did those 2019 predictions turn out?

The 2019 article included a set of predictions about the future of the RCU API. Five years later, a look at how they turned out seems warranted.

A kmem_struct counterpart to kfree_rcu() will likely be required.
This had been predicted in 2014 as well but, yet again, this did not happen. But something even better happened, namely that kfree() now handles memory obtained from kmem_struct_alloc(). This means that there is no longer a need for something like a kmem_struct_free_rcu() because kfree_rcu() now just handles this case.
Almost. But that is the subject of a new prediction.
Quick Quiz 32: Inlining rcu_read_lock() sounds quite valuable. So why aren't Jiangshan's patches upstream?

Click for answer
They do remove function-call and task-structure-access overhead from rcu_read_lock(), but at the cost of additional code on the context-switch fast path. This might well be a good tradeoff, but actual performance results are required. Please feel free to give this patch series a spin on your favorite workload! (And of course to post the resulting performance results.)

Inlining of TREE_PREEMPT_RCU's rcu_read_lock() primitive.
And yet again, this did not happen.
But not for lack of effort on the part of Lai Jiangshan, who provided not one but two patch series along these lines (2024, 2019).
Additional forward-progress work, both in rcutorture and in RCU proper.
There was significant work in this area, along with upgrades to the callback-flooding testing in rcutorture to help ensure that the improvements stay improved.
Better handling of vCPU preemption within RCU readers.
There have been some interesting experiments in this area, but no commits have yet made it to mainline.
Adding rcutorture to kselftests, that is, adding a Makefile to tools/testing/selftests/rcutorture that carries out a quick rcutorture-based smoke test of RCU.
This has been done. Even better, there are now more people who make frequent use of rcutorture.
Disentangling rcu_barrier() from CPU hotplug operations, which could permit this function to be invoked from CPU-hotplug notifiers.
This has been completed.

Quick Quiz 33: what happened to quick quizzes 5-31?

Click for answer

They can be found in the background-material supplemental article, for everybody who wants more RCU.

Three and two halves out of six, which is not as good as one might hope, but better than they usually are.

It is also illuminating to list the unexpected changes. Some of these are hinted at above, but bear repeating:

There is now a Tasks Rude RCU and Tasks Trace RCU.
There are a number of energy-efficiency improvements, including lazy RCU callbacks and lazy kfree_rcu() processing.
There is now a (more) complete set of polled RCU grace-period APIs.
SRCU read-side critical sections may now be in NMI handlers (using the new srcu_read_lock_nmisafe() and srcu_read_unlock_nmisafe() functions) and may also span tasks (using the new srcu_down_read() and srcu_up_read() functions).
RAII guards are now available for RCU and SRCU, courtesy of Zijlstra and Berg.
There is now a kernel configuration option to reduce synchronize_rcu() latency during heavy RCU-callback loads.
Expedited RCU uses kthread workers instead of workqueues, which enables priority boosting to also boost expedited grace-period processing.
The expedited RCU CPU stall-warning timeout can now be set in milliseconds, and some users set it to 20 milliseconds. And so it is that Linux's stall warning timeout finally beats the 1990s DYNIX/ptx timeout of 1.5 seconds. For expedited grace periods, anyway.
It is now OK to disable interrupts across rcu_read_unlock() even if the corresponding RCU read-side critical section might have been preempted. This is a welcome side-effect of Jiangshan's first attempt to inline rcu_read_lock() and rcu_read_unlock().
RCU code now takes greater advantage of the Kernel Concurrency Sanitizer (KCSAN).
CPUs may have their callbacks offloaded and de-offloaded at runtime, though this capability has not yet been made available to user space.
Debug-objects testing for double call_rcu() bugs can now print out more information on the memory passed in.
The rcutorture tests of RCU priority boosting are now much more stringent and resistant to false positives.
There is now a kvm-remote.sh rcutorture test facility that spreads rcutorture tests over many remote systems.
There is now a torture.sh test facility that does an overnight test of various torture tests.
There is now a trivial textbook RCU implementation in rcutorture, just to keep the slides and textbooks honest.
The lockdep facility now checks for SRCU-based deadlocks.
The Linux-kernel memory model's (LKMM's) model of SRCU is now much more realistic.
A great many much-appreciated features and fixes from more than 100 Linux-kernel developers.

What next for the RCU API?

As always, the most honest answer is that I do not know. That said, here are a few things that might happen:

The slab allocator will defer acting on kmem_struct_destroy() until after memory sent to kfree_rcu() has been freed. This would permit a module to use kfree_rcu() on memory obtained from kmem_struct allocators that this module passes to kmem_struct_destroy() at module-unload time.
TREE_PREEMPT_RCU's rcu_read_lock() primitive will be inlined. After all, why not triple down?
Hazard pointers will be added to the Linux kernel's deferred-free toolbox.
RCU callbacks will benefit from concurrent expedited grace periods.
Although RCU is now capable of changing the callback-offloaded status of a given CPU at runtime, this has not been made available to user space. It seems likely that user space will gain this capability sooner rather than later, albeit with some restrictions.
Further upgrades to RCU's energy-efficiency, latency, and simplicity.
Someone will notice that rcu_barrier() is no longer guaranteed to wait for the grace periods corresponding to prior calls to synchronize_rcu(). (Late-breaking news: Vlastimil Babka has already noticed.)

But now as always, new use cases and workloads will place unanticipated demands on RCU.

Acknowledgments

We are all indebted to a huge number of people who have used, abused, poked at, and otherwise helped to improve the RCU API. Paul is grateful to Dan Kelley for his support of this effort.

This work represents the view of the authors and does not necessarily represent the view of the authors' respective employers.

Comments (8 posted)

Vanilla OS 2: an immutable distribution to run all software

September 17, 2024

This article was contributed by Koen Vervloesem

Vanilla OS, an immutable desktop Linux distribution designed for developers and advanced users, has recently published its 2.0 "Orchid" release. Previously based on Ubuntu, Vanilla OS has now shifted to Debian unstable ("sid"). The release has made it easier to install software from other distributions' package repositories, and it is now theoretically possible to install and run Android applications as well.

The idea behind Vanilla OS is to maintain a small, immutable core operating system while isolating additional software from that core through containerization or sandboxing. The core of Vanilla OS is based on Open Container Initiative (OCI) images composed of packages from Debian sid. These images are updated and managed by the ABRoot utility. Rather than handling system updates through a package manager, Vanilla OS downloads a new OCI image, and then uses the new image when it reboots. This is similar to other image-based operating systems, like Aeon and Bluefin that use A/B image update strategies—though each distribution has its own method of doing so. All of this happens under the hood, so users don't have to be aware of how updates work.

One of the reasons that Vanilla OS shifted from Ubuntu to Debian was that the latter offers a more vanilla GNOME experience by providing the software without significant customizations. As the project's announcement from last year said:

Ubuntu provides a modified version of the GNOME desktop, that does not match how GNOME envisions its desktop. One of the high-level goals of Vanilla OS is to be as vanilla as possible, so we reverted many of these changes to reach that goal.

Another concern cited in the announcement was Ubuntu's reliance on the Snap packaging format, with problems like slow startups and Canonical's control over the official Snap store.

Installing the system and applications

The project offers an ISO image for x86-64 systems, with no support for other architectures. Only UEFI systems are supported, and even on those systems, the installer seems quite finicky. On two physical test machines, the installer seemed to run fine until it aborted with a generic error message. I managed to install Vanilla OS in a virtual machine (using GNOME Boxes) only after selecting Debian testing as the operating system (and choosing UEFI, of course).

When the installer works, it is straightforward to use. After the first reboot, a configuration wizard guides the user through creating an account and selecting applications to install. After a second reboot, the user is prompted to choose a password, followed by the installation of the selected applications using Flatpak.

Upon installation of these applications, a stock GNOME desktop environment appears. The recommended way to install applications in Vanilla OS is through the GNOME application Software, which is configured to install applications as Flatpaks from Flathub. By default, Flatpaks installed this way are updated automatically. Vanilla OS also checks for system updates weekly, by default. Users can change this to daily, weekly, or never if they choose. To minimize any impact on the user experience, the updater considers factors such as system load, battery level, and network connection before applying system updates. With the core operating system and applications set for automatic updates, Vanilla OS aims to relieve users of the often tedious task of keeping all packages updated.

Installing non-Flatpak packages

Although Flatpak has gained popularity, many applications are still not available as Flatpaks. Given that Vanilla OS is based on Debian, users might be tempted to install a .deb package directly into the core operating system when there's no Flatpak version for it. However, Vanilla OS doesn't offer that possibility. It addresses the need for Debian packages with the Vanilla System Operator (VSO v2), a container where users can safely install any .deb package.

VSO integrates software installed in a container with the desktop so that it appears just like any other desktop application. After downloading a .deb package, double-clicking on the file in the GNOME Files application launches the Sideload utility, which installs the .deb package into the VSO container. Once installed, the application is displayed among other applications in GNOME's activities overview. Launching the application runs it in a container, isolated from the operating system but integrated with the user's home directory. Thus, an application installed via a .deb package behaves similarly to one installed via Flatpak.

There's also a command-line method for installing .deb files. The default terminal emulator in Vanilla OS is Black Box. When launched, this application takes the user to a shell within the VSO container rather than the core operating system. Packages can be installed in this container using Debian's apt package manager.

This method is also useful for troubleshooting dependency issues. For instance, I decided to install an open-source ChatGPT alternative called Jan on Vanilla OS. When I attempted to install its .deb package with Sideload, the application failed to start without any clear indication of what was wrong. Opening Black Box and running the jan command in the shell revealed an error message regarding a missing library: libasound2.so. It turned out that Jan's .deb package didn't list this library as a dependency. After executing sudo apt install libasound2 in Black Box, I was able to run Jan successfully.

Vanilla OS offers a similar method for installing Android apps by running them in a Waydroid container. However, the developers caution that this is still an experimental feature. I was able to initialize the Android subsystem with vso android init in the VSO shell but encountered issues when trying to manually open an Android .apk file with Sideload. Despite several troubleshooting attempts, I was unable to get the Android functionality running. This was possibly due to my test system running in a virtual machine, a scenario that the project has not tested.

Linux subsystems

While VSO is the default container in Vanilla OS, the operating system allows users to install separate "subsystems" based on various Linux distributions through Apx, a wrapper around Distrobox. The latter, in turn, is a wrapper around Podman, Docker, or the simple container manager lilipod to create containers that are highly integrated with the host operating system. Incidentally, Luca Di Maio, the creator of Distrobox and lilipod, is also part of the Vanilla OS team and one of its co-founders.

Vanilla OS includes a graphical user interface for Apx, enabling users to easily create a subsystem based on a stack (Alpine, Arch Linux, Fedora, openSUSE, Ubuntu, and Vanilla OS are supported by default) and a package manager (apk, apt, dnf, pacman, or zypper). After creating and starting the subsystem, the user can search its packages and install them using commands in the VSO shell, such as:

    apx gecko search qucs
    apx gecko install qucs-s

Here gecko is the name I gave to the subsystem in this example, based on an openSUSE container.

Applications installed this way are "exported", which is a feature of Distrobox. The application then appears in GNOME's activities overview. When clicking on the icon, the subsystem's container is automatically started, and the application launches. This provides an easy way to install software available in specific distribution repositories. Furthermore, the apx package management commands are consistent, regardless of the underlying package manager, so users don't need to remember the differences between apt, dnf, or zypper, for example.

Apx offers a lot of flexibility. Users can add other package managers, such as yay instead of pacman for an Arch Linux subsystem. This involves mapping each apx package management command to the corresponding yay command. Similarly, users can create their own stack by referencing an image from one of the well-known container repositories. A stack can also have a list of pre-installed packages. This is useful for setting up a complete development stack in a container.

Development and documentation

Development is happening in the Vanilla OS GitHub organization, which has 24 members. Its more than 100 repositories include numerous container files and various custom tools. Although many of these tools were specifically created for Vanilla OS, the developers try to make them as distribution-independent as possible. For instance, there are instructions for using Apx in other distributions.

There's no fixed release cadence, but the lifecycle of Vanilla OS is clear: when a major version is released, the previous major release enters the maintenance state. This means that it only gets important bug fixes and security fixes. After six months in the maintenance state, the release is declared end of life. Updates are released at irregular intervals with detailed descriptions of bug fixes and enhancements. The project does not have a separate security-announcement list.

The transition from Vanilla OS 1 to 2 was significant, not only due to the shift from an Ubuntu base to Debian, but also because tools like ABRoot, VSO, and Apx were completely rewritten. However, the Vanilla OS documentation site still primarily covers ABRoot v1, VSO v1, and Apx v1. While the man pages seem to have been updated, information about the new features in Vanilla OS 2 remains fragmented, scattered across the project's web site, blog, and GitHub repositories.

Overall, Vanilla OS appears to be a good choice for advanced users who need applications from multiple distributions without risking system instability. Major applications can be easily installed as Flatpaks, while others can be installed in subsystems from various distributions. Thanks to the tight integration of the VSO and Apx containers with the host operating system, users can seamlessly work with all those applications without worrying about the underlying differences. And when something goes wrong with one of the applications, the core operating system isn't impacted and the VSO or Apx container can simply be recreated.

Comments (4 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>

Google	1477	11.5%
Intel	1350	10.5%
AMD	1327	10.3%
Meta	1060	8.2%
Red Hat	875	6.8%
Linaro	846	6.6%
Qualcomm	707	5.5%
Arm	667	5.2%
Linux Foundation	503	3.9%
(Unknown)	451	3.5%
SUSE	307	2.4%
(None)	281	2.2%
Huawei Technologies	276	2.1%
Cisco	239	1.9%
IBM	223	1.7%
NVIDIA	200	1.6%
Microsoft	196	1.5%
Oracle	168	1.3%
MediaTek	145	1.1%
Texas Instruments	137	1.1%

Leading items

Time to retire ifupdown?

Popping open Pandora's box

What's wrong with ifupdown?

What about Netplan?

Private to public

To decide or not to decide

Fedora and cryptography

WolfSSL

Security team snafu

Retirement

Aftermath

Summary of RCU API changes

Lazy RCU grace asynchronous periods

Reworking of kfree_rcu()

Polled RCU grace periods

Tasks Rude RCU and Tasks Trace RCU

New SRCU read-side critical sections

Read-side guards

RCU callback dynamic (de-)offloading

Miscellaneous

How did those 2019 predictions turn out?

What next for the RCU API?

Acknowledgments

Installing the system and applications

Installing non-Flatpak packages

Linux subsystems

Development and documentation

Reworking of `kfree_rcu()`