Bao: a lightweight static partitioning hypervisor

By Jonathan Corbet
May 20, 2020

Developers of safety-critical systems tend to avoid Linux kernels for a number of fairly obvious reasons; Linux simply was not developed with that sort of use case in mind. There are increasingly compelling reasons to use Linux in such systems, though, leading to a search for the best way to do so safely. At the 2020 Power Management and Scheduling in the Linux Kernel summit (OSPM), José Martins described Bao, a minimal hypervisor aimed at safety-critical deployments.

The actual target, he began, is "mixed-criticality" systems in which multiple software stacks run in parallel with each other; some of those stacks are safety-critical, while others are not. For example, a system could have a user interface running on Linux alongside the safety-critical application that it controls. There is an industry trend toward consolidating systems in this way driven by power considerations and the availability of processors with numerous CPUs.

Virtualization is naturally interesting for the developers of such systems; it minimizes the effort required to port systems and eases the integration within them. Good virtualization provides fault isolation, preventing failures in one part of the system from interfering with others. Developers want the usual things from such a system: good performance, realtime guarantees, and strong security.

Martins spent some time looking at solutions like Xen and KVM. They were not designed for this kind of use case, but they end up being used anyway. Neither is an optimal solution; they use virtualized I/O mechanisms that add a lot of overhead, and their code bases are large and hard to audit.

Instead, he said, there is a role here for a static partitioning hypervisor, which can be seen as a thin configuration layer that divides up a system's resources. Under a system like this, there is a one-to-one mapping between virtual CPUs and physical CPUs, so there is no contention for CPU time. Devices are mapped directly into the guests, avoiding any added I/O overhead. Perhaps the best-known hypervisor of this type is Jailhouse, but that didn't meet Martins's needs; it depends on a Linux "root cell" to run the whole show, its boot time is relatively long, and there is still a big code base to audit. The Xen Dom0-less project can do direct device assignment, which is nice, but it falls short in other ways.

So Martins set out to create Bao as a "type-1 bare-metal hypervisor" with a one-to-one CPU mapping. It doesn't depend on any sort of privileged virtual machine or operating system to boot. Bao provides a simple inter-VM communication mechanism based on shared memory and virtual interrupts. It depends on hardware assistance for many of its functions, including second-stage address translation, an I/O memory-management unit, and virtual interrupts. Bao can use huge pages to reduce translation lookaside buffer pressure and page-table memory use; it is also able to perform cache coloring for memory allocations to avoid low-level cache interference between machines.

Bao currently targets the Armv8 architecture. There is a RISC-V port, but the virtualization specification for RISC-V is not ready, so this port is not interesting yet. It can run a number of guests, including bare-metal applications, Linux, Android, and various realtime operating systems.

Ideally, he said, Bao would just be a configuration layer that does its work and gets out of the way, but the hardware does not support this mode of operation. Interrupts, for example, have to be mediated through the hypervisor, which is unfortunate since that increases latency. The I/O memory-management unit has a limited number of stream registers, and doesn't cover all devices on some platforms. There is no partitioning mechanism for memory cache on Arm, so the hypervisor must handle isolation via cache coloring.

The system is implemented in about 7,000 lines of code, and requires 50KB of memory on the target system. That is "somewhat small", he said, but he is working to get it smaller. Run-time memory requirements add up to about 250KB. Benchmark runs show that the hypervisor adds an execution-time overhead of about 2%. Turning on cache coloring increases that overhead, since that feature is incompatible with the use of huge pages. Interference tests currently show a significant amount of degradation caused by activity in other virtual machines; cache coloring helps but does not completely solve the problem.

Another issue is interrupt latency, which increases significantly due to the need for a round-trip through the hypervisor. There is a fair amount of cross-VM interference caused by interrupts as well; again, cache coloring helps, especially if it is used within the hypervisor too. He has found a way to map interrupts directly into guests, something that is made possible by the one-to-one CPU mapping. That increases the overhead for interrupts intended for the hypervisor itself, but those are relatively rare.

Current work includes adding support for trusted execution environments. The other approaches out there share the trusted code across virtual machines, which is not ideal. Arm is adding support for a "trusted hypervisor" mode but, for now, complex workarounds are required. Martins said that the "dual-world" approach used in this area is not inherently secure; a lot of code has to be added to the secure side, bringing the same old problems with it. It is better, he said, to limit the secure world to core security primitives; he is trying to do that by avoiding TrustZone completely and dedicating a virtual CPU to trusted work. This involves allowing multiple virtual CPUs to run on a single physical CPU.

Overall, he concluded, Bao has turned out to be a good fit for the intended use case.

Index entries for this article
Kernel	Virtualization
Conference	OS-Directed Power-Management Summit/2020

to post comments

Bao: a lightweight static partitioning hypervisor

Posted May 20, 2020 16:38 UTC (Wed) by Sesse (subscriber, #53779) [Link] (2 responses)

So now, we have Boa, a lightweight web server with only one thread, and Bao, a lightweight hypervisor with only one user per thread? :-)

Bao: a lightweight static partitioning hypervisor

Posted May 20, 2020 17:55 UTC (Wed) by bittonye (guest, #138442) [Link] (1 responses)

Why not use Sel4 with Camkes VM? (https://docs.sel4.systems/projects/virtualization/)

Bao: a lightweight static partitioning hypervisor

Posted May 21, 2020 23:27 UTC (Thu) by glenn (subscriber, #102223) [Link]

Sel4 is great, but it has limitations. It's unable to support multi-socket/NUMA systems, for instance. This is an intentional design choice on the part of Sel4's developers.

Muen?

Posted May 20, 2020 18:30 UTC (Wed) by justincormack (subscriber, #70439) [Link]

Muen is another static separation kernel https://muen.sk/ which is formally verified. Currently x86 only although there has been some arm work AFAIK.

RISC-V H-ext

Posted May 20, 2020 20:11 UTC (Wed) by bjorntopel (subscriber, #80345) [Link]

The hypervisor extension is in pretty good shape. The WDC folks has been working on the QEMU/KVM port for quite some time. The latest version can be found here [1]. The spec has been getting a lot of input from KVM maintainer Paolo Bonzini, e.g. [2], which has been added to the specification. In RISC-V land a spec doesn't get ratified until there's silicon to support it.

Hopefully, we'll get a chip soon! :-)

[1] https://lore.kernel.org/linux-riscv/20200428073312.324684...
[2] https://groups.google.com/a/groups.riscv.org/forum/m/#!ms...

Bao: a lightweight static partitioning hypervisor

Posted May 29, 2020 13:50 UTC (Fri) by josemartins (guest, #139216) [Link] (1 responses)

thank you for the article, Jonathan! However, I have a small correction. The hypervisor, for the example configuration I was talking about, is actually 50KB, not 50MB.

What's three orders of magnitude among friends?

Posted May 29, 2020 13:52 UTC (Fri) by corbet (editor, #1) [Link]

Oops...I even had that right in my notes, but somehow messed up when putting it into the article. Fixed now, apologies for the mistake.