Some KVM developments

[Posted January 9, 2007 by corbet]

The KVM patch set was covered here briefly last October. In short, KVM allows for (relatively) simple support of virtualized clients on recent processors. On a CPU with Intel's or AMD's hardware virtualization support, a hypervisor can open /dev/kvm and, through a series of ioctl() calls, create virtualized processors and launch guest systems on them. Compared to a full paravirtualization system like Xen, KVM is relatively small and straightforward; that is one of the reasons why KVM went in to 2.6.20, while Xen remains on the outside.

While KVM is in the mainline, it is not exactly in a finished state yet, and it may see significant changes before and after the 2.6.20 release. One current problem has to do with the implementation of "shadow page tables," which does not perform as well as one would like. The solution is conceptually straightforward - at least, once one understands what shadow page tables do.

A page table, of course, is a mapping from a virtual address to the associated physical address (or a flag that said mapping does not currently exist). A virtualized operating system is given a range of "physical" memory to work with, and it implements its own page tables to map between its virtual address spaces and that memory range. But the guest's "physical" memory is a virtual range administered by the host; guests do not deal directly with "bare metal" memory. The result is that there are actually two sets of page tables between a virtual address space on a virtualized guest and the real, physical memory it maps to. The guest can set up one level of translation, but only the host can manage the mapping between the guest's "physical" memory and the real thing.

This situation is handled by way of shadow page tables. The virtualized client thinks it is maintaining its own page tables, but the processor does not actually use them. Instead, the host system implements a "shadow" table which mirror's the guest's table, but which maps guest virtual addresses directly to physical addresses. The shadow table starts out empty; every page fault on the guest then results in the filling in of the appropriate shadow entry. Once the guest has faulted in the pages it needs, it will be able to run at native speed with no further hypervisor attention required.

With the version of KVM found in 2.6.20-rc4, that happy situation tends not to last for very long, though. Once the guest performs a context switch, the painfully-built shadow page table is dumped and a new one is started. Changing the shadow table is required, since the process running after the context switch will have a different set of address mappings. But, when the previous process gets back into the CPU, it would be nice if its shadow page tables were there waiting for it.

The shadow page table caching patch posted by Avi Kivity does just that. Rather than just dump the shadow table, it sets that table aside so that it can be loaded again the next time it's needed. The idea seems simple, but the implementation requires a 33-part patch - there are a lot of details to take care of. Much of the trouble comes from the fact that the host cannot always tell for sure when the guest has made a page table entry change. As a result, guest page tables must be write-protected. Whenever the guest makes a change, it will trap into the hypervisor, which can complete the change and update the shadow table accordingly.

To make the write-protect mechanism work, the caching patch must add a reverse-mapping mechanism to allow it to trace faults back to the page table(s) of interest. There is also an interesting situation where, occasionally, a page will stop being used as a page table without the host system knowing about it. To detect that situation, the KVM code looks for overly-frequent or misaligned writes, either of which indicates (heuristically) that the function of the page has changed.

The 2.6.20 kernel is in a relatively late stage of development, with the final release expected later this month. Even so, Avi would like to see this large change merged now. Ingo Molnar concurs, saying:

I have tested the new MMU changes quite extensively and they are converging nicely. It brings down context-switch costs by a factor of 10 and more, even for microbenchmarks: instead of throwing away the full shadow pagetable hierarchy we have worked so hard to construct this patchset allows the intelligent caching of shadow pagetables. The effect is human-visible as well - the system got visibly snappier

Since the KVM code is new for 2.6.20, changes within it cannot cause regressions for anybody. So this sort of feature addition is likely to be allowed, even this late in the development cycle.

Ingo has been busy on this front, announcing a patch entitled KVM paravirtualization for Linux. It is a set of patches which allows a Linux guest to run under KVM. It is a paravirtualization solution, though, rather than full virtualization: the guest system knows that it is running as a virtual guest. Paravirtualization should not be strictly necessary with hardware virtualization support, but a paravirtualized kernel can take some shortcuts which speed things up considerably. With these patches and the full set of KVM patches, Ingo is able to get benchmark results which are surprisingly close to native hardware speeds, and at least an order of magnitude faster than running under Qemu.

This patch is, in fact, the current form of the paravirt_ops concept. With paravirt_ops, low-level, hardware-specific operations are hidden behind a structure full of member functions. This paravirt_ops structure, by default, contains functions which operate on the hardware directly. Those functions can be replaced, however, by alternatives which operate through a hypervisor. Ingo's patch replaces a relatively small set of operations - mostly those involved with the maintenance of page tables.

There was one interesting complaint which come out of Ingo's patch - even though Ingo's new code is not really the problem. The paravirt_ops structure is exported to modules, making it possible for loadable modules to work properly with hypervisors. But there are many operations in paravirt_ops which have never been made available to modules in the past. So paravirt_ops represents a significant widening of the module interface. Ingo responded with a patch which splits paravirt_ops into two structures, only one of which (paravirt_mod_ops) is exported to modules. It seems that the preferred approach, however, will be to create wrapper functions around the operations deemed suitable for modules and export those. That minimizes the intrusiveness of the patch and keeps the paravirt_ops structure out of module reach.

One remaining nagging little detail with the KVM subsystem is what the interface to user space will look like. Avi Kivity has noted that the API currently found in the mainline kernel has a number of shortcomings and will need some changes; many of those, it appears, are likely to show up in 2.6.21. The proposed API is still heavy on ioctl() calls, which does not sit well with all developers, but no alternatives have been proposed. This is a discussion which is likely to continue for some time yet.

Perhaps the most interesting outcome of all this, however, is how KVM is gaining momentum as the virtualization approach of choice - at least for contemporary and future hardware. One can almost see the interest in Xen (for example) fading; KVM comes across as a much simpler, more maintainable way to support full and paravirtualization. The community seems to be converging on KVM as the low-level virtualization interface; commercial vendors of higher-level products will want to adapt to this interface if they want their products to be supported in the future.

Index entries for this article
Kernel	KVM
Kernel	paravirt_ops
Kernel	Virtualization/KVM

to post comments

Some KVM developments

Posted Jan 13, 2007 19:47 UTC (Sat) by nlucas (subscriber, #33793) [Link] (2 responses)

I always thought the KVM way was better than the Xen way, since I had contact with the colinux way of doing things. With the added support for paravirtualization, it should not be long until KVM will also support CPU's without the VMX hardware support, which will make many people wonder why they need Xen.

Paravirtualization will not make you run Windows on it if your CPU doesn't have hardware support, but will be enough to test and run different Linux and other open source systems with much better performance than with Qemu (for many things, users will not notice the difference).

It's even conceivable that a KVM application could do it in a transparent way to the user, and if the CPU doesn't support VMX it will fallback to using Qemu emulation to run the native system image (as KVM also needs Qemu support for emulating some hardware). So KVM will behave as the Qemu hardware accelerator module (the Kqemu closed source driver).

Some KVM developments

Posted Jan 17, 2007 2:23 UTC (Wed) by ttfkam (guest, #29791) [Link]

It also opens up quite a few avenues that simply weren't possible before. For example:

Combined with the LinuxBIOS project, this allows a "bare" PC to be bootable while still allowing for Windows to be loaded; no longer either/or.

Dual booting is even easier as both OSes would be under the control of the hypervisor. With enough RAM and disk space, it would literally be two (or more) computers in one.

Sharing data between the OSes would no longer be dependent upon Linux support for NTFS -- or any other Windows or Mac filesystem -- but rather SMB/CIFS (read: Samba) access in both directions: Linux->Windows, Windows->Linux, Linux->OSX, and OSX->Linux.

Expanded use of kernel-built-in paravirtualization means that hardware drivers are handled at the hypervisor level while apps and config are at the guest level. Backup and restoration of servers cease to be hardware configuration-dependent. Having this is the vanilla kernel raises the possibility of a uniform backup utility for any and all flavors/distributions.

Syslog to the hypervisor by default so that even if a guest is compromised, the logs cannot be. (Or at least the difficulty in altering logs raises dramatically.)

That's off the top of my head. Any other bright ideas to share?

Some KVM developments

Posted Feb 5, 2007 19:12 UTC (Mon) by mmarq (guest, #2332) [Link]

"" With the added support for paravirtualization, it should not be long until KVM will also support CPU's without the VMX hardware support, which will make many people wonder why they need Xen. ""

Well in the datacenter, the possibilities to *migrate* VMs on failures seems to me to be invaluable. The industry is converging on the x86_64, so the ability to emulate... MIPS or POWER; does not seem to me, to be a very hard to pass thing.

Some KVM developments

Posted Jan 15, 2007 13:46 UTC (Mon) by job (guest, #670) [Link] (1 responses)

Very interesting. I would love to an in-depth comparison and/or some
benchmarks of KVM, Xen and VMware!

Some KVM developments

Posted Feb 5, 2007 21:25 UTC (Mon) by riel (subscriber, #3142) [Link]

It's not very in-depth, but a comparison of the various virtualization methods for Linux is available on http://virt.kernelnewbies.org/TechComparison

Some background info is available on other pages of that site.

Nested paging?

Posted Jan 19, 2007 6:15 UTC (Fri) by jking (guest, #42868) [Link]

Have the KVM developers looked at AMD's and Intel's hardware support for nested paging? If used properly, it would seem to make the entire shadow page table problem/patch moot.