Some KVM developments
While KVM is in the mainline, it is not exactly in a finished state yet, and it may see significant changes before and after the 2.6.20 release. One current problem has to do with the implementation of "shadow page tables," which does not perform as well as one would like. The solution is conceptually straightforward - at least, once one understands what shadow page tables do.
A page table, of course, is a mapping from a virtual address to the associated physical address (or a flag that said mapping does not currently exist). A virtualized operating system is given a range of "physical" memory to work with, and it implements its own page tables to map between its virtual address spaces and that memory range. But the guest's "physical" memory is a virtual range administered by the host; guests do not deal directly with "bare metal" memory. The result is that there are actually two sets of page tables between a virtual address space on a virtualized guest and the real, physical memory it maps to. The guest can set up one level of translation, but only the host can manage the mapping between the guest's "physical" memory and the real thing.
This situation is handled by way of shadow page tables. The virtualized client thinks it is maintaining its own page tables, but the processor does not actually use them. Instead, the host system implements a "shadow" table which mirror's the guest's table, but which maps guest virtual addresses directly to physical addresses. The shadow table starts out empty; every page fault on the guest then results in the filling in of the appropriate shadow entry. Once the guest has faulted in the pages it needs, it will be able to run at native speed with no further hypervisor attention required.
With the version of KVM found in 2.6.20-rc4, that happy situation tends not to last for very long, though. Once the guest performs a context switch, the painfully-built shadow page table is dumped and a new one is started. Changing the shadow table is required, since the process running after the context switch will have a different set of address mappings. But, when the previous process gets back into the CPU, it would be nice if its shadow page tables were there waiting for it.
The shadow page table caching patch posted by Avi Kivity does just that. Rather than just dump the shadow table, it sets that table aside so that it can be loaded again the next time it's needed. The idea seems simple, but the implementation requires a 33-part patch - there are a lot of details to take care of. Much of the trouble comes from the fact that the host cannot always tell for sure when the guest has made a page table entry change. As a result, guest page tables must be write-protected. Whenever the guest makes a change, it will trap into the hypervisor, which can complete the change and update the shadow table accordingly.
To make the write-protect mechanism work, the caching patch must add a reverse-mapping mechanism to allow it to trace faults back to the page table(s) of interest. There is also an interesting situation where, occasionally, a page will stop being used as a page table without the host system knowing about it. To detect that situation, the KVM code looks for overly-frequent or misaligned writes, either of which indicates (heuristically) that the function of the page has changed.
The 2.6.20 kernel is in a relatively late stage of development, with the final release expected later this month. Even so, Avi would like to see this large change merged now. Ingo Molnar concurs, saying:
Since the KVM code is new for 2.6.20, changes within it cannot cause regressions for anybody. So this sort of feature addition is likely to be allowed, even this late in the development cycle.
Ingo has been busy on this front, announcing a patch entitled KVM paravirtualization for Linux. It is a set of patches which allows a Linux guest to run under KVM. It is a paravirtualization solution, though, rather than full virtualization: the guest system knows that it is running as a virtual guest. Paravirtualization should not be strictly necessary with hardware virtualization support, but a paravirtualized kernel can take some shortcuts which speed things up considerably. With these patches and the full set of KVM patches, Ingo is able to get benchmark results which are surprisingly close to native hardware speeds, and at least an order of magnitude faster than running under Qemu.
This patch is, in fact, the current form of the paravirt_ops concept. With paravirt_ops, low-level, hardware-specific operations are hidden behind a structure full of member functions. This paravirt_ops structure, by default, contains functions which operate on the hardware directly. Those functions can be replaced, however, by alternatives which operate through a hypervisor. Ingo's patch replaces a relatively small set of operations - mostly those involved with the maintenance of page tables.
There was one interesting complaint which come out of Ingo's patch - even though Ingo's new code is not really the problem. The paravirt_ops structure is exported to modules, making it possible for loadable modules to work properly with hypervisors. But there are many operations in paravirt_ops which have never been made available to modules in the past. So paravirt_ops represents a significant widening of the module interface. Ingo responded with a patch which splits paravirt_ops into two structures, only one of which (paravirt_mod_ops) is exported to modules. It seems that the preferred approach, however, will be to create wrapper functions around the operations deemed suitable for modules and export those. That minimizes the intrusiveness of the patch and keeps the paravirt_ops structure out of module reach.
One remaining nagging little detail with the KVM subsystem is what the interface to user space will look like. Avi Kivity has noted that the API currently found in the mainline kernel has a number of shortcomings and will need some changes; many of those, it appears, are likely to show up in 2.6.21. The proposed API is still heavy on ioctl() calls, which does not sit well with all developers, but no alternatives have been proposed. This is a discussion which is likely to continue for some time yet.
Perhaps the most interesting outcome of all this, however, is how KVM is
gaining momentum as the virtualization approach of choice - at least for
contemporary and future hardware. One can almost see the interest in Xen
(for example) fading; KVM comes across as a much simpler, more maintainable
way to support full and paravirtualization. The community seems to be
converging on KVM as the low-level virtualization interface;
commercial vendors of higher-level products will want to adapt to this
interface if they want their products to be supported in the future.
| Index entries for this article | |
|---|---|
| Kernel | KVM |
| Kernel | paravirt_ops |
| Kernel | Virtualization/KVM |