Link-time optimization for the kernel
The idea behind LTO is to examine the entire program after the individual files have been compiled and exploit any additional optimization opportunities that appear. The most significant of those opportunities appears to be the inlining of small functions across object files. The compiler can also be more aggressive about detecting and eliminating unused code and data. Under the hood, LTO works by dumping the compiler's intermediate representation (the "GIMPLE" code) into the resulting object file whenever a source file is compiled. The actual LTO stage is then carried out by loading all of the GIMPLE code into a single in-core image and rewriting the (presumably) further-optimized object code.
The LTO feature first appeared in GCC 4.5, but it has only really started to become useful in the 4.7 release. It still has a number of limitations; one of those is that all of the object files involved must be compiled with the same set of command-line options. That limitation turns out to be a problem with the kernel, as will be seen below.
Andi's LTO patch set weighs in at 74 changesets — not a small or unintrusive change. But it turns out that most of the changes have the same basic scope: ensuring that the compiler knows that specific symbols are needed even if they appear to be unused; that prevents the LTO stage from optimizing them away. For example, symbols exported to modules may not have any callers in the core kernel itself, but they need to be preserved for modules that may be loaded later. To that end, Andi's first patch defines a new attribute (__visible) used to mark such symbols; most of the remaining patches are dedicated to the addition of __visible attributes where they are needed.
Beyond that, there is a small set of fixes for specific problems
encountered when building kernels with LTO. It seems that functions with
long argument lists can get their arguments
corrupted if the functions are
inlined during the LTO stage; avoiding that requires marking the functions
noinline. Andi complains "I wish there was a generic way to
handle this. Seems like a ticking time bomb problem.
" In general,
he acknowledges the possibility that LTO may introduce new,
optimization-related bugs into the kernel; finding all of those could be a
challenge.
Then there is the requirement that all files be built with the same set of options. Current kernels are not built that way; different options are used in different parts of the tree. In some places, this problem can be worked around by disabling specific optimizations that depend on different compiler flags than are used in the rest of the kernel. In others, though, features must simply be disabled to use LTO. These include the "modversions" feature (allowing kernel modules to be used with more than one kernel version) and the function tracer. Modversions seems to be fixable; getting ftrace to work may require changes to GCC, though.
It is also necessary, of course, to change the build system to use the GCC LTO feature. As of this writing, one must have a current GCC release; it is also necessary to install a development version of the binutils package for LTO to work. Even a minimal kernel requires about 4GB of memory for the LTO pass; an "allyesconfig" build could require as much as 9GB. Given that, the use of 32-bit systems for LTO kernel builds is out of the question; it is still possible, of course, to build a 32-bit kernel on a 64-bit system. The build will also take between two and four times as long as it does without LTO. So developers are unlikely to make much use of LTO for their own work, but it might be of interest to distributors and others who are building production kernels.
The fact that most people will not want to do LTO builds actually poses a bit of a problem. Given the potential for LTO to introduce subtle bugs, due either to optimization-related misunderstandings or simple bugs in the new LTO feature itself, widespread testing is clearly called for before LTO is used for production kernels. But if developers and testers are unwilling to do such heavyweight builds, that testing may be hard to come by. That will make it harder to achieve the level of confidence that will be needed before LTO-built kernels can be used in real-world settings.
Given the above challenges, the size of the patch set, and the ongoing maintenance burden of keeping LTO working, one might well wonder if it is all worth it. And that comes down entirely to the numbers: how much faster does the kernel get when LTO is used? Hard numbers are not readily available at this time; the LTO patch set is new and there are still a lot of things to be fixed. Andi reports that runs of the "hackbench" benchmark gain about 5%, while kernel builds don't change much at all. Some networking benchmarks improve as much as 18%. There are also some unspecified "minor regressions." The numbers are rough, but Andi believes they are encouraging enough to justify further work; he also expects the LTO implementation in GCC to improve over time.
Andi also suggests that, in the long term, LTO could help to improve the quality of the kernel code base by eliminating the need to put inline functions into include files.
All told, this is a patch set in a very early stage of development; it
seems unlikely to be proposed for merging into a near-term kernel, even as
an experimental feature. In the longer term, though, it could lead to
faster kernels; use of LTO in the kernel could also help to drive
improvements in the GCC implementation that would benefit all projects. So
it is an effort that is worth keeping an eye on.
| Index entries for this article | |
|---|---|
| Kernel | Build system |
| Kernel | Optimization tools |