Kernel development
Brief items
Kernel release status
The current development kernel is 3.6-rc3, released on August 22. Linus says: "Shortlog appended, there's nothing here that makes me go 'OMG! Scary!' or makes me want to particularly mention it separately. All just random updates and fixes."
Previously, 3.6-rc2 was released on on
August 16. "Anyway, with all that said, things don't seem too
bad. Yes, I ignored a few pull requests, but I have to say that there
weren't all that many of those, and the rest looked pretty calm. Sure,
there's 330+ commits in there, but considering that it's been two weeks,
that's about expected (or even a bit low) for early -rc's. Yes, 3.5 may
have been much less for -rc2, but that was unusual.
"
Stable updates: 2.6.34.13 and 3.2.28 were both released on August 20.
Quotes of the week
If we have carefully made a decision to inline a function, we should (now) use __always_inline. If we have carefully made a decision to not inline a function, we should use noinline. If we don't care, we should omit all such markings.
This leaves no place for "inline"?
Long-term support for the 3.4 kernel
Greg Kroah-Hartman has announced that the 3.4 kernel will receive stable updates for a period of at least two years. It joins 3.0 (which has at least one more year of support) on the long-term support list.
Kernel development news
The return of power-aware scheduling
Years of work to improve power utilization in Linux have made one thing clear: efficient power behavior must be implemented throughout the system. That certainly includes the CPU scheduler, but the kernel's scheduler currently has little in the way of logic aimed at minimizing power use. A recent proposal has started a discussion on how the scheduler might be made to be more power-aware. But, as this discussion shows, there is no single, straightforward answer to the question of how power-aware scheduling should be done.Interestingly, the scheduler did have power-aware logic from 2.6.18 through 3.4. There was a sysctl knob (sched_mc_power_savings) that would cause the scheduler to try to group runnable processes onto the smallest possible number of cores, allowing others to go idle. That code was removed in 3.5 because it never worked very well and nobody was putting any effort into improving it. The result was the removal of some rather unloved code, but it also left the scheduler with no power awareness at all. Given the level of interest in power savings in almost every environment, having a power-unaware scheduler seems less than optimal; it was only a matter of time until somebody tried to put together a better solution.
Alex Shi started off the conversation with a rough proposal on how power awareness might be added back to the scheduler. This proposal envisions two modes, called "power" and "performance," that would be used by the scheduler to guide its decisions. Some of the first debate centered around how that policy would be chosen, with some developers suggesting that "performance" could be used while on AC power and "power" when on battery power. But that policy entirely ignores an important constituency: data centers. Operators of data centers are becoming increasingly concerned about power usage and its associated costs; many of them are likely to want to run in a lower-power mode regardless of where the power is coming from. The obvious conclusion is that the kernel needs to provide a mechanism by which the mode can be chosen; the policy can then be decided by the system administrator.
The harder question is: what would that policy decision actually do? The old power code tried to cause some cores, at least, to go completely idle so that they could go into a sleep state. The proposal from Alex takes a different approach. Alex claims that trying to idle a subset of the CPUs in the system is not going to save much power; instead, it is best to spread the runnable processes across the system as widely as possible and try to get to a point where all CPUs can go idle. That seems to be the best approach, on x86-class processors, anyway. On that architecture, no processor can go into a deep sleep state unless they all go into that state; having even a single processor running will keep the others in a less efficient sleep state. A single processor also keeps associated hardware — the memory controller, for example — in a powered-up state. The first CPU is by far the most expensive one; bringing in additional CPUs has a much lower incremental cost.
So the general rule seems to be: keep all of the processors busy as long as there is work to be done. This approach should lead to the quickest processing and best cache utilization; it also gives the best power utilization. In other words, the best policy for power savings looks a lot like the best policy for performance. That conclusion came as a surprise to some, but it makes some sense; as Arjan van de Ven put it:
So why bother with multiple scheduling modes in the first place? Naturally enough, there are some complications that enter this picture and make it a little bit less neat. The first of these is that spreading load across processors only helps if the new processors are actually put to work for a substantial period of time, for values of "substantial" around 100μs. For any shorter period, the cost of bringing the CPU out of even a shallow sleep exceeds the savings gained from running a process there. So extra CPUs should not be brought into play for short-lived tasks. Properly implementing that policy is likely to require that the kernel gain a better understanding of the behavior of the processes running in any given workload.
There is also still scope for some differences of behavior between the two modes. In a performance-oriented mode, the scheduler might balance tasks more aggressively, trying to keep the load the same on all processors. In a power-savings mode, processes might stay a bit more tightly packed onto a smaller number of CPUs, especially processes that have an observed history of running for very short periods of time.
But the conversation has, arguably, only barely touched on the biggest complication of all. There was a lot of talk of what the optimal behavior is for current-generation x86 processors, but that is far from the only environment in which Linux runs. ARM processors have a complex set of facilities for power management, allowing much finer control over which parts of the system have power and clocks at any given time. The ARM world is also pushing the boundaries with asymmetric architectures like big.LITTLE; figuring out the optimal task placement for systems with more than one type of CPU is not going to be an easy task.
The problem is thus architecture-specific; optimal behavior on one architecture may yield poor results on another. But the eventual solution needs to work on all of the important architectures supported by Linux. And, preferably, it should be easily modifiable to work on future versions of those architectures, since the way to get the best power utilization is likely to change over time. That suggests that the mechanism currently used to describe architecture-specific details to the scheduler (scheduling domains) needs to grow the ability to describe parameters relevant to power management as well. An architecture-independent scheduler could then use those parameters to guide its behavior. That scheduler will also need a better understanding of process behavior; the almost-ready per-entity load tracking patch set may help in this regard.
Designing and implementing these changes is clearly not going to be a short-term job. It will require a fair amount of cooperation between the core scheduler developers and those working on specific architectures. But, given how long we have been without power management support in the scheduler, and given that the bulk of the real power savings are to be had elsewhere (in drivers and in user space, for example), we can wait a little longer while a proper scheduler solution is worked out.
Link-time optimization for the kernel
The kernel tends to place an upper limit on how quickly any given workload can run, so it is unsurprising that kernel developers are always on the lookout for ways to make the system go faster. Significant amounts of work can be put into optimizations that, on the surface, seem small. So when the opportunity comes to make the kernel go faster without the need to rewrite any performance-critical code paths, there will naturally be a fair amount of interest. Whether the "link-time optimization" (LTO) feature supported by recent versions of GCC is such an opportunity or not is yet to be proved, but Andi Kleen is determined to find out.The idea behind LTO is to examine the entire program after the individual files have been compiled and exploit any additional optimization opportunities that appear. The most significant of those opportunities appears to be the inlining of small functions across object files. The compiler can also be more aggressive about detecting and eliminating unused code and data. Under the hood, LTO works by dumping the compiler's intermediate representation (the "GIMPLE" code) into the resulting object file whenever a source file is compiled. The actual LTO stage is then carried out by loading all of the GIMPLE code into a single in-core image and rewriting the (presumably) further-optimized object code.
The LTO feature first appeared in GCC 4.5, but it has only really started to become useful in the 4.7 release. It still has a number of limitations; one of those is that all of the object files involved must be compiled with the same set of command-line options. That limitation turns out to be a problem with the kernel, as will be seen below.
Andi's LTO patch set weighs in at 74 changesets — not a small or unintrusive change. But it turns out that most of the changes have the same basic scope: ensuring that the compiler knows that specific symbols are needed even if they appear to be unused; that prevents the LTO stage from optimizing them away. For example, symbols exported to modules may not have any callers in the core kernel itself, but they need to be preserved for modules that may be loaded later. To that end, Andi's first patch defines a new attribute (__visible) used to mark such symbols; most of the remaining patches are dedicated to the addition of __visible attributes where they are needed.
Beyond that, there is a small set of fixes for specific problems
encountered when building kernels with LTO. It seems that functions with
long argument lists can get their arguments
corrupted if the functions are
inlined during the LTO stage; avoiding that requires marking the functions
noinline. Andi complains "I wish there was a generic way to
handle this. Seems like a ticking time bomb problem.
" In general,
he acknowledges the possibility that LTO may introduce new,
optimization-related bugs into the kernel; finding all of those could be a
challenge.
Then there is the requirement that all files be built with the same set of options. Current kernels are not built that way; different options are used in different parts of the tree. In some places, this problem can be worked around by disabling specific optimizations that depend on different compiler flags than are used in the rest of the kernel. In others, though, features must simply be disabled to use LTO. These include the "modversions" feature (allowing kernel modules to be used with more than one kernel version) and the function tracer. Modversions seems to be fixable; getting ftrace to work may require changes to GCC, though.
It is also necessary, of course, to change the build system to use the GCC LTO feature. As of this writing, one must have a current GCC release; it is also necessary to install a development version of the binutils package for LTO to work. Even a minimal kernel requires about 4GB of memory for the LTO pass; an "allyesconfig" build could require as much as 9GB. Given that, the use of 32-bit systems for LTO kernel builds is out of the question; it is still possible, of course, to build a 32-bit kernel on a 64-bit system. The build will also take between two and four times as long as it does without LTO. So developers are unlikely to make much use of LTO for their own work, but it might be of interest to distributors and others who are building production kernels.
The fact that most people will not want to do LTO builds actually poses a bit of a problem. Given the potential for LTO to introduce subtle bugs, due either to optimization-related misunderstandings or simple bugs in the new LTO feature itself, widespread testing is clearly called for before LTO is used for production kernels. But if developers and testers are unwilling to do such heavyweight builds, that testing may be hard to come by. That will make it harder to achieve the level of confidence that will be needed before LTO-built kernels can be used in real-world settings.
Given the above challenges, the size of the patch set, and the ongoing maintenance burden of keeping LTO working, one might well wonder if it is all worth it. And that comes down entirely to the numbers: how much faster does the kernel get when LTO is used? Hard numbers are not readily available at this time; the LTO patch set is new and there are still a lot of things to be fixed. Andi reports that runs of the "hackbench" benchmark gain about 5%, while kernel builds don't change much at all. Some networking benchmarks improve as much as 18%. There are also some unspecified "minor regressions." The numbers are rough, but Andi believes they are encouraging enough to justify further work; he also expects the LTO implementation in GCC to improve over time.
Andi also suggests that, in the long term, LTO could help to improve the quality of the kernel code base by eliminating the need to put inline functions into include files.
All told, this is a patch set in a very early stage of development; it seems unlikely to be proposed for merging into a near-term kernel, even as an experimental feature. In the longer term, though, it could lead to faster kernels; use of LTO in the kernel could also help to drive improvements in the GCC implementation that would benefit all projects. So it is an effort that is worth keeping an eye on.
Ask a kernel developer: maintainer workflow
In this edition of "ask a kernel developer", I answer a multi-part question about kernel subsystem maintenance from a new maintainer. The workflow that I use to handle patches in the USB subsystem is used as an example to hopefully provide a guide for those who are new to the maintainer role.
As always, if you have unanswered questions relating to technical or procedural issues in Linux kernel development, ask them in the comment section, or email them directly to me. I will try to get to them in another installment down the road.
I have some questions about what I am supposed to be doing at different points of the release cycle. -rc1 and -rc2 are spelled out in Documentation/HOWTO, and I have a decent idea that patches I accept should be smaller and fix more critical bugs as the -rcX's roll out. The big question is what do I do with all of the other patches that come at random times?
First off, thanks so much for agreeing to maintain a kernel subsystem. Without maintainers like you, the Linux kernel development process would be much more chaotic and hard to navigate. I will try to explain how I have set up my development workflow and how I maintain the different subsystems I am in charge of. That example can help you determine how you wish to manage your own development trees, and how to handle incoming patches from developers.
To answer the question, yes, you will receive patches at any point in the release cycle, but not all of them are applicable to be sent to Linus at all points in time, depending on where we are in the release cycle. I'll go into more detail below, but for now, realize that in my opinion you should not require the other developers to wait for different points in the release cycle, and, instead, you should hold onto patches and send them upstream when they are needed. I think it is the maintainer's job to do the buffering.
How best do I organize my pull-request branches so that developers know which they can pull as dependencies, and which are for-next. I don't want to over-organize it, but do want to make it easy for board submitters to test from my trees. Should my pull-request branches be long-lived, or, kill them and create new after each cycle?
It's best to stick with a simple scheme for branches, work with that for a while, and then if you find that is too limiting, feel free to grow from there. I only have two branches in my git trees, one to feed to Linus for the current release cycle, and one that is for the next release cycle. This can be seen in the USB git tree on kernel.org, which shows three branches:
- master, which tracks Linus's tree
- usb-linus, which contains patches to go to Linus for this release cycle
- usb-next, which contains the patches to go to Linus for the next release cycle.
I receive patches from lots of different developers all the time. All patches, after they pass an initial "is this sane" glance, get copied to a mailbox that I call TODO. Every few days, depending on my workload, I go through the mailbox and pick out all of the patches that are to be applied to various trees I am responsible for. For this example, I'll search on anything that touches the USB tree and copy those messages to a temporary local mailbox on the filesystem called s (I name my local mailboxes for their ease of typing, not for any other good reason.)
After digging all of the USB patches out (which is really a simple filter for all threads that have the "drivers/usb" string in them), I take a closer look at the patches in the s mailbox.
First I look to find anything that would be applicable to Linus's current tree. This is usually a bug fix for something that was introduced during this merge window, or a regression for systems that were previously working just fine. I pick those out and save them to another temporary mailbox called s1.
Now it's time to start testing to see if the patches actually apply to the tree. I go into a directory that contains my usb tree and check to see what branch I am on:
$ cd linux/work/usb
$ git b
master 6dab7ed Merge branch 'fixes' of git://git.linaro.org/people/rmk/linux-arm
* usb-linus 8f057d7 gpu/mfd/usb: Fix USB randconfig problems
usb-next 26f944b usb: hcd: use *resource_size_t* for specifying resource data
work-linus 8f057d7 gpu/mfd/usb: Fix USB randconfig problems
work-next 26f944b usb: hcd: use *resource_size_t* for specifying resource data
Note, I have the following aliases in my ~/.gitconfig file:
[alias]
dc = describe --contains
fp = format-patch -k -M -N
b = branch -v
This enables me to use git b to see the current branch much easier,
git fp to format patches in the style I need them in, and git dc to
determine exactly what release a specific git commit was contained in.
As you can see by the list of branches, I have a local branch that mirrors the public versions of the usb-linus and usb-next branches called work-linus and work-next. I do the testing and development work in these local branches, and only when I feel they are "good enough" do I push them to the public facing branches and then out to kernel.org.
So, back to work. As I am working on patches [1, 2] that are to be sent to Linus first, let's change to the local working version of that branch:
$ git checkout work-linus
Switched to branch 'work-linus'
Then a quick sanity check to verify that the patches in s1 really will apply to this tree (sadly, they often do not.):
$ p1 < ../s1
patching file drivers/usb/core/endpoint.c
patching file drivers/usb/core/quirks.c
patching file drivers/usb/core/sysfs.c
Hunk #2 FAILED at 210.
1 out of 2 hunks FAILED -- saving rejects to file drivers/usb/core/sysfs.c.rej
patching file drivers/usb/storage/transport.c
patching file include/linux/usb/quirks.h
(Note, the 'p1' command is really:
patch -p1 -g1 --dry-run
that I set up in my
.alias file years ago as I quickly got tired of typing the full thing out.)
Here is an example of patches that will not apply to the work-linus branch, but it turns out that this was my fault. They were generated against the linux-next branch, and really should be queued up for the next merge window, not for this release.
So, let's switch back to the work-next branch, as that is where the patches really belong:
$ git checkout work-next
Switched to branch 'work-next'
And see if they apply there properly:
$ p1 < ../s1
patching file drivers/usb/core/endpoint.c
patching file drivers/usb/core/quirks.c
patching file drivers/usb/core/sysfs.c
patching file drivers/usb/storage/transport.c
patching file include/linux/usb/quirks.h
Much better.
Then I look at the patches themselves again in my email client, and edit anything that needs to be cleaned up. The changes could be in the Subject, the body of the patch, or any other things that need to be touched up. With developers who send patches all the time, no changes generally need to be done in this area, but, unfortunately, I end up editing this type of "metadata" all the time.
After the patches look clean, and I've done a review of them again to verify that I don't notice anything strange or suspicious, I do one last sanity check by running the checkpatch.pl tool:
$ ./scripts/checkpatch.pl ../s1
total: 0 errors, 0 warnings, 73 lines checked
../s1 has no obvious style problems and is ready for submission.
So all looks good, so let's apply them to the branch and see if the build works properly:
$ git am -s ../s1
Applying: usb/endpoint: Set release callback in the struct device_type \
instead of in the device itself directly
Applying: usb: convert USB_QUIRK_RESET_MORPHS to USB_QUIRK_RESET
$ make -j8
If everything built, then it's time to test the patches. This
can range from installing the changed kernel and ensuring that everything still
works properly and the new modifications work as they say they should work, to
doing nothing more than verifying that the build didn't break if I do not have
the hardware that the changed driver controls.
After this, and everything looks sane, it's time to push the patches to the public kernel.org repository, as well as notifying the developer that their patch was applied to the tree and where they can find it. This I do with a script called do.sh that has grown over the years; it was originally based on a script that Andrew Morton uses to notify developers when he applies their patches. You can find a copy of it and the rest of the helper scripts I use for kernel development in my gregkh-linux GitHub tree.
The script does the following:
- generates a patch for every changeset in the local branch that is not in the usb-next branch
- emails the developer that this patch has now been applied and where it can be found
- merges the branch to the local usb-next branch
- pushes the branch to the public git.kernel.org repository
- pushes the branch to a local backup server that is on write-only media
- switches back to the work-next branch
After this, people do sometimes find problems with patches that need to be fixed up. But, since my trees are public, I can't rebase them—otherwise any developer who had previously pulled my branches would get messed up. Instead, I sometimes revert patches, or apply fix-up patches on top of the current tree to resolve issues. It isn't the cleanest solution at times, but it is better to do this than rebase a public tree, which is something that no one should ever do.
Hopefully, this description gives you an idea how you can manage your trees and the patches sent to you to make things easier for yourself, the linux-next maintainer, and any developer who relies on your tree.
Patches and updates
Kernel trees
Architecture-specific
Build system
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Page editor: Jonathan Corbet
Next page:
Distributions>>