Kernel development

Brief items

Kernel release status

The current development kernel is 3.6-rc3, released on August 22. Linus says: "Shortlog appended, there's nothing here that makes me go 'OMG! Scary!' or makes me want to particularly mention it separately. All just random updates and fixes."

Previously, 3.6-rc2 was released on on August 16. "Anyway, with all that said, things don't seem too bad. Yes, I ignored a few pull requests, but I have to say that there weren't all that many of those, and the rest looked pretty calm. Sure, there's 330+ commits in there, but considering that it's been two weeks, that's about expected (or even a bit low) for early -rc's. Yes, 3.5 may have been much less for -rc2, but that was unusual."

Stable updates: 2.6.34.13 and 3.2.28 were both released on August 20.

Comments (1 posted)

Quotes of the week

Our power consumption is worse than under other operating systems is almost entirely because only one of our three GPU drivers implements any kind of useful power management.

— Matthew Garrett

Moving 'policy' into user-space has been an utter failure, mostly because there's not a single project/subsystem responsible for getting a good result to users. This is why I resist "policy should not be in the kernel" meme here.

— Ingo Molnar

"inline" is now a vague, pathetic and useless thing. The problem is that the reader just doesn't *know* whether or not the writer really wanted it to be inlined.

If we have carefully made a decision to inline a function, we should (now) use __always_inline. If we have carefully made a decision to not inline a function, we should use noinline. If we don't care, we should omit all such markings.

This leaves no place for "inline"?

— Andrew Morton

Copy and paste is the #1 cause for subtle bugs.

— Thomas Gleixner

Comments (23 posted)

Long-term support for the 3.4 kernel

Greg Kroah-Hartman has announced that the 3.4 kernel will receive stable updates for a period of at least two years. It joins 3.0 (which has at least one more year of support) on the long-term support list.

Full Story (comments: 3)

Kernel development news

The return of power-aware scheduling

By Jonathan Corbet
August 21, 2012

Years of work to improve power utilization in Linux have made one thing clear: efficient power behavior must be implemented throughout the system. That certainly includes the CPU scheduler, but the kernel's scheduler currently has little in the way of logic aimed at minimizing power use. A recent proposal has started a discussion on how the scheduler might be made to be more power-aware. But, as this discussion shows, there is no single, straightforward answer to the question of how power-aware scheduling should be done.

Interestingly, the scheduler did have power-aware logic from 2.6.18 through 3.4. There was a sysctl knob (sched_mc_power_savings) that would cause the scheduler to try to group runnable processes onto the smallest possible number of cores, allowing others to go idle. That code was removed in 3.5 because it never worked very well and nobody was putting any effort into improving it. The result was the removal of some rather unloved code, but it also left the scheduler with no power awareness at all. Given the level of interest in power savings in almost every environment, having a power-unaware scheduler seems less than optimal; it was only a matter of time until somebody tried to put together a better solution.

Alex Shi started off the conversation with a rough proposal on how power awareness might be added back to the scheduler. This proposal envisions two modes, called "power" and "performance," that would be used by the scheduler to guide its decisions. Some of the first debate centered around how that policy would be chosen, with some developers suggesting that "performance" could be used while on AC power and "power" when on battery power. But that policy entirely ignores an important constituency: data centers. Operators of data centers are becoming increasingly concerned about power usage and its associated costs; many of them are likely to want to run in a lower-power mode regardless of where the power is coming from. The obvious conclusion is that the kernel needs to provide a mechanism by which the mode can be chosen; the policy can then be decided by the system administrator.

The harder question is: what would that policy decision actually do? The old power code tried to cause some cores, at least, to go completely idle so that they could go into a sleep state. The proposal from Alex takes a different approach. Alex claims that trying to idle a subset of the CPUs in the system is not going to save much power; instead, it is best to spread the runnable processes across the system as widely as possible and try to get to a point where all CPUs can go idle. That seems to be the best approach, on x86-class processors, anyway. On that architecture, no processor can go into a deep sleep state unless they all go into that state; having even a single processor running will keep the others in a less efficient sleep state. A single processor also keeps associated hardware — the memory controller, for example — in a powered-up state. The first CPU is by far the most expensive one; bringing in additional CPUs has a much lower incremental cost.

So the general rule seems to be: keep all of the processors busy as long as there is work to be done. This approach should lead to the quickest processing and best cache utilization; it also gives the best power utilization. In other words, the best policy for power savings looks a lot like the best policy for performance. That conclusion came as a surprise to some, but it makes some sense; as Arjan van de Ven put it:

So in reality, the very first thing that helps power, is to run software efficiently. Anything else is completely secondary. If placement policy leads to a placement that's different from the most efficient placement, you're already burning extra power...

So why bother with multiple scheduling modes in the first place? Naturally enough, there are some complications that enter this picture and make it a little bit less neat. The first of these is that spreading load across processors only helps if the new processors are actually put to work for a substantial period of time, for values of "substantial" around 100μs. For any shorter period, the cost of bringing the CPU out of even a shallow sleep exceeds the savings gained from running a process there. So extra CPUs should not be brought into play for short-lived tasks. Properly implementing that policy is likely to require that the kernel gain a better understanding of the behavior of the processes running in any given workload.

There is also still scope for some differences of behavior between the two modes. In a performance-oriented mode, the scheduler might balance tasks more aggressively, trying to keep the load the same on all processors. In a power-savings mode, processes might stay a bit more tightly packed onto a smaller number of CPUs, especially processes that have an observed history of running for very short periods of time.

But the conversation has, arguably, only barely touched on the biggest complication of all. There was a lot of talk of what the optimal behavior is for current-generation x86 processors, but that is far from the only environment in which Linux runs. ARM processors have a complex set of facilities for power management, allowing much finer control over which parts of the system have power and clocks at any given time. The ARM world is also pushing the boundaries with asymmetric architectures like big.LITTLE; figuring out the optimal task placement for systems with more than one type of CPU is not going to be an easy task.

The problem is thus architecture-specific; optimal behavior on one architecture may yield poor results on another. But the eventual solution needs to work on all of the important architectures supported by Linux. And, preferably, it should be easily modifiable to work on future versions of those architectures, since the way to get the best power utilization is likely to change over time. That suggests that the mechanism currently used to describe architecture-specific details to the scheduler (scheduling domains) needs to grow the ability to describe parameters relevant to power management as well. An architecture-independent scheduler could then use those parameters to guide its behavior. That scheduler will also need a better understanding of process behavior; the almost-ready per-entity load tracking patch set may help in this regard.

Designing and implementing these changes is clearly not going to be a short-term job. It will require a fair amount of cooperation between the core scheduler developers and those working on specific architectures. But, given how long we have been without power management support in the scheduler, and given that the bulk of the real power savings are to be had elsewhere (in drivers and in user space, for example), we can wait a little longer while a proper scheduler solution is worked out.

Comments (3 posted)

Link-time optimization for the kernel

By Jonathan Corbet
August 21, 2012

The kernel tends to place an upper limit on how quickly any given workload can run, so it is unsurprising that kernel developers are always on the lookout for ways to make the system go faster. Significant amounts of work can be put into optimizations that, on the surface, seem small. So when the opportunity comes to make the kernel go faster without the need to rewrite any performance-critical code paths, there will naturally be a fair amount of interest. Whether the "link-time optimization" (LTO) feature supported by recent versions of GCC is such an opportunity or not is yet to be proved, but Andi Kleen is determined to find out.

The idea behind LTO is to examine the entire program after the individual files have been compiled and exploit any additional optimization opportunities that appear. The most significant of those opportunities appears to be the inlining of small functions across object files. The compiler can also be more aggressive about detecting and eliminating unused code and data. Under the hood, LTO works by dumping the compiler's intermediate representation (the "GIMPLE" code) into the resulting object file whenever a source file is compiled. The actual LTO stage is then carried out by loading all of the GIMPLE code into a single in-core image and rewriting the (presumably) further-optimized object code.

The LTO feature first appeared in GCC 4.5, but it has only really started to become useful in the 4.7 release. It still has a number of limitations; one of those is that all of the object files involved must be compiled with the same set of command-line options. That limitation turns out to be a problem with the kernel, as will be seen below.

Andi's LTO patch set weighs in at 74 changesets — not a small or unintrusive change. But it turns out that most of the changes have the same basic scope: ensuring that the compiler knows that specific symbols are needed even if they appear to be unused; that prevents the LTO stage from optimizing them away. For example, symbols exported to modules may not have any callers in the core kernel itself, but they need to be preserved for modules that may be loaded later. To that end, Andi's first patch defines a new attribute (__visible) used to mark such symbols; most of the remaining patches are dedicated to the addition of __visible attributes where they are needed.

Beyond that, there is a small set of fixes for specific problems encountered when building kernels with LTO. It seems that functions with long argument lists can get their arguments corrupted if the functions are inlined during the LTO stage; avoiding that requires marking the functions noinline. Andi complains "I wish there was a generic way to handle this. Seems like a ticking time bomb problem." In general, he acknowledges the possibility that LTO may introduce new, optimization-related bugs into the kernel; finding all of those could be a challenge.

Then there is the requirement that all files be built with the same set of options. Current kernels are not built that way; different options are used in different parts of the tree. In some places, this problem can be worked around by disabling specific optimizations that depend on different compiler flags than are used in the rest of the kernel. In others, though, features must simply be disabled to use LTO. These include the "modversions" feature (allowing kernel modules to be used with more than one kernel version) and the function tracer. Modversions seems to be fixable; getting ftrace to work may require changes to GCC, though.

It is also necessary, of course, to change the build system to use the GCC LTO feature. As of this writing, one must have a current GCC release; it is also necessary to install a development version of the binutils package for LTO to work. Even a minimal kernel requires about 4GB of memory for the LTO pass; an "allyesconfig" build could require as much as 9GB. Given that, the use of 32-bit systems for LTO kernel builds is out of the question; it is still possible, of course, to build a 32-bit kernel on a 64-bit system. The build will also take between two and four times as long as it does without LTO. So developers are unlikely to make much use of LTO for their own work, but it might be of interest to distributors and others who are building production kernels.

The fact that most people will not want to do LTO builds actually poses a bit of a problem. Given the potential for LTO to introduce subtle bugs, due either to optimization-related misunderstandings or simple bugs in the new LTO feature itself, widespread testing is clearly called for before LTO is used for production kernels. But if developers and testers are unwilling to do such heavyweight builds, that testing may be hard to come by. That will make it harder to achieve the level of confidence that will be needed before LTO-built kernels can be used in real-world settings.

Given the above challenges, the size of the patch set, and the ongoing maintenance burden of keeping LTO working, one might well wonder if it is all worth it. And that comes down entirely to the numbers: how much faster does the kernel get when LTO is used? Hard numbers are not readily available at this time; the LTO patch set is new and there are still a lot of things to be fixed. Andi reports that runs of the "hackbench" benchmark gain about 5%, while kernel builds don't change much at all. Some networking benchmarks improve as much as 18%. There are also some unspecified "minor regressions." The numbers are rough, but Andi believes they are encouraging enough to justify further work; he also expects the LTO implementation in GCC to improve over time.

Andi also suggests that, in the long term, LTO could help to improve the quality of the kernel code base by eliminating the need to put inline functions into include files.

All told, this is a patch set in a very early stage of development; it seems unlikely to be proposed for merging into a near-term kernel, even as an experimental feature. In the longer term, though, it could lead to faster kernels; use of LTO in the kernel could also help to drive improvements in the GCC implementation that would benefit all projects. So it is an effort that is worth keeping an eye on.

Comments (46 posted)

Ask a kernel developer: maintainer workflow

August 22, 2012

This article was contributed by Greg Kroah-Hartman.

In this edition of "ask a kernel developer", I answer a multi-part question about kernel subsystem maintenance from a new maintainer. The workflow that I use to handle patches in the USB subsystem is used as an example to hopefully provide a guide for those who are new to the maintainer role.

As always, if you have unanswered questions relating to technical or procedural issues in Linux kernel development, ask them in the comment section, or email them directly to me. I will try to get to them in another installment down the road.

I have some questions about what I am supposed to be doing at different points of the release cycle. -rc1 and -rc2 are spelled out in Documentation/HOWTO, and I have a decent idea that patches I accept should be smaller and fix more critical bugs as the -rcX's roll out. The big question is what do I do with all of the other patches that come at random times?

First off, thanks so much for agreeing to maintain a kernel subsystem. Without maintainers like you, the Linux kernel development process would be much more chaotic and hard to navigate. I will try to explain how I have set up my development workflow and how I maintain the different subsystems I am in charge of. That example can help you determine how you wish to manage your own development trees, and how to handle incoming patches from developers.

To answer the question, yes, you will receive patches at any point in the release cycle, but not all of them are applicable to be sent to Linus at all points in time, depending on where we are in the release cycle. I'll go into more detail below, but for now, realize that in my opinion you should not require the other developers to wait for different points in the release cycle, and, instead, you should hold onto patches and send them upstream when they are needed. I think it is the maintainer's job to do the buffering.

How best do I organize my pull-request branches so that developers know which they can pull as dependencies, and which are for-next. I don't want to over-organize it, but do want to make it easy for board submitters to test from my trees. Should my pull-request branches be long-lived, or, kill them and create new after each cycle?

It's best to stick with a simple scheme for branches, work with that for a while, and then if you find that is too limiting, feel free to grow from there. I only have two branches in my git trees, one to feed to Linus for the current release cycle, and one that is for the next release cycle. This can be seen in the USB git tree on kernel.org, which shows three branches:

master, which tracks Linus's tree
usb-linus, which contains patches to go to Linus for this release cycle
usb-next, which contains the patches to go to Linus for the next release cycle.

Both of the usb-linus and usb-next branches are included in the nightly linux-next releases as well. That gives me and the USB developers quick feedback in case there are merge issues with other development trees or if there are build issues on other architectures that I missed.

I receive patches from lots of different developers all the time. All patches, after they pass an initial "is this sane" glance, get copied to a mailbox that I call TODO. Every few days, depending on my workload, I go through the mailbox and pick out all of the patches that are to be applied to various trees I am responsible for. For this example, I'll search on anything that touches the USB tree and copy those messages to a temporary local mailbox on the filesystem called s (I name my local mailboxes for their ease of typing, not for any other good reason.)

After digging all of the USB patches out (which is really a simple filter for all threads that have the "drivers/usb" string in them), I take a closer look at the patches in the s mailbox.

First I look to find anything that would be applicable to Linus's current tree. This is usually a bug fix for something that was introduced during this merge window, or a regression for systems that were previously working just fine. I pick those out and save them to another temporary mailbox called s1.

Now it's time to start testing to see if the patches actually apply to the tree. I go into a directory that contains my usb tree and check to see what branch I am on:

    $ cd linux/work/usb
    $ git b
      master     6dab7ed Merge branch 'fixes' of git://git.linaro.org/people/rmk/linux-arm
    * usb-linus  8f057d7 gpu/mfd/usb: Fix USB randconfig problems
      usb-next   26f944b usb: hcd: use *resource_size_t* for specifying resource data
      work-linus 8f057d7 gpu/mfd/usb: Fix USB randconfig problems
      work-next  26f944b usb: hcd: use *resource_size_t* for specifying resource data

Note, I have the following aliases in my ~/.gitconfig file:

    [alias]
	dc = describe --contains
	fp = format-patch -k -M -N
	b = branch -v

This enables me to use git b to see the current branch much easier, git fp to format patches in the style I need them in, and git dc to determine exactly what release a specific git commit was contained in.

As you can see by the list of branches, I have a local branch that mirrors the public versions of the usb-linus and usb-next branches called work-linus and work-next. I do the testing and development work in these local branches, and only when I feel they are "good enough" do I push them to the public facing branches and then out to kernel.org.

So, back to work. As I am working on patches [1, 2] that are to be sent to Linus first, let's change to the local working version of that branch:

    $ git checkout work-linus
    Switched to branch 'work-linus'

Then a quick sanity check to verify that the patches in s1 really will apply to this tree (sadly, they often do not.):

    $ p1 < ../s1
    patching file drivers/usb/core/endpoint.c
    patching file drivers/usb/core/quirks.c
    patching file drivers/usb/core/sysfs.c
    Hunk #2 FAILED at 210.
    1 out of 2 hunks FAILED -- saving rejects to file drivers/usb/core/sysfs.c.rej
    patching file drivers/usb/storage/transport.c
    patching file include/linux/usb/quirks.h

(Note, the 'p1' command is really:

    patch -p1 -g1 --dry-run

that I set up in my .alias file years ago as I quickly got tired of typing the full thing out.)

Here is an example of patches that will not apply to the work-linus branch, but it turns out that this was my fault. They were generated against the linux-next branch, and really should be queued up for the next merge window, not for this release.

So, let's switch back to the work-next branch, as that is where the patches really belong:

    $ git checkout work-next
    Switched to branch 'work-next'

And see if they apply there properly:

    $ p1 < ../s1
    patching file drivers/usb/core/endpoint.c
    patching file drivers/usb/core/quirks.c
    patching file drivers/usb/core/sysfs.c
    patching file drivers/usb/storage/transport.c
    patching file include/linux/usb/quirks.h

Much better.

Then I look at the patches themselves again in my email client, and edit anything that needs to be cleaned up. The changes could be in the Subject, the body of the patch, or any other things that need to be touched up. With developers who send patches all the time, no changes generally need to be done in this area, but, unfortunately, I end up editing this type of "metadata" all the time.

After the patches look clean, and I've done a review of them again to verify that I don't notice anything strange or suspicious, I do one last sanity check by running the checkpatch.pl tool:

    $ ./scripts/checkpatch.pl ../s1
    total: 0 errors, 0 warnings, 73 lines checked

    ../s1 has no obvious style problems and is ready for submission.

So all looks good, so let's apply them to the branch and see if the build works properly:

    $ git am -s ../s1
    Applying: usb/endpoint: Set release callback in the struct device_type \
              instead of in the device itself directly
    Applying: usb: convert USB_QUIRK_RESET_MORPHS to USB_QUIRK_RESET
    $ make -j8

If everything built, then it's time to test the patches. This can range from installing the changed kernel and ensuring that everything still works properly and the new modifications work as they say they should work, to doing nothing more than verifying that the build didn't break if I do not have the hardware that the changed driver controls.

After this, and everything looks sane, it's time to push the patches to the public kernel.org repository, as well as notifying the developer that their patch was applied to the tree and where they can find it. This I do with a script called do.sh that has grown over the years; it was originally based on a script that Andrew Morton uses to notify developers when he applies their patches. You can find a copy of it and the rest of the helper scripts I use for kernel development in my gregkh-linux GitHub tree.

The script does the following:

generates a patch for every changeset in the local branch that is not in the usb-next branch
emails the developer that this patch has now been applied and where it can be found
merges the branch to the local usb-next branch
pushes the branch to the public git.kernel.org repository
pushes the branch to a local backup server that is on write-only media
switches back to the work-next branch

With that, I'm free to delete the s1 mailbox, and start all over with more patches.

After this, people do sometimes find problems with patches that need to be fixed up. But, since my trees are public, I can't rebase them—otherwise any developer who had previously pulled my branches would get messed up. Instead, I sometimes revert patches, or apply fix-up patches on top of the current tree to resolve issues. It isn't the cleanest solution at times, but it is better to do this than rebase a public tree, which is something that no one should ever do.

Hopefully, this description gives you an idea how you can manage your trees and the patches sent to you to make things easier for yourself, the linux-next maintainer, and any developer who relies on your tree.

Comments (14 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.6-rc2 ?

Linus Torvalds Linux 3.6-rc3 ?

Steven Rostedt 3.4.9-rt17 ?

Ben Hutchings Linux 3.2.28 ?

Steven Rostedt 3.0.41-rt61 ?

Paul Gortmaker Linux 2.6.34.13 ?

Architecture-specific

Stefano Stabellini Introduce Xen support on ARM ?

Huacai Chen MIPS: Add Loongson-3 based machines support ?

Joerg Roedel Improve IRQ remapping abstraction in x86 core code ?

Build system

Andi Kleen RFC: Link Time Optimization support for the kernel ?

Core kernel code

Con Kolivas [ANNOUNCE} 3.5-ck1, BFS scheduler v0.24 for linux-3.5 ?

Alexandre Courbot Runtime Interpreted Power Sequences ?

aris@redhat.com cgroup: add xattr support ?

Sasha Levin generic hashtable implementation ?

Mike Turquette [RFC] reentrancy in the common clk framework ?

Development tools

Anton Vorontsov KGDB/KDB FIQ (NMI) debugger ?

Device drivers

Hideki EIRAKU Renesas IPMMU driver for sh7372 ?

Christopher Heiny input: Synaptics RMI4 Touchscreen Driver ?

cjren@qca.qualcomm.com net: add new QCA alx ethernet driver ?

Laurent Pinchart Generic panel framework ?

Ondrej Zary introduce snd-cmi8328: C-Media CMI8328 driver ?

Filesystems and block I/O

Cyrill Gorcunov procfs, fdinfo reworked, attempts N-th ?

Dan Luedtke fs: Introducing Lanyard Filesystem ?

Zheng Liu ext4: extent status tree (step 1) ?

Kent Overstreet Block cleanups ?

Memory management

Anton Vorontsov Userspace low memory killer daemon ?

Mel Gorman Improve hugepage allocation success rates under load V5 ?

wency@cn.fujitsu.com memory-hotplug: hot-remove physical memory ?

Rafael Aquini make balloon pages movable by compaction ?

Andrea Arcangeli AutoNUMA24 ?

Networking

Patrick McHardy netfilter: IPv6 NAT ?

Patrick McHardy netlink: memory mapped I/O ?

Security-related

David Howells Crypto keys and module signing ?

Dmitry Kasatkin ima: directory and special files integrity protection ?

Virtualization and containers

Konrad Rzeszutek Wilk Boot PV guests with more than 128GB (v3) for v3.7. ?

Stanislav Kinsbursky NFS: callback threads containerization ?

Xiao Guangrong KVM: introduce readonly memslot ?

Page editor: Jonathan Corbet
Next page: Distributions>>