Kernel development
Brief items
Kernel release status
The current development kernel is 3.12-rc6, released on October 19. Linus said: "Nothing major happened last week., and the upcoming week is likely to be quiet too, since a lot of core maintainers will be in Edinburgh for the KS."
Stable updates:
3.11.6 and 3.10.17 were released on October 18,
followed by
3.4.67 and 3.0.101 on October 22.
This is the end of the line for 3.0: "I will NOT be doing any more
3.0.x kernel releases. If you rely on the 3.0.x kernel series, you should
now move to the 3.10.x kernel series, or, at the worse case, 3.4.x.
"
Quotes of the week
Nftables pulled for 3.13
For those who have been following nftables, the replacement firewall subsystem for the kernel: this code has just been pulled into the net-next tree. That means that, barring some sort of trouble, it will be merged in the 3.13 development cycle. This code is not ready to replace iptables yet, but the pace of the work should increase once this subsystem is in the mainline. Anybody wanting to try out nftables can see the quick howto page for instructions.
Kernel development news
The power-aware scheduling mini-summit
The first day of the 2013 Kernel Summit was set aside for mini-summits on various topics. One of those topics was the controversial area of power-aware scheduling. Numerous patches attempting to improve the scheduler have been posted, but none have come near the mainline. This gathering brought together developers from the embedded world and the core scheduler developers to try to make some forward progress in this area; whether they succeeded remains to be seen, but some tentative forward steps were identified.
Morten Rasmussen, the author of the big LITTLE
MP patch set, was
one of the organizers of this session. He started with a set of agenda items
and supporting slides; the topics to be discussed were:
- Unification of power-management policies, which are currently split
among multiple, uncoordinated subsystems.
- Task packing. Various patch sets have been posted, but none have
really gotten anywhere.
- Power drivers and, in particular, the creation of proper abstractions for hardware-level power management.
He attempted to get started with a discussion of the cpufreq and cpuidle subsystems, but didn't get very far before the conversation shifted.
The need for metrics
Ingo Molnar came in with a
complaint: none of the power-management work starts with measurements of
the system's power behavior. Without a coherent approach to measuring the
effects of a patch, there is no real way to judge these patches to decide
which ones should go in. We cannot, he said, merge scheduler patches on
faith, hoping that they somehow make things better.
What followed was a long and wide-ranging discussion of how such metrics might be made. What the scheduler developers would like is reasonably clear: how long did it take to execute a defined amount of work, and how much energy was consumed in the process? There is also some interest in what worst-case latencies were experienced by the workload while it was running. With reproducible numbers for these quantities, it should be possible to determine which patches help with which workloads and make some progress in this area.
Agreement with this approach was not universal, though. Mark Gross of Intel made the claim that this kind of performance metric was "the road to hell." Instead, he said, power benchmarks should focus on low-level behavior like processor sleep ("C") states. For any given sleep state, the processor must remain in that state for a given period of time before going into that state actually saves power. So the cpuidle subsystem must make an estimate of how long the processor must be idle before selecting an appropriate sleep state. The actual idle period is something that can then be measured; over time, one can come up with a picture of how well the kernel's estimates of idle time match up with reality. That, Mark said, is the kind of benchmark the kernel developers should be using.
Others argued that idle-time estimation accuracy is a low-level measurement that may not map well onto what the users actually want: their work done quickly, without unacceptable latencies, and with a minimal use of power. But actual power-usage measurements have been hard to come by; the processor vendors appear to be reluctant to expose that information to the kernel. So the group never did come up with a good set of metrics to use for the evaluation of scheduler patches. In the end, Ingo said, the first patch that adds a reasonable power-oriented benchmark to the tools directory will go in; it can be refined from there.
What to do
From there, the discussion shifted toward how the scheduler's power behavior might be improved. It was agreed that there is a need for a better mechanism by which an application can indicate its latency requirements to the kernel. These latency requirements then need to be handled carefully: it will not do to keep all CPUs in the system awake because an application running on one of them needs bounded latencies.
There was some talk of adding some sort of energy use model to a simulator — either linsched or the perf sched utility — to allow for standardized testing of patches with specific workloads. Linsched was under development by an intern at Google, but the work was not completed, so it's still not ready for upstreaming. Ingo noted that the resources to do this work are available; there are, after all, developers interested in getting power-aware code into the scheduler.
Rafael Wysocki asked the scheduler developers: what information do you need
from the hardware to make better scheduling decisions? Paul Turner
responded: the cost of running the system in a given configuration; Peter
Zijlstra added that he would like to know the cost of starting up an
additional CPU. The response from Mark is that it all depends on which
generation of hardware is in use. In general, Intel seems to be reluctant
to make that information available, an approach which caused some visible
frustration among the scheduler developers.
Over time the discussion did somewhat converge on what the scheduler community would like to have. Some sort of cost metric should be attached to the scheduling domains infrastructure; it would tell the scheduler what the cost of firing up any given processor would be. That information would have to appear at multiple levels; bringing up the first processor in a different physical package will clearly cost more than adding a processor in an already-active package, for example.
Tied into this is the concept of performance ("P") states, generally thought of as the "CPU frequency." The frequency concept is somewhat outdated, but it persists in the kernel's cpufreq subsystem which, it was mostly agreed, should go away. The scheduler does need to understand the cost (and change in performance) associated with a P-state change, though; that would allow it to choose between increasing a CPU's P state or moving a task to a new CPU. How this information would be exposed by the CPU remains to be seen, but, if it were available, it would be possible to start making smarter scheduling decisions with it.
How those decisions would be made remains vague. There was talk of putting together some kind of set of standard workloads, but that seems premature. Paul suggested starting with a set of "stories" describing specific workloads in a human-comprehensible form. Once a collection of stories has been put together, developers can start trying to find the common themes that can be used to come up with algorithms for better, more power-aware scheduling.
There was some brief discussion of Morten's recent scheduler patches. It was agreed that they provide a reasonable start for the movement of CPU frequency and idle awareness into the scheduler itself. A focus on moving cpuidle into the scheduler first was suggested; most developers would rather see cpufreq just go away at this point.
And that was where the group was reminded that lunch had started nearly half an hour before. The guidance produced for the power-aware scheduling developers remains vague at best, but there are still some worthwhile conclusions, with the need for a set of plausible metrics being at the top of the list. That should be enough to enable this work to take a few baby steps forward.
(For more details, see the extensive notes from the session taken by Paul McKenney and Paul Walmsley).
[Your editor would like to thank the Linux Foundation for supporting his travel to Edinburgh.]
A look at the 3.12 development cycle
As of this writing, the 3.12-rc6 prepatch has been released, Linus seems happy with the state of the kernel, and, in general, there are few reports of problems on the mailing lists. If things continue to stabilize, the 3.12 cycle might be a short one, even by recent standards. So, clearly, it's time to get the traditional development statistics article out there.There have been 10,480 non-merge changesets pulled into the mainline repository during this cycle (so far). That means that 3.12 may be the slowest cycle since 3.6 which, almost exactly one year ago, came out with 10,247 changesets. An unscientific look at recent release history suggests that kernels released in the (northern hemisphere) fall tend to include fewer changesets than those released at other times of the year, possibly reflecting lower productivity while developers take time off over the summer.
That said, 10,480 changesets is still quite a bit of work. Those changesets were contributed by 1,259 developers (a typical number for recent kernels), added 601,000 lines of code, and removed 279,000 for a net gain of 322,000 lines. The most active developers in this cycle were:
Most active 3.12 developers
By changesets Sachin Kamat 261 2.5% Jingoo Han 241 2.3% Mark Brown 209 2.0% Greg Kroah-Hartman 197 1.9% H Hartley Sweeten 160 1.5% Alex Deucher 151 1.4% Laurent Pinchart 140 1.3% Daniel Vetter 138 1.3% Fabio Estevam 114 1.1% Chris Metcalf 103 1.0% Dan Carpenter 96 0.9% Dave Chinner 90 0.9% Peter Hurley 83 0.8% Joe Perches 80 0.8% Ben Hutchings 77 0.7% Magnus Damm 76 0.7% Rafael J. Wysocki 73 0.7% Lars-Peter Clausen 73 0.7% Trond Myklebust 67 0.6% Axel Lin 65 0.6%
By changed lines Larry Finger 92908 12.6% Jesse Brandeburg 30520 4.2% Greg Kroah-Hartman 29740 4.0% H Hartley Sweeten 25932 3.5% Alex Deucher 18026 2.5% Ben Hutchings 17660 2.4% Rob Clark 15703 2.1% Bradley Grove 15687 2.1% Dave Chinner 15099 2.1% Scott Kilau 14712 2.0% Lidza Louina 13474 1.8% Laurent Pinchart 11676 1.6% Rajendra Nayak 10866 1.5% Chris Metcalf 8924 1.2% Tomi Valkeinen 8881 1.2% Feng-Hsin Chiang 6813 0.9% Yuan-Hsin Chen 6813 0.9% Ambresh K 5528 0.8% Hans Verkuil 5385 0.7% Atul Deshmukh 4849 0.7%
Sachin Kamat and Jingoo Han both contributed a wide range of cleanup patches all over the driver subsystem. Mark Brown continues to do substantial work in the sound, SPI, and regulator driver subsystems, among others. Greg Kroah-Hartman integrated a number of low-level device model changes, along with Lustre filesystem fixups and more. H. Hartley Sweeten's crusade to clean up the Comedi drivers continues; that work resulted in the removal of over 21,000 lines of code this time around.
On the "lines changed" side, Larry Finger added the Realtek RTL8188EU wireless network driver to the staging tree. Jesse Brandeburg added the Intel i40e network driver. Greg's and Hartley's work was, in both cases, dominated by the removal of large amounts of unneeded driver code, while Alex Deucher continues to add functionality to the Radeon driver.
A total of 212 employers (that we know of) supported work on the 3.12 release. The most active of those were:
Most active 3.12 employers
By changesets Intel 1028 9.8% (None) 964 9.2% Linaro 732 7.0% Red Hat 707 6.7% (Unknown) 609 5.8% Samsung 492 4.7% IBM 390 3.7% Freescale 256 2.4% Renesas Electronics 249 2.4% Texas Instruments 245 2.3% Linux Foundation 225 2.1% SUSE 206 2.0% Oracle 183 1.7% Free Electrons 183 1.7% (Consultant) 182 1.7% AMD 178 1.7% Vision Engraving 175 1.7% 161 1.5% Huawei Technologies 124 1.2% Broadcom 120 1.1%
By lines changed (None) 134134 18.3% Intel 61227 8.3% Red Hat 49820 6.8% Linux Foundation 32955 4.5% Vision Engraving 26848 3.7% Linaro 26081 3.5% Texas Instruments 24518 3.3% AMD 24389 3.3% (Unknown) 22927 3.1% Renesas Electronics 20656 2.8% Outreach Program for Women 19649 2.7% Solarflare Comm. 18303 2.5% ATTO Technology 15688 2.1% Digi International 14720 2.0% Faraday Technology 13626 1.9% IBM 13554 1.8% Samsung 13083 1.8% Tilera 12256 1.7% Freescale 12104 1.6% SUSE 11265 1.5%
This is not the first time that Red Hat has been upstaged as the top corporate contributor, but it has never been as low as #4. Linaro, instead, continues to increase its contributions to the kernel, as do a number of mobile and embedded companies. The number of changes from volunteers is down slightly, in keeping with the steady trend over the last few years. Developers brought in through the Outreach Program for Women continued to contribute significantly during this cycle.
Your editor is often asked to summarize the origin of kernel patches geographically — how many are coming from $COUNTRY? That question can be hard to answer. But there is another question that is a bit easier: every commit in the repository has a time stamp, and that time stamp includes a time zone. It is a relatively easy matter to pass over a range of commits and summarize which time zones appear most often.
The result of this work appears in the plot to the right. One should bear
in mind that this data is necessarily somewhat noisy; there is nothing that
constrains developers' machines to have their time set in the zone where
they physically reside. Daylight saving time can also add noise to the
picture. In the aggregate, though, there are some
interesting things to be seen here.
Starting at the top, +10 includes parts of Russia, parts of Indonesia, and, most importantly for this study, parts of Australia. The +9 zone, instead, is mainly Japan and Korea, while +8 is Western Australia and China. About 15% of the changes to 3.12 came from those three time zones.
There is only one country that lives in +5:30 — India. The number of contributions from India has been growing for a while; it's now 6% of the total. Going west from there, +2 to +4 will be dominated by continental Europe, with the central European time zone accounting for 23% of the changes in 3.12. The UK and Ireland, at +1, put in another 10%.
The western hemisphere (negative) zones will be dominated by North America. The -3 zone, however, only covers Newfoundland and Labrador in the northern hemisphere; the relative scarcity of kernel hackers in that part of the world leads one to conclude that 5% of the patches for 3.12 came from Brazil and Argentina, both of which reside in that zone. Your editor's time zone (-6), alas, was the source of only 1% of the changes going into this release; it must be time to pull together some white-space patches to improve that situation.
To state things more generally: one could say that Asia and Australia contributed 22% of the changes to 3.12, while Europe contributed 43%, North America 30%, and South America 5%. These numbers are clearly approximate, but they probably do not hugely distort the reality. The Linux kernel project truly is global in scope, with developers representing much of the planet participating. All told, it has the appearance of a healthy and thriving community.
The LPC Android microconference
A number of upstream community developers and Google Android developers got together at the 2013 Linux Plumbers Conference discuss some of the non-graphics related out-of-tree or still-in-staging portions of the Android patch set. This discussion followed the Android graphics microconference held earlier in the day. This article contains a summary of the major issues discussed at this gathering.
Unified IPC with binder and kdbus
Earlier in the year, Greg Kroah-Hartman from the Linux Foundation had expressed his goal of getting the user-space binder libraries to run over the still-in-development kdbus interprocess communication mechanism. The session started with a question to Greg: "how is that going?". Bashfully rubbing his head and getting playfully heckled by Kay Sievers (another of the kdbus developers) from Red Hat, Greg suggested that we maybe could check back next year. Greg mentioned that he still has it as a goal, but Kay's concerns about the viability of the concept have made him somewhat less confident. At this point Kay stepped in to describe some of the semantic differences between the mechanisms, such as how everything in kdbus is asynchronous, making it more like a network protocol, while binder calls block to completion in a way that's more akin to system calls. These differences in behavior make it very difficult to support both modes of operation elegantly and efficiently in one IPC mechanism.
When questioned whether this meant that binder should be merged upstream alongside kdbus, Greg was a bit hesitant given the reported security concerns with using binder in a non-Android environment. Colin Cross, one of the Android developers in attendance, noted that the security issue is quite easy to avoid, and wouldn't really be a problem, but from his perspective, he doesn't see an immediate need to get binder out of staging and officially merged upstream. Greg agreed, and also clarified there is no rush, as he is fine with binder staying in staging for however long is necessary.
The Android developers have often been dragged through the mud by the community for implementing their own solutions, focused on solving their own problems, rather than working with the community to fix the existing infrastructure to satisfy their needs. In that context, I somewhat jokingly prodded Greg to explain why it is acceptable to develop kdbus instead of fixing or expanding binder's features to support D-Bus. He acknowledged the contradiction but repeated that kdbus really seems to support a different model, and there isn't likely to be a clean way to support both in one implementation.
Kay mentioned that while, in his opinion, the interfaces likely can't be shared, he did see some hope for sharing some of the underlying infrastructure. He was particularly keen on the concept of "sealed file descriptors" as being something that Android could make use of. Sealed file descriptors are used when two applications need to share memory by passing a file descriptor; they allow the sender to "seal" the descriptor so that the receiver can be confident that the data won't be later modified by the sender. Kay mentioned the parallels with Android's ashmem, which is the interface Android uses to create shared memory regions that can be shared via file descriptor passing. At this point there was some confusion in the discussion; while the feature does seem useful, it didn't seem to actually mirror how Android currently uses ashmem or shared file descriptors, but it seemed like maybe it would be something that might indeed become more useful once the functionality is upstream.
I then discussed some of my reasons for being hopeful that Greg would be able to achieve his goal. In particular, there is the issue of binder's complexity and that there really is only one key developer on the Android team who understands the in-kernel binder driver. This makes binder somewhat risky since that developer could be hit by a bus or otherwise stop participating in discussions, making it hard to find someone to continue to maintain binder upstream. Additionally, things like ioctl() compatibility support are currently lacking in binder, and, because of its design, it's not easy for 32-bit and 64-bit applications to communicate with each other using binder. Greg noted that everything in kdbus is 64-bit safe, but he also didn't see why binder couldn't be fixed since it's not an official, stable API, which caused some cross talk about how important supporting legacy Android environments is. Colin spoke up to say it really wasn't a big issue, since, when it's needed, the binder library could be made to query the system and do the right thing.
With that the discussion seemed to be exhausted, with Greg saying we should check back in next year to see if his thinking has changed.
Netfilter extensions
The next topic was netfilter extensions. JP Abgrall from Google's Android team started with a brief background presentation (slides) on the "xt_qtaguid" filter he developed, along with some other netfilter modifications Android uses (in particular, xt_quota2 and xt_idletimer) in order to provide per-application network data statistics in its settings interface.
When these changes were submitted to the community, it was suggested that he look at using the already upstream xt_owner and xt_nfacct (both of which were merged after xt_qtaguid was developed) to provide the same functionality. But JP had a few concerns about the differences between those modules, particularly that using xt_owner and xt_nfacct would require manipulating the firewall rules during normal operation and would require excessive numbers of rules, both of which could cause performance problems.
Marcel Holtmann from Intel spoke up to say that he thought the functionality being provided was really nice and that his projects would like to have something similar. But there were some concerns about how it would work with control groups. JP mentioned he had briefly looked at control groups, but they didn't seem useful. This caused some quick debate between Marcel and JP on the particular differences between how Android and desktop environments utilize control groups which I'm not sure clarified much.
Eric Dumazet, also from Google, but not an Android developer, piped in that he was a networking developer, that the functionality JP wanted with xt_qtaguid was already upstream, and that NFQUEUE is what they should be using. Since NFQUEUE pushes the packet decisions to user space, this caused a bit of an uproar in the room, as numerous folks were very skeptical that context switching to user space on every packet would be particularly fast or power-efficient.
Eric reassured everyone that it wasn't a concern, and that the enterprise world uses NFQUEUE for workloads up to some millions of packets per second without trouble. After a bit of contentious back-and-forth with JP, it seemed this issue wouldn't be resolved in the time remaining, and Eric suggested JP come over to his side of the Google campus at a later time to discuss it further.
Eric also asked about the out-of-tree xt_quota2 usage and why it was chosen instead of using the already-upstream xt_quota. JP mentioned that the in-tree quota code didn't seem useful at all, and the xt_quota2 code was already implemented. Eric suggested that if the upstream quota didn't work, it should be fixed instead of using xt_quota2. It was brought up that Jan Englehardt, the developer of xt_quota2 had been contacted, and he said the xt_quota2 code had been rejected by the upstream developers. So being able to better describe the limitations with the upstream quota code will be needed to help persuade upstream maintainers that the functionality in xt_quota2 is useful.
Android gadget driver and ConfigFS gadget driver
Closing out the microconference session was the Android gadget driver and ConfigFS gadget driver discussion. To try to liven things up a bit, Benoit Goby, of the Google Android team, and Andrzej Pietrasiewicz from Samsung, were seated face-off style in the front of the room for a dramatic showdown.
Benoit started off with a bit of background on the Android gadget driver. This driver allows Android devices to provide a number of gadget functions, such as support for the picture transfer protocol (PTP), USB mass storage, Ethernet (for tethering), and more over a single USB connection. Additionally, unlike other multi-function gadget drivers already upstream, these multiplexed functions can be dynamically configured at run time. The Android gadget driver patches also provide additional functions that the upstream kernel doesn't yet support, like the media transfer protocol (MTP), the ADB debugger, Android Accessory and Android Audio gadgets. Additionally, the Android gadget driver supports FunctionFS gadgets, which allow gadget functionality to be implemented in user space. In fact, the adbd server has been modified to support the FunctionFS interface, removing the need for an in-kernel ADB function driver.
Andrzej then similarly described ConfigFS gadget as a dynamically configurable gadget device that allows various gadget functions to be multiplexed. It is different from the Android gadget driver in that it uses ConfigFS for setting up and changing the configuration of the various functions. Andrzej talked a bit about the history of the ConfigFS gadget, noting that he originally had taken the Android gadget driver, removed anything that was Android-specific, renamed it Configurable Composite Gadget (CCG) and got it merged into staging. However, when upstream maintainers pushed for use of the ConfigFS interface, the CCG driver was abandoned and Andrzej, along with Sebastian Andrzej Siewior, focused on the ConfigFS gadget implementation. As of 3.10, the ConfigFS gadget is upstream, however it's still missing some desired functionality, like the FunctionFS interface support.
When asked if he had any issues with the ConfigFS gadget as being a potential upstream replacement for the Android gadget driver, Benoit said he had no objections. Once FunctionFS support and the other Android specific out-of-tree functions, like MTP were merged, it was just a matter of changing Android's user space code to support it.
Discussion then moved to exactly what the best approach would be for upstreaming the Android-specific gadget functions, like ADB, MTP, Android Accessory and Android audio support. Benoit mentioned again that adbd already supports FunctionFS, so once FunctionFS support is added to the ConfigFS gadget then an ADB function isn't necessary. He also said the MTP implementation could probably be done in user space, but it wasn't clear if that was the best way forward. The Android Accessory function would probably still need to be merged, but the Android Audio interface, he thought, could likely be replaced with a different audio function (though specifically which one and if it was upstream was a little unclear).
When asked if this all sounded reasonable, Andrzej simply agreed, making this one of the least contentious topics of the day. While it didn't have the lively fireworks hoped for to keep folks awake at the end of a long week, it was a nice and optimistic way to end the planned discussions.
I'd like to thank everyone for attending and participating in the discussions, as well as Zach Pfeffer and Karim Yaghmour for co-organizing the microconference, and helping with the note taking and reporting.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>