Kernel development
Brief items
Kernel release status
The current development kernel is 4.4-rc6, released on December 20. Linus said: "Things remain fairly normal. Last week rc5 was very small indeed, this week we have a slightly bigger rc6. The main difference is that rc6 had a network pull in it."
Stable updates: none have been released in the last week.
Quote of the week
Kernel development news
An (unsigned) long story about page allocation
The kernel project is famously willing to change any internal interface as needed for the long-term maintainability of the code. Effects on out-of-tree modules or other external code are not generally deemed to be reasons to keep an interface stable. But what happens if you want to change one of the oldest interfaces found within the kernel — one with many hundreds of call sites? It turns out that, in 2015, the appetite for interface churn may not be what it once was.If one looks at mm/memory.c in the Linux 0.01 release, one finds that a page of memory is allocated with:
unsigned long get_free_page(void);
From the memory-management point of view, the system's RAM can be seen as a linear array of pages, so it can make a certain amount of sense to think of addresses as integer types — indexes into the array, essentially. Integers can also be used for arbitrary arithmetic; pointers in C can be used that way too, but one quickly gets into "undefined behavior" territory where an overly enthusiastic compiler may feel entitled to create all kinds of mayhem. So unsigned long was established as the return type from get_free_page() and, in general, as the way that one refers to an address that may appear in any place in memory.
Fast-forward to the 4.4-rc6 release and dig through a rather larger body of code, and one finds that pages are allocated with:
unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
unsigned long __get_free_page(gfp_t gfp_mask);
The latter is a macro calling the former with an order of zero. Note that, more than 24 years after the 0.01 release, unsigned long is still used as the return type from __get_free_pages(). There are other variants (alloc_pages(), for example) that return struct page pointers, but much of low-level, page-oriented memory management in Linux is still done with unsigned long values.
The only problem is that, often, the kernel must deal with a page of memory as memory, modifying its contents. That requires a pointer. So even back in 0.01, one can find code like:
p = (struct task_struct *) get_free_page();
The unsigned long return value is immediately cast into the
pointer value that is actually needed.
Al Viro did a survey of __get_free_pages() users in current
kernels and concluded that "well
above 90%
" of the callers were using the return value as a pointer.
That turns out to be a lot of casts, suggesting that the type of the
return value for this function is not correct. So, he suggested, it might
make sense to change it:
Some of those bugs, he pointed out, he found simply by looking at the code with this kind of transformation in mind. Ten days later, he showed up with a patch set making the change and asked for a verdict from Linus.
One might find various faults with Linus's
response, but a lack of clarity will not be among them. He left no
doubt that there was no place in the mainline for this particular patch
set. The diffstat in Al's patch (568 files changed, 1956 insertions, 2202
deletions) was clearly frightening — enough, in its own right, to rule
out the change. A patch this wide-ranging would create conflicts
throughout the tree and make life difficult for those backporting patches.
This interface, it seems, is too old and too entrenched for this kind of
flag-day change; as Linus put it: "No way in hell do we suddenly
change the semantics of an interface that has been around from basically
day #1.
"
Still, as he clarified afterward, Linus isn't arguing for leaving everything exactly as it is. He accepted that most callers likely want a pointer value. But the way forward isn't to thrash up an interface like __get_free_pages(); instead, there are two approaches that, he said, could be taken.
The first of these would be to create a new, pointer-oriented interface that exists in parallel with __get_free_pages(). Then call sites could be converted at leisure over the course of what would probably be years.
The alternative, Linus said, is that code needing pointers could just allocate memory with kmalloc() instead. Once upon a time, that would not necessarily have been a good idea, since kmalloc() (implemented by the slab allocators) adds overhead to the page allocator and might have expanded the size of the returned memory beyond one page. Indeed, there was a period where an allocation of exactly one page would have consumed two physically contiguous pages when the slab housekeeping information was added. But those days are long in the past. In current kernels, kmalloc() is fast and requires little memory beyond that which is actually allocated. Indeed, Linus pointed out, kmalloc() may actually be faster than __get_free_pages() due to its use of per-CPU object caches.
So kmalloc() is probably the best option for many of the
call sites currently using __get_free_pages(). The places where
it is still inappropriate will be those needing multiple-page allocations
and those needing allocations that are not only page-sized but
page-aligned. In those cases, Linus said, the unsigned long
return type might not be a bad thing, since "it's clearly not just a
random pointer allocation if the bit pattern of the pointer
matters.
"
After this discussion took place, Al did a pass over the __get_free_pages() call sites in the filesystem code and concluded that almost all of them truly would would be better off using kmalloc(). So the end result of this work may be a slow shift in that direction and, perhaps, the creation of a new document telling kernel developers which memory allocator they should be using in which setting.
Two PaX features move toward the mainline
As the Kernel self-protection project (KSPP) ramps up in the month and a half since its formation, several features from the PaX project are starting their journey toward the mainline. The reception on the kernel-hardening mailing list that is being used to coordinate KSPP work has been positive, but the real test for these features will come when they are proposed for the mainline. Two specific patch sets have been posted recently, for PAX_REFCOUNT and PAX_MEMORY_SANITIZE, that we will look at here.
PAX_REFCOUNT
The idea behind the PAX_REFCOUNT patch set, posted by David Windsor, is to detect and handle overflows in reference-count variables. The kernel uses reference counts to track objects that have been allocated, incrementing or decrementing the count as references to them come and go; the kernel frees those objects when the count reaches zero. But if there is a path in the kernel where the count doesn't get decremented when an object reference gets dropped, an attacker could use that path to overflow and wrap the reference counter, effectively setting it to zero when there are actually still valid references to it. The object will be freed, but will still be used by those with references, leading to a use-after-free vulnerability.
This is not the first attempt to add this kind of overflow protection to the kernel. But when Windsor posted about a related idea for kref, which is a kernel abstraction for reference counts, back in 2012, the idea ran aground on how it handled the overflows. Like the original PaX patches, Windsor's patch would call BUG_ON() for reference counts that reached INT_MAX, instead of incrementing them. That would crash the kernel if the count ever reaches INT_MAX, which Greg Kroah-Hartman objected to:
And people wonder why I no longer have any hair...
But as Windsor and others pointed out, there is no sensible recovery that can be done if a reference count is about to wrap. An alternative might be to simply not change the counter (and put a warning into the kernel log) once it reaches INT_MAX, but that would lead to a memory leak. Overall, at least at the time, Kroah-Hartman was clearly skeptical of the whole idea—or even that a kref wrap could be exploited. However, Kees Cook did describe the way an exploit might work:
user1: alloc(), inc
user2: inc
user2: fail to dec
*repeat user2's actions until wrap*
user3: inc
user3: dec, free()
user1: operate on freed memory zomg
In the recent posting of PAX_REFCOUNT, Windsor has essentially broken up the PaX project's patches and applied them to the 4.2.6 stable kernel, though he is working on rebasing on linux-next. He noted a post on the grsecurity forums where the feature is well documented. The implementation changes the kernel's operations on atomic_t types so that overflows cannot occur; increments beyond INT_MAX are disallowed. In addition, processes that would have caused an overflow are sent a SIGKILL so that they can do no further damage. Windsor suggested that the signal might be too severe to start with:
The patches also create an atomic_unchecked_t type that acts just like today's atomic_t; it does no checking for overflow. In fact, the bulk of the patches are to various subsystems that use atomic variables but don't use them as reference counts; they are switched to use the new unchecked type. If the patches get merged, new users of atomic variables will need to determine if they are being used as reference counts or not to choose the proper atomic type.
So far, the comments on the patches have been light, but one suspects the code churn needed to switch all of those atomic types will bring some complaints when the patches get posted more widely. One could imagine creating a new type for those variables that need the checking, but that would require constant vigilance to ensure that any reference counts added to the kernel actually used the new type. That problem still exists with the posted patches, however, since new atomic_unchecked_t variables will need to be scrutinized to see that they aren't being used as reference counts.
PAX_MEMORY_SANITIZE
One way to mitigate the effect of use-after-free vulnerability or to stop various information leaks is to "sanitize" memory that is being freed by writing zeroes or some other constant value to it. That is the idea behind the PAX_MEMORY_SANITIZE feature. Laura Abbott posted a partial port of the feature to kernel-hardening on December 21.
In particular, Abbott's patches add the sanitization to the slab
allocators (slab, slob, and slub), but not for the buddy allocator as the
full PAX_MEMORY_SANITIZE feature does. That
means "that allocations which go directly to the buddy allocator
(i.e. large
allocations) aren't sanitized
". The actual
sanitization is done
using a fixed value (0xff for all architectures except x86-64, which uses
0xfe) that is written over the entire object before it is freed.
Abbott plans to look into adding
sanitization to the buddy allocator sometime in the new year.
Another change that Abbott made to the PaX version of the feature was to
add an option to handle the
sanitization in the slow path of the allocator.
Christoph Lameter complained that the feature was similar to the
slab-poisoning feature, so it should use that mechanism instead. Abbott agreed that the features were similar, but said
that poisoning is a debug feature and this work is targeting kernel
hardening so "it seemed more appropriate to keep debug features and
non-debug
features separate hence the separate option and configuration
".
The cost of sanitization is performance, of course. Abbott said she measured impacts of 3-20% depending on the benchmark. But the impact of compiling the feature into the kernel, but turning it off at runtime (using the sanitize_slab=off boot option), is negligible.
Lameter also suggested using the GFP_ZERO flag to make allocations
be zeroed before being returned. If there were a mode that set that flag
for all allocations it would provide "implied sanitization
".
But doing it that way would move the performance impact from the free path
to the allocation side, which is typically more performance sensitive, as
Dave Hansen pointed out. It also means
that unallocated memory would still store the potentially sensitive
contents of the previous object
until it is allocated again.
Instead of writing the fixed sanitization value across the object, writing zeroes would potentially allow the allocation path to skip the zeroing step, Hansen suggested. That might reduce some of the performance impact, though doing the zeroing at allocation time does leave the object's memory cache-hot, as Lameter noted. But zeroing has another downside that Abbott mentioned:
Overall, both patches were fairly well-received, but the hardening list is likely made up of those who are predisposed to look favorably on these kinds of changes. Based on discussions at last year's Kernel Summit, mainline developers should in theory be more receptive to patches that seek to mitigate whole classes of security bugs. If these PaX features can get merged eventually, there are some even more intrusive ones that could also attempt to run the gauntlet of the linux-kernel mailing list. Just where the line is—or even if there is one—is still unclear, but patches like these may help define it.
Some 4.4 development statistics
The 4.4-rc6 kernel prepatch was released on December 20, right on schedule. The 4.4 development series as a whole appears to be on schedule, with the most probable release date for 4.4 final being January 3, after one more prepatch. Linus has suggested that he might delay the release for one week. Any such delay, though, would be to allow developers to recover from the holidays before starting a new merge window rather than anything needed to stabilize 4.4.So, naturally, it is about time to look at this cycle's development activity. As of this writing, there have been 12,854 non-merge changesets pulled into the mainline this time around. It has thus been a busy cycle, though it would be surprising if we reached the number of changes seen in 4.2 (13,694), or the all-time record (13,722) set for 3.15.
The number of developers involved thus far is 1,548 — a large number, but slightly short of the 1,600 seen in 4.3. We may yet reach the 1,569 seen in the 4.2 cycle, though. Of those 1,548 contributors, 246 made their first kernel contribution in this development cycle — the lowest number since 3.13. The most active developers this time around were:
Most active 4.4 developers
By changesets H Hartley Sweeten 288 2.2% Mateusz Kulikowski 218 1.7% Chaehyun Lim 179 1.4% Leo Kim 167 1.3% Eric W. Biederman 163 1.3% Shraddha Barke 147 1.1% Ville Syrjälä 144 1.1% Arnd Bergmann 143 1.1% Eric Dumazet 123 1.0% Tony Cho 108 0.8% Geert Uytterhoeven 105 0.8% Glen Lee 105 0.8% Russell King 104 0.8% Javier Martinez Canillas 101 0.8% Sudip Mukherjee 96 0.7% Christoph Hellwig 91 0.7% Mike Rapoport 91 0.7% Oleg Drokin 89 0.7% Luis de Bethencourt 89 0.7% Andy Shevchenko 82 0.6%
By changed lines Alex Deucher 32203 5.0% Sreekanth Reddy 24009 3.7% Yuval Mintz 20622 3.2% Christoph Hellwig 15656 2.4% huangdaode 14725 2.3% Michael Chan 13137 2.0% Lv Zheng 9887 1.5% Oleg Drokin 8434 1.3% Deepa Dinamani 7797 1.2% Jes Sorensen 7737 1.2% Peter Senna Tschudin 7676 1.2% Sudeep Dutt 6881 1.1% Leo Kim 6664 1.0% Alexander Shishkin 6612 1.0% Arnd Bergmann 5893 0.9% Takashi Sakamoto 5837 0.9% Jiri Pirko 5350 0.8% Adam Thomson 5123 0.8% Eric Anholt 5041 0.8% H Hartley Sweeten 5030 0.8%
After an absence for a few development cycles, H. Hartley Sweeten is back at the top of the per-changeset list for the ongoing work on the Comedi drivers in the staging tree. This code, at just under 100,000 lines, has now seen nearly 8,000 patches — and the work continues. Mateusz Kulikowski worked entirely with the rtl8192e staging driver, while Chaehyun Lim and Leo Kim both fixed up the wilc1000 staging driver. Eric Biederman is engaged in a substantial reworking of how the network stack handles network namespaces, with an emphasis on proper handling of packets that cross namespaces.
On the lines-changed side, Alex Deucher continues to work on the AMD graphics drivers, Sreekanth Reddy removed a bunch of code from the mpt2sas driver (and, as a result, was the developer removing the most code in this cycle), and Yuval Mintz added a couple of Qlogic Ethernet drivers. Christoph Hellwig did a fair amount of cleanup throughout the driver and block subsystems, while huangdaode (the only name that appears in the logs) added support for the Hisilicon network subsystem.
The sum of these developers' effort resulted in the net addition of 242,000 lines of code to the kernel in this development cycle.
Work on 4.4 was supported by 202 employers that we could identify, a slight increase from 4.3. The most active companies working on 4.4 were:
Most active 4.4 employers
By changesets Intel 1660 12.9% (Unknown) 1139 8.9% (None) 684 5.3% Samsung 670 5.2% Red Hat 655 5.1% Atmel 449 3.5% Linaro 448 3.5% (Consultant) 419 3.3% Outreachy 400 3.1% IBM 302 2.3% Vision Engraving Systems 288 2.2% 273 2.1% SUSE 257 2.0% ARM 226 1.8% Texas Instruments 210 1.6% Freescale 208 1.6% Renesas Electronics 190 1.5% AMD 177 1.4% Oracle 173 1.3% Broadcom 169 1.3%
By lines changed Intel 85390 13.3% (None) 37078 5.8% AMD 36306 5.6% Red Hat 34937 5.4% (Unknown) 33739 5.2% (Consultant) 30271 4.7% Avago Technologies 27001 4.2% QLogic 24381 3.8% Broadcom 19318 3.0% Atmel 17856 2.8% Samsung 16508 2.6% Linaro 16154 2.5% HiSilicon Technologies 15260 2.4% Outreachy 12765 2.0% Renesas Electronics 11745 1.8% Mellanox 11590 1.8% Freescale 11392 1.8% ARM 10986 1.7% IBM 10402 1.6% Texas Instruments 10345 1.6%
For many years, Red Hat stood alone at the top of both columns of this list. That situation has been changing for some time; at this point, it is more than fair to say that Red Hat has ceased to be the most active company in the kernel development community. That is not to slight the company's work, of course; Red Hat still funds many of our most active developers, and those developers, in the subsystem-maintainer role, signed off on 16% of the changes merged this time around. But, at this point, Red Hat is one of a number of top-tier companies working to improve the Linux kernel.
Speaking of signoffs, the most active developers and companies when it comes to signing off patches they did not write are:
Most non-author signoffs in 4.4
Developers Greg Kroah-Hartman 2746 21.3% David S. Miller 1048 8.1% Daniel Vetter 447 3.5% Andrew Morton 346 2.7% Mark Brown 343 2.7% Ingo Molnar 241 1.9% Arnaldo Carvalho de Melo 224 1.7% Tony Cho 210 1.6% Jeff Kirsher 209 1.6% Kalle Valo 174 1.3%
Companies Linux Foundation 2763 21.6% Red Hat 2060 16.1% Intel 1649 12.9% Linaro 820 6.4% 602 4.7% (None) 459 3.6% SUSE 392 3.1% Atmel 260 2.0% Samsung 260 2.0% 233 1.8%
The kernel's subsystem maintainers remain concentrated in relatively few companies though, arguably, they are spread out a bit more widely than they once were. While many companies are willing to support kernel development in specific areas, fewer of them see the need to support developers working in the subsystem-maintainer role.
In summary, 4.4, the final kernel development for 2015, looks pretty typical. It was busier than most, but that, too, is typical, given the long-term trend toward larger development cycles. That busyness does not appear set to make this cycle longer than the 63 days that we have come to expect, though. Despite its occasional hiccups, the kernel-development machine continues to run smoothly.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>