Kernel development [LWN.net]

Kernel release status

The current development kernel is 4.4-rc6, released on December 20. Linus said: "Things remain fairly normal. Last week rc5 was very small indeed, this week we have a slightly bigger rc6. The main difference is that rc6 had a network pull in it."

Stable updates: none have been released in the last week.

Comments (none posted)

Quote of the week

In A.D. 1582 Pope Gregory XIII found that the existing Julian calendar insufficiently represented reality, and changed the rules about calculating leap years to account for this. Similarly, in A.D. 2013 Rockchip hardware engineers found that the new Gregorian calendar still contained flaws, and that the month of November should be counted up to 31 days instead. Unfortunately it takes a long time for calendar changes to gain widespread adoption, and just like more than 300 years went by before the last Protestant nation implemented Greg's proposal, we will have to wait a while until all religions and operating system kernels acknowledge the inherent advantages of the Rockchip system. Until then we need to translate dates read from (and written to) Rockchip hardware back to the Gregorian format.

— Julius Werner

Comments (none posted)

An (unsigned) long story about page allocation

By Jonathan Corbet
December 23, 2015

The kernel project is famously willing to change any internal interface as needed for the long-term maintainability of the code. Effects on out-of-tree modules or other external code are not generally deemed to be reasons to keep an interface stable. But what happens if you want to change one of the oldest interfaces found within the kernel — one with many hundreds of call sites? It turns out that, in 2015, the appetite for interface churn may not be what it once was.

If one looks at mm/memory.c in the Linux 0.01 release, one finds that a page of memory is allocated with:

    unsigned long get_free_page(void);

From the memory-management point of view, the system's RAM can be seen as a linear array of pages, so it can make a certain amount of sense to think of addresses as integer types — indexes into the array, essentially. Integers can also be used for arbitrary arithmetic; pointers in C can be used that way too, but one quickly gets into "undefined behavior" territory where an overly enthusiastic compiler may feel entitled to create all kinds of mayhem. So unsigned long was established as the return type from get_free_page() and, in general, as the way that one refers to an address that may appear in any place in memory.

Fast-forward to the 4.4-rc6 release and dig through a rather larger body of code, and one finds that pages are allocated with:

    unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
    unsigned long __get_free_page(gfp_t gfp_mask);

The latter is a macro calling the former with an order of zero. Note that, more than 24 years after the 0.01 release, unsigned long is still used as the return type from __get_free_pages(). There are other variants (alloc_pages(), for example) that return struct page pointers, but much of low-level, page-oriented memory management in Linux is still done with unsigned long values.

The only problem is that, often, the kernel must deal with a page of memory as memory, modifying its contents. That requires a pointer. So even back in 0.01, one can find code like:

    p = (struct task_struct *) get_free_page();

The unsigned long return value is immediately cast into the pointer value that is actually needed. Al Viro did a survey of __get_free_pages() users in current kernels and concluded that "well above 90%" of the callers were using the return value as a pointer. That turns out to be a lot of casts, suggesting that the type of the return value for this function is not correct. So, he suggested, it might make sense to change it:

In other words, switching to void * for return values of allocating and argument of freeing functions would reduce the amount of boilerplate quite nicely. What's more, fewer casts means better chance for typechecking to catch more bugs.

Some of those bugs, he pointed out, he found simply by looking at the code with this kind of transformation in mind. Ten days later, he showed up with a patch set making the change and asked for a verdict from Linus.

One might find various faults with Linus's response, but a lack of clarity will not be among them. He left no doubt that there was no place in the mainline for this particular patch set. The diffstat in Al's patch (568 files changed, 1956 insertions, 2202 deletions) was clearly frightening — enough, in its own right, to rule out the change. A patch this wide-ranging would create conflicts throughout the tree and make life difficult for those backporting patches. This interface, it seems, is too old and too entrenched for this kind of flag-day change; as Linus put it: "No way in hell do we suddenly change the semantics of an interface that has been around from basically day #1."

Still, as he clarified afterward, Linus isn't arguing for leaving everything exactly as it is. He accepted that most callers likely want a pointer value. But the way forward isn't to thrash up an interface like __get_free_pages(); instead, there are two approaches that, he said, could be taken.

The first of these would be to create a new, pointer-oriented interface that exists in parallel with __get_free_pages(). Then call sites could be converted at leisure over the course of what would probably be years.

The alternative, Linus said, is that code needing pointers could just allocate memory with kmalloc() instead. Once upon a time, that would not necessarily have been a good idea, since kmalloc() (implemented by the slab allocators) adds overhead to the page allocator and might have expanded the size of the returned memory beyond one page. Indeed, there was a period where an allocation of exactly one page would have consumed two physically contiguous pages when the slab housekeeping information was added. But those days are long in the past. In current kernels, kmalloc() is fast and requires little memory beyond that which is actually allocated. Indeed, Linus pointed out, kmalloc() may actually be faster than __get_free_pages() due to its use of per-CPU object caches.

So kmalloc() is probably the best option for many of the call sites currently using __get_free_pages(). The places where it is still inappropriate will be those needing multiple-page allocations and those needing allocations that are not only page-sized but page-aligned. In those cases, Linus said, the unsigned long return type might not be a bad thing, since "it's clearly not just a random pointer allocation if the bit pattern of the pointer matters."

After this discussion took place, Al did a pass over the __get_free_pages() call sites in the filesystem code and concluded that almost all of them truly would would be better off using kmalloc(). So the end result of this work may be a slow shift in that direction and, perhaps, the creation of a new document telling kernel developers which memory allocator they should be using in which setting.

Comments (10 posted)

Two PaX features move toward the mainline

By Jake Edge
December 23, 2015

As the Kernel self-protection project (KSPP) ramps up in the month and a half since its formation, several features from the PaX project are starting their journey toward the mainline. The reception on the kernel-hardening mailing list that is being used to coordinate KSPP work has been positive, but the real test for these features will come when they are proposed for the mainline. Two specific patch sets have been posted recently, for PAX_REFCOUNT and PAX_MEMORY_SANITIZE, that we will look at here.

PAX_REFCOUNT

The idea behind the PAX_REFCOUNT patch set, posted by David Windsor, is to detect and handle overflows in reference-count variables. The kernel uses reference counts to track objects that have been allocated, incrementing or decrementing the count as references to them come and go; the kernel frees those objects when the count reaches zero. But if there is a path in the kernel where the count doesn't get decremented when an object reference gets dropped, an attacker could use that path to overflow and wrap the reference counter, effectively setting it to zero when there are actually still valid references to it. The object will be freed, but will still be used by those with references, leading to a use-after-free vulnerability.

This is not the first attempt to add this kind of overflow protection to the kernel. But when Windsor posted about a related idea for kref, which is a kernel abstraction for reference counts, back in 2012, the idea ran aground on how it handled the overflows. Like the original PaX patches, Windsor's patch would call BUG_ON() for reference counts that reached INT_MAX, instead of incrementing them. That would crash the kernel if the count ever reaches INT_MAX, which Greg Kroah-Hartman objected to:

So you are guaranteeing to crash a machine here if this fails? And you were trying to say this is a "security" based fix?

And people wonder why I no longer have any hair...

But as Windsor and others pointed out, there is no sensible recovery that can be done if a reference count is about to wrap. An alternative might be to simply not change the counter (and put a warning into the kernel log) once it reaches INT_MAX, but that would lead to a memory leak. Overall, at least at the time, Kroah-Hartman was clearly skeptical of the whole idea—or even that a kref wrap could be exploited. However, Kees Cook did describe the way an exploit might work:

Based on what I've seen, the "normal" exploit follows this pattern:

user1: alloc(), inc
user2: inc
user2: fail to dec
*repeat user2's actions until wrap*
user3: inc
user3: dec, free()
user1: operate on freed memory zomg

In the recent posting of PAX_REFCOUNT, Windsor has essentially broken up the PaX project's patches and applied them to the 4.2.6 stable kernel, though he is working on rebasing on linux-next. He noted a post on the grsecurity forums where the feature is well documented. The implementation changes the kernel's operations on atomic_t types so that overflows cannot occur; increments beyond INT_MAX are disallowed. In addition, processes that would have caused an overflow are sent a SIGKILL so that they can do no further damage. Windsor suggested that the signal might be too severe to start with:

When an overflow is detected, SIGKILL is sent to the offending process. This may be too drastic for an initial upstream submission. WARN_ON may be more appropriate until distros have some time to absorb it and report any unaddressed overflows.

The patches also create an atomic_unchecked_t type that acts just like today's atomic_t; it does no checking for overflow. In fact, the bulk of the patches are to various subsystems that use atomic variables but don't use them as reference counts; they are switched to use the new unchecked type. If the patches get merged, new users of atomic variables will need to determine if they are being used as reference counts or not to choose the proper atomic type.

So far, the comments on the patches have been light, but one suspects the code churn needed to switch all of those atomic types will bring some complaints when the patches get posted more widely. One could imagine creating a new type for those variables that need the checking, but that would require constant vigilance to ensure that any reference counts added to the kernel actually used the new type. That problem still exists with the posted patches, however, since new atomic_unchecked_t variables will need to be scrutinized to see that they aren't being used as reference counts.

PAX_MEMORY_SANITIZE

One way to mitigate the effect of use-after-free vulnerability or to stop various information leaks is to "sanitize" memory that is being freed by writing zeroes or some other constant value to it. That is the idea behind the PAX_MEMORY_SANITIZE feature. Laura Abbott posted a partial port of the feature to kernel-hardening on December 21.

In particular, Abbott's patches add the sanitization to the slab allocators (slab, slob, and slub), but not for the buddy allocator as the full PAX_MEMORY_SANITIZE feature does. That means "that allocations which go directly to the buddy allocator (i.e. large allocations) aren't sanitized". The actual sanitization is done using a fixed value (0xff for all architectures except x86-64, which uses 0xfe) that is written over the entire object before it is freed. Abbott plans to look into adding sanitization to the buddy allocator sometime in the new year. Another change that Abbott made to the PaX version of the feature was to add an option to handle the sanitization in the slow path of the allocator.

Christoph Lameter complained that the feature was similar to the slab-poisoning feature, so it should use that mechanism instead. Abbott agreed that the features were similar, but said that poisoning is a debug feature and this work is targeting kernel hardening so "it seemed more appropriate to keep debug features and non-debug features separate hence the separate option and configuration".

The cost of sanitization is performance, of course. Abbott said she measured impacts of 3-20% depending on the benchmark. But the impact of compiling the feature into the kernel, but turning it off at runtime (using the sanitize_slab=off boot option), is negligible.

Lameter also suggested using the GFP_ZERO flag to make allocations be zeroed before being returned. If there were a mode that set that flag for all allocations it would provide "implied sanitization". But doing it that way would move the performance impact from the free path to the allocation side, which is typically more performance sensitive, as Dave Hansen pointed out. It also means that unallocated memory would still store the potentially sensitive contents of the previous object until it is allocated again.

Instead of writing the fixed sanitization value across the object, writing zeroes would potentially allow the allocation path to skip the zeroing step, Hansen suggested. That might reduce some of the performance impact, though doing the zeroing at allocation time does leave the object's memory cache-hot, as Lameter noted. But zeroing has another downside that Abbott mentioned:

poisoning with non-zero memory makes it easier to determine that the error came from accessing the sanitized memory vs. some other case. I don't think the feature would be as strong if the memory was only zeroed vs. some other data value.

Overall, both patches were fairly well-received, but the hardening list is likely made up of those who are predisposed to look favorably on these kinds of changes. Based on discussions at last year's Kernel Summit, mainline developers should in theory be more receptive to patches that seek to mitigate whole classes of security bugs. If these PaX features can get merged eventually, there are some even more intrusive ones that could also attempt to run the gauntlet of the linux-kernel mailing list. Just where the line is—or even if there is one—is still unclear, but patches like these may help define it.

Comments (4 posted)

Some 4.4 development statistics

By Jonathan Corbet
December 23, 2015

The 4.4-rc6 kernel prepatch was released on December 20, right on schedule. The 4.4 development series as a whole appears to be on schedule, with the most probable release date for 4.4 final being January 3, after one more prepatch. Linus has suggested that he might delay the release for one week. Any such delay, though, would be to allow developers to recover from the holidays before starting a new merge window rather than anything needed to stabilize 4.4.

So, naturally, it is about time to look at this cycle's development activity. As of this writing, there have been 12,854 non-merge changesets pulled into the mainline this time around. It has thus been a busy cycle, though it would be surprising if we reached the number of changes seen in 4.2 (13,694), or the all-time record (13,722) set for 3.15.

The number of developers involved thus far is 1,548 — a large number, but slightly short of the 1,600 seen in 4.3. We may yet reach the 1,569 seen in the 4.2 cycle, though. Of those 1,548 contributors, 246 made their first kernel contribution in this development cycle — the lowest number since 3.13. The most active developers this time around were:

Most active 4.4 developers

By changesets

H Hartley Sweeten 288 2.2%

Mateusz Kulikowski 218 1.7%

Chaehyun Lim 179 1.4%

Leo Kim 167 1.3%

Eric W. Biederman 163 1.3%

Shraddha Barke 147 1.1%

Ville Syrjälä 144 1.1%

Arnd Bergmann 143 1.1%

Eric Dumazet 123 1.0%

Tony Cho 108 0.8%

Geert Uytterhoeven 105 0.8%

Glen Lee 105 0.8%

Russell King 104 0.8%

Javier Martinez Canillas 101 0.8%

Sudip Mukherjee 96 0.7%

Christoph Hellwig 91 0.7%

Mike Rapoport 91 0.7%

Oleg Drokin 89 0.7%

Luis de Bethencourt 89 0.7%

Andy Shevchenko 82 0.6%

By changed lines

Alex Deucher 32203 5.0%

Sreekanth Reddy 24009 3.7%

Yuval Mintz 20622 3.2%

Christoph Hellwig 15656 2.4%

huangdaode 14725 2.3%

Michael Chan 13137 2.0%

Lv Zheng 9887 1.5%

Oleg Drokin 8434 1.3%

Deepa Dinamani 7797 1.2%

Jes Sorensen 7737 1.2%

Peter Senna Tschudin 7676 1.2%

Sudeep Dutt 6881 1.1%

Leo Kim 6664 1.0%

Alexander Shishkin 6612 1.0%

Arnd Bergmann 5893 0.9%

Takashi Sakamoto 5837 0.9%

Jiri Pirko 5350 0.8%

Adam Thomson 5123 0.8%

Eric Anholt 5041 0.8%

H Hartley Sweeten 5030 0.8%

After an absence for a few development cycles, H. Hartley Sweeten is back at the top of the per-changeset list for the ongoing work on the Comedi drivers in the staging tree. This code, at just under 100,000 lines, has now seen nearly 8,000 patches — and the work continues. Mateusz Kulikowski worked entirely with the rtl8192e staging driver, while Chaehyun Lim and Leo Kim both fixed up the wilc1000 staging driver. Eric Biederman is engaged in a substantial reworking of how the network stack handles network namespaces, with an emphasis on proper handling of packets that cross namespaces.

On the lines-changed side, Alex Deucher continues to work on the AMD graphics drivers, Sreekanth Reddy removed a bunch of code from the mpt2sas driver (and, as a result, was the developer removing the most code in this cycle), and Yuval Mintz added a couple of Qlogic Ethernet drivers. Christoph Hellwig did a fair amount of cleanup throughout the driver and block subsystems, while huangdaode (the only name that appears in the logs) added support for the Hisilicon network subsystem.

The sum of these developers' effort resulted in the net addition of 242,000 lines of code to the kernel in this development cycle.

Work on 4.4 was supported by 202 employers that we could identify, a slight increase from 4.3. The most active companies working on 4.4 were:

Most active 4.4 employers

By changesets

Intel 1660 12.9%

(Unknown) 1139 8.9%

(None) 684 5.3%

Samsung 670 5.2%

Red Hat 655 5.1%

Atmel 449 3.5%

Linaro 448 3.5%

(Consultant) 419 3.3%

Outreachy 400 3.1%

IBM 302 2.3%

Vision Engraving Systems 288 2.2%

Google 273 2.1%

SUSE 257 2.0%

ARM 226 1.8%

Texas Instruments 210 1.6%

Freescale 208 1.6%

Renesas Electronics 190 1.5%

AMD 177 1.4%

Oracle 173 1.3%

Broadcom 169 1.3%

By lines changed

Intel 85390 13.3%

(None) 37078 5.8%

AMD 36306 5.6%

Red Hat 34937 5.4%

(Unknown) 33739 5.2%

(Consultant) 30271 4.7%

Avago Technologies 27001 4.2%

QLogic 24381 3.8%

Broadcom 19318 3.0%

Atmel 17856 2.8%

Samsung 16508 2.6%

Linaro 16154 2.5%

HiSilicon Technologies 15260 2.4%

Outreachy 12765 2.0%

Renesas Electronics 11745 1.8%

Mellanox 11590 1.8%

Freescale 11392 1.8%

ARM 10986 1.7%

IBM 10402 1.6%

Texas Instruments 10345 1.6%

For many years, Red Hat stood alone at the top of both columns of this list. That situation has been changing for some time; at this point, it is more than fair to say that Red Hat has ceased to be the most active company in the kernel development community. That is not to slight the company's work, of course; Red Hat still funds many of our most active developers, and those developers, in the subsystem-maintainer role, signed off on 16% of the changes merged this time around. But, at this point, Red Hat is one of a number of top-tier companies working to improve the Linux kernel.

Speaking of signoffs, the most active developers and companies when it comes to signing off patches they did not write are:

Most non-author signoffs in 4.4

Developers

Greg Kroah-Hartman 2746 21.3%

David S. Miller 1048 8.1%

Daniel Vetter 447 3.5%

Andrew Morton 346 2.7%

Mark Brown 343 2.7%

Ingo Molnar 241 1.9%

Arnaldo Carvalho de Melo 224 1.7%

Tony Cho 210 1.6%

Jeff Kirsher 209 1.6%

Kalle Valo 174 1.3%

Companies

Linux Foundation 2763 21.6%

Red Hat 2060 16.1%

Intel 1649 12.9%

Linaro 820 6.4%

Google 602 4.7%

(None) 459 3.6%

SUSE 392 3.1%

Atmel 260 2.0%

Samsung 260 2.0%

Facebook 233 1.8%

The kernel's subsystem maintainers remain concentrated in relatively few companies though, arguably, they are spread out a bit more widely than they once were. While many companies are willing to support kernel development in specific areas, fewer of them see the need to support developers working in the subsystem-maintainer role.

In summary, 4.4, the final kernel development for 2015, looks pretty typical. It was busier than most, but that, too, is typical, given the long-term trend toward larger development cycles. That busyness does not appear set to make this cycle longer than the 63 days that we have come to expect, though. Despite its occasional hiccups, the kernel-development machine continues to run smoothly.

Comments (4 posted)

Linus Torvalds Linux 4.4-rc6 ?

Sebastian Andrzej Siewior 4.1.15-rt17 ?

Kamal Mostafa Linux 3.19.8-ckt12 ?

Kamal Mostafa Linux 3.13.11-ckt32 ?

David Woods arm64: Add support for PTE contiguous bit. ?

Fenghua Yu x86: Intel Cache Allocation Technology Support ?

Xiao Guangrong KVM: x86: track guest page access ?

serge.hallyn@ubuntu.com CGroup Namespaces (v8) ?

Josh Poimboeuf Compile-time stack metadata validation ?

Songjun Wu ASoC: atmel-pdmic: add driver for Atmel PDMIC ?

Jon Hunter Add support for Tegra210 AGIC ?

MaJun irqchip:support mbigen interrupt controller ?

Tomas Winkler mei: create proper iAMT watchdog driver ?

Sinan Kaya dma: add Qualcomm Technologies HIDMA driver ?

Sergei Shtylyov extcon: add Maxim MAX3355 driver ?

Ludovic Desroches Introduce at91_adc8xx driver ?

igal.liberman@freescale.com Freescale DPAA FMan ?

shh.xie@gmail.com net: phy: adds backplane driver for Freescale's PCS PHY ?

Faisal Latif add Intel(R) X722 iWARP driver ?

Keith Busch Driver for new "VMD" device ?

Mathieu Poirier Coresight integration with perf ?

David Daney pci: Add host controller driver for Cavium ThunderX PCIe ?

Liviu Dudau drm: Add support for the ARM HDLCD display controller ?

Yakir Yang Add Analogix Core Display Port Driver ?

Laurent Pinchart Request API and proof-of-concept implementation ?

Tiago Vignatti Direct userspace dma-buf mmap (v7) ?

Deepa Dinamani fs: Add 64 bit time support ?

Darrick J. Wong vfs: hoist reflink/dedupe ioctls to the VFS ?

Jan Kara ext[24]: MBCache rewrite ?

Theodore Ts'o ext4 crypto: backup and restore encrypted files ?

Vladimir Davydov Add swap accounting to cgroup2 ?

Ross Zwisler DAX fsync/msync support ?

Dan Williams get_user_pages() for dax pte and pmd mappings ?

Al Viro free_pages stuff ?

Vitaly Kuznetsov memory-hotplug: add automatic onlining policy for the newly added memory ?

Jarno Rajahalme openvswitch: NAT support. ?

Craig Gallek Faster SO_REUSEPORT ?

Baolin Wang Introduce the bulk IV mode for improving the crypto engine efficiency ?

Stephan Mueller crypto: add algif_akcipher user space API ?

David Windsor Add PAX_REFCOUNT overflow protection ?

Laura Abbott Sanitization of slabs based on grsecurity/PaX ?

Linus Torvalds Batched user access support ?

Daniel Cashman Allow customizable random offset to mmap_base address. ?

Stefan Hajnoczi Add virtio transport for AF_VSOCK ?

Arnaldo Carvalho de Melo perf/core new feature: 'perf stat record/report' ?

Namhyung Kim perf tools: Support dynamic sort keys for tracepoints (v3) ?

Jiri Olsa perf stat: Add scripting support ?

Andy Yan misc: add reboot mode driver ?

Kernel development

Brief items

Kernel release status

Quote of the week

Kernel development news

An (unsigned) long story about page allocation

Two PaX features move toward the mainline

PAX_REFCOUNT

PAX_MEMORY_SANITIZE

Some 4.4 development statistics

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous