[go: up one dir, main page]

|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current development kernel is 4.4-rc6, released on December 20. Linus said: "Things remain fairly normal. Last week rc5 was very small indeed, this week we have a slightly bigger rc6. The main difference is that rc6 had a network pull in it."

Stable updates: none have been released in the last week.

Comments (none posted)

Quote of the week

In A.D. 1582 Pope Gregory XIII found that the existing Julian calendar insufficiently represented reality, and changed the rules about calculating leap years to account for this. Similarly, in A.D. 2013 Rockchip hardware engineers found that the new Gregorian calendar still contained flaws, and that the month of November should be counted up to 31 days instead. Unfortunately it takes a long time for calendar changes to gain widespread adoption, and just like more than 300 years went by before the last Protestant nation implemented Greg's proposal, we will have to wait a while until all religions and operating system kernels acknowledge the inherent advantages of the Rockchip system. Until then we need to translate dates read from (and written to) Rockchip hardware back to the Gregorian format.
Julius Werner

Comments (none posted)

Kernel development news

An (unsigned) long story about page allocation

By Jonathan Corbet
December 23, 2015
The kernel project is famously willing to change any internal interface as needed for the long-term maintainability of the code. Effects on out-of-tree modules or other external code are not generally deemed to be reasons to keep an interface stable. But what happens if you want to change one of the oldest interfaces found within the kernel — one with many hundreds of call sites? It turns out that, in 2015, the appetite for interface churn may not be what it once was.

If one looks at mm/memory.c in the Linux 0.01 release, one finds that a page of memory is allocated with:

    unsigned long get_free_page(void);

From the memory-management point of view, the system's RAM can be seen as a linear array of pages, so it can make a certain amount of sense to think of addresses as integer types — indexes into the array, essentially. Integers can also be used for arbitrary arithmetic; pointers in C can be used that way too, but one quickly gets into "undefined behavior" territory where an overly enthusiastic compiler may feel entitled to create all kinds of mayhem. So unsigned long was established as the return type from get_free_page() and, in general, as the way that one refers to an address that may appear in any place in memory.

Fast-forward to the 4.4-rc6 release and dig through a rather larger body of code, and one finds that pages are allocated with:

    unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
    unsigned long __get_free_page(gfp_t gfp_mask);

The latter is a macro calling the former with an order of zero. Note that, more than 24 years after the 0.01 release, unsigned long is still used as the return type from __get_free_pages(). There are other variants (alloc_pages(), for example) that return struct page pointers, but much of low-level, page-oriented memory management in Linux is still done with unsigned long values.

The only problem is that, often, the kernel must deal with a page of memory as memory, modifying its contents. That requires a pointer. So even back in 0.01, one can find code like:

    p = (struct task_struct *) get_free_page();

The unsigned long return value is immediately cast into the pointer value that is actually needed. Al Viro did a survey of __get_free_pages() users in current kernels and concluded that "well above 90%" of the callers were using the return value as a pointer. That turns out to be a lot of casts, suggesting that the type of the return value for this function is not correct. So, he suggested, it might make sense to change it:

In other words, switching to void * for return values of allocating and argument of freeing functions would reduce the amount of boilerplate quite nicely. What's more, fewer casts means better chance for typechecking to catch more bugs.

Some of those bugs, he pointed out, he found simply by looking at the code with this kind of transformation in mind. Ten days later, he showed up with a patch set making the change and asked for a verdict from Linus.

One might find various faults with Linus's response, but a lack of clarity will not be among them. He left no doubt that there was no place in the mainline for this particular patch set. The diffstat in Al's patch (568 files changed, 1956 insertions, 2202 deletions) was clearly frightening — enough, in its own right, to rule out the change. A patch this wide-ranging would create conflicts throughout the tree and make life difficult for those backporting patches. This interface, it seems, is too old and too entrenched for this kind of flag-day change; as Linus put it: "No way in hell do we suddenly change the semantics of an interface that has been around from basically day #1."

Still, as he clarified afterward, Linus isn't arguing for leaving everything exactly as it is. He accepted that most callers likely want a pointer value. But the way forward isn't to thrash up an interface like __get_free_pages(); instead, there are two approaches that, he said, could be taken.

The first of these would be to create a new, pointer-oriented interface that exists in parallel with __get_free_pages(). Then call sites could be converted at leisure over the course of what would probably be years.

The alternative, Linus said, is that code needing pointers could just allocate memory with kmalloc() instead. Once upon a time, that would not necessarily have been a good idea, since kmalloc() (implemented by the slab allocators) adds overhead to the page allocator and might have expanded the size of the returned memory beyond one page. Indeed, there was a period where an allocation of exactly one page would have consumed two physically contiguous pages when the slab housekeeping information was added. But those days are long in the past. In current kernels, kmalloc() is fast and requires little memory beyond that which is actually allocated. Indeed, Linus pointed out, kmalloc() may actually be faster than __get_free_pages() due to its use of per-CPU object caches.

So kmalloc() is probably the best option for many of the call sites currently using __get_free_pages(). The places where it is still inappropriate will be those needing multiple-page allocations and those needing allocations that are not only page-sized but page-aligned. In those cases, Linus said, the unsigned long return type might not be a bad thing, since "it's clearly not just a random pointer allocation if the bit pattern of the pointer matters."

After this discussion took place, Al did a pass over the __get_free_pages() call sites in the filesystem code and concluded that almost all of them truly would would be better off using kmalloc(). So the end result of this work may be a slow shift in that direction and, perhaps, the creation of a new document telling kernel developers which memory allocator they should be using in which setting.

Comments (10 posted)

Two PaX features move toward the mainline

By Jake Edge
December 23, 2015

As the Kernel self-protection project (KSPP) ramps up in the month and a half since its formation, several features from the PaX project are starting their journey toward the mainline. The reception on the kernel-hardening mailing list that is being used to coordinate KSPP work has been positive, but the real test for these features will come when they are proposed for the mainline. Two specific patch sets have been posted recently, for PAX_REFCOUNT and PAX_MEMORY_SANITIZE, that we will look at here.

PAX_REFCOUNT

The idea behind the PAX_REFCOUNT patch set, posted by David Windsor, is to detect and handle overflows in reference-count variables. The kernel uses reference counts to track objects that have been allocated, incrementing or decrementing the count as references to them come and go; the kernel frees those objects when the count reaches zero. But if there is a path in the kernel where the count doesn't get decremented when an object reference gets dropped, an attacker could use that path to overflow and wrap the reference counter, effectively setting it to zero when there are actually still valid references to it. The object will be freed, but will still be used by those with references, leading to a use-after-free vulnerability.

This is not the first attempt to add this kind of overflow protection to the kernel. But when Windsor posted about a related idea for kref, which is a kernel abstraction for reference counts, back in 2012, the idea ran aground on how it handled the overflows. Like the original PaX patches, Windsor's patch would call BUG_ON() for reference counts that reached INT_MAX, instead of incrementing them. That would crash the kernel if the count ever reaches INT_MAX, which Greg Kroah-Hartman objected to:

So you are guaranteeing to crash a machine here if this fails? And you were trying to say this is a "security" based fix?

And people wonder why I no longer have any hair...

But as Windsor and others pointed out, there is no sensible recovery that can be done if a reference count is about to wrap. An alternative might be to simply not change the counter (and put a warning into the kernel log) once it reaches INT_MAX, but that would lead to a memory leak. Overall, at least at the time, Kroah-Hartman was clearly skeptical of the whole idea—or even that a kref wrap could be exploited. However, Kees Cook did describe the way an exploit might work:

Based on what I've seen, the "normal" exploit follows this pattern:

user1: alloc(), inc
user2: inc
user2: fail to dec
*repeat user2's actions until wrap*
user3: inc
user3: dec, free()
user1: operate on freed memory zomg

In the recent posting of PAX_REFCOUNT, Windsor has essentially broken up the PaX project's patches and applied them to the 4.2.6 stable kernel, though he is working on rebasing on linux-next. He noted a post on the grsecurity forums where the feature is well documented. The implementation changes the kernel's operations on atomic_t types so that overflows cannot occur; increments beyond INT_MAX are disallowed. In addition, processes that would have caused an overflow are sent a SIGKILL so that they can do no further damage. Windsor suggested that the signal might be too severe to start with:

When an overflow is detected, SIGKILL is sent to the offending process. This may be too drastic for an initial upstream submission. WARN_ON may be more appropriate until distros have some time to absorb it and report any unaddressed overflows.

The patches also create an atomic_unchecked_t type that acts just like today's atomic_t; it does no checking for overflow. In fact, the bulk of the patches are to various subsystems that use atomic variables but don't use them as reference counts; they are switched to use the new unchecked type. If the patches get merged, new users of atomic variables will need to determine if they are being used as reference counts or not to choose the proper atomic type.

So far, the comments on the patches have been light, but one suspects the code churn needed to switch all of those atomic types will bring some complaints when the patches get posted more widely. One could imagine creating a new type for those variables that need the checking, but that would require constant vigilance to ensure that any reference counts added to the kernel actually used the new type. That problem still exists with the posted patches, however, since new atomic_unchecked_t variables will need to be scrutinized to see that they aren't being used as reference counts.

PAX_MEMORY_SANITIZE

One way to mitigate the effect of use-after-free vulnerability or to stop various information leaks is to "sanitize" memory that is being freed by writing zeroes or some other constant value to it. That is the idea behind the PAX_MEMORY_SANITIZE feature. Laura Abbott posted a partial port of the feature to kernel-hardening on December 21.

In particular, Abbott's patches add the sanitization to the slab allocators (slab, slob, and slub), but not for the buddy allocator as the full PAX_MEMORY_SANITIZE feature does. That means "that allocations which go directly to the buddy allocator (i.e. large allocations) aren't sanitized". The actual sanitization is done using a fixed value (0xff for all architectures except x86-64, which uses 0xfe) that is written over the entire object before it is freed. Abbott plans to look into adding sanitization to the buddy allocator sometime in the new year. Another change that Abbott made to the PaX version of the feature was to add an option to handle the sanitization in the slow path of the allocator.

Christoph Lameter complained that the feature was similar to the slab-poisoning feature, so it should use that mechanism instead. Abbott agreed that the features were similar, but said that poisoning is a debug feature and this work is targeting kernel hardening so "it seemed more appropriate to keep debug features and non-debug features separate hence the separate option and configuration".

The cost of sanitization is performance, of course. Abbott said she measured impacts of 3-20% depending on the benchmark. But the impact of compiling the feature into the kernel, but turning it off at runtime (using the sanitize_slab=off boot option), is negligible.

Lameter also suggested using the GFP_ZERO flag to make allocations be zeroed before being returned. If there were a mode that set that flag for all allocations it would provide "implied sanitization". But doing it that way would move the performance impact from the free path to the allocation side, which is typically more performance sensitive, as Dave Hansen pointed out. It also means that unallocated memory would still store the potentially sensitive contents of the previous object until it is allocated again.

Instead of writing the fixed sanitization value across the object, writing zeroes would potentially allow the allocation path to skip the zeroing step, Hansen suggested. That might reduce some of the performance impact, though doing the zeroing at allocation time does leave the object's memory cache-hot, as Lameter noted. But zeroing has another downside that Abbott mentioned:

poisoning with non-zero memory makes it easier to determine that the error came from accessing the sanitized memory vs. some other case. I don't think the feature would be as strong if the memory was only zeroed vs. some other data value.

Overall, both patches were fairly well-received, but the hardening list is likely made up of those who are predisposed to look favorably on these kinds of changes. Based on discussions at last year's Kernel Summit, mainline developers should in theory be more receptive to patches that seek to mitigate whole classes of security bugs. If these PaX features can get merged eventually, there are some even more intrusive ones that could also attempt to run the gauntlet of the linux-kernel mailing list. Just where the line is—or even if there is one—is still unclear, but patches like these may help define it.

Comments (4 posted)

Some 4.4 development statistics

By Jonathan Corbet
December 23, 2015
The 4.4-rc6 kernel prepatch was released on December 20, right on schedule. The 4.4 development series as a whole appears to be on schedule, with the most probable release date for 4.4 final being January 3, after one more prepatch. Linus has suggested that he might delay the release for one week. Any such delay, though, would be to allow developers to recover from the holidays before starting a new merge window rather than anything needed to stabilize 4.4.

So, naturally, it is about time to look at this cycle's development activity. As of this writing, there have been 12,854 non-merge changesets pulled into the mainline this time around. It has thus been a busy cycle, though it would be surprising if we reached the number of changes seen in 4.2 (13,694), or the all-time record (13,722) set for 3.15.

The number of developers involved thus far is 1,548 — a large number, but slightly short of the 1,600 seen in 4.3. We may yet reach the 1,569 seen in the 4.2 cycle, though. Of those 1,548 contributors, 246 made their first kernel contribution in this development cycle — the lowest number since 3.13. The most active developers this time around were:

Most active 4.4 developers
By changesets
H Hartley Sweeten2882.2%
Mateusz Kulikowski2181.7%
Chaehyun Lim1791.4%
Leo Kim1671.3%
Eric W. Biederman1631.3%
Shraddha Barke1471.1%
Ville Syrjälä1441.1%
Arnd Bergmann1431.1%
Eric Dumazet1231.0%
Tony Cho1080.8%
Geert Uytterhoeven1050.8%
Glen Lee1050.8%
Russell King1040.8%
Javier Martinez Canillas1010.8%
Sudip Mukherjee960.7%
Christoph Hellwig910.7%
Mike Rapoport910.7%
Oleg Drokin890.7%
Luis de Bethencourt890.7%
Andy Shevchenko820.6%
By changed lines
Alex Deucher322035.0%
Sreekanth Reddy240093.7%
Yuval Mintz206223.2%
Christoph Hellwig156562.4%
huangdaode147252.3%
Michael Chan131372.0%
Lv Zheng98871.5%
Oleg Drokin84341.3%
Deepa Dinamani77971.2%
Jes Sorensen77371.2%
Peter Senna Tschudin76761.2%
Sudeep Dutt68811.1%
Leo Kim66641.0%
Alexander Shishkin66121.0%
Arnd Bergmann58930.9%
Takashi Sakamoto58370.9%
Jiri Pirko53500.8%
Adam Thomson51230.8%
Eric Anholt50410.8%
H Hartley Sweeten50300.8%

After an absence for a few development cycles, H. Hartley Sweeten is back at the top of the per-changeset list for the ongoing work on the Comedi drivers in the staging tree. This code, at just under 100,000 lines, has now seen nearly 8,000 patches — and the work continues. Mateusz Kulikowski worked entirely with the rtl8192e staging driver, while Chaehyun Lim and Leo Kim both fixed up the wilc1000 staging driver. Eric Biederman is engaged in a substantial reworking of how the network stack handles network namespaces, with an emphasis on proper handling of packets that cross namespaces.

On the lines-changed side, Alex Deucher continues to work on the AMD graphics drivers, Sreekanth Reddy removed a bunch of code from the mpt2sas driver (and, as a result, was the developer removing the most code in this cycle), and Yuval Mintz added a couple of Qlogic Ethernet drivers. Christoph Hellwig did a fair amount of cleanup throughout the driver and block subsystems, while huangdaode (the only name that appears in the logs) added support for the Hisilicon network subsystem.

The sum of these developers' effort resulted in the net addition of 242,000 lines of code to the kernel in this development cycle.

Work on 4.4 was supported by 202 employers that we could identify, a slight increase from 4.3. The most active companies working on 4.4 were:

Most active 4.4 employers
By changesets
Intel166012.9%
(Unknown)11398.9%
(None)6845.3%
Samsung6705.2%
Red Hat6555.1%
Atmel4493.5%
Linaro4483.5%
(Consultant)4193.3%
Outreachy4003.1%
IBM3022.3%
Vision Engraving Systems2882.2%
Google2732.1%
SUSE2572.0%
ARM2261.8%
Texas Instruments2101.6%
Freescale2081.6%
Renesas Electronics1901.5%
AMD1771.4%
Oracle1731.3%
Broadcom1691.3%
By lines changed
Intel8539013.3%
(None)370785.8%
AMD363065.6%
Red Hat349375.4%
(Unknown)337395.2%
(Consultant)302714.7%
Avago Technologies270014.2%
QLogic243813.8%
Broadcom193183.0%
Atmel178562.8%
Samsung165082.6%
Linaro161542.5%
HiSilicon Technologies152602.4%
Outreachy127652.0%
Renesas Electronics117451.8%
Mellanox115901.8%
Freescale113921.8%
ARM109861.7%
IBM104021.6%
Texas Instruments103451.6%

For many years, Red Hat stood alone at the top of both columns of this list. That situation has been changing for some time; at this point, it is more than fair to say that Red Hat has ceased to be the most active company in the kernel development community. That is not to slight the company's work, of course; Red Hat still funds many of our most active developers, and those developers, in the subsystem-maintainer role, signed off on 16% of the changes merged this time around. But, at this point, Red Hat is one of a number of top-tier companies working to improve the Linux kernel.

Speaking of signoffs, the most active developers and companies when it comes to signing off patches they did not write are:

Most non-author signoffs in 4.4
Developers
Greg Kroah-Hartman274621.3%
David S. Miller10488.1%
Daniel Vetter4473.5%
Andrew Morton3462.7%
Mark Brown3432.7%
Ingo Molnar2411.9%
Arnaldo Carvalho de Melo2241.7%
Tony Cho2101.6%
Jeff Kirsher2091.6%
Kalle Valo1741.3%
Companies
Linux Foundation276321.6%
Red Hat206016.1%
Intel164912.9%
Linaro8206.4%
Google6024.7%
(None)4593.6%
SUSE3923.1%
Atmel2602.0%
Samsung2602.0%
Facebook2331.8%

The kernel's subsystem maintainers remain concentrated in relatively few companies though, arguably, they are spread out a bit more widely than they once were. While many companies are willing to support kernel development in specific areas, fewer of them see the need to support developers working in the subsystem-maintainer role.

In summary, 4.4, the final kernel development for 2015, looks pretty typical. It was busier than most, but that, too, is typical, given the long-term trend toward larger development cycles. That busyness does not appear set to make this cycle longer than the 63 days that we have come to expect, though. Despite its occasional hiccups, the kernel-development machine continues to run smoothly.

Comments (4 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 4.4-rc6 ?
Sebastian Andrzej Siewior 4.1.15-rt17 ?
Kamal Mostafa Linux 3.19.8-ckt12 ?
Kamal Mostafa Linux 3.13.11-ckt32 ?

Architecture-specific

Core kernel code

serge.hallyn@ubuntu.com CGroup Namespaces (v8) ?

Development tools

Device drivers

Device driver infrastructure

Filesystems and block I/O

Memory management

Networking

Jarno Rajahalme openvswitch: NAT support. ?
Craig Gallek Faster SO_REUSEPORT ?

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2015, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds