[go: up one dir, main page]

|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current development kernel is 3.3-rc7, released on March 10 despite Linus's earlier wish not to do any more 3.3 prepatches. "Now, none of the fixes here are all that scary in themselves, but there were just too many of them, and across various subsystems. Networking, memory management, drivers, you name it. And instead of having fewer commits than in -rc6, we have more of them. So my hope that things would calm down simply just didn't materialize."

Stable updates: the 3.2.10 and 3.0.24 updates were released on March 12; both contain a long list of important fixes. 3.2.11 followed one day later to fix a build problem.

The 2.6.34.11 update, containing almost 200 changes, is in the review process as of this writing.

Comments (none posted)

Quotes of the week

Programming is not just an act of telling a computer what to do: it is also an act of telling other programmers what you wished the computer to do. Both are important, and the latter deserves care.
-- Andrew Morton

Dammit, I'm continually surprised by the *idiots* out there that don't understand that binary compatibility is one of the absolute top priorities. The *only* reason for an OS kernel existing in the first place is to serve user-space. The kernel has no relevance on its own. Breaking existing binaries - and then not acknowledging how horribly bad that was - is just about the *worst* offense any kernel developer can do.
-- Linus Torvalds

Kernel developers have thick heads, in most cases thicker than processor manuals.
-- Borislav Petkov

Comments (9 posted)

Greg KH: The 2.6.32 Linux kernel

Greg Kroah-Hartman discusses the history of the 2.6.32 stable kernel series and why he has stopped supporting it. "With the 2.6.32 kernel being the base of these longterm enterprise distros, it was originally guessed that it would be hanging around for many many years to come. But, my old argument about how moving a kernel forward for an enterprise distro finally sunk in for 2 of the 3 major players. Both Oracle Linux and SLES 11, in their latest releases these few months, have moved to the 3.0 kernel as the base of them, despite leaving almost all other parts of the distro alone. They did this to take advantage of the better support for hardware, newer features, newer filesystems, and the hundreds of thousands of different changes that has happened in the kernel.org releases since way back in 2009."

Comments (none posted)

McKenney: Transactional Memory Everywhere: 2012 Update for HTM

Paul McKenney looks at the state of hardware transactional memory with an eye toward how it might be useful for current software. "Even with forward-progress guarantees, HTM is subject to aborts and rollbacks, which (aside from wasting energy) are failure paths. Failure code paths are in my experience difficult to work with. The possibility of failure is not handled particularly well by human brain cells, which are programmed for optimism. Failure code paths also pose difficulties for validations, particularly in cases where the probability of failure is low or in cases where multiple failures are required to reach a given code path."

Comments (20 posted)

A proposed plan for control groups

By Jonathan Corbet
March 14, 2012
After the late-February discussion on the future of control groups, Tejun Heo has boiled down the comments and come to some conclusions as to where he would like to go with this subsystem. The first of these is that multiple hierarchies are doomed in the long term:

At least to me, nobody seems to have strong enough justification for orthogonal multiple hierarchies, so, yeah, unless something else happens, I'm scheduling multiple hierarchy support for the chopping block. This is a long term thing (think years), so no need to panic right now and as is life plans may change and fail to materialize, but I intend to at least move away from it.

So there will, someday, be a single control group hierarchy. It will not, however, be tied to the process tree; it will be an independent tree of groups allowing processes to be combined in arbitrary ways.

The responses to Tejun's conclusions have mostly focused on details (how to handle controllers that are not fully hierarchical, for example). There does not appear to be any determined opposition to the idea of removing the multiple hierarchy feature at some point when it can be done without breaking systems, so users of control groups should consider the writing to be on the wall.

Comments (3 posted)

Kernel development news

Kernel competition in the enterprise space

By Jonathan Corbet
March 14, 2012
Kernel developers like to grumble about the kernels shipped by enterprise distributions. Those kernels tend to be managed in ways that ignore the best features of the Linux development process; indeed, sometimes they seem to work against that process. But, enterprise kernels and the systems built on them are also the platform on which the money that supports kernel development is made, so developers only push their complaints so far. For years, it has seemed that nothing could change the "enterprise mindset," but recent releases show that there may, indeed, be change brewing in this area.

Consider Red Hat Enterprise Linux 6; its kernel is ostensibly based on the 2.6.32 release. The actual kernel, as shipped by Red Hat, differs from 2.6.32 by around 7,700 patches, though. Many of those are fixes, but others are major new features, often backported from more recent releases. Thus, the RHEL "2.6.32" kernel includes features like per-session group scheduling, receive packet/flow steering, transparent huge pages, pstore, and, of course, support for a wide range of hardware that was not available when 2.6.32 shipped. Throw in a few out-of-tree features (SystemTap, for example), and the end result is a kernel far removed from anything shipped by kernel.org. That is why Red Hat has had no real use for the 2.6.32 stable kernel series for some years.

Red Hat's motivation for creating these kernels is not hard to understand; the company is trying to provide its customers with a combination of the stability that comes from well-aged software and the features, fixes, and performance improvements from the leading edge. This process, when it goes well, can give those customers the best of both worlds. On the other hand, the resulting kernels differ widely from the community's product, have not been tested by the community, and exclude recent features that have not been chosen for backporting. They are also quite expensive to create; behind Red Hat's many high-profile kernel hackers is an army of developers tasked with backporting features and keeping the resulting kernel stable and secure.

When developers grumble about enterprise kernels, what they are really saying is that enterprise distributions might be better served by simply updating to more current kernels. In the process they would get all those features, improvements, and bug fixes from the community, in the form that they were developed and tested by that community. Enterprise distributors shipping current kernels could dispense with much of their support expense and could better benefit from shared maintenance of stable kernel releases. The response that typically comes back is that enterprise customers worry about kernel version bumps (though massive changes hidden behind a minor number change are apparently not a problem) and that new kernels bring new bugs with them. The cost of stabilizing a new kernel release, it is suggested, could exceed that of backporting desired features into an older release.

Given that, it is interesting to see two other enterprise distributors pushing forward with newer kernels. Both SUSE Linux Enterprise Server 11 Service Pack 2 and Oracle's Unbreakable Enterprise Kernel Release 2 feature much more recent kernels - 3.0.10 and 3.0.16, respectively. In each case, the shift to a newer kernel is a clear attempt to create a more attractive distribution; we may be seeing the beginning of a change in the longstanding enterprise mindset.

SUSE seems firmly stuck in a second-place market position relative to Red Hat. As a result, the company will be searching for ways to differentiate its distribution from RHEL. SUSE almost certainly also lacks the kind of resources that Red Hat is able to apply to its enterprise kernels, so it will be looking for cheaper ways to provide a competitive set of features. Taking better advantage of the community's work by shipping more current kernels is one obvious way to do that. By shipping recent releases, SUSE does not have to backport fixes and features, and it is able to take advantage of the long-term stable support planned for the 3.0 kernel. In that context, it is not entirely surprising that SUSE has repeatedly pulled its customers forward, jumping from 2.6.27 to 2.6.32 in the Service Pack 1 release, then to 3.0.

Oracle, too, has a need to differentiate its distribution - even more so, given that said distribution is really just a rebranded RHEL. To that end, Oracle would like to push some of its in-house features like btrfs, which is optimistically labeled "production-ready" in a recent press release. If btrfs is indeed ready for production use, it certainly has only gotten there in very recent releases; moving to the 3.0 kernel allows Oracle to push this feature while minimizing the amount of work required to backport the most recent fixes. Oracle is offering this kernel with releases 5 and 6 of Oracle Linux; had Oracle stuck with Red Hat's RHEL 5 kernel, Oracle Linux 5 users would still be running something based on 2.6.18. For a company trying to provide a more feature-rich distribution on a budget, dropping in a current kernel must seem like a bargain.

What about the down side of new kernels - all those new bugs? Both companies have clearly tried to mitigate that risk by letting 3.0 stabilize for six months or so before shipping it to customers. There have been over 1,500 fixes applied in the 24 updates to 3.0 released so far. The real proof, though, is in users' experience. If SLES or Oracle Linux users experience bugs or performance regressions as a result of the kernel version change, they may soon start looking for alternatives. In the Oracle case, the original Red Hat kernel remains an option for customers; SUSE, instead, seems committed to the newer version.

Between these two distributions there should be enough users to eventually establish whether moving to newer kernels in the middle of an enterprise distribution's support period is a smart move or not. If it works out, SUSE and Oracle may benefit from an influx of customers who are tired of Red Hat's hybrid kernels. If the new kernels prove not to be enterprise-ready, instead, Red Hat's position may become even stronger. Learning which way things will go may take a while. Should Red Hat show up one day with a newer kernel for RHEL customers, though, we'll know that the issue has been decided at last.

Comments (10 posted)

The trouble with stable pages

By Jonathan Corbet
March 13, 2012
Traditionally, the kernel has allowed the modification of pages in memory while those pages are in the process of being written back to persistent storage. If a process writes to a section of a file that is currently under writeback, that specific writeback operation may or may not contain all of the most recently written data. This behavior is not normally a problem; all the data will get to disk eventually, and developers (should) know that if they want to get data to disk at a specific time, they should use the fsync() system call to get it there. That said, there are times when modifying under-writeback pages can create problems; those problems have been addressed, but now it appears that the cure may be as bad as the disease.

Some storage hardware can transmit and store checksums along with data; those checksums can provide assurance that the data written to (or read from) disk matches what the processor thought it was writing. If the data in a page changes after the calculation of the checksum, though, that data will appear to be corrupted when the checksum is verified later on. Volatile data can also create problems on RAID devices and with filesystems implementing advanced features like data compression. For all of these reasons, the stable pages feature was added to ext4 for the 3.0 release (some other filesystems, btrfs included, have had stable pages for some time). With this feature, pages under writeback are marked as not being writable; any process attempting to write to such a page will block until the writeback completes. It is a relatively simple change that makes system behavior more deterministic and predictable.

That was the thought, anyway, and things do work out that way most of the time. But, occasionally, as described by Ted Ts'o, processes performing writes can find themselves blocked for lengthy periods (multiple seconds) of time. Occasional latency spikes are not the sort of deterministic behavior the developers were after; they also leave users unamused.

In a general sense, it is not hard to imagine what may be going on after seeing this kind of problem report. The system in question is very busy, with many processes contending for the available I/O bandwidth. One process is happily minding its own business while appending to its log file. At some point, though, the final page in that log file is submitted for writeback; it then becomes unwritable. As soon as our hapless process tries to add another line to the file, it will be blocked waiting for that writeback to complete. Since the disks are contended and the I/O queues are long, that wait can go on for some time. By the time the process is allowed to proceed, it has suffered an extensive, unexpected period of latency.

Ted's proposed solution was to only implement stable pages if the data integrity features are built into the kernel. That fix is unlikely to be merged in that form for a few reasons. Many distributor kernels are likely to have the feature enabled, but it will actually be used on relatively few systems. As noted above, there are other places where changing data in pages under writeback can create problems. So the real solution may be some sort of runtime switch - perhaps a filesystem mount option - indicating when stable pages are needed.

It is also possible that the real problem is somewhere else. Chris Mason expressed discomfort with the idea of only using stable pages where they are strictly needed:

I'm not against only turning on stable pages when they are needed, but the code that isn't the default tends to be somewhat less used. So it does increase testing burden when we do want stable pages, and it tends to make for awkward bugs that are hard to reproduce because someone neglects to mention it.

According to Chris, writeback latencies simply should not be seen on the scale of multiple seconds; he would like to see some effort put into figuring out why that is happening. Then, perhaps, the real problem could be fixed. But it may be that the real problem is simply that the system's resources are heavily oversubscribed and the I/O queues are long. In that case, a real fix may be hard to come by.

Boaz Harrosh suggested avoiding writeback on the final pages of any files that have been modified in the last few seconds. That might help in the "appending to a log file" case, but will not avoid unpredictable latency resulting from modification of the file at any location other than the end. People have suggested that pages modified while under writeback could be copied, allowing the modification to proceed immediately and not interfere with the writeback. That solution, though, requires more memory (perhaps during a time when the system is desperately trying to free memory) and copying pages is not free. Another option, suggested by Ted, would be to add a callback to be invoked by the block layer just before a page is passed on to the device; that callback could calculate checksums and mark the page unwritable only for the (presumably much shorter) time that it is actually under I/O.

Other solutions certainly exist. The first step, though, would appear to be to get a real handle on the problem so that solutions are written with an understanding of where the latency is actually coming from. Then, perhaps, we can have a stable pages implementation that provides stable data with stable latency in all situations.

Comments (15 posted)

A deep dive into CMA

March 14, 2012

This article was contributed by Michal "mina86" Nazarewicz

The Contiguous Memory Allocator (or CMA), which LWN looked at back in June 2011, has been developed to allow allocation of big, physically-contiguous memory blocks. Simple in principle, it has grown quite complicated, requiring cooperation between many subsystems. Depending on one's perspective, there are different things to be done and watch out for with CMA. In this article, I will describe how to use CMA and how to integrate it with a given platform.

From a device driver author's point of view, nothing should change. CMA is integrated with the DMA subsystem, so the usual calls to the DMA API (such as dma_alloc_coherent()) should work as usual. In fact, device drivers should never need to call the CMA API directly, since instead of bus addresses and kernel mappings it operates on pages and page frame numbers (PFNs), and provides no mechanism for maintaining cache coherency.

For more information, looking at Documentation/DMA-API.txt and Documentation/DMA-API-HOWTO.txt will be useful. Those two documents describe the provided functions as well as giving usage examples.

Architecture integration

Of course, someone has to integrate CMA with the DMA subsystem of a given architecture. This is performed in a few, fairly easy steps.

CMA works by reserving memory early at boot time. This memory, called a CMA area or a CMA context, is later returned to the buddy allocator so that it can be used by regular applications. To do the reservation, one needs to call:

    void dma_contiguous_reserve(phys_addr_t limit);

just after the low-level "memblock" allocator is initialized but prior to the buddy allocator setup. On ARM, for example, it is called in arm_memblock_init(), whereas on x86 it is just after memblock is set up in setup_arch().

The limit argument specifies physical address above which no memory will be prepared for CMA. The intention is to limit CMA contexts to addresses that DMA can handle. In the case of ARM, the limit is the minimum of arm_dma_limit and arm_lowmem_limit. Passing zero will allow CMA to allocate its context as high as it wants. The only constraint is that the reserved memory must belong to the same zone.

The amount of reserved memory depends on a few Kconfig options and a cma kernel parameter. I will describe them further down in the article.

The dma_contiguous_reserve() function will reserve memory and prepare it to be used with CMA. On some architectures (eg. ARM) some architecture-specific work needs to be performed as well. To allow that, CMA will call the following function:

    void dma_contiguous_early_fixup(phys_addr_t base, unsigned long size);

It is the architecture's responsibility to provide it along with its declaration in the asm/dma-contiguous.h header file. If a given architecture does not need any special handling, it's enough to provide an empty function definition.

It will be called quite early, thus some subsystems (e.g. kmalloc()) will not be available. Furthermore, it may be called several times (since, as described below, several CMA contexts may exist).

The second thing to do is to change the architecture's DMA implementation to use the whole machinery. To allocate CMA memory one uses:

    struct page *dma_alloc_from_contiguous(struct device *dev, int count, unsigned int align);

Its first argument is a device that the allocation is performed on behalf of. The second specifies the number of pages (not bytes or order) to allocate. The third argument is the alignment expressed as a page order. It enables allocation of buffers whose physical addresses are aligned to 2align pages. To avoid fragmentation, if at all possible pass zero here. It is worth noting that there is a Kconfig option (CONFIG_CMA_ALIGNMENT) which specifies maximum alignment accepted by the function. Its default value is 8 meaning 256-page alignment.

The return value is the first of a sequence of count allocated pages.

To free the allocated buffer, one needs to call:

    bool dma_release_from_contiguous(struct device *dev, struct page *pages, int count);

The dev and count arguments are same as before, whereas pages is what dma_alloc_from_contiguous() returned. If the region passed to the function did not come from CMA, the function will return false. Otherwise, it will return true. This removes the need for higher-level functions to track which allocations were made with CMA and which were made using some other method.

Beware that dma_alloc_from_contiguous() may not be called from atomic context. It performs some “heavy” operations such as page migration, direct reclaim, etc., which may take a while. Because of that, to make dma_alloc_coherent() and friends work as advertised, the architecture needs to have a different method of allocating memory in atomic context.

The simplest solution is to put aside a bit of memory at boot time and perform atomic allocations from that. This is in fact what ARM is doing. Existing architectures most likely already have a special path for atomic allocations.

Special memory requirements

At this point, most of the drivers should “just work”. They use the DMA API, which calls CMA. Life is beautiful. Except some devices may have special memory requirements. For instance, Samsung's S5P Multi-format codec requires buffers to be located in different memory banks (which allows reading them through two memory channels, thus increasing memory bandwidth). Furthermore, one may want to separate some devices' allocations from others to limit fragmentation within CMA areas.

CMA operates on contexts. Devices use one global area by default, but private contexts can be used as well. There is a many-to-one mapping between struct devices and a struct cma (ie. CMA context). This means that a single device driver needs to have separate struct device objects to use more than one CMA context, while at the same time several struct device objects may point to the same CMA context.

To assign a CMA context to a device, all one needs to do is call:

    int dma_declare_contiguous(struct device *dev, unsigned long size,
			       phys_addr_t base, phys_addr_t limit);

As with dma_contiguous_reserve(), this needs to be called after memblock initializes but before too much memory gets grabbed from it. For ARM platforms, a convenient place to put the call to this function is in the machine's reserve() callback. This won't work for automatically probed devices or those loaded as modules, so some other mechanism will be needed if those kinds of devices require CMA contexts.

The first argument of the function is the device that the new context is to be assigned to. The second specifies the size in bytes (not in pages) to reserve for the areas. The third is the physical address of the area or zero. The last one has the same meaning as dma_contiguous_reserve()'s limit argument. The return value is either zero or a negative error code.

There is a limit to how many “private” areas can be declared, namely CONFIG_CMA_AREAS. Its default value is seven but it can be safely increased if the need arises.

Things get a little bit more complicated if the same non-default CMA context needs to be used by two or more devices. The current API does not provide a trivial way to do that. What can be done is to use dev_get_cma_area() to figure out the CMA area that one device is using, and dev_set_cma_area() to set the same context to another device. This sequence must be called no sooner than in postcore_initcall(). Here is how it might look:

    static int __init foo_set_up_cma_areas(void)
    {
	struct cma *cma;

	cma = dev_get_cma_area(device1);
	dev_set_cma_area(device2, cma);
	return 0;
    }    
    postcore_initcall(foo_set_up_cma_areas);

As a matter of fact, there is nothing special about the default context that is created by dma_contiguous_reserve() function. It is in no way required and the system will work without it. If there is no default context, dma_alloc_from_contiguous() will return NULL for devices without assigned areas. dev_get_cma_area() can be used to distinguish between this situation and allocation failure.

dma_contiguous_reserve() does not take a size as an argument, so how does it know how much memory should be reserved? There are two sources of this information:

There is a set of Kconfig options, which specify the default size of the reservation. All of those options are located under “Device Drivers” » “Generic Driver Options” » “Contiguous Memory Allocator” in the Kconfig menu. They allow choosing from four possibilities: the size can be an absolute value in megabytes, a percentage of total memory, the smaller of the two, or the larger of the two. The default is to allocate 16 MiBs.

There is also a cma= kernel command line option. It lets one specify the size of the area at boot time without the need to recompile the kernel. This option specifies the size in bytes and accepts the usual suffixes.

So how does it work?

To understand how CMA works, one needs to know a little about migrate types and pageblocks.

When requesting memory from the buddy allocator, one provides a gfp_mask. Among other things, it specifies the "migrate type" of the requested page(s). One of the migrate types is MIGRATE_MOVABLE. The idea behind it is that data from a movable page can be migrated (or moved, hence the name), which works well for disk caches, process pages, etc.

To keep pages with the same migrate type together, the buddy allocator groups pages into "pageblocks," each having a migrate type assigned to it. The allocator then tries to allocate pages from pageblocks with a type corresponding to the request. If that's not possible, however, it will take pages from different pageblocks and may even change a pageblock's migrate type. This means that a non-movable page can be allocated from a MIGRATE_MOVABLE pageblock which can also result in that pageblock changing its migrate type. This is undesirable for CMA, so it introduces a MIGRATE_CMA type which has one important property: only movable pages can be allocated from a MIGRATE_CMA pageblock.

So, at boot time, when the dma_contiguous_reserve() and/or dma_declare_contiguous() functions are called, CMA talks to memblock to reserve a portion of RAM, just to give it back to the buddy system later on with the underlying pageblock's migrate type set to MIGRATE_CMA. The end result is that all the reserved pages end up back in the buddy allocator, so they can be used to satisfy movable page allocations.

During CMA allocation, dma_alloc_from_contiguous() chooses a page range and calls:

     int alloc_contig_range(unsigned long start, unsigned long end,
     	                    unsigned migratetype);

The start and end arguments specify the page frame numbers (or the PFN range) of the target memory. The last argument, migratetype, indicates the migration type of the underlying pageblocks; in the case of CMA, this is MIGRATE_CMA. The first thing this function does is to mark the pageblocks contained within the [start, end) range as MIGRATE_ISOLATE. The buddy allocator will never touch a pageblock with that migrate type. Changing the migrate type does not magically free pages, though; this is why __alloc_conting_migrate_range() is called next. It scans the PFN range and looks for pages that can be migrated away.

Migration is the process of copying a page to some other portion of system memory and updating any references to it. The former is straightforward and the latter is handled by the memory management subsystem. After its data has been migrated, the old page is freed by giving it back to the buddy allocator. This is why the containing pageblocks had to be marked as MIGRATE_ISOLATE beforehand. Had they been given a different migrate type, the buddy allocator would not think twice about using them to fulfill other allocation requests.

Now all of the pages that alloc_contig_range() cares about are (hopefully) free. The function takes them away from buddy system, then changes pageblock's migrate type back to MIGRATE_CMA. Those pages are then returned to the caller.

Freeing memory is much simpler process. dma_release_from_contiguous() delegates most of its work to:

     void free_contig_range(unsigned long pfn, unsigned nr_pages);

which simply iterates over all the pages and puts them back to the buddy system.

Epilogue

The Contiguous Memory Allocator patch set has gone a long way from its first version (and even longer from its predecessor – Physical Memory Management posted almost three years ago). On the way, it lost some of its functionality but got better at what it does now. On complex platforms, it is likely that CMA won't be usable on its own, but will be used in combination with ION and dmabuf.

Even though it is at its 23rd version, CMA is still not perfect and, as always, there's still a lot that can be done to improve it. Hopefully though, getting it finally merged into the -mm tree will get more people working on it to create a solution that benefits everyone.

Comments (none posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.3-rc7 ?
Greg KH Linux 3.2.10 ?
Greg KH Linux 3.2.11 ?
Greg KH Linux 3.0.24 ?
Steven Rostedt 3.0.24-rt41 ?
Steven Rostedt 3.0.23-rt39 ?
Steven Rostedt 3.0.23-rt40 ?

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Michael Kerrisk (man-pages) man-pages-3.37 is released ?

Filesystems and block I/O

Memory management

Networking

Security-related

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds