Kernel development [LWN.net]

Kernel release status

The current stable 2.6 kernel is 2.6.11.6, which was released (with a handful of security patches) on March 25.

The current 2.6.12 prepatch remains 2.6.12-rc1; no 2.6.12 prepatches have been released in the last week.

Linus's BitKeeper repository contains a number of architecture updates, an XFS update, some netpoll improvements, a new __nocast annotation which allows "sparse" to catch certain type mismatches, a change from io_remap_page_range() to io_remap_pfn_range(), and lots of fixes.

The current -mm tree is 2.6.12-rc1-mm3. Recent changes to -mm include the addition of David Miller's networking tree and Herbert Xu's crypto tree, some core page table handling cleanups, a big DVB update, a number of cleanups to the (ugly and insecure) ISO9660 filesystem code, and lots of fixes.

The current 2.4 prepatch is 2.4.30-rc4, released by Marcelo on March 30 with a couple of regression fixes. Previously, 2.4.30-rc3 was released on March 26. The -rc3 patch contained a single fix to a serious problem introduced in 2.4.30-rc2 which had been released (with several fixes) the day before.

Comments (none posted)

Quote of the week

In NFSv4 we often want to serialize asynchronous RPC calls with ordinary RPC calls (OPEN and CLOSE for instance). On paper, semaphores would appear to fit the bill, however there is no support for asynchronous I/O with semaphores. <rant>What's more, trying to add that type of support is an exercise in futility: there are currently 23 slightly different arch-dependent and over-optimized versions of semaphores (not counting the different versions of read/write semaphores).</rant>

--Trond Myklebust

Comments (none posted)

Realtime preemption and read-copy-update

Ingo Molnar's massive realtime preemption patch is an attempt to bring near-realtime response to the stock Linux kernel. It works by making almost everything in the kernel preemptible. Spinlocks turn into preemptible mutexes; interrupt handlers get moved into preemptible kernel threads, etc. The result is a major change in how the scheduling of kernel code is done and quick response to external events. This work has been quieter in recent times, but it has not stalled by any means.

When LWN last looked at the realtime preemption patch, one of the remaining rough spots was its interaction with the read-copy-update (RCU) mechanism. RCU, remember, encapsulates a conceptually simple (though a bit more gnarly in the implementation) technique. A resource of interest (a routing table entry, say) is referenced by a pointer. When that resource must be changed, a copy is made and the changes are done there; the pointer is then directed at the new copy. At some future, safe time, the old version can be freed. Linux RCU works by requiring that all accesses to RCU-protected data structures be atomic; with that constraint, a "safe time" can be defined as "after every processor on the system has scheduled." Since scheduling while holding a reference to an RCU-protected structure is against the rules, any such structure which was made inaccessible before all processors schedule cannot be referenced by any processor afterward.

Since accesses to RCU-protected structures must be atomic, the RCU locking function (rcu_read_lock()) disables preemption. But disabling preemption is exactly what the realtime preemption patch is trying to get away from, so something had to give. Ingo had solved this problem by requiring that all RCU users identify an explicit lock which protects the structures in question, and modifying the RCU locking functions to take that lock as a parameter. This approach was never optimal. It caused the creation of a whole new family of new RCU functions to cope with every type of lock that might be used, and, simultaneously, decreased the flexibility of the RCU read locking mechanism. And, to a great extent, it simply replaced RCU with more traditional locking which, while it works, does not have the scalability advantages which were the motivation for RCU in the first place.

The RCU issue was clearly on Ingo's mind:

If PREEMPT_RT is merged into the upstream kernel then it will (at least initially) be at a status similar to NOMMU: it will be tolerated as long as it causes no 'drag' on the main code. The RCU API variants i introduced clearly violated this requirement, and were my #1 worry wrt. upstream mergability.

So Ingo was pleased when RCU creator Paul McKenney proposed some approaches for making RCU and realtime preemption work together. Paul's message goes through a series of increasingly complex solutions, and is worth reading in its own right. The core idea, however, is that, in a fully preemptible world, RCU cannot depend on atomic access to data structures, and thus cannot use the "all processors have scheduled" heuristic to know that the time has come to execute a given set of RCU cleanup functions. So the tracking of code executing within RCU critical sections must be made more explicit. Paul's solutions used a reader/writer lock for that purpose, but the approach taken in Ingo's latest realtime preemption patch is a little different.

The code executed to go into an RCU-protected section now looks like this (when configured for realtime preemption):

    void rcu_read_lock(void)
    {
	if (current->rcu_read_lock_nesting++ == 0) {
		current->rcu_data = &get_cpu_var(rcu_data);
		atomic_inc(&current->rcu_data->active_readers);
		smp_mb__after_atomic_inc();
		put_cpu_var(rcu_data);
	}
    }

The idea is simple: a per-CPU count of processes in RCU critical sections is kept. When a process goes into a critical section, a pointer to the current CPU's counter is stored with the task information, so that the right counter will be decremented later on. There is also a per-process variable which keeps track of RCU section nesting. No further work needs to be done before the process can access the protected structure; in particular, no locks are acquired.

When the process exits the critical section, the process is reversed: the nesting count is decremented. When that count goes to zero, the per-CPU count is decremented as well. If the per-CPU count drops to zero, then that processor is deemed to have "quiesced," with no processes running within RCU critical sections. Once all CPUs have quiesced in this way (as tracked by a bitmask of processors in the system), all RCU cleanup functions queued before their respective processors quiesced can be called.

This scheme restores the core RCU functionality, allowing lock-free access to fast-path data structures. It also retains the current RCU API, with the result that the realtime preemption patch becomes significantly less intrusive. It is not a perfect implementation, however. It requires that each CPU regularly find itself with no processes executing within RCU critical sections. Since these sections are now preemptible, the "quiet" times could be quite far apart on heavily-loaded systems. While the system is waiting for a processor to quiesce, the RCU callback structures for the cleanup functions will continue to accumulate, to the point that quite a bit of memory could be used before the cleanup actually happens. For the realtime case, this tradeoff is acceptable: latency, not memory use, is the most important factor. Since the existing RCU algorithm is used when realtime preemption is not configured in, everybody should be happy. In practice, further work may be required; in particular, it may be necessary to find a way to force RCU cleanup when the system gets low on memory. Meanwhile, however, the realtime preemption patch appears to have gotten past one more major hurdle on its way toward possible inclusion into the mainline.

Comments (1 posted)

The __nocast attribute

Attentive readers of patches being merged for 2.6.12-rc2 will have noticed the use of a new attribute: __nocast. For example, the prototype of kmalloc() has changed to:

    void *kmalloc(size_t size, unsigned int __nocast flags);

For normal compilation, this attribute expands to an empty string; it has no effect. When the sparse tool is being used, however, the __nocast attribute disables many of the implicit type conversions performed by the compiler. In the kmalloc() case, sparse will complain whenever a signed integer value is passed as the flags argument. Since the GFP flags passed to kmalloc() are explicitly defined as unsigned values, they will not cause a warning to be issued. Any normal integer variable or constant, however, will be flagged. Similarly, the use of an integer value where an enumerated type is expected will be caught. Thus, this little tweak should help with the automated detection of another class of errors that the compiler will not find.

Comments (5 posted)

io_remap_pfn_range()

io_remap_page_range() has always been a strange function. Its stated purpose is to portably map I/O memory into a process's address space. Its prototype has always differed from one system to the next, however, making portable use difficult. On most architectures it looks like this:

    int io_remap_page_range(struct vm_area_struct *vma, unsigned long virt_addr,
                            unsigned long phys_addr, unsigned long size, 
                            pgprot_t prot);

The sparc64 architecture, however, defines it this way:

    int io_remap_page_range(struct vm_area_struct *vma, unsigned long virt_addr,
                            unsigned long phys_addr, unsigned long size, 
                            pgprot_t prot, int space);

The extra argument (space) was necessary to deal with the inconvenient fact that I/O addresses on the sparc64 architecture would not fit into an unsigned long variable.

The change from remap_page_range() to remap_pfn_range() was done, in part, to address (so to speak) this issue. Since remapping must be done on a page-aligned basis anyway, there is no real point in using a regular physical address, which contains the offset within the page. Said offset, after all, must be zero. By using a page frame number instead, the range of the phys_addr argument is extended far enough to reach into I/O memory on all architectures. The remap_pfn_range() work stopped short of actually fixing the io_remap_page_range() problem, however.

Randy Dunlap has now finished the task with a set of patches adding io_remap_pfn_range():

    int io_remap_pfn_range(struct vm_area_struct *vma, unsigned long from,
                           unsigned long pfn, unsigned long size, 
                           pgprot_t prot);

This function has the same prototype on all architectures. In-tree callers have been modified, and the feature removal schedule has been updated: io_remap_page_range() will go away in September, 2005.

Comments (none posted)

Network block devices and OOM safety

iSCSI is, for all practical purposes, a way of attaching storage devices to a fast network interconnect and making them look like local SCSI drives. There is a great deal of interest in iSCSI for high-end "storage area network" applications, and a few competing iSCSI implementations exist for Linux. Top-quality Linux iSCSI support would be a good thing to have; it turns out, however, that iSCSI raises an interesting issue with how the block subsystem works, especially when it must interact with the networking layer.

When the system gets short of memory, one of the things it must do is to force dirty pages to be written to their backing store, so that those pages may be freed. This activity becomes doubly urgent when the system runs completely out of memory. What happens, however, if the act of writing those pages to disk also requires a memory allocation? In the iSCSI case, those pages must be written via a TCP socket, so the networking layer must be able to allocate enough memory to handle the TCP protocol's needs. If the system is completely out of memory, where will this additional allocation come from?

This particular problem was solved for the block layer some time ago with the mempool mechanism. A mempool sets aside a certain amount of memory for emergencies. When all else fails, the block layer can allocate needed memory from the mempool; in that way, it is guaranteed of being able to make at least some progress and free memory for the system.

A similar mechanism could be put in place for network-based devices, probably through a special socket option which would cause a mempool to be set up for a specific connection. Attaching a mempool to a socket would guarantee that the system could send data through that connection. Unfortunately, in this case, using a mempool in this way does not solve the entire problem.

When a block driver writes data to a local device, it can easily tell when the operation has completed (and the relevant memory can be freed). In many cases, it is simply a matter of waiting for an interrupt and querying ports on the host controller. Newer, more complex protocols can be handled by setting aside a small amount of memory for replies from the controller. The controller is unlikely to overwhelm the system with spurious messages; about the only thing that will come back is responses to operations initiated by the system. In the iSCSI case, a write to the device cannot be deemed to have succeeded until the device sends back an acknowledgment, which will arrive as one of possibly many TCP packets. If the system does not have memory available to receive those packets and process the ACKs, it will be unable to complete the write operations and free up more memory. So everything stalls, or, in the worst case, deadlocks completely.

Just creating another mempool for incoming packets is not a solution, however. The number of packets arriving on a network interface can be huge, and the bulk of them are likely to be entirely unrelated to the crucial outstanding iSCSI operations. A system which is in an out-of-memory state simply cannot attempt to keep up with the full flood of packets arriving on its network interfaces. But, if it is unable to deal with the specific packets it is looking for, it may never get out of its memory crunch.

Various possible solutions have been floated. Many network interfaces can be programmed, in great detail, to drop uninteresting packets. So, when the system hits a memory crunch, it could instruct its network drivers to restrict the incoming packet stream to acknowledgments on high-priority connections. This approach would work, but it would require complicated communications between network drivers and the higher layers of the system. Network adaptors are also limited in the amount of programming they can handle; this limitation would restrict the number of iSCSI devices which could be reliably supported by the system.

Another possible solution was posted by Andrea Arcangeli. When an attempt to allocate memory for an incoming packet fails, the system would perform the allocation from one of the mempools (chosen at random) associated with sockets routed through the relevant interface. Once the packet was fed into the networking layer, a quick check would be made to see if the packet is, in fact, associated with one of the high-priority sockets; if not, it would be quickly dropped and the memory returned to the mempool. Packets belonging to high-priority sockets would be processed normally, resulting, hopefully, in the completion of write operations and the freeing of memory.

This discussion has not reached any sort of consensus, and has made it clear that a number of issues arise when the block and networking layers interact. The attempt to find a solution, in this case, is likely to be deferred to the Kernel Summit, to be held in Ottawa this July. It should be an interesting session.

Comments (3 posted)

Kernel Planet launches

Dave Airlie has launched KernelPlanet.org, which is an aggregation of weblog entries from several kernel hackers.

Comments (none posted)

Andrew Morton 2.6.12-rc1-mm2 Now includes davem's networking tree. ?

Andrew Morton 2.6.12-rc1-mm3 ?

Chris Wright Linux 2.6.11.6 ?

Marcelo Tosatti Linux 2.4.30-rc2 ?

Marcelo Tosatti Linux 2.4.30-rc3 ?

Marcelo Tosatti Linux 2.4.30-rc4 ?

Ingo Molnar Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-10 ?

gh@us.ibm.com CKRM: Core patch set ?

gh@us.ibm.com CKRM: Core CKRM Event Callbacks ?

gh@us.ibm.com CKRM: Processor Delay Accounting ?

gh@us.ibm.com CKRM: Default Classification Engine ?

Gerrit Huizenga CKRM: Resource Control File System (rcfs) ?

Gerrit Huizenga CKRM: Task Class Controller ?

gh@us.ibm.com CKRM: Socket Class Controller ?

gh@us.ibm.com CKRM: Numtasks Controller ?

gh@us.ibm.com CKRM: Documentation ?

Trond Myklebust [RFC] Add support for semaphore-like structure with support for asynchronous I/O ?

Hariprasad Nellitheertha Obtaining memory information for kexec/kdump ?

Hien Nguyen [1/2] kprobes += function-return probes ?

Moore, Eric Dean - MPT FUSION - SPLITTING SCSI HOST DRIVERS ?

Patrick Mochel More Driver Model Locking Changes ?

Pavel Machek pm_message_t becoming struct ?

Dave Airlie DRM tree for 2.6.12 - fixes ?

Jaroslav Kysela ALSA 1.0.9rc2 ?

Adam Belay Some thoughts on device drivers and sysfs ?

Adam Belay Driver States ?

Greg KH deleting urb->status ?

Greg KH USB update for 2.6.12-rc1 ?

James Bottomley SCSI updates for 2.6.12-rc1 ?

Chen, Kenneth W new fifo I/O elevator that really does nothing at all ?

Christoph Lameter API changes to the slab allocator for NUMA memory allocation ?

Christoph Lameter Pageset Localization V2 ?

David McCullough API for true Random Number Generators to add entropy (2.6.11) ?

David McCullough ocf-linux-20050324 - Asynchronous Crypto support for linux ?

Coywolf Qi Hunt oom lca -- fork bombing killer ?

Willy TARREAU Scheduler latency measurements on 2.6.11.6 ?

Mariusz Mazur linux-libc-headers 2.6.11.1 ?

Kernel development

Brief items

Kernel release status

Kernel development news

Quote of the week

Realtime preemption and read-copy-update

The __nocast attribute

io_remap_pfn_range()

Network block devices and OOM safety

Kernel Planet launches

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Security-related

Benchmarks and bugs

Miscellaneous