RCU-safe reference counting

[Posted July 14, 2004 by corbet]

The "kref" mechanism is a simple structure for implementing reference-counted objects in the kernel; it was covered here last March. At the core of a kref is an atomic_t counter which contains the number of outstanding references. When that counter goes to zero, the object is no longer used and can be freed.

The kref functions are simple. Obtaining a reference is done with a call to kref_get():

    struct kref *kref_get(struct kref *kref)
    {
	WARN_ON(!atomic_read(&kref->refcount));
	atomic_inc(&kref->refcount);
	return kref;
    }

Releasing that reference is accomplished with kref_put():

    void kref_put(struct kref *kref)
    {
	if (atomic_dec_and_test(&kref->refcount)) {
	    kref->release(kref);
	}
    }

The use of atomic types makes these functions safe in multiprocessor or preemptive environments; the reference count will always be correct. Except, of course, when things go wrong. Consider the following order of operations performed by two kernel threads; they could be running on separate processors, or on a preemptive, uniprocessor system:

Thread 1	Thread 2
/* In kref_get() */ WARN_ON(!atomic_read(&kref->refcount));
	kref_put(&kref);
atomic_inc(&kref->refcount); return kref;

The first thread will be left thinking it holds a reference to an object which, in fact, has been deleted. As a general rule, good things cannot be expected to result from this situation. The kref code deals with this possibility by fiat: simultaneous calls to kref_get() and kref_put() on the same object are not allowed. In practice, this restriction usually requires that these operations be called under the protection of a lock somewhere.

Developers interested in high-end scalability, however, often try to use lock-free algorithms. Locks can easily become a performance bottleneck as the number of threads increases, so, if they can be eliminated, the kernel will scale better. That is the motivation behind the use of techniques like seqlocks and read-copy-update (RCU). The locking requirement associated with the kref type makes that type difficult to use with these techniques.

Ravikiran G Thirumalai recently posted a patch entitled "Refcounting of objects part of a lockfree collection" which implements a new locking type (called refcount_t) for dealing with objects managed using no-lock techniques. The explanation goes to great lengths to describe reference counting issued when working with RCU, but, in the end, all the patch is really doing, via a long path, is making a type which is like the kref, but which is not subject to the race described above.

kref_get(), as currently written, checks the reference count first; if that count is zero, the object has already been freed. The current implementation merely complains when this happens; one could argue that stronger action is called for. The real problem, though, is that this test and the subsequent incrementing of the reference count are not, together, atomic - other actions can come between the two. Ravikiran's patch addresses this issue by coding his _get() function differently:

    static inline int refcount_get_rcu(refcount_t *rc)
    {
	int c, old;
	c = atomic_read(&rc->count);
	while ( c && (old = cmpxchg(&rc->count.counter, c, c+1)) != c) 
		c = old;
	return c;
    }

The core of this function is the call to cmpxchg(), which is an inline assembly function giving access to the processor's cmpxchg instruction. The function prototype looks like:

    int cmpxchg(int *location, int old, int new);

(The actual definition is a little more complex, depending on the real type of location). The purpose of this function is to (1) compare the contents of *location with old, (2) if and only if the two are the same, assign new to *location, and (3) return the old value. If cmpxchg() returns old, the operation succeeded; otherwise the value pointed to by location is unchanged. The key point is that all of these operations are performed in an atomic manner

cmpxchg() is, in other words, a form of test-and-set instruction. It is used here to increment the reference count in an atomic manner while being absolutely sure that nobody else can possibly have seen that count reach zero. When references are obtained in this way, the race described above cannot happen.

There is still a pitfall, however. If the reference-counted object were to be freed and reused before another thread tried to obtain a reference, that thread might see a random "reference count" and think it succeeded. Preventing that turn of events is where RCU comes in. The actual object is freed by way of an RCU callback, which cannot happen until every processor has scheduled. If any thread can see a pointer to the object, said object will continue to exist, though its reference count may be zero. After a complete quiescence cycle, no threads can see such a pointer, and the object can be safely deleted.

One other potential problem is that not all architectures offer a cmpxchg instruction. On such systems Ravikiran uses a rather more elaborate and unsightly scheme involving a hashed array of spinlocks; see the patch if morbid curiosity gets the better of you.

This effort seems worthwhile; when this technique is used for looking up file descriptors, tiobench performance improvements of 13% to 21% are claimed. There were objections, however, to the creation of a new reference counting API which is very similar to the kref API. As a result, the patch is likely to be rewritten to use krefs, extending that API as need be to supply the required semantics.

Index entries for this article
Kernel	cmpxchg()
Kernel	kref
Kernel	Race conditions
Kernel	Read-copy-update
Kernel	Reference counting

to post comments

RCU-safe reference counting

Posted Jul 15, 2004 20:32 UTC (Thu) by dvdeug (subscriber, #10998) [Link] (1 responses)

Using the assembly instruction cmpxchg is all fine and dandy if you're running on ix86 or AMD-64. What happens on the other architectures?

RCU-safe reference counting

Posted Jul 15, 2004 20:33 UTC (Thu) by dvdeug (subscriber, #10998) [Link]

Sorry; didn't read the article throughly enough.

RCU-safe reference counting

Posted Jul 15, 2004 22:24 UTC (Thu) by sjmadsen (guest, #4035) [Link] (2 responses)

Never mind that cmpxchg() is a spinlock in different clothes. It is ever so slightly more efficient
because taking the lock and the operation protected by the lock are one and the same, and you
don't have to free the lock afterwards.

Seems like the "rather more elaborate and unsightly scheme involving a hashed array of
spinlocks" could have been a single spinlock.

RCU-safe reference counting

Posted Jul 17, 2004 0:54 UTC (Sat) by giraffedata (guest, #1954) [Link]

>Never mind that cmpxchg() is a spinlock in different clothes

I believe it beats a spinlock because the processor can implement it without actually stopping all the other CPUs. It can go ahead and do the whole operation as if the other CPUs didn't exist, then check whether there had been any conflicting activity by another CPU, and in the unlikely event that there was, try again. The spinlock requires all the other CPUs (that want that lock) to stop doing work while the one holder runs.

But here, we aren't talking about using cmpxchg() per se instead of a spinlock. We're talking about using it as part of some RCU logic so that an entire sequence of (lookup, find, and add a reference) can happen at the same time as another thread is destroying the found object.

Traditionally, that whole sequence would be covered by a spinlock, and two CPUs at a time could not do it, even for different objects. Nor could any CPU destroy any object while another CPU was in this lookup sequence. CPUs would be waiting for another CPU to do something that isn't even an actual conflict.

RCU-safe reference counting

Posted Jul 19, 2004 9:26 UTC (Mon) by khim (subscriber, #9252) [Link]

Never mind that cmpxchg() is a spinlock in different clothes.

Yes and no. It's reuse of existing "spinlock". Each and every access to every byte of RAM on SMP system is covered by "spinlock" anyway - think about cache cogerency! Yes, it's not really implemented as spinlock and there are different implementations of idea but the end result is the same: access to every byte in modern SMP systems is covered by "sort of spinlock" (and if two processors stomp each other feets too often system slows down to a crawl as you can guess). And cmpxchg() is using this feature - that's all. Since each and every access to any variable should be held with such "spinlock" held it's not slower then simple access without cmpxchg() function.