Transactional memory in the dentry cache
One might think that it would be hard to improve on the performance of a lockref in this area, but there is one technology that might just help in this regard: transactional memory. For some years now, transactional memory has been one of those hardware features that was destined to solve a number of our scalability problems. As is often the case, that future has been slow to arrive in the real world. But some manufacturers are now shipping CPUs with some transactional memory support built into them, and software is starting to take advantage of this feature. See, for example, this article describing the work done to add transactional memory support to the GNU C library.
Linus recently got an itch to upgrade to a newer, faster desktop computer; when he chose his new processor, he made a point of getting one that provided transactional memory support. So, he decided, the time had come to try to use that support to further speed dentry cache locking. The result was a short, proof-of-concept patch that shows how things could work. The core part of the patch is worth showing directly:
asm goto("xbegin %l[repeat]": : :"memory","ax":repeat);
if (unlikely(d_unhashed(dentry)))
goto xabort;
if (unlikely(!is_simple_dput(dentry)))
goto xabort;
if (unlikely(!arch_spin_value_unlocked(dentry->d_lock.rlock.raw_lock)))
goto xabort;
dentry->d_lockref.count--;
asm volatile("xend");
return;
xabort:
asm volatile("xabort $0");
repeat:
/* Fallback to lockref code */
The first line adds the xbegin instruction that starts the transaction. This instruction behaves a little strangely, in that its influence extends over the code that follows. Execution will continue with the code after the xbegin, but, should the transaction abort for any reason, all changes made during the transaction will be rolled back and control will jump to the address provided to xbegin (the label repeat in this case).
What follows is a set of tests to determine whether it is truly safe to decrement the reference count without holding the dentry spinlock. In particular, the dentry must be hashed (have a name associated with it, essentially), be on the LRU list, and not be locked. Reading these fields of the dentry structure adds them to the transaction; should some other code make a change to one of them, the transaction will be aborted by the processor. If any of the tests show that the dentry is not in a suitable state for a quick reference count decrement, the code uses the xabort instruction to make the transaction fail. Otherwise, the reference count will be decremented and the xend instruction will bring the transaction to a successful conclusion. The reference count, too, will become part of the transaction; should any other processor also try to change it, the transaction will abort.
The code as written is clearly not intended for merging; one does not put
late-model x86 assembly code directly into generic filesystem code, after all.
But it is enough to get a sense for how well transactional memory works in
this situation. According to Linus, "
Whether this technique will see much wider use in the kernel remains to be
seen, though. Andi Kleen posted a patch using
transactional memory for most kernel locks back in March, but that work
did not go very far, mostly because he could not provide any benchmark
numbers showing what kind of performance improvements could be expected.
In his patch, Linus made it clear that he suspects those improvements will be small,
saying "
So as far as I'm concerned, transactional memory is going to be
useful - *if* it is useful - only for specialized code. Some of
that might be architecture-internal lock implementations, other
things might be exactly the dput() kind of situation.
The end result is that Andi's patches may be a good starting point for
transactional memory support in a more architecture-independent way, but we
may not see transactional memory used as a general locking mechanism (or
lock-elision mechanism) in the kernel.
Along the lines of architecture-independence, it is worth noting that
Intel was not the first to ship a processor with transactional memory
support; the PowerPC
architecture has had similar functionality for a couple of years.
Naturally, the feature works a little differently in that architecture, but
the basic functionality is the same. So it should be possible to create
some sort of wrappers around transactional memory that can be used in
common code. That is, if kernel-space transactional memory can be made to
work at all in an efficient manner on PowerPC; there are some challenges in the way of getting that
functionality working reliably.
What one might conclude is that, while transactional memory is a
useful technology for speeding up specific bits of high-profile kernel
code, it does not look like a general solution to the locking problem at
this point. That could change in the future, of course, as hardware
transactional memory gets faster and more capable. But, for now,
transactional memory looks mostly like a useful trick to squeeze a little
more performance out of a few highly optimized code paths.profiles with this look
beautiful
", with the locking cost associated with the
dput() operation reduced essentially to zero. So there may well
be a place for transactional memory within the dentry cache.
I doubt the intel TSX patches are going to be useful (if they
were, Intel would be crowing about performance numbers now that the CPU's
are out, and they aren't).
" Instead, he has suggested that this
feature will only be useful in a few places:
Index entries for this article Kernel Transactional memory
Also, there may be rather different ways to do this--Intel doesn't have an exclusive on smart people.