Deferrable timers for user space

Posted Feb 27, 2014 22:41 UTC (Thu) by wahern (subscriber, #37304)
In reply to: Deferrable timers for user space by luto
Parent article: Deferrable timers for user space

Because RB trees are what are used for the high-resolution kernel timers, as far as I gather.

Min-heap trees are more common, as most people aren't that familiar with hierarchical timing wheels, and it's easier to code or copy a min-heap. Also, a min-heap has O(1) insertion for random timeout values, which can superficially make it look great in benchmarking.

A hierarchical timing wheel has similarities to a trie. But it's optimized for timeouts. Every operation is O(1) with a small constant factor, including (in my implementation) minimum lookup. And the algorithm works such that timeouts efficiently bubble up to be readily collected for servicing. Insertion and cancellation are a simple linked-list operation, and in many scenarios timeouts are usually cancelled long before they expire. (Think of having a 10 second timeout on an I/O operation which usually completes in a fraction of a second, necessitating a cancellation and another re-scheduling for the next operation. For an RB tree that's just killer in real terms. For high-resolution timers with tiny timeouts--on the order of microseconds--bunching becomes a concern because you're always cascading down a wheel, although in Big-O terms it's nothing. OTOH, if you get that kind of bunching it means you're already rapidly installing and canceling timers on the order of hundreds or thousands, milliseconds or less apart, and for something like an RB tree, as it grows to thousands of tens of thousands, it's not gonna scale in wall clock terms. At least, that's how it looks to me. But apparently I'm missing something.)

The only potential issue is bunching. In terms of Big-O it doesn't matter, because you're doing the same amount of work whether they're bunched or not (they're not sorted in each slot), but potentially there can be an issue with run-time latency.

Actually, there are two issues with latency. In traditional wheels there's the latency with the periodic timer, limiting granularity, in addition to latency with cascading bunched timeouts. In my implementation there is no periodic timer necessary because I use word-sized bitmaps to memoize slot residency (and on Haswell something like find-first-set is a mere 3 cycles).

I operate in user space in a non-real-time setting, so the second issue of latency is of little concern. But, frankly, the implementation is just so much faster than a min-heap (2-3x faster even for random insertion where both are O(1)), let alone an RB tree (ridiculously faster), that I was simply surprised that someone found an RB tree to work better for large collections of timeouts. And if there was an issue, I was curious if some simple smoothing functions or other hacks hadn't been tried. Because I'm still in the middle of comprehensive testing, it's of significant interest to me.