Deferrable timers for user space

By Jonathan Corbet
February 26, 2014

The deferrable timers concept in the kernel dates back to 2007. A deferrable timer can be used in situations where an arbitrary delay between the timer's expiration and the execution of the timeout code is acceptable. In such situations, the expiration of the timer can be delayed until something else wakes up the CPU. Deferring expiration in this way minimizes the number of wakeups, which, in turn, helps to minimize power consumption.

Deferrable timers are available within the kernel, but they are not provided to user space. So the timer-related system calls (including timerd_settime(), clock_nanosleep(), and nanosleep()) will all make an effort to notify user space quickly on timer expiration, even if user space could tolerate some delay. That is irritating to developers who are working to improve power performance. The good news for those developers is that, after a couple of false starts, it now appears that deferrable timers may finally be made available to user space.

Some readers will certainly be thinking about the timer slack mechanism, which is available in user space. However, it affects all timers; for some applications, some timers may be more deferrable than others. Timers with slack can also only be deferred for so long, meaning that they might still wake a sleeping CPU when they expire. So deferrable timers may well address some use cases that are not well handled by timer slack.

Back in 2012, Anton Vorontsov sent out a patch set adding deferrable timer support to the timerfd_settime() system call. In putting together this patch, Anton ran into a problem: like all of the timing-related system calls, the timerfd mechanism uses the kernel's high-resolution timers. But high-resolution timers do not support deferrable operation; that functionality is limited to the older "timer wheel" mechanism (described in this article). The timer wheel is old code with a number of problems; it has been slated for removal for years. But nobody has done that work, so the timer wheel remains in place, and it remains the only facility with the deferrable option.

Anton's response was to split timers in the timerfd subsystem across both mechanisms. Regular timer requests would use the high-resolution timer subsystem as usual, but any request with the TFD_TIMER_DEFERRABLE flag set would, instead, be handled by the timer wheel. Among other things, that limited the resolution of deferrable timers to one jiffy (0.001 to 0.01 seconds, depending on the kernel configuration), but, if the timer is deferrable, low resolution is not necessarily a problem. Still, this patch set did not go very far, and Anton appears to have dropped it fairly quickly.

More recently, Alexey Perevalov has picked up this concept and tried to push it forward. He first posted a patch set in January; this posting generated rather more discussion than its predecessor did. John Stultz was concerned that only timerfd timers gained the new functionality; it would be better, he thought, to push it down to a lower level so that all timer-related system calls would benefit. Doing so, he thought, would likely entail adding the deferrable capability to the high-resolution timer subsystem.

Thomas Gleixner was rather more emphatic, stating that use of the timer wheel "is not going to happen". He suggested that this functionality should instead be added to high-resolution timers in the form of a new set of clock IDs. The clock ID is a parameter provided by user space describing which timekeeping regime should be used. CLOCK_REALTIME, for example, corresponds to the system clock; it can go backward should the administrator change the system time. The CLOCK_MONOTONIC clock, instead, is guaranteed to always move forward. There are a number of other clocks, including CLOCK_TAI, added in 3.10, which corresponds to international atomic time. Thomas tossed out a proof-of-concept patch adding new versions of all of these clocks (CLOCK_MONOTONIC_DEFERRABLE, for example) that would provide deferrable operation.

John, however, argued that clock IDs were not the right interface to expose to user space:

My reasoning was that the deferrablity isn't a clock domain, and is more of a modifier. Thus to keep the interfaces somewhat sane (and avoiding having to add N new clockids for each new modifier), we should utilize the flag arguments to timers. So instead of just having TIMER_ABSTIME, we could add TIMER_DEFER, etc, which we could utilize instead.

The use of separate clock IDs within the kernel as an implementation detail is fine, he said, but the right way for user space to request the feature is with modifier flags. Fortunately, almost all of the relevant system calls already have flag arguments; the big exception is nanosleep(), which is a call some developers would like to see simply vanish anyway. John's argument, when expressed in this way, prevailed with no real dissent.

Alexey posted a couple more versions of his patch set, but, to put it gently, they did not find approval with Thomas, who eventually posted a deferrable timers patch set of his own, showing how he thinks the problem should be solved. In this patch set, clock_nanosleep() gets a new TIMER_IS_DEFERRABLE flag, while timerfd_settime() gets TFD_TIMER_IS_DEFERRABLE. In either case, setting that flag causes the kernel to use one of the new deferrable internal clock IDs. Timers on those IDs never actually program the hardware, so they never generate interrupts and cannot wake the system. Instead, the expiration functions will be executed when the system is awakened for some other task. In the absence of the new flag, these system calls behave as they always have.

Thomas's patch set has not gotten much in the way of comments beyond low-level implementation details; if that situation persists for long, silence is likely to indicate consent. So, unless some surprise comes along, the kernel will probably offer deferrable timers to user space around 3.15 or so.

Index entries for this article
Kernel	Timers

to post comments

Deferrable timers for user space

Posted Feb 27, 2014 2:24 UTC (Thu) by josh (subscriber, #17465) [Link] (2 responses)

Is there any semantic difference between a deferrable timer and a timer with infinite timer slack?

Deferrable timers for user space

Posted Feb 27, 2014 20:11 UTC (Thu) by luto (subscriber, #39314) [Link] (1 responses)

Not that I know of.

I'd much prefer to have per-timer slack controls than to add a new kind of timer.

Deferrable timers for user space

Posted Feb 27, 2014 22:38 UTC (Thu) by josh (subscriber, #17465) [Link]

Agreed, and that's exactly why I was asking. When reading the article, I kept thinking "wait, that's a boolean flag, but where's the number for *how* deferrable the timer is?".

Deferrable timers for user space

Posted Feb 27, 2014 3:52 UTC (Thu) by wahern (subscriber, #37304) [Link] (4 responses)

I don't quite understand the aversion to timing wheels. I've implemented a hierarchical timing wheel here:

http://25thandclement.com/~william/projects/timeout.c.html

In my tests benchmarking my implementation to a min-heap queue, as well as red-black tree, there's simply no comparison, _even_ when you need to cascade timers. I've done sequential, random, all expirations, all cancel, etc, etc. Just... no comparison whatsoever. (I'll be posting my results in the coming weeks, when work allows.)

The notion of avoiding cascading makes no sense to me, because if a timer is canceled or expired you're still rebalancing the tree (previous LWN article said high-resolution timers are using a red-black tree). A rebalance of a large tree is larger the the number of cascading operations needed (which are usually fixed at a small number--2-6 linked-list operations per timeout for expiration). The only thing I can think of is an issue with jitter. But timing wheels are just so insanely fast... I just can't see it, and jitter is more a concern with hashed timing wheels, not hierarchical wheels. (Hashed wheels also have horrible worst-cast performance.) Plus, cancellations usually happen earlier rather than later. For example a 10 second timeout on a socket is likely to be canceled long before it needs to be moved to the next wheel. And if jitter really, really, really mattered, you can just install the timeouts in a random slot in a lower wheel, to smooth out the workload. This is _especially_ true of high-resolution timers, where no two timeouts could feasibly have the same low-order bits.

Did anybody in the kernel even bother to benchmark stuff?

Deferrable timers for user space

Posted Feb 27, 2014 14:25 UTC (Thu) by tglx (subscriber, #31301) [Link] (3 responses)

Sigh, we definitely do not need another timer wheel implementation. We have one already in kernel and we use it where appropriate.

See https://lwn.net/Articles/152436/ and https://lwn.net/Articles/152363/ for details.

> The notion of avoiding cascading makes no sense to me...

Then go and watch what recascading thousands of timers in a burst does to your system. That's real world, not some random benchmark.

> Did anybody in the kernel even bother to benchmark stuff?

No, we all use crystal balls and rely on our gut feelings.

Deferrable timers for user space

Posted Feb 27, 2014 19:42 UTC (Thu) by wahern (subscriber, #37304) [Link] (2 responses)

What's the ratio of expiration to cancellation? If you're trying to ditch the timing wheels (which I gathered from some other articles), I'm just really curious how socket timeouts would be effected.

And I have measured cascading. On tens of millions of timers. The issue I saw was that the RB trees just performed so poorly there was simply no comparison. Rebalancing of trees is a cache killer, disregarding the Big-O differences. I usually use RB trees everywhere, preferring them to hashes. But when writing a high-throughput network server, timeouts are just manipulated too darned often.

I never had any intention of trying to get my timer implementation in the Linux kernel. Why would you think that? I can't get over the Linux reverse naming scheme of go_pick_up_laundry_bob() instead of bob_laundry_pick_up(). That's just not how we roll in user-space.

But I would like to get an address for the crystal ball emporium. ;) If it told you RB trees had lower latency than hierarchical timing wheels for thousands of timers, then I really have no ground to stand on.

Deferrable timers for user space

Posted Feb 27, 2014 20:10 UTC (Thu) by luto (subscriber, #39314) [Link] (1 responses)

Why is an RB tree a reasonably comparison? Modern radix trees should be *way* better for this use, I think.

Deferrable timers for user space

Posted Feb 27, 2014 22:41 UTC (Thu) by wahern (subscriber, #37304) [Link]

Because RB trees are what are used for the high-resolution kernel timers, as far as I gather.

Min-heap trees are more common, as most people aren't that familiar with hierarchical timing wheels, and it's easier to code or copy a min-heap. Also, a min-heap has O(1) insertion for random timeout values, which can superficially make it look great in benchmarking.

A hierarchical timing wheel has similarities to a trie. But it's optimized for timeouts. Every operation is O(1) with a small constant factor, including (in my implementation) minimum lookup. And the algorithm works such that timeouts efficiently bubble up to be readily collected for servicing. Insertion and cancellation are a simple linked-list operation, and in many scenarios timeouts are usually cancelled long before they expire. (Think of having a 10 second timeout on an I/O operation which usually completes in a fraction of a second, necessitating a cancellation and another re-scheduling for the next operation. For an RB tree that's just killer in real terms. For high-resolution timers with tiny timeouts--on the order of microseconds--bunching becomes a concern because you're always cascading down a wheel, although in Big-O terms it's nothing. OTOH, if you get that kind of bunching it means you're already rapidly installing and canceling timers on the order of hundreds or thousands, milliseconds or less apart, and for something like an RB tree, as it grows to thousands of tens of thousands, it's not gonna scale in wall clock terms. At least, that's how it looks to me. But apparently I'm missing something.)

The only potential issue is bunching. In terms of Big-O it doesn't matter, because you're doing the same amount of work whether they're bunched or not (they're not sorted in each slot), but potentially there can be an issue with run-time latency.

Actually, there are two issues with latency. In traditional wheels there's the latency with the periodic timer, limiting granularity, in addition to latency with cascading bunched timeouts. In my implementation there is no periodic timer necessary because I use word-sized bitmaps to memoize slot residency (and on Haswell something like find-first-set is a mere 3 cycles).

I operate in user space in a non-real-time setting, so the second issue of latency is of little concern. But, frankly, the implementation is just so much faster than a min-heap (2-3x faster even for random insertion where both are O(1)), let alone an RB tree (ridiculously faster), that I was simply surprised that someone found an RB tree to work better for large collections of timeouts. And if there was an issue, I was curious if some simple smoothing functions or other hacks hadn't been tried. Because I'm still in the middle of comprehensive testing, it's of significant interest to me.

Deferrable timers for user space

Posted Mar 8, 2014 4:59 UTC (Sat) by gdt (subscriber, #6284) [Link] (1 responses)

Perhaps that interface isn't widely useful for applications programming. The moment you have a few timers which can all potentially expire at once the question of priorities (and priority inversion) raises its nasty head. That's much better dealt with in the kernel (which can arbitrate between all timers) than with a shim over the timer calls in an application.

Deferrable timers for user space

Posted Mar 9, 2014 6:12 UTC (Sun) by dlang (guest, #313) [Link]

even normal timers can expire at the same time and applications cope with that just fine (and I believe the way they do is that they aren't given any choice about what timer is processed first)

If an app is going to freak out because some timer are going to expire in a different order than expected, it already has a chance of having problems. This does extend the windows of the problem, but since timers never expire exactly when intended, and the system can sleep for a long time (after which the timers will have expired all at once anyway), this doesn't look like a new problem to me.

Deferrable timers for user space

Posted Jul 5, 2014 12:59 UTC (Sat) by Aeyoun (guest, #92213) [Link]

A very interesting topic. However, it is half a year later and it does not appear that any of this went anywhere. Or am I missing something?