The power-aware scheduling mini-summit

By Jonathan Corbet
October 23, 2013

The first day of the 2013 Kernel Summit was set aside for mini-summits on various topics. One of those topics was the controversial area of power-aware scheduling. Numerous patches attempting to improve the scheduler have been posted, but none have come near the mainline. This gathering brought together developers from the embedded world and the core scheduler developers to try to make some forward progress in this area; whether they succeeded remains to be seen, but some tentative forward steps were identified.

Morten Rasmussen, the author of the big LITTLE MP patch set, was one of the organizers of this session. He started with a set of agenda items and supporting slides; the topics to be discussed were:

Unification of power-management policies, which are currently split among multiple, uncoordinated subsystems.
Task packing. Various patch sets have been posted, but none have really gotten anywhere.
Power drivers and, in particular, the creation of proper abstractions for hardware-level power management.

He attempted to get started with a discussion of the cpufreq and cpuidle subsystems, but didn't get very far before the conversation shifted.

The need for metrics

Ingo Molnar came in with a complaint: none of the power-management work starts with measurements of the system's power behavior. Without a coherent approach to measuring the effects of a patch, there is no real way to judge these patches to decide which ones should go in. We cannot, he said, merge scheduler patches on faith, hoping that they somehow make things better.

What followed was a long and wide-ranging discussion of how such metrics might be made. What the scheduler developers would like is reasonably clear: how long did it take to execute a defined amount of work, and how much energy was consumed in the process? There is also some interest in what worst-case latencies were experienced by the workload while it was running. With reproducible numbers for these quantities, it should be possible to determine which patches help with which workloads and make some progress in this area.

Agreement with this approach was not universal, though. Mark Gross of Intel made the claim that this kind of performance metric was "the road to hell." Instead, he said, power benchmarks should focus on low-level behavior like processor sleep ("C") states. For any given sleep state, the processor must remain in that state for a given period of time before going into that state actually saves power. So the cpuidle subsystem must make an estimate of how long the processor must be idle before selecting an appropriate sleep state. The actual idle period is something that can then be measured; over time, one can come up with a picture of how well the kernel's estimates of idle time match up with reality. That, Mark said, is the kind of benchmark the kernel developers should be using.

Others argued that idle-time estimation accuracy is a low-level measurement that may not map well onto what the users actually want: their work done quickly, without unacceptable latencies, and with a minimal use of power. But actual power-usage measurements have been hard to come by; the processor vendors appear to be reluctant to expose that information to the kernel. So the group never did come up with a good set of metrics to use for the evaluation of scheduler patches. In the end, Ingo said, the first patch that adds a reasonable power-oriented benchmark to the tools directory will go in; it can be refined from there.

What to do

From there, the discussion shifted toward how the scheduler's power behavior might be improved. It was agreed that there is a need for a better mechanism by which an application can indicate its latency requirements to the kernel. These latency requirements then need to be handled carefully: it will not do to keep all CPUs in the system awake because an application running on one of them needs bounded latencies.

There was some talk of adding some sort of energy use model to a simulator — either linsched or the perf sched utility — to allow for standardized testing of patches with specific workloads. Linsched was under development by an intern at Google, but the work was not completed, so it's still not ready for upstreaming. Ingo noted that the resources to do this work are available; there are, after all, developers interested in getting power-aware code into the scheduler.

Rafael Wysocki asked the scheduler developers: what information do you need from the hardware to make better scheduling decisions? Paul Turner responded: the cost of running the system in a given configuration; Peter Zijlstra added that he would like to know the cost of starting up an additional CPU. The response from Mark is that it all depends on which generation of hardware is in use. In general, Intel seems to be reluctant to make that information available, an approach which caused some visible frustration among the scheduler developers.

Over time the discussion did somewhat converge on what the scheduler community would like to have. Some sort of cost metric should be attached to the scheduling domains infrastructure; it would tell the scheduler what the cost of firing up any given processor would be. That information would have to appear at multiple levels; bringing up the first processor in a different physical package will clearly cost more than adding a processor in an already-active package, for example.

Tied into this is the concept of performance ("P") states, generally thought of as the "CPU frequency." The frequency concept is somewhat outdated, but it persists in the kernel's cpufreq subsystem which, it was mostly agreed, should go away. The scheduler does need to understand the cost (and change in performance) associated with a P-state change, though; that would allow it to choose between increasing a CPU's P state or moving a task to a new CPU. How this information would be exposed by the CPU remains to be seen, but, if it were available, it would be possible to start making smarter scheduling decisions with it.

How those decisions would be made remains vague. There was talk of putting together some kind of set of standard workloads, but that seems premature. Paul suggested starting with a set of "stories" describing specific workloads in a human-comprehensible form. Once a collection of stories has been put together, developers can start trying to find the common themes that can be used to come up with algorithms for better, more power-aware scheduling.

There was some brief discussion of Morten's recent scheduler patches. It was agreed that they provide a reasonable start for the movement of CPU frequency and idle awareness into the scheduler itself. A focus on moving cpuidle into the scheduler first was suggested; most developers would rather see cpufreq just go away at this point.

And that was where the group was reminded that lunch had started nearly half an hour before. The guidance produced for the power-aware scheduling developers remains vague at best, but there are still some worthwhile conclusions, with the need for a set of plausible metrics being at the top of the list. That should be enough to enable this work to take a few baby steps forward.

(For more details, see the extensive notes from the session taken by Paul McKenney and Paul Walmsley).

[Your editor would like to thank the Linux Foundation for supporting his travel to Edinburgh.]

Index entries for this article
Kernel	Power management/CPU scheduling
Kernel	Scheduler/and power management
Conference	Kernel Summit/2013

to post comments

The power-aware scheduling mini-summit

Posted Oct 25, 2013 23:27 UTC (Fri) by jpan9 (subscriber, #37902) [Link] (1 responses)

One more thing might be worth considering is whether we can synchronize longer idle time slots among all cores. This will give hardware a chance to enter package level C-states, which are getting deeper on the newer processors. The energy saving can be significantly greater than core idle.

If we we can achieve the synchronized idle, spreading work among more cores and enter idle the same time might be better than keeping one core busy such that less package level idle can be achieved. Having information on the power consumption ratio between core and package level will help make the choice.

The power-aware scheduling mini-summit

Posted Oct 28, 2013 11:10 UTC (Mon) by cmarinas (subscriber, #39468) [Link]

That's an important part of the ARM big.LITTLE architecture - there are two clusters (a.k.a. packages) and one of the approaches is to pack tasks onto a single cluster (usually the little one) to let the other go into deep sleep state. The point the scheduler maintainers have been making is that the scheduler is not aware of these states and power hierarchy, so simple task packing may not be generic enough. Once the scheduler is fully aware of the C-states topology at both CPU and package level, it could make more informed decisions about task placement (it sounds simpler at the high level but rather difficult once you start implementing it).

The power-aware scheduling mini-summit

Posted Oct 28, 2013 11:39 UTC (Mon) by etienne (guest, #25256) [Link]

> bringing up the first processor in a different physical package will clearly cost more than adding a processor in an already-active package

Both cases gives you one more processor; but in the former case the new processor has a complete memory cache for its workload, and in the later case you will share the cache and so reduce the speed of the old workload...

The power-aware scheduling mini-summit

Posted Oct 30, 2013 19:38 UTC (Wed) by liam (guest, #84133) [Link] (2 responses)

I'm not seeing anything about task grouping (i.e., grouping timer-based wakeups so as to cause as few movements from idle as possible, within reason). That seems a pretty obvious (even if not obvious how to actually accomplish) place to look to go into deeper C-states.
Perhaps this has been mentioned elsewhere?
Also, has anyone considered creating a few "simple" workloads that run on an ideal scheduler? That is, the scheduler knows "exactly" (to slice precision) how, and how long, each task will run. That seems a reasonable metric to gauge further progress on the scheduler.

The power-aware scheduling mini-summit

Posted Oct 30, 2013 20:26 UTC (Wed) by corbet (editor, #1) [Link] (1 responses)

Coalescing wakeups is done with mechanisms like timer slack; it was solved some years ago.

The creation of sample workloads was part of Paul Turner's request for stories. The linsched simulator also works with canned workloads.

The power-aware scheduling mini-summit

Posted Oct 30, 2013 22:33 UTC (Wed) by liam (guest, #84133) [Link]

Perfect. Thanks Jonathan.

Seeing hidden motives everywhere?

Posted Oct 31, 2013 1:35 UTC (Thu) by vomlehn (guest, #45588) [Link] (2 responses)

So, what with the NSA apparently subverting standards processes, I think I'm starting to see hidden motives everywhere. Intel seems to have torpedoed a good discussion about exactly what needs to be measured. It sounds like they then declined to provide detailed power consumption information, in spite of the importance of such information. This sounds like they were fighting the goal of the meeting--not what we want to see from a major player.

Seeing hidden motives everywhere?

Posted Oct 31, 2013 1:51 UTC (Thu) by corbet (editor, #1) [Link] (1 responses)

It's easy to get paranoid in 2013, but that may be a bit more than is strictly necessary. Intel's developers were active participants in the meeting, and they have a clear interest in getting good power performance with Linux. They didn't torpedo anything; my apologies if the article gave a different impression.

If anything, it's a disagreement over approach. Intel's current thinking seems to be to solve these problems with a lot of magic buried in the hardware, but a lot of people think that the scheduler, which has a clue about what the system needs to accomplish, can usefully participate in the process. It will work out in the end.

Seeing hidden motives everywhere?

Posted Oct 31, 2013 21:17 UTC (Thu) by vomlehn (guest, #45588) [Link]

Thanks for the perspective, Jon. Being paranoid is tiring!

Using Info from User Space?

Posted Nov 2, 2013 20:06 UTC (Sat) by roblucid (guest, #48964) [Link] (5 responses)

Wether to run things power economically, or prioritise performance, seems like something only user space can know to me. Why run multiple cores in a race to idle on nice-ed jobs, which by definition are lo priority?

Again, it´d be nice for a user to choose between saving watts or time to, starting up extra cores on processing or wake ups, seems bit extravagent on hyper-threaded or FP sharing AMD modules, simply for small increases on idle time. Yet I can see benchmarks or some work loads, would depend on rapid reaction to increasing work load, without policy hints from User Space how can the scheduler to the right thing?

Using Info from User Space?

Posted Nov 2, 2013 21:09 UTC (Sat) by raven667 (subscriber, #5198) [Link] (4 responses)

The amount of work which needs to be done doesn't change and is determined by the user, the kernel should have the latitude to schedule the work it's been asked to do in the most efficient manner.

Using Info from User Space?

Posted Nov 3, 2013 5:09 UTC (Sun) by dlang (guest, #313) [Link] (3 responses)

which is better, powering two cores that share cache and cause the process to take 10% longer to run, or powering two cores that don't share cache that finish faster, but take 20% more power to run? (picking number out of thin air to represent a real tradeoff)

the kernel can't know if the user would rather finish faster at the cost of more power or to save power even if it takes longer to finish unless the user tells the kernel somehow, so it's not just as simple as 'userspace determines the work to be done and the kernel schedules it in the most efficient manner' because 'most efficient' is not a well defined term.

The other issue is that the user may want the system to remain at higher speeds even when idle to reduce the latency when new work arrives.

Using Info from User Space?

Posted Nov 3, 2013 15:14 UTC (Sun) by raven667 (subscriber, #5198) [Link] (2 responses)

I think that the point is that the tradeoff is usually the path which finishes faster is also the path which uses the least power, because the system can stop using power as soon as it finishes its work so faster is better, from a power management perspective. So the trade off is that you can power two cores and take 10% longer, or power up four cores, use higher power in the short term, finish fast, then shut them all off and save power.

Using Info from User Space?

Posted Nov 3, 2013 15:23 UTC (Sun) by raven667 (subscriber, #5198) [Link]

I am also going to add that latency is a big part of this calculation. As you correctly point out, keeping the processing core hotted up reduces latency for new work that is unknown to the scheduler. The finish fast theory is all about latency too, what is the latency of powering up more cores to add capacity versus just running on the cores which are already active, which way will finish first and return to a lower power state the fastest. The schedule can guesstimate this for work it knows about, although interrupt driven work has different properties.

Using Info from User Space?

Posted Nov 3, 2013 16:58 UTC (Sun) by dlang (guest, #313) [Link]

It's not as simple as you are painting it.

it's not just the cores that need to be powered, it's the cache and other aspects of the chip.

it's not powering up two cores to get it done, or four cores to get it done in half the time.

it's powering up two cores and one set of cache, or powering up two cores and two sets of cache to get done 10% faster

from a strict power budget, powering one set of cache and taking 10% longer is probably a win, but is the user willing to sacrifice the performance?

and there are also secondary effects, the speed that you have the CPU set to run may impact the video performance, and if you have the user running something very demanding on the GPU, should you slow down the CPU (and therefor the GPU) to save power or not?

these are decisions that the kernel cannot make on it's own, because it doesn't know the user's priorities.