LCA: Server power management

By Jonathan Corbet
January 26, 2011

Power management is often seen as a concern mostly for embedded and mobile systems. They worry about power management because we want our phones to run for longer between recharges and our laptops to not inflict burns on our thighs. But power management is equally important for data centers, which are currently responsible for about 3% of the total power consumption in the US. Keeping the net going in the US requires about ~~15TW~~ 15GW of power - the dedicated output of about 15 nuclear power plants. Clearly there would be some real value to saving some of that power. Matthew Garrett's talk during the Southern Plumbers Miniconf at linux.conf.au 2011 covered the work that is being done in that area and where Linux stands relative to other operating systems.

Much of the power consumed by data centers is not directly controllable by Linux - it is overhead which is consumed outside of the computers themselves. About one watt of power is consumed by overhead for each watt consumed by computation. This overhead includes network infrastructure and power supply loss, but the biggest component is air conditioning. So the obvious thing to do here is to create more efficient cooling and power infrastructure. Running at higher ambient temperatures, while uncomfortable for humans, can also help. The best contemporary data centers have been able to reduce their overhead to about 20% - a big improvement. Cogeneration techniques - using heat from data centers to warm buildings, for example - can reduce that overhead even further.

But we still have trouble. A 48-core system, Matthew says, will draw about 350W when it is idle; a rack full of such systems will still pull a lot of power. What can be done? Most power management attention has been focused on the CPU, which is where a lot of the power goes. As a result, an idle Intel CPU now draws approximately zero watts of power - it is "terrifying" how well it works. When the CPU is working, though, the situation is a bit different; the power consumption is about 20W per core, or about 960W for a busy 48-core system.

The clear implication is that we should keep the CPUs idle whenever possible. That can be tricky, though; it is hard to write software which does nothing. Or - as Matthew corrected himself - it's hard to write useful software which does nothing.

There are some trends which can be pointed to in this area. CPU power management is essentially solved; Linux is quite good at it. In fact, Linux is better than any other operating system with regard to CPU power; we have more time in deep idle states and fewer wakeups than others. So interest is shifting toward memory power management. If all of the CPUs in a package within the system can be idled, the associated memory controller will go idle as well. It's also possible to put memory into "self-refresh" mode if it is idle, reducing power use while preserving the contents. In other situations, running memory at a lower clock rate can reduce power usage. There will be a lot of work in this area because, at this point, memory looks like the biggest, lowest-hanging fruit.

Even more power can be saved by simply turning a system off; that is where virtualization comes into play. If applications are run on virtualized servers, those servers can be consolidated onto a small number of machines during times of low load, allowing the other machines to be powered down. There is a great deal of customer interest in this capability, but there is still work to be done; in particular, we need fast guest migration, which is a hard problem to solve.

The other hard problem is the fact that optimal power behavior may make tradeoffs which enterprise customers may be unwilling to make. Performance matters for these people, and, if that means expending more energy, they are willing to pay that cost. As an example, consider the gettimeofday() system call which, while having been ruthlessly optimized, can still be slower than some people would like. Simply reading the processor's time stamp counter (TSC) can be faster. The problem is that the TSC can become unreliable in the presence of power management. Once upon a time, changing the CPU frequency would change the rate of the TSC, but that problem has been solved by the CPU vendors for a few years now. So TSC problems are no longer an excuse to avoid lowering the clock frequency.

Unfortunately, that is not too useful, because it rarely makes sense to run a CPU at a lower frequency; best results usually come from running at full speed and spending more time in a sleep state ("C state"). But C states can stop the TSC altogether, once again creating problems for performance-sensitive users. In response, manufacturers have caused the TSC to run even when the CPU is sleeping. So, while virtualization remains a hassle, systems running on bare metal can expect the TSC to work properly in all power management states.

But that still doesn't satisfy some performance-sensitive users because deep C states create latency. It can take a millisecond to wake a CPU out of a deep sleep - that is a very long time in some applications. We have the pm_qos mechanism which can let the kernel know whether deep sleeps are acceptable at any given time, allowing power management to happen when latency is not an immediate concern. Not a perfect solution, but that may be as good as it gets for now.

Another interesting feature of contemporary CPUs is the "turbo" mode, which can allow a CPU to run in an overclocked mode for a period of time. Using this mode can get work done faster, allowing longer sleeps and better power behavior, but it depends on good power management if it is to work at all. If a core is to run in turbo mode, all other cores on the same die must be in a sleep state. The end result is that turbo mode can give good results for single-threaded workloads.

Some effort is going into powering down unused hardware components - I/O controllers, for example - even though the gains to be had in this area are relatively small. Many systems have quite a few USB ports, many of which are entirely unused. Versions 1 and 2 of the USB specification make powering down those port hard; even worse, those ports will repeatedly wake the CPU even if nothing is plugged in. USB 3 is better in this regard.

Unfortunately, even in this case, it's hard to power down the ports because it is a feature which is poorly specified, poorly documented, and poorly implemented. The reliability of the hardware varies; Windows tends not to use the PCI power management event infrastructure, so it often simply does not work. This problem has been solved by polling the hardware once every second; that is "the least bad thing" they could come up with. The result is better power behavior, but also up to one second of latency before the system responds to the plugging-in of a new USB device. Since, as Matthew noted, that one second is probably less than the user already lost while trying to insert the plug upside-down, it shouldn't be a problem.

Similar things can be done with other types of hardware - firewire ports, audio devices, SD ports, etc. It's just a matter of figuring out how to make it work. There is also some interest in reducing the power consumption of graphics processors (GPUs), even though enterprise systems tend not to have fancy GPUs. The level of support varies from one GPU to the next, but work is being done to improve power consumption for most of them.

Work for the future includes better CPU frequency governor development; we need to do better at ramping up the processor's frequency when there is work to be done. The scheduler needs tweaks to do a better job of consolidating jobs onto one package, allowing others to be powered down. And there is the continued exploitation of other power management features in hardware; there are a lot of them that we are not using. On the other hand, others are not using those features either, so they probably do not work.

In summary: Linux is doing pretty well with regard to enterprise-level power management; the GPU is the only place where we perform worse than Windows does. But we can always do better, so work will continue in that direction.

Index entries for this article
Kernel	Power management
Conference	linux.conf.au/2011

to post comments

LCA: Server power management

Posted Jan 27, 2011 5:33 UTC (Thu) by ctpm (guest, #35884) [Link] (1 responses)

"15TW of power"

Didn't you intend to say 15GW of power?

LCA: Server power management

Posted Jan 27, 2011 5:59 UTC (Thu) by mjg59 (subscriber, #23239) [Link]

Oops. That's my fault, not Jon's - I managed to end up off by 3 orders of magnitude there. Amazingly, I managed this in a sufficiently inconsistent way that my other figures still all work out, so just pretend that every "Tera" is a "Giga".

LCA: Server power management

Posted Jan 27, 2011 8:48 UTC (Thu) by bronson (subscriber, #4806) [Link] (1 responses)

Windows is easy to beat (on my Thinkpad anyway). How does Linux do against OS X when running on a Mac?

LCA: Server power management

Posted Jan 27, 2011 17:39 UTC (Thu) by mjg59 (subscriber, #23239) [Link]

On a laptop? Badly. Apples are heavily tuned in such a way that GPU power management is an important part of the power management situation, and right now we're not terribly good at that on anything other than Intel (which most Macs no longer use). Things are improving, but not there yet.

LCA: Server power management

Posted Jan 27, 2011 15:00 UTC (Thu) by MortenSickel (subscriber, #3238) [Link]

"This problem has been solved by polling the hardware once every second; (...) one second of latency before the system responds to the plugging-in of a new USB device. Since, as Matthew noted, that one second is probably less than the user already lost while trying to insert the plug upside-down, it shouldn't be a problem. "

Quote of the week!

LCA: Server power management

Posted Jan 28, 2011 16:29 UTC (Fri) by robert_s (subscriber, #42402) [Link] (2 responses)

"Another interesting feature of contemporary CPUs is the "turbo" mode, which can allow a CPU to run in an overclocked mode for a period of time. Using this mode can get work done faster, allowing longer sleeps and better power behavior, but it depends on good power management if it is to work at all."

From all I could find on the web about Turbo mode, the policy all seems to be done "automatically" by the hardware. If true, this seems like the worst design decision ever. At best, surely the kernel would be able to do a much better job with the extra context it has about the workload. At worst, doesn't this cause absolute havoc with the kernel's scheduler?

LCA: Server power management

Posted Feb 1, 2011 21:22 UTC (Tue) by martinfick (subscriber, #4455) [Link] (1 responses)

No, it usually involves OS support.

http://en.wikipedia.org/wiki/Intel_Turbo_Boost

"Turbo Boost activates when the operating system requests the highest performance state of the processor. "

LCA: Server power management

Posted Feb 1, 2011 22:41 UTC (Tue) by mjg59 (subscriber, #23239) [Link]

Right, but that just appears as another P state to the OS - it's entirely possible that the BIOS will have programmed that before the OS asserts any control over the functionality. The problem with turbo mode is that without knowing the full state of the processor (the C state of every thread, the temperature of every sensor, the current power draw of every core) and without knowing the algorithms used (which the vendors consider secret), the OS has no way of knowing whether asking for the highest performance state will give it performance identical to the next highest state, or anywhere up to 60% higher.