Saturday, September 15, 2007

Datacenter Power Management: Subverting the Dominant Paradigm

The purpose of this entry is to dispell the meme that increasing server utilization is a viable long term approach to power management. The dominant paradigm is that virtualization technology will allow you to increase your server utilization and therefore allow you to improve total performance per watt. White papers from AMD and Intel both discuss consolidation as a means to data center power reduction and PG&E incetivizes the use of virtualization technology. While encouraging managed virtual machines for power consumption is a good model, increasing utilization is not a viable long-term model for optimizing GOps/Watt.

The point I would like to make is that in order to decrease power consumption, it may be necessary to buy more chips and distribute the load better. As a rule-of-thumb, in traditional CMOS hardware design, power consumption is cubically dependent on clock frequency. In ultra-low-power systems employing subthreshold, adiabatic and asynchronous logic, this speed-power relationship has an even higher order.

An analogy that makes sense to everyone is that when you coast your car, you achieve optimal miles-per-gallon, but you can't achieve any meaningful speed. Applying a little gas results in an optimal moving speed, but once you apply too much gas to maintain a high speed, your fuel-efficiency drops again.

As a reference point, you can buy microcontrollers at 1 MHz consuming ~400 micro-Watts. Architectures with finer granularity of power and frequency management allow you to distribute 1000 1-MHz virtual machines onto 1000 1-MHz cores instead of consolidating them to run on a single 1-GHz core. By the cubic rule of thumb, each of the 1000 1-MHz cores consumes a billionth of the power, resulting in one millionth total power consumption. Static currents and fixed overhead causes this cubic model to break down at some "power-optimal clock speed."

Dynamic voltage and frequency scaling in multicore arrays may allow each core to have its own power/clock domain in a globally asynchronous model. To optimize for static currents, using dynamic threshold scaling (modulating the body bias voltage of the chip) along with dynamic voltage scaling seems to be a viable technique. Here's a spreadsheet model for leakage current in a transistor varying Temparature, power and threshold voltage across a reasonable range. At lower frequencies, higher threshold voltages can be used to offset leakage power consumption. Such dynamic leakage reduction cannot be achieve using only multi-threshold CMOS (using high-threshold power-enable transistors). Since this static leakage current factor is increasingly a dominant factor in chip power consumption as channel lengths shrink to 45 nm and below, using threshold scaling with power-frequency scaling results in a higher that cubically ordered relationship between power consumption and clock speed.

Instead of increasing the utilization of a single chip, we can fundamentally decrease our need for gigahertz frequency computation and thus decrease power consumption by increasing the parallelism of our systems. Invoking Seymour Cray: while 1024 chickens may not plow a field as quickly as 2 Strong Oxen, they may plow it fast enough and for a lot less chicken-feed.

1 comment:

Hasit said...

Hi Amir
So you mean that instead of utilizing high frequency (GHz range) clocks, we should instead opt for slower clocks, but at the same time parallelize the hardware to such an extent that the task i still completed in reasonable amount of time.
That s interesting. That would require massive parallelization. We can probably use reconfigurable logic (FPGAs) as co processors.
I believe that if we are successful in distributing the workload efficiently among the processors then we can achieve major speedup.
Any comments ?

-Hasit