Saturday, September 15, 2007

Datacenter Power Management: Subverting the Dominant Paradigm

The purpose of this entry is to dispell the meme that increasing server utilization is a viable long term approach to power management. The dominant paradigm is that virtualization technology will allow you to increase your server utilization and therefore allow you to improve total performance per watt. White papers from AMD and Intel both discuss consolidation as a means to data center power reduction and PG&E incetivizes the use of virtualization technology. While encouraging managed virtual machines for power consumption is a good model, increasing utilization is not a viable long-term model for optimizing GOps/Watt.

The point I would like to make is that in order to decrease power consumption, it may be necessary to buy more chips and distribute the load better. As a rule-of-thumb, in traditional CMOS hardware design, power consumption is cubically dependent on clock frequency. In ultra-low-power systems employing subthreshold, adiabatic and asynchronous logic, this speed-power relationship has an even higher order.

An analogy that makes sense to everyone is that when you coast your car, you achieve optimal miles-per-gallon, but you can't achieve any meaningful speed. Applying a little gas results in an optimal moving speed, but once you apply too much gas to maintain a high speed, your fuel-efficiency drops again.

As a reference point, you can buy microcontrollers at 1 MHz consuming ~400 micro-Watts. Architectures with finer granularity of power and frequency management allow you to distribute 1000 1-MHz virtual machines onto 1000 1-MHz cores instead of consolidating them to run on a single 1-GHz core. By the cubic rule of thumb, each of the 1000 1-MHz cores consumes a billionth of the power, resulting in one millionth total power consumption. Static currents and fixed overhead causes this cubic model to break down at some "power-optimal clock speed."

Dynamic voltage and frequency scaling in multicore arrays may allow each core to have its own power/clock domain in a globally asynchronous model. To optimize for static currents, using dynamic threshold scaling (modulating the body bias voltage of the chip) along with dynamic voltage scaling seems to be a viable technique. Here's a spreadsheet model for leakage current in a transistor varying Temparature, power and threshold voltage across a reasonable range. At lower frequencies, higher threshold voltages can be used to offset leakage power consumption. Such dynamic leakage reduction cannot be achieve using only multi-threshold CMOS (using high-threshold power-enable transistors). Since this static leakage current factor is increasingly a dominant factor in chip power consumption as channel lengths shrink to 45 nm and below, using threshold scaling with power-frequency scaling results in a higher that cubically ordered relationship between power consumption and clock speed.

Instead of increasing the utilization of a single chip, we can fundamentally decrease our need for gigahertz frequency computation and thus decrease power consumption by increasing the parallelism of our systems. Invoking Seymour Cray: while 1024 chickens may not plow a field as quickly as 2 Strong Oxen, they may plow it fast enough and for a lot less chicken-feed.

Monday, September 10, 2007

Evolving Beyond One-Dimensional Programming; Why Foo-C is not a Parallel Programming Environment

In my last blog entry, I introduced the spreadsheet parallel programming model that I will be evangelizing for the next several years. The goal of this entry is to contrast the spreadsheet programming model with existing attempts at C-like parallel programming environments. I argue that one-dimensional von Neumann languages are insufficient for expressing the objectives of a parallel program and that any "parallel programming" extension to C is a temporary solution to a bigger problem.

A programming objective is most generally one of two things: a functional procedure or a state modifying atomic action. In hardware, these notions respectively correspond to resolving a combinatorial circuit and latching a register. In software descriptions these correspond to evaluating a pure-function and assigning the result to the left-hand-side variable. Often these concepts are not separated in the mind of the sequential imperative programmer; each piece of code is thought of as a sequence of instructions that is executed as it is read. This programming model matches the hardware most software engineers have used for the past 60 years and so it is the dominant paradigm.

In the sequential school of thought, functions are just subroutines which return a value. Since most languages do not strictly isolate state assignment from functional value determination, the order in which functional computations and assignments are performed matters to the correctness of the program. Thus a program is generally some contortion within a language to achieve either a functional or atomic-action objective. In this sense, the programmer is devising not a solution to a problem, but an expression of the solution in some language.

Software programmers rarely create code the way we think about the problem, but rather we solve the problem in the manner that the language provides. Computer programming has predominantly been the engineering practice of projecting combinatorial functions or atomic actions into a one-dimensional execution order. The work of a parallelizing compiler may be considered a process of "re-functionalizing" the one-dimensional projection produced by the programmer. Unfortunately, sequential imperative control flow confounds a parallelizing compiler and introduces a bottleneck which cannot easily be anti-projected from one-dimension to many. Speculation seems to be the mode for improving the concurrency of our code, though parallelism is not achieved through brute-force speculation alone since speculation is an inefficient parallelization primitive when we explore too many unlikely paths.

Herein lies the key difference between software and reconfigurable hardware programming. Software is a control-centric model and reconfigurable computing is a data-centric model. Any sort of programmatic control flow in reconfigurable hardware must be explicitly managed by state-machines, whereas control flow in an instruction stream is implicit (next instruction) or managed with branching. The C programmer mangles what could have been a multiplexor into control flow case statements. An if statement only evaluates one path after the predicate resolves while a multiplexer resolves all inputs simultaneously while the predicate is being determined. Expressing this sort of parallelism is impossible in C since simultaneity is not a component of the language.

Ultimately the problem of a parallelizing compiler resolving C-like control flow is this: when do we want our conditional statements to be an "if" and when do we want them to be a muliplexer? By writing code in a von Neumann language we are forcing one-dimensionality in our code and explicitly sequentializing what could otherwise be a parallel expression.

Wednesday, September 05, 2007

A Replacement for the von Neumann Model

"What is a von Neumann computer? When von Neumann and others conceived it over thirty years ago, it was an elegant, practical, and unifying idea that simplified a number of engineering and programming problems that existed then. Although the conditions that produced its architecture have changed radically, we nevertheless still identify the notion of "computer" with this thirty year old concept."
John Backus, “Can Programming Be Liberated from the von Neumann Style?” Turing Award Lecture, 1977


Yesterday I submitted my Master's thesis to MIT's department of Electrical Engineering and Computer Science. In my thesis I present my answer to the problem: how do we program parallel architectures?

My answer is to use a spreadsheet as a replacement for the von Neumann model since modern hardware more accurately resembles an array of processing units than a single instruction stream executer. Since spreadsheets are already a well-established dataflow development paradigm, tools already exist to make spreadsheet specification extremely rapid and straightforward. Unfortunately existing spreadsheet applications aren't designed with the intent of being complete programming languages and they often treat circular reference like an error--though you can enable manual iteration in Excel (try making a counter, IIR filter and a RISC CPU in Excel as an exercise--F9 to iterate). To make the spreadsheets a better parallel programming tool I added multiple interpreter modes (asynchronous, blocking and non-blocking assignments like Verilog), behavioral abstraction, guarded atomic action macros and compilation to FPGA, GPU and Multicore.

My thesis title is "Compiling and Optimizing Spreadsheets for FPGA and Multicore Execution"

Here is the abstract:

A major barrier to developing systems on multicore and FPGA chips is an easy-to-use development environment. This thesis presents the RhoZeta spreadsheet compiler and Catalyst optimization system for programming multiprocessors and FPGAs. Any spreadsheet frontend may be extended to work with RhoZeta’s multiple interpreters and behavioral abstraction mechanisms. RhoZeta synchronizes a variety of cell interpreters acting on a global memory space. RhoZeta can also compile a group of cells to multithreaded C or Verilog. The result is an easy-to-use interface for programming multicore microprocessors and FPGAs. A spreadsheet environment presents parallelism and locality issues of modern hardware directly to the user and allows for a simple global memory synchronization model. Catalyst is a spreadsheet graph rewriting system based on performing behaviorally invariant guarded atomic actions while a system is being interpreted by RhoZeta. A number of optimization macros were developed to perform speculation, resource sharing and propagation of static assignments through a circuit. Parallelization of a 64-bit serial leading-zero-counter is demonstrated with Catalyst. Fault tolerance macros were also developed in Catalyst to protect against dynamic faults and to offset costs associated with testing semiconductors for static defects. A model for partitioning, placing and profiling spreadsheet execution in a heterogeneous hardware environment is also discussed. The RhoZeta system has been used to design several multithreaded and FPGA applications including a RISC emulator and a MIDI controlled modular synthesizer.