Reconfigurable Computing: 2007

Monday, October 08, 2007

Mandelbrot Set in Excel

As part of my "anything you can do, I can do in Excel" crusade, I made a spreadsheet to compute the Mandelbrot set (2.5 MB file). This takes roughly 5 seconds per iteration on a single-core Pentium 4 at 2.6 GHz. Excel also seems to consume 128 MB of memory to compute the 400 x 350 x 2 sheets of cells (compare to 28 MB just to run Excel). The slowness and memory overhead (probably related) of this operation in Excel points to some serious design flaws in the underlying spreadsheet calculation system. Considering the spreadsheet file is around 2.5 MB, I can't see why the entire thing would require more than say 5 MB (one to store the current iteration and one to store the next in order to preserve atomicity). Since Excel's iteration is blocking assignments in reading order, the actual iteration could require no more memory than the entire 2.5 MB sheet if atomicity of the iteration operation is not a concern. Since this spreadsheet is embarrassingly parallel and requires vastly more floating point resources than can fit on an FPGA, it makes for a good benchmark of a resource sharing compiler (aka a serializing compiler).

Time for pictures. These are produced by Excel's conditional cell coloring. The images link to bigger PNG versions.

After 1 iteration (minx = -2, maxy = 1, dx = .01, dy = .01)

After 2 iterations

Several iterations later, it starts to resemble a Mandelbrot set.

Several more iterations and some cells start to diverge.

Everyone's favorite fractal: the Mandelbrot Set!

Zooming in on a particular section and iterating a bunch.

Saturday, September 15, 2007

Datacenter Power Management: Subverting the Dominant Paradigm

The purpose of this entry is to dispell the meme that increasing server utilization is a viable long term approach to power management. The dominant paradigm is that virtualization technology will allow you to increase your server utilization and therefore allow you to improve total performance per watt. White papers from AMD and Intel both discuss consolidation as a means to data center power reduction and PG&E incetivizes the use of virtualization technology. While encouraging managed virtual machines for power consumption is a good model, increasing utilization is not a viable long-term model for optimizing GOps/Watt.

The point I would like to make is that in order to decrease power consumption, it may be necessary to buy more chips and distribute the load better. As a rule-of-thumb, in traditional CMOS hardware design, power consumption is cubically dependent on clock frequency. In ultra-low-power systems employing subthreshold, adiabatic and asynchronous logic, this speed-power relationship has an even higher order.

An analogy that makes sense to everyone is that when you coast your car, you achieve optimal miles-per-gallon, but you can't achieve any meaningful speed. Applying a little gas results in an optimal moving speed, but once you apply too much gas to maintain a high speed, your fuel-efficiency drops again.

As a reference point, you can buy microcontrollers at 1 MHz consuming ~400 micro-Watts. Architectures with finer granularity of power and frequency management allow you to distribute 1000 1-MHz virtual machines onto 1000 1-MHz cores instead of consolidating them to run on a single 1-GHz core. By the cubic rule of thumb, each of the 1000 1-MHz cores consumes a billionth of the power, resulting in one millionth total power consumption. Static currents and fixed overhead causes this cubic model to break down at some "power-optimal clock speed."

Dynamic voltage and frequency scaling in multicore arrays may allow each core to have its own power/clock domain in a globally asynchronous model. To optimize for static currents, using dynamic threshold scaling (modulating the body bias voltage of the chip) along with dynamic voltage scaling seems to be a viable technique. Here's a spreadsheet model for leakage current in a transistor varying Temparature, power and threshold voltage across a reasonable range. At lower frequencies, higher threshold voltages can be used to offset leakage power consumption. Such dynamic leakage reduction cannot be achieve using only multi-threshold CMOS (using high-threshold power-enable transistors). Since this static leakage current factor is increasingly a dominant factor in chip power consumption as channel lengths shrink to 45 nm and below, using threshold scaling with power-frequency scaling results in a higher that cubically ordered relationship between power consumption and clock speed.

Instead of increasing the utilization of a single chip, we can fundamentally decrease our need for gigahertz frequency computation and thus decrease power consumption by increasing the parallelism of our systems. Invoking Seymour Cray: while 1024 chickens may not plow a field as quickly as 2 Strong Oxen, they may plow it fast enough and for a lot less chicken-feed.

Monday, September 10, 2007

Evolving Beyond One-Dimensional Programming; Why Foo-C is not a Parallel Programming Environment

In my last blog entry, I introduced the spreadsheet parallel programming model that I will be evangelizing for the next several years. The goal of this entry is to contrast the spreadsheet programming model with existing attempts at C-like parallel programming environments. I argue that one-dimensional von Neumann languages are insufficient for expressing the objectives of a parallel program and that any "parallel programming" extension to C is a temporary solution to a bigger problem.

A programming objective is most generally one of two things: a functional procedure or a state modifying atomic action. In hardware, these notions respectively correspond to resolving a combinatorial circuit and latching a register. In software descriptions these correspond to evaluating a pure-function and assigning the result to the left-hand-side variable. Often these concepts are not separated in the mind of the sequential imperative programmer; each piece of code is thought of as a sequence of instructions that is executed as it is read. This programming model matches the hardware most software engineers have used for the past 60 years and so it is the dominant paradigm.

In the sequential school of thought, functions are just subroutines which return a value. Since most languages do not strictly isolate state assignment from functional value determination, the order in which functional computations and assignments are performed matters to the correctness of the program. Thus a program is generally some contortion within a language to achieve either a functional or atomic-action objective. In this sense, the programmer is devising not a solution to a problem, but an expression of the solution in some language.

Software programmers rarely create code the way we think about the problem, but rather we solve the problem in the manner that the language provides. Computer programming has predominantly been the engineering practice of projecting combinatorial functions or atomic actions into a one-dimensional execution order. The work of a parallelizing compiler may be considered a process of "re-functionalizing" the one-dimensional projection produced by the programmer. Unfortunately, sequential imperative control flow confounds a parallelizing compiler and introduces a bottleneck which cannot easily be anti-projected from one-dimension to many. Speculation seems to be the mode for improving the concurrency of our code, though parallelism is not achieved through brute-force speculation alone since speculation is an inefficient parallelization primitive when we explore too many unlikely paths.

Herein lies the key difference between software and reconfigurable hardware programming. Software is a control-centric model and reconfigurable computing is a data-centric model. Any sort of programmatic control flow in reconfigurable hardware must be explicitly managed by state-machines, whereas control flow in an instruction stream is implicit (next instruction) or managed with branching. The C programmer mangles what could have been a multiplexor into control flow case statements. An if statement only evaluates one path after the predicate resolves while a multiplexer resolves all inputs simultaneously while the predicate is being determined. Expressing this sort of parallelism is impossible in C since simultaneity is not a component of the language.

Ultimately the problem of a parallelizing compiler resolving C-like control flow is this: when do we want our conditional statements to be an "if" and when do we want them to be a muliplexer? By writing code in a von Neumann language we are forcing one-dimensionality in our code and explicitly sequentializing what could otherwise be a parallel expression.

Wednesday, September 05, 2007

A Replacement for the von Neumann Model

"What is a von Neumann computer? When von Neumann and others conceived it over thirty years ago, it was an elegant, practical, and unifying idea that simplified a number of engineering and programming problems that existed then. Although the conditions that produced its architecture have changed radically, we nevertheless still identify the notion of "computer" with this thirty year old concept."
John Backus, “Can Programming Be Liberated from the von Neumann Style?” Turing Award Lecture, 1977

Yesterday I submitted my Master's thesis to MIT's department of Electrical Engineering and Computer Science. In my thesis I present my answer to the problem: how do we program parallel architectures?

My answer is to use a spreadsheet as a replacement for the von Neumann model since modern hardware more accurately resembles an array of processing units than a single instruction stream executer. Since spreadsheets are already a well-established dataflow development paradigm, tools already exist to make spreadsheet specification extremely rapid and straightforward. Unfortunately existing spreadsheet applications aren't designed with the intent of being complete programming languages and they often treat circular reference like an error--though you can enable manual iteration in Excel (try making a counter, IIR filter and a RISC CPU in Excel as an exercise--F9 to iterate). To make the spreadsheets a better parallel programming tool I added multiple interpreter modes (asynchronous, blocking and non-blocking assignments like Verilog), behavioral abstraction, guarded atomic action macros and compilation to FPGA, GPU and Multicore.

My thesis title is "Compiling and Optimizing Spreadsheets for FPGA and Multicore Execution"

Here is the abstract:

A major barrier to developing systems on multicore and FPGA chips is an easy-to-use development environment. This thesis presents the RhoZeta spreadsheet compiler and Catalyst optimization system for programming multiprocessors and FPGAs. Any spreadsheet frontend may be extended to work with RhoZeta’s multiple interpreters and behavioral abstraction mechanisms. RhoZeta synchronizes a variety of cell interpreters acting on a global memory space. RhoZeta can also compile a group of cells to multithreaded C or Verilog. The result is an easy-to-use interface for programming multicore microprocessors and FPGAs. A spreadsheet environment presents parallelism and locality issues of modern hardware directly to the user and allows for a simple global memory synchronization model. Catalyst is a spreadsheet graph rewriting system based on performing behaviorally invariant guarded atomic actions while a system is being interpreted by RhoZeta. A number of optimization macros were developed to perform speculation, resource sharing and propagation of static assignments through a circuit. Parallelization of a 64-bit serial leading-zero-counter is demonstrated with Catalyst. Fault tolerance macros were also developed in Catalyst to protect against dynamic faults and to offset costs associated with testing semiconductors for static defects. A model for partitioning, placing and profiling spreadsheet execution in a heterogeneous hardware environment is also discussed. The RhoZeta system has been used to design several multithreaded and FPGA applications including a RISC emulator and a MIDI controlled modular synthesizer.

Wednesday, August 08, 2007

Theres probably a few hundred of us

I haven't updated this blog in a while. I've been traveling and writing a thesis full of monadic deliciousness and behavioral invariant transformations--coming soon. I started a job last month making a PDP-11 emulator at Quickware.

There's probably a few hundred of us who are watching the accelerated computing meme spread. It is catching on, there's probably a few dozen of us ready to put our mouth where the money is when wall street catches on too--email me by the way (my name at where i work).

Just wanted to point out that big supercomputers are used to analyze carbon dioxide emissions due to the ever increasing amount of energy required to power our big super computers.

And programming environments always seem to suck unless you build it yourself. Thank Gerry for Scheme though.

Wednesday, April 25, 2007

AMD White Paper

Joe Landman over at scalability.org has written up a white paper for AMD on accelerated computing technology.

Friday, January 26, 2007

Recognition via Synthesis and Replication

Many recognition systems work by applying transformations on input data to map a feature space to a more "friendly" feature space. The transformations are generally destructive, non-reversible filters (ie edge detection or color removal in video, frequency selective filtering in audio). This implies a destruction of information in the signal path: by filtering data we reducing the variance of our state-space to focus on "interesting" features. After filtering the data, the result is a more "familiar" feature space on which to perform reductive analysis.

In order to recognize speech we may eliminate all uninteresting frequencies by taking local maxima above a certain threshold. By looking at the relative amplitudes of interesting frequencies, we can identify a set of conditions that categorize each of the sounds into phonemes (record yourself saying ooh and ahh and look at an FFT and you'll get an idea of what a computer might want to look for). The conditions for categorizing an input stram is often trained against a known data set.

Today at the HKN Student Project expo, two 6.111 projects showed how an FPGA can be used to identify a person's hands (juggling simulation) and colored patches and gloves (full body DDR). These systems recognize where a color is by filtering color from video data and identifying the center of mass of each color. Such filter and reduce (alludes to MapReduce of Google) recognition systems will be an important part of artificially intelligent system. Real-world solutions while sharing this common programming model, still require a significant amount of energy to program and train resulting in highly domain specific implementations.

Perhaps it would be useful to produce a general set of filters and reductive analyses. The system, guided by a programmer would assemble the filters and reductions to produce an output from an input stream: where are the hands in this image stream, what is the word in this audio stream. I realized that training such a system, would be simpler, but there is still the problem of calibrating a paramater space for the filters: which frequencies are interesting, what size of red-blob in a filtered image qualifies as a hand? How do we handle arbitrary training sets?

This problem is closely related to speech synthesis: what parameters for tone generators and filters are necessary to model a voice? It struck me that perhaps the reason we recognize and understand speech is because we can mimic the words and the intentions expressed. We recognize and understand when someone says "I am happy," because we can both produce the expression and sympathize with the emotion. Recognition is probably the lower hanging fruit than understanding.

The K-lines activated in person A when a person B says "happy" is instrinsically connected to the K-lines person A uses to control his voice. A computer could perform recognition by modulating the controls of a synthesizer to minimize the difference between the heard tone and the synthesized tone. This could apply to a video recognition system as well: attempt to render a scene matching what it sees. To find a hand, it renders a parametric model of a hand until we realize where what parameters for the hand to minimize the error between the rendered image and the seen image to a point where we are confident we have located the hand properly.

Since I'm a synth junkie, I think of a computer controlling the knobs on a synth to mimic a sound. By examining the gestures used to produce the sound, the synthesizer would recognize the world hello by realizing it is saying "hello" when it attempts to mimic the input. By translating an input space to a control-space we can produce a set of meaningful symbols on which to perform analysis. Useful control parameters are orthogonal/independent: if we modulate them in isolation we could find a parametric minima that is a component of the global minima. It is also nice if our parameters themselves do not have local minima in order to simplify minimization.

We don't always get that kind of niceness... perhaps we can take advantage of the continuity of trajectories in the physical world to imply only incremental modulation is required to control and maintain accurate replication synthesis.

Reconfigurable Computing