Thursday, September 04, 2008

Parallel Programming is Easy: Making a Framework is Hard

An HPCwire article titled "Compilers and More: Parallel Programming Made Easy?" by Michael Wolfe presents a gloomy outlook for parallel programming:
Every time I see someone claiming they've come up with a method to make parallel programming easy, I can't take them seriously.
People you don't take seriously may take you by surprise. I think the computing industry suffers from the "faster horse" problem: Henry Ford couldn't ask his customers what they wanted because they would have said "a faster horse." The instruction stream is a horse: the industry has already built the fastest horses physically possible, so now the industry is going on a multithreading tangent (multi-horsing?).

While everyone else is still in a horse-race, we're building hybrid engines.

The HPCwire article lists a whole bunch of languages, but whenever someone groans about parallel programming and provides a list of languages, Excel is always left off. Clearly we aren't seeing the forest if we leave out the most widely used language ever (probably by a double or even triple-digit factor). The omission is especially egregious when these lists include other visual dataflow languages like Labview and Simulink (this article mentions dataflow style but none of these particulars). Spreadsheet cells are explicitly parallel and make dataflow and vector programming so simple that almost everyone who has ever used a computer has done it. There's even a well-understood model for event-triggered control-flow macros for those cases where you "need" instruction streams.

So I strongly disagree with the premise that parallel programming aught to be difficult. Parallel programming is the same as spreadsheet programming, it's easy to do and everyone knows how it works already. Especially don't let someone convince you that parallel programming is hard if they work on the hardware-software interface. Many of these people still believe parallel programming involves synchronizing random-access-machines running non-deterministic threads (avoid the pitfalls of horse-race-conditions by considering threads harmful).

Developing a high-performance, real-time spreadsheet framework for a hybrid hardware topology requires substantial effort. Depending on your target hardware architecture, you may need to use threads, vector operations, distributed processes, and hardware description languages to iterate that spreadsheet efficiently. To do this, you need a compiler from the spreadsheet language to each of the hardware models you want to support and you need to generate synchronization code to share precedent cell data across hardware elements. Depending on the memory and interconnect architecture this data synchronization code can get somewhat tricky, but code generation from a spreadsheet is the "tractable" part of the parallel programming problem and makes for good Master's theses if you throw in at least one optimization.

For your PhD you'll have to do something more difficult than just automatically generating parallel code from a partitioned dataflow graph.

Optimal partitioning of parallel programs is obscenely hard (MegaHard as my previous post would have it). In heterogeneous environments that use many of these primitive parallel models, you need to worry about optimally partitioning which cells run on which metal. Partitiong based on computational resources is a pain, but the real difficulty is optimizing for the communication requirements between partitions and the communication constraints between the hardware elements. We are approaching the optimal partitioning problem by assigning a color for each chunk-of-metal. We group spreadsheets cells by color and then profile the computational load of the color group and the communication between color groups using hardware constraints

The HPCwire article does mention communicating sequential processes and dataflow models:
"Until we have machines that implement these [communicating sequential processes and dataflow] models more closely, we need to take into account the cost of the virtualization as well."
We do have machines that implement these models (as Tilera and Altera will attest). They are still as difficult to program as any parallel architecture, but I assure you that once we start to think of these things as "hardware spreadsheets" we will start to see a way out of the parallel programming cave. I wonder if people who describe an FPGA as a "neural-net processor" make the pop-culture connection:


Anonymous said...

How does the spreadsheet model answer basic software needs for parallelism, such which aren't mathematical/vector oriented (web crawling for example)?

Amir said...

we are building a spreadsheet with object-oriented types so you probably need to modify your assumptions a little bit to see how we do non-mathematical stuff (check back to the python-excel demos that let us put lists, dictionaries and objects in cells).

To build any sort of database you will generally need to use some control flow macros to put new rows into the table.

I would start with a column of URLs. Followed by a column of strings fetched from those URLs. Followed by a column of index dictionaries parsed from those strings followed by a column of lists of URL links parsed from the site. your parallelism is the number of rows you have parsing simultaneously. when one row finishes computing the event triggers a control flow macro that can insert elements of the URL list back into the first column of a new row which would trigger the dataflow-waterfall to fill in the remaining columns for that URL.

actually building this will make for a good demo...

Amir said...

btw, you don't actually need to use control flow macros for anything since we reduce all Monadic operations to ports bound to a cell. This allows a built-in mechanism for reflection allowing some cells to atomically modify other cells. The result is efficient memory structures.

A control-flow macro to demonstrate reflection looks like this:

OnCalculate(): if A1: Cell(A2,A3).Formula = A4

here you provide a row a column and a formula and it will atomically overwrite the target formula.

Physical arrays have internal configuration access ports that do this.

Anonymous said...

A fellow I worked with in the mid 1980's at GE's Corporate R&D once commented that spreadsheets were tremedously underappreciated as a computational model. I agree.

I imagine the spreadsheet as a view upon an underlaying CSP model. The spreadsheet's two dimensions can even be extended by providing mechanisms to communicate outside a given spreadsheet to additional spreadsheets (the Excel model of addressing cells in other worksheets.)

A uniform model provides uniform constraints, and uniform constraints make it easier to implement uniform hardware. Devising a standard machine, along the lines of the Transputer, seems the most reasonable approach to me - at least until that General-purpose, Operation-Optimizing Data-Liasing Uber Compiler Kool-aid (GOODLUCK) is actually inventedn (after nearly 60 years of trying.)

Amir said...

resource sharing macrow can identify common functions in a spreadsheet and share pipelined structures using a few simple sharing mechanisms (priority, round robin etc) for FPGA compilation. It's possible to imagine an FPGA large enough to fit every spreadsheet ever made in the CLBs, but it's nice when your design is just so large that it won't fit in what you have. most spreadsheet iterations takes multiple cycles when you have transcendental functions

I'm sure the vendors hate this heresy ("why not just buy MORE FPGAs!!" :) but it's necessary for basically any spreadsheet application we're going to use now.

Anonymous said...

Hi Amir! I had started with FPGA just early this year and your blog has been very helpful in my learning. I had added you in my links. Thank you so much for sharing!

Tom Harris said...

Ever since I saw Lotus 1-2-3 automatically recalculate a spreadsheet full of formulas, I knew that designing a spreadsheet (NOT with macros -- just formulas) was parallel programming. And it's functional programming at the same time. Of course I knew then, and same is true now, that "under the hood" Excel is doing sequential, imperative processing.

What I would look for regarding spreadsheets and parallel processing is something much simpler: mapping spreadsheet cells to parallel processor elements. Anything happening with that?