Reconfigurable Computing

Wednesday, June 18, 2008

Chips from NVidia and ClearSpeed

Yesterday I wrote about the GFlops/Watt performance numbers for AMD's new GPGPU and ClearSpeed showing 2-2.5 GFlops/Watt -- It looks like Clearspeed has a new card that does 4 Gigaflops/Watt at double precision.

NVidia has a new Tesla too. According to the FPGA Journal article you can buy 4 TFlops consuming only 700W or 5.7 GFLops/Watt (It's unlcear if the numbers from FPGA Journal are specs for double or single precision, but I assume single-precision). At 10 cents per kilowatt hour, a Teraflop of Teslas will cost you 153 dollars to run for a year. Not bad.

Frankly, there's too much marketing and handwaving on these specs --- not enough real numbers to make a conclusion on who dominates in efficiency.

Tuesday, June 17, 2008

AMD's new chips, OpenCL

HPCWire reports on AMD's latest GPUs clocking in with 200 GFlops double precision performance under 150 Watts or 1.33 GFlops/Watt. That translates to a double precision petaflop for .75 MWatts compared to the RoadRunner which consumes 3 MWatts. The AMD GPU is about 2-3 times the peak GFlops/Watt FPGA floating point performance numbers, though I speculate the new Altera Stratix IV may be competitive. Cleerspeed apparently wins the double-precision efficiency competition with a 2 GFLops/Watt and 2.5 GFLops/Watt chips. Performance for specific functions can vary substantially though there is still no standard language to make it practical to spec.

AMD claims that they will support the OpenCL ("Computing Language") specification. OpenCL is still non-existant as far as I can tell. From the HPCWire article:

"In an attempt to unify the programmer interface AMD has announced its support for OpenCL."

Steve Jobs mentioned OpenCL support in Snow Leopard and now it looks like the Khronos Group is trying to organize the effort to actually make the standard. Intel should join the fun and say the Larabee will support the OpenCL standard.

Thursday, June 12, 2008

"Programmable Arrays" are more than "Reconfigurable HDL Executers"

A blog by David Pellerin of ImpulseC fame called "Reconfigurable, Reconshmigurable" links to Vhayu's announcement of a compression IP for FPGA accelerated ticker systems.

To me, the most interesting part of the article is:

"Some Wall Street executives interested in using FPGAs to accelerate applications, or portions of them, however, have expressed the concern that it's hard to find programmers who are skilled at writing applications for FPGAs."

I keep hearing this brought up at low-latency Wall Street IT conferences so it's definitely a real issue. Reconfigurable computing desperately requires a common open framework that minimizes the learning curve for programming FPGA hardware. The problem is that the FPGA industry has the inertia of it's EDA upbringing, so the result is people think that the primitive language for programmable arrays should be a Hardware Description Language--but finding HDL programmers is hard.

I think it's time to drop the "hardware" from reconfigurable hardware and just think about programmable arrays. From this perspective, it is a bit ironic that Wall Street Executives have trouble finding FPGA programmers: programmable arrays have been the primary computational metaphor used by financial services since before they even had electronic computers to iterate their spreadsheets for them.

All FPGA hardware is actually programmed by a proprietary bitstream language much more closely related to programming a spreadsheet than an HDL (specify 2-D grid coordinates, specify a function, connect to precedents). However, instead of providing software tools for programmable arrays, FPGA vendors stick to their EDA roots. Because it has been so profitable, the FPGA industry has fallen into HDL la-la-land while obscuring the low-level interfaces to their physical devices.

I would go so far as to say that there has been no real vendor of hardware programmable arrays since Xilinx stopped documenting how to reconfigure their arrays. They might sell you "field programmable gate arrays" as a naming convention, but what you really get from these vendors is a "reconfigurable HDL executer." If you want to actually use an FPGA like a programmable array, you need to reverse engineer the proprietary bitstreams. The FPGA vendors actually don't have much interest in making their programmable arrays useful as programmable arrays because they make a killing selling reconfigurable HDL execution systems.

But with interest towards FPGAs outside the traditional hardware development niches, vendors quickly realized that they absolutely cannot sell HDL execution systems to people interested in using programmable arrays for their computational needs. Modern forays into C and Matlab synthesis help to address this programmability problem for certain markets, but these tools are often entirely reliant on an HDL-centric toolflow and obscure the physical constraints of the underlying programmable array even more. The hiding of low-level abstractions that comes with high-level-languages is fine (and even desirable) for end-user application development, but using C as a framework for mapping 4GLs to FPGA hardware is just as backwards as coding a Java VM in Verilog and expecting good performance on single-threaded CPU.

For the FPGA market to mature into computing applications, FPGA hardware vendors need to get back to selling hardware programmable arrays instead of HDL-executers. If they want to compete with CUDA in HPC, they should take a cue from NVidia and provide complete low-level APIs to their hardware. Instead of hyping up easy-to-program high-level-languages for particular application niches, the hardware vendors need to focus on making and documenting better low-level-languages for their hardware.

The fundamental concept of a programmable array is simple: everyone groks a spreadsheet. No one should ever be forced to target a programmable array like it were a reconfigurable HDL machine.

Monday, June 09, 2008

Petaflop in the NY Times

A Petaflop of Cell Processors made the NY Times. Highlights of the article: 12960 total Cell chips with (9*12960)= 116640 cores.

The article tries twice to turn the supercomputing top-spot as an issue of national pride. It also discusses the difficulty in programming these devices and how the next generation of consumer products will require programming paradigms for massively multicore hardware. The article also mentions the fact that the three types of cores requires a heterogeneous partitioner. Now, they are probably doing manual partitioning and making sure they're designs are highly symmetric. If we want to build large computational pipelines we need a hardware agnostic programming model for parallel programming that handles partitioning, placement and profiling.

According to a OpenFPGA Corelib presentation from Altera last Thursday, we could probably get a Petaflop by replacing all the Cells in this deployment with FPGAs. It seems plausible that a Petaflop-capable FPGA supercomputers will exist and will be better used for 2-bit DNA problems.

Brute force scaling and twice the funding will get us an ExaFlop at 32 nm. The next major leap in supercomputing is going to require a materials/fabrication advance. FinFets and 3-D integration will get us a ZettaFlop in the sub-22nm range.

I expect molar-scale integration using directed self-assembly of reconfigurable arrays will disrupt this process sometime in the 5 to 10 year range. We will then realize the supercomputers we are building to study global warming are the cause of global warming.

Wednesday, May 21, 2008

Altera Incorporates Dynamic Threshold Scailing in 40 nm Stratix IV

In September I wrote:

To optimize for static currents, using dynamic threshold scaling (modulating the body bias voltage of the chip) along with dynamic voltage scaling [for active power] seems to be a viable technique. Here's a spreadsheet model (ODS original) for leakage current in a transistor varying Temparature, power and threshold voltage across a reasonable range.

According to this FPGA Journal article, Altera has incorporated programmable body biasing into the logic blocks of their 40 nm Stratix IV FPGA.

Xilinx will probably also follow suit with dynamic threshold scaling sometime this summer.

Altera claims to have 680K logic elements in their highest capacity offering... I think 640K is all anyone will ever need :)

Sunday, May 11, 2008

Locality Optimization

Digital system designers of FPGAs and ASICs alike become extremely familiar with the place-and-route problem during late-night moments of desperate agony the day before a deadline. Place-and-route is about to become relevant to a very new market as the distinction between FPGA and multicore chips turn into a question of granularity. As we increase the number of cores on a chip, the placement of communicating dataflow kernels will become a primary issue for chip-multiprocessor performance.

The time and power required to perform a dataflow computation is determined by the total wire length over which data must travel. In 2-D tiled structures like FPGAs or multiprocessors, the ratio between the worst-case and best-case placement for two communicating kernels grows as the square-root of the number of cores: a bad placement of two communicating kernels in a 1 million cell array could cost you on the order of 1000 times more wire-length than a good placement. This is easily visualized by the graphic below and the order-square-root-of-cores relationship holds regardless of whether Manhattan (L1) or Euclidean (L2) distance is used (using the L1 metric there is a constant factor of 2 multiplier and using the L2 metric there is a constant factor of sqrt(2)).

As we add more communicating kernels, it becomes more and more obvious how poor placement is detrimental to performance. For arbitrary dataflow graphs, the problem of finding the mapping of the graph which minimizes arc-length is NP-hard. A generalization of this problem is the quadratic assignment problem (QAP).

Several methods are employed to perform placement and are often used in hybrid. A generic method to accelerate place-and-route is to partition the design into sub-graphs. Partitioning dataflow graphs into independently placeable groups allows a hierarchical placer to perform multiple sub-placements in parallel. Partitioning can be used as a general strategy with QAP solvers or simulated annealing methods used to solve the placement problem within partitions.

Tools allow a designer to constrain the physical placement of partitions so that they only need to be re-placed when they change. Since development of FPGA modules usually requires multiple static system elements to test the user code, it is nice to not have to work on these sections again. This also allows board vendors to offer a partial bitstream with interface components pre-placed in the FPGA to hide messy interface signals and provide a simple development environments. A partition-constrained place-and-route can then map new user-logic independent of system-logic.

A spreadsheet programming model directly exposes the locality issue to the programmer. We should expect that a spreadsheet consisting of 6-input logic functions on boolean types should map immediately to the logic fabric of a Virtex-5. If we bring up the granularity of our spreadsheet to integer and character operations we should still expect a somewhat direct placement from the spreadsheet to the FPGA. This does not guarantee that the interconnect can be routed: placement is constrained by routing interconnection and place-and-route algorithms consider routing while placing components. See ulogic.org for information about the Xilinx bitstream and XDL format as well as the programmable interconnection points.

Here is some Python code to compute the wire-length metrics for an Excel spreadsheet. It doesn't take many lines of code to write macros that do simulated annealing on a spreadsheet. I included code which uses Excel's Cut functionality to move cells while preserving the dataflow graph. The slowness of the COM interface in this example will prevent you from getting far.

Wednesday, April 09, 2008

"HDL-phobic pansies"

Kevin Morris over at FPGA Journal wrote a really good entry about the evolution of FPGA design and how the tools always target the intended market--even if they don't know the underlying execution engine is a reconfigurable array.

So if you want to make a product for the finance dudes, then you're building a spreadsheet--conveniently familiar reconfigurable array. The Accelerator thinks quantitative finance is a bunk field for Accelerated Computing enthusiasts. I say regardless of the underlying economic/ethical question he raises, having an arbitrary market for low-latency computation isn't necessarily bad. If some combination of low latency LAPACK routines lets me print money for my own biochemical endeavors then I'm willing to write them (incidentally, the 18.330 professor who taught me some of those LAPACK routines now works at DE Shaw).

We setup a public SVN to store our spreadsheet tools. See the Catalyst Accelerated Computing Trac page. Anonymous check-in enabled if you want to add a spreadsheet that does something nifty. I'll update this blog when we have releases.

Monday, March 31, 2008

Addendum to RhoZeta source

To add to my RhoZeta post from a few weeks back, I thought I'd add an example CellManager. This code demonstrates how to make a Non-Blocking cell manager on a RhoZeta sheet, and how to spawn such a thread for the ActiveSheet in Excel. If you add this to the RhoZeta.py, run the code, and then type "NBCM = NBActiveSheet()" into the PythonWin interactive console you will create a non-blocking style cell manager in the active sheet.

If the CellManager thread craps out (there's a bug when the dictionary changes size in the middle of computation) you can just recreate it by typing "NBCM = NBActiveSheet()" into the console. (start the thread with NBCM.start())

To determine the iteration time, put "=time.clock()" into Cell A1, "=A1" into Cell A2, and "=A1-A2" into cell A3. you can create a simple shift register struction by setting A4 to "=A3" and A5 to "=A4" and so on. This allows you to take the average over a series of tests.

You can also create a counter by initializing B1 to "=0" and then changing it's formula to "=B1 +1" for example.

Non-blocking means all computed assignments are deferred so that all the inputs are read from the previous iteration, as opposed to blocking assignments which read values from the current iteration and assign new values as they are computed. Changing this code to blocking mode iteration is easy to do in terms of number of keystrokes (change eval to exec + a little more massaging, or update valdic as you eval), though I'll leave that as an exercise for you. This Python interpreted modes are slow because of the dynamic eval of strings in Python, compilation is required to accelerated the function calls. JIT compilation to FPGA/GPU/Multicore is what is really required to accelerate spreadsheet iteration... hence a start-up.

#start copying here:


def SheetNonBlockingAssignments(Sh):
    valdic = {}
    for key,value in Sh.Cells.iteritems():
        valdic[key]=value.Value

    for cell in Sh.Cells.values():
        try:

                                    if (cell.Formula == ''):
                                                cell.Value = None
                        elif (cell.Formula[0:1] == "'=") or (cell.Formula[0] == "="):
                                                cell.Value = eval(cell.Formula[1:].lstrip('='),globals(),valdic)
                                    else:
                                                cell.Value = cell.Value
                except Exception,e:
                                    cell.Value=e

class NonBlockingCellManager(Timeout):
            def __init__(self, Sheet):
                        Timeout.__init__(target=SheetNonBlockingAssignments, kwargs={'Sh':Sheet})
                        self.set_run_forever()

def NBActiveSheet():
            as = RZ.UISession.app.ActiveSheet
            NBCM = NonBlockingCellManager(RZ[as.Parent.Name][as.Name])
            return NBCM

#End copy

Another useful change to the FrontendSynchronizer class looks like this:

if xl.ActiveWindow.DisplayFormulas:
setv = RangeMap(Excel.strifnotnone, self.RhoZetaSession.RangeFromTuple(vrtuple).Formula)
else:
setv = RangeMap(Excel.strifnotnone, self.RhoZetaSession.RangeFromTuple(vrtuple).Value)

Press CTRL-` to toggle the DisplayFormulas variable in Excel.

We'll put up source control soon. Feel free to email me at amirh at mit dot edu with questions comments or if you want to help.

Friday, March 28, 2008

Xilinx Favors Clusters to FPGA for Application Acceleration

My last post got 40x my usual traffic. Apparently Python Spreadsheets is a hot topic. It also looks like reconfigurable dataflow networks aren't popular under the moniker of a four letter acronym. When you call it a hardware spreadsheet, a 2-D tiled array of functional blocks is suddenly a whole lot more more accessible.

A few posts back I recounted the tale (and linked to papers that agreed) that the proprietary nature of the FPGA industry impedes reconfigurable computing research with locked-down binary formats and expensive/closed-source (un-study-able) EDA tools. So now let me create a little more controversy. I think it's about time FPGA manufacturers embraced their reconfigurable computing futures. This means that when a major FPGA manufacturer has an EDA problem requiring accelerated computation, they should go with the FPGA approach for accelerating that problem instead of (or at least in addition to) the Cluster or GPU approach. I'm talking about the SmartXplorer utility added to Xilinx ISE 10.1.

Here's my gripe: why is Xilinx promoting clusters instead of FPGA acceleration for it's ISE? Of course if you're an EDA company trying to deliver a product, you want to use tested technology for parallel computing, like a cluster. But it's not like there aren't examples accelerating FPGA Placement on FPGAs. And everybody using ISE has an FPGA already -- it's sitting there attached over USB JTAG waiting for a new bitstream to be placed and routed by my cluster -- but we don't have a cluster.

If nothing more, this sends a signal about Xilinx's stance on whether one should invest in FPGA accelerated computing or a cluster solution (if your problem is sorta similar to the types of problems an FPGA EDA tool might have). Now Altera has to release a GPU accelerated workflow for bragging rights.

Something really needs to be fixed here.

(edit on March 31, props to Acceleware for the GPU accelerated workflows)

Monday, March 17, 2008

Python Spreadsheets: Like Resolver, Only in Excel and Free

(see http://catalystac.com/trac for latest code source)

Resolver Systems sells a product that does Python with a spreadsheet UI. Resolver uses their own spreadsheet frontend. Here's a demo of how to add Pythonability to Excel. This code is buggy and barely tested and its only a small part of what is an even buggier master's thesis. This demo will show you how to synchronize the visible range in Excel from a running Python session. This allows the Python spreadsheet engine to run independently of Excel allowing me to partition and compile sheet iteration functions onto various co-processors and report the values to the frontend at a lower frequency. I worked on this demo enough to hope that it might work on your machine too.

Grab the code that I pieced together and extract both RhoZeta.py and Excel.py into the same directory. The code is free for you to use: Python is a language that thrives off of open-source sweat. Let me know if you do anything with this (amirh at mit dot edu).

Here's the demo steps with screenshots:

CAUTION: DO NOT RUN THIS WITH OPEN EXCEL WORKBOOKS OR YOU MAY SCREW UP YOUR DATA.

Load RhoZeta.py in PythonWin and click on "Run" and "OK."

After Excel loads you will have an empty workbook. Try typing "=range(0,10)" into A1.

As you might expect from a Python spreadsheet this produces the repr() or the list as a text string in cell A1. Now type "=dir()" in cell A2 to discover the local context.

When we entered the formula in A2, the only assigned cell is A1. If this spreadsheet were recalculating we would want cell A2 to constantly update to inform us which cells are in the local context. Now let's see what the global dictionary looks like:

This produces a dictionary representation in A3 (encased by squiggly {} brackets) . In order to access the members of this dictionary we would like to be able to use the form "=A3[key]" but because Excel doesn't like the square [] brackets we need to precede our formula with an apostrophe. If Excel complains about your perfectly reasonable Python formula, you can precede it with an apostrophe (seriously try without the apostrophe ' and enjoy the message).

OK, so we can reference into a dictionary and fetch a function object. In this case "RangeToAddressArray" is a utility function which takes a textual representation of a range and converts it into a 2-D array of addresses. The point here is that we can store a function in a cell and then apply that function using the cell's address.

We can also bind a cell to an anonymous functions using lambda. Here's how to create the square function:

Now let's go put a value in B8 and apply that square function:

You can also reference attributes in the usual way. In this case the ExcelEvents class has an attribute RZ which points to the RhoZeta spreadsheet engine. (The binding mechanism is a bit awkward because of the way ExcelEvents is passed to the win32com DispatchWithEvents method).

We can actually use the RZ object here to read from the cells:

Here B13 contains the formula '=B11['Book1'].Sheets. This screenshot also demos how the ActiveCell has it's formula overwritten with the internal RhoZeta implementation. The idea was to have editable and draggable formulas, but setting the formulas of a Range doesn't work quite yet. Notice B11 reports "#NAME?" as Excel attempts to execute the formula "=ExcelEvents.RZ" which displays in the formula bar.

So this system doesn't do a lot yet, but hopefully this will give a good foot forward to other people to tinker. The remaining spreadsheet engine from my thesis consists of CellManager threads which compile and run the formulas (using asynchronous, blocking or non-blocking interpretations) and Catalyst threads which walk the spreadsheet graph and perform optimizations. I'll incorporate these elements and eventually this will suck even less.

You can also have fun graphing from Excel into PyOpenGL Kinda like what this guy did, but in PyOpenGL. Here's a tease:

Friday, March 07, 2008

New Accelerated Computing Blog

Doug O'Flaherty joins Joe Landman and I in trying to sort out what is going on in accelerated computing in blog form. More synchronized neurons is probably a good thing. His blog title, "Lead, Follow, or..." begs a good question. The first thing to sort out: are we making waves, riding waves or just blowing bubbles? Considering the recent trend of bus-standard openness is his baby, I wonder if he agrees with my take on proprietary standards and the EDA business model miring the FPGA acceleration field (go back two entries).

His first post starts with a bang: "I am not a fan of blogs. In general, they are self indulgent musings in search of an editor." This basically sums up my writing for the past 2 years. Blogging helps the accelerated mind sort: if I didn't blog I wouldn't even know what I was thinking most of the time.

Now that both Joe and I have put a link to his blog, I expect him to make the top of the Google search for "Accelerated Computing" in no time at all :)

Thursday, March 06, 2008

Excel as a 3-D Graphics Engine

Slashdot has a link to an article by Peter Rakos demonstrating how to use Excel to power a 3-D graphics engine.

He touches upon the paradigm shift associated with the spreadsheet dataflow programming model in comparison to traditional sequential imperative programs.

If you've been following my blog you know that my master's thesis argues that a spreadsheet model supporting various sheet calculation modes (non-blocking, blocking and asynchronous) should replace the von Neumann instruction stream as the primitive model for parallel architectures. It tingles to see someone else discovering the same.

Wednesday, February 27, 2008

DRC Stirs the Kool-Aid

FPGA Journal recently published an article by Michael R. D'Amour of DRC Computer titled "Reconfigurable Computing for Acceleration in HPC." It's refreshing to see other FPGA computing enthusiasts populating the internet. The article discusses the marketable benefits of FPGA accelerators and the barriers this field has traditionally faced in its attempts to break into the mainstream. Early in his article, he hits the nail on the head:

"While there were many reasons for this lack of early widespread acceptance, one major issue has been a lack of standards surrounding reconfigurable processors and the absence of a standardized system architecture that can effectively employ them."

Standardization, and Open Standardization in particular is the only way a technology can mature from a high margin, low volume market to a low margin, high volume market. We've already seen FPGAs win embedded markets by integrating standard microcontroller architectures. The reason technological maturation requires standards is because the transition requires competition. Standardization provides parameters for competition. Proprietary standards simply erect other barriers to entry (NDAs or reverse engineering). Fortunately there's enough of us who think that competing over a small pie is less fun than making the pie bigger, which is the answer to the skeptics question why AMD would want open bus standards if it allows someone to compete them off their own bus. AMD is growing an eco-system in the process, but since they viably compete in the more standard and bigger x86 market, they really don't have much to fear from FPGAs which are still mired by proprietary programming tools. Still FPGA computing is growing it's niches and since many people have been working on these standardization problems, Michael is somewhat justified in posing this outlook:

"Many of the technological barriers to widespread use of reconfigurable computers have been overcome, and with standards-based reconfigurable interfaces and hardware plus a growing body of standard language compilers to support the technology, reconfigurable computing is poised to break through as a viable solution for a wide range of commercial and HPC applications."

This statement is mostly accurate since it starts with the word "many." Certainly DRC deserves a lot of credit for producing a viable co-processor following AMDs open bus specs. An equivalently accurate statement would go like this: Many of the technological barriers to widespread use of reconfigurable computers are a long way from being overcome, with proprietary hardware interfaces and expensive development environments impeding development of viable solutions for commercial and HPC applications.

Indeed Michael approximates this sentiment in his article while explaining why RC hasn't caught on. The problem is, when do we acknowledge the transition has happened? Is it by popular vote of the bloggers, because my casual observation is that FPGAs are losing to GPUs and Multicores there.

I think it has a lot to do with accessibility of the technology. You actually need people building applications. And right now, it's pretty darn tedious and expensive to build any substantial FPGA design (Check out the project videos of 300+ MIT students who mostly agree). After you do build large designs, its non-trivial to target a new architecture. FPGA acceleration requires FPGA acceleration or at least some semblance of iterative dynamism. I'll know the reconfigurable computing wave is coming when an FPGA is running its own open source tool-chain because that means Xilinx or Altera or somebody will actually be drinking from their own juice-boxes. This is not sarcastic though. Why should I buy reconfigurable computing if Xilinx and Altera don't even buy it yet?

Languages and tools for FPGAs are perhaps getting better for the HPC niche, but this is only a start and I will continue to deny that reconfigurable computing has a standard language until someone shows me a language that supports reconfiguration as a primitive notion. I told someone who works at Bluespec about my variant that allows dynamic elaboration and dynamic reassignment of modules and he wondered how that could be useful since it would take so long to re-target the device after any dynamic changes. The "field programmable" thing is still an inside joke among reconfigurable computing enthusiasts, but as I've said before, marketing reconfigurability is a good way for a computing startup to find no market at all. Everyone at Xilinx knows this because everyone who works there at some point cynically made the connection between reconfigurable computing and Gallium Arsenide. Verification and low-volume embedded are the FPGA markets and they pwnd micro-controllers to take the latter, but will an FPGA driving general purpose computing always be a thing of the future?

Despite the tools and proprietary architectures, we will start to see profitable FPGA computing applications pick up, especially where deterministic latency is a big win-factor or where the speedup potential is several orders of magnitude. But there is a long and bumpy road for FPGAs to break through to a wide range of potential applications. What's really holding back FPGA computing is more than just standards, its the inertia of the EDA industry and its market model.

The economic and perhaps cultural barriers for FPGA co-processors are nearly as complex as the technological barriers. The real, and ironic enemy of FPGA accelerated computing is Xilinx, Altera and the entrenched FPGA EDA industry who are most responsible for the proprietary nature of the current toolset and architectures. These organization are actually somewhat threatened by a disturbance to their high margins from the embedded system and verification markets. Breaking through into computing applications means higher volume of lower cost hardware and similarly cheap and familiar tools that don't require special training. If the cost structure doesn't change then the ROI will rarely be found. Indeed Xilinx and Altera have resisted commoditization of their high-end components and they keep their low-level API's mostly under wraps. Their software is free like free beer, but it isn't Free like freedom yet. If low level FPGA computing research requires an NDA to access the binary formats then an iconoclastic high-school student can't just look it up on the net and hack together a compiler over a rainy weekend in Seattle. Didn't Microsoft just open their binary formats? Maybe they could start a trend?

The FPGA EDA community is going to be shaken by a different market model that doesn't take kindly to expensive proprietary tools. The potential scale of a viable FPGA computing industry could compete out-of-existence many current EDA business models that fail to adapt.

Edit on March 3:

Michael's article originally appeared July 13, 2007 in HPCwire
FCCM '07 had a paper which mentioned how proprietary formats have stifled FPGA computing research
A paper and website about reverse-engineering bitstream formats
There's also a comp.arch.fpga thread from 2000 about "FPGA openness" (Google for it)

Wednesday, January 16, 2008

Tech Predictions for 2008

Sup. Still building a PDP-11 for Ontario Power Generation. I also have DRC Computer with an FPGA and a webcam doing image recognition and motion tracking. We're recruiting people for Catalyst Accelerated Computing if you want to have some fun.

Here are my Tech Predictions for 2008

1) P2P Live Video Streaming Goes HD. A cheap dongle will connect your HDMI TV to the internet: "vendors of copyrighted material" will rush to this space in 2008 (like Netflix, and Apple TV), but video-phoning and pirates will ultimately be the norm. The Pirate Party will start to make an impact in the Swedish Election but the legal issue about information ownership will be continue to waste paper. We've already been playing with our own P2P streaming model for a while, but similar technology is near to taking over BitTorrent. Combined with the Writer's Strike, this technology will end the old media market model. Arr.

2) Software Defined Radio (SDR) -- Google's efforts to free the 700 MHz spectrum are the beginning of a transition in RF policy. The rest of the spectrum will slowly transition to "best-effort" spectrum sharing protocols (ethernet) with SDR chips allowing software programmable frequency selection and data decoding.

3) Accelerated Computing Moves (closer to) Mainstream. In December of last year, AMD named a VP Accelerated Computing and there's a new video about their initiative from CES. An AMD VP is a signal that startups are going to swarm this market now. FPGA and GPU co-processor integration will still suck well into 2009, but people who have mastered this black-art will be in high demand to retool Wall Street and other HPC niches where GOPs and dollars are strongly correlated.

4) Gestural Control and Motion Tracking. iPhone multitouch will still have that magic effect for a few more years, but now you can do it with a WiiMote. Webcam hand tracking will start to catch on in 2008. We've been having fun with this on our DRC box. Full Body DDR will be a major video-game product in 2008 and play off the Rock Band phenomena to give new meaning to the words: "so you think you can dance?"

5) Social Networking goes Botnet. MySpace sucks. Facebook sucks less. Social Networking is a buzz-word past it's prime. People will realize that a social network and a Botnet are roughly the same, with the computational components generally replaced by protein. Aggregation will pick up and distributed content caching will start to make more sense. We've been experimenting with distributed computation using Javascript. Combined with some photo-sharing/recognizing Amazon Mechanical Turk thing we can use the buzzword Web 3.0 to brand the resulting distributed intelligence. Maybe people won't be so creeped out when they realize they are being datamined by Facebook (and everyone else who wants to datamine social network data).

Monday, October 08, 2007

Mandelbrot Set in Excel

As part of my "anything you can do, I can do in Excel" crusade, I made a spreadsheet to compute the Mandelbrot set (2.5 MB file). This takes roughly 5 seconds per iteration on a single-core Pentium 4 at 2.6 GHz. Excel also seems to consume 128 MB of memory to compute the 400 x 350 x 2 sheets of cells (compare to 28 MB just to run Excel). The slowness and memory overhead (probably related) of this operation in Excel points to some serious design flaws in the underlying spreadsheet calculation system. Considering the spreadsheet file is around 2.5 MB, I can't see why the entire thing would require more than say 5 MB (one to store the current iteration and one to store the next in order to preserve atomicity). Since Excel's iteration is blocking assignments in reading order, the actual iteration could require no more memory than the entire 2.5 MB sheet if atomicity of the iteration operation is not a concern. Since this spreadsheet is embarrassingly parallel and requires vastly more floating point resources than can fit on an FPGA, it makes for a good benchmark of a resource sharing compiler (aka a serializing compiler).

Time for pictures. These are produced by Excel's conditional cell coloring. The images link to bigger PNG versions.

After 1 iteration (minx = -2, maxy = 1, dx = .01, dy = .01)

After 2 iterations

Several iterations later, it starts to resemble a Mandelbrot set.

Several more iterations and some cells start to diverge.

Everyone's favorite fractal: the Mandelbrot Set!

Zooming in on a particular section and iterating a bunch.

Saturday, September 15, 2007

Datacenter Power Management: Subverting the Dominant Paradigm

The purpose of this entry is to dispell the meme that increasing server utilization is a viable long term approach to power management. The dominant paradigm is that virtualization technology will allow you to increase your server utilization and therefore allow you to improve total performance per watt. White papers from AMD and Intel both discuss consolidation as a means to data center power reduction and PG&E incetivizes the use of virtualization technology. While encouraging managed virtual machines for power consumption is a good model, increasing utilization is not a viable long-term model for optimizing GOps/Watt.

The point I would like to make is that in order to decrease power consumption, it may be necessary to buy more chips and distribute the load better. As a rule-of-thumb, in traditional CMOS hardware design, power consumption is cubically dependent on clock frequency. In ultra-low-power systems employing subthreshold, adiabatic and asynchronous logic, this speed-power relationship has an even higher order.

An analogy that makes sense to everyone is that when you coast your car, you achieve optimal miles-per-gallon, but you can't achieve any meaningful speed. Applying a little gas results in an optimal moving speed, but once you apply too much gas to maintain a high speed, your fuel-efficiency drops again.

As a reference point, you can buy microcontrollers at 1 MHz consuming ~400 micro-Watts. Architectures with finer granularity of power and frequency management allow you to distribute 1000 1-MHz virtual machines onto 1000 1-MHz cores instead of consolidating them to run on a single 1-GHz core. By the cubic rule of thumb, each of the 1000 1-MHz cores consumes a billionth of the power, resulting in one millionth total power consumption. Static currents and fixed overhead causes this cubic model to break down at some "power-optimal clock speed."

Dynamic voltage and frequency scaling in multicore arrays may allow each core to have its own power/clock domain in a globally asynchronous model. To optimize for static currents, using dynamic threshold scaling (modulating the body bias voltage of the chip) along with dynamic voltage scaling seems to be a viable technique. Here's a spreadsheet model for leakage current in a transistor varying Temparature, power and threshold voltage across a reasonable range. At lower frequencies, higher threshold voltages can be used to offset leakage power consumption. Such dynamic leakage reduction cannot be achieve using only multi-threshold CMOS (using high-threshold power-enable transistors). Since this static leakage current factor is increasingly a dominant factor in chip power consumption as channel lengths shrink to 45 nm and below, using threshold scaling with power-frequency scaling results in a higher that cubically ordered relationship between power consumption and clock speed.

Instead of increasing the utilization of a single chip, we can fundamentally decrease our need for gigahertz frequency computation and thus decrease power consumption by increasing the parallelism of our systems. Invoking Seymour Cray: while 1024 chickens may not plow a field as quickly as 2 Strong Oxen, they may plow it fast enough and for a lot less chicken-feed.

Monday, September 10, 2007

Evolving Beyond One-Dimensional Programming; Why Foo-C is not a Parallel Programming Environment

In my last blog entry, I introduced the spreadsheet parallel programming model that I will be evangelizing for the next several years. The goal of this entry is to contrast the spreadsheet programming model with existing attempts at C-like parallel programming environments. I argue that one-dimensional von Neumann languages are insufficient for expressing the objectives of a parallel program and that any "parallel programming" extension to C is a temporary solution to a bigger problem.

A programming objective is most generally one of two things: a functional procedure or a state modifying atomic action. In hardware, these notions respectively correspond to resolving a combinatorial circuit and latching a register. In software descriptions these correspond to evaluating a pure-function and assigning the result to the left-hand-side variable. Often these concepts are not separated in the mind of the sequential imperative programmer; each piece of code is thought of as a sequence of instructions that is executed as it is read. This programming model matches the hardware most software engineers have used for the past 60 years and so it is the dominant paradigm.

In the sequential school of thought, functions are just subroutines which return a value. Since most languages do not strictly isolate state assignment from functional value determination, the order in which functional computations and assignments are performed matters to the correctness of the program. Thus a program is generally some contortion within a language to achieve either a functional or atomic-action objective. In this sense, the programmer is devising not a solution to a problem, but an expression of the solution in some language.

Software programmers rarely create code the way we think about the problem, but rather we solve the problem in the manner that the language provides. Computer programming has predominantly been the engineering practice of projecting combinatorial functions or atomic actions into a one-dimensional execution order. The work of a parallelizing compiler may be considered a process of "re-functionalizing" the one-dimensional projection produced by the programmer. Unfortunately, sequential imperative control flow confounds a parallelizing compiler and introduces a bottleneck which cannot easily be anti-projected from one-dimension to many. Speculation seems to be the mode for improving the concurrency of our code, though parallelism is not achieved through brute-force speculation alone since speculation is an inefficient parallelization primitive when we explore too many unlikely paths.

Herein lies the key difference between software and reconfigurable hardware programming. Software is a control-centric model and reconfigurable computing is a data-centric model. Any sort of programmatic control flow in reconfigurable hardware must be explicitly managed by state-machines, whereas control flow in an instruction stream is implicit (next instruction) or managed with branching. The C programmer mangles what could have been a multiplexor into control flow case statements. An if statement only evaluates one path after the predicate resolves while a multiplexer resolves all inputs simultaneously while the predicate is being determined. Expressing this sort of parallelism is impossible in C since simultaneity is not a component of the language.

Ultimately the problem of a parallelizing compiler resolving C-like control flow is this: when do we want our conditional statements to be an "if" and when do we want them to be a muliplexer? By writing code in a von Neumann language we are forcing one-dimensionality in our code and explicitly sequentializing what could otherwise be a parallel expression.

Wednesday, September 05, 2007

A Replacement for the von Neumann Model

"What is a von Neumann computer? When von Neumann and others conceived it over thirty years ago, it was an elegant, practical, and unifying idea that simplified a number of engineering and programming problems that existed then. Although the conditions that produced its architecture have changed radically, we nevertheless still identify the notion of "computer" with this thirty year old concept."
John Backus, “Can Programming Be Liberated from the von Neumann Style?” Turing Award Lecture, 1977

Yesterday I submitted my Master's thesis to MIT's department of Electrical Engineering and Computer Science. In my thesis I present my answer to the problem: how do we program parallel architectures?

My answer is to use a spreadsheet as a replacement for the von Neumann model since modern hardware more accurately resembles an array of processing units than a single instruction stream executer. Since spreadsheets are already a well-established dataflow development paradigm, tools already exist to make spreadsheet specification extremely rapid and straightforward. Unfortunately existing spreadsheet applications aren't designed with the intent of being complete programming languages and they often treat circular reference like an error--though you can enable manual iteration in Excel (try making a counter, IIR filter and a RISC CPU in Excel as an exercise--F9 to iterate). To make the spreadsheets a better parallel programming tool I added multiple interpreter modes (asynchronous, blocking and non-blocking assignments like Verilog), behavioral abstraction, guarded atomic action macros and compilation to FPGA, GPU and Multicore.

My thesis title is "Compiling and Optimizing Spreadsheets for FPGA and Multicore Execution"

Here is the abstract:

A major barrier to developing systems on multicore and FPGA chips is an easy-to-use development environment. This thesis presents the RhoZeta spreadsheet compiler and Catalyst optimization system for programming multiprocessors and FPGAs. Any spreadsheet frontend may be extended to work with RhoZeta’s multiple interpreters and behavioral abstraction mechanisms. RhoZeta synchronizes a variety of cell interpreters acting on a global memory space. RhoZeta can also compile a group of cells to multithreaded C or Verilog. The result is an easy-to-use interface for programming multicore microprocessors and FPGAs. A spreadsheet environment presents parallelism and locality issues of modern hardware directly to the user and allows for a simple global memory synchronization model. Catalyst is a spreadsheet graph rewriting system based on performing behaviorally invariant guarded atomic actions while a system is being interpreted by RhoZeta. A number of optimization macros were developed to perform speculation, resource sharing and propagation of static assignments through a circuit. Parallelization of a 64-bit serial leading-zero-counter is demonstrated with Catalyst. Fault tolerance macros were also developed in Catalyst to protect against dynamic faults and to offset costs associated with testing semiconductors for static defects. A model for partitioning, placing and profiling spreadsheet execution in a heterogeneous hardware environment is also discussed. The RhoZeta system has been used to design several multithreaded and FPGA applications including a RISC emulator and a MIDI controlled modular synthesizer.

Wednesday, August 08, 2007

Theres probably a few hundred of us

I haven't updated this blog in a while. I've been traveling and writing a thesis full of monadic deliciousness and behavioral invariant transformations--coming soon. I started a job last month making a PDP-11 emulator at Quickware.

There's probably a few hundred of us who are watching the accelerated computing meme spread. It is catching on, there's probably a few dozen of us ready to put our mouth where the money is when wall street catches on too--email me by the way (my name at where i work).

Just wanted to point out that big supercomputers are used to analyze carbon dioxide emissions due to the ever increasing amount of energy required to power our big super computers.

And programming environments always seem to suck unless you build it yourself. Thank Gerry for Scheme though.

Wednesday, April 25, 2007

AMD White Paper

Joe Landman over at scalability.org has written up a white paper for AMD on accelerated computing technology.