Wednesday, June 18, 2008

Chips from NVidia and ClearSpeed

Yesterday I wrote about the GFlops/Watt performance numbers for AMD's new GPGPU and ClearSpeed showing 2-2.5 GFlops/Watt -- It looks like Clearspeed has a new card that does 4 Gigaflops/Watt at double precision.

NVidia has a new Tesla too. According to the FPGA Journal article you can buy 4 TFlops consuming only 700W or 5.7 GFLops/Watt (It's unlcear if the numbers from FPGA Journal are specs for double or single precision, but I assume single-precision). At 10 cents per kilowatt hour, a Teraflop of Teslas will cost you 153 dollars to run for a year. Not bad.

Frankly, there's too much marketing and handwaving on these specs --- not enough real numbers to make a conclusion on who dominates in efficiency.

Tuesday, June 17, 2008

AMD's new chips, OpenCL

HPCWire reports on AMD's latest GPUs clocking in with 200 GFlops double precision performance under 150 Watts or 1.33 GFlops/Watt. That translates to a double precision petaflop for .75 MWatts compared to the RoadRunner which consumes 3 MWatts. The AMD GPU is about 2-3 times the peak GFlops/Watt FPGA floating point performance numbers, though I speculate the new Altera Stratix IV may be competitive. Cleerspeed apparently wins the double-precision efficiency competition with a 2 GFLops/Watt and 2.5 GFLops/Watt chips. Performance for specific functions can vary substantially though there is still no standard language to make it practical to spec.

AMD claims that they will support the OpenCL ("Computing Language") specification. OpenCL is still non-existant as far as I can tell. From the HPCWire article:

"In an attempt to unify the programmer interface AMD has announced its support for OpenCL."

Steve Jobs mentioned OpenCL support in Snow Leopard and now it looks like the Khronos Group is trying to organize the effort to actually make the standard. Intel should join the fun and say the Larabee will support the OpenCL standard.

Thursday, June 12, 2008

"Programmable Arrays" are more than "Reconfigurable HDL Executers"

A blog by David Pellerin of ImpulseC fame called "Reconfigurable, Reconshmigurable" links to Vhayu's announcement of a compression IP for FPGA accelerated ticker systems.

To me, the most interesting part of the article is:

"Some Wall Street executives interested in using FPGAs to accelerate applications, or portions of them, however, have expressed the concern that it's hard to find programmers who are skilled at writing applications for FPGAs."

I keep hearing this brought up at low-latency Wall Street IT conferences so it's definitely a real issue. Reconfigurable computing desperately requires a common open framework that minimizes the learning curve for programming FPGA hardware. The problem is that the FPGA industry has the inertia of it's EDA upbringing, so the result is people think that the primitive language for programmable arrays should be a Hardware Description Language--but finding HDL programmers is hard.

I think it's time to drop the "hardware" from reconfigurable hardware and just think about programmable arrays. From this perspective, it is a bit ironic that Wall Street Executives have trouble finding FPGA programmers: programmable arrays have been the primary computational metaphor used by financial services since before they even had electronic computers to iterate their spreadsheets for them.

All FPGA hardware is actually programmed by a proprietary bitstream language much more closely related to programming a spreadsheet than an HDL (specify 2-D grid coordinates, specify a function, connect to precedents). However, instead of providing software tools for programmable arrays, FPGA vendors stick to their EDA roots. Because it has been so profitable, the FPGA industry has fallen into HDL la-la-land while obscuring the low-level interfaces to their physical devices.

I would go so far as to say that there has been no real vendor of hardware programmable arrays since Xilinx stopped documenting how to reconfigure their arrays. They might sell you "field programmable gate arrays" as a naming convention, but what you really get from these vendors is a "reconfigurable HDL executer." If you want to actually use an FPGA like a programmable array, you need to reverse engineer the proprietary bitstreams. The FPGA vendors actually don't have much interest in making their programmable arrays useful as programmable arrays because they make a killing selling reconfigurable HDL execution systems.

But with interest towards FPGAs outside the traditional hardware development niches, vendors quickly realized that they absolutely cannot sell HDL execution systems to people interested in using programmable arrays for their computational needs. Modern forays into C and Matlab synthesis help to address this programmability problem for certain markets, but these tools are often entirely reliant on an HDL-centric toolflow and obscure the physical constraints of the underlying programmable array even more. The hiding of low-level abstractions that comes with high-level-languages is fine (and even desirable) for end-user application development, but using C as a framework for mapping 4GLs to FPGA hardware is just as backwards as coding a Java VM in Verilog and expecting good performance on single-threaded CPU.

For the FPGA market to mature into computing applications, FPGA hardware vendors need to get back to selling hardware programmable arrays instead of HDL-executers. If they want to compete with CUDA in HPC, they should take a cue from NVidia and provide complete low-level APIs to their hardware. Instead of hyping up easy-to-program high-level-languages for particular application niches, the hardware vendors need to focus on making and documenting better low-level-languages for their hardware.

The fundamental concept of a programmable array is simple: everyone groks a spreadsheet. No one should ever be forced to target a programmable array like it were a reconfigurable HDL machine.

Monday, June 09, 2008

Petaflop in the NY Times

A Petaflop of Cell Processors made the NY Times. Highlights of the article: 12960 total Cell chips with (9*12960)= 116640 cores.

The article tries twice to turn the supercomputing top-spot as an issue of national pride. It also discusses the difficulty in programming these devices and how the next generation of consumer products will require programming paradigms for massively multicore hardware. The article also mentions the fact that the three types of cores requires a heterogeneous partitioner. Now, they are probably doing manual partitioning and making sure they're designs are highly symmetric. If we want to build large computational pipelines we need a hardware agnostic programming model for parallel programming that handles partitioning, placement and profiling.

According to a OpenFPGA Corelib presentation from Altera last Thursday, we could probably get a Petaflop by replacing all the Cells in this deployment with FPGAs. It seems plausible that a Petaflop-capable FPGA supercomputers will exist and will be better used for 2-bit DNA problems.

Brute force scaling and twice the funding will get us an ExaFlop at 32 nm. The next major leap in supercomputing is going to require a materials/fabrication advance. FinFets and 3-D integration will get us a ZettaFlop in the sub-22nm range.

I expect molar-scale integration using directed self-assembly of reconfigurable arrays will disrupt this process sometime in the 5 to 10 year range. We will then realize the supercomputers we are building to study global warming are the cause of global warming.