Monday, September 22, 2008

High Performance on Wall Street 2008

A freshly rebranded Catalyst Accelerated Computing went to the High Performance on Wall Street conference today. The usual suspects were present. Speedup results all seemed to match the gender ratio at supercomputing conferences.

FPGAs got bashed in an early session. "It's hard to find people to program FPGAs" came up at least twice during the conference (we consult!). I heard, "threads are the future" ... oy. After the earily morning thread-love-fest, the acclerator panel defended the honor of the gate array well.

Technical stuff that irked me:

Multithreading and multicore aren't the same thing. Multicore processors can use multiple processes with explicit pipes or they can use multiple threads with a global memory coherency protocol. Multiple processes with pipes between them is the dataflow or "streaming" model, like a spreadsheet or like "ls -Al | grep foo > bar", while the multi-threading model should be avoided like sub-prime mortgages for the same reason (they cause your system to crash in mysterious ways).

Multithreading is the use of multiple instruction streams sharing a global address space. It was originally a method of hiding latency by transfering the context of a core to a different thread when you were waiting for I/O or memory. Intel cores support "hyperthreading" which switches context between two threads and makes it seem like it has two cores. This allows the core to share a global memory space and hide memory access latency which is large compared to the clock rate. Cores can have a lot of threads: Sun's open-source Sparc core supports 32 native threads.

Power consumption P = fCV^2. The power voltage (V) is generally linearly dependent on f (frequency) because we can use less potential to switch at slower frequency resulting in the "cubic rule of thumb" relating clock-speed and power. If we use twice the area and half the frequency to do the same work, then switched Capactiance (C) is 2x while f*V^2 is 1/8, leading to the rule-of-thumb quadratic power savings from parallelism (see Chapter 11.7 of Anantha's book "Digital Integrated Circuits").. Leakage is the dominating factor now though and slower switching circuits can operate with higher threshold voltage to lower leakage if your device supports dynamic threshold scaling (like the Stratix IV from Altera will).

The better reason why FPGAs dominate in power performance is becaues of the efficient total distance of data-flow, aka much lower capacitance to move data. As the number of cores increases, there is an O(N^(3/2)) relationship between the number of cores and the degree to which a design can be optimized for process locality (see "locality optimization"). This is why place-and-route is so important for FPGAs.

Now for the fun stuff. Buzzword scoreboard from presentations:

{ "Leverage": 17, "Agility": 4, "Low-Latency": 44, "Accelerate": 176, "Eco-System": 8, "Productivity": 191, "Scalability": 83, "Service-Oriented": 17, "Paradigm":16,"Dynamic":55, "Exploit Multicore": 18, "Future-Proof": 4, "Mainstream": 36, "Seamless": 43, "Cloud": 91, "Heterogeneous": 12, "Efficient": 50, "Enabling": 23, "Integrated": 19, "Interoperability": 24, "Realtime": 12, "Reliability": 13, "High-Availability": 33, "Bottleneck": 26 }

Productivity wins.

I'm particularly amused by the frequency of "mainstream." Mainstream on Wall Street today probably means your firm just shut down, merged, or totally changed business models. Happy Monday for a Wall Street Conference!

Coming soon: a business-plan buzzword-compliance checker to determine if your business plan is syntactically correct and give you a score.

Wednesday, September 17, 2008

Achronix Goes Balls Out

Congratulations to Achronix on announcing availability of their FPGAs and development boards. The 65nm Speedster reports a 1.5 GHz max internal throughput rate and a ton of I/O. The important technology Achronix is introducing to the market is their high-throughput asynchronous pipelining technique. There are numerous white papers on the Achronix site and research papers from the from their days in the Cornell Asynchronous FPGA group which explain how the "pipoPipe" technology works.

While the speed of this beast might get you excited, the clock rate reported doesn't translate to decreased pipeline latency, but rather implies that you can pipeline your existing system and boost your throughput rate by 3x over other FPGAs that max out at 500 MHz. As far as FPGAs are concerned, 3x more logic is better than 3x speed any-day. Still, if their picoPipe routing architecture can be easily integrated into existing FPGAs then this technology will be an obvious addition to any FPGA that needs a throughput boost.

For resource constrained applications, a 3x faster FPGA can use one-third the area to perform the same function using time-division multiplexing ("Resource Sharing"), but frankly, this is comparing apples and oranges since the 3x higher signal rate in 1/3 the area comes at a (theoretically quadratically dependent) cost to total power consumption. On the other hand, having more (but slower) logic means you can perform more simultaneous functions instead of only achieving more throughput through existing functions. Having 3x more logic will give you 3x throughput with a similar linear increase in power costs, but 3x more throughput won't allow you to emulate 3x more logic in general.

So when we compare the Achronix Speedster to that beast-of-an-FPGA the 40nm Altera Stratix IV, we have to keep in mind that 1.5 GHz internal throughput is largely a distraction from the end-to-end argument. The Achronix approach uses high-throughput pipelines while the Altera approach uses a metric-ton of logic at a lower rate. For blocks like adders, multipliers, FFTs, and floating point units, having a high speed pipelined circuits makes total sense to get a smaller die area and hence a lower cost chip, but for latency-dependent control logic, I/O-bound processes and power constrained circuits it is unlikely that the chip will be operating with its high throughput pipelines at full speed.

So more logic might be generally better than higher internal pipeline speed, but more I/O throughput is the definitive tie-breaker for most applications. Here the Speedster is definitely a speed-monster: the raw I/O throughput of this machine will make it a quick favorite for many applications: up to 40 lanes of 10.3 Gbps SerDes and 850 I/O pins up to 1066 MHz for a beast that can provide nearly 1.3 Tbps of raw throughgput.

Achronix knows that more logic beats faster logic in FPGAs and that I/O is king. They also know that the FPGA market is too smart to fall for a clock-rate race. But the deal-breaker and the golden rule of FPGAs is this: you must have an extremely compelling software workflow if you are going to get designers to adopt your hardware. If Achronix wants to convince me that they've totally pwned the rest of the FPGA market, then they need to provide the "Progressive Insurance" of FPGA tools. I want a website where I can submit my designs and report the speed and power specs of a Speedster implementation as well as several Xilinx and Altera FPGAs too.

If Achronix is highly dependent on the existing reconfigurable HDL market for tools and if their hardware advance isn't met with a similar software toolchain advance to take advantage of the new-found throughput, then this technology will have some serious barriers to overcome. It is extremely difficult to automate load-balancing of shared pipelined resources (going from a spreadsheet-RTL with absurdly high resource consumption to an implementable resource-sharing HDL code is one of those magic automations I implemented for my Master's degree).

I'm not sure that anyone knows what it means to make FPGA tools that don't suck, but I'm convinced that building a community and developing domain-specific tools is a huge part of it. If I were Achronix I would do these things to cultivate a user community:
  1. Get boards out to the undergraduate digital design labs at a bunch of schools
  2. Fund competitions for the best applications in multiple niches
  3. Support Open Source IP and Open Source EDA
Frankly If you don't give your FPGAs to undergraduates they'll end up learning how to use your competitors' wares. Xilinx donated a bunch of FPGAs to MIT to replace the old Altera boards we were previously using in 6.111. The result is that every MIT student who learned how to program an FPGA for the past four years knows what "allow unmatched LOC constraints" means in Xilinx ISE instead of the similar idiosyncrasies of Altera's Quartus toolset.

Bottom line: Achronix needs application benchmarks to prove that their hardware has a future and EDA tools to prove that their company has a future.

Thursday, September 04, 2008

Parallel Programming is Easy: Making a Framework is Hard

An HPCwire article titled "Compilers and More: Parallel Programming Made Easy?" by Michael Wolfe presents a gloomy outlook for parallel programming:
Every time I see someone claiming they've come up with a method to make parallel programming easy, I can't take them seriously.
People you don't take seriously may take you by surprise. I think the computing industry suffers from the "faster horse" problem: Henry Ford couldn't ask his customers what they wanted because they would have said "a faster horse." The instruction stream is a horse: the industry has already built the fastest horses physically possible, so now the industry is going on a multithreading tangent (multi-horsing?).

While everyone else is still in a horse-race, we're building hybrid engines.

The HPCwire article lists a whole bunch of languages, but whenever someone groans about parallel programming and provides a list of languages, Excel is always left off. Clearly we aren't seeing the forest if we leave out the most widely used language ever (probably by a double or even triple-digit factor). The omission is especially egregious when these lists include other visual dataflow languages like Labview and Simulink (this article mentions dataflow style but none of these particulars). Spreadsheet cells are explicitly parallel and make dataflow and vector programming so simple that almost everyone who has ever used a computer has done it. There's even a well-understood model for event-triggered control-flow macros for those cases where you "need" instruction streams.

So I strongly disagree with the premise that parallel programming aught to be difficult. Parallel programming is the same as spreadsheet programming, it's easy to do and everyone knows how it works already. Especially don't let someone convince you that parallel programming is hard if they work on the hardware-software interface. Many of these people still believe parallel programming involves synchronizing random-access-machines running non-deterministic threads (avoid the pitfalls of horse-race-conditions by considering threads harmful).

Developing a high-performance, real-time spreadsheet framework for a hybrid hardware topology requires substantial effort. Depending on your target hardware architecture, you may need to use threads, vector operations, distributed processes, and hardware description languages to iterate that spreadsheet efficiently. To do this, you need a compiler from the spreadsheet language to each of the hardware models you want to support and you need to generate synchronization code to share precedent cell data across hardware elements. Depending on the memory and interconnect architecture this data synchronization code can get somewhat tricky, but code generation from a spreadsheet is the "tractable" part of the parallel programming problem and makes for good Master's theses if you throw in at least one optimization.

For your PhD you'll have to do something more difficult than just automatically generating parallel code from a partitioned dataflow graph.

Optimal partitioning of parallel programs is obscenely hard (MegaHard as my previous post would have it). In heterogeneous environments that use many of these primitive parallel models, you need to worry about optimally partitioning which cells run on which metal. Partitiong based on computational resources is a pain, but the real difficulty is optimizing for the communication requirements between partitions and the communication constraints between the hardware elements. We are approaching the optimal partitioning problem by assigning a color for each chunk-of-metal. We group spreadsheets cells by color and then profile the computational load of the color group and the communication between color groups using hardware constraints

The HPCwire article does mention communicating sequential processes and dataflow models:
"Until we have machines that implement these [communicating sequential processes and dataflow] models more closely, we need to take into account the cost of the virtualization as well."
We do have machines that implement these models (as Tilera and Altera will attest). They are still as difficult to program as any parallel architecture, but I assure you that once we start to think of these things as "hardware spreadsheets" we will start to see a way out of the parallel programming cave. I wonder if people who describe an FPGA as a "neural-net processor" make the pop-culture connection: