Wednesday, September 17, 2008

Achronix Goes Balls Out

Congratulations to Achronix on announcing availability of their FPGAs and development boards. The 65nm Speedster reports a 1.5 GHz max internal throughput rate and a ton of I/O. The important technology Achronix is introducing to the market is their high-throughput asynchronous pipelining technique. There are numerous white papers on the Achronix site and research papers from the from their days in the Cornell Asynchronous FPGA group which explain how the "pipoPipe" technology works.

While the speed of this beast might get you excited, the clock rate reported doesn't translate to decreased pipeline latency, but rather implies that you can pipeline your existing system and boost your throughput rate by 3x over other FPGAs that max out at 500 MHz. As far as FPGAs are concerned, 3x more logic is better than 3x speed any-day. Still, if their picoPipe routing architecture can be easily integrated into existing FPGAs then this technology will be an obvious addition to any FPGA that needs a throughput boost.

For resource constrained applications, a 3x faster FPGA can use one-third the area to perform the same function using time-division multiplexing ("Resource Sharing"), but frankly, this is comparing apples and oranges since the 3x higher signal rate in 1/3 the area comes at a (theoretically quadratically dependent) cost to total power consumption. On the other hand, having more (but slower) logic means you can perform more simultaneous functions instead of only achieving more throughput through existing functions. Having 3x more logic will give you 3x throughput with a similar linear increase in power costs, but 3x more throughput won't allow you to emulate 3x more logic in general.

So when we compare the Achronix Speedster to that beast-of-an-FPGA the 40nm Altera Stratix IV, we have to keep in mind that 1.5 GHz internal throughput is largely a distraction from the end-to-end argument. The Achronix approach uses high-throughput pipelines while the Altera approach uses a metric-ton of logic at a lower rate. For blocks like adders, multipliers, FFTs, and floating point units, having a high speed pipelined circuits makes total sense to get a smaller die area and hence a lower cost chip, but for latency-dependent control logic, I/O-bound processes and power constrained circuits it is unlikely that the chip will be operating with its high throughput pipelines at full speed.

So more logic might be generally better than higher internal pipeline speed, but more I/O throughput is the definitive tie-breaker for most applications. Here the Speedster is definitely a speed-monster: the raw I/O throughput of this machine will make it a quick favorite for many applications: up to 40 lanes of 10.3 Gbps SerDes and 850 I/O pins up to 1066 MHz for a beast that can provide nearly 1.3 Tbps of raw throughgput.

Achronix knows that more logic beats faster logic in FPGAs and that I/O is king. They also know that the FPGA market is too smart to fall for a clock-rate race. But the deal-breaker and the golden rule of FPGAs is this: you must have an extremely compelling software workflow if you are going to get designers to adopt your hardware. If Achronix wants to convince me that they've totally pwned the rest of the FPGA market, then they need to provide the "Progressive Insurance" of FPGA tools. I want a website where I can submit my designs and report the speed and power specs of a Speedster implementation as well as several Xilinx and Altera FPGAs too.

If Achronix is highly dependent on the existing reconfigurable HDL market for tools and if their hardware advance isn't met with a similar software toolchain advance to take advantage of the new-found throughput, then this technology will have some serious barriers to overcome. It is extremely difficult to automate load-balancing of shared pipelined resources (going from a spreadsheet-RTL with absurdly high resource consumption to an implementable resource-sharing HDL code is one of those magic automations I implemented for my Master's degree).

I'm not sure that anyone knows what it means to make FPGA tools that don't suck, but I'm convinced that building a community and developing domain-specific tools is a huge part of it. If I were Achronix I would do these things to cultivate a user community:
  1. Get boards out to the undergraduate digital design labs at a bunch of schools
  2. Fund competitions for the best applications in multiple niches
  3. Support Open Source IP and Open Source EDA
Frankly If you don't give your FPGAs to undergraduates they'll end up learning how to use your competitors' wares. Xilinx donated a bunch of FPGAs to MIT to replace the old Altera boards we were previously using in 6.111. The result is that every MIT student who learned how to program an FPGA for the past four years knows what "allow unmatched LOC constraints" means in Xilinx ISE instead of the similar idiosyncrasies of Altera's Quartus toolset.

Bottom line: Achronix needs application benchmarks to prove that their hardware has a future and EDA tools to prove that their company has a future.

3 comments:

jason said...

I can't entirely agree with you. There's a subset of computing where you can't just throw gates at the problem and you need speed, as well as gates. Signal processing algorithms are one area where you can have FPGA parallelism and a highly pipelined architecture and use them to great benefit. This was one area where an ASIC was almost always required because FPGAs simply didn't run fast enough. It'll be interesting to see how Achronix fills this niche because if they do it well, we could very well be on our way to a powerful method for reconfigurable computing.

Amir said...

Hi Jason,

The latency of the Achronix circuit is not going to change compared to the non-picoPiped logic and you are always critically dependent on that latency for your result in any necessarily serial computation. So the 3x faster Speedster is not going to fill any high performance niche that a 3x more logic FPGA with the same I/O throughput wouldn't fit, but I expect that Achronix can provide it at a lower chip price and higher power consumption.

So as far as latency is concerned a 40 nm FPGA is generally going to beat a 65 nm FPGA just by being twice as dense and an ASIC is going to dominate the FPGAs because it doesn't have programmable interconnect.

Don't get me wrong, the Achronix picoPipe AND higher density is something we definitely want. Slap picoPipe interconnect into the denser FPGAs and you get an immediate boost to your FPGA for a fractional cost to area. The fact that they are behind in process and can still compete in performance means that their interconnect technology is that much more compelling.

Shiraz said...

Amir,

Thinking about asynchronous logic as a ROUTING technology is a mistake. Asynchronous is a whole different systematic methodology of digital logic circuit design. You don't use periodic clocks to make sure all the stages are within the worst case limits of propagation delay, etc. - you enable results to propagate just as soon as they're ready at their actual maximum rate on a gate by gate basis, while making sure of successful completion of each of the operations individually using special asynchronous interlocks. Thus you cannot retrofit these concepts in the traditional sense onto conventional logic technology without adding more transistors and having the designer think a whole lot differently. It is more economical when the logic is purpose-designed for asynchronous use.

Previous FPGAs such as those of Fred Furtek's Concurrent Logic Inc (later bought by Atmel) probably did look into asynchronous implementation as he had research papers and patents relating to these concepts in regard to FPGAs.

What Achronix has done is letting the designer design using asynchronous logic ALMOST AS IF he were still using synchronous logic, as a matter of look-and-feel front-end tool-chain interface to the user, without having him worry about asynchronous aspects too much. But what happens behind the scenes in the back end and in the actual hardware is a whole other scenario involving asynchronous logic such as Muller gates, concepts such as operation completion detection, and is a whole other ball game - it is immune to metastability - it just has to wait until the metastability resolves...this cannot be replicated merely by throwing three times more logic at it...

So it is not merely the interconnect, but the way the entire network of gates operates that increases the speed of asynchronous circuits.