Reconfigurable Computing: 2008

Monday, December 29, 2008

C-to-Verilog.com provides C to HDL as a service

If you follow this blog, then you've read my ramblings on EDA SaaS. An interesting new advance in this area is http://www.c-to-verilog.com. This website let's you compile C code into synthesizable Verilog modules.

Here is a screencast of the service:

I've also discovered a few other websites related to EDA, SaaS and "Cloud Computing." Harry the asic guy's blog covers the burgeoning EDA SaaS market and Xuropa is creating an online community for EDA users and developers. Here's a two part EETimes piece which talks about EDA SaaS.

I already use a remote desktop connection to run most of my EDA jobs remotely. I've argued that there is a "generation gap" between SaaS and traditional software license markets. The people who used to code in university basements when they were 13 in the 60's and 70's invented the software licensing industry. Now, the people who used to herd botnets when they were 13 are graduating with computer science degrees. The nu-hacker distributes code updates to all his users immediately without forcing anyone to wait through an install.

Friday, December 05, 2008

More versus Faster

/. points to an IEEE Spectrum article title "Multicore Is Bad News" which discusses the bottleneck associated with getting data from memory into the many cores of a multicore processor.

This is primarily an issue with our programming model we are trying to force on the hardware and not a problem with the real capacity of such hardware. The article specifically says that many HPC processes map cleanly to grids thus address the locality problems of multicore arrays. I'm a broken record: the problem of locality optimization for mapping dataflow graphs to multicore arrays requires a transition of our primitive computer model from instruction stream executers to spreadsheet iterators.

The article sums up the problem: "Although the number of cores per processor is increasing, the number of connections from the chip to the rest of the computer is not." This is not necessarily true, since we are seeing increased number of pins on a chip, but it is also not the main issue.

Even without increasing the rate of dataflow through the processor, we can drastically improve power-performance by slowing down our processing pipelines and distributing load across multiple slower cores. One-thousand 1 MHz cores can perform many of the same tricks as a single 1 GHz core for far fewer Joules. This physical truth is starkly contrary to the popular notion that we should increase utilization to improve performance. Since Moore's law says that transistors for ever more cores will continue to cost less and less (and I predict this will continue long after dimensional scaling ends, due to optimizations in the fabrication methods), we can achieve a decline in operating expense to sustain our growing hunger for computing resources. This requires us to fundamentally change our approach: we cannot continue to expect faster execution, we can only expect more execution. Exponentially more.

Once the industry discovers that it has to transform from isolated general purpose problem solvers to networked special purpose assembly lines for a particular problem, I expect we will seem the rise of FPGAs and reconfigurable dataflow pipelines. Any application that can use many thousands of CPUs effectively will probably use many thousands of FPGAs an order of magnitude more efficiently. Its becuase of the relationship between the granularity/number of cores and the degree to which we can optimize for locality. All the extra dispatch hardware and cache management logic is unnecessary in deterministic dataflow pipelines. We simply need more dense functional area and more pipes for data.

Monday, November 24, 2008

Convey's Reconfigurable Computing System in the NY Times

An article in the New York Times discusses Steve Wallach's Convey Computer Corporation and their new reconfigurable supercomputer.

The Convey machines use FPGAs to augment the CPU's instruction set. From a given FORTRAN or C code, their compiler tools create a "personality," which is Convey's way of re-branding the term "configuration."

Steve has a great quote on his site, which sums up everything I've ever written about the FPGA market on this blog: "The architecture which is simpler to program will win." I was once an intern at Xilinx pushing FORTRAN code through F2C (which redefines "barf" in the f2c.h header file) and massaging the resulting C code through an FPGA compiler. As a general rule, most important milestones in compiler technology diminish the need for a summer intern's manual labor. But while I understand the importance and value of demonstrating the capability of reconfigurable supercomputing on legacy FORTRAN applications, if FPGA supercomputing has a future, I am convinced that we need to break free of the instruction stream control flow model.

I've previously argued that the any approach to accelerating sequential imperative languages like C and FORTRAN only extends the von Neumann syndrome and that we need an explicitly parallel, locality-constrained, dataflow model akin to the spreadsheet if we hope to create scalable applications. Moore's law continues to drive down the cost of a transistor, but the speed of these transistors is limited by a power wall; if we are going to continue the geometric cost reduction of Gigaops, then we need languages and programming models that improve with more transistors and not just with faster transistors.

I agree with Ken Iverson's thesis in "Notation as a Tool of Thought." The Sapir-Whorf Hypothesis holds true in computer languages: we are bound to the von Neumann model by the languages we know and use and this is what we must change. In many of his talks and papers, Reiner Hartenstein identifies the dichotomy between von Neumann and reconfigurable computing paradigms. He argues that the divide must be overcome with an updated computer science curriculum.

Programming models aside, the main thing bugging me about the Convey machine is it's price tag: $32,000. Here's what Joe at Scalability has to say about pricing FPGA acceleration reasonably:

One company [at SC08] seemed to have a clue on the FPGA side, basically the pricing for the boards with FPGAs were reasonable. Understand that I found this surprising. The economics of FPGA accelerators have been badly skewed … which IMO served to exclude FPGAs from serious considerations for many users. With the rise in GPU usage for computing, some of the FPGA folks finally started to grasp that you can’t charge 10x node price for 10x single core performance.

I really hope that Convey's increased programmability helps them make huge inroads at this price point so that we can expect super high margins from our products when we launch. I don't know how they will compete with XtremeData, DRC, Nallatech, Celoxica or whoever decides to do the same thing all these companies do with Achronix chips (hmmm...)

Tuesday, November 11, 2008

Software Defined Radio: Revolution Not Included

Spectrum policy needs to be re-evaluated in the 21st century. The expansion of the Internet as an organizational model destroys one-way, hierarchically controlled mediums. A Wired article earlier this year observed that the nature of the Internet is to make everything it touches tend to gratis. The Internet and the wireless world are now inextricably linked by devices that support both free protocols on free spectrum and proprietary protocols on licensed spectrum. VoIP-capable WiFi-enabled phones are the beginning of the end of cell-phone service as we know. Regulated spectrum ownership will transition to universal Ethernet coverage using best effort spectrum sharing protocols.

With the move to digital TV the bandwidth requirements for television broadcasting decreases substantially, freeing up a large amount of white space bandwidth between existing channels. A programmable transceiver can operate on these white spaces switching between multiple protocols to serve other white space devices. If we make a transceiver device that can access this huge amount of bandwidth, we can build a substantial, free wireless network.

Google's Larry Page spearheaded the Free the Airwaves campaign which urged the FCC to free these white spaces for unlicensed public use. On election day, the FCC unanimously voted in accordance with these wishes, approving devices that use sensing and geographical information to avoid interference. The white spaces are the first frontier. We should promote a long term vision that would have the FCC open the entire spectrum to sharing protocols. TV and Radio broadcasting are an enormous waste of bandwidth and discriminatory allocation of bandwidth is one of the pillars of inequity in modern American society (think nationally syndicated talk radio). The roadmap for the obsolescence of spectrum ownership should be set so that capital can be allocated properly to progressive technologies. Unfortunately this technology direction is contrary to the current revenue model of the FCC and would likely result in the re-purposing of that organization.

The economic challenge in making this transition is that there are hundreds of billions of wealth units tied to the premise that spectrum can be owned. I expect to see the wireless and media empires crash as the premise of their business model is made unsound by technical obsolescence. The question is: where does all this money go? The answer is nowhere: it just disappears. This effect is similar in substance and larger in magnitude to the hundreds of millions of dollars of market value in newspaper classifieds that disappeared because of Craigslist.

The technical possibility for multi-band and spread spectrum sharing owes itself to the widespread availability of wired Internet infrastucture and the rapid buildup of digital signal processing technology driven by the geometric decrease in transistor cost. Software Defined Radio (SDR) devices like the GNU Radio USRP (FPGA-inside) can be programmed to operate in arbitrary frequency ranges with arbitrary protocols. As more spectrum becomes freed, SDR devices can be reprogrammed in the field to use these new ranges.

This democratization of signal processing technology and communications infrastructure also fueled the cost reduction of music production and distribution that made the recording industry obsolete. Without the teeth of copyright law, and without a monopoly on the electromagnetic spectrum, the decline of the telecoms will be more rapid than the RIAA, and hopefully without the resentment generated by lawsuits against customers.

We have previously seen the effect of technical obsolescence on existing markets and we know the response from the tech that was superseded: the destruction of wealth is often met with lawsuits. Organizations like the RIAA claim that they protect the interests of artists and the industry surrounding them by transforming antiquated copyright laws into censorship laws for the "digital millenium." Unlike the recording industry, there will be less legal claws gripping onto a futile business plan during the technical obsolescence of the telecom industry.

Spectral freedom ultimately favors device manufacturers and consumers to the chagrin of telecoms carriers, content owners and distributors (aka "the middle men"). I think the winning business model will beat Apple to the white-space iPhone/laptop, open its API, and have a good tranceiving router to back it up. Initially, software controlled protocols and devices that take advantage of them will be patented and will be valuable to carriers who want to lock-in their customers. These proprietary technologies will eventually be replaced by open source alternatives. I expect there will be some legal antagonism between proprietary device manufacturers and the free culture communities, but the futility of protocol licensing is the same as spectrum ownership. Software Defined Radio is software after all, and each FPGA configuration is just a very large number. Be skeptical of the business plan of people who troll electronic bridges for large numbers. Proponents of protocol licensing will point to the number of jobs lost, and appeal to the false-utopia of full-employment using some form of the broken window fallacy (sure the spectrum ownership and patent protection policies are broken, but look at how many people they employ!).

Companies like Motorola and Nokia can use the white-spaces to attract customers to products that are compatible with "WiFi 2.0," as Google likes to call it. They should start to charge the actual price of their equipment so they can become free of their dependency on the carriers. Sprint, Verizon and AT&T should milk those SMS fees while they still can: freedom in the white-spaces will annihilate their industry. The writing is on the wall.

Friday, October 17, 2008

Timing Closure

Doug Amos from Synopsys wrote at article in FPGA Journal about the Timing Closure problem comparing it to whack-a-mole.

I just got through a painful timing closure process. I feel like I came out of a month-long coma. I'll share some anecdotes about the experience to help people in the same situation.

The "Post-Map" results are never good enough for timing. You will want to run Place-and-Route so the tools can take a really long time (sometimes multiple hours) to get through compilation. You will want to automate the tools using scripts and run the EDA toolchain on many computers. See the catalystac.com/trac how to use the "subprocess" module in Python with a dictionary specifying the parameter list for all of the Xilinx toolchain (I found similar code on the net and hacked it to pull the project constraints and synthesis options from Excel). Combine this with SimpleXMLRPCServer and you can manage a botnet running multiple tool instances. It is possible to comment out multiple sections of your code using a simple preprocessor to automate many compilation processes to narrow in on which segments of your code need to be fixed for timing. Smaller code compiles faster too so you can run many permutations over a weekend. (I'll add these scripts to the trac after I have a chance to make them non-specific to this particular project).

Modular partitioning and floorplanning make the compilation and timing closure process a lot easier, but if you want to optimize across hierarchy you can't use them. This is one aspect of behavioral synthesis that needs some serious consideration: how can we avoid running the entire toolchain for minor code modifications? Also since synthesis and optimization can generate weird names for internal signals it is non-obvious what paths are causing the timing errors when they are reported. Usually you can figure out the bad paths by reading the report, but I really wish there was some better way to tie the timing bug back to the code so you know what to modify to fix the bug. There doesn't seem to be a more elegant solution than the brute force method of commenting out sections of the code described in the previous paragraph.

Now let me explain some of the timing bugs I found and how I fixed them. My PDP-11 board has an 8 ns clock period for SRAM access with 4 ns per 18 bit word on each half-clock = 4.32 Gbps read and 4.32 Gbps write (full-duplex). The data word arrives before the rising-edge and is used as a 16 bit data word, the second word is used as 16 bit parity word. Each data nibble has 4 parity bits allowing single correction and double detection. A modular test of the ECC would have us believe that from the arrival of the second half-clock word we could determine if there were errors within the 4 ns between the parity word arriving and the next clock cycle. Unfortunately, we discovered that the parity path was causing a timing bug between one of the Data pins and the SingeBitError / ResultValid signal.

The simple solution is to burn a clock cycle for parity correction, but this would cost us a clock cycle. It would be nice to have an error correcting memory that doesn't consume an entire clock cycle to compute correctness so we can use the data word for 8ns and simply reject the result using the parity check in 4ns. If we don't know that the word is valid for an entire cycle, then we cannot speculatively use the data word unless we are willing to rewind the pipeline an entire clock cycle which is certainly possible, but it's a scary proposition none-the-less (better to burn a cycle in this design, since we prefer reliability and simplicity).

Turns out the reason for the timing bug was one of the data pins was attached to a different I/O bank than the other pins and so the routing delay made the critical path. This was discovered by tracing the path in the floorplanner. The solution here was to turn off the pin since we have 18 data bits for memory and only use 16 for our word. The next board revision will also fix this.

The next major timing bug was the operand fetching pipeline somewhere between the Register and Mode and the Virtual Address register (see here for PDP-11 architecture info). The error here was very small (300 picoseconds) and would go away whenever the fan out and fan in of some of the signals were decreased by commenting out some functions.

We have 8 nanoseconds to decode an opcode from the arrival of the memory word. This is
enough time to read the register from the register file and decode the address mode, but not enough time to guarantee that the virtual address is ready in the address mode case where we must pre-decrement the register to generate the virtual address. I fixed the timing of the operand fetch stage by only decoding the address mode and fetching the register word during the opcode decode stage. The operand fetch stage now uses the decoded address word and register word to generate a virtual address in a second cycle. It would be possible to optimize this process so that in the majority of cases where the register is the virtual address: we can start with that assumption and then invalidate the result when we discover that another cycle is required to decode the address mode.

The most pernicious timing bug involved our floating point multiplier which was partitioned into multiple DSP48 blocks. The multiplier core was generated by coregen to have enough pipeline stages to meet timing. Compiling it in it's own project revealed a maximum throughput of 2.3 ns, but it just barely broke the timing when compiled with everything else: it was off by a factor of the timing uncertainty. We thought that perhaps the tool is retiming the multiplier pipeline to just barely meet timing, and then clock jitter and uncertainty were added later causing it to break the constraint. We did multiple permutations of synthesis options and ran recompilations to no avail. We also added stages to the pipeline to no avail.

To solve this problem, we just created an entirely new project without any timing constraints and set it do the best it could and we met timing with 35 picoseconds to spare. Hallelujah!

If you are in the middle of a painful timing closure, I'm sorry for you, and I hope you can find something useful from this post.

Monday, September 22, 2008

High Performance on Wall Street 2008

A freshly rebranded Catalyst Accelerated Computing went to the High Performance on Wall Street conference today. The usual suspects were present. Speedup results all seemed to match the gender ratio at supercomputing conferences.

FPGAs got bashed in an early session. "It's hard to find people to program FPGAs" came up at least twice during the conference (we consult!). I heard, "threads are the future" ... oy. After the earily morning thread-love-fest, the acclerator panel defended the honor of the gate array well.

Technical stuff that irked me:

Multithreading and multicore aren't the same thing. Multicore processors can use multiple processes with explicit pipes or they can use multiple threads with a global memory coherency protocol. Multiple processes with pipes between them is the dataflow or "streaming" model, like a spreadsheet or like "ls -Al | grep foo > bar", while the multi-threading model should be avoided like sub-prime mortgages for the same reason (they cause your system to crash in mysterious ways).

Multithreading is the use of multiple instruction streams sharing a global address space. It was originally a method of hiding latency by transfering the context of a core to a different thread when you were waiting for I/O or memory. Intel cores support "hyperthreading" which switches context between two threads and makes it seem like it has two cores. This allows the core to share a global memory space and hide memory access latency which is large compared to the clock rate. Cores can have a lot of threads: Sun's open-source Sparc core supports 32 native threads.

Power consumption P = fCV^2. The power voltage (V) is generally linearly dependent on f (frequency) because we can use less potential to switch at slower frequency resulting in the "cubic rule of thumb" relating clock-speed and power. If we use twice the area and half the frequency to do the same work, then switched Capactiance (C) is 2x while f*V^2 is 1/8, leading to the rule-of-thumb quadratic power savings from parallelism (see Chapter 11.7 of Anantha's book "Digital Integrated Circuits").. Leakage is the dominating factor now though and slower switching circuits can operate with higher threshold voltage to lower leakage if your device supports dynamic threshold scaling (like the Stratix IV from Altera will).

The better reason why FPGAs dominate in power performance is becaues of the efficient total distance of data-flow, aka much lower capacitance to move data. As the number of cores increases, there is an O(N^(3/2)) relationship between the number of cores and the degree to which a design can be optimized for process locality (see "locality optimization"). This is why place-and-route is so important for FPGAs.

Now for the fun stuff. Buzzword scoreboard from presentations:

{ "Leverage": 17, "Agility": 4, "Low-Latency": 44, "Accelerate": 176, "Eco-System": 8, "Productivity": 191, "Scalability": 83, "Service-Oriented": 17, "Paradigm":16,"Dynamic":55, "Exploit Multicore": 18, "Future-Proof": 4, "Mainstream": 36, "Seamless": 43, "Cloud": 91, "Heterogeneous": 12, "Efficient": 50, "Enabling": 23, "Integrated": 19, "Interoperability": 24, "Realtime": 12, "Reliability": 13, "High-Availability": 33, "Bottleneck": 26 }

Productivity wins.

I'm particularly amused by the frequency of "mainstream." Mainstream on Wall Street today probably means your firm just shut down, merged, or totally changed business models. Happy Monday for a Wall Street Conference!

Coming soon: a business-plan buzzword-compliance checker to determine if your business plan is syntactically correct and give you a score.

Wednesday, September 17, 2008

Achronix Goes Balls Out

Congratulations to Achronix on announcing availability of their FPGAs and development boards. The 65nm Speedster reports a 1.5 GHz max internal throughput rate and a ton of I/O. The important technology Achronix is introducing to the market is their high-throughput asynchronous pipelining technique. There are numerous white papers on the Achronix site and research papers from the from their days in the Cornell Asynchronous FPGA group which explain how the "pipoPipe" technology works.

While the speed of this beast might get you excited, the clock rate reported doesn't translate to decreased pipeline latency, but rather implies that you can pipeline your existing system and boost your throughput rate by 3x over other FPGAs that max out at 500 MHz. As far as FPGAs are concerned, 3x more logic is better than 3x speed any-day. Still, if their picoPipe routing architecture can be easily integrated into existing FPGAs then this technology will be an obvious addition to any FPGA that needs a throughput boost.

For resource constrained applications, a 3x faster FPGA can use one-third the area to perform the same function using time-division multiplexing ("Resource Sharing"), but frankly, this is comparing apples and oranges since the 3x higher signal rate in 1/3 the area comes at a (theoretically quadratically dependent) cost to total power consumption. On the other hand, having more (but slower) logic means you can perform more simultaneous functions instead of only achieving more throughput through existing functions. Having 3x more logic will give you 3x throughput with a similar linear increase in power costs, but 3x more throughput won't allow you to emulate 3x more logic in general.

So when we compare the Achronix Speedster to that beast-of-an-FPGA the 40nm Altera Stratix IV, we have to keep in mind that 1.5 GHz internal throughput is largely a distraction from the end-to-end argument. The Achronix approach uses high-throughput pipelines while the Altera approach uses a metric-ton of logic at a lower rate. For blocks like adders, multipliers, FFTs, and floating point units, having a high speed pipelined circuits makes total sense to get a smaller die area and hence a lower cost chip, but for latency-dependent control logic, I/O-bound processes and power constrained circuits it is unlikely that the chip will be operating with its high throughput pipelines at full speed.

So more logic might be generally better than higher internal pipeline speed, but more I/O throughput is the definitive tie-breaker for most applications. Here the Speedster is definitely a speed-monster: the raw I/O throughput of this machine will make it a quick favorite for many applications: up to 40 lanes of 10.3 Gbps SerDes and 850 I/O pins up to 1066 MHz for a beast that can provide nearly 1.3 Tbps of raw throughgput.

Achronix knows that more logic beats faster logic in FPGAs and that I/O is king. They also know that the FPGA market is too smart to fall for a clock-rate race. But the deal-breaker and the golden rule of FPGAs is this: you must have an extremely compelling software workflow if you are going to get designers to adopt your hardware. If Achronix wants to convince me that they've totally pwned the rest of the FPGA market, then they need to provide the "Progressive Insurance" of FPGA tools. I want a website where I can submit my designs and report the speed and power specs of a Speedster implementation as well as several Xilinx and Altera FPGAs too.

If Achronix is highly dependent on the existing reconfigurable HDL market for tools and if their hardware advance isn't met with a similar software toolchain advance to take advantage of the new-found throughput, then this technology will have some serious barriers to overcome. It is extremely difficult to automate load-balancing of shared pipelined resources (going from a spreadsheet-RTL with absurdly high resource consumption to an implementable resource-sharing HDL code is one of those magic automations I implemented for my Master's degree).

I'm not sure that anyone knows what it means to make FPGA tools that don't suck, but I'm convinced that building a community and developing domain-specific tools is a huge part of it. If I were Achronix I would do these things to cultivate a user community:

Get boards out to the undergraduate digital design labs at a bunch of schools
Fund competitions for the best applications in multiple niches
Support Open Source IP and Open Source EDA

Frankly If you don't give your FPGAs to undergraduates they'll end up learning how to use your competitors' wares. Xilinx donated a bunch of FPGAs to MIT to replace the old Altera boards we were previously using in 6.111. The result is that every MIT student who learned how to program an FPGA for the past four years knows what "allow unmatched LOC constraints" means in Xilinx ISE instead of the similar idiosyncrasies of Altera's Quartus toolset.

Bottom line: Achronix needs application benchmarks to prove that their hardware has a future and EDA tools to prove that their company has a future.

Thursday, September 04, 2008

Parallel Programming is Easy: Making a Framework is Hard

An HPCwire article titled "Compilers and More: Parallel Programming Made Easy?" by Michael Wolfe presents a gloomy outlook for parallel programming:

Every time I see someone claiming they've come up with a method to make parallel programming easy, I can't take them seriously.

People you don't take seriously may take you by surprise. I think the computing industry suffers from the "faster horse" problem: Henry Ford couldn't ask his customers what they wanted because they would have said "a faster horse." The instruction stream is a horse: the industry has already built the fastest horses physically possible, so now the industry is going on a multithreading tangent (multi-horsing?).

While everyone else is still in a horse-race, we're building hybrid engines.

The HPCwire article lists a whole bunch of languages, but whenever someone groans about parallel programming and provides a list of languages, Excel is always left off. Clearly we aren't seeing the forest if we leave out the most widely used language ever (probably by a double or even triple-digit factor). The omission is especially egregious when these lists include other visual dataflow languages like Labview and Simulink (this article mentions dataflow style but none of these particulars). Spreadsheet cells are explicitly parallel and make dataflow and vector programming so simple that almost everyone who has ever used a computer has done it. There's even a well-understood model for event-triggered control-flow macros for those cases where you "need" instruction streams.

So I strongly disagree with the premise that parallel programming aught to be difficult. Parallel programming is the same as spreadsheet programming, it's easy to do and everyone knows how it works already. Especially don't let someone convince you that parallel programming is hard if they work on the hardware-software interface. Many of these people still believe parallel programming involves synchronizing random-access-machines running non-deterministic threads (avoid the pitfalls of horse-race-conditions by considering threads harmful).

Developing a high-performance, real-time spreadsheet framework for a hybrid hardware topology requires substantial effort. Depending on your target hardware architecture, you may need to use threads, vector operations, distributed processes, and hardware description languages to iterate that spreadsheet efficiently. To do this, you need a compiler from the spreadsheet language to each of the hardware models you want to support and you need to generate synchronization code to share precedent cell data across hardware elements. Depending on the memory and interconnect architecture this data synchronization code can get somewhat tricky, but code generation from a spreadsheet is the "tractable" part of the parallel programming problem and makes for good Master's theses if you throw in at least one optimization.

For your PhD you'll have to do something more difficult than just automatically generating parallel code from a partitioned dataflow graph.

Optimal partitioning of parallel programs is obscenely hard (MegaHard as my previous post would have it). In heterogeneous environments that use many of these primitive parallel models, you need to worry about optimally partitioning which cells run on which metal. Partitiong based on computational resources is a pain, but the real difficulty is optimizing for the communication requirements between partitions and the communication constraints between the hardware elements. We are approaching the optimal partitioning problem by assigning a color for each chunk-of-metal. We group spreadsheets cells by color and then profile the computational load of the color group and the communication between color groups using hardware constraints

The HPCwire article does mention communicating sequential processes and dataflow models:

"Until we have machines that implement these [communicating sequential processes and dataflow] models more closely, we need to take into account the cost of the virtualization as well."

We do have machines that implement these models (as Tilera and Altera will attest). They are still as difficult to program as any parallel architecture, but I assure you that once we start to think of these things as "hardware spreadsheets" we will start to see a way out of the parallel programming cave. I wonder if people who describe an FPGA as a "neural-net processor" make the pop-culture connection:

Friday, August 29, 2008

Megahard Corp: Open Source EDA as-a-Service

Prescience is a quality of those who act on it:

Building very-large-scale parallel computing structures requires simulation and design-cost analysis whose associated optimizations are MegaHard problems (yes, I'm re-branding NP-Hard). A Software-as-a-Service (SaaS) provider of vendor-neutral simulation, synthesis, mapping, and place-and-route tools could change the electronic design automation industry substantially.

If a supercomputer architecture could perform the MegaHard compiler tasks associated with compiling parallel systems in less time than the current tools (within seconds instead of minutes/hours) many designers would gladly pay for access. If the supercomputer uses distributed algorithms, then many people may offer spare cycles to do parts of your MegaHard jobs. Most developers using such a tool would gladly lend their idle cycles to perform graph permutations and compute cost functions.

So let's motivate and create a business model for the tentatively-named, Megahard Corp.

Megahard Corp. (pronounced "Mega Hardcore") has the opposite business model of Microsoft Corp. Instead of selling closed source software and tools as licensed applications, Megahard provides open-source hardware IP and development tools as a service.

The writing is on the wall: we are transitioning from personal-computing, single-core desktop world to a shared-computing, multicore/FPGA embedded world. Yet the complexity of parallel system designs and the computational strain on tools to compile them is growing faster than the performance of a practical desktop computer. The decision for EDA firms to adopt the Application Service Provider (ASP) model probably made sense at some point in the last millennium: design assumptions are much different when you are connected to a massive parallel computer. Because current tools take much, much longer to compile than the file transfer time, there is potential to improve the compiler market by providing electronic design automation tools online.

So here's the business plan: build a supercomputer and code it to run a bunch of FPGA/multicore compilers and simulators really well. Since it's a supercomputer it should make the problem I/O bound. We can tell people we've got this supercomputer that will take in source and spit out configurations in about as long as it takes for you to upload the diff source and download the diff configuration. Since we have a really big supercomputer with lots of hardware all over the place, you can also target your application to some of our bored metal and provide your own service under our framework. (Sassafras is a good name for a SaaS Framework and can have the leaf-shaped logo)

Since we're thoroughly indoctrinated, we'll build our system as Free software and use all the existing open source compiler architecture too. Being open source means our tool won't be made obsolete by an eventual open source standard (read that again so you get it). Open source frameworks allow capable users to contribute to the product development: and wouldn't you know it, the entire EDA market happens to have degrees in computer science, electrical engineering, mathematics and physics.

There's also that recurring comp.arch.fpga thread complaining about the current tools, the lack of open source tools, and the lack of documentation of the configuration format and interface -- someone must appeal to these people because they have a real point: technology is inhibited by monopolized knowledge. It's harder for features to improve when they are part of closed-source packages because you are automatically limiting the number of developers that can improve it (this situation worsens as you lay off more developers).

Another benefit of using a SaaS tool: your users don't have to waste time upgrading. The only way to convince anyone to wait through a software upgrade process is by announcing new features that claim to make the product better: that's why users wait till the second iteration of a major release. SaaS providers can roll-out and test the waters with new features with much higher iteration rate.

When I use the term open source, I mean "free" like freedom, but it's important to point out that previous forays into open source FPGA tools have failed and disappeared because they were merely "open source" and not "free." When I was just starting to care about these sorts of things, GOSPL was being hyped up and eventually was canceled. The code is no where to be found because access to the source was restricted by invite-only membership for some nonsense reason: membership exclusivity maintains the integrity of the organization. When giant corporations disingenuously uses "open source" as a marketing term for non-free software, the project is destined/deserves to fail as long as the so-called "open" source is never actually provided to freedom-loving eyes.

On the other hand, free software, like debits or free hardware IP like opencores won't stop being useful just because some organization stops funding it. Free-software projects never die, they just go unmaintained.

Besides, Megahard Corp. follows a SaaS model, so the profitability is from developing and maintaining a parallel supercomputer system to run the free compiler software, instead of distributing that software. Good free software projects encourage users to "help make it better" instead of "help make us richer," though improving products is a good way to increase sales. A sufficiently better compilation service is probably more profitable than selling licenses too -- I would certainly pay more than what I'm currently paying for proprietary software licenses if I could get access to a supercomputer for FPGA compilation that does it faster.

One major problem: If you aren't Altera, Xilinx, Lattice, Actel, Achronix, Tilera, IBM, AMD, Intel etc. then making low-level Multicore and FPGA tools requires a whole bunch of reverse-engineering and "MegaHard Work" (TM). It's probably technically easier to design an FPGA from the bottom up than to reverse engineer the architecture of an existing chip and build a new tool-chain.

Open-source compilers and reverse-engineered, vendor-neutral FPGA tools have both been successful endeavors in the past. Providing Hardware-as-a-Service is still a "cloudy" future, but there are plenty of recent examples where a vendor came into a market ripe for SaaS-ification and redefined the market. I expect that the MegaHard problems associated with design automation make it ripe for a SaaS provider to change the game.

----
(Edit 9/2) If you find positions contrary to the Open Source EDA SaaS argument, please share them with me. Here's an old interview with Grant Martin, Chief Scientist of Tensilica, who argues that we should not hold our breath for Open Source EDA and specifically says:

I think the other issue with EDA is in terms of the general number of users. I don't think there's a large enough number of users in any particular sub-category of tools to really support very much open source development. Open source requires a big community involvement, plus ancillary things being built around that environment to attract companies into the effort.

EDA is still tool small to make open source the general model. The idea that all tools will be open source, and that all EDA companies would evolve to a service business model, is unlikely and makes no sense.
....
The business incentives just aren't there to motivate those efforts in an open source environment.

I tend to not trust the beliefs of people who were born before the Internet (aka the over-30 crowd). I think he's missing the perhaps non-obvious point that a service business model can subvert the traditional software business model by offering a faster software service with smoother new-feature roll-outs (perhaps when he says "service business model" he thinks RedHat instead of Google Apps). I also know from my prior immersion at CSAIL, that open source software development does NOT require users or a big community to be involved in development, but only requires one indefatigable iconoclast to chip away at status quo for his own personal reasons that are often incongruous with profit motive. When multiple hackers create disjoint useful tools that converge into a single product, user communities start to form to propel that framework further and potentially crack the existing market open.

I wonder how anyone can reject the service business model future for basically ANY computer tool. Seeing such a future no longer requires prescience, just patience. FPGA accelerated computing will increase the number of EDA users, but none of these users will ever identify with the existing EDA market (there's far too much FUD associated with "hardware" development that they traditionally sell: programmable arrays are more than reconfigurable HDL executers).
(end edit)

Sunday, August 24, 2008

Threads Considered Harmful (for the same reason as Goto)

A month ago, Charles Leiserson wrote a post on the Multicore Blog at Cilk Arts called "The Folly of DIY Multithreading." He provides pthreads and Cilk implementations of a parallel fib() function and offers great advice in his article: "Building a concurrency platform from scratch is a mountain to climb." Professor Leiserson mentioned the Therac-25 radiation therapy machine in his post. In 6.033, every MIT CS student learns about the how thread interlocking can lead to confuddling system errors that can kill people. Clearly threads should be considered harmful. Professor Leiserson argues that using a multithreading platform like Cilk will help you address these harmful side-effects of multithreading.

Two years ago, Edward Lee at Berkeley wrote an excellent paper called "The Problem With Threads." In his paper he emphasizes that determinism and composability should be the two major objectives when programming concurrent systems. He describes patterns like MapReduce as "coordination languages" and believes that this model is a fruitful route for parallel programming. He arguest that systems that attempt to simplify multithreading "merely chip away at the unnecessarily enormous nondeterminism of the threading model" and that the threading model is "intrinsically intractable." The final paragraph of his essay serves as a manifesto for those of us designing non-threaded frameworks for parallel computing:

If we expect concurrent programming to be mainstream, and if we demand reliability and predictability from programs, then we must discard threads as a programming model. Concurrent programming models can be constructed that are much more predictable and understandable than threads. They are based on a very simple principle: deterministic ends should be accomplished with deterministic means. Nondeterminism should be judiciously and carefully introduced where needed, and should be explicit in programs. This principle seems obvious, yet it is not accomplished by threads. Threads must be relegated to the engine room of computing, to be suffered only by expert technology providers.

Threads aren't just harmful because of non-determinism and interlocking issues. Threads are harmful for the same reason Edsger Dijkstra argued Goto is harmful. It's the same reason John Backus apologized for Fortran in his Turing award lecture. The serial instruction-stream control-flow style of traditional computer programming is the wrong way to approach most electronic data processing systems!

The main point of this blog is to argue that any-kind-of-threading is a folly (even single threading) and to put forward the spreadsheet as a better hardware primitive. While hardware realities are forcing the deprecation of the single-threaded instruction stream primitive model, we should deprecate the rest of the von Nuemann model while we're at it. We should avoid the perils of multithreading and transition to a dataflow-centric primitive computing model premised around arrays of reconfigurable dataflow units. Locality constrained dataflow is how electrical engineers design physically realizable systems and programmers will have to follow suit to design concurrent code. As we scale to ever more cores, multi-CPUs will start to resemble FPGAs and addressing the inter-process communication issues will require a more physical dataflow view of the system. (Electrical Engineers don't believe in magic. Computer Scientists don't believe in physics.)

Indeed, part of the folly of DIY dataflow graph (DFG) engines is writing your own multithreaded code to run DFGs on multicore and single threaded executers -- not to discount the NP-Hard problems related to heterogeneous partitioning and FPGA compilation. Let your dataflow framework provider worry about using multithreading platforms to run your DFGs so you can just keep using spreadsheets.

Tuesday, August 12, 2008

MIT Students Use FPGAs to Hack Boston T

(edit 8/19) The judge has lifted the gag order today. (end edit)

/. points to an article about a court issuing a temporary restraining order to block a Defcon presentation by a group of MIT students that hacked the Boston T Charlie Card system. Not only can they print their own magnetic-striped cards with up to $655.36, but they can also crack the RFID cards using now-widely-known NXP MiFare vulnerabity and a "WarCart" they built (USENIX paper: Reverse-Engineering a Cryptographic RFID Tag). These students started this work for 6.857 taught by Ron Rivest. Two of them took 6.111 when I TA'ed: they attempted to build a CDMA Control Channel Traffic Analyzer for their final project. Maybe we should only allow projects that could end up getting us into lawsuits!

The Tech website has the presentation slides with all the legal documents too. The MBTA's strongest argument is based on the premise that it was unclear whether the MIT students were going to present any new vulnerabilities that were not already in the materials they had sent. Such materials could be a potential threat to public safety. The claim posits that Professor Rivest would not have given them an "A" for the project if their work was just magstripe printing and a repeat of the MiFare hack. It was pretty clear (to me) from the evidence that they were only presenting vulnerabilities in the CharlieCard system that they explained in their now-public disclosure to the MBTA as well as demonstrating how poorly accountable the actual security system is currently.

From a free-speech standpoint I think the temporary restraining order may have been appropriate to require the students to explain their project to the court in case they were going to present anything that could present a safety hazard to the subway system. As far as a non-expert Judge is concerned "someone is presenting novel vulnerabilities in the Boston Subway system" should be enough to merit the fullest caution of the court. However, as soon as it is clear to the Judge that the financial interests of the MBTA (and not public safety) are being weighed against the free-speech rights of the MIT students, the TRO should be lifted. The result of this case will probably be the same as the Dutch NXP Mifare case and an appeals court should lift the restraining order in favor of first amendment rights.

(Edit 8/14): Some of the terms of the TRO are ridiculous from a free-speech standpoint -- claim (2) would imply that security is a property established by judicial process. It's self-contradictory to include that in the request. Apparently the Judge denied a reasonable appeal from the EFF to add the terms "non-public" from the TRO. Apparently this Judge wants a higher court to clarify that first amendment rights outweigh the financial interests of the MBTA.

The Boston Herald reports that an MBTA board member Janice Loux is calling for a security audit and has lost confidence in the General Manager David Grabauskas . Here's a quote from Grabauskas that contradicts the courts action today to deny the appeal: “We just want to make very clear that we’re not interested in quashing anyone’s First Amendment rights,” Grabauskas said, declining to respond to Loux’s criticisms. “All we’re really asking is that the court bar the release of any nonpublic information that might adversely affect the CharlieCard’s security.”
(end edit)

That said, I think the students probably should have presented the CharlieCard vulnerability anyway since they were definitely not going to present anything that an expert couldn't figure out from the data that is out now -- and with so much more attention because of the injunction. Apparently the MBTA also isn't familiar with the Streisand effect. This aspect only highlights a common management problem and my new favorite aphorism:

Do not trust anyone over 30: they did not grow up with the Internet

Never hire someone for a security position unless they have a solid understanding of memetic diffusion in social networks.

As for the problem of finding qualified security consultants to audit the system, I think the students just volunteered to do it*: considering they have pictures in their presentation documenting how they committed various types of fraud it seems like a fitting punishment to make them fix Skynet. I guarantee that they are capable of building a more secure CharlieCard system on the cheap (like 1/100th of the $180M spent on this already). A short term software-only fix could probably achieve much higher security within weeks. The system should have a built-in RMS notice to encourage card swapping "to limit the amount of information Big Brother [Little Brother] can collect about T riders..." (RMS sends a message like this to the CSAIL list to setup card swapping sessions every so often).

The MIT students have a great slide in their presentation: "Why Brute Force with an FPGA? ... Because it's Fast!" I think it's probably better to make this pitch at the High Performance on Wall Street conference over Defcon. From their slides it seems like they created an FPGA-accelerated 48-bit RFID key cracker using a Python script to generate Verilog. Now they, like h1kari from Pico Computing, are giving presentations to other hackers about the merits of FPGA-accelerated cracking (video of h1kari demoing FPGA accelerated WEP cracking). I remember an all nighter at the Las Vegas airport after Defcon in '06 discussing the state of FPGA programming tools with h1kari... that conference made me realize that the existence of vendor-neutral open-source FPGA tools would be the catalyst to change the perception that FPGAs are difficult to program. It seems like reconfigurable computing has multiple fathers and a whole slew of iconoclastic sons trying desperately to get on FBI watch-lists. The FBI probably doesn't suck at this FPGA-accelerated cracking stuff either, but they find more use for FPGAs in their data-warehouses as I'm sure my readers won't confirm.

To understand how the RFID hack was performed, you should read the Nohl paper from UVA which describes how the physical circuitry of a Mifare Classic RFID was reverse engineered. I think the technology used to determine the digital logic on the chips is more interesting than the rest of the paper about the RFID security. Once they determined the cipher for the Mifare cards, they implemented a naive brute force cracker in 64 Virtex-5 LX50 FPGAs that takes 50 minutes. Then they also reveal a weakness in the random number generator that allows them to essentially eliminate the randomness. They also describe a method of generating codebooks to substantially accelerate the crack.

From the slides of their presentation, I believe the MIT students transformed the naive brute-force multivariate quadratic (MQ) cipher cracking problem to a more efficiently implemented SAT problem (which would certainly merit an "A" for the work). FPGAs are perfect for brute-forcing SAT and there are probably 1,000 papers on this topic since it is possibly the most obvious target for FPGA acceleration. From the slides, it seems like the students replicate their cores with a Python script and probably do not contribute anything interesting to accelerating SAT except a novel way to profit from it.

From the comments on this year-old hackaday thread about converting HDTV boards into crypto-crackers, FPGA-accelerated cracking seems to be a good way to get people excited about FPGA technology. It also seems the problem of explaining "what is an FPGA" is still a hurdle to widespread adoption. This is where I say that an FPGA is a hardware spreadsheet, and we call it "Accelerated Computing" to proliferate our favorite meme to replace any other nerd-word you might put in front of "Computing." I also built a spreadsheet to hack the 48-bit RFID keys; it took less than 10 minutes to design and compile to Verilog, but I'm actually afraid to post it now! Here's the anatomy of a massively parallel brute-force cracker: each row tests a unique range of keys and iterates through them until it finds a match, columns of the sheet are "start-key" "end-key" "current-key" "next-key" "hashed-output" and "satisfies target." Super-clever implementers will propagate constants to optimize the hashing function based on the (start, end) range for each row.

Coming Soon: A collection of "Threads Considered Harmful" articles about the dangers of programming in parallel control flow style and the merits of dataflow design for easy-to-grok parallelism.

* The second-most important thing I learned at MIT was that things aren't done because you haven't done them. The most important thing I extracted from MIT was an observation about the effectiveness of people at completing a task that Arlow and I made while running house-cleanups in our fraternity. This observation was extremely apparent as a lab assistant watching students learn to develop FPGA applications in teams. We learned that the metric for a person's effectiveness is the product of their motivation and skill. With rare exception the overall effectiveness of grouping people working tightly together is the product of their individual contributions. Skill can exist anywhere in the complex plane. This model explains things like Microsoft, Google, Congress, and many other organizations down to the T. Imaginary skill leads to "security systems" meaning something different with the quotes. Motivated negative skill leads to War (and WarXing my EZ-Pass for profit).

Thursday, July 24, 2008

RhoZeta Excel-Python-OpenGL Demo

In this demo I will show you how to use the RhoZeta Python-Excel bindings to create a non-blocking assignment spreadsheet iteration thread and an OpenGL window to draw cells. This older demo shows more Python evaluate-able code in Excel spreadsheet cells with RhoZeta. Feel free to leave comments and suggestions for future demos.

RhoZeta is a Python spreadsheet language with Excel bindings to allow programmers to process spreadsheet formulas through a Python backend. The cell values are forwarded to the Excel frontend to display in the current Visible Range. The longer term goal is to demonstrate how the spreadsheet iteration model can serve as an open framework for accelerated computing on FPGA, GPU, and Multicore architectures. These architectures consist of physical arrays of cells designed for dataflow processing and vector data parallelism. This locality-constrained, vector-oriented, dataflow programming model finds obvious expression in a spreadsheet. I've argued that this model effectively replaces the von Neumann instruction stream as the primitive model for designing systems on parallel hardware.

The code for this demo is up on the Catalyst Accelerated Computing Trac. Running RhoZeta.py alone does not require PyOpenGL but you wont get the OpenGL window. RhoZetaDemo.py depends on PyOpenGL (which requires setuptools to install) and win32com (included with Python for windows aka PythonWin). I've only tested this with Excel 2007, but I think the underlying COM interface is still compatible with 2003.

Firstly: this is hacking and not a product yet. DO NOT START THIS WITH ANY OPEN WORKBOOKS OR YOU MAY LOSE YOUR DATA. You were warned.

Run RhoZetaDemo.py and Excel and an OpenGL Window should pop-up:

Put the function "=time.clock()" into cell A1. If the cell manager is functioning the value of A1 should be forward to the Excel frontend on a Timeout.

Since this is a non-blocking assignment interpretter, placing "=A1" into cell A2 will cause A2 to act like a register reflecting the value of A1 from the previous iteration:

We use the fact that A2 is A1 delayed by an iteration cycle to determine the time difference between two cells:

We can repeat the register-delay structure to track the iteration time for multiple clock cycles:

If we repeat this register structure 4 times we may take the average iteration time (try to click-and-drag the formula from cell B4, it should work).

I also whipped up some code to pull the Row Heights and Column Widths into an OpenGL drawing window with some MouseOver detection (only 2-D... I need to learn how to do selection correctly and in 3-D). The OpenGL window synchronizes with Excel when you click in the window. The top left cell should be the same as the top left cell in your Excel visible range. There's also code to draw a bar-graph of the selected cells data, but you'll have to tinker. Expect the next OpenGL demo to be even cooler.

The goal here was to demonstrate a spreadsheet that iterates continuously following a non-blocking semantic as in Verilog non-blocking assignments: each cell is computed from the same pool of values and the next values are assigned simultaneously. Ordinary Excel iteration interprets according to blocking assignment semantics: each cell is computed and assigned in the order in which the table is read. Under Excel's ordinary blocking assignments, assigning A2 the formla "=A1" would read the newly assigned value from the current iteration because A2 comes after A1. We can do shift registers in blocking assignments by reversing the order ie assigning A1 to "=A2"

Cells in a blocking assignment iteration are non-commutable, the location of the cell matters to the interpretation. Non-blocking assigned cells may be moved freely. A third assignment type, asynchronous assignment, computes a new value whenever the cell's precedents change as a combinatorial circuit would. Cells with an asynchronous assignment are also commutable. It is generally possible to mix these forms of assignments as in VHDL with mixed variables and signals and continuous assignments. We could presumably support a syntax to replace the leading "=" to allow mixed modes.

The iteration timing here is a poor because cells are being dynamically parsed and interpreted by Python using the eval function. We know how to do better by JIT compiling sheet iteration in Python (some code is up in our Trac to do this). The next step in performance is to statically propagate types and compile sheet iteration pipelines to C so we can fork new threads onto a multicore/GPU/FPGA-core and parallelize sheet iteration. We also know how to get this sheet to iterate deterministically with dedicated resources in under 2 nanoseconds using an FPGA synthesis toolchain -- of course measuring the average iteration time for a deterministic iterator would be silly.

Monday, July 21, 2008

Partitioning Dataflow Graphs

OpenFPGA forked a child called Open Accelerator to organize activities in open accelerated computing. Sounds like my business plan. Let me talk about the hard problem: partitioning.

If you're like me, when you got your heterogeneous supercomputer out of the box, you probably produced some mixture of OpenGL, CUDA, Verilog, pthreads or spethreads, Python and duct-tape. We'll get back to the duct-tape.

The problem of accelerated computing development is working simultaneously with both control flow graphs (CFGs) and dataflow graphs (DFGs) on both control-flow engines (CPU) and data-flow engines (FPGA). I've argued that a spreadsheet framework will take care of the language problem. Even people who don't consider themselves programmers are familiar with the semantics of mixing dataflow in cells with control flow commands in macros. We also know how to translate between control flow and dataflow descriptions. Now it's easy to think-outside-the-box, but actually building a good box for other people to think inside of is not so easy.

The devil is in the duct-tape.

A NUMA without global cache-coherence is too difficult to program without someone hiding the synchronization protocol behind simple spreadsheet semantics. Once you partition cells into separate processing node, you have to worry about communicating precedents across partition boundaries. The easy option is to perform a global commit cycle for all shared cells. This works fine when you're only using multithreading to hide memory latency in a single to quad core or if you require all-to-all communication.

If you want to make a real multicore spreadsheet, after you partition the spreadsheet, you'll want to automatically generate code to manage data coherency explicity passing messages between communicating partitions. This problem is actually very similar to the routing part of the place-and-route problem though efficiency is even harder to profile. Optimal partitions need to minimize the total number of consumers for each producer, as well as minimize the total amount of data consumed. We're saved a bit by the fact that spreadsheet dataflow designs tends to have fixed, regular communication structure. As a result it is posisble to generate a deterministic sychronization procedure for any partition.

Profiling cross-partition communication is only one of many stumbling blocks for an partition optimizer. Profiling and load-balancing the computation itself is another problem. Here's something I wrote from two years ago about using an economic model for profiling the load-balancing on heterogeneous architectures in a run-time environment.

Realistically, developers who want high performance data systems should expect some amount of manual partitioning. The goal of any partitioning tool should be to expose these issues to the user as clearly as possible and make it easy for the user to optimize the design using macros. These sorts of macros will be like a welding-torch compared to the current duct-tape.

Sunday, July 13, 2008

Power Electronics

For this entry, I'm going to take a departure from the usual microelectronics and HPC topics and focus on power electronics topics on which I have been keeping .. current (forgive me).

Ultracapacitors

According to Wikipedia, Lithium Ion batteries can store 160 Wh/kg or 576 kJ/kg with maximum power output of 1800 W/kg. The wiki page on ultracapacitors indicates 120 Wh/kg as a comparison point for Lithium Ion batteries, but Li-ion is obviously somewhere in the 100-200 Wh/kg range...

Solid-state Ultracapacitors support substantially higher power than Li batteries (6 kW/kg from wiki), though the energy per weight efficiency is often an order of magnitude lower. This means they can be charged and discharged much more rapidly than a battery, but contain much less energy. Since there are no chemical processes, the number of charge cycles for the system is many times greater than a Li battery.

MIT Professor Joel Schindall uses aligned nanotubes to increase the surface area for charge storage in an ultracap and aims for 30-60 Wh/kg. Read his article from the IEEE spectrum.

EEStor makes ultracapacitors in an ultrasecretive fashion, but the claimed results beat Li batteries in terms of weight to energy stored by a factor of 2 (200-300 Wh/kg or 1 MJ/kg). Zenn Motors is apparently using this in their electric vehicle offering.

A few months back I was reading about Graphene transistors. Capacitors using Graphene Monolayers seems like a good idea. An Indian team reports 31.9 Wh/kg.

Solar Power

The next several years will be the early adoption period for solar energy. Clever mayors will propel political careers on their ability to successfully deploy solar energy in their towns.

Making Solar-Tiles-On-Every-Roof happen is like making One-Laptop-Per-Child happen: the problem isn't just finding the cheapest solar tile you can make so much as having a scalable deployment plan (if you spend too much time making really cheap laptops, you forget that the real problem is the per-child part). A scalable solar plan should put grid-tied panels on roofs while appreciating accelerating returns from volume and adapting to a variety of local financial incentives.

Israeli Bank Hapoalim offers a loan program and Israel Electric offers grid-tie incentives to encourage the adoption of Solar technology.

Without established best practices in these loan and incentive programs, there's a lot of difficult ROI math to convince people to pay for a panel. It seems like the best solution to get solar electricity to market is to provide free solar panels and grid tie installation in exchange for a mortgage redeemable in sub-market-price electricity. Such bonds can tie together with "weather risk bonds" to form a new type financial products allowing you to invest solar energy by essentially renting someone's roof to profit off the power generation (roof-renting actually makes sense in more than one way in a foreclosure market). States and cities could provide tax incentives on this type of investment vehicle to make it more attractive against other financial products and to encourage people to invest in regional solar power.

An interesting HPC side note related to weather bonds: I wonder if our superb supercomputer models can be used to predict total solar panel output on a given day...

Superconductors

A superconducting electric car has been built in Japan. The concept seems sound: higher current density means you can use a lighter conductor for the same magnetic field, not to mention the efficiency boost from 0 electrical resistance. There's obviously the added liquid nitrogen tank too, which is also useful when you're trying to get away from a T1000. They claim 10% more range. I wonder if their suspension systems uses the Meissner effect...

It seems plausible that the superconducting electric motor could also be used to efficiently compress liquid nitrogen while the thing is recharging too (I have the same inclination to insist on FPGA acceleration for FPGA accelerator compilation).

Nuclear Fusion

Google Tech Talk about Bussard's Polywell Fusion Reactor. The majority of Fusion research is in toroidal Tokamak magnetic containment structures. Bussard's comment is that we have had all these problems generating sustainable fusion with Tokamaks, and yet we look up and see thousands of fusion examples, none of which are toroidal. Instead they are all held together using gravity: a radial 1/r^2 law.

To achieve a similar 1/r^2 radial fields, we can create an electron ball in the center of a reactor and use this field to draw fusuable ions close to eachother. This idea, called inertial electrostatic confinement, had been explored by Bussard's colleagues in the development of the fusor. The fusor used electrode cage to create the electric field for the electric confinement, but collisions with the cage resulted in a net energy loss and a burnt up electrode cage.

The insight in the Polywell reactor is that trapping electrons in a magnetic field is a lot easier than trapping fusable ions as in a Tokamak (because an electron doesn't weigh a lot). The Polywell creates an electron "wiffle ball" using a tetrahedra of coils to avoid the heating problem with cage based designs. In his Google Tech Talk video, Bussard describes the engineering challenges and insights discovered during the various design iterations. He died shortly after receiving funding to continue the program, but the work is still continuing.

When you see him present the history of Fusion research and the reason things are the way they are with the massive Tokamaks dominating research, you can tell he just knows that controlling ion momentum in magnetic fields "won't produce fusion power, but it will produce great physics." And so the funding continues.

He knew he was on to something with inertial electrostatic confinement using magnetic fields so he patented the design. Then he improved his design by using circular rings instead of square rings, and then ideally spacing his non-ideal rings. I wonder if they could modulate the current in the rings to produce dynamic stabilization and try to narrow the cusps.

You almost want to build a really large one with superconducting electromagnets and use it to hold a small star in place.

Tuesday, July 01, 2008

FPGA Editor and Googe Maps

Another Israeli FPGA guy named Eli Billauer produced a video explaining how to use FPGA Editor. Now that you watched that video, load up FPGA Editor and click on "Help -> Help Topics" to compare your learning experience. Video feature documentation is a great project for an intern.

FPGA Editor should feel more like Google Maps. Consider the application specifcation for FPGA Editor and Google Maps: you must navigate a complex map of billions of paths and locations, able to search for specific locations by name or nearby locations by keyword/type (I think FPGA editor has this), easily access external information about specific nodes, auto-route optimal paths between nodes.

Perhaps the biggest difference between the application specs is that FPGA Editor is intended to consider the additional constraint that multiple nets may not share the same path and fan-out distribution is different from the traveling-salesman routing done in Google Maps.

In any event, Google Maps should probably be considered the "standard" interface for these types of things. I wonder how easily searching for "nearby restaurants" could become "nearby blockrams."

Wednesday, June 18, 2008

Chips from NVidia and ClearSpeed

Yesterday I wrote about the GFlops/Watt performance numbers for AMD's new GPGPU and ClearSpeed showing 2-2.5 GFlops/Watt -- It looks like Clearspeed has a new card that does 4 Gigaflops/Watt at double precision.

NVidia has a new Tesla too. According to the FPGA Journal article you can buy 4 TFlops consuming only 700W or 5.7 GFLops/Watt (It's unlcear if the numbers from FPGA Journal are specs for double or single precision, but I assume single-precision). At 10 cents per kilowatt hour, a Teraflop of Teslas will cost you 153 dollars to run for a year. Not bad.

Frankly, there's too much marketing and handwaving on these specs --- not enough real numbers to make a conclusion on who dominates in efficiency.

Tuesday, June 17, 2008

AMD's new chips, OpenCL

HPCWire reports on AMD's latest GPUs clocking in with 200 GFlops double precision performance under 150 Watts or 1.33 GFlops/Watt. That translates to a double precision petaflop for .75 MWatts compared to the RoadRunner which consumes 3 MWatts. The AMD GPU is about 2-3 times the peak GFlops/Watt FPGA floating point performance numbers, though I speculate the new Altera Stratix IV may be competitive. Cleerspeed apparently wins the double-precision efficiency competition with a 2 GFLops/Watt and 2.5 GFLops/Watt chips. Performance for specific functions can vary substantially though there is still no standard language to make it practical to spec.

AMD claims that they will support the OpenCL ("Computing Language") specification. OpenCL is still non-existant as far as I can tell. From the HPCWire article:

"In an attempt to unify the programmer interface AMD has announced its support for OpenCL."

Steve Jobs mentioned OpenCL support in Snow Leopard and now it looks like the Khronos Group is trying to organize the effort to actually make the standard. Intel should join the fun and say the Larabee will support the OpenCL standard.

Thursday, June 12, 2008

"Programmable Arrays" are more than "Reconfigurable HDL Executers"

A blog by David Pellerin of ImpulseC fame called "Reconfigurable, Reconshmigurable" links to Vhayu's announcement of a compression IP for FPGA accelerated ticker systems.

To me, the most interesting part of the article is:

"Some Wall Street executives interested in using FPGAs to accelerate applications, or portions of them, however, have expressed the concern that it's hard to find programmers who are skilled at writing applications for FPGAs."

I keep hearing this brought up at low-latency Wall Street IT conferences so it's definitely a real issue. Reconfigurable computing desperately requires a common open framework that minimizes the learning curve for programming FPGA hardware. The problem is that the FPGA industry has the inertia of it's EDA upbringing, so the result is people think that the primitive language for programmable arrays should be a Hardware Description Language--but finding HDL programmers is hard.

I think it's time to drop the "hardware" from reconfigurable hardware and just think about programmable arrays. From this perspective, it is a bit ironic that Wall Street Executives have trouble finding FPGA programmers: programmable arrays have been the primary computational metaphor used by financial services since before they even had electronic computers to iterate their spreadsheets for them.

All FPGA hardware is actually programmed by a proprietary bitstream language much more closely related to programming a spreadsheet than an HDL (specify 2-D grid coordinates, specify a function, connect to precedents). However, instead of providing software tools for programmable arrays, FPGA vendors stick to their EDA roots. Because it has been so profitable, the FPGA industry has fallen into HDL la-la-land while obscuring the low-level interfaces to their physical devices.

I would go so far as to say that there has been no real vendor of hardware programmable arrays since Xilinx stopped documenting how to reconfigure their arrays. They might sell you "field programmable gate arrays" as a naming convention, but what you really get from these vendors is a "reconfigurable HDL executer." If you want to actually use an FPGA like a programmable array, you need to reverse engineer the proprietary bitstreams. The FPGA vendors actually don't have much interest in making their programmable arrays useful as programmable arrays because they make a killing selling reconfigurable HDL execution systems.

But with interest towards FPGAs outside the traditional hardware development niches, vendors quickly realized that they absolutely cannot sell HDL execution systems to people interested in using programmable arrays for their computational needs. Modern forays into C and Matlab synthesis help to address this programmability problem for certain markets, but these tools are often entirely reliant on an HDL-centric toolflow and obscure the physical constraints of the underlying programmable array even more. The hiding of low-level abstractions that comes with high-level-languages is fine (and even desirable) for end-user application development, but using C as a framework for mapping 4GLs to FPGA hardware is just as backwards as coding a Java VM in Verilog and expecting good performance on single-threaded CPU.

For the FPGA market to mature into computing applications, FPGA hardware vendors need to get back to selling hardware programmable arrays instead of HDL-executers. If they want to compete with CUDA in HPC, they should take a cue from NVidia and provide complete low-level APIs to their hardware. Instead of hyping up easy-to-program high-level-languages for particular application niches, the hardware vendors need to focus on making and documenting better low-level-languages for their hardware.

The fundamental concept of a programmable array is simple: everyone groks a spreadsheet. No one should ever be forced to target a programmable array like it were a reconfigurable HDL machine.

Monday, June 09, 2008

Petaflop in the NY Times

A Petaflop of Cell Processors made the NY Times. Highlights of the article: 12960 total Cell chips with (9*12960)= 116640 cores.

The article tries twice to turn the supercomputing top-spot as an issue of national pride. It also discusses the difficulty in programming these devices and how the next generation of consumer products will require programming paradigms for massively multicore hardware. The article also mentions the fact that the three types of cores requires a heterogeneous partitioner. Now, they are probably doing manual partitioning and making sure they're designs are highly symmetric. If we want to build large computational pipelines we need a hardware agnostic programming model for parallel programming that handles partitioning, placement and profiling.

According to a OpenFPGA Corelib presentation from Altera last Thursday, we could probably get a Petaflop by replacing all the Cells in this deployment with FPGAs. It seems plausible that a Petaflop-capable FPGA supercomputers will exist and will be better used for 2-bit DNA problems.

Brute force scaling and twice the funding will get us an ExaFlop at 32 nm. The next major leap in supercomputing is going to require a materials/fabrication advance. FinFets and 3-D integration will get us a ZettaFlop in the sub-22nm range.

I expect molar-scale integration using directed self-assembly of reconfigurable arrays will disrupt this process sometime in the 5 to 10 year range. We will then realize the supercomputers we are building to study global warming are the cause of global warming.

Wednesday, May 21, 2008

Altera Incorporates Dynamic Threshold Scailing in 40 nm Stratix IV

In September I wrote:

To optimize for static currents, using dynamic threshold scaling (modulating the body bias voltage of the chip) along with dynamic voltage scaling [for active power] seems to be a viable technique. Here's a spreadsheet model (ODS original) for leakage current in a transistor varying Temparature, power and threshold voltage across a reasonable range.

According to this FPGA Journal article, Altera has incorporated programmable body biasing into the logic blocks of their 40 nm Stratix IV FPGA.

Xilinx will probably also follow suit with dynamic threshold scaling sometime this summer.

Altera claims to have 680K logic elements in their highest capacity offering... I think 640K is all anyone will ever need :)

Sunday, May 11, 2008

Locality Optimization

Digital system designers of FPGAs and ASICs alike become extremely familiar with the place-and-route problem during late-night moments of desperate agony the day before a deadline. Place-and-route is about to become relevant to a very new market as the distinction between FPGA and multicore chips turn into a question of granularity. As we increase the number of cores on a chip, the placement of communicating dataflow kernels will become a primary issue for chip-multiprocessor performance.

The time and power required to perform a dataflow computation is determined by the total wire length over which data must travel. In 2-D tiled structures like FPGAs or multiprocessors, the ratio between the worst-case and best-case placement for two communicating kernels grows as the square-root of the number of cores: a bad placement of two communicating kernels in a 1 million cell array could cost you on the order of 1000 times more wire-length than a good placement. This is easily visualized by the graphic below and the order-square-root-of-cores relationship holds regardless of whether Manhattan (L1) or Euclidean (L2) distance is used (using the L1 metric there is a constant factor of 2 multiplier and using the L2 metric there is a constant factor of sqrt(2)).

As we add more communicating kernels, it becomes more and more obvious how poor placement is detrimental to performance. For arbitrary dataflow graphs, the problem of finding the mapping of the graph which minimizes arc-length is NP-hard. A generalization of this problem is the quadratic assignment problem (QAP).

Several methods are employed to perform placement and are often used in hybrid. A generic method to accelerate place-and-route is to partition the design into sub-graphs. Partitioning dataflow graphs into independently placeable groups allows a hierarchical placer to perform multiple sub-placements in parallel. Partitioning can be used as a general strategy with QAP solvers or simulated annealing methods used to solve the placement problem within partitions.

Tools allow a designer to constrain the physical placement of partitions so that they only need to be re-placed when they change. Since development of FPGA modules usually requires multiple static system elements to test the user code, it is nice to not have to work on these sections again. This also allows board vendors to offer a partial bitstream with interface components pre-placed in the FPGA to hide messy interface signals and provide a simple development environments. A partition-constrained place-and-route can then map new user-logic independent of system-logic.

A spreadsheet programming model directly exposes the locality issue to the programmer. We should expect that a spreadsheet consisting of 6-input logic functions on boolean types should map immediately to the logic fabric of a Virtex-5. If we bring up the granularity of our spreadsheet to integer and character operations we should still expect a somewhat direct placement from the spreadsheet to the FPGA. This does not guarantee that the interconnect can be routed: placement is constrained by routing interconnection and place-and-route algorithms consider routing while placing components. See ulogic.org for information about the Xilinx bitstream and XDL format as well as the programmable interconnection points.

Here is some Python code to compute the wire-length metrics for an Excel spreadsheet. It doesn't take many lines of code to write macros that do simulated annealing on a spreadsheet. I included code which uses Excel's Cut functionality to move cells while preserving the dataflow graph. The slowness of the COM interface in this example will prevent you from getting far.