Wednesday, March 11, 2009

Emulation is the Sincerest Form of Flattery

I apologize for not writing in a while. I have been running the DEC XXDP diagnostic tests on my PDP-11/70 emulator. I just finished making integer divide compatible --- not just to spec, but compatible with all the edge cases of the 11/70 model. I'm impressed with how they designed and debugged this sort of thing back in the 60's and 70's. Each bug probably took about a day to diagnose with hugely expensive oscilloscopes and logic analyzer and another day to fix and test. In my case I can simulate every bit in my entire system in about 1 us of simulation time (125 clock cycles) per second of wall clock time. I can find and fix about 8 to 10 bugs a day. This fact is the cause of Kurzweil's Law of Accelerating Returns.

The more powerful the machines we use to design machines, the better we can evaluate how future machines will operate and emulate how old machines did. Before VHDL and behavioral compilation, engineers used hundreds of pages of flowcharts to describe the microcode for each operation of a computer. The logic and circuitry for each operation was broken down to logic gates. Before programmable interconnect, they used to practice the art of wire-wrapping. Now, Universities award Electrical Engineering degrees to students who may never experience the distinct smells of soldering a wire or frying out an IC.

The art evolves and yet here I am puzzling over microstates and logic diagrams drawn up in the 1970's. It is definitely a little ridiculous to mimic all the overflow conditions of old microcode in VHDL for an FPGA implementation. I doubt anyone would ever write code dependent on these corner case technicalities except to debug the microstates of their particular design. Bell and Strecker admit that the flags of the PDP-11 were over-specified. Yet I, like the orthodox follower of an ancient religion, am making sure that we observe the law of the DIV.60 microstate: "CONTINUE DIVIDE IF QUOTIENT WILL BE POS. BUT ALLOW FOR MOST NEG. NO. IF QUOT. IS TO BE NEG." because my division code apparently does not enter the DVE.20 overflow abort state after the first bit has been computed.

Despite the evolution of physical computing systems, practitioners of the science of computation still base their notion of what a computer does on the ancient art of sequential imperative descriptions. Modern formalizations of such descriptions, namely the C language, was in fact born to control the PDP-11. Our model of an algorithm, which PDP-11 is designed to execute, is based on even older methodologies concerned with instructing a single mathematics student on how to work out a problem with pencil and paper.

In order to usurp the role of the CPU, the next wave of hardware must emulate this functionality first. Evolution requires functional replacements before it allows for improvements. But what if all the possible behaviors of our system are not used by our particular application? What if all the software my PDP-11/70 will run exclusively uses floating point division instead of integer divide?

Now that we have multicore CPUs, we can partition processes into separate pieces of hardware. In reconfigurable hardware, if we know that a process isn't ever going to use integer division we should be able use the area it occupies for something else. What if we might use DIV, but not all that often: can we use a trap to reconfigure the FPGA whenever DIV is called? This sort of introspection is not currently implemented for FPGAs. If we wanted to run thousands of emulators, we could manage hardware resources effectively so that we don't waste area for our cores. We could similarly manage the interconnect between our emulators if we know the communication topology of our processes.

FPGAs emulate the behavior of an ASIC at perhaps 1/10th of the speed or 1/100th the power efficiency of an ASIC. This means that anything that runs efficiently on an FPGA has a direct path to running hundreds of times more efficiently as an ASIC. If we can establish a fixed process topology for a particular supercomputing system, it aught to be possible to design FPGA and ASIC systems with hundreds or even thousands of incompatible data path units optimized to run particular processes in the system.

Design tools to develop parallel computing systems on FPGAs and ASICs are just starting to exist. For example, DE Shaw Research has built an ASIC supercomputer to perform molecular dynamics simulations with impressive results. I expect that if a $100M supercomputer is worth making, it is worth making an ASIC.

Once we can automate the process of developing ASIC supercomputers, we should look towards wafer-scale and 3-D integration to increase the computational density of our systems. This requires new models for fault tolerance and heat removal, but if a $500M supercomputer is worth making, it is worth making as a thick cylinder of bonded wafers.

Monday, December 29, 2008

C-to-Verilog.com provides C to HDL as a service

If you follow this blog, then you've read my ramblings on EDA SaaS. An interesting new advance in this area is http://www.c-to-verilog.com. This website let's you compile C code into synthesizable Verilog modules.

Here is a screencast of the service:




I've also discovered a few other websites related to EDA, SaaS and "Cloud Computing." Harry the asic guy's blog covers the burgeoning EDA SaaS market and Xuropa is creating an online community for EDA users and developers. Here's a two part EETimes piece which talks about EDA SaaS.

I already use a remote desktop connection to run most of my EDA jobs remotely. I've argued that there is a "generation gap" between SaaS and traditional software license markets. The people who used to code in university basements when they were 13 in the 60's and 70's invented the software licensing industry. Now, the people who used to herd botnets when they were 13 are graduating with computer science degrees. The nu-hacker distributes code updates to all his users immediately without forcing anyone to wait through an install.

Friday, December 05, 2008

More versus Faster

/. points to an IEEE Spectrum article title "Multicore Is Bad News" which discusses the bottleneck associated with getting data from memory into the many cores of a multicore processor.

This is primarily an issue with our programming model we are trying to force on the hardware and not a problem with the real capacity of such hardware. The article specifically says that many HPC processes map cleanly to grids thus address the locality problems of multicore arrays. I'm a broken record: the problem of locality optimization for mapping dataflow graphs to multicore arrays requires a transition of our primitive computer model from instruction stream executers to spreadsheet iterators.

The article sums up the problem: "Although the number of cores per processor is increasing, the number of connections from the chip to the rest of the computer is not." This is not necessarily true, since we are seeing increased number of pins on a chip, but it is also not the main issue.

Even without increasing the rate of dataflow through the processor, we can drastically improve power-performance by slowing down our processing pipelines and distributing load across multiple slower cores. One-thousand 1 MHz cores can perform many of the same tricks as a single 1 GHz core for far fewer Joules. This physical truth is starkly contrary to the popular notion that we should increase utilization to improve performance. Since Moore's law says that transistors for ever more cores will continue to cost less and less (and I predict this will continue long after dimensional scaling ends, due to optimizations in the fabrication methods), we can achieve a decline in operating expense to sustain our growing hunger for computing resources. This requires us to fundamentally change our approach: we cannot continue to expect faster execution, we can only expect more execution. Exponentially more.

Once the industry discovers that it has to transform from isolated general purpose problem solvers to networked special purpose assembly lines for a particular problem, I expect we will seem the rise of FPGAs and reconfigurable dataflow pipelines. Any application that can use many thousands of CPUs effectively will probably use many thousands of FPGAs an order of magnitude more efficiently. Its becuase of the relationship between the granularity/number of cores and the degree to which we can optimize for locality. All the extra dispatch hardware and cache management logic is unnecessary in deterministic dataflow pipelines. We simply need more dense functional area and more pipes for data.

Monday, November 24, 2008

Convey's Reconfigurable Computing System in the NY Times

An article in the New York Times discusses Steve Wallach's Convey Computer Corporation and their new reconfigurable supercomputer.

The Convey machines use FPGAs to augment the CPU's instruction set. From a given FORTRAN or C code, their compiler tools create a "personality," which is Convey's way of re-branding the term "configuration."

Steve has a great quote on his site, which sums up everything I've ever written about the FPGA market on this blog: "The architecture which is simpler to program will win." I was once an intern at Xilinx pushing FORTRAN code through F2C (which redefines "barf" in the f2c.h header file) and massaging the resulting C code through an FPGA compiler. As a general rule, most important milestones in compiler technology diminish the need for a summer intern's manual labor. But while I understand the importance and value of demonstrating the capability of reconfigurable supercomputing on legacy FORTRAN applications, if FPGA supercomputing has a future, I am convinced that we need to break free of the instruction stream control flow model.

I've previously argued that the any approach to accelerating sequential imperative languages like C and FORTRAN only extends the von Neumann syndrome and that we need an explicitly parallel, locality-constrained, dataflow model akin to the spreadsheet if we hope to create scalable applications. Moore's law continues to drive down the cost of a transistor, but the speed of these transistors is limited by a power wall; if we are going to continue the geometric cost reduction of Gigaops, then we need languages and programming models that improve with more transistors and not just with faster transistors.

I agree with Ken Iverson's thesis in "Notation as a Tool of Thought." The Sapir-Whorf Hypothesis holds true in computer languages: we are bound to the von Neumann model by the languages we know and use and this is what we must change. In many of his talks and papers, Reiner Hartenstein identifies the dichotomy between von Neumann and reconfigurable computing paradigms. He argues that the divide must be overcome with an updated computer science curriculum.

Programming models aside, the main thing bugging me about the Convey machine is it's price tag: $32,000. Here's what Joe at Scalability has to say about pricing FPGA acceleration reasonably:
One company [at SC08] seemed to have a clue on the FPGA side, basically the pricing for the boards with FPGAs were reasonable. Understand that I found this surprising. The economics of FPGA accelerators have been badly skewed … which IMO served to exclude FPGAs from serious considerations for many users. With the rise in GPU usage for computing, some of the FPGA folks finally started to grasp that you can’t charge 10x node price for 10x single core performance.
I really hope that Convey's increased programmability helps them make huge inroads at this price point so that we can expect super high margins from our products when we launch. I don't know how they will compete with XtremeData, DRC, Nallatech, Celoxica or whoever decides to do the same thing all these companies do with Achronix chips (hmmm...)

Tuesday, November 11, 2008

Software Defined Radio: Revolution Not Included

Spectrum policy needs to be re-evaluated in the 21st century. The expansion of the Internet as an organizational model destroys one-way, hierarchically controlled mediums. A Wired article earlier this year observed that the nature of the Internet is to make everything it touches tend to gratis. The Internet and the wireless world are now inextricably linked by devices that support both free protocols on free spectrum and proprietary protocols on licensed spectrum. VoIP-capable WiFi-enabled phones are the beginning of the end of cell-phone service as we know. Regulated spectrum ownership will transition to universal Ethernet coverage using best effort spectrum sharing protocols.

With the move to digital TV the bandwidth requirements for television broadcasting decreases substantially, freeing up a large amount of white space bandwidth between existing channels. A programmable transceiver can operate on these white spaces switching between multiple protocols to serve other white space devices. If we make a transceiver device that can access this huge amount of bandwidth, we can build a substantial, free wireless network.

Google's Larry Page spearheaded the Free the Airwaves campaign which urged the FCC to free these white spaces for unlicensed public use. On election day, the FCC unanimously voted in accordance with these wishes, approving devices that use sensing and geographical information to avoid interference. The white spaces are the first frontier. We should promote a long term vision that would have the FCC open the entire spectrum to sharing protocols. TV and Radio broadcasting are an enormous waste of bandwidth and discriminatory allocation of bandwidth is one of the pillars of inequity in modern American society (think nationally syndicated talk radio). The roadmap for the obsolescence of spectrum ownership should be set so that capital can be allocated properly to progressive technologies. Unfortunately this technology direction is contrary to the current revenue model of the FCC and would likely result in the re-purposing of that organization.

The economic challenge in making this transition is that there are hundreds of billions of wealth units tied to the premise that spectrum can be owned. I expect to see the wireless and media empires crash as the premise of their business model is made unsound by technical obsolescence. The question is: where does all this money go? The answer is nowhere: it just disappears. This effect is similar in substance and larger in magnitude to the hundreds of millions of dollars of market value in newspaper classifieds that disappeared because of Craigslist.

The technical possibility for multi-band and spread spectrum sharing owes itself to the widespread availability of wired Internet infrastucture and the rapid buildup of digital signal processing technology driven by the geometric decrease in transistor cost. Software Defined Radio (SDR) devices like the GNU Radio USRP (FPGA-inside) can be programmed to operate in arbitrary frequency ranges with arbitrary protocols. As more spectrum becomes freed, SDR devices can be reprogrammed in the field to use these new ranges.

This democratization of signal processing technology and communications infrastructure also fueled the cost reduction of music production and distribution that made the recording industry obsolete. Without the teeth of copyright law, and without a monopoly on the electromagnetic spectrum, the decline of the telecoms will be more rapid than the RIAA, and hopefully without the resentment generated by lawsuits against customers.

We have previously seen the effect of technical obsolescence on existing markets and we know the response from the tech that was superseded: the destruction of wealth is often met with lawsuits. Organizations like the RIAA claim that they protect the interests of artists and the industry surrounding them by transforming antiquated copyright laws into censorship laws for the "digital millenium." Unlike the recording industry, there will be less legal claws gripping onto a futile business plan during the technical obsolescence of the telecom industry.

Spectral freedom ultimately favors device manufacturers and consumers to the chagrin of telecoms carriers, content owners and distributors (aka "the middle men"). I think the winning business model will beat Apple to the white-space iPhone/laptop, open its API, and have a good tranceiving router to back it up. Initially, software controlled protocols and devices that take advantage of them will be patented and will be valuable to carriers who want to lock-in their customers. These proprietary technologies will eventually be replaced by open source alternatives. I expect there will be some legal antagonism between proprietary device manufacturers and the free culture communities, but the futility of protocol licensing is the same as spectrum ownership. Software Defined Radio is software after all, and each FPGA configuration is just a very large number. Be skeptical of the business plan of people who troll electronic bridges for large numbers. Proponents of protocol licensing will point to the number of jobs lost, and appeal to the false-utopia of full-employment using some form of the broken window fallacy (sure the spectrum ownership and patent protection policies are broken, but look at how many people they employ!).

Companies like Motorola and Nokia can use the white-spaces to attract customers to products that are compatible with "WiFi 2.0," as Google likes to call it. They should start to charge the actual price of their equipment so they can become free of their dependency on the carriers. Sprint, Verizon and AT&T should milk those SMS fees while they still can: freedom in the white-spaces will annihilate their industry. The writing is on the wall.

Friday, October 17, 2008

Timing Closure

Doug Amos from Synopsys wrote at article in FPGA Journal about the Timing Closure problem comparing it to whack-a-mole.

I just got through a painful timing closure process. I feel like I came out of a month-long coma. I'll share some anecdotes about the experience to help people in the same situation.

The "Post-Map" results are never good enough for timing. You will want to run Place-and-Route so the tools can take a really long time (sometimes multiple hours) to get through compilation. You will want to automate the tools using scripts and run the EDA toolchain on many computers. See the catalystac.com/trac how to use the "subprocess" module in Python with a dictionary specifying the parameter list for all of the Xilinx toolchain (I found similar code on the net and hacked it to pull the project constraints and synthesis options from Excel). Combine this with SimpleXMLRPCServer and you can manage a botnet running multiple tool instances. It is possible to comment out multiple sections of your code using a simple preprocessor to automate many compilation processes to narrow in on which segments of your code need to be fixed for timing. Smaller code compiles faster too so you can run many permutations over a weekend. (I'll add these scripts to the trac after I have a chance to make them non-specific to this particular project).

Modular partitioning and floorplanning make the compilation and timing closure process a lot easier, but if you want to optimize across hierarchy you can't use them. This is one aspect of behavioral synthesis that needs some serious consideration: how can we avoid running the entire toolchain for minor code modifications? Also since synthesis and optimization can generate weird names for internal signals it is non-obvious what paths are causing the timing errors when they are reported. Usually you can figure out the bad paths by reading the report, but I really wish there was some better way to tie the timing bug back to the code so you know what to modify to fix the bug. There doesn't seem to be a more elegant solution than the brute force method of commenting out sections of the code described in the previous paragraph.

Now let me explain some of the timing bugs I found and how I fixed them. My PDP-11 board has an 8 ns clock period for SRAM access with 4 ns per 18 bit word on each half-clock = 4.32 Gbps read and 4.32 Gbps write (full-duplex). The data word arrives before the rising-edge and is used as a 16 bit data word, the second word is used as 16 bit parity word. Each data nibble has 4 parity bits allowing single correction and double detection. A modular test of the ECC would have us believe that from the arrival of the second half-clock word we could determine if there were errors within the 4 ns between the parity word arriving and the next clock cycle. Unfortunately, we discovered that the parity path was causing a timing bug between one of the Data pins and the SingeBitError / ResultValid signal.

The simple solution is to burn a clock cycle for parity correction, but this would cost us a clock cycle. It would be nice to have an error correcting memory that doesn't consume an entire clock cycle to compute correctness so we can use the data word for 8ns and simply reject the result using the parity check in 4ns. If we don't know that the word is valid for an entire cycle, then we cannot speculatively use the data word unless we are willing to rewind the pipeline an entire clock cycle which is certainly possible, but it's a scary proposition none-the-less (better to burn a cycle in this design, since we prefer reliability and simplicity).

Turns out the reason for the timing bug was one of the data pins was attached to a different I/O bank than the other pins and so the routing delay made the critical path. This was discovered by tracing the path in the floorplanner. The solution here was to turn off the pin since we have 18 data bits for memory and only use 16 for our word. The next board revision will also fix this.

The next major timing bug was the operand fetching pipeline somewhere between the Register and Mode and the Virtual Address register (see here for PDP-11 architecture info). The error here was very small (300 picoseconds) and would go away whenever the fan out and fan in of some of the signals were decreased by commenting out some functions.

We have 8 nanoseconds to decode an opcode from the arrival of the memory word. This is
enough time to read the register from the register file and decode the address mode, but not enough time to guarantee that the virtual address is ready in the address mode case where we must pre-decrement the register to generate the virtual address. I fixed the timing of the operand fetch stage by only decoding the address mode and fetching the register word during the opcode decode stage. The operand fetch stage now uses the decoded address word and register word to generate a virtual address in a second cycle. It would be possible to optimize this process so that in the majority of cases where the register is the virtual address: we can start with that assumption and then invalidate the result when we discover that another cycle is required to decode the address mode.

The most pernicious timing bug involved our floating point multiplier which was partitioned into multiple DSP48 blocks. The multiplier core was generated by coregen to have enough pipeline stages to meet timing. Compiling it in it's own project revealed a maximum throughput of 2.3 ns, but it just barely broke the timing when compiled with everything else: it was off by a factor of the timing uncertainty. We thought that perhaps the tool is retiming the multiplier pipeline to just barely meet timing, and then clock jitter and uncertainty were added later causing it to break the constraint. We did multiple permutations of synthesis options and ran recompilations to no avail. We also added stages to the pipeline to no avail.

To solve this problem, we just created an entirely new project without any timing constraints and set it do the best it could and we met timing with 35 picoseconds to spare. Hallelujah!

If you are in the middle of a painful timing closure, I'm sorry for you, and I hope you can find something useful from this post.

Monday, September 22, 2008

High Performance on Wall Street 2008

A freshly rebranded Catalyst Accelerated Computing went to the High Performance on Wall Street conference today. The usual suspects were present. Speedup results all seemed to match the gender ratio at supercomputing conferences.

FPGAs got bashed in an early session. "It's hard to find people to program FPGAs" came up at least twice during the conference (we consult!). I heard, "threads are the future" ... oy. After the earily morning thread-love-fest, the acclerator panel defended the honor of the gate array well.

Technical stuff that irked me:

Multithreading and multicore aren't the same thing. Multicore processors can use multiple processes with explicit pipes or they can use multiple threads with a global memory coherency protocol. Multiple processes with pipes between them is the dataflow or "streaming" model, like a spreadsheet or like "ls -Al | grep foo > bar", while the multi-threading model should be avoided like sub-prime mortgages for the same reason (they cause your system to crash in mysterious ways).

Multithreading is the use of multiple instruction streams sharing a global address space. It was originally a method of hiding latency by transfering the context of a core to a different thread when you were waiting for I/O or memory. Intel cores support "hyperthreading" which switches context between two threads and makes it seem like it has two cores. This allows the core to share a global memory space and hide memory access latency which is large compared to the clock rate. Cores can have a lot of threads: Sun's open-source Sparc core supports 32 native threads.

Power consumption P = fCV^2. The power voltage (V) is generally linearly dependent on f (frequency) because we can use less potential to switch at slower frequency resulting in the "cubic rule of thumb" relating clock-speed and power. If we use twice the area and half the frequency to do the same work, then switched Capactiance (C) is 2x while f*V^2 is 1/8, leading to the rule-of-thumb quadratic power savings from parallelism (see Chapter 11.7 of Anantha's book "Digital Integrated Circuits").. Leakage is the dominating factor now though and slower switching circuits can operate with higher threshold voltage to lower leakage if your device supports dynamic threshold scaling (like the Stratix IV from Altera will).

The better reason why FPGAs dominate in power performance is becaues of the efficient total distance of data-flow, aka much lower capacitance to move data. As the number of cores increases, there is an O(N^(3/2)) relationship between the number of cores and the degree to which a design can be optimized for process locality (see "locality optimization"). This is why place-and-route is so important for FPGAs.

Now for the fun stuff. Buzzword scoreboard from presentations:

{ "Leverage": 17, "Agility": 4, "Low-Latency": 44, "Accelerate": 176, "Eco-System": 8, "Productivity": 191, "Scalability": 83, "Service-Oriented": 17, "Paradigm":16,"Dynamic":55, "Exploit Multicore": 18, "Future-Proof": 4, "Mainstream": 36, "Seamless": 43, "Cloud": 91, "Heterogeneous": 12, "Efficient": 50, "Enabling": 23, "Integrated": 19, "Interoperability": 24, "Realtime": 12, "Reliability": 13, "High-Availability": 33, "Bottleneck": 26 }

Productivity wins.

I'm particularly amused by the frequency of "mainstream." Mainstream on Wall Street today probably means your firm just shut down, merged, or totally changed business models. Happy Monday for a Wall Street Conference!

Coming soon: a business-plan buzzword-compliance checker to determine if your business plan is syntactically correct and give you a score.