Wednesday, March 11, 2009

Emulation is the Sincerest Form of Flattery

I apologize for not writing in a while. I have been running the DEC XXDP diagnostic tests on my PDP-11/70 emulator. I just finished making integer divide compatible --- not just to spec, but compatible with all the edge cases of the 11/70 model. I'm impressed with how they designed and debugged this sort of thing back in the 60's and 70's. Each bug probably took about a day to diagnose with hugely expensive oscilloscopes and logic analyzer and another day to fix and test. In my case I can simulate every bit in my entire system in about 1 us of simulation time (125 clock cycles) per second of wall clock time. I can find and fix about 8 to 10 bugs a day. This fact is the cause of Kurzweil's Law of Accelerating Returns.

The more powerful the machines we use to design machines, the better we can evaluate how future machines will operate and emulate how old machines did. Before VHDL and behavioral compilation, engineers used hundreds of pages of flowcharts to describe the microcode for each operation of a computer. The logic and circuitry for each operation was broken down to logic gates. Before programmable interconnect, they used to practice the art of wire-wrapping. Now, Universities award Electrical Engineering degrees to students who may never experience the distinct smells of soldering a wire or frying out an IC.

The art evolves and yet here I am puzzling over microstates and logic diagrams drawn up in the 1970's. It is definitely a little ridiculous to mimic all the overflow conditions of old microcode in VHDL for an FPGA implementation. I doubt anyone would ever write code dependent on these corner case technicalities except to debug the microstates of their particular design. Bell and Strecker admit that the flags of the PDP-11 were over-specified. Yet I, like the orthodox follower of an ancient religion, am making sure that we observe the law of the DIV.60 microstate: "CONTINUE DIVIDE IF QUOTIENT WILL BE POS. BUT ALLOW FOR MOST NEG. NO. IF QUOT. IS TO BE NEG." because my division code apparently does not enter the DVE.20 overflow abort state after the first bit has been computed.

Despite the evolution of physical computing systems, practitioners of the science of computation still base their notion of what a computer does on the ancient art of sequential imperative descriptions. Modern formalizations of such descriptions, namely the C language, was in fact born to control the PDP-11. Our model of an algorithm, which PDP-11 is designed to execute, is based on even older methodologies concerned with instructing a single mathematics student on how to work out a problem with pencil and paper.

In order to usurp the role of the CPU, the next wave of hardware must emulate this functionality first. Evolution requires functional replacements before it allows for improvements. But what if all the possible behaviors of our system are not used by our particular application? What if all the software my PDP-11/70 will run exclusively uses floating point division instead of integer divide?

Now that we have multicore CPUs, we can partition processes into separate pieces of hardware. In reconfigurable hardware, if we know that a process isn't ever going to use integer division we should be able use the area it occupies for something else. What if we might use DIV, but not all that often: can we use a trap to reconfigure the FPGA whenever DIV is called? This sort of introspection is not currently implemented for FPGAs. If we wanted to run thousands of emulators, we could manage hardware resources effectively so that we don't waste area for our cores. We could similarly manage the interconnect between our emulators if we know the communication topology of our processes.

FPGAs emulate the behavior of an ASIC at perhaps 1/10th of the speed or 1/100th the power efficiency of an ASIC. This means that anything that runs efficiently on an FPGA has a direct path to running hundreds of times more efficiently as an ASIC. If we can establish a fixed process topology for a particular supercomputing system, it aught to be possible to design FPGA and ASIC systems with hundreds or even thousands of incompatible data path units optimized to run particular processes in the system.

Design tools to develop parallel computing systems on FPGAs and ASICs are just starting to exist. For example, DE Shaw Research has built an ASIC supercomputer to perform molecular dynamics simulations with impressive results. I expect that if a $100M supercomputer is worth making, it is worth making an ASIC.

Once we can automate the process of developing ASIC supercomputers, we should look towards wafer-scale and 3-D integration to increase the computational density of our systems. This requires new models for fault tolerance and heat removal, but if a $500M supercomputer is worth making, it is worth making as a thick cylinder of bonded wafers.