Monday, November 02, 2009

Altium Nanoboard 3000: Day 1

I received an Altium Nanoboard 3000 with the condition that I provide feedback on my experience using it. I hope this information will be useful to Altium and to other Nanoboard users.

This will be the first in a multi-part series about using the Altium Nanoboard 3000 with the specific angle of teaching a total FPGA newbie how to get started with this board.

A friend of mine is a senior in MIT's Aeronautical Engineering department. He has to build a UAV for a class project and wants to use an FPGA to control servos. He asked me to teach him how to use FPGAs. I told him I was getting a new board aimed at providing a good user experience to newbies like him and that I would teach him how to use the board. We started to install the Altium Designer software on my friends dual-core laptop. "Should we read this license carefully or just accept it like we usually do?"

Eventually the Altium Designer completed installation and we started it up and figured out how the board-tied licensing works.

Our attempt to use the board was thwarted by a massive hard drive space requirement and extremely large download size of the Xilinx development tools. Installation of the webpack wasn't worth the time to clean up space on his PC and then download 5 GB on a computer which would rarely have the Nanoboard attached. I decided to install the Altium and Xilinx software on my own laptop and become familiar with using it before trying to pass it off to a newbie. My laptop is a single core (Pentium M), 1 GB RAM, Toshiba Satellite R10 tablet recently reformatted with a fresh install of Windows 7 (which broke a bunch of things, but is fine overall). Altium installed in about 24 minutes from the DVD and I left the Xilinx Webpack to download and install over a few hours while I did something else. One of the optional Xilinx DSP tools failed because it does not support Windows 7.

This Xilinx-dependency isn't necessarily Altium's fault, but if they're trying to focus on user experience, they aught to figure out how to make this all actually work without requiring multiple hours of downloading and installing. The DVD case provided seems more expensive than a USB key that could fit all the required software pre-installed on it. Frankly, I'm tired of software installation processes, my writings about EDA SaaS perhaps point to a better way: provide all the software on a remote host, give me a small 1 MB bitstream downloader to run locally and let's get on our jolly way already. Maybe a 10-20 MB maximum download and install size if you want to include some complex software locally near the board like an in-circuit-debugger. Xilinx's software downloader and installer requires 90 MB before you even start installing the rest of the 6 GB. I just don't see why it is technically necessary to download and install multiple Gigabytes of data before I can start pumping bitstreams down a USB cable.

Altium Designer System Requirements indicate that it was not designed to run in 1024 x 768 resolution and indeed several of the dialog boxes are cut off without ability to scroll or resize. My system barely meets the rest of the minimum specifications. I will probably have to install this software on a better system, but I do not have easy access to one now.

So I have installed all the required software from Altium and Xilinx onto my laptop and finished all the license registration setup with Altium and Xilinx. I'm about 20 seconds into the "Simple LED driver" demo video in the "Training Room" video set and I cannot insert a new schematic into the project as shown in the demonstration video. If I add a new VHDL file first, I can then insert a new schematic. I can then remove and close the VHDL file leaving just a schematic in the project. To isolate the problem I created a new project and attempted to add a new schematic and failed a few times. I repeated the VHDL insertion and then adding a new schematic had no problem. I restarted the Altium Designer and created a new project, and then attempted to add a new schematic: it failed again. The VHDL-followed-by-a-schematic trick works again. I just sent an email to Altium describing this bug.

Certain hints are missing from the first tutorial video which would really be helpful for first time Altium Designer users: "hit spacebar to rotate a schematic symbol", "hold control while dragging a block to extend the bus connection", "right click to terminate a bus line without attaching it to a port." Overall, the video is informative enough, and the help search is enough to someone who has done this before and knows the lingo (example: "net label" is a meaningless term to someone who has never designed electronic systems).

3:45 seconds into the "Simple LED Driver" video, I am told to be patient for the Xilinx tools to finish running. A few minutes into the build process, the Xilinx Mapper fails with this error:

INFO:Security:56 - Part 'xc3s1400an' is not a WebPack part.
INFO:Security:61 - The XILINXD_LICENSE_FILE environment variable is not set.
INFO:Security:63 - The LM_LICENSE_FILE environment variable is not set.
INFO:Security:68 - Please run the Xilinx License Configuration Manager
(xlcm or "Manage Xilinx Licenses")
to assist in obtaining a license.
ERROR:Security:9 - No 'ISE' feature was available for part 'xc3s1400an'.
ERROR:Security:12 - No 'xc3s1400an' feature was available (-5).
----------------------------------------------------------------------
This is an unfortunate error message to get because the Altium Nanoboard resources page has a clear link to "Download Xilinx ISE WebPACK 11.1." Unforgivable.

I ran the
Xilinx License Configuration Manager to get a 30-day node-locked evaluation license so I can use target the FPGA on the NanoBoard with my laptop for 30 more days. My Altium Designer license lasts until next year though, so either Altium will have to get me another gift or I'll have to see if I can use my license for ISE at work on my laptop: though for my work I run Xilinx tools on a beefy machine over a remote desktop connection on a VPN and I'm not sure if the license can be transferred or otherwise used by my laptop. I will not be installing Altium Designer on my work computer since the license is tied to the NanoBoard itself.

After setting up the 30-day-node-locked-evaluation-license from Xilinx, everything runs as expected and as demonstrated in the video: I can control the boards LEDs using sliders in Altium Designer on my PC to control the Red, Green and Blue values. Very satisfying effect, and after all the initial bumps I look forward to exploring rest of the tutorials, at least in as much free time as I can scrounge between now and when my temporary evaluation license of ISE expires.

Overall first impressions: Altium has not field-tested the user experience of the NB3000 on an FPGA-n00b. Since I am being their guinea pig I feel like I have to be pretty explicit about how my experience deviates from the marketing claims on their website. Let me break down their claims and give you my verdict:

"Not your average FPGA-based development board" - definitely above average. This board has decidedly more sex-appeal than any other I have received in the past. It even has MIDI I/O already soldered on the board so I don't need to go out and buy an optocouper and a female connector to make MIDI synthesizers.

"you don't need any FPGA design experience to use it" - I sincerely doubt that someone without Xilinx already installed could get through the installation and licensing in less than 5 hours. Assuming the installation and licenses are already set, I don't think a newbie can get through the first 4:35 Simple LED Driver demo video in less an hour. The video should be drawn out to make each step even more explicit (how to make bus connections, rotate schematic symbols, etc). Now that I've taken care of all the licensing, I will see how my Aeronautical Engineering friend gets through the first tutorial video.

"Simply install the software, connect the NanoBoard and you're ready" - Ready to get frustrated with the Xilinx webpack installation. And then get frustrated with the Xilinx license. The initial user experience with the NanoBoard still needs some polish, and I think this article will help them with that.

Monday, August 24, 2009

Intel Buying Everyone

In the past month, Intel has purchased RapidMind and Cilk. I've talked about Cilk on this blog a while ago (post has comments from one of their founders).

This was a good move for Intel. It is probably an attempt to make the eventual release of Larrabee less painful for developers. This will help put Intel in the leader seat for parallel programming platforms.

What will this mean for CUDA and OpenCL? (Full disclosure: I own shares in Nvidia).

RapidMind and Cilk are both easier platforms to use than Nvidia's CUDA, but the total number of available Teraflops in all the CUDA-capable nodes makes it attractive. Intel still needs silicon to compete with CUDA. RapidMind and Cilk will give Intel's silicon a lot more flexible programming model than CUDA gives to Nvidia's GPUs, complementing the fact that Intel's silicon will be a lot more flexible architecture than Nvidia GPUs.

Cilk and RapidMind will simplify some of the work of parallelizing library functions, but Intel will be hard-pressed to compete with Nvidia in cost/performance ratio in any application with a strong CUDA library. Nvidia GPUs are already cheap: Intel will have to use their insane operating leverage to compete in the accelerator market on a cost/performance basis. Intel can also win this market from Nvidia by getting their latest integrated graphics chips in all the newest machines and generally by doing things that piss off anti-trust prosecutors.

I'm not very hopeful for OpenCL. Unless Nvidia decides to abandon CUDA or make isomorphic with OpenCL, then OpenCL is DOA. Apple's dependency on Intel means they will eventually find Zen in whatever platform Intel offers them. AMD is the first, and will probably be the only one to support this "Open Standard" for GPGPU and multicore. Unfortunately, they will find themselves the leader in a very small market. AMD needs to focus on crushing Intel in the server market by getting to 32 nm first and releasing octo-core Opterons.

This will be interesting to watch unfold.

Wednesday, August 19, 2009

Power vs Speed

It would appear that we have reached the limits of what it is possible to achieve with computer technology, although one should be careful with such statements, as they tend to sound pretty silly in 5 years - John von Neumann, 1949

The design goals in parallel computing differ between the embedded multicore market and the high performance computing market. On the one hand, more cores can do parallel work and get a job done faster, while on the other-hand power efficiency can be increased with no penalty to throughput by doubling the number of cores and halving their clocks. Both switching power and leakage power can be optimized using this strategy. Voltage scaling techniques can address dynamic current and threshold scaling can address leakage. Place and route optimization can improve both power effieciency and maximum performance. A number of advanced circuit design techniques can also address power efficiency.

In traditional CMOS operation, power P is proportional to C*f*V^2 (C capacitance switched per clock cycle, f frequency of switching, V is the main voltage voltage). If you double the number of cores and half their frequency then the total capacitance doubles while the f*V^2 term decreases by 1/8: lowering f allows us to proportionally lower V because we need less potential to charge up a capacitor over a longer period.

A circuit designer will often use the latency of an inverter as their unit of propagation delay. For example, a combinatorial unit may have a delay equivalent to 10 inverter delays. If the system clock is generated by a ring oscillator composed of a series of inverters, then the propagation delay of each inverter will increase as we lower the main voltage. Thus, lowering the main voltage also lowers the ring oscillator frequency. Since the combinatorial paths in the circuit are in units of the inverter delay of the ring oscillator, all of the elements in the circuit will still operate with appropriate relative timing after the supply voltage is lowered.

Since the transition speed of a circuit is dependent on the threshold voltage (see pp 15-18), when we lower our clock frequency, we may also raise our threshold voltage to decrease the leakage current. To address static leakage circuitry multithreshold CMOS techniques may be used along with power enable circuitry. However, multithreshold CMOS will not allow us to decrease leakage dynamically in frequency scaling situations. A while back I made a spreadsheet model of the leakage current (Calc ods format) to demonstrate the benefit of threshold voltage scaling. The threshold voltage is a function of the body bias of a transistor so additional fabrication steps are required to make this variable. Both Xilinx and Altera have incoporated threshold scaling in their latest FPGAs.

When the supply voltage is lower than the threshold voltage, a CMOS circuit operates in subthreshold mode. Subthreshold switching allows for very low power operation for low speed systems. A major issue with subthreshold circuit design is the tight dependency on parameter accuracy: subthreshold drain currents are exponentially dependent on threshold voltage so small parameter variations can have extremely large effects on the system design. This increases the cost of testing devices and decreases the yield.

As the thickness of the gate dielectric decrease, gate leakage current increases due to the tunnelling of electrons through the dielectric. "High-K" dielectrics addres this issue; Intel uses Hafnium instead of SiO2 in their 45-nm process. Optimal gate thickness for gate leakage depends highly on the expected operating temparature and the supply and threshold voltages.

One of the main benefits of FPGA design is their versatility in a range of operation conditions. It is suboptimal to use the same circuit in a variety of different regimes: it is best to optimize a circuit around a small operating range. Thus we should expect high performance devices to be different animals from the low power devices. In some systems it makes sense to use two separate chips: one for high power, high speed operation and another for low power, low speed operation. This is starting to be a common practice for laptop GPUs.

A major cost of distributing work is moving data; this can be addressed using a place-and-route optimization to minimize the signaling distance required for communicating processes. Shorter communication paths can translate into increased power efficiency or latency optimization. When optimizing for latency, the optimization goal is to minimize the maximum latency path. In power optimization, the activity factor for each wire has to be accounted for, so the goal is to minimize the weighted sum of activity times wire-length.

To improve the total interconnect lengths, 3-D integration can be used to stack circuit elements. In three dimensions, it is possible to cut the total amount of interconnect in half. Circuits may be stacked using wafer bonding or epitaxial growth however both processes are expensive. A major concern with 3-D integration is heat removal. IBM has demonstrated water cooling of 3-D intgrated chips. The yield of a wafer bonded circuit is dependent on the defect density on each bonded component. In order to address this issue, defect tolerance must be incorporated into the system design. Another issue to consider is the need for 3-D place-and-route tools.

One of the most costly wires in a system is the clock, which has high activity and has tenticles spanning an entire chip. Clock distribution power is often a double digit percentage of the total system budget. Asychronous circuits operate without a clock, using handshakes to synchronize separate components thereby eliminating the cost of clock distribution.

Adiabatic circuitry uses power clocking and charge-recovery circuitry to asymptotically eliminate switching power as the switching time increases. Combined asynchronous, adiabatic logic uses the asynchronous component handshake as the power clock.

With a number of different technologies available to address power concerns, how can the digital designer rapidly explore a number of architectural possibilities? Automation tools need to be endowed with the ability to transform a digital design to a low-power subthreshold implementation or a high speed circuit with dynamic supply and threshold scaling. These tools need to be aware of the power, speed and manufacturing tradeoffs associated with each of these semiconductor technologies. This will almost certainly require multiple vendors' tools playing nice with eachother.

Thursday, July 23, 2009

Demand for EDA SaaS

It's been a while, I promise I'll post more when I'm finished with my job.

The blogs have been buzzing about EDA SaaS ("Electronic Design Automation Software as a Service"). In one of my previous posts on the subject, I argued that the complexity of system design is growing faster than the ability of a reasonable desktop computer and that this will create demand for hosted EDA tools. I also argued that the ease of rolling out new features and upgrading users to the latest version is a major selling point for EDA SaaS. My experience this past week produced anecdotal evidence of both these points.

I'm coming to the final stages of a PDP-11/70 emulator design where I have it running test software in simulation. I was running Xilinx ISE 10.1 and about an hour and a half into the synthesis, I got an out of memory error from XST:
ERROR:Portability:3 - This Xilinx application has run out of memory or has encountered a memory conflict. Current memory usage is 2090436 kb. You can try increasing your system's physical or virtual memory. For technical support on this issue, please open a WebCase with this project attached at http://www.xilinx.com/support.

Process "Synthesis" failed
I searched for help on this error and I discovered from the Xilinx forums that you can add the /3GB option to your 32-bit Windows machine's boot.ini. A reboot and a couple hours later, I get the same message only with a larger number for the current memory usage at the time it failed. Before I start partitioning my design (something I'll have to do eventually anyway to increase my iteration rate during timing closure), I decide to give it a try on a 64-bit Vista machine. It compiles using some ungodly amount of memory after several hours.

I decided that I should install Xilinx ISE 11.1 on the 32-bit machine and give it a try. After an hour-long installation I have 11.1 running, and after another hour downloading an automatic update to 11.2 I'm ready to go. Running 11.2, the 32-bit machine compiles my design within the 3 GB memory limit.

These problems don't exist in a future world where EDA tools are provided as a service. If synthesis tools were hosted on some humongous supercomputer then I don't have to run out of memory and I don't have to install any software updates. Since you can run the synthesis optimizations and place-and-route parallelized across a thousand cores, I can even get my results in less than a couple hours.

Anyone want to do this?

---------
Edit 7/24

Another benefit of hosted EDA tools is that errors can be reported directly to the software vendor. This means that your hosted software won't have dozens of users experiencing the same error and not telling anyone.

I started partitioning my design today, and got a wonderfully meaningless error:

INTERNAL_ERROR:Xst:cmain.c:3446:1.47.6.1 - To resolve this error, please consult the Answers Database and other online resources at http://support.xilinx.com

Obviously Xilinx doesn't provide cmain.c as open source so I can't really figure out what I'm upsetting in the source code. Googling reveals that the Xilinx Forums have nothing useful to say about this bug, but I discovered that I am not alone with this error on the blog of another Israeli with similar gripes with ISE.

There are thousands of business opportunities that can be created by appending "...that doesn't suck" to the description of an existing product.

Wednesday, March 11, 2009

Emulation is the Sincerest Form of Flattery

I apologize for not writing in a while. I have been running the DEC XXDP diagnostic tests on my PDP-11/70 emulator. I just finished making integer divide compatible --- not just to spec, but compatible with all the edge cases of the 11/70 model. I'm impressed with how they designed and debugged this sort of thing back in the 60's and 70's. Each bug probably took about a day to diagnose with hugely expensive oscilloscopes and logic analyzer and another day to fix and test. In my case I can simulate every bit in my entire system in about 1 us of simulation time (125 clock cycles) per second of wall clock time. I can find and fix about 8 to 10 bugs a day. This fact is the cause of Kurzweil's Law of Accelerating Returns.

The more powerful the machines we use to design machines, the better we can evaluate how future machines will operate and emulate how old machines did. Before VHDL and behavioral compilation, engineers used hundreds of pages of flowcharts to describe the microcode for each operation of a computer. The logic and circuitry for each operation was broken down to logic gates. Before programmable interconnect, they used to practice the art of wire-wrapping. Now, Universities award Electrical Engineering degrees to students who may never experience the distinct smells of soldering a wire or frying out an IC.

The art evolves and yet here I am puzzling over microstates and logic diagrams drawn up in the 1970's. It is definitely a little ridiculous to mimic all the overflow conditions of old microcode in VHDL for an FPGA implementation. I doubt anyone would ever write code dependent on these corner case technicalities except to debug the microstates of their particular design. Bell and Strecker admit that the flags of the PDP-11 were over-specified. Yet I, like the orthodox follower of an ancient religion, am making sure that we observe the law of the DIV.60 microstate: "CONTINUE DIVIDE IF QUOTIENT WILL BE POS. BUT ALLOW FOR MOST NEG. NO. IF QUOT. IS TO BE NEG." because my division code apparently does not enter the DVE.20 overflow abort state after the first bit has been computed.

Despite the evolution of physical computing systems, practitioners of the science of computation still base their notion of what a computer does on the ancient art of sequential imperative descriptions. Modern formalizations of such descriptions, namely the C language, was in fact born to control the PDP-11. Our model of an algorithm, which PDP-11 is designed to execute, is based on even older methodologies concerned with instructing a single mathematics student on how to work out a problem with pencil and paper.

In order to usurp the role of the CPU, the next wave of hardware must emulate this functionality first. Evolution requires functional replacements before it allows for improvements. But what if all the possible behaviors of our system are not used by our particular application? What if all the software my PDP-11/70 will run exclusively uses floating point division instead of integer divide?

Now that we have multicore CPUs, we can partition processes into separate pieces of hardware. In reconfigurable hardware, if we know that a process isn't ever going to use integer division we should be able use the area it occupies for something else. What if we might use DIV, but not all that often: can we use a trap to reconfigure the FPGA whenever DIV is called? This sort of introspection is not currently implemented for FPGAs. If we wanted to run thousands of emulators, we could manage hardware resources effectively so that we don't waste area for our cores. We could similarly manage the interconnect between our emulators if we know the communication topology of our processes.

FPGAs emulate the behavior of an ASIC at perhaps 1/10th of the speed or 1/100th the power efficiency of an ASIC. This means that anything that runs efficiently on an FPGA has a direct path to running hundreds of times more efficiently as an ASIC. If we can establish a fixed process topology for a particular supercomputing system, it aught to be possible to design FPGA and ASIC systems with hundreds or even thousands of incompatible data path units optimized to run particular processes in the system.

Design tools to develop parallel computing systems on FPGAs and ASICs are just starting to exist. For example, DE Shaw Research has built an ASIC supercomputer to perform molecular dynamics simulations with impressive results. I expect that if a $100M supercomputer is worth making, it is worth making an ASIC.

Once we can automate the process of developing ASIC supercomputers, we should look towards wafer-scale and 3-D integration to increase the computational density of our systems. This requires new models for fault tolerance and heat removal, but if a $500M supercomputer is worth making, it is worth making as a thick cylinder of bonded wafers.