Monday, August 24, 2009

Intel Buying Everyone

In the past month, Intel has purchased RapidMind and Cilk. I've talked about Cilk on this blog a while ago (post has comments from one of their founders).

This was a good move for Intel. It is probably an attempt to make the eventual release of Larrabee less painful for developers. This will help put Intel in the leader seat for parallel programming platforms.

What will this mean for CUDA and OpenCL? (Full disclosure: I own shares in Nvidia).

RapidMind and Cilk are both easier platforms to use than Nvidia's CUDA, but the total number of available Teraflops in all the CUDA-capable nodes makes it attractive. Intel still needs silicon to compete with CUDA. RapidMind and Cilk will give Intel's silicon a lot more flexible programming model than CUDA gives to Nvidia's GPUs, complementing the fact that Intel's silicon will be a lot more flexible architecture than Nvidia GPUs.

Cilk and RapidMind will simplify some of the work of parallelizing library functions, but Intel will be hard-pressed to compete with Nvidia in cost/performance ratio in any application with a strong CUDA library. Nvidia GPUs are already cheap: Intel will have to use their insane operating leverage to compete in the accelerator market on a cost/performance basis. Intel can also win this market from Nvidia by getting their latest integrated graphics chips in all the newest machines and generally by doing things that piss off anti-trust prosecutors.

I'm not very hopeful for OpenCL. Unless Nvidia decides to abandon CUDA or make isomorphic with OpenCL, then OpenCL is DOA. Apple's dependency on Intel means they will eventually find Zen in whatever platform Intel offers them. AMD is the first, and will probably be the only one to support this "Open Standard" for GPGPU and multicore. Unfortunately, they will find themselves the leader in a very small market. AMD needs to focus on crushing Intel in the server market by getting to 32 nm first and releasing octo-core Opterons.

This will be interesting to watch unfold.

Wednesday, August 19, 2009

Power vs Speed

It would appear that we have reached the limits of what it is possible to achieve with computer technology, although one should be careful with such statements, as they tend to sound pretty silly in 5 years - John von Neumann, 1949

The design goals in parallel computing differ between the embedded multicore market and the high performance computing market. On the one hand, more cores can do parallel work and get a job done faster, while on the other-hand power efficiency can be increased with no penalty to throughput by doubling the number of cores and halving their clocks. Both switching power and leakage power can be optimized using this strategy. Voltage scaling techniques can address dynamic current and threshold scaling can address leakage. Place and route optimization can improve both power effieciency and maximum performance. A number of advanced circuit design techniques can also address power efficiency.

In traditional CMOS operation, power P is proportional to C*f*V^2 (C capacitance switched per clock cycle, f frequency of switching, V is the main voltage voltage). If you double the number of cores and half their frequency then the total capacitance doubles while the f*V^2 term decreases by 1/8: lowering f allows us to proportionally lower V because we need less potential to charge up a capacitor over a longer period.

A circuit designer will often use the latency of an inverter as their unit of propagation delay. For example, a combinatorial unit may have a delay equivalent to 10 inverter delays. If the system clock is generated by a ring oscillator composed of a series of inverters, then the propagation delay of each inverter will increase as we lower the main voltage. Thus, lowering the main voltage also lowers the ring oscillator frequency. Since the combinatorial paths in the circuit are in units of the inverter delay of the ring oscillator, all of the elements in the circuit will still operate with appropriate relative timing after the supply voltage is lowered.

Since the transition speed of a circuit is dependent on the threshold voltage (see pp 15-18), when we lower our clock frequency, we may also raise our threshold voltage to decrease the leakage current. To address static leakage circuitry multithreshold CMOS techniques may be used along with power enable circuitry. However, multithreshold CMOS will not allow us to decrease leakage dynamically in frequency scaling situations. A while back I made a spreadsheet model of the leakage current (Calc ods format) to demonstrate the benefit of threshold voltage scaling. The threshold voltage is a function of the body bias of a transistor so additional fabrication steps are required to make this variable. Both Xilinx and Altera have incoporated threshold scaling in their latest FPGAs.

When the supply voltage is lower than the threshold voltage, a CMOS circuit operates in subthreshold mode. Subthreshold switching allows for very low power operation for low speed systems. A major issue with subthreshold circuit design is the tight dependency on parameter accuracy: subthreshold drain currents are exponentially dependent on threshold voltage so small parameter variations can have extremely large effects on the system design. This increases the cost of testing devices and decreases the yield.

As the thickness of the gate dielectric decrease, gate leakage current increases due to the tunnelling of electrons through the dielectric. "High-K" dielectrics addres this issue; Intel uses Hafnium instead of SiO2 in their 45-nm process. Optimal gate thickness for gate leakage depends highly on the expected operating temparature and the supply and threshold voltages.

One of the main benefits of FPGA design is their versatility in a range of operation conditions. It is suboptimal to use the same circuit in a variety of different regimes: it is best to optimize a circuit around a small operating range. Thus we should expect high performance devices to be different animals from the low power devices. In some systems it makes sense to use two separate chips: one for high power, high speed operation and another for low power, low speed operation. This is starting to be a common practice for laptop GPUs.

A major cost of distributing work is moving data; this can be addressed using a place-and-route optimization to minimize the signaling distance required for communicating processes. Shorter communication paths can translate into increased power efficiency or latency optimization. When optimizing for latency, the optimization goal is to minimize the maximum latency path. In power optimization, the activity factor for each wire has to be accounted for, so the goal is to minimize the weighted sum of activity times wire-length.

To improve the total interconnect lengths, 3-D integration can be used to stack circuit elements. In three dimensions, it is possible to cut the total amount of interconnect in half. Circuits may be stacked using wafer bonding or epitaxial growth however both processes are expensive. A major concern with 3-D integration is heat removal. IBM has demonstrated water cooling of 3-D intgrated chips. The yield of a wafer bonded circuit is dependent on the defect density on each bonded component. In order to address this issue, defect tolerance must be incorporated into the system design. Another issue to consider is the need for 3-D place-and-route tools.

One of the most costly wires in a system is the clock, which has high activity and has tenticles spanning an entire chip. Clock distribution power is often a double digit percentage of the total system budget. Asychronous circuits operate without a clock, using handshakes to synchronize separate components thereby eliminating the cost of clock distribution.

Adiabatic circuitry uses power clocking and charge-recovery circuitry to asymptotically eliminate switching power as the switching time increases. Combined asynchronous, adiabatic logic uses the asynchronous component handshake as the power clock.

With a number of different technologies available to address power concerns, how can the digital designer rapidly explore a number of architectural possibilities? Automation tools need to be endowed with the ability to transform a digital design to a low-power subthreshold implementation or a high speed circuit with dynamic supply and threshold scaling. These tools need to be aware of the power, speed and manufacturing tradeoffs associated with each of these semiconductor technologies. This will almost certainly require multiple vendors' tools playing nice with eachother.