Wednesday, August 19, 2009

Power vs Speed

It would appear that we have reached the limits of what it is possible to achieve with computer technology, although one should be careful with such statements, as they tend to sound pretty silly in 5 years - John von Neumann, 1949

The design goals in parallel computing differ between the embedded multicore market and the high performance computing market. On the one hand, more cores can do parallel work and get a job done faster, while on the other-hand power efficiency can be increased with no penalty to throughput by doubling the number of cores and halving their clocks. Both switching power and leakage power can be optimized using this strategy. Voltage scaling techniques can address dynamic current and threshold scaling can address leakage. Place and route optimization can improve both power effieciency and maximum performance. A number of advanced circuit design techniques can also address power efficiency.

In traditional CMOS operation, power P is proportional to C*f*V^2 (C capacitance switched per clock cycle, f frequency of switching, V is the main voltage voltage). If you double the number of cores and half their frequency then the total capacitance doubles while the f*V^2 term decreases by 1/8: lowering f allows us to proportionally lower V because we need less potential to charge up a capacitor over a longer period.

A circuit designer will often use the latency of an inverter as their unit of propagation delay. For example, a combinatorial unit may have a delay equivalent to 10 inverter delays. If the system clock is generated by a ring oscillator composed of a series of inverters, then the propagation delay of each inverter will increase as we lower the main voltage. Thus, lowering the main voltage also lowers the ring oscillator frequency. Since the combinatorial paths in the circuit are in units of the inverter delay of the ring oscillator, all of the elements in the circuit will still operate with appropriate relative timing after the supply voltage is lowered.

Since the transition speed of a circuit is dependent on the threshold voltage (see pp 15-18), when we lower our clock frequency, we may also raise our threshold voltage to decrease the leakage current. To address static leakage circuitry multithreshold CMOS techniques may be used along with power enable circuitry. However, multithreshold CMOS will not allow us to decrease leakage dynamically in frequency scaling situations. A while back I made a spreadsheet model of the leakage current (Calc ods format) to demonstrate the benefit of threshold voltage scaling. The threshold voltage is a function of the body bias of a transistor so additional fabrication steps are required to make this variable. Both Xilinx and Altera have incoporated threshold scaling in their latest FPGAs.

When the supply voltage is lower than the threshold voltage, a CMOS circuit operates in subthreshold mode. Subthreshold switching allows for very low power operation for low speed systems. A major issue with subthreshold circuit design is the tight dependency on parameter accuracy: subthreshold drain currents are exponentially dependent on threshold voltage so small parameter variations can have extremely large effects on the system design. This increases the cost of testing devices and decreases the yield.

As the thickness of the gate dielectric decrease, gate leakage current increases due to the tunnelling of electrons through the dielectric. "High-K" dielectrics addres this issue; Intel uses Hafnium instead of SiO2 in their 45-nm process. Optimal gate thickness for gate leakage depends highly on the expected operating temparature and the supply and threshold voltages.

One of the main benefits of FPGA design is their versatility in a range of operation conditions. It is suboptimal to use the same circuit in a variety of different regimes: it is best to optimize a circuit around a small operating range. Thus we should expect high performance devices to be different animals from the low power devices. In some systems it makes sense to use two separate chips: one for high power, high speed operation and another for low power, low speed operation. This is starting to be a common practice for laptop GPUs.

A major cost of distributing work is moving data; this can be addressed using a place-and-route optimization to minimize the signaling distance required for communicating processes. Shorter communication paths can translate into increased power efficiency or latency optimization. When optimizing for latency, the optimization goal is to minimize the maximum latency path. In power optimization, the activity factor for each wire has to be accounted for, so the goal is to minimize the weighted sum of activity times wire-length.

To improve the total interconnect lengths, 3-D integration can be used to stack circuit elements. In three dimensions, it is possible to cut the total amount of interconnect in half. Circuits may be stacked using wafer bonding or epitaxial growth however both processes are expensive. A major concern with 3-D integration is heat removal. IBM has demonstrated water cooling of 3-D intgrated chips. The yield of a wafer bonded circuit is dependent on the defect density on each bonded component. In order to address this issue, defect tolerance must be incorporated into the system design. Another issue to consider is the need for 3-D place-and-route tools.

One of the most costly wires in a system is the clock, which has high activity and has tenticles spanning an entire chip. Clock distribution power is often a double digit percentage of the total system budget. Asychronous circuits operate without a clock, using handshakes to synchronize separate components thereby eliminating the cost of clock distribution.

Adiabatic circuitry uses power clocking and charge-recovery circuitry to asymptotically eliminate switching power as the switching time increases. Combined asynchronous, adiabatic logic uses the asynchronous component handshake as the power clock.

With a number of different technologies available to address power concerns, how can the digital designer rapidly explore a number of architectural possibilities? Automation tools need to be endowed with the ability to transform a digital design to a low-power subthreshold implementation or a high speed circuit with dynamic supply and threshold scaling. These tools need to be aware of the power, speed and manufacturing tradeoffs associated with each of these semiconductor technologies. This will almost certainly require multiple vendors' tools playing nice with eachother.

No comments: