In September I wrote:

To optimize for static currents, using dynamic threshold scaling (modulating the body bias voltage of the chip) along with dynamic voltage scaling [for active power] seems to be a viable technique. Here's a spreadsheet model (ODS original) for leakage current in a transistor varying Temparature, power and threshold voltage across a reasonable range.

According to this FPGA Journal article, Altera has incorporated programmable body biasing into the logic blocks of their 40 nm Stratix IV FPGA.

Xilinx will probably also follow suit with dynamic threshold scaling sometime this summer.

Altera claims to have 680K logic elements in their highest capacity offering... I think 640K is all anyone will ever need :)

## Wednesday, May 21, 2008

## Sunday, May 11, 2008

### Locality Optimization

Digital system designers of FPGAs and ASICs alike become extremely familiar with the place-and-route problem during late-night moments of desperate agony the day before a deadline. Place-and-route is about to become relevant to a very new market as the distinction between FPGA and multicore chips turn into a question of granularity. As we increase the number of cores on a chip, the placement of communicating dataflow kernels will become a primary issue for chip-multiprocessor performance.

The time and power required to perform a dataflow computation is determined by the total wire length over which data must travel. In 2-D tiled structures like FPGAs or multiprocessors, the ratio between the worst-case and best-case placement for two communicating kernels grows as the square-root of the number of cores: a bad placement of two communicating kernels in a 1 million cell array could cost you on the order of 1000 times more wire-length than a good placement. This is easily visualized by the graphic below and the order-square-root-of-cores relationship holds regardless of whether Manhattan (L1) or Euclidean (L2) distance is used (using the L1 metric there is a constant factor of 2 multiplier and using the L2 metric there is a constant factor of sqrt(2)).

As we add more communicating kernels, it becomes more and more obvious how poor placement is detrimental to performance. For arbitrary dataflow graphs, the problem of finding the mapping of the graph which minimizes arc-length is NP-hard. A generalization of this problem is the quadratic assignment problem (QAP).

Several methods are employed to perform placement and are often used in hybrid. A generic method to accelerate place-and-route is to partition the design into sub-graphs. Partitioning dataflow graphs into independently placeable groups allows a hierarchical placer to perform multiple sub-placements in parallel. Partitioning can be used as a general strategy with QAP solvers or simulated annealing methods used to solve the placement problem within partitions.

Tools allow a designer to constrain the physical placement of partitions so that they only need to be re-placed when they change. Since development of FPGA modules usually requires multiple static system elements to test the user code, it is nice to not have to work on these sections again. This also allows board vendors to offer a partial bitstream with interface components pre-placed in the FPGA to hide messy interface signals and provide a simple development environments. A partition-constrained place-and-route can then map new user-logic independent of system-logic.

A spreadsheet programming model directly exposes the locality issue to the programmer. We should expect that a spreadsheet consisting of 6-input logic functions on boolean types should map immediately to the logic fabric of a Virtex-5. If we bring up the granularity of our spreadsheet to integer and character operations we should still expect a somewhat direct placement from the spreadsheet to the FPGA. This does not guarantee that the interconnect can be routed: placement is constrained by routing interconnection and place-and-route algorithms consider routing while placing components. See ulogic.org for information about the Xilinx bitstream and XDL format as well as the programmable interconnection points.

Here is some Python code to compute the wire-length metrics for an Excel spreadsheet. It doesn't take many lines of code to write macros that do simulated annealing on a spreadsheet. I included code which uses Excel's Cut functionality to move cells while preserving the dataflow graph. The slowness of the COM interface in this example will prevent you from getting far.

The time and power required to perform a dataflow computation is determined by the total wire length over which data must travel. In 2-D tiled structures like FPGAs or multiprocessors, the ratio between the worst-case and best-case placement for two communicating kernels grows as the square-root of the number of cores: a bad placement of two communicating kernels in a 1 million cell array could cost you on the order of 1000 times more wire-length than a good placement. This is easily visualized by the graphic below and the order-square-root-of-cores relationship holds regardless of whether Manhattan (L1) or Euclidean (L2) distance is used (using the L1 metric there is a constant factor of 2 multiplier and using the L2 metric there is a constant factor of sqrt(2)).

As we add more communicating kernels, it becomes more and more obvious how poor placement is detrimental to performance. For arbitrary dataflow graphs, the problem of finding the mapping of the graph which minimizes arc-length is NP-hard. A generalization of this problem is the quadratic assignment problem (QAP).

Several methods are employed to perform placement and are often used in hybrid. A generic method to accelerate place-and-route is to partition the design into sub-graphs. Partitioning dataflow graphs into independently placeable groups allows a hierarchical placer to perform multiple sub-placements in parallel. Partitioning can be used as a general strategy with QAP solvers or simulated annealing methods used to solve the placement problem within partitions.

Tools allow a designer to constrain the physical placement of partitions so that they only need to be re-placed when they change. Since development of FPGA modules usually requires multiple static system elements to test the user code, it is nice to not have to work on these sections again. This also allows board vendors to offer a partial bitstream with interface components pre-placed in the FPGA to hide messy interface signals and provide a simple development environments. A partition-constrained place-and-route can then map new user-logic independent of system-logic.

A spreadsheet programming model directly exposes the locality issue to the programmer. We should expect that a spreadsheet consisting of 6-input logic functions on boolean types should map immediately to the logic fabric of a Virtex-5. If we bring up the granularity of our spreadsheet to integer and character operations we should still expect a somewhat direct placement from the spreadsheet to the FPGA. This does not guarantee that the interconnect can be routed: placement is constrained by routing interconnection and place-and-route algorithms consider routing while placing components. See ulogic.org for information about the Xilinx bitstream and XDL format as well as the programmable interconnection points.

Here is some Python code to compute the wire-length metrics for an Excel spreadsheet. It doesn't take many lines of code to write macros that do simulated annealing on a spreadsheet. I included code which uses Excel's Cut functionality to move cells while preserving the dataflow graph. The slowness of the COM interface in this example will prevent you from getting far.

Subscribe to:
Posts (Atom)