Tuesday, December 26, 2006

An Interpreted Hardware Description Language

Last semester, some of my classmates and I worked on extensions to Bluespec for Prof. Arvind's class on multithreaded parallelism and compilers. Bluespec's site probably has an overview of the languages ideas, but I'll sum them up here. Bluespec consists of modules which contain registers, rules and methods. A register is a mutable state container. A rule is a guarded atomic action. A guard is a predicate procedure (evalutes to true or false) which determines if the atomic action may be performed. An atomic action is a set of non-blocking assignments and/or method calls that occur instantaneously and completely (all or nothing and with no interuption). A method is similarly a guarded atomic action, however it may accept arguments when it is invoked and it may return a value. If a method call within an atomic action (of a rule or method) has a guard that evaluates to false then the entire atomic action may not be performed.

Methods make up the interface to a module, and may be externally invoked. For example, a module may call any of it's submodules methods within an atomic action. Rules, on the other hand, are entirely internal to the module. Rules may be executed when their guards are true, but there is no imperative within the language to force a particular rule to execute when it's guard is true. Here is an example of a GCD coded in Bluespec.

Bluespec works well as an HDL because the guarded atomic action notion is an exceptionally powerful way to think about designing systems. For starters, the language has no concept of time and is implicitly asynchronous (though I presume all physical implementations are synchronous). Rule based system specification is nice because it provide a clean constructive approach to solving a problem: add rules which transform the system state closer to a terminal state until all possible non-terminal states are accounted for. The other nicety of specifying what the system may do as opposed to what it must do is that it allows for many possible automated optimization strategies in the rule-execution system to minimize system costs (power, area, time). This is a rich area of study and will probably produce many graduate theses ;-)

Our project was focused on sequentializing rules in Bluespec to minimize the amount of hardware required to implement a system (at a tradeoff to speed). For example, the single rule: A <= B + C; B <= D + E; could be split into two rules so that only one adder would be required. This computationally invariant transformation was done with valid Bluespec code as the source and target.

The second aspect of our project was to explore how we could create an interpreted Bluespec-like scripting language. The deficiencies we saw in the Bluespec language are not related to it's functionality as an HDL, but are rather deficiencies common to hardware description in general. In general, HDL types are static and strict, meaning a module or a register are bound to a type at the time they are defined. Some HDLs do have polymorphic type systems (particularly those based on Haskell such as Bluespec). Also, there is no Hardware Description Language with methods that allow dynamic creation of new modules, rules or methods. Dynamic elaboration is desireable in the context of reconfigurable arrays. The language we created for the class was called BlueScheme, which inherited the dynamic typing and first-class procedures from the Scheme language, and guarded atomic actions from Bluespec. The result is a language which supports untyped registers, methods and modules and dynamic modification of modules, methods and rules as an atomic action.

I've independently continued working on this BlueScheme language since after the class has finished. The rest of this document is a specification of how it is designed. A module consists of three hashtables for child-modules, rules and methods. Rules have a guard procedure and an atomic action with no parameters that may not return a value. Methods have a guard procedure and an atomic action which may accept arguments and return a value. Child-modules are the modules whose methods may be called from within a modules atomic action (note that every module is it's own child and is refered to as "me"). I suppose "child" a bit of a misnomer, since multiple modules may share a child module (multiple different modules may use the same GCD).

Rules or methods may be attempted and may succeed or fail. The source of an attempt may be a module which independently attempts a rule or the user who may request a method of a module. If a module attempts a rule or a method is externally invoked and the guard returns false then the failure is propagated to the source, otherwise if the guard succeeds the atomic action will execute. If the atomic action fails, then again the failure signal will propagate to the source. If an attempted atomic action succeeds then it returns a committable action. A commitable action is a thunk (procedure with no arguments) which consists of a set of assignments where the value to be assigned are non-reducible expressions (ie numbers, symbols, lists, strings, modules, methods, rules etc.). If a method Foo returns a commitable action within method Bar's atomic action, Foo's commitable action will be propagated together with the result of Bar. If all attempted actions return to the source without failure, then the source will commit all of the commitable actions by executing the thunk.

Since I have enabled self-modifying modules, I've eliminated registers from the language semantics, and replaced them with modules that support set and get methods (calling set modifies the returned value of the get method). This allows me to add rules to registers that have a random guard and flip the register value or make it disfunctional in order to mimic defective or faulty systems (useful for designing for fault/defect tolerance).

Currently I have only implemented a round-robin rule execution system which attempts rules in a round robin fashion. There is a lot more work that could be done on this end...

The language currently uses Scheme syntax, though we wrote a Bluespec parser and printer for our project and a design that does not contain any dynamic parts could already be converted into valid Bluespec and compiled to C or Verilog. I am also interested in seeing how to use this as a runtime language for a dynamically reconfigurable FPGA. One realization would treat the language as an extension of a file system. In this view we treat modules as "directories," treat rules as running processes scoped within a directory and treat methods as executables. When a module needs acceleration it's directory structure could be moved into the FPGA. Presumably it should still be able to interface with modules that are implemented in sequentially executing CPUs.

In order to transform a module structure into an FPGA structure, we need to type-constrain our modules and methods and map all functional expressions to the FPGA primitives (LUTs, adders). I have implemented a transformation which creates submodules for each functional expression: a <= b + c; becomes a <= adder.add(b,c). I've been looking to Lava to get from here into the FPGA.

We can apply computationally invariant transformations to the module structure to optimize for various cost metrics (see my previous entry on using an economic model as an optimization strategy). Module flattening takes an entire sub-directory structure and flattens it into a single directory. Rule merging takes two disjoint processes and combines them (increasing speed and resource consumption). Rule sequentialization split rules up to conserve resources. Place and route optimizations should also be done, moving modules around within the FPGA to optimize for speed, power and area.

I will put it all online as soon as I write better documentation and work out some demos. If you're interested in seeing the GCD demo please contact me! The most immediate next step I'd like to take is to compile a streaming language into this world of guarded atomic actions.

Sunday, December 24, 2006

low power filters using polyphase decomposition

Polyphase decomposition is a method of transforming a filter into k filters running at 1/k of the sampling rate. The benefit of polyphase decomposition is that reducing the sampling rate of a filter by a factor of k, reduces the power required to implement the filter by a factor of k^3. Of course if we require k times more filters to implement the system, we may only get a k^2 power benefit. I have to play with some real systems to get data.

Friday, December 01, 2006

Asynchronous, Adiabatic Systems

I've been reading a bunch of papers about Adiabatic, Asynchronous Logic systems and Optical interconnect and switching. These seems like three interesting methods that will alleviate the speed / power tradeoff.

When most people in the computing world hear "SOA" they think service oriented architecture. For the real EE geeks, Semiconductor Optical Amplifier may come to mind. Optical circuits could potentially run 100s of times faster than electronic components and consume very little interconnect power.

Adiabatic systems recover the energy used to perform a function by using clocks as 1 and 0 and maintaining a copy of the function inverse so that there is no entropy change in the system. The result is that very little energy is lost by the system. As the time for the transition approaches infinite, the energy consumed by the function approaches 0. The amount of energy consumed is dependent on the square of the rate of transition. Since more logic is required to compute the inverse there is an area tradeoff. Additionally, energy must be injected into the system to make up for the switching rate which adds complexity to the system design.

Asynchronous systems do not have a clock, and must rely upon asynchronous handshake among communicating components to function correctly. This asynchronous handshake could potentially provide the energy required for an adiabatic circuit to function. I've heard of Achronix making highspeed asynchronous FPGAs. I'm not sure if asynchronous, adiabatic FPGAs exist yet, but I'm willing to bet they will eventually.

I've been concocting an asynchronous FPGA programming method based on rules and guarded atomic actions. Two points in there worth noting: the lumped RAM and router view of an FPGA and the use of an FSM to locally manage the routing table of an FPGA. Locally managed routing presents a miniature microcoded architectures to sequentially implement functions, trading off functional units for RAM and thus enabling "virtualized" structural recursion.

Asynchronous FPGA research at Cornell
Some references on Adiabatic Logic:
Asymptotically Zero Energy Computing Using Split-Level Charge Recovery Logic
Adiabatic CMOS gate and adiabatic circuit design for low-power applications