Tuesday, December 26, 2006
Methods make up the interface to a module, and may be externally invoked. For example, a module may call any of it's submodules methods within an atomic action. Rules, on the other hand, are entirely internal to the module. Rules may be executed when their guards are true, but there is no imperative within the language to force a particular rule to execute when it's guard is true. Here is an example of a GCD coded in Bluespec.
Bluespec works well as an HDL because the guarded atomic action notion is an exceptionally powerful way to think about designing systems. For starters, the language has no concept of time and is implicitly asynchronous (though I presume all physical implementations are synchronous). Rule based system specification is nice because it provide a clean constructive approach to solving a problem: add rules which transform the system state closer to a terminal state until all possible non-terminal states are accounted for. The other nicety of specifying what the system may do as opposed to what it must do is that it allows for many possible automated optimization strategies in the rule-execution system to minimize system costs (power, area, time). This is a rich area of study and will probably produce many graduate theses ;-)
Our project was focused on sequentializing rules in Bluespec to minimize the amount of hardware required to implement a system (at a tradeoff to speed). For example, the single rule: A <= B + C; B <= D + E; could be split into two rules so that only one adder would be required. This computationally invariant transformation was done with valid Bluespec code as the source and target.
The second aspect of our project was to explore how we could create an interpreted Bluespec-like scripting language. The deficiencies we saw in the Bluespec language are not related to it's functionality as an HDL, but are rather deficiencies common to hardware description in general. In general, HDL types are static and strict, meaning a module or a register are bound to a type at the time they are defined. Some HDLs do have polymorphic type systems (particularly those based on Haskell such as Bluespec). Also, there is no Hardware Description Language with methods that allow dynamic creation of new modules, rules or methods. Dynamic elaboration is desireable in the context of reconfigurable arrays. The language we created for the class was called BlueScheme, which inherited the dynamic typing and first-class procedures from the Scheme language, and guarded atomic actions from Bluespec. The result is a language which supports untyped registers, methods and modules and dynamic modification of modules, methods and rules as an atomic action.
I've independently continued working on this BlueScheme language since after the class has finished. The rest of this document is a specification of how it is designed. A module consists of three hashtables for child-modules, rules and methods. Rules have a guard procedure and an atomic action with no parameters that may not return a value. Methods have a guard procedure and an atomic action which may accept arguments and return a value. Child-modules are the modules whose methods may be called from within a modules atomic action (note that every module is it's own child and is refered to as "me"). I suppose "child" a bit of a misnomer, since multiple modules may share a child module (multiple different modules may use the same GCD).
Rules or methods may be attempted and may succeed or fail. The source of an attempt may be a module which independently attempts a rule or the user who may request a method of a module. If a module attempts a rule or a method is externally invoked and the guard returns false then the failure is propagated to the source, otherwise if the guard succeeds the atomic action will execute. If the atomic action fails, then again the failure signal will propagate to the source. If an attempted atomic action succeeds then it returns a committable action. A commitable action is a thunk (procedure with no arguments) which consists of a set of assignments where the value to be assigned are non-reducible expressions (ie numbers, symbols, lists, strings, modules, methods, rules etc.). If a method Foo returns a commitable action within method Bar's atomic action, Foo's commitable action will be propagated together with the result of Bar. If all attempted actions return to the source without failure, then the source will commit all of the commitable actions by executing the thunk.
Since I have enabled self-modifying modules, I've eliminated registers from the language semantics, and replaced them with modules that support set and get methods (calling set modifies the returned value of the get method). This allows me to add rules to registers that have a random guard and flip the register value or make it disfunctional in order to mimic defective or faulty systems (useful for designing for fault/defect tolerance).
Currently I have only implemented a round-robin rule execution system which attempts rules in a round robin fashion. There is a lot more work that could be done on this end...
The language currently uses Scheme syntax, though we wrote a Bluespec parser and printer for our project and a design that does not contain any dynamic parts could already be converted into valid Bluespec and compiled to C or Verilog. I am also interested in seeing how to use this as a runtime language for a dynamically reconfigurable FPGA. One realization would treat the language as an extension of a file system. In this view we treat modules as "directories," treat rules as running processes scoped within a directory and treat methods as executables. When a module needs acceleration it's directory structure could be moved into the FPGA. Presumably it should still be able to interface with modules that are implemented in sequentially executing CPUs.
In order to transform a module structure into an FPGA structure, we need to type-constrain our modules and methods and map all functional expressions to the FPGA primitives (LUTs, adders). I have implemented a transformation which creates submodules for each functional expression: a <= b + c; becomes a <= adder.add(b,c). I've been looking to Lava to get from here into the FPGA.
We can apply computationally invariant transformations to the module structure to optimize for various cost metrics (see my previous entry on using an economic model as an optimization strategy). Module flattening takes an entire sub-directory structure and flattens it into a single directory. Rule merging takes two disjoint processes and combines them (increasing speed and resource consumption). Rule sequentialization split rules up to conserve resources. Place and route optimizations should also be done, moving modules around within the FPGA to optimize for speed, power and area.
I will put it all online as soon as I write better documentation and work out some demos. If you're interested in seeing the GCD demo please contact me! The most immediate next step I'd like to take is to compile a streaming language into this world of guarded atomic actions.
Sunday, December 24, 2006
Friday, December 01, 2006
When most people in the computing world hear "SOA" they think service oriented architecture. For the real EE geeks, Semiconductor Optical Amplifier may come to mind. Optical circuits could potentially run 100s of times faster than electronic components and consume very little interconnect power.
Adiabatic systems recover the energy used to perform a function by using clocks as 1 and 0 and maintaining a copy of the function inverse so that there is no entropy change in the system. The result is that very little energy is lost by the system. As the time for the transition approaches infinite, the energy consumed by the function approaches 0. The amount of energy consumed is dependent on the square of the rate of transition. Since more logic is required to compute the inverse there is an area tradeoff. Additionally, energy must be injected into the system to make up for the switching rate which adds complexity to the system design.
Asynchronous systems do not have a clock, and must rely upon asynchronous handshake among communicating components to function correctly. This asynchronous handshake could potentially provide the energy required for an adiabatic circuit to function. I've heard of Achronix making highspeed asynchronous FPGAs. I'm not sure if asynchronous, adiabatic FPGAs exist yet, but I'm willing to bet they will eventually.
I've been concocting an asynchronous FPGA programming method based on rules and guarded atomic actions. Two points in there worth noting: the lumped RAM and router view of an FPGA and the use of an FSM to locally manage the routing table of an FPGA. Locally managed routing presents a miniature microcoded architectures to sequentially implement functions, trading off functional units for RAM and thus enabling "virtualized" structural recursion.
Asynchronous FPGA research at Cornell
Some references on Adiabatic Logic:
Asymptotically Zero Energy Computing Using Split-Level Charge Recovery Logic
Adiabatic CMOS gate and adiabatic circuit design for low-power applications
Friday, November 24, 2006
I've been researching power-optimized partitioning on tiled architectures. Something will come out of that soon.
Friday, November 03, 2006
As the number of memory ports, functional units and simultaneous instructions rise, the overhead required to manage the re-order-buffer and register renaming on such a behemoth hardware precludes hardware scheduling as the optimal strategy. Especially since we nearly always run code that has executed before, it seems an awful waste of energy to redo the scheduling in hardware due to the requirement for random instruction issue. Granted, multi-threading on a single core produces the effect of random issue, though especially since we are making the transition towards multi-core, the assumption of random issue should be revisited.
Reduction of scheduling overhead can be accomplished by running the data-flow scheduler as a software process either during compilation or during the phase when an instruction page is loaded into the instruction cache. If we use software compilation for such a VLIW processor then the operating system must manages a file-system of OoOE schedules for every function executed and so that an instruction page fetch automatically fetches system optimized schedules. This is practical because bulk memory for such storage is cheap.
With infinitely many renaming slots, our execution schedule in a superscalar processor resembles a Verilog netlist for the optimal pipeline of our system. To support a large number of virtual registers it would be necessary to "delocalize" register files. Mapping virtual registers to physical registers is now more akin to FPGA place-and-route and may be optimized using similar algorithms.
We must also produce a variety of local caches to support loading and storing of operands and results. We should have the ability to escape to a main memory system on a cache miss (perhaps an on-chip level 2 cache that triggers an interrupt on a second miss).
Once we succumb to software optimized hardware scheduling, it's a small logical step to say that our hardware should allow some amount of reconfigurability to maximize functional unit utilization. The structure of such a hardware device would more closely resemble an FPGA than a single core superscalar processor.
All of this ignores branching of course, which presents an interesting and complicated control problem. More on this another day...
Tuesday, October 24, 2006
The internet has made essentially free the "cost-of-distribution" of accessing extra computing power for things as SETI@Home and file sharing networks. In these cases the resource being distributed is essentially free. In contrast, an open source developer must dedicate substantial time to his work. The article poses the question: why would a developer freely distribute open source works? Developing trust and reciprocity, as well as gaining respect of the programming community are offered as reasons. This is true, but I think a major reason why programmers release open source code is because they enjoy the products of their coding and they wish to share the fruits of their labor with others. Open source software is like the grafitti murals of the software industry.
The essay on sharing is by Yochai Benkler of the Yale Law School. Benkler argues that sharing is a new mode of economic production using distributed computing and carpooling as case studies. Carpooling and distributed computed are similar in that they are very wide-scale sharing practice of a privately-owned good, whose excess capacity has established market models. Benkler writes: "In these characteristics carpooling and distributed computing are like peer-to-peer networks, or ad hoc wireless mesh networks, or like the labor individual programmers put into free software."
Benkler uses the notion of "bumpiness" to define the quantization of capactiy. A car is often designed to accomodate multiple passangers and thus generally has excess capactiy. Similarly computers have been designed to operate extremely quickly for media intensive applications, but are often used simply for word processing. In both domains, cars or computers, there is an excess capacity due to the quanitization of the supplied good.
There are established market models to capitalize on quantization inefficiencies. Bus systems meet transportation demand and data centers meet computational demands. In both of these models we consolidate multiple subsystems in order to achieve high utilization. The added efficiency can often present an economic advantage to the bus rider or data-center customer alike. Another relevant example of consolidation and sharing reducing quantization induced capacity innefficiencies are fabless semiconductor companies that eliminate extremely large capital expenditures by outsourcing fabrication.
A great way to identify opportunities for business innovation is by identifying such quantization inefficiencies. In order to identify if there is a market for resource sharing, we must consider the model of both the consumer and the requisite material providers. How must our consumers adapt to apply our innovation? How much dependency is there on prerequisite component providers? In the case of fabless semiconductor industry these may be more specific: how must the engineer specify the IC design? Can fabrication facilities be made "exchangeable?" If there is a variance between fabrication facilities, can we measure tradeoffs affecting the design. For example suppose we may decide between using a 90nm and a 65nm fabrication facility to implement an HDL specification. If the resulting 65 nm chip would have higher unit costs at the anticipated volume level, yet run faster and consume less dynamic power such that, it may be sold at a higher price. Depending on cost analyses, we may need to make a decision between a variety of different options.
The very same decision must be made by airline consumers who must choose between faster or more convenient, but also more expensive flights. If time is a strong constraint, as it may be for a person flying on a tight schedule, then the person may be willing or required to pay a premium price for a particularly convenient flight. An airline must optimize its flight scheduling to make sure each different type of airplane is maximally shared in order to maximize profit. The pricing and scheduling model for an airline company reflects demand for different flights. Jet Blue has been able to dramatically reduce operating costs by managing a homogenous fleet. By reducing operating costs and serving high demand markets they can reduce prices to maximize quantity sold and thus maximize profit. Yet clearly there is a market space for an alternative to Jet Blue, and we will not want to use a passanger ailrliner to ship cargo. Similarly some degree of heterogeneity will maximize the efficiency computing array. Could the management costs of a heterogenous computing array present an overhead such that we would prefer to view computational resources as homogeneous clusters?
In order to have an effective process scheduling system for a heterogenous reconfigurable computer, processes must share resources with varying costs and constraints. In a heterogeneous array, there may be x86s, GPUs, FPGAs, which overlap in functionality, but are optimal for specific instances, just as there are planes, buses, cars, bikes, and feet which can all get you there. By incentivizing "car pooling," we may find ways to maximize efficiency. It's not even a distant a metaphor to suggest that shared resources should get to travel in the fast lane.
An algorithm to optimize computational resource sharing should inherit semantics and structure from a general economic model. Such an economic model would provide a method for multiple locally optimizing (greedy) process scheduling agents to globally optimize the system.
Examining strategies derived from athropological and economic bases provides a good perspective for exploring complex multi-agent management models. Benkler's essay presents a case study on "dynamic ad-hoc carpooling." An example relevant to security, is that a women is much less likely to carpool with two men both of whom she does not know (page 286 of the law journal). Thus "security" and "trust" are relevant motivational factors for making a sharing decision. That same women would probably not have any objection to sharing an airplane regardless of the gender ratio. This is based on a trust relationship with the service provider: each agent sharing the bus has been authenticated prior to enterring the shared arrangement. In a shared, distributed computer system, methods of autheticating agents and forming trust relationships between agents must be established to guarantee some level of security.
Security issues related to technological innovations are usually not unbridgeable gaps in the system design, but rather psychological gaps. Such situations similarly require that the client and vendor have a means of forming a trust relationship or autheticating eachother through other relationships. There may be a psychological barrier to convince someone to outsource their data-center due to security concerns, yet Google mass appeal demonstrates that trust relationships can develop between a user and a comptuing service provider.
Distributed computing networks present two separate security issues on the requisitie provider and the computation consumer. Suppose I were to sell a service using spare cycles for distributed analytics. I buy the spare computing cycles from companies A, B, C, D and E, and use them to perform distributed analytics for company A. I would have to be able to convince A that its analysis will be secure, and not revealed to B through E. Similarly I would have to convince B through E that their internal network is secure when they are serviceing requests from A. This problem is a psychological problem rather than an inherent limitation of the computing system, since all of the computers invovled can be securely consolidated to provide A through E with their computation requirements already. It is likely that this will be the most major limiting factor to utility computing.
Finally, on the economics of sharing and open source programming. During my experience at Xilinx, I got a better understanding of the motivation for closed source tools. It has everything to do with dependency and part lock-in, less to do with support and only psychologically related to security (one of the great one-liner that I took out of DEFCON this year was "The system is secure because it is only accessible by our proprietary readers"). Today I read a thread from 2000 on comp.arch.fpga about FPGA openness. The FPGA hardware market is hindering it's own growth by each manufacturer structuring it's toolset around closed specifications. There is an enormous waste of potential with the goal of disincentivizing interchangeability. This is contrary to the natural tendancy towards commoditization of computing components. Xilinx has been enjoy high margins and even with respectably high volume, though I fully believe they can trigger an increase volume by a factor of 5 without causing margins to drop by more than 1/5. This argument is clearly hand-waving, but the point is that they could be able to take on new customers without losing anything substantial. The first FPGA company (there aren't many to choose from) to open critical parts of their system and toolset will see an appreciable increase in their computing market share because they will be more easily adopted as accelerator co-processors. This may in fact be a case where 1024 chickens (open source programmers) would plow the field faster if only the two strong oxens (Xilinx and Altera) would get their proprietary formats out of the way.
With or without Xilinx, the market for accelerator co-processors will continue to grow as consolidated heterogeneous data-centers become more and more commonplace. I'm willing to bet the next seven years of my life to it.
Thursday, October 19, 2006
Anyway so more on GALS. GALS is nice because it eases clock distribution which consumes a large portion of the power on the chip. With multiple local clocks you can control dynamic voltage and frequency scaling in a localized manner to power optimize a process load balancer. Ring oscillators provide a clock for a small region of the chip which asynchronously communicates with other regions.
Asynchronous communication schemes are pretty well established, especially since the Internet and asynchronous processing methods are emerging now with AJAX applications dispatching client side programs. On chip asynchronous processing should be able to inherit it's programming metaphor from how people run distributed applications on the Internet.
when the client webpage loads it says:
(tell server '(repl-requested))
(tell client `(authenticate ,(pop-authentication-agent!)))
the authentication agent profiles the client REPL (how long does it take to respond? who is the user? do we trust this anonymous user?). After authenticating the client REPL and determining privileges, it tells the client REPL to report to a task manager.
The task manager provides the client with a set of definitions and instructs it to compute some process. At the end of the process the client will return the result and request a new process. If the client side comes upon an undefined symbol it can make a server request to define the symbol. I think we'll get to this point in a few weeks.
What will be really interesting will be to see what kinds of applications can attract a lot of user time, and benefit from distributed computation. Online games with distributable AI seems like a good fusion of these characteristics and might provide an alleviation for the scalability issues associated with high traffic.
This web programming seems like a departure from reconfigurable computing but the same GALS programming provides a strong metaphor for FPGA control and a lot of the process management must be "reconfigurable" in order to tolerate faulty or defective clients.
Here's a well-written article on AJAX.
Wednesday, October 11, 2006
The problem is that reconfigurable computing isn't a hardware problem. It's a software problem and it's a really big and hairy software problem too. There's no clear starting point either: part of the software problem is even just identifying which problems need solving. Most articles about "The Next Big Thing in Computing" have a skeptic keenly observe that tool support for reconfigurable computing is lacking. Lacking tool and system standards we are unlikely to truly produce any "next big thing."
So several companies have come along to deliver compiler tools to make development easier in some way or another. These companies fair better since they can gain traction without enormous expenditure. But reconfigurable computing still hasn't made its promised impact. This is because tools are only a small slice of the software problem. What we have is a chicken and egg problem: do the tools create the demand for applications or does demand for applications create the tools? The real software problem is to produce an important application that relies on reconfigurable computing technology. Driving demand for reconfigurable computing applications will have the effect of growing the toolset for developing such applications.
This feedback is nice, but it cannot be forced by pumping up compiler tools alone. Without substantial attention put into runtime environments and user interfaces it is unlikely that reconfigurable computing applications will take off. If you want to know how I think I can overcome these barriers, please email me.
Tuesday, October 03, 2006
Most profiling tools only consider execution time on a single CPU or a homogeneous array. What metrics are useful for profiling an application on a heterogeneous and reconfigurable platform?
I offer the unit GOPS/$ as a metric for computational efficiency. GOPS stands for “billions of operations per second” and is dependent on the application. TeraFlops, for contrast, is a measure of explicitly floating-point operations. The cost function is dependent both on the type of operation being performed and the computational fabric.
To quantify GOPS I will use the concept of information entropy. If a boolean random variable is evenly distributed, p(0) = p(1) = .5, then we gain 1 bit of information from resolving its value. Consider a two input NOR gate with evenly distributed inputs. The output of the NOR gate has p(0) = .75 and p(1) = .25. Resolving the output of the NOR gate provides us with .811 bits of information.
Consider now if both of the inputs to the NOR gate are independent and have p(1) = .99 and p(0) = .01. The output of our NOR gate now has p(1) = .0001 and p(0) = .9999. Resolving the NOR gate only provides us with .0015 bits of information, substantially less than before. Yet the circuitry providing us with this information has the same cost as in the previous case.
"Information entropy" provides a "physical" basis for measuring an operation. If GOPS is billions of operations per second, then it's "physical" unit is (information entropy / time). GOPS/$ = "information entropy / (time * cost)" If the information entropy of an output pin is small, then it may not be worth the cost of implementing the hardware.
For example, consider an adder whose inputs have high probability of being small and low probability of being large. The information entropy of the output bits is very low for the high order bits. Depending on the costs and probabilities, it may be worthwhile to use an 8 bit adder instead of a 32 bit adder. If there is some finite probability of inputs being larger than 8 bits then we will need some detection circuit to handle this case. This adds a fixed cost to the circuitry. We can quantify the cost as follows:
$(8 bit adder) = [p(<> 8 bit input) * $(> 8 bit addition)] + [$(detection circuitry) + $(adder circuitry)]
If we compare this cost function across a variety of bit widths we can deduce an optimal bit width for our adder. The cost functions don't look exactly like this for all bit widths: if we had used a 4 bit adder for example, the cost for performing 4 bit, 8 bit, 12 bit, and 16 bit addition would all be different and would have to be taken into account.
We also want to consider profiling across multiple co-processors. Suppose we wish to perform N FFT operations of size s and we have the option of using either our CPU, a GPGPU or an FPGA. Let's suppose we only wish to perform 1 FFT of size 128. In this case it may not be worth the overhead of offloading the process to the GPGPU or the FPGA since we only need to perform 1 operation. As a result, GOPS/$ is maximized by using the CPU to compute the FFT.
Consider now that we have 128 FFT operations of size 128. In this case, the throughput benefits associated with offloading the process ammortizes the cost of doing so. We may offload the task to either the FPGA or the GPGPU. If the FPGA already has FFT circuitry configured and assuming it performs FFTs substantially better than the GPGPU, then the task should be performed in the FPGA. However, if the FPGA is not configured with an FFT, then for a problem of this size the overhead associated with configuring the FPGA may preclude using it for this operation. Thus we will use the GPGPU if the FPGA does not already contain an FFT. Now suppose that we want to perform 2048 FFTs of size 2048. The cost of configuring the FPGA for this task is ammortized by the size of the job and thus it will always be beneficial to perform the FFT on the FPGA.
The result of this discourse is that choosing an accelerator methodology in a heterogeneous reconfigurable fabric may be a runtime consideration depending on the size of the operation to be performed and the configuration of the system. A load balancing subsystem will need to simplify the task of profiling an application by determining some high dependency variables. To keep the overhead associated with a run-time load balancer extremely low, we will want to generate a "condition set" at profile-time and link each condition with a particular configuration and methodology.
To manage such a load balancer, I propose using a financial model in which processes get "paid" for performing their tasks and use this money to pay for the resources they use. A well designed economic system will have its basis in meaningful cost metrics. Some primary factors for the cost function is the power, time, thermal dissipation and area required to perform the computation. Remember that GOPS/$ has units of (bits / (time*cost)). We put time into the denominator and into the cost function since all things being equal we would prefer a faster solution for the same cost. If speed costs substantial amounts of energy we will need to take that into consideration. The cost associated with time is split between to factors: the ammortized cost over the device life of the hardware and the urgency of the computation.
The urgency factor of the time cost of an operation is highly dependent on it's location in the pipeline. For example, if task A and task B are both prerequisites for task C, then we will want to accomplish A and B as fast as possible. Suppose that A takes 4 times longer than B if we solely optimize for time. We now have flexibility to minimize the cost of B. For instance we may lower the voltage of the circuitry processing B which will slow the circuit, but may save us substantially in terms of power. Suppose B is a 32 bit addition, we may decide to transition it to an 8 bit adder to save on space though require four times as long to produce a 32 bit result. Depending on the cost functions we may choose to go with a middle-ground: a 16 bit adder with a slightly lower voltage that still completes the task in time. This decision may be made to avoid the opportunity cost associated with not using the circuitry at it's full compute capacity,
Alternatively, we may find that task B is common in our process schedule that we wish to share the resources to perform B with different processes. We may choose among various methods to share B. If task C is highly critical, we will want to use a dedicated-priority sharing manager that will only share B if there is no request for C. Similarly a non-dedicated-priority sharing manager will assign priorities to each of the possible tasks who may want to use its resources. Presumably a task could pay more to have higher priority. A non-priority sharing manager offers its resources at the same price to everyone, with no guarantee that a given task will receive priority, though there will be some guarantee on latency.
An adaptive profiling and load balancing mechanism will also need to be optimized to determine how to minimize the overhead costs associated with profiling and optimization. In order to do this, we will want to keep a strategy database to provide the load balancer with information about how to manage a process topology (a set of processes to be executed simultaneously). We can ascribe a set of modes for a task dispatched by the load-balancer. In the simple "do nothing unless told to" mode, the load-balancer only dispatches based on specific directives from the application. In "simple management mode" the load balancer will use it's strategy database to manage only those process topologies it has encountered before. In "agressive management mode" the load balancer will make assumptions about the process topologies (such as bit width assumptions or timing assumptions) to relate the topology to previously encountered topologies. Presumably there is some gradient of optoins between simple and agressive management modes. We will prefer the simple management mode or the "do nothing unless told to" mode for "semi-constant" (mostly the same process topology through all time) or diagnostic applications which we want to have lower level control over the hardware.
The aggressive mode will be preferable when we have flexibility to tinker with the application while it is running to determine more optimal partitioning configurations. If we take the aggressive mode to it's logical extreme we have "profile-mode," in which execution is extremely slow, yet the load balancer will be able to produce an analysis of the execution of the task on across a variety of platforms and topologies. We would probably want to enter "profile mode" during idle time and we will want to consider process topologies that we have encountered in the past to build up the strategy database.
Sunday, October 01, 2006
Of course the question of which accelerator to use is extremely dependent on profiling tools and compilers, but more importantly, scalability and "future proof"-ness. Here is where I think FPGAs will win since they will continue to be technology drivers and will lead the way to 45nm and below. One other nice facet to having a data center full of FPGAs, is that you can identify useful hardware structures and provide feedback to manufacturers on which features to "harden" into the next fabric.
More important than developing the framework for accelerating applications is actually delivering on the applications even using existing tools. I see a convergence between utility computing (data center outsourcing) and the accelerator co-processor markets: accelerators will benefit from utility computing (where higher utilization will ammortize hardware investment) and data centers will benefit from accelerators (increased GOPS/$).
Friday, September 29, 2006
another article in the register.
"The end result will likely be a multi-core world where the common general purpose cores of today sit alongside FPGAs, networking products and other co-processors tuned to handle specific tasks at remarkable speeds."
Tuesday, September 26, 2006
I think in terms of amortized efficiency benefits, it is likely that the true benefits of multi-core will be "assembly-line" throughput benefits. It is OK to have a very long pipeline spanning 80 processing kernels if such a methodology substantially increases system throughput even at a cost to single request delay. Even many realtime systems will benefit despite these throughput / delay tradeoffs.
Working away on mapping graphs to a lattice....
Thursday, September 21, 2006
Wednesday, September 20, 2006
Tuesday, September 19, 2006
My current plan is to build up an interface that allows me to modify code and provides a clean graphical interface to a the system. After the development environment is working and I have an FPGA board exposed to the internet, I will host the system on an FPGA board to start coding the hardware scheduler.
I've also been cooking up the system to compile Scheme to LUTs. Scheme's "bit-string" abstraction is very useful for encoding LUTs. A brief summary example of the LUT definitions:
(define (lut configuration in)
(if (bit-string? cofiguration) configuration (unsigned-integer->bit-string configuration))
(define (lut2 configuration in1 in2)
(lut configuration (bit-string-append in1 in2))
(macro-define '(name configuration)
'(define (name in1 in2) (lut2 configuration in1 in2))
'((zero2 0) (nor2 1) (pbq 2) (pb 3) (pqb 4) (qb 5) (xor2 6) (nand2 7) (and2 8) (nxor2 9) (q 10) (pborq 11) (p 12) (porqb 13) (or2 14) (one2 15)))
I have written about 8 pages of stuff demonstrating how to compile functions to a hierarchy of LUTs using this abstraction with let to assign temporary names to internal pins. When these functional descriptions get merged with the lattice processing model I will be able to simulate state evolution and delay. I will need to add an algorithm to infer an n-input LUT for arbitrary n-input combinatorial functions in order to map to real hardware.
Sunday, September 10, 2006
I think FPGAs scale better than CPUs in terms of compute density and power efficiency. I also think theat they force a more scalable programming model by virtue of their architecture. Multi-core chips force the question is "how do we manage concurrent processes across a varying size field of processing elements." The answer is closely akin to an FPGA programming methodology. Not surprisingly, there isn't really a good programming methodology for multi-cores or FPGAs.
Most of the "von Neumann" bottlenecks are associated with the cache structures in a microprocessor. Since FPGA's primitive elements are memories I like to think of them as "smart caches" that can maintain data locality better than RAM and can have operations moved to the data instead of the other way around.
I am meeting with Krste tomorrow to get access to the BEE2 board. I will also put up a CVS on fpgaos.com soon.
Thursday, September 07, 2006
In the utility computing market, the major objective of computing providers and consumers will be to boost GOPS/$ for specific applications. How will FPGAs find their way into such a market? Will FPGAs give that group of MIT students a competitive advantage?
Sunday, September 03, 2006
Reconfigurable computers will win in settings where the costs associated with transistioning an application to a reconfigurable computer is low and the GOPS/$ that the reconfigurable solution provides for a the application is higher than other solutions. Virtualization may effectively make a reconfigurable solution easily compatible at a cost to GOPS/$. For some applications it may be worthwhile to tune the engine to maximize computation efficiency.
Friday, September 01, 2006
The CTO of Intel gave a talk at Stanford last week looking to stimulate multithreaded software development. It is not surprising that Intel needs software to target it's multi-core offering. The problems for software developers is to design scalable systems so that as Intel cranks out more and more cores on their chips, we don't have to undergo a software revision each time.
Another article is about algorithms for determining sub-oceanic topology from seismic data. Allan Willsky of "Signals and Systems" fame, has been working for Shell Oil on ways of crunching lots of data to speed up the mapping of underwater salt deposits. Sounds like an opportunity for acceleration.
Tuesday, August 29, 2006
I've been thinking a lot about the implications of software virtualization to FPGA computing. I'm intrigued by the idea of implementing a hypervisor in a tightly coupled FPGA/CPU system. I imagine a system with an FPGA connected to multiple memory units and managing the data flow for the various active virtual machines. The software layer of the hypervisor instructs the FPGA to move data to the processor or to function accelerators within the FPGA. The FPGA would probably host the hypervisor agent in order to concurrently manage instructions from the various virtual machines. Instructions non-native to the CPU may either be executed within the FPGA or "decoded" in the FPGA and passed along to the CPU.
If a virtualization layer sits just above the OS layer consisting of a resource sharing scheduler and a dynamic load balancer then applications would be able to take advantage of fine-grained optimizations while running on a virtualized compatibility layer. The key to making a massively scalable operating system is providing mechanisms for high level agents to inherit from low level optimization. A method for agents to manage "costs" and resource sharing provides an elegant optimization strategy that spans across all granularity levels.
Tuesday, August 22, 2006
Taking advantage of paralellism is a necessary step for the advance of computing. Chip-multiprocessors (CMP) and FPGAs enable concurrent computation paradigms. CMPs allow for thread-level parallelism (TLP), where each core contains an instruction cache consisting of the current set of supported functions. Reconfigruable datapath arrays allow for networks of fixed functions to be interconnected through routers to take advantage of instruction level parallelism (ILP). FPGAs offer even finer granularity of control allowing a network of reconfigurable functions often enabling bit level parallelism (BLP) in addition to ILP and TLP.
The granularity of reconfigurability also has implications to data locality. If we have fine grained control over where we may define memory and registers in our array, then we may localize our variables near the operations that use them. Since memory bandwidth is the primary cause of the "von Neumann bottleneck," on-chip data locality provides a solution.
The cost of reconfigurability is the amount of area required per operation, which implies a lower clock frequency and higher power consumption per function when compared to a fixed implementation. Still it is often impractical to have a fixed-ASIC implemenation for all computing functions. Thus we are left to choose between a reconfigurabile chip and a general purpose CPU. A reconfigurable device can often levarage parallelism to achieve a decrease in total execution time and total power consumption over general purpose microprocessing units.
It may not be the case the a reconfigurable device wins over a general purpose CPU. If a CPU device is more suitable for an application it would be wise to use it. A mixed granular structure incorporating reconfigurable logic within an array of microprocessors can maximize performance by alleviating speed issues for explicitly single threaded applications that cannot leverage BLP, ILP or TLP.
My goal is to create a self optimizing operating system for such reconfigurable heterogeneous arrays. Compatibility with current computing paradigms is a primary objective to minimize barriers to adoption. To maintain compatibility the primary application will be a virtual machine server that manages reconfigurable hardware. The operating system seeks to minimize some cost function while executing some set of processes. This cost function should be based on an economic model of computation concerned with metrics such as power consumption or execution time. If real estate can be used to save energy and time then it has an implied value.
The system is managed by a network of agents. Agents are capable of receiving and responding to signals and objects. Signals and objects are related just as energy and mass are related--sometimes it is useful to look at things as waves, and other times as particles. The agents communicate with one another subject to the constraints of some physical environment just as the propagation of electromagnetic waves is constained by the physical hardware.
To understand the interactions of the network of agents with their environment, it is important to have a model of an environment. An environment consists of a set of state variables, a set of accessors, and a set of state evolution functions. State variables are the information we might wish to know about a system for example the temparature at a location, the configuration of a LUT, the contents of a register or the capacitance of a MOSFET. These examples demonstrate that state variables exist with different scope and different domains.
State variables that are static in scope are the physical constraints that may not be altered by the agents of our operating system. For example, the electro-magnetic constant, the length of copper between two transistors in a circuit, the dopant density of silicon, etc. State variables that are in constant scope provide a configuration layer. This configuration layer may be accessed by reconfiguration agents.
It is generally desirable for things to behave predictably so that we can constrain an environment model and adapt to an evironment. However, this does not mean that we may assume absolute determinism; we should provide a means for semi-static and semi-constant variables that permit some randomness. This will provide support for defective or faulty components.
There should be processes that can simulate a particular environment to allow for behavior predictions. There could also be feedback from some of the physical environment to monitor the system and control the system.
Methods for optimizing the cost function include:
resource sharing and recycling
dynamic load balancing
power and frequency scaling
place and route
These processes are "computationally invariant" which implies that they only alter the cost of execution while the system functionality remains the same.
Monday, August 21, 2006
Tuesday, August 15, 2006
In terms of economics GOPS/Watt may be a more important metric than GOPS/(Fixed Cost) especially if long device life amortizes fixed costs. GOPS/Watt is most critical in battery powered applications.
Anyway, according to my unified theory of reconfigurable computing, these kinds of devices live and die by software support so we'll just have to wait and see.
--Edit June 10, 2008:
Cornell's Asynchronous FPGA group has Technical papers about the research that lead to Achronix products.
Monday, August 14, 2006
I approach reconfigurable computing as a software problem first and a hardware problem second. It's something I believe makes business sense, given that COTS reconfigurable chips already exist but no real software exists for them yet. Since no one in the FPGA world has come up with a software methodology for reconfigurable computing, it's hard to imagine how new hardware that looks not to dissimilar to an FPGA will somehow enable a new software paradigm. The Tech Review article hit on these issues and it looks like there are some Quicksilver alumni on Rapport's staff, so hopefully they'll leverage their experience in this area.
These chips look pretty cool to me. Hopefully I can "get my shit together" quickly enough and get a team of geeks together to make OS support.
Wednesday, August 02, 2006
A few people have commented on why I would want to make the FPGA OS open source. The primary reason is that closed source is counter-academic and creates a barrier to adoption by those who would only use open source (meet the Linux community). The major barrier to FPGA computing is the lack of awareness among computer scientists, and closed source software tools exacerbate that problem. Also, open source guarantees that the best developers out there will have access to improve the code.
It may seem that I'd be missing a huge economic opportunity by going closed source, but if I place more emphasis on advancing computer operating system technology, money will almost definitely find its way to creating new applications. Even though transparancy is the goal of the OS, there will be an awful lot of code that could be optimized for concurrent execution even after an OS is complete. This will certainly be a non-trivial task, that will greatly reward those with expertise.
The amount of coding required to build an operating system for a reconfigurable computer is also way too massive to consider building without the concesus and assistance of the wider community. If everyone "agrees" on the OS then everyone will use it too, which means the effort won't have been wasted.
Monday, July 31, 2006
This market is wide and open. Reconfigurable computing could potentially have an enormous effect on the next revision of the internet infrastructure. Lack of developers is the only reason I can see why this hasn't taken off already. Most people who observe these issues would agree that current FPGA tools mostly suck, and that concurrent application development requires some amount of "magical power." Today it would be almost obscene to go into an enterprise software company and suggest that they should target a reconfigurable computer with their application.
Ah, but look at the way things are headed. Multi-core processors. Virtual Machines. Interpretted web interfaces. For much of the web server market it doesn't really matter whose chip you use as long as you can support Linux, Apache, MySQL, and PHP. Throw in Java and you'll get a huge chunk of the enterprise market too... In fact, start with a Java VM to grab some compute intensive financial service companies. Deliver more bang for the buck in that market, and lots of people will start listening...
Thursday, July 27, 2006
Tuesday, July 25, 2006
coming soon: fpgaos.com -- As soon as I get back to MIT I will be hosting a website on an FPGA and allow people to actively contribute to the development of the hosting platform.
Monday, July 24, 2006
Tuesday, July 18, 2006
Friday, July 14, 2006
been thinking about Virtual Machines and FPGAs. Both of those two terms are increasing in relevance. they will also start to sync up in co-relevance.
i've made some headway on my thesis work lately. i'm working on a "parallel evaluator" for scheme right now that uses multiple evaluators each with their own strategies.
an evaluator consists of:
an evaluator uses it's listening strategy to communicate with its parent, children and partners. the listening strategy assigns priorities to the different ports. The parent spawned the evaluator. The children are spawned by the evaluator. The "partners" are the other evaluators (non parent or children) that the evaluator shares a port with. the objective of an evaluator is its current process.
with this framework i'm going to start to define different ways of computing things. for example: an evaluator might get a request from his parent to do (foo x). he responds with a promise to do (foo x). he might call out to all his partners and children "can anyone do (foo x)" if someone responds and says "i will do (foo x)" when the parent tries to force the promise, he will tell his parent to talk to them for the answer. if he cannot find someone else to do (foo x) he will compute it himself when it is forced. this is the "laziest" evalutation strategy I could think of on the spot.
I need some descriptive language for managing "evaluator teams" in order to handle the complexity of using multiple parallel evaluators. I want to be able to fit notions like combinatorial functions, memory, state machines, datapaths and processing units into this evaluator framework. i made something similar for 6.891 last semester, by creating an array of nodes that acted on their ports, with surface reconfigurability (using "CONFIGURE" messages) however, the most complex functionality I could demonstrate with that was simple arithmetic pipelines.
This gets closer and closer to the "big idea."
--whoa, this blog entry comes up number one on google search for "concurrent evaluator"
Monday, June 05, 2006
Here is the best I can describe the problem: my understanding of the biology is extremely limited, but the underlying algorithmic problem should be clear (my friend is clearly responsible for any biological aspects of this work). Suppose a set of proteins regulate DNA transcription for genes A, B and C but not genes D, E and F, then there must be some a common pattern in the DNA for genes A, B and C that is not in genes D, E, and F. Unfortunately this DNA pattern matching is not necessarily straightforward: it is not simply the case that genes A, B, and C have the sequence "GGACT" in their regulatory region, which is not in the regulatory region of genes D, E and F. The codes may have a more complex "logic" to them. For example, it may be the case that the pattern is of the form "Does not contain (GGATTC or ACCTAG) within 100 base pairs of a code with a Hamming distance of 2 from ACGGTCCGT." If A, B and C match this pattern and D, E and F do not, then this would explain why the transcription factors in question regulate A, B and C and not D, E and F.
After discussing the problem for a while, we came to the perhaps not coincidental conclusion that a genetic algorithm would be best suited for this problem and that exploiting the parallelism and granularity benefits of an FPGA implementation would offer a lot of acceleration potential. For such a genetic algorithm we test populations of "binding strategies" in an environment consisting of a set of genes joined with a 1 or a 0 depending on if the transcription factors in question regulate that gene or not. The fitness of a binding strategy is measured by the number of times the strategy correctly predicts regulation (1) or no regulation (0). As the algorithm progresses, we select the better binding strategies, breed them and mutate them. We may breed two seperate strategies by combining them with a logical combinator or the "within k base pairs of" combinator. We may mutate the strategies by negating them, by changing, adding or removing characters, by adding or removing logical combinators, or by adding "hamming distance of m" conditions (negating the worst performing binding strategy in a population actually makes sense since it is going to produce a more fit member in the next generation). Generally, we would prefer it if our binding strategies were "simple" to avoid overfitting, and there will be a preference for simplicity in the selection strategy.
This problem lends itself to an FPGA implementation since we may implement an entire population of binding strategies and evaluate their fitness in parallel. We may also perform the population breeding process in parallel and reconfigure the system to evalute the next generation. An well designed approach will be able to share logic among binding strategies in a similar fashion as how the Rete algorithm avoids repeating sub-pattern matching. FPGAs have been shown to produce 1000x speedups over CPU implementations for similar genetic algorithms and also shown equivalently absurd speed-ups for pattern matching algorithms. Certainly we will have to test an FPGA impementation against the cluster supercomputer at the Broad Institute.
One neat idea for accelerting genetic algorithms on an FPGA is to use partial reconfiguration to allow dynamic mutation and breeding during execution. If a member of the population is consistently failing after examining only 10% of the environment data, it may as well be killed off and replaced with a child of more successful members of the population. For this specific problem, after a binding strategies fails to predict the transcription factors behavior on 123 genes, we can recycle the hardware it occupies and replace it with the child of a more successful binding strategy. Alternatively, if a binding strategy succeeds 10% more than any other, it may kill off a "weaker" binding strategy and recycle its hardware for one of its children. This may prove a much better genetic algorithm strategy than using discrete generations.
Saturday, May 27, 2006
The remainder of this blog entry was written on 7/19/05.
3-D chips require less wires. As shown in this thesis and paper, 56% less interconnect is required for a 5 layer chip. Wafer bonding has been thoroughly investigated, and processes compatible with standard CMOS are being refined. Tezzaron is using this technology for memory.
The big problems facing the industry are the lack of good design tools and the issues associated with yield and heat. Design tools will be developed as the processes become more refined. Yield issues and heat need to be taken into consideration in the design. Consider if you have an 80% yield on each wafer; when you have 5 layers of silicon--assuming defects are not correlated to the location on the chip, and no defects due to the bonding process--your yield reduces to 33%. Of course, we are able to have more redundancy with more silicon layers, so we can design systems that are fault tolerant (google: fault tolerant architectures. lots of good stuff). The costs of the chips will probably directly represent the decrease in yield -- good designs and tools will save companies a lot of money (though i shouldn't give away my secrets before i patent them :-)
Cooling higher density chips is the major hurdle towards development of 3-D circuits. A few documents hint that microfluidic cooling systems may be the solution. Georgia Tech researchers made an advance on this end a few weeks ago by presenting a microfluidic manufacturing process compatible with standard CMOS
Expect lots of great things in the years to come. For now I expect 3-D integration to creep into specialty mixed signal chips that are extremely expensive, and memory where heat generation is less of a problem. Microfluidic cooling technologies will be adopted in the near term for 2-D high power chips. The first 3-D micro-processor architectures will probably use extra layers for clock distribution, global interconnect systems, and power distribution systems. Caching systems will likely be added as a third layer until new design approaches (and better tools) allow for the design of multi-layer integration with logic interspersed between the layers.
Saturday, May 20, 2006
Paper of the day:
Operating systems for reconfigurable embedded platforms: online scheduling of real-time tasks
Steiger, C.; Walder, H.; Platzner, M.;
Computers, IEEE Transactions on
Volume 53, Issue 11, Nov. 2004 Page(s):1393 - 1407
This paper got me thinking about the "shapes" of our reconfigurable units. If we only think it rectangles we'll be horribly limiting ourselves, but if we use more complex shapes they will be extremely difficult to manage. This paper uses the term "fragmentation" to describe the wasteful effect of using rectangles and I think it's a good term to pick up. The effect is similar to fragmentation on a hard disk and "defragmentation" will be an important process to optimizing a 4-Dimensional Schedule (3-D space and time).
I like the fact that every paper I read on a reconfigurable computer operating system says something like "reconfigurable computing operating systems are a rather new area of research." How many times does that get written in the academic world before it can no longer be considered a new area of research?
It's really nifty to think about a "hardware configuration manager" sitting at the very bottom level of an OS. VMware is already in the "hardware virtualization" business. I wonder if anyone there had considered usign FPGAs to accelerate their platform?
"We've now discovered molecules that act just like reconfigurable logic bits," says Hewlett-Packard's Kuekes. "We are proposing fairly simple devices that can be literally grown with chemistry. Then all the complexity will be downloaded into configuration bits once the structure is made." Kuekes expects this technology will come to fruition in about ten years, just about the time silicon will peter out. "Reconfigurable logic won't just be a good idea," says Kuekes. "It will be the only way to do computing in the future."
Tuesday, May 16, 2006
I asked him a question about function multiplexing (setting up multiple configurations for a LUT and using a global switch to select a configuration). I wanted to know if Xilinx had any plans to deliver chips to support it. He said no. He expressed the opinion that the space used by the additional RAM would be better spent on additional LUTs in which one could implement the multiplexed function. Thus it's probably bettter to just implement the multiplexing in the configware. I'm not sure if this changes my opinion on using dynamic partial reconfiguration with function multiplexing to implement a paging mechanism to swap configurations. I suppose if reconfiguration can be done fast enough it would make sense to implement a set of multiplexed functions and alter them. There's a lot of stuff to consider here.
If the use of an operating system for an FPGA catches on, it is likely that morphware fabrics will be tuned specifically for the OS. Dynamic partial reconfiguration is a relatively new area of exploration so the hardware support for them is still limited. There's only one path for reconfiguration on an FPGA (as in you cannot reconfigure multiple blocks simultaneously). If multiple disjoint processes require reconfiguration it would be nice to do the operations in parallel. We shall see...
I turned on anonymous comments on this blog at the request of someone who emailed me. If there are people reading this who are interested in reconfigurable computing, feel free to get in touch with me! If you're in the valley this summer we could meet up and have some beers or something.
Monday, May 15, 2006
Recursive functional programs map to recursive structural circuits. We want an implementation mechanism for recursively defined circuits that uses continuations at break points to dynamically modify the configware layer. A "dispatch" circuit should also be possibility, depending on a request the dispatch circuit will spawn a new circuit. A lot of the implementation for such a system should follow from the implementation of a concurrent evaluator for a functional language.
Sunday, May 14, 2006
I will eventually want to run "virtual" fabrics on my current reconfigurable computing setup so that I can use the virtual fabric to test the scheduling algorithm. I will end up simulating more complex fabrics and with functional multiplexing and better routing options.
I've been thinking about recursively defined hierarchical fabric structures may be interesting substrates for reconfigurable computing. For example we can construct an order M computing node where nodes of order n consist of 4 other nodes of order n-1, an order n datapath, a controller and a 2^n sized memory block. an order n datapath contains n-stage pipeline of reconfigurable units. the controller manages the flowware of the node and moves data from the memory. an order 0 node contains no children or datapath and just consists the memory as a buffer to the output. This was a pretty loose spec, but the idea of recursive structures seems like a good idea especially as an abstraction for defining and dealing with complex heterogenous structures.
Saturday, May 13, 2006
One way to look at this problem: we have to schedule circuits A and B with no connection to eachother. Now using only 2 input muxes merge the circuits into one circuit such that if the mux select is 0 then the system behaves as circuit A and if the mux is in 1 then the system behaves as circuit B. Minimize the amount of logic you need without adding new memory elements. The "size" of the resulting circuit tells you how much A and B share.
Another view is to look at processes as using services which are spawn on the array such as a "square-root" or "vector-add" or "increment" or whatever. The a high-volume "square-root" service will need a dedicated subtract and divide service though a low-volume square-root can use a non-dedicated subtract and divide service and simply use flowware to direct the data through the shared resources. The scheduler makes the decisions about "sharing" resources, and also identifies ways of scheduling the transition between processes by the commonality of their resource usage.
I like the idea of improving an algorithm that promotes sharing. It's very fraternal.
A friend of mine said: "we've been doing von Neumann for 50 years, we couldn't have gotten it right the first time."
Friday, May 12, 2006
Sunday, May 07, 2006
Like so many articles it says that FPGAs are hard to target and you need special programmers to do it. I've probably read a thousand articles and papers that say the same thing.
Still writing my thesis proposal... slow process...
Thursday, May 04, 2006
The following is a revised version of an article I wrote last month. I've been narrowing down my problem description in order to put together a proposal for my Master's Thesis.
Methods for Reconfigurable Computing
There is some work to be done on the expressive powers of FPGA hardware description languages. The problem is that many EDA developers are only improving methods of hardware description. People view circuit emulation as the sole application of FPGAs instead of one of many possible applications. FPGAs are largely used to verify designs and they may possibly even end up in some embedded products. However this mode of operation has created inertia that must be overcome before we start seeing reconfigurable logic arrays in all types of devices. We need to start viewing the FPGA the way we view a sequential CPU: as a means to the execution of an algorithm. C is the "native" language of the sequential Von Neumann processor and it came after several iterations of language development. There isn't such a thing for FPGAs yet; Verilog and VHDL are more like the assembly languages of FPGAs, and they don't even support reconfiguration!
Any system designed to improve FPGA usage should target a wider class of problems. The problem of targetting an application to a grid of reconfigurable functional units is applicable outside the realm of FPGAs. An methodology for programming a network of distributed systems is extremely relevant for cluster supercomputing applications as well as Internet applications. Additionally, multi-core processors are swarming the marketplace and soon they may even incorporate reconfigurable logic for application acceleration and dynamic data routing. FPGAs already contain CPU cores and one can imagine that those CPU cores could contain reconfigurable logic in a circular fashion. Future hardware may be in the form of quantum dot arrays, photonic processors and magnetic spin logic. Practical systems using these technologies may only exist in the form of programmable cellular arrays (fabricated in a wet self-assembly process, no doubt).
Currently, not enough software exists to support the kind of structure that future chips and systems will have. What I view as the major issue facing the FPGA EDA space is one of identity. We should aim for more than creating tools for emulation and verification. We need to create tools that solve the more general problem of targetting arrays of reprogrammable devices. Ideally, the tools should run on any kind of hardware platform, be it an FPGA, a single CPU core, a 16 core processor or a distributed network of CPUs. Tools and languages do exist for developing parallel computing applications, but they are largely disjoint from the FPGA space. I plan to focus my master's thesis work on developing tools to implement such a system on an FPGA.
When I realized that this was primarily a software problem, I wondered why it hadn't been solved by software people. Then I realized that inertia is probably the reason. The majority of the computing world is heavily invested in Von Neumann architecture and is generally unaware of the problems in the FPGA realm. I grew up knowing how to code in C and Assembly, but I never learned Verilog until I took 6.111 as a sophomore at MIT. Even still, many computer scientists at MIT haven't even heard of FPGAs. CS students avoid the "Digital Death Lab" because it has a reputation of being extremely time consuming (it should be if you become impassioned by your project). The experience most EE students have with FPGA technology may look something like this: sitting in the 6.111 lab frustrating over EDA tools that "simulate correctly but then why isn't my circuit working." Then you have to wait 5 minutes for the Xilinx tools to program the chip after you think you've fixed what was broken.
As a lab assistant for 6.111 I regularly hear students complain about the tools; I could write an entire book about the types of errors that aren't really bugs in the design, but rather in the tools. So the current tools need improvement, but even still, the necessary improvement is a step away from design verification tools. Students still approach the FPGA in line with the current dominant paradigm: as a means to emulate an application specific digital system, not as a means to execute an algorithm. This is mostly because we do not have tools with the latter purpose in mind. Students certainly understand that for many applications we can gain massive speedups through parallelism and granularity and that there is a whole class of applications where a single FPGA can outperform a cluster of computers. The functionality to cost ratio is still prohibitive for many applications, but like most things in hardware, that is a matter of scale.
The FPGA market space is heavily invested in hardware description languages. This is because there are methods in place to translate HDL designs to an ASIC implementation and FPGAs are mostly used for design verification. Verilog is based around module "building blocks" with behavioral descriptions and connections with other the modules. This "structural design" method is good for FPGAs and lends itself to a functional programming languages. Since I particularly like Scheme, I'm going to use it for the majority of my coding.
I also think it is simpler to write interpreters and translators in Lisps than in any other languages. If I can command the FPGA from Scheme, making it happen from C, C++, Java, or any other language is a SMOP (a great way to learn a language is to create an interpretter for it!) This is the case for me and probably for all MIT students who have all written a Scheme evaluator in Scheme for the introductory computer science course. If I want to utilize MIT computer science talent on an FPGA programming language project, Lisp is definitely the way to go.
A large part of the problem with HDL tools is that semantic eloquence is vacant from hardware description. Machine design through explicit module definition is not the best way to solve massive and complex problems; Verilog becomes extremely bulky and debugging systems can take an extremely long time. In my 6.111 project I was drawn to the idea of using multiple parallel datapaths and creating an assembler to assign functions to each of the datapaths. I later discovered that this was not an uncommon practice. We would want to go a level higher than this to solve bigger problems in less time: we want our system to spawn specific datapaths for a particular algorithm in order to maximize throughput. The system should generate optimal hardware structures from a high-level algorithm description.
Recursive functional programming allows for us to think of massively sized machines unfolding to solve a problem; often times we would unfold machines much larger than an FPGA could emulate. If we are at 99% resource utilization we may still map our problem to the FPGA, but once we go to 101% our entire design fails. There is currently no good way to implement a system that pushes and pops state and configuration data. We want to be able to dynamically spawn modules and alter the connection between modules to enable the architecture implement systems that are much more complicated than could fit statically on the FPGA.
I recently read a paper about function multiplexing on an FPGA. The idea is that each configurable unit may have multiple configurations. We could utilize this to dynamically transition a module between functionalities: while the FPGA is in configuration 0, we reprogram configuration 1. As soon as configuration 0 has completed its execution, the system switches to configuration 1. We then reprogram configuration 0 and so forth. If we have more than 2 configuration spaces we can even imagine data directed, conditional reconfiguration in which we choose between multiple next configurations based on the state of the system at some branch point.
To do this we would an operating system to manage the resources of an FPGA based computer. Such an operating system should work symmetrically with distributed computing systems and multi-core CPUs. The scope of this problem is much larger than I can accomplish on my own. It will be incredibly important to look at what's already been done and to devise a methodology and a taxonomy that can easily be adopted by the community of FPGA developers.