Saturday, May 27, 2006

3-D Integration

3-D integration is going to become more and more popular in the next few years. I created a thermal cost function for a 3-D FPGA Place and Route last year for 6.374. Prof. Chandrakasan, who taught 6.374, has been researching a 3-D FPGA. I haven't done any further work on it since the class, but I read an article today about integrated cooling which made me think about it again (I've looked into implementing the thermal FEA on an FPGA, but others have already done similar things). Xilinx is certainly interested in this research. FPGAs have implicit fault-tolerance which means that yield issues associated with 3-D integration can be marginalized (in fact, Xilinx sells their "defective" units as application specific devices).

The remainder of this blog entry was written on 7/19/05.

3-D chips require less wires. As shown in this thesis and paper, 56% less interconnect is required for a 5 layer chip. Wafer bonding has been thoroughly investigated, and processes compatible with standard CMOS are being refined. Tezzaron is using this technology for memory.
The big problems facing the industry are the lack of good design tools and the issues associated with yield and heat. Design tools will be developed as the processes become more refined. Yield issues and heat need to be taken into consideration in the design. Consider if you have an 80% yield on each wafer; when you have 5 layers of silicon--assuming defects are not correlated to the location on the chip, and no defects due to the bonding process--your yield reduces to 33%. Of course, we are able to have more redundancy with more silicon layers, so we can design systems that are fault tolerant (google: fault tolerant architectures. lots of good stuff). The costs of the chips will probably directly represent the decrease in yield -- good designs and tools will save companies a lot of money (though i shouldn't give away my secrets before i patent them :-)

Cooling higher density chips is the major hurdle towards development of 3-D circuits. A few documents hint that microfluidic cooling systems may be the solution. Georgia Tech researchers made an advance on this end a few weeks ago by presenting a microfluidic manufacturing process compatible with standard CMOS

Expect lots of great things in the years to come. For now I expect 3-D integration to creep into specialty mixed signal chips that are extremely expensive, and memory where heat generation is less of a problem. Microfluidic cooling technologies will be adopted in the near term for 2-D high power chips. The first 3-D micro-processor architectures will probably use extra layers for clock distribution, global interconnect systems, and power distribution systems. Caching systems will likely be added as a third layer until new design approaches (and better tools) allow for the design of multi-layer integration with logic interspersed between the layers.

Computing Without Computers

Found this article by Ian Page, a researcher in reconfigurable computing and founder of Celoxica. I think he's right on with his observations.

Saturday, May 20, 2006

my daily headache

I've been toying with an IP core that allows me to access the FPGAs reconfiguration port from the internals of the FPGA. I've connected the reconfiguration port to a microblaze core running ucLinux, but I haven't actually got a method to do anything worthwhile. I'd like to get a Linux running on the dual PowerPCs so I can start to Kernel hack it and enable dynamic reconfiguration methods. I'm thinking to develop a "debug" mode that displays a picture of the FPGA on the screen and allows you to see and manipulate the configuration.

Paper of the day:
Operating systems for reconfigurable embedded platforms: online scheduling of real-time tasks
Steiger, C.; Walder, H.; Platzner, M.;
Computers, IEEE Transactions on
Volume 53, Issue 11, Nov. 2004 Page(s):1393 - 1407

This paper got me thinking about the "shapes" of our reconfigurable units. If we only think it rectangles we'll be horribly limiting ourselves, but if we use more complex shapes they will be extremely difficult to manage. This paper uses the term "fragmentation" to describe the wasteful effect of using rectangles and I think it's a good term to pick up. The effect is similar to fragmentation on a hard disk and "defragmentation" will be an important process to optimizing a 4-Dimensional Schedule (3-D space and time).

I like the fact that every paper I read on a reconfigurable computer operating system says something like "reconfigurable computing operating systems are a rather new area of research." How many times does that get written in the academic world before it can no longer be considered a new area of research?

It's really nifty to think about a "hardware configuration manager" sitting at the very bottom level of an OS. VMware is already in the "hardware virtualization" business. I wonder if anyone there had considered usign FPGAs to accelerate their platform?

molecular computing

I've often suggested that self-assembled molecular electronics would be in the form of reconfigurable arrays. I found this article to back it up.

"We've now discovered molecules that act just like reconfigurable logic bits," says Hewlett-Packard's Kuekes. "We are proposing fairly simple devices that can be literally grown with chemistry. Then all the complexity will be downloaded into configuration bits once the structure is made." Kuekes expects this technology will come to fruition in about ten years, just about the time silicon will peter out. "Reconfigurable logic won't just be a good idea," says Kuekes. "It will be the only way to do computing in the future."

Tuesday, May 16, 2006

function multiplexing

The CTO of Xilinx came to MIT to check out the 6.111 projects and to give a talk on the future of FPGA. He mentioned most of the stuff I babble about in this blog. He discussed the need for better tools to increase the transparency of targetting FPGAs to open the market up to computer scientists (who outnumber us Electrical Engineers by 100 to 1). He also specifically mentioned dynamic partial reconfiguration and the need for an operating systems to support it.

I asked him a question about function multiplexing (setting up multiple configurations for a LUT and using a global switch to select a configuration). I wanted to know if Xilinx had any plans to deliver chips to support it. He said no. He expressed the opinion that the space used by the additional RAM would be better spent on additional LUTs in which one could implement the multiplexed function. Thus it's probably bettter to just implement the multiplexing in the configware. I'm not sure if this changes my opinion on using dynamic partial reconfiguration with function multiplexing to implement a paging mechanism to swap configurations. I suppose if reconfiguration can be done fast enough it would make sense to implement a set of multiplexed functions and alter them. There's a lot of stuff to consider here.

If the use of an operating system for an FPGA catches on, it is likely that morphware fabrics will be tuned specifically for the OS. Dynamic partial reconfiguration is a relatively new area of exploration so the hardware support for them is still limited. There's only one path for reconfiguration on an FPGA (as in you cannot reconfigure multiple blocks simultaneously). If multiple disjoint processes require reconfiguration it would be nice to do the operations in parallel. We shall see...

I turned on anonymous comments on this blog at the request of someone who emailed me. If there are people reading this who are interested in reconfigurable computing, feel free to get in touch with me! If you're in the valley this summer we could meet up and have some beers or something.

Monday, May 15, 2006

recursive circuit definition

One of the problems with most HDLs is that they do not support recursively defined circuits. It's not too hard to make macros for such things in other languages, but as the circuit size gets too large it becomes impractical to implement. However a mechanism for handling such things would be really useful.

Recursive functional programs map to recursive structural circuits. We want an implementation mechanism for recursively defined circuits that uses continuations at break points to dynamically modify the configware layer. A "dispatch" circuit should also be possibility, depending on a request the dispatch circuit will spawn a new circuit. A lot of the implementation for such a system should follow from the implementation of a concurrent evaluator for a functional language.

Sunday, May 14, 2006

virtual hardware target

In order to maintain target independence, I'm building circuit fabrics in Scheme using the digital circuit simulator as outlined in SICP (section 3.3.4).

I will eventually want to run "virtual" fabrics on my current reconfigurable computing setup so that I can use the virtual fabric to test the scheduling algorithm. I will end up simulating more complex fabrics and with functional multiplexing and better routing options.

I've been thinking about recursively defined hierarchical fabric structures may be interesting substrates for reconfigurable computing. For example we can construct an order M computing node where nodes of order n consist of 4 other nodes of order n-1, an order n datapath, a controller and a 2^n sized memory block. an order n datapath contains n-stage pipeline of reconfigurable units. the controller manages the flowware of the node and moves data from the memory. an order 0 node contains no children or datapath and just consists the memory as a buffer to the output. This was a pretty loose spec, but the idea of recursive structures seems like a good idea especially as an abstraction for defining and dealing with complex heterogenous structures.

Saturday, May 13, 2006

scheduler

How do we optimally schedule for an FPGA?

One way to look at this problem: we have to schedule circuits A and B with no connection to eachother. Now using only 2 input muxes merge the circuits into one circuit such that if the mux select is 0 then the system behaves as circuit A and if the mux is in 1 then the system behaves as circuit B. Minimize the amount of logic you need without adding new memory elements. The "size" of the resulting circuit tells you how much A and B share.

Another view is to look at processes as using services which are spawn on the array such as a "square-root" or "vector-add" or "increment" or whatever. The a high-volume "square-root" service will need a dedicated subtract and divide service though a low-volume square-root can use a non-dedicated subtract and divide service and simply use flowware to direct the data through the shared resources. The scheduler makes the decisions about "sharing" resources, and also identifies ways of scheduling the transition between processes by the commonality of their resource usage.

I like the idea of improving an algorithm that promotes sharing. It's very fraternal.

Why reconfigruable hardware?

My belief is that a good programming interface for reconfigurable systems will be necessary for future chips. I believe that manufacturing considerations for future nanoscale ICs will proove fault-tolerant fabrics to be the optimal chip structure. Years beyond that amorphous and self-assembing cellular structures are likely to replace the fabrication methods of today. It is likely that these devices will be extremely difficult to program so I'd better think about it now if I want a head-start.

A friend of mine said: "we've been doing von Neumann for 50 years, we couldn't have gotten it right the first time."

Friday, May 12, 2006

Thesis Proposal

Proposal for my Master's Thesis

"There is, however, no current hardware limitation preventing a two FPGA system in which each FPGA is responsible for reconfiguring the other (evoking images of Escher's hands drawing each other)."


Sunday, May 07, 2006

Article about Cray XD1

Just read this article about the XD1 from Cray.

Like so many articles it says that FPGAs are hard to target and you need special programmers to do it. I've probably read a thousand articles and papers that say the same thing.

Still writing my thesis proposal... slow process...

Thursday, May 04, 2006

Methods for Reconfigurable Computing

The following is a revised version of an article I wrote last month. I've been narrowing down my problem description in order to put together a proposal for my Master's Thesis.


Methods for Reconfigurable Computing

There is some work to be done on the expressive powers of FPGA hardware description languages. The problem is that many EDA developers are only improving methods of hardware description. People view circuit emulation as the sole application of FPGAs instead of one of many possible applications. FPGAs are largely used to verify designs and they may possibly even end up in some embedded products. However this mode of operation has created inertia that must be overcome before we start seeing reconfigurable logic arrays in all types of devices. We need to start viewing the FPGA the way we view a sequential CPU: as a means to the execution of an algorithm. C is the "native" language of the sequential Von Neumann processor and it came after several iterations of language development. There isn't such a thing for FPGAs yet; Verilog and VHDL are more like the assembly languages of FPGAs, and they don't even support reconfiguration!

Any system designed to improve FPGA usage should target a wider class of problems. The problem of targetting an application to a grid of reconfigurable functional units is applicable outside the realm of FPGAs. An methodology for programming a network of distributed systems is extremely relevant for cluster supercomputing applications as well as Internet applications. Additionally, multi-core processors are swarming the marketplace and soon they may even incorporate reconfigurable logic for application acceleration and dynamic data routing. FPGAs already contain CPU cores and one can imagine that those CPU cores could contain reconfigurable logic in a circular fashion. Future hardware may be in the form of quantum dot arrays, photonic processors and magnetic spin logic. Practical systems using these technologies may only exist in the form of programmable cellular arrays (fabricated in a wet self-assembly process, no doubt).

Currently, not enough software exists to support the kind of structure that future chips and systems will have. What I view as the major issue facing the FPGA EDA space is one of identity. We should aim for more than creating tools for emulation and verification. We need to create tools that solve the more general problem of targetting arrays of reprogrammable devices. Ideally, the tools should run on any kind of hardware platform, be it an FPGA, a single CPU core, a 16 core processor or a distributed network of CPUs. Tools and languages do exist for developing parallel computing applications, but they are largely disjoint from the FPGA space. I plan to focus my master's thesis work on developing tools to implement such a system on an FPGA.

When I realized that this was primarily a software problem, I wondered why it hadn't been solved by software people. Then I realized that inertia is probably the reason. The majority of the computing world is heavily invested in Von Neumann architecture and is generally unaware of the problems in the FPGA realm. I grew up knowing how to code in C and Assembly, but I never learned Verilog until I took 6.111 as a sophomore at MIT. Even still, many computer scientists at MIT haven't even heard of FPGAs. CS students avoid the "Digital Death Lab" because it has a reputation of being extremely time consuming (it should be if you become impassioned by your project). The experience most EE students have with FPGA technology may look something like this: sitting in the 6.111 lab frustrating over EDA tools that "simulate correctly but then why isn't my circuit working." Then you have to wait 5 minutes for the Xilinx tools to program the chip after you think you've fixed what was broken.

As a lab assistant for 6.111 I regularly hear students complain about the tools; I could write an entire book about the types of errors that aren't really bugs in the design, but rather in the tools. So the current tools need improvement, but even still, the necessary improvement is a step away from design verification tools. Students still approach the FPGA in line with the current dominant paradigm: as a means to emulate an application specific digital system, not as a means to execute an algorithm. This is mostly because we do not have tools with the latter purpose in mind. Students certainly understand that for many applications we can gain massive speedups through parallelism and granularity and that there is a whole class of applications where a single FPGA can outperform a cluster of computers. The functionality to cost ratio is still prohibitive for many applications, but like most things in hardware, that is a matter of scale.

The FPGA market space is heavily invested in hardware description languages. This is because there are methods in place to translate HDL designs to an ASIC implementation and FPGAs are mostly used for design verification. Verilog is based around module "building blocks" with behavioral descriptions and connections with other the modules. This "structural design" method is good for FPGAs and lends itself to a functional programming languages. Since I particularly like Scheme, I'm going to use it for the majority of my coding.

I also think it is simpler to write interpreters and translators in Lisps than in any other languages. If I can command the FPGA from Scheme, making it happen from C, C++, Java, or any other language is a SMOP (a great way to learn a language is to create an interpretter for it!) This is the case for me and probably for all MIT students who have all written a Scheme evaluator in Scheme for the introductory computer science course. If I want to utilize MIT computer science talent on an FPGA programming language project, Lisp is definitely the way to go.

A large part of the problem with HDL tools is that semantic eloquence is vacant from hardware description. Machine design through explicit module definition is not the best way to solve massive and complex problems; Verilog becomes extremely bulky and debugging systems can take an extremely long time. In my 6.111 project I was drawn to the idea of using multiple parallel datapaths and creating an assembler to assign functions to each of the datapaths. I later discovered that this was not an uncommon practice. We would want to go a level higher than this to solve bigger problems in less time: we want our system to spawn specific datapaths for a particular algorithm in order to maximize throughput. The system should generate optimal hardware structures from a high-level algorithm description.

Recursive functional programming allows for us to think of massively sized machines unfolding to solve a problem; often times we would unfold machines much larger than an FPGA could emulate. If we are at 99% resource utilization we may still map our problem to the FPGA, but once we go to 101% our entire design fails. There is currently no good way to implement a system that pushes and pops state and configuration data. We want to be able to dynamically spawn modules and alter the connection between modules to enable the architecture implement systems that are much more complicated than could fit statically on the FPGA.

I recently read a paper about function multiplexing on an FPGA. The idea is that each configurable unit may have multiple configurations. We could utilize this to dynamically transition a module between functionalities: while the FPGA is in configuration 0, we reprogram configuration 1. As soon as configuration 0 has completed its execution, the system switches to configuration 1. We then reprogram configuration 0 and so forth. If we have more than 2 configuration spaces we can even imagine data directed, conditional reconfiguration in which we choose between multiple next configurations based on the state of the system at some branch point.

To do this we would an operating system to manage the resources of an FPGA based computer. Such an operating system should work symmetrically with distributed computing systems and multi-core CPUs. The scope of this problem is much larger than I can accomplish on my own. It will be incredibly important to look at what's already been done and to devise a methodology and a taxonomy that can easily be adopted by the community of FPGA developers.