Thursday, July 24, 2008

RhoZeta Excel-Python-OpenGL Demo

In this demo I will show you how to use the RhoZeta Python-Excel bindings to create a non-blocking assignment spreadsheet iteration thread and an OpenGL window to draw cells. This older demo shows more Python evaluate-able code in Excel spreadsheet cells with RhoZeta. Feel free to leave comments and suggestions for future demos.

RhoZeta is a Python spreadsheet language with Excel bindings to allow programmers to process spreadsheet formulas through a Python backend. The cell values are forwarded to the Excel frontend to display in the current Visible Range. The longer term goal is to demonstrate how the spreadsheet iteration model can serve as an open framework for accelerated computing on FPGA, GPU, and Multicore architectures. These architectures consist of physical arrays of cells designed for dataflow processing and vector data parallelism. This locality-constrained, vector-oriented, dataflow programming model finds obvious expression in a spreadsheet. I've argued that this model effectively replaces the von Neumann instruction stream as the primitive model for designing systems on parallel hardware.

The code for this demo is up on the Catalyst Accelerated Computing Trac. Running RhoZeta.py alone does not require PyOpenGL but you wont get the OpenGL window. RhoZetaDemo.py depends on PyOpenGL (which requires setuptools to install) and win32com (included with Python for windows aka PythonWin). I've only tested this with Excel 2007, but I think the underlying COM interface is still compatible with 2003.

Firstly: this is hacking and not a product yet. DO NOT START THIS WITH ANY OPEN WORKBOOKS OR YOU MAY LOSE YOUR DATA. You were warned.

Run RhoZetaDemo.py and Excel and an OpenGL Window should pop-up:


Put the function "=time.clock()" into cell A1. If the cell manager is functioning the value of A1 should be forward to the Excel frontend on a Timeout.



Since this is a non-blocking assignment interpretter, placing "=A1" into cell A2 will cause A2 to act like a register reflecting the value of A1 from the previous iteration:



We use the fact that A2 is A1 delayed by an iteration cycle to determine the time difference between two cells:


We can repeat the register-delay structure to track the iteration time for multiple clock cycles:


If we repeat this register structure 4 times we may take the average iteration time (try to click-and-drag the formula from cell B4, it should work).


I also whipped up some code to pull the Row Heights and Column Widths into an OpenGL drawing window with some MouseOver detection (only 2-D... I need to learn how to do selection correctly and in 3-D). The OpenGL window synchronizes with Excel when you click in the window. The top left cell should be the same as the top left cell in your Excel visible range. There's also code to draw a bar-graph of the selected cells data, but you'll have to tinker. Expect the next OpenGL demo to be even cooler.


The goal here was to demonstrate a spreadsheet that iterates continuously following a non-blocking semantic as in Verilog non-blocking assignments: each cell is computed from the same pool of values and the next values are assigned simultaneously. Ordinary Excel iteration interprets according to blocking assignment semantics: each cell is computed and assigned in the order in which the table is read. Under Excel's ordinary blocking assignments, assigning A2 the formla "=A1" would read the newly assigned value from the current iteration because A2 comes after A1. We can do shift registers in blocking assignments by reversing the order ie assigning A1 to "=A2"

Cells in a blocking assignment iteration are non-commutable, the location of the cell matters to the interpretation. Non-blocking assigned cells may be moved freely. A third assignment type, asynchronous assignment, computes a new value whenever the cell's precedents change as a combinatorial circuit would. Cells with an asynchronous assignment are also commutable. It is generally possible to mix these forms of assignments as in VHDL with mixed variables and signals and continuous assignments. We could presumably support a syntax to replace the leading "=" to allow mixed modes.

The iteration timing here is a poor because cells are being dynamically parsed and interpreted by Python using the eval function. We know how to do better by JIT compiling sheet iteration in Python (some code is up in our Trac to do this). The next step in performance is to statically propagate types and compile sheet iteration pipelines to C so we can fork new threads onto a multicore/GPU/FPGA-core and parallelize sheet iteration. We also know how to get this sheet to iterate deterministically with dedicated resources in under 2 nanoseconds using an FPGA synthesis toolchain -- of course measuring the average iteration time for a deterministic iterator would be silly.

Monday, July 21, 2008

Partitioning Dataflow Graphs

OpenFPGA forked a child called Open Accelerator to organize activities in open accelerated computing. Sounds like my business plan. Let me talk about the hard problem: partitioning.

If you're like me, when you got your heterogeneous supercomputer out of the box, you probably produced some mixture of OpenGL, CUDA, Verilog, pthreads or spethreads, Python and duct-tape. We'll get back to the duct-tape.

The problem of accelerated computing development is working simultaneously with both control flow graphs (CFGs) and dataflow graphs (DFGs) on both control-flow engines (CPU) and data-flow engines (FPGA). I've argued that a spreadsheet framework will take care of the language problem. Even people who don't consider themselves programmers are familiar with the semantics of mixing dataflow in cells with control flow commands in macros. We also know how to translate between control flow and dataflow descriptions. Now it's easy to think-outside-the-box, but actually building a good box for other people to think inside of is not so easy.

The devil is in the duct-tape.

A NUMA without global cache-coherence is too difficult to program without someone hiding the synchronization protocol behind simple spreadsheet semantics. Once you partition cells into separate processing node, you have to worry about communicating precedents across partition boundaries. The easy option is to perform a global commit cycle for all shared cells. This works fine when you're only using multithreading to hide memory latency in a single to quad core or if you require all-to-all communication.

If you want to make a real multicore spreadsheet, after you partition the spreadsheet, you'll want to automatically generate code to manage data coherency explicity passing messages between communicating partitions. This problem is actually very similar to the routing part of the place-and-route problem though efficiency is even harder to profile. Optimal partitions need to minimize the total number of consumers for each producer, as well as minimize the total amount of data consumed. We're saved a bit by the fact that spreadsheet dataflow designs tends to have fixed, regular communication structure. As a result it is posisble to generate a deterministic sychronization procedure for any partition.

Profiling cross-partition communication is only one of many stumbling blocks for an partition optimizer. Profiling and load-balancing the computation itself is another problem. Here's something I wrote from two years ago about using an economic model for profiling the load-balancing on heterogeneous architectures in a run-time environment.

Realistically, developers who want high performance data systems should expect some amount of manual partitioning. The goal of any partitioning tool should be to expose these issues to the user as clearly as possible and make it easy for the user to optimize the design using macros. These sorts of macros will be like a welding-torch compared to the current duct-tape.

Sunday, July 13, 2008

Power Electronics

For this entry, I'm going to take a departure from the usual microelectronics and HPC topics and focus on power electronics topics on which I have been keeping .. current (forgive me).

Ultracapacitors

According to Wikipedia, Lithium Ion batteries can store 160 Wh/kg or 576 kJ/kg with maximum power output of 1800 W/kg. The wiki page on ultracapacitors indicates 120 Wh/kg as a comparison point for Lithium Ion batteries, but Li-ion is obviously somewhere in the 100-200 Wh/kg range...

Solid-state Ultracapacitors support substantially higher power than Li batteries (6 kW/kg from wiki), though the energy per weight efficiency is often an order of magnitude lower. This means they can be charged and discharged much more rapidly than a battery, but contain much less energy. Since there are no chemical processes, the number of charge cycles for the system is many times greater than a Li battery.

MIT Professor Joel Schindall uses aligned nanotubes to increase the surface area for charge storage in an ultracap and aims for 30-60 Wh/kg. Read his article from the IEEE spectrum.

EEStor makes ultracapacitors in an ultrasecretive fashion, but the claimed results beat Li batteries in terms of weight to energy stored by a factor of 2 (200-300 Wh/kg or 1 MJ/kg). Zenn Motors is apparently using this in their electric vehicle offering.

A few months back I was reading about Graphene transistors. Capacitors using Graphene Monolayers seems like a good idea. An Indian team reports 31.9 Wh/kg.

Solar Power

The next several years will be the early adoption period for solar energy. Clever mayors will propel political careers on their ability to successfully deploy solar energy in their towns.

Making Solar-Tiles-On-Every-Roof happen is like making One-Laptop-Per-Child happen: the problem isn't just finding the cheapest solar tile you can make so much as having a scalable deployment plan (if you spend too much time making really cheap laptops, you forget that the real problem is the per-child part). A scalable solar plan should put grid-tied panels on roofs while appreciating accelerating returns from volume and adapting to a variety of local financial incentives.

Israeli Bank Hapoalim offers a loan program and Israel Electric offers grid-tie incentives to encourage the adoption of Solar technology.

Without established best practices in these loan and incentive programs, there's a lot of difficult ROI math to convince people to pay for a panel. It seems like the best solution to get solar electricity to market is to provide free solar panels and grid tie installation in exchange for a mortgage redeemable in sub-market-price electricity. Such bonds can tie together with "weather risk bonds" to form a new type financial products allowing you to invest solar energy by essentially renting someone's roof to profit off the power generation (roof-renting actually makes sense in more than one way in a foreclosure market). States and cities could provide tax incentives on this type of investment vehicle to make it more attractive against other financial products and to encourage people to invest in regional solar power.

An interesting HPC side note related to weather bonds: I wonder if our superb supercomputer models can be used to predict total solar panel output on a given day...

Superconductors

A superconducting electric car has been built in Japan. The concept seems sound: higher current density means you can use a lighter conductor for the same magnetic field, not to mention the efficiency boost from 0 electrical resistance. There's obviously the added liquid nitrogen tank too, which is also useful when you're trying to get away from a T1000. They claim 10% more range. I wonder if their suspension systems uses the Meissner effect...

It seems plausible that the superconducting electric motor could also be used to efficiently compress liquid nitrogen while the thing is recharging too (I have the same inclination to insist on FPGA acceleration for FPGA accelerator compilation).

Nuclear Fusion

Google Tech Talk about Bussard's Polywell Fusion Reactor. The majority of Fusion research is in toroidal Tokamak magnetic containment structures. Bussard's comment is that we have had all these problems generating sustainable fusion with Tokamaks, and yet we look up and see thousands of fusion examples, none of which are toroidal. Instead they are all held together using gravity: a radial 1/r^2 law.

To achieve a similar 1/r^2 radial fields, we can create an electron ball in the center of a reactor and use this field to draw fusuable ions close to eachother. This idea, called inertial electrostatic confinement, had been explored by Bussard's colleagues in the development of the fusor. The fusor used electrode cage to create the electric field for the electric confinement, but collisions with the cage resulted in a net energy loss and a burnt up electrode cage.

The insight in the Polywell reactor is that trapping electrons in a magnetic field is a lot easier than trapping fusable ions as in a Tokamak (because an electron doesn't weigh a lot). The Polywell creates an electron "wiffle ball" using a tetrahedra of coils to avoid the heating problem with cage based designs. In his Google Tech Talk video, Bussard describes the engineering challenges and insights discovered during the various design iterations. He died shortly after receiving funding to continue the program, but the work is still continuing.

When you see him present the history of Fusion research and the reason things are the way they are with the massive Tokamaks dominating research, you can tell he just knows that controlling ion momentum in magnetic fields "won't produce fusion power, but it will produce great physics." And so the funding continues.

He knew he was on to something with inertial electrostatic confinement using magnetic fields so he patented the design. Then he improved his design by using circular rings instead of square rings, and then ideally spacing his non-ideal rings. I wonder if they could modulate the current in the rings to produce dynamic stabilization and try to narrow the cusps.

You almost want to build a really large one with superconducting electromagnets and use it to hold a small star in place.

Tuesday, July 01, 2008

FPGA Editor and Googe Maps

Another Israeli FPGA guy named Eli Billauer produced a video explaining how to use FPGA Editor. Now that you watched that video, load up FPGA Editor and click on "Help -> Help Topics" to compare your learning experience. Video feature documentation is a great project for an intern.

FPGA Editor should feel more like Google Maps. Consider the application specifcation for FPGA Editor and Google Maps: you must navigate a complex map of billions of paths and locations, able to search for specific locations by name or nearby locations by keyword/type (I think FPGA editor has this), easily access external information about specific nodes, auto-route optimal paths between nodes.

Perhaps the biggest difference between the application specs is that FPGA Editor is intended to consider the additional constraint that multiple nets may not share the same path and fan-out distribution is different from the traveling-salesman routing done in Google Maps.

In any event, Google Maps should probably be considered the "standard" interface for these types of things. I wonder how easily searching for "nearby restaurants" could become "nearby blockrams."