OpenFPGA forked a child called Open Accelerator to organize activities in open accelerated computing. Sounds like my business plan. Let me talk about the hard problem: partitioning.
If you're like me, when you got your heterogeneous supercomputer out of the box, you probably produced some mixture of OpenGL, CUDA, Verilog, pthreads or spethreads, Python and duct-tape. We'll get back to the duct-tape.
The problem of accelerated computing development is working simultaneously with both control flow graphs (CFGs) and dataflow graphs (DFGs) on both control-flow engines (CPU) and data-flow engines (FPGA). I've argued that a spreadsheet framework will take care of the language problem. Even people who don't consider themselves programmers are familiar with the semantics of mixing dataflow in cells with control flow commands in macros. We also know how to translate between control flow and dataflow descriptions. Now it's easy to think-outside-the-box, but actually building a good box for other people to think inside of is not so easy.
The devil is in the duct-tape.
A NUMA without global cache-coherence is too difficult to program without someone hiding the synchronization protocol behind simple spreadsheet semantics. Once you partition cells into separate processing node, you have to worry about communicating precedents across partition boundaries. The easy option is to perform a global commit cycle for all shared cells. This works fine when you're only using multithreading to hide memory latency in a single to quad core or if you require all-to-all communication.
If you want to make a real multicore spreadsheet, after you partition the spreadsheet, you'll want to automatically generate code to manage data coherency explicity passing messages between communicating partitions. This problem is actually very similar to the routing part of the place-and-route problem though efficiency is even harder to profile. Optimal partitions need to minimize the total number of consumers for each producer, as well as minimize the total amount of data consumed. We're saved a bit by the fact that spreadsheet dataflow designs tends to have fixed, regular communication structure. As a result it is posisble to generate a deterministic sychronization procedure for any partition.
Profiling cross-partition communication is only one of many stumbling blocks for an partition optimizer. Profiling and load-balancing the computation itself is another problem. Here's something I wrote from two years ago about using an economic model for profiling the load-balancing on heterogeneous architectures in a run-time environment.
Realistically, developers who want high performance data systems should expect some amount of manual partitioning. The goal of any partitioning tool should be to expose these issues to the user as clearly as possible and make it easy for the user to optimize the design using macros. These sorts of macros will be like a welding-torch compared to the current duct-tape.