I have a test in computer archi-torture tomorrow and did a problem that asked us to determine the minimum number of cycles to execute a typical iteration of a particular loop on a superscalar processsor with out of order execution under different assumptions. In the "best case" we were allowed to use as many functional units and memory ports as required and are allowed to fetch and commit as many instructions as we want.
As the number of memory ports, functional units and simultaneous instructions rise, the overhead required to manage the re-order-buffer and register renaming on such a behemoth hardware precludes hardware scheduling as the optimal strategy. Especially since we nearly always run code that has executed before, it seems an awful waste of energy to redo the scheduling in hardware due to the requirement for random instruction issue. Granted, multi-threading on a single core produces the effect of random issue, though especially since we are making the transition towards multi-core, the assumption of random issue should be revisited.
Reduction of scheduling overhead can be accomplished by running the data-flow scheduler as a software process either during compilation or during the phase when an instruction page is loaded into the instruction cache. If we use software compilation for such a VLIW processor then the operating system must manages a file-system of OoOE schedules for every function executed and so that an instruction page fetch automatically fetches system optimized schedules. This is practical because bulk memory for such storage is cheap.
With infinitely many renaming slots, our execution schedule in a superscalar processor resembles a Verilog netlist for the optimal pipeline of our system. To support a large number of virtual registers it would be necessary to "delocalize" register files. Mapping virtual registers to physical registers is now more akin to FPGA place-and-route and may be optimized using similar algorithms.
We must also produce a variety of local caches to support loading and storing of operands and results. We should have the ability to escape to a main memory system on a cache miss (perhaps an on-chip level 2 cache that triggers an interrupt on a second miss).
Once we succumb to software optimized hardware scheduling, it's a small logical step to say that our hardware should allow some amount of reconfigurability to maximize functional unit utilization. The structure of such a hardware device would more closely resemble an FPGA than a single core superscalar processor.
All of this ignores branching of course, which presents an interesting and complicated control problem. More on this another day...