/. points to an IEEE Spectrum article title "Multicore Is Bad News" which discusses the bottleneck associated with getting data from memory into the many cores of a multicore processor.
This is primarily an issue with our programming model we are trying to force on the hardware and not a problem with the real capacity of such hardware. The article specifically says that many HPC processes map cleanly to grids thus address the locality problems of multicore arrays. I'm a broken record: the problem of locality optimization for mapping dataflow graphs to multicore arrays requires a transition of our primitive computer model from instruction stream executers to spreadsheet iterators.
The article sums up the problem: "Although the number of cores per processor is increasing, the number of connections from the chip to the rest of the computer is not." This is not necessarily true, since we are seeing increased number of pins on a chip, but it is also not the main issue.
Even without increasing the rate of dataflow through the processor, we can drastically improve power-performance by slowing down our processing pipelines and distributing load across multiple slower cores. One-thousand 1 MHz cores can perform many of the same tricks as a single 1 GHz core for far fewer Joules. This physical truth is starkly contrary to the popular notion that we should increase utilization to improve performance. Since Moore's law says that transistors for ever more cores will continue to cost less and less (and I predict this will continue long after dimensional scaling ends, due to optimizations in the fabrication methods), we can achieve a decline in operating expense to sustain our growing hunger for computing resources. This requires us to fundamentally change our approach: we cannot continue to expect faster execution, we can only expect more execution. Exponentially more.
Once the industry discovers that it has to transform from isolated general purpose problem solvers to networked special purpose assembly lines for a particular problem, I expect we will seem the rise of FPGAs and reconfigurable dataflow pipelines. Any application that can use many thousands of CPUs effectively will probably use many thousands of FPGAs an order of magnitude more efficiently. Its becuase of the relationship between the granularity/number of cores and the degree to which we can optimize for locality. All the extra dispatch hardware and cache management logic is unnecessary in deterministic dataflow pipelines. We simply need more dense functional area and more pipes for data.