The future of high-performance processing is to mix things up: heterogeneous multiprocessing is here to stay.
Almost ten years ago, president and CEO of Tensilica Chris Rowen made a prediction about the future of system-on-chip (SoC) design, dominated then by hardwired logic. 'SoC will become a sea of processors. You will have ten to maybe a thousand processors on a chip. The individual processors are tiny, so you will be able to do that.'
Fast-forward ten years and some of that has happened. Take apart a graphics processor unit (GPU) from the likes of AMD or nVidia and it will often have more than ten processor cores inside it. Cisco unveiled its QuantumFlow processor in 2008 that deploys 40 cores, each of which can perform up to four operations at once.
At the University of North Carolina, Arun Ravindran and colleagues used field-programmable gate arrays to put 40 custom processors in a systolic array that could massively speed up the laborious process of matching genetic sequences. Rather than trying to use general-purpose processors to perform the entire algorithm, the researchers split the algorithm into pieces with some running on a host processor while the meat of the kernel could be deployed on dedicated string-matching processors implemented in programmable logic.
This heterogeneous mixture of soft-programmable cores with dedicated logic - a mash-up of cores - is likely to be the future of computation and much of chip design because traditional methods are getting too expensive. It was the reason why Rowen foresaw a shift from 'sea of gates' to 'sea of processors'.
Professor Mark Horowitz from Stanford University says: 'The ASIC business is dying because the design cost is upward of $20m. Writing Verilog is the wrong thing to do: we want to write something that has a longer lifetime than just a single design.'
One future is to create a new computing architecture, says Horowitz, who started a programme at Stanford to 'rethink digital design'. Horowitz says he went to a show for 'homebrew' electronics. 'At first I was offended when they had a microcontroller behind every blinky light. But then I thought 'why not?'.'
Although a massive multiprocessor is not necessarily the endpoint for Horowitz's project at Stanford, his team has been able to experiment with the idea of a 'chip generator' using heterogenenous multiprocessors.
'We use the standard software trick of adding a layer of indirection. You can tune the resources you care about and even change the hardware support. Then you generate the optimised chip. It's a semi-custom system but moves the abstraction to a higher level,' Horowitz explains, adding that the first implementation is a multiprocessor generator. 'We are using Tensilica because it makes it easier to do.
'Using this generic multiprocessor, we looked at what happens if we do H.264 video encoding. Initially, it was 400 to 600 times slower than real time. The students worked to speed up the implementation by using generic data-parallel optimisations. That led to about an order of magnitude improvement. But to really get improvements we had to do very customised changes to the implementation for the particular application we were running. We got to a factor of three within what the ASIC got. We've got a new design that has got us even closer.'
Horowitz's approach parallels work at Microsoft Research in the UK and the University of Cambridge. Satnam Singh of Microsoft and David Greaves at the university nearby are working on project Kiwi as part of larger programme, Alchemy.
Singh is working on the assumption that future processors will not be a collection of many processors of the same type all connected together via a single shared memory. He has called that approach 'the path of least imagination'.
Instead, the work at Microsoft Research focuses on heterogeneous computing devices with a mixture of processing elements which not only include different kinds of processors, including GPUs, but programmable logic in the form of field-programmable gate arrays (FPGAs).
The FPGAs may not use conventional architecture because they do not have good logic density. Instead, they may be more coarse-grained. Working with these combinations involves a change in design but Singh reckons it is possible to have algorithms mapped automatically to a mix of processor elements.
Bill Dally, who moved from Stanford where he researched stream or 'throughput' processing to become chief scientist at nVidia last year, is betting on heterogenenous computing. However, he is less keen on the idea of incorporating programmable logic, at least as it stands today, as it may not provide enough of a speed boost for the most common applications in personal computing, which tend to focus on signal processing in audio and video.
'The real niche for the FPGA is doing things where you aren't doing integer or floating-point arithmetic. But the problem is FPGAs only run at a few hundred megahertz,' Dally says, which means a lot of processing elements are needed to maintain parity with dedicated logic which can run at more than 1GHz.
That is a double whammy for the FPGA. Not only do you need a lot of processing elements - each one of them is big. John East, president of Actel, talks ruefully of the way that FPGA IP cores that were developed in the past decade seemed to offer ASIC users a way to build more flexibility into their devices.
'It seemed like there was going to be something that would work. Everybody wanted to talk to us about that. But it's a dry well. The minute you put the core down on the chip, the FPGA dominates the cost, power and speed of your ASIC,' East says.
A more coarse-grained architecture, as suggested by Singh, could overcome some FPGA density problems. Four-bit arithmetic, for example, works very well in biological applications because there are only a handful of different bases of DNA and amino acids in proteins. This was exploited successfully by Ravindran's project. A time-slicing architecture such as Tabula's (see page 33) could also do much to improve density to the point where it makes sense to have some programmable logic on-chip to take care of those instances, such as oddball arithmetic in cryptography or pattern matching in H.264, where a more general-purpose processor is too poorly optimised.
What's important, says Dally, is to get the data in the right place. 'The real thing you want to gain is locality. Don't move data if you can avoid it. FPGAs sometimes have an advantage if you can make the data just flow through the logic. But we can often achieve a very similar effect by staging data through the execution units in our GPUs,' he claims.
Dally's view, which should come as little surprise as he works at nVidia, is: 'Ultimately, we will have heterogenenous computers with a few latency-optimised processors. What people call a CPU today. The bulk of the work will be done by throughput-optimised processors: processors that are dominated by throughput rather than single-thread performance.
'A word processor? A CPU is good enough for that. But if you look at the vast majority of emerging applications, they are very parallel. Images and video. They demand throughput,' Dally adds.
Although Dally says GPU processors will incorporate some elements of latency-optimised processors to improve their performance on single-threaded code, by not having to deal with caches, branch prediction and other sophisticated techniques to eke performance out of code, the GPU and other specialised processor can continue to outgun the CPU.
'Today we have a teraflops GPU and the CPUs are perhaps 10Gflops, a difference of 100 times today versus single core. And we are continuing. We are growing at a rate where, by 2015, we are expecting a 20-fold improvement,' Dally claims. 'And that will come through parallelism. The bulk of the performance will come through having more cores.'
The open question is how the architecture will be balanced between different types of processor. However, programmers will need to get used to the idea of a mash-up computer architecture that assembles many different execution units.