Game on for acceleration
Graphics may be the way forward for desktop processors.
When the CEO of nVidia, JenHsun Huang, went into his annual meeting with financial analysts at the company's headquarters in Santa Clara he had one thing on his mind: an apparent attempt by Intel to declare that his company and the chips its makes are living on borrowed time.
At the Intel Developers' Forum in China, the company's chief technology officer, Pat Gelsinger, claimed: "today's graphics architectures are coming to an end - it's no longer scalable for the needs of the future."
On top of that, Intel executives claimed that discrete graphics processors (GPUs) will be unnecessary for many consumers in the future. In effect, Intel would take away their sockets. As Huang cited to the assembled analysts: "the integrated graphics market will continue to grow... nVidia will be dead soon. The logic is impeccable... nothing is worse for a company than no place to stick it."
The counter-argument from nVidia is that discrete GPUs continue to sell in high volumes - and Intel plans to sell its own Larrabee into those sockets - because they makes games look better. The net result is what the industry calls 'double attach'. According to Huang, a total 366 million graphics chips were sold, including both discrete GPUs and the graphics units in motherboard-control chipsets. But only 270 million host processors were sold. Close to 100 million PCs wound up with two graphics processors inside them last year, and only one of them get used.
But nVidia aims to increase the influence of the GPU inside the PC, and within computer architecture in general, by turning it into an applications accelerator. The company devised a programming environment for the floating-point units inside the GPU to allow them to be deployed for use as applications accelerators rather than just being dedicated to graphics. "It's the highest-volume supercomputer ever created," Huang claimed.
Supercomputer users are beginning to agree, although it's not a market confined to nVidia and its CUDA programming environment - Intel aims to push into the market with Larrabee and its Ct programming environment and AMD's ATI unit provides programming tools for its GPUs.
Having ridden the increases in performance in PC processors for the last ten years with massively parallel architectures, they are now looking for additional compute power.
Some have looked at the IBM Cell, some have used field-programming gate arrays and others are experimenting with GPUs. Researchers, such as Greg Peterson of the University of Tennessee's virtual centre for cyber-chemistry envision using all of them.
At the recent Many-Core and Reconfigurable Supercomputing Conference (MRSC) in Belfast, Professor Mike Giles of the University of Oxford said GPUs have the potential to pull away from PC processors: "the move to ever faster clock frequencies hit a brick wall in terms of power consumption: there were too many thermal problems and everything went multicore instead. But graphics chips were into multicore for a long time: they have up to 128 cores. I would say that for a long time they have had more floating-point capability than Intel's or AMD's [PC processor] chips."
The problem, said Giles, is that it was difficult to program them. That is changing, he claimed. He is heading up a short £200,000 project funded by the UK's engineering and Physical Sciences Research Council to promote the use of technologies such as field-programmable gate arrays (FPGAs) and GPUs to science users.
The glory days of when supercomputer makers could influence chip design, spawning custom machines built by Cray Research and thinking Machines, are long gone. "I feel we can't be a motivating force in the development of chips, "said Giles, but he noted: "Scientific computing has more in common with graphics than it has with office computing. One thing I like about GPUs is the low cost of entry."
Among scientific users there is one big concern: today's GPUs are designed to run single precision floating-point calculations efficiently. But scientific users like being able to use double-precision. "Do we need single precision or double precision?" asked Giles. "My memory of developing CFD [computational fluid dynamics] codes is that we went to double precision when it became available at no more cost. I didn't feel there was that much of a need. It was just a no-brainer if it was there for no extra cost. But with single precision, I remember the queasy feeling when getting different results from what I expected: was it due to round-off or some compiler bug that was going to bite me with another test case?"
At the University of Manchester, David Bailey's group has been using ATI cards to simulate the behaviour of particle accelerators. He noted that nVidia has taken a lead over AMD's subsidiary: "everyone is jumping onto CUDA because the [equivalent] AMD stuff hasn't appeared yet. We have been using the Brook compiler environment. It is a bit long in the tooth but it works with most programmable GPUs," Bailey said at MRSC.
"Do we see an improvement in speed? Yes we do," Bailey claimed. the overall gain compared with an implementation running just on the host Opteron was a factor of four. "We are not seeing massive speedup factors. I think that is because we are doing a lot of copies between host and memory. You get a latency hit every time you set up to do that copy."
The issue with GPUs is that, to obtain maximum benefits, you have to spawn many, many threads of control to hide the cost of memory accesses by the GPU. But you have to trade this off against the amount of fast shared memory on the GPU itself: it is easy to run out of it with a lot of threads running. If the application is rewritten to take account of these constraints, it is possible to see speedups of one hundred fold. However, George Constantinides of Imperial College, London, warns that the performance of GPUs is "brittle".
Working on a simulation to see how radiation damages DNA, Nico Sanna of the Italian interuniversity supercomputing consortium CASPUR, found that the speedup can depend greatly on the data itself. "The performance changes radically based on the molecules being simulated themselves. That is a very important message: these codes change," says Sanna. The problem was that the proportion of the functions that could be accelerated by a GPU within the code varied dramatically. For methane, an accelerated exponential function accounted for 30 per cent of the workload; for the much larger fullerene molecule, it was just 2 per cent.
However, some believe that it is the multicore host processors that are running out of steam and that the only commercial devices with potential are GPUs. John Michalakes, lead software developer for weather research and forecasting at the US national Center for Atmospheric Research, said during nVidia's analyst day supercomputer users "are standing at the threshold of the petascale", with supercomputers able to perform 1petaflop/s. Yet, such power may not actually be exploitable by forecasters.
"As we reach petascale we are hitting a crisis or tipping point. Supercomputers are using more and more commodity CPUs. Ten years ago there were maybe 1,000. To get up to petascale they are talking about getting up to 100,000 or one million procesors. That size of machine is only reasonable if you don't care about the time to solution. We can't increase the size of the problem to scale our way out of the inefficiency that means. The solution is faster processors, not more of them. It is good for us that CUDA has come along at this time."
Michalakes says the centre has adapted some of the weather-forecasting software to run on GPUs. "The part we adapted saw a ten-fold increase in performance. It is only 1 per cent of the code, but we are seeing a 10 per cent speedup. We are adapting a larger percentage of the code to increase the speed of the prediction without increasing the size of the problem," he explains.
The performance gains made possible by GPUs in supercomputing may feed back into the architecture of the mainstream PC or even embedded computer. Architects are wondering what shape a future processor will take now that the multicore philosophy is embedded into most processor roadmaps. It may not just be host processors plus arrays of graphics-oriented floating-point processors. Constantinides points out that the main strength of FPGAs in high-performance computing lies in the ability to reorganise the internal structure of a machine to feed data-to-data processors at highspeed instead of forcing them to made continual memory requests. This can make FPGAs much better at sustaining performance compared with processors, he says, as they do not have to deal with the penalties of cache misses. The algorithm ends up being designed to tolerate a highlatency memory subsystem. The result, claims Constantinides, is that "gigaflops per watt are very good for FPGAs versus GPUs".
One possibility advanced last year at the IET FPGA Developer's Forum by Satnam Singh, researcher at Microsoft Laboratories in Cambridge, is that the future will be a heterogeneous processor made from conventional processors, GPUs and programmable logic.
The emergence of the heterogeneous multiprocessor architecture will the software and hardware design communities to come together, Singh argues. But it will mean a more extensive change to hardware-description languages. He claimed that today's hardware-design languages, from Verilog to SystemC, have the wrong semantics for the kind of development he envisages.
"It is going to be a mainstream task to program these things. And it will be done by normal people," Singh claimed. The question that Intel and nVidia will want to answer is: who will own that processor socket?
Intel's foray into graphics with Larrabee points in the direction of the GPU being absorbed into the host processor. However, through a deal with Via Technologies, nVidia has already made an x86 processor with onboard graphics, aimed at sub-notebook PCs.
"The GPU is poised to be a disruptive technology," claims Huang, adding that the combined processor used a new design. "Disruptive technology tends to come up from the bottom. It is easier to double each year than to shrink something down and throw away the legacy stuff. We didn't take a Gforce core and say: 'Make it 2W'. We started from zero and the GPU on here dissipates almost zero watts, just a couple of hundred milliwatts."
Although the GPU may support higher performance at lower power, applications will determine who ends up owning the PC's main socket. If only a small subset of specialist programs need the GPU, the dominance is likely to remain with Intel and other host processor suppliers. But, if more software writers adopt environments such as CUDA, the balance of power could easily shift.