Exascale computers aren't as far-futuristic as you may think - you could build one now, so long as you have a dozen nuclear power stations to run it, reports E&T.
Supercomputer users face a dilemma. The core technology to keep the performance growth in their systems is on track - they just won't be able to afford to run them.
The exponentials that underpin progress in the supercomputer are taking over. It means the jump from 10E+15 floating-point calculations per second (1 petaflop) to the 10E+18 per second exaflop machines will demand dramatic changes in both hardware and software design, otherwise the potential gains will not work.
Exascale computing attempts to move computing beyond petascale. The initiative has been endorsed by the US Department of Energy and the US National Nuclear Security Administration. As with petascale, exascale's core applications would lie in computation-intensive areas, such as biology, earth science, energy, engineering, and materials science. Few doubt that exascale will become a reality - quite how and when is being debated. The challenges of achieving a thousandfold increase over petascale computing capabilities are considerable.
Wilfried Verachtert, high-performance computing project manager at Belgian research institute IMEC, says supercomputer design runs on a roughly ten-year cycle: 'In 1997, we saw the first terascale machines. A few years ago, petascale appeared. We will hit exascale in around 2018.'
Verachtert adds: 'People ask me, 'do we need them?' The simple answer is 'yes'. There are lots of computationally difficult problems that we need to run.'
Martin Curley, senior principal engineer and director of Intel Labs Europe, agrees: 'Exascale computing can really change our world.' Although Curley claims faster supercomputers can play a role in creating more environmentally friendly technologies, he realises these gigantic machines have their own sustainability problems. 'An exascale computer has the equivalent power of 50 million laptops. Stacked on top each other, they would be 1,000 miles high and weight more than 100,000 tonnes.'
Verachtert says the power demand for an exascale computer made using today's technology would keep 14 nuclear reactors running. 'There are a few very hard problems we have to face in building an exascale computer. Energy is number one. Right now we need 7,000MW for exascale performance. We want to get that down to 50MW, and that is still higher than we want.'
Up to now, supercomputer designers have added more and more processors, taking advantage of Moore's Law wherever possible to put more on each chip. To get an exascale computer means changing the way computers process data in the hope of reducing the energy they demand to manageable levels; and programmers will feel the impact this time.
Although the answer to what an exascale supercomputer will need is not yet clear, one thing is certain: they will have a lot of processors. 'The only way to get there is through billion-operation parallelism,' says Curley; while Verachtert asserts: 'We need roughly a million to ten million processor cores.'
Each processor sitting inside the racks may have hundreds or thousands of processors integrated on it, and each one running parallel data streams internally to get to Intel's Martin Curley's billion-operation parallelism. As exascale machines are scheduled to appear around eight to ten years from now, the silicon technology looks to be able to achieve this increase. Intel's latest processors, for example, are made with feature sizes as small as 30nm across. The silicon technology available by 2018 will have features as small as 10nm across and be able to squeeze on close to 20 times as many circuits or processors as they can today.
'If you look out to 11nm, we see clear ways to get to order of 5,000 cores on-chip,' says Professor Bill Dally of Stanford University, and chief scientist at graphics chipmaker NVIDIA, a company viewed by some as occupying a leadership position in resolving some key issues that exascale computing presents.
SGI, manufacturer of mainframes, supercomputers, and high-powered workstations has begun experimenting with low-power processors in supercomputers. It is using the Atom processors Intel developed for handheld computers in place of the Xeons that are used in its top-end Altix UV machines. The Atom processors are used by, among others, Professor Stephen Hawking's Computational Cosmology Consortium (COSMOS) team based at the University of Cambridge. COSMOS selected a SGI Altix UV 1000 platform to support its research.
'Recent progress towards a complete understanding of the universe has been impressive, but many puzzles remain,' comments Professor Hawking. 'Cosmology is now a precise science. We need supercomputers to calculate what our theories of the early universe predict and test them against observations of the present universe.'
Supporting up to 16TB of global shared memory in a single system image, Altix UV is based on Intel Xeon processors, and enables scaling from 32 to 2,048 cores with architectural provisioning for up to 262,144 cores.
'We adapted these low-power processors to have a lot of them,' says Eng Lim Goh, CTO of SGI. 'We can exploit the fact that power consumption increases with clock speed. If certain codes are highly parallelisable, why not use many more slower processors? Then you gain the advantage of exponentially lower power consumption.'
Making use of many processors, however, is not always easy, as Intel's Martin Curley points out: 'Even with just 10 to 12 cores, we see the performance of commercial microprocessors begin to degrade as we add more. The biggest single challenge we have is exploiting parallelism.'
Goh agree with this proposition - up to a point: 'Some problems are at the opposite end of the spectrum. These kinds of problems will not be applicable to the lower-power processors because they are too communications-intensive. You have to have more local accelerators to reduce the communications cost of passing information, rather than just increasing the processing capability.'
Stanford University's Bill Dally reckons that, ultimately, 'we will have heterogeneous computers with fewer latency-optimised processors: what people call a CPU today. The bulk will be done by a throughput-optimised processor where computational speed is dominated by overall throughput rather than single thread performance.'
Professor Dally devised the idea of the stream processor while at the Massachusetts Institute of Technology (MIT) and the concepts have been embraced most enthusiastically by the makers of graphics processing units (GPUs). The compute engines used to work out how 3D objects are shaded, and are heavily multithreaded to hide the latency of memory accesses, optimising them for throughput if you have enough threads to keep them occupied. These shader engines have gradually mutated into floating-point units capable of running scientific code.
Field-programmable gate arrays
The GPU is not the only option. SGI, for example, has worked with field-programmable gate arrays (FPGAs) - chips that can be reconfigured on the fly with different circuits. Goh says FPGAs are poorly suited to floating-point intensive code because they cannot be implemented efficiently on the reprogrammable fabric. But with short word-lengths and search- or logic-intensive code, FPGAs come into their own because they allow thousands of custom processors to be built in parallel, used, and then scrapped when no longer needed.
Steve Teig, president and CTO of FPGA specialist Tabula, says the architecture of the FPGA makes it possible to rethink how data moves around a computer. Instead of bringing data to the processor, you can reverse the process. 'It ultimately becomes cheaper to reconfigure and compute in place than to send the data somewhere else for computation. I'm trying to look 25 years ahead but at a certain point in time, devices will be designed this way because it takes less energy to reconfigure on the fly than to send the data far away, perform the computation and return it.'
Dally agrees: 'The real thing you want to gain is locality. Don't move data if you can avoid it. But we can do it in a more efficient way than on FPGA2.' However, he adds that FPGAs have a problem in speed: they run at a few megahertz and, right now, are still quite power hungry.
In the meantime, SGI is taking a different approach: using communication accelerators to try to hide the problems caused by parallel algorithms that need to share data among processors.
'With certain codes, no matter how hard you try, you have so much communication involved. That is why we decided to offload communications onto an accelerator,' says its CTO Eng Lim Goh. 'In one application that performed airflow analysis over a truck running across a thousand cores, 40 per cent of the time spent was actual work. The remaining 60 per cent was spent elsewhere and the bulk was on communications. It was not the case with 512 processors, but the moment it hit 1024, the balance shifted. Communications overhead overwhelms the problem. That is why we introduced communications acceleration.'
Encouraged by the results from its Atom-based supercomputer, SGI is looking at other mobile processors that could save even more power, combining them with its communications-offload technology. As the CPUs themselves are quite small, they might even migrate onto the custom communications chips made by HPC vendors such as SGI. 'Integrating ARM processors into the communications engines, that would be interesting' But it is not an area where we have put our feet down and said 'this is what we are going to do', Goh admits.
Meanwhile, as supercomputers head towards the million-processor point, a rise in failures will make the valves of early mainframes seem reliable: because hardware fails.
'Even the best supercomputers fail from time to time,' admits says IMEC's Wilfried Verachtert. 'We can now measure the time between those failures in days. The best we can do today is try to avoid having programs that run for days and, if they do fail, simply run them again. But, as we scale up, the most optimistic projection is that something in the computer will fail every minute. Something may fail every second with that many processors inside. We have to do something about that,'which leads to the biggest problem of all: programming.'
Software soft touch
Software assumes that the machine is 100 per cent reliable - if it has been restarted because a board dies, the software has no awareness of that. Storing an application's state at regular checkpoints may help, so that the application does not have to start from scratch every time something goes wrong; but if you expect a processor to go on strike every second, then checkpointing becomes impractical. The programmers will need to write software that can watch for failures and react by checking for errors and then moving a task to a spare processor so that the application can carry on.
The reliability issue may encourage development of software techniques that encourage speculation and less tightly-coupled communications. Realising that many communications-bound applications slow down because they have to wait for information from other processors, computer scientists have developed techniques to let software predict the most likely result and execute along that path.
If its speculation turns out to be wrong, it can back out of the blind alley, and retry with the correct data. Whether speculation makes sense depends on the predictability of a particular code branch. But the approach has proved successful on a small scale in superscalar processors that can take a branch before the condition that controls which way the software should go has been determined. This kind of programming is very different to what supercomputer programmers are used to; but development will have to change if the exascale machines of the future are not to lie idle.
'Our current UV architecture came about because the industry adapted itself into a corner and we needed to change,' SGI's Goh concludes. 'We hope to open up a new path to adapt along.'